PySpark - Explode Arrays into Rows of a DataFrame

Introduction

In this tutorial, we want to explode arrays into rows of a PySpark DataFrame. In order to do this, we use the explode() function and the explode_outer() function of PySpark.

Import Libraries

First, we import the following python modules:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, explode, explode_outer

Create SparkSession

Before we can work with Pyspark, we need to create a SparkSession. A SparkSession is the entry point into all functionalities of Spark.

In order to create a basic SparkSession programmatically, we use the following command:

spark = SparkSession \
    .builder \
    .appName("Python PySpark Example") \
    .getOrCreate()

Create PySpark DataFrame

Next, we create the PySpark DataFrame with some example data from a list. To do this, we use the method createDataFrame() and pass the data and the column names as arguments.

column_names = ["language", "frameworks"]
data = [
    ("Python", ["FastAPI", "Django", "Flask"]),
    ("JavaScript", ["ReactJS", "AngularJS"]),
    ("Java", None),
]
df = spark.createDataFrame(data, column_names)
df.show()