PySpark - Join DataFrames

Introduction

In this tutorial, we want to join PySpark DataFrames. In order to do this, we use the the join() method of PySpark.

Import Libraries

First, we import the following python modules:

from pyspark.sql import SparkSession

Create SparkSession

Before we can work with Pyspark, we need to create a SparkSession. A SparkSession is the entry point into all functionalities of Spark.

In order to create a basic SparkSession programmatically, we use the following command:

spark = SparkSession \
    .builder \
    .appName("Python PySpark Example") \
    .getOrCreate()

Create PySpark DataFrames

We create two PySpark DataFrames with some example data from lists. To do this, we use the method createDataFrame() and pass the data and the column names as arguments.

First, we create the PySpark DataFrame "df_languages":

column_names = ["id", "language"]
data = [
    (1, "Python"),
    (2, "JavaScript"),
    (3, "C++"),
    (4, "Visual Basic"),
]
df_languages = spark.createDataFrame(data, column_names)
df_languages.show()

Next, we create the PySpark DataFrame "df_frameworks":

column_names = ["framework_id", "framework", "language_id"]
data = [
    (1, "Spring", 5),
    (2, "FastAPI", 1),
    (3, "ReactJS", 2),
    (4, "Django", 1),
    (5, "Flask", 1),
    (6, "AngularJS", 2),
]
df_frameworks = spark.createDataFrame(data, column_names)
df_frameworks.show()

Inner Join

Now, we would like to join the two DataFrames over an inner join. The DataFrame "df_languages" has the primary key "id" and the foreign key in the DataFrame "df_frameworks" is "language_id".