PySpark - Group and Concatenate Strings in a DataFrame

Introduction

In this tutorial, we will show you how to group and concatenate strings in a PySpark DataFrame. In order to do this, we will use the groupBy() method in combination with the functions concat_ws(), collect_list() and array_distinct() of PySpark.

Import Libraries

First, we import the following python modules:

from pyspark.sql import SparkSession
from pyspark.sql import functions as F

Create SparkSession

Before we can work with Pyspark, we need to create a SparkSession. A SparkSession is the entry point into all functionalities of Spark.

In order to create a basic SparkSession programmatically, we use the following command:

spark = SparkSession \
    .builder \
    .appName("Python PySpark Example") \
    .getOrCreate()

Create PySpark DataFrame

Next, we create the PySpark DataFrame with some example data from a list. To do this, we use the method createDataFrame() and pass the data and the column names as arguments.

column_names = ["language", "framework", "users"]
data = [
    ("Python", "Django", 20000),
    ("Python", "FastAPI", 9000),
    ("JavaScript", "AngularJS", 7000),
    ("JavaScript", "ReactJS", 5000),
    ("Python", "FastAPI", 13000)
]
df = spark.createDataFrame(data, column_names)
df.show()

Group and Concatenate Strings

We want to group the rows of the PySpark DataFrame based on the column "language". For each group, the string values of column "framework" should be concatenated into a single string.