PySpark - Group a DataFrame and apply Aggregations

Introduction

One of the key tasks in data analysis is grouping data to gain insights and make informed decisions. In this tutorial, we will show you how to group the rows of a PySpark DataFrame and apply different aggregations on the grouped data. In order to do this, we will use the groupBy() function in combination with the agg() function and various aggregation functions of PySpark.

Import Libraries

First, we import the following python modules:

from pyspark.sql import SparkSession
from pyspark.sql.functions import avg, sum

Create SparkSession

Before we can work with Pyspark, we need to create a SparkSession. A SparkSession is the entry point into all functionalities of Spark.

In order to create a basic SparkSession programmatically, we use the following command:

spark = SparkSession \
    .builder \
    .appName("Python PySpark Example") \
    .getOrCreate()

Create PySpark DataFrame

Next, we create the PySpark DataFrame with some example data from a list. To do this, we use the method createDataFrame() and pass the data and the column names as arguments.

column_names = ["language", "framework", "users"]
data = [
    ("Python", "Django", 20000),
    ("Python", "FastAPI", 9000),
    ("Java", "Spring", 7000),
    ("JavaScript", "ReactJS", 5000),
    ("Python", "FastAPI", 13000)
]
df = spark.createDataFrame(data, column_names)
df.show()