PySpark - Aggregate Functions

Introduction

In this tutorial, we want to make aggregate operations on columns of a PySpark DataFrame. In order to do this, we use different aggregate functions of PySpark.

Import Libraries

First, we import the following python modules:

from pyspark.sql import SparkSession
from pyspark.sql.functions import *

Create SparkSession

Before we can work with Pyspark, we need to create a SparkSession. A SparkSession is the entry point into all functionalities of Spark.

In order to create a basic SparkSession programmatically, we use the following command:

spark = SparkSession \
    .builder \
    .appName("Python PySpark Example") \
    .getOrCreate()

Create PySpark DataFrame

Next, we create the PySpark DataFrame with some example data from a list. To do this, we use the method createDataFrame() and pass the data and the column names as arguments.

column_names = ["language", "framework", "users"]
data = [
    ("Python", "Django", 20000), 
    ("Python", "FastAPI", 9000), 
    (None, "Spring", 9000),
    ("JavaScript", "ReactJS", 5000)
]
df = spark.createDataFrame(data, column_names)
df.show()

First and Last Value

We would like to extract the first and the last value from the column "framework" of the DataFrame.

To do this, we use the first() and the last() function of PySpark:

first_value = df.select(first("framework")).collect()[0][0]
last_value = df.select(last("framework")).collect()[0][0]

print(f"First value: {first_value}")
print(f"Last value: {last_value}")