PySpark - How to use Pandas User Defined Function (UDF)

Introduction

In the realm of big data processing, PySpark has emerged as a powerful tool for handling large-scale datasets. Its distributed computing framework allows for efficient processing of massive volumes of data. However, despite its capabilities, performing certain data transformations in PySpark can sometimes be cumbersome and complex. That's where Pandas User Defined Functions (UDFs) come into play, offering a simpler and more intuitive way to manipulate data within a PySpark DataFrame.

In this tutorial, we will provide a step-by-step guide how to create a Pandas UDF and apply it to a PySpark DataFrame. In order to do this, we will demonstrate two different methods: using the pandas_udf() function and using the @pandas_udf decorator.

What are Pandas UDFs?

Pandas UDFs leverage the functionality of the popular Python library Pandas within the PySpark environment. They allow you to apply Pandas functions to PySpark DataFrames, enabling you to perform complex data manipulations with ease. This integration of Pandas with PySpark opens up a wide range of possibilities for data transformation tasks.

Import Libraries

First, we import the following python modules:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, pandas_udf
from pyspark.sql.types import StringType

Create SparkSession

Before we can work with Pyspark, we need to create a SparkSession. A SparkSession is the entry point into all functionalities of Spark.

In order to create a basic SparkSession programmatically, we use the following command:

spark = SparkSession \
    .builder \
    .appName("Python PySpark Example") \
    .getOrCreate()

Create PySpark DataFrame

Next, we create the PySpark DataFrame with some example data from a list. To do this, we use the method createDataFrame() and pass the data and the column names as arguments.

column_names = ["language", "framework", "users"]
data = [
    ("Python", "Django", 20000), 
    ("Python", "FastAPI", 9000), 
    ("JavaScript", "ReactJS", 5000)
]
df = spark.createDataFrame(data, column_names)
df.show()