PySpark - Regular Expressions (Regex)

Introduction

In this tutorial, we want to use regular expressions (regex) to filter, replace and extract strings of a PySpark DataFrame based on specific patterns. In order to do this, we use the rlike() method, the regexp_replace() function and the regexp_extract() function of PySpark.

Import Libraries

First, we import the following python modules:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, regexp_extract, regexp_replace

Create SparkSession

Before we can work with Pyspark, we need to create a SparkSession. A SparkSession is the entry point into all functionalities of Spark.

In order to create a basic SparkSession programmatically, we use the following command:

spark = SparkSession \
    .builder \
    .appName("Python PySpark Example") \
    .getOrCreate()

Create PySpark DataFrame

Next, we create the PySpark DataFrame with some example data from a list. To do this, we use the method createDataFrame() and pass the data and the column names as arguments.

column_names = ["language", "framework", "users"]
data = [
    ("Python", "FastAPI 0.92.0", 9000),
    ("JavaScript", "ReactJS 18.0", 7000),
    ("Python", "Django 4.1", 20000),
    ("Java", "Spring Boot 3.1", 12000),
]
df = spark.createDataFrame(data, column_names)
df.show()