Introduction

In this tutorial, we want to use regular expressions (regex) to filter, replace and extract strings of a PySpark DataFrame based on specific patterns. In order to do this, we use the rlike() method, the regexp_replace() function and the regexp_extract() function of PySpark.

Import Libraries

First, we import the following python modules:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, regexp_extract, regexp_replace

Create SparkSession

Before we can work with Pyspark, we need to create a SparkSession. A SparkSession is the entry point into all functionalities of Spark.

In order to create a basic SparkSession programmatically, we use the following command:

spark = SparkSession \
    .builder \
    .appName("Python PySpark Example") \
    .getOrCreate()

Create PySpark DataFrame

Next, we create the PySpark DataFrame with some example data from a list. To do this, we use the method createDataFrame() and pass the data and the column names as arguments.

column_names = ["language", "framework", "users"]
data = [
    ("Python", "FastAPI 0.92.0", 9000),
    ("JavaScript", "ReactJS 18.0", 7000),
    ("Python", "Django 4.1", 20000),
    ("Java", "Spring Boot 3.1", 12000),
]
df = spark.createDataFrame(data, column_names)
df.show()

Filter Data

We would like to filter data of the DataFrame based on a certain string pattern.

In this example, we want to select all rows, where the value of the column "language" starts with "Py".

You can view this post with the tier: Academy Membership

Join academy now to read the post and get access to the full library of premium posts for academy members only.

Join Academy Already have an account? Sign In