PySpark - Remove Whitespaces from a String Column of a DataFrame

Introduction

In this tutorial, we will show you how to remove the leading and trailing whitespaces from a string column of a PySpark DataFrame. In order to do this, we will use the functions trim(), ltrim() and rtrim() of PySpark.

Import Libraries

First, we import the following python modules:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, trim, ltrim, rtrim

Create SparkSession

Before we can work with Pyspark, we need to create a SparkSession. A SparkSession is the entry point into all functionalities of Spark.

In order to create a basic SparkSession programmatically, we use the following command:

spark = SparkSession \
    .builder \
    .appName("Python PySpark Example") \
    .getOrCreate()

Create PySpark DataFrame

Next, we create the PySpark DataFrame with some example data from a list. To do this, we use the method createDataFrame() and pass the data and the column names as arguments.

column_names = ["language", "framework", "users"]
data = [
    ("Python", "    Django    ", 20000),
    ("Python", "    FastAPI", 9000),
    ("JavaScript", "  AngularJS", 7000),
    ("JavaScript", "  ReactJS     ", 5000),
    ("Python", "  FastAPI      ", 13000)
]
df = spark.createDataFrame(data, column_names)
df.show()