Introduction

In this tutorial, we want to replace null values in a PySpark DataFrame. In order to do this, we use the the fillna() method of PySpark.

Import Libraries

First, we import the following python modules:

from pyspark.sql import SparkSession
from pyspark.sql.functions import mean

Create SparkSession

Before we can work with Pyspark, we need to create a SparkSession. A SparkSession is the entry point into all functionalities of Spark.

In order to create a basic SparkSession programmatically, we use the following command:

spark = SparkSession \
    .builder \
    .appName("Python PySpark Example") \
    .getOrCreate()

Create PySpark DataFrame

Next, we create the PySpark DataFrame "df" with some example data from a list. To do this, we use the method createDataFrame() and pass the data and the column names as arguments.

column_names = ["language", "framework", "users"]
data = [
    ("Python", "FastAPI", None),
    (None, None, 7000),
    ("Python", "Django", 20000),
    ("Java", None, None),
]
df = spark.createDataFrame(data, column_names)
df.show()

Replace Missing Values with Constant Values

Next, we would like to replace all null values of the DataFrame "df" with constant values.

The null values of the columns "language" and "framework" should be replaced with the value "unknown". The null values of the column "users" should be replaced with the value 0.

To do this, we use fillna() method of PySpark. We have to pass the new value and the column names as argument:

df_cleaned = df.fillna(value="unknown", subset=["language", "framework"])
df_cleaned = df_cleaned.fillna(value=0, subset=["users"])
df_cleaned.show()

Replace Missing Values with Aggregated Values

Next, we would like to replace null values of the DataFrame "df" with aggregated values.

The null values of the column "users"  should be replaced with the mean of the column values.

To do this, we use the mean() function of PySpark for calculating the mean of the column and the fillna() method of PySpark for replacing the null values with the mean:

mean_value = df.select(mean(df['users'])).collect()[0][0]
df_cleaned = df.fillna(mean_value, subset=['users'])
df_cleaned.show()

Conclusion

Congratulations! Now you are one step closer to become an AI Expert. You have seen that it is very easy to replace null values in a PySpark DataFrame. We can simply use the fillna() method of PySpark. Try it yourself!

Instagram

Also check out our Instagram page. We appreciate your like or comment. Feel free to share this post with your friends.