Introduction

In this tutorial, we want to drop rows with null values from a PySpark DataFrame. In order to do this, we use the the dropna() method of PySpark.

Import Libraries

First, we import the following python modules:

from pyspark.sql import SparkSession

Create SparkSession

Before we can work with Pyspark, we need to create a SparkSession. A SparkSession is the entry point into all functionalities of Spark.

In order to create a basic SparkSession programmatically, we use the following command:

spark = SparkSession \
    .builder \
    .appName("Python PySpark Example") \
    .getOrCreate()

Create PySpark DataFrame

Next, we create a PySpark DataFrame with some example data from a list. To do this, we use the method createDataFrame() and pass the data and the column names as arguments.

column_names = ["language", "framework", "users"]
data = [
    ("Python", "FastAPI", None),
    ("JavaScript", None, 7000),
    ("Python", "Django", 20000),
    ("Java", None, None),
    (None, None, None),
]
df = spark.createDataFrame(data, column_names)
df.show()

Removing Rows with Missing Values in any Column

Next, we would like to remove all rows from the DataFrame  that have null values in any column.

To do this, we use the dropna() method of PySpark. We have to use the how parameter and pass the value "any" as argument:

df_cleaned = df.dropna(how="any")
df_cleaned.show()

Remove Rows with Missing Values in all Columns

Next, we would like to remove all rows from the DataFrame that have null values in all columns.

To do this, we use the dropna() method of PySpark. We have to use the how parameter and pass the value "all" as argument:

df_cleaned = df.dropna(how="all")
df_cleaned.show()

Remove Rows with Missing Values in a certain Column

Next, we would like to remove all rows from the DataFrame  that have null values in the column "framework".

To do this, we use the dropna() method of PySpark. We have to use the subset parameter and pass the column name as argument:

df_cleaned = df.dropna(subset=["framework"])
df_cleaned.show()

Conclusion

Congratulations! Now you are one step closer to become an AI Expert. You have seen that it is very easy to drop rows with null values from a PySpark DataFrame. We can simply use the dropna() method of PySpark. Try it yourself!

Instagram

Also check out our Instagram page. We appreciate your like or comment. Feel free to share this post with your friends.