PySpark - Drop Columns from a DataFrame

Introduction

In this tutorial, we want to drop columns from a PySpark DataFrame. In order to do this, we use the the drop() method of PySpark.

Import Libraries

First, we import the following python modules:

from pyspark.sql import SparkSession

Create SparkSession

Before we can work with Pyspark, we need to create a SparkSession. A SparkSession is the entry point into all functionalities of Spark.

In order to create a basic SparkSession programmatically, we use the following command:

spark = SparkSession \
    .builder \
    .appName("Python PySpark Example") \
    .getOrCreate()

Create PySpark DataFrame

Next, we create a PySpark DataFrame with some example data from a list. To do this, we use the method createDataFrame() and pass the data and the column names as arguments.

column_names = ["language", "framework", "users"]
data = [
    ("Python", "Django", 20000), 
    ("Python", "FastAPI", 9000), 
    ("Java", "Spring", 7000), 
    ("JavaScript", "ReactJS", 5000)
]
df = spark.createDataFrame(data, column_names)
df.show()

Drop a Single Column

We would like to remove a single column from the DataFrame.

To do this, we use the drop() method of PySpark and pass the column name as argument:

new_df = df.drop("users")
new_df.show()

Drop Multiple Columns

Next, we would like to remove multiple columns from the DataFrame.

To do this, we use the drop() method of PySpark and pass the column names as arguments:

new_df = df.drop("framework", "users")
new_df.show()

Conclusion

Congratulations! Now you are one step closer to become an AI Expert. You have seen that it is very easy to drop columns from a PySpark DataFrame. We can simply use the drop() method of PySpark. Try it yourself!

Instagram

Also check out our Instagram page. We appreciate your like or comment. Feel free to share this post with your friends.

Sieh dir diesen Beitrag auf Instagram an

Ein Beitrag geteilt von Deep Learning Nerds | AI, Data Science & Machine Learning (@deeplearningnerds)

PySpark - Drop Columns from a DataFrame

Data Engineer

GitHub Copilot vs Codex vs Claude Code: What Is the Difference?

How to Use GitHub Copilot in Visual Studio Code

What Is GitHub Copilot? Explained for Beginners

Introduction

Import Libraries

Create SparkSession

Create PySpark DataFrame

Drop a Single Column

Drop Multiple Columns

Conclusion

Instagram

PySpark Structured Streaming Explained for Beginners: Build a Real-Time Data Pipeline

How to Rename Multiple DataFrame Columns at Once in PySpark

PySpark coalesce() Function Explained

How to Ingest Data from Kafka Streams to Delta Tables Using PySpark in Databricks

How to Generate a Hash from Multiple Columns in PySpark