Introduction

In this tutorial, we want to select specific columns from a PySpark DataFrame. In order to do this, we use the select() method of PySpark in different variants.

Import Libraries

First, we import the following python modules:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

Create SparkSession

Before we can work with Pyspark, we need to create a SparkSession. A SparkSession is the entry point into all functionalities of Spark.

In order to create a basic SparkSession programmatically, we use the following command:

spark = SparkSession \
    .builder \
    .appName("Python PySpark Example") \
    .getOrCreate()

Create PySpark DataFrame

Next, we create a PySpark DataFrame with some example data from a list. To do this, we use the method createDataFrame() and pass the data and the column names as arguments.

column_names = ["language", "framework", "users"]
data = [
    ("Python", "Django", 20000), 
    ("Python", "FastAPI", 9000), 
    ("Java", "Spring", 7000), 
    ("JavaScript", "ReactJS", 5000)
]
df = spark.createDataFrame(data, column_names)
df.show()

Select Columns - Option 1

Now, we would like to select the columns "framework" and "users" of the DataFrame. To do this, we use the select() function.

There are different opportunities to specify the columns with this function. One option is to pass the column names as strings into the select() function:

new_df = df.select("framework", "users")
new_df.show()

Select Columns - Option 2

This is another opportunity to select the columns:

new_df = df.select(df.framework, df.users)
new_df.show()

Select Columns - Option 3

In a similar way we can select the columns like follows:

new_df = df.select(df["framework"], df["users"])
new_df.show()

Select Columns - Option 4

Another option is to use the col() function inside of the select() method:

new_df = df.select(col("framework"), col("users"))
new_df.show()

Select Columns - Option 5

Another option is to select the columns by using the index:

new_df = df.select(df.columns[1:])
new_df.show()

Conclusion

Congratulations! Now you are one step closer to become an AI Expert. You have seen that it is very easy to select specific columns from a PySpark DataFrame. We can simply use the select() method of PySpark. Try it yourself!

Instagram

Also check out our Instagram page. We appreciate your like or comment. Feel free to share this post with your friends.