PySpark - Add Columns to a DataFrame

Introduction

In this tutorial, we want to add columns to a PySpark DataFrame. In order to do this, we use the the withColumn() method of PySpark.

Import Libraries

First, we import the following python modules:

from pyspark.sql import SparkSession
from pyspark.sql.functions import lit, col, sum, when
from pyspark.sql.window import Window

Create SparkSession

Before we can work with Pyspark, we need to create a SparkSession. A SparkSession is the entry point into all functionalities of Spark.

In order to create a basic SparkSession programmatically, we use the following command:

spark = SparkSession \
    .builder \
    .appName("Python PySpark Example") \
    .getOrCreate()

Create PySpark DataFrame

Next, we create a PySpark DataFrame with some example data from a list. To do this, we use the method createDataFrame() and pass the data and the column names as arguments.

column_names = ["language", "framework", "users"]
data = [
    ("Python", "Django", 20000), 
    ("Python", "FastAPI", 9000), 
    ("Java", "Spring", 7000), 
    ("JavaScript", "ReactJS", 5000)
]
df = spark.createDataFrame(data, column_names)
df.show()

Add Column with constant Value

Now, we would like to add the column "country" with the constant value "Germany".

To do this, we use the withColumn() method of PySpark and pass the column name and the values as arguments. For providing the constant values, we use the lit() function of PySpark:

df = df.withColumn(
    "country", 
    lit("Germany")
)
df.show()

Add Column based on another Column

Next, we would like to add the column "user_percent" with the percentage of the existing column "users".

To do this, we use the withColumn() function of PySpark and pass the column name and the values as arguments. For calculating the percentage, we use the functions col(), sum() and Window() of PySpark:

df = df.withColumn(
    "users_percent",
    col("users")/sum('users').over(Window.partitionBy())
)
df.show()

Add Column based on Condition

Next, we would like to add the column "type" with the values "Backend", "Frontend" and None depending on the values of column "language".

To do this, we use the withColumn() function of PySpark and pass the column name and the values as arguments. For calculating the values based on a condition, we use the functions when() and lit() of PySpark:

df = df.withColumn(
    "type",
    when(df.language.isin(["Python", "Java"]), lit("Backend"))
        .when(df.language=="JavaScript", lit("Frontend"))
        .otherwise(lit(None))
)
df.show()

Conclusion

Congratulations! Now you are one step closer to become an AI Expert. You have seen that it is very easy to add columns to a PySpark DataFrame. We can simply use the withColumn() method of PySpark. Try it yourself!

Instagram

Also check out our Instagram page. We appreciate your like or comment. Feel free to share this post with your friends.

Sieh dir diesen Beitrag auf Instagram an

Ein Beitrag geteilt von Deep Learning Nerds | AI, Data Science & Machine Learning (@deeplearningnerds)

PySpark - Add Columns to a DataFrame

Data Engineer

Power BI - Import Data from JSON file

How to visualize Query Results using the Data Explorer in Microsoft Fabric

Power BI - How to use Conditional Formatting with Icons in a Table

Introduction

Import Libraries

Create SparkSession

Create PySpark DataFrame

Add Column with constant Value

Add Column based on another Column

Add Column based on Condition

Conclusion

Instagram

How to randomly sample a Subset of a PySpark DataFrame

PySpark - Create Embedding Vectors with Sentence-Transformers

PySpark - Concatenate String Columns of a DataFrame

PySpark - Remove Whitespaces from a String Column of a DataFrame

How to read Excel File into PySpark DataFrame in Databricks