Introduction

Data manipulation tasks often involve converting column data types to ensure consistency and accuracy in analysis. In this tutorial, we will show you how to change column types of a PySpark DataFrame. In order to do this, we will use the cast() function of PySpark.

Import Libraries

First, we import the following python modules:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.sql.types import IntegerType, BooleanType, DateType

Create SparkSession

Before we can work with PySpark, we need to create a SparkSession. A SparkSession is the entry point into all functionalities of Spark.

In order to create a basic SparkSession programmatically, we use the following command:

spark = SparkSession \
    .builder \
    .appName("Python PySpark Example") \
    .getOrCreate()

Create PySpark DataFrame

Next, we create the PySpark DataFrame with some example data from a list. To do this, we use the method createDataFrame() and pass the data and the column names as arguments.

column_names = ["language", "framework", "users", "backend", "date"]
data = [
    ("Python", "Django", "20000", "true", "2022-03-15"),
    ("Python", "FastAPI", "9000", "true", "2022-06-21"),
    ("Java", "Spring", "7000", "true", "2023-12-04"),
    ("JavaScript", "ReactJS", "5000", "false", "2023-01-11")
]
df = spark.createDataFrame(data, column_names)
df.show()

Let's print the schema of the DataFrame:

df.printSchema()

Change Data Type of a Single Column

Let's start with an example of converting the data type of a single column within a PySpark DataFrame.

We want to convert the data type of the column "users" from string to integer. To do this, we use the cast() function of PySpark:

# change column type
df_new = df.withColumn("users", col("users").cast(IntegerType()))

# print schema
df_new.printSchema()

As you can see, the column "users" now has the desired data type.

💡
Alternative ways to change the data type of column "users" to integer would be to use the commands col("users").cast("int") or col("users").cast("integer").

Change Data Type of Multiple Columns

Now, let's see how to change the data types of multiple columns at once.

We want to do the following:

  • Convert the data type of the column "users" from string to integer.
  • Convert the data type of the column "backend" from string to boolean.
  • Convert the data type of the column "date" from string to date.

To do this, we use the cast() function of PySpark multiple times:

# change column types
df_new = df.withColumn("users", col("users").cast(IntegerType())) \
    .withColumn("backend", col("backend").cast(BooleanType())) \
    .withColumn("date", col("date").cast(DateType()))

# print schema
df_new.printSchema()

As you can see, the columns "users", "backend" and "date" now have the desired data types.

Conclusion

Congratulations! Now you are one step closer to become an AI Expert. You have seen that it is very easy to change column types of a PySpark DataFrame. We can simply use the cast() function of PySpark. Try it yourself!

Instagram

Also check out our Instagram page. We appreciate your like or comment. Feel free to share this post with your friends.