In this tutorial, we want to convert a Pandas DataFrame into a PySpark DataFrame with a specific schema. In order to do this, we use the the createDataFrame() function of PySpark.

Import Libraries

First, we import the following python modules:

import pandas as pd

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

Create SparkSession

Before we can work with Pyspark, we need to create a SparkSession. A SparkSession is the entry point into all functionalities of Spark.

In order to create a basic SparkSession programmatically, we use the following command:

spark = SparkSession \
    .builder \
    .appName("Python PySpark Example") \

Create Pandas DataFrame

We would like to create a Pandas DataFrame based on a dictionary. To do this, we use the pandas class DataFrame:

my_dict = {
    "language": ["Python", "Python", "Java", "JavaScript"],
    "framework": ["Django", "FastAPI", "Spring", "ReactJS"],
    "users": [20000, 9000, 7000, 5000]
pandas_df = pd.DataFrame(my_dict)

Define Schema

Next, we define the underlying schema of the PySpark DataFrame. We would like to specify the column names along with their data types.

To do this, we use the classes StructType and StructField. StructField is used to define the column namedata type, and a flag for nullable or not.

schema = StructType([

Convert to PySpark DataFrame

Finally, we convert the Pandas DataFrame into  a PySpark DataFrame. To do this, we use the  createDataFrame() function and pass the Pandas DataFrame and schema as arguments:

pyspark_df = spark.createDataFrame(pandas_df, schema)


Congratulations! Now you are one step closer to become an AI Expert. You have seen that it is very easy to convert a Pandas DataFrame into a PySpark DataFrame. We can simply use the createDataFrame() function of PySpark. Try it yourself!


Also check out our Instagram page. We appreciate your like or comment. Feel free to share this post with your friends.