Introduction

In this tutorial, we want to read a CSV file into a PySpark DataFrame. In order to do this, we use the csv() method and the format("csv").load() method of PySpark DataFrameReader. Besides, we use spark.read for creating a DataFrameReader instance.

Import Libraries

First, we import the following python modules:

from pyspark.sql import SparkSession

Create SparkSession

Before we can work with Pyspark, we need to create a SparkSession. A SparkSession is the entry point into all functionalities of Spark.

In order to create a basic SparkSession programmatically, we use the following command:

spark = SparkSession \
    .builder \
    .appName("Python PySpark Example") \
    .getOrCreate()

CSV File

We consider the CSV file "frameworks.csv" containing the following data:

We have to keep in mind the following attributes of the CSV file:

  • File includes a header with the column names.
  • Columns of the file are separated with semi-colon ;.
  • File path is "data/frameworks.csv".

Read CSV File into PySpark DataFrame

Next, we would like to read the CSV file into a PySpark DataFrame. The schema of the DataFrame should be inferred automatically from the underlying data. We can do this in two different ways.

Option 1: csv()

To do this, we first create a DataFrameReader instance with spark.read. Afterwards, we use the csv() method in combination with the option() method of DataFrameReader:

df = spark.read.option("header",True) \
    .option("delimiter",";") \
    .option("inferSchema",True) \
    .csv("data/frameworks.csv")

df.show()

Option 2: format("csv").load()

Now, we consider another option to read the CSV file into a PySpark DataFrame.

First, we create a DataFrameReader instance with spark.read. Afterwards, we use the load() method in combination with the format() method and the option() method of DataFrameReader:

df = spark.read.option("header",True) \
    .option("delimiter",";") \
    .option("inferSchema",True) \
    .format("csv") \
    .load("data/frameworks.csv")

df.show()

Conclusion

Congratulations! Now you are one step closer to become an AI Expert. You have seen that it is very easy to read a CSV file into a PySpark DataFrame. We can simply use the csv() method or the format("csv").load() method of PySpark DataFrameReader. A DataFrameReader instance can be created with spark.read. Try it yourself!

Instagram

Also check out our Instagram page. We appreciate your like or comment. Feel free to share this post with your friends.