PySpark - Read CSV File into DataFrame

Introduction

In this tutorial, we want to read a CSV file into a PySpark DataFrame. In order to do this, we use the csv() method and the format("csv").load() method of PySpark DataFrameReader. Besides, we use spark.read for creating a DataFrameReader instance.

Import Libraries

First, we import the following python modules:

from pyspark.sql import SparkSession

Create SparkSession

Before we can work with Pyspark, we need to create a SparkSession. A SparkSession is the entry point into all functionalities of Spark.

In order to create a basic SparkSession programmatically, we use the following command:

spark = SparkSession \
    .builder \
    .appName("Python PySpark Example") \
    .getOrCreate()

CSV File

We consider the CSV file "frameworks.csv" containing the following data:

We have to keep in mind the following attributes of the CSV file:

File includes a header with the column names.
Columns of the file are separated with semi-colon ;.
File path is "data/frameworks.csv".

Read CSV File into PySpark DataFrame

Next, we would like to read the CSV file into a PySpark DataFrame. The schema of the DataFrame should be inferred automatically from the underlying data. We can do this in two different ways.

Option 1: csv()

To do this, we first create a DataFrameReader instance with spark.read. Afterwards, we use the csv() method in combination with the option() method of DataFrameReader:

df = spark.read.option("header",True) \
    .option("delimiter",";") \
    .option("inferSchema",True) \
    .csv("data/frameworks.csv")

df.show()

Option 2: format("csv").load()

Now, we consider another option to read the CSV file into a PySpark DataFrame.

First, we create a DataFrameReader instance with spark.read. Afterwards, we use the load() method in combination with the format() method and the option() method of DataFrameReader:

df = spark.read.option("header",True) \
    .option("delimiter",";") \
    .option("inferSchema",True) \
    .format("csv") \
    .load("data/frameworks.csv")

df.show()

Conclusion

Congratulations! Now you are one step closer to become an AI Expert. You have seen that it is very easy to read a CSV file into a PySpark DataFrame. We can simply use the csv() method or the format("csv").load() method of PySpark DataFrameReader. A DataFrameReader instance can be created with spark.read. Try it yourself!

Instagram

Also check out our Instagram page. We appreciate your like or comment. Feel free to share this post with your friends.

Sieh dir diesen Beitrag auf Instagram an

Ein Beitrag geteilt von Deep Learning Nerds | AI, Data Science & Machine Learning (@deeplearningnerds)

PySpark - Read CSV File into DataFrame

Data Engineer

How to build your first Web Application with Gradio in Python: A Step-by-Step Guide

How to flatten a JSON column with a Dataflow in Microsoft Fabric

How to create a Machine Learning Model in Microsoft Fabric: A Step-by-Step Guide

Introduction

Import Libraries

Create SparkSession

CSV File

Read CSV File into PySpark DataFrame

Option 1: csv()

Option 2: format("csv").load()

Conclusion

Instagram

PySpark - How to create and use Broadcast Variables

How to read a Delta Table into a PySpark DataFrame in Microsoft Fabric

How to write a PySpark DataFrame to a Delta Table in Microsoft Fabric

How to randomly sample a Subset of a PySpark DataFrame

PySpark - Create Embedding Vectors with Sentence-Transformers