📘 Introduction

Real-time data pipelines are used when data should be processed continuously instead of waiting for a daily batch job. In this tutorial, you will learn how PySpark Structured Streaming works and how to build your first small streaming pipeline with JSON files, transformations, console output, and checkpointing.

💡
Structured Streaming is a great next step after learning normal PySpark DataFrames, because you can use a very similar API for continuously arriving data.

💡 What is PySpark Structured Streaming?

PySpark Structured Streaming is Spark's high-level API for processing live data streams. You describe your logic with DataFrame operations, and Spark runs that logic incrementally as new data arrives.

Structured Streaming is a scalable and fault-tolerant way to process live data with the Spark SQL engine. For beginners, the most important idea is simple: a stream behaves like a table that keeps receiving new rows.

ConceptMeaningIn this tutorial
Streaming DataFrameA DataFrame that receives new rows over timeNew JSON files in data/input/
Micro-batchSpark processes new data in small repeated batchesEvery 5 seconds
readStreamReads a streaming sourceReads JSON files from a folder
writeStreamWrites streaming resultsPrints transformed rows to the console
CheckpointStores progress and recovery metadataUses checkpoints/orders_console/

🧠 Why this matters for data engineering

Many modern data systems do not only process yesterday's data. They process events, logs, transactions, IoT records, application activity, or order updates as they arrive. Structured Streaming helps you build these pipelines without switching to a completely different programming model.

In this beginner example, we will simulate a real-time order stream by dropping JSON files into a folder. Spark will detect the new files, apply transformations, and print the processed result.

📌
This folder-based example is intentionally simple. The same Structured Streaming ideas also apply when your source is Kafka, cloud storage, event hubs, or a lakehouse table.

✅ Prerequisites

Before getting started, make sure you have:

☑️ Python installed
☑️ Basic knowledge of PySpark DataFrames
☑️ A terminal or command prompt
☑️ Java available on your machine, because Spark runs on the JVM

⚙️1️⃣ Set up the project folders

Now that the prerequisites are clear, create a small local project folder for the streaming tutorial:

mkdir pyspark-structured-streaming
cd pyspark-structured-streaming

Now create folders for input files, checkpoint metadata, and optional output files:

mkdir -p data/input checkpoints output

On Windows PowerShell, you can use:

mkdir data\input, checkpoints, output

🐍2️⃣ Create a virtual environment

Create a virtual environment so the tutorial dependencies stay inside this project:

python -m venv .venv
source .venv/bin/activate

On Windows, activate it with:

.venv\Scripts\activate

📦3️⃣ Install libraries

Then install PySpark:

pip install pyspark
⚙️
If this is your first Spark project, installing PySpark may take a moment. Spark also needs Java, so check your Java installation if the SparkSession cannot start.

🧪4️⃣ Understand the streaming scenario

We will build a small order-processing stream. Every time a new JSON file appears in data/input/, Spark will read it as new streaming data, calculate a total_amount, convert the product name to uppercase, and print the processed rows to the console.

data/input/orders_001.json
                |
                v
PySpark readStream -> transform rows -> writeStream console
                |
                v
checkpoints/orders_console/

The key beginner-friendly idea is that Spark waits for new files. When you add another file later, Spark treats it as another small batch of new data.

🎓
Want the full step-by-step implementation? The Academy section continues with the complete streaming script, test files, result checks, output modes, triggers, checkpoints, and the final project structure.

You can view this post with the tier: Academy Membership

Join academy now to read the post and get access to the full library of premium posts for academy members only.

Join Academy Already have an account? Sign In