📘 Introduction

Real-time data ingestion is a critical part of modern data architectures. Organizations need to process and store continuous streams of information for analytics, monitoring, and machine learning. Databricks, with the combined power of PySpark and Delta Lake, provides an efficient way to build end-to-end streaming pipelines that handle data ingestion, transformation, and persistence seamlessly.

In this guide, you’ll learn how to ingest data from Kafka streams into Delta tables using PySpark in Databricks. You’ll see how to configure Kafka as a streaming source, parse JSON messages, and write the output directly into Delta Lake — ensuring reliability, scalability, and consistency for your real-time workloads.

💡 Why Use Kafka with Delta Lake in Databricks?

Kafka is the backbone of many real-time data pipelines, allowing systems to publish and subscribe to continuous streams of events. Delta Lake complements Kafka perfectly by adding ACID transactions, scalable metadata handling, and schema evolution capabilities on top of data lakes.

💡
When you combine the two in Databricks using PySpark, you get a fault-tolerant streaming architecture where incoming Kafka events are ingested into Delta tables for immediate analysis and long-term storage — all with minimal configuration and maximum performance.

✅ Prerequisites

Before you begin, make sure you have:

🐍☑️ Python and PySpark installed
🔥☑️ Access to a Databricks workspace and cluster
📦☑️ A Kafka cluster with an accessible topic

📥1️⃣ Import Libraries

Start by importing the required Python modules:

from pyspark.sql import SparkSession
from pyspark.sql.types import StringType, StructType, StructField
from pyspark.sql.functions import from_json, col

⚙️2️⃣ Build a Spark Session

Before connecting to Kafka, you’ll need a Spark session. This serves as the main entry point for working with Spark SQL, DataFrames, and streaming APIs in Databricks.

spark = SparkSession.builder \
    .appName("KafkaToDeltaDemo") \
    .getOrCreate()
💡
This Spark session initializes your PySpark environment and connects your Databricks notebook to the underlying Spark cluster.

🔗3️⃣ Configure Kafka Stream Source

Now, connect Spark Structured Streaming to your Kafka topic. The following code configures Kafka as a source and sets startingOffsets="earliest", which tells Spark to begin reading from the very beginning of the Kafka topic.

kafka_df = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "broker1:9092,broker2:9092") \
    .option("subscribe", "orders_topic") \
    .option("startingOffsets", "earliest") \
    .load()

You can view this post with the tier: Academy Membership

Join academy now to read the post and get access to the full library of premium posts for academy members only.

Join Academy Already have an account? Sign In