📘 Introduction

In many real-world datasets, the same type of information can appear in more than one column. A customer may provide an email address, a phone number, or a backup contact, and different systems may populate different fields. When you want to select the first available non-null value from several options, PySpark’s coalesce() function is the cleanest and most effective solution.

coalesce() checks each column or expression in order and returns the first non-null one. This avoids complicated conditional logic and keeps your transformations easy to read and maintain.

💡 Why Use the coalesce() Function?

You might need coalesce() when you want to:

  • Pick the first non-null value across multiple columns
  • Apply fallback logic in a single expression
  • Simplify pipelines that receive partial or inconsistent data
  • Replace long when().otherwise() chains with something cleaner
💡
Instead of writing nested conditionals like:

“If email exists use it, else use phone, else use backup, otherwise use a default.”

coalesce() lets you express that logic in one simple line.

✅ Prerequisites

Before starting, make sure you have the following:

🐍☑️ Python installed
🔥☑️ A working Spark environment

📦1️⃣ Install Libraries

Install the following Python packages using pip:

pip install pyspark

📥2️⃣ Import Libraries

Start by importing the required Python modules:

from pyspark.sql import SparkSession
from pyspark.sql.functions import coalesce, col, lit

⚙️3️⃣ Build a Spark Session

Next, initialize your Spark session — the entry point for working with DataFrames:

spark = SparkSession.builder \
    .appName("PySparkTutorial") \
    .getOrCreate()

✍️4️⃣ Create Sample DataFrame

Let’s create a sample PySpark DataFrame.

data = [
    (1, "alice@example.com", None),
    (2, None, "555-1234"),
    (3, None, None)
]

df = spark.createDataFrame(
    data,
    ["id", "email", "phone"]
)

df.show()

Output:

+---+-------------------+---------+
|id |email              |phone    |
+---+-------------------+---------+
|1  |alice@example.com  |null     |
|2  |null               |555-1234 |
|3  |null               |null     |
+---+-------------------+---------+

🔄5️⃣ Choose the Best Contact Using coalesce()

Now let’s apply coalesce() to select the preferred contact method in this order:

email → phone → "no_contact"

You can view this post with the tier: Academy Membership

Join academy now to read the post and get access to the full library of premium posts for academy members only.

Join Academy Already have an account? Sign In