📘Introduction

If you're new to dbt (data build tool) and want to transform raw data into clean, analytics-ready tables using Python, you're in the right place. In this step-by-step tutorial, we’ll walk you through how to build your first Python model in dbt, step by step.

📌 This is a must-know topic for the dbt Analytics Engineering Certification Exam, so mastering it now puts you one step closer to passing the exam and leveling up your data engineering skills! 👨‍🎓

✅ Prerequisites

Before you start, make sure you have:

☑️ A dbt project set up
☑️ Set up Medallion Architecture
☑️ Source data loaded into your data warehouse
☑️ Source configurations defined in sources.yml

🐍 What are Python Models?

In dbt, a Python model is a .py file that contains a function named model() and returns a DataFrame - typically a Pandas DataFrame for most warehouses, or a PySpark DataFrame when using platforms like Databricks. When you run dbt run, dbt executes the Python code and materializes the result into your data warehouse.

Python models only support the following materializations:

  • table
  • incremental

✍️1️⃣ Specify Requirements

Let’s define the objective of our model:

We want to create a model that selects specific columns from the student table from the source udc. This source table exists in the schema landing in our data warehouse.

The table student contains the following data:

Here’s what we want our model to do:

  • ✅ Select the columns: IDNameMajorNumber

📁2️⃣ Create Python model

In your dbt project, navigate to the models folder. Create a new .py file in the appropriate layer. In our example we create a file named cleaned_student.py within the folder 02_bronze.

💻3️⃣ Write Python code

Open cleaned_student.py and add the following code:

You can view this post with the tier: Academy Membership

Join academy now to read the post and get access to the full library of premium posts for academy members only.

Join Academy Already have an account? Sign In