Defining source configurations in sources.yml in dbt

Introduction

In dbt (data build tool), sources represent raw data tables that already exist in your data warehouse - often provided by upstream systems like operational databases, APIs or other applications. Defining these sources in sources.yml is a crucial step in your data project. This tutorial will guide you through how to define and configure sources in sources.yml, helping dbt understand where your data comes from and enabling you to reference these tables in your models.

📌 This is a must-know topic for the dbt Analytics Engineering Certification Exam, so mastering it now puts you one step closer to passing the exam and leveling up your data engineering skills! 👨‍🎓

✅ Prerequisites

Before you start, make sure you have:

☑️ A dbt project set up

☑️ Source data loaded into your data warehouse

📁 What is `sources.yml`?

In dbt, sources.yml is a declarative file typically placed in your models/ directory where you register raw data tables - your sources - that already exist in your data warehouse but are not created or managed by dbt itself. This data is typically loaded using EL tools (Extract and Load), such as Fivetran, Airbyte, Python-based frameworks like DLT (Data Load Tool) or custom pipelines, pulling data from upstream systems like operational databases, APIs, or SaaS applications.

By declaring sources, you unlock essential features:

Lineage & Dependencies
Ensures dbt understands how models depend on raw data.
Testing & Quality Checks
You can attach tests directly to sources to validate assumptions.
Freshness Monitoring
Configurable freshness thresholds let you detect stale or delayed data.
Auto‑Generated Documentation
Sources, descriptions, tests, and freshness settings all appear in your dbt docs—enhancing transparency.

🔍1️⃣ Identify Source Data

First, let's identify the sources in our warehouse. The following data was already loaded in the warehouse in landing and comes from two different source systems.

💡

The following sources are completely fictional and tailored for the university example used in this tutorial.

Source 1: University Data Center (UDC)

The following data is provided by the University Data Center (UDC) 🟧:

🟧 student
🟧 tutor
🟧 attendance

▶️ UDC is the source for these tables.

Source 2: Student Information System (SIS)

The following data is provided by the Student Information System (SIS) 🟪:

🟪 course
🟪 course_name

▶️ SIS is the source for these tables.

Learn more about data architectures, warehouses and lakehouses in our newly published book! We guide you step-by-step using Microsoft Fabric, sharing practical insights and hands-on techniques. Whether you’re just starting out or want to deepen your expertise, you’ll discover everything you need to successfully deploy data solutions in real-world scenarios.

To the Book

🗂️2️⃣ Create `sources.yml`

Now, create a YAML file called sources.yml within your dbt project. Since the source data in our case is stored in the landing layer, place the file inside the landing/ folder.

💡

Instead of using a single YAML file for all sources, you can also create separate YAML files - one for each source.

⚙️3️⃣ Configure `sources.yml`

Basic configurations

You can view this post with the tier: Academy Membership

Join academy now to read the post and get access to the full library of premium posts for academy members only.

Join Academy Already have an account? Sign In

Defining source configurations in sources.yml in dbt

Data Engineer

How to pass the dbt Analytics Engineering Certification Exam: Preparation Tips and Learning Materials

8 Essential Tips every dbt Developer should know: Level Up Your dbt Development

Create a Chatbot GUI with Streamlit in Python: A Step-by-Step Guide

Introduction

✅ Prerequisites

📁 What is `sources.yml`?

🔍1️⃣ Identify Source Data

Source 1: University Data Center (UDC)

Source 2: Student Information System (SIS)

🗂️2️⃣ Create `sources.yml`

⚙️3️⃣ Configure `sources.yml`

Basic configurations

You can view this post with the tier: Academy Membership

How to pass the dbt Analytics Engineering Certification Exam: Preparation Tips and Learning Materials

8 Essential Tips every dbt Developer should know: Level Up Your dbt Development

Learn how to use the ref() function in dbt to reference Models and build Dependencies

How to install and use Packages in dbt

Build your first Python model in dbt: A Step-by-Step Tutorial

Defining source configurations in sources.yml in dbt

Data Engineer

How to pass the dbt Analytics Engineering Certification Exam: Preparation Tips and Learning Materials

8 Essential Tips every dbt Developer should know: Level Up Your dbt Development

Create a Chatbot GUI with Streamlit in Python: A Step-by-Step Guide

Introduction

✅ Prerequisites

📁 What is sources.yml?

🔍1️⃣ Identify Source Data

Source 1: University Data Center (UDC)

Source 2: Student Information System (SIS)

🗂️2️⃣ Create sources.yml

⚙️3️⃣ Configure sources.yml

Basic configurations

You can view this post with the tier: Academy Membership

How to pass the dbt Analytics Engineering Certification Exam: Preparation Tips and Learning Materials

8 Essential Tips every dbt Developer should know: Level Up Your dbt Development

Learn how to use the ref() function in dbt to reference Models and build Dependencies

How to install and use Packages in dbt

Build your first Python model in dbt: A Step-by-Step Tutorial

📁 What is `sources.yml`?

🗂️2️⃣ Create `sources.yml`

⚙️3️⃣ Configure `sources.yml`