Introduction

In dbt (data build tool), sources represent raw data tables that already exist in your data warehouse - often provided by upstream systems like operational databases, APIs or other applications. Defining these sources in sources.yml is a crucial step in your data project. This tutorial will guide you through how to define and configure sources in sources.yml, helping dbt understand where your data comes from and enabling you to reference these tables in your models.

📌 This is a must-know topic for the dbt Analytics Engineering Certification Exam, so mastering it now puts you one step closer to passing the exam and leveling up your data engineering skills! 👨‍🎓

✅ Prerequisites

Before you start, make sure you have:

☑️ A dbt project set up

Set up a new dbt Project from Scratch: A Beginner’s Guide
Introduction Want to start with dbt core but don’t know where to begin? Don’t worry! In this tutorial, we’ll walk through setting up a new dbt project from scratch - we cover the entire process from creating a virtual environment to initializing your project and verifying the setup.

☑️ Source data loaded into your data warehouse

📁 What is sources.yml?

In dbt, sources.yml is a declarative file typically placed in your models/ directory where you register raw data tables - your sources - that already exist in your data warehouse but are not created or managed by dbt itself. This data is typically loaded using EL tools (Extract and Load), such as FivetranAirbytePython-based frameworks like DLT (Data Load Tool) or custom pipelines, pulling data from upstream systems like operational databases, APIs, or SaaS applications.

By declaring sources, you unlock essential features:

  • Lineage & Dependencies
    Ensures dbt understands how models depend on raw data.
  • Testing & Quality Checks
    You can attach tests directly to sources to validate assumptions.
  • Freshness Monitoring
    Configurable freshness thresholds let you detect stale or delayed data.
  • Auto‑Generated Documentation
    Sources, descriptions, tests, and freshness settings all appear in your dbt docs—enhancing transparency.

🔍1️⃣ Identify Source Data

First, let's identify the sources in our warehouse. The following data was already loaded in the warehouse in landing and comes from two different source systems.

💡
The following sources are completely fictional and tailored for the university example used in this tutorial.

Source 1: University Data Center (UDC)

The following data is provided by the University Data Center (UDC) 🟧:

🟧 student
🟧 tutor
🟧 attendance

▶️ UDC is the source for these tables.

Source 2: Student Information System (SIS)

The following data is provided by the Student Information System (SIS) 🟪:

🟪 course
🟪 course_name

▶️ SIS is the source for these tables.

CTA Image

Learn more about data architectures, warehouses and lakehouses in our newly published book! We guide you step-by-step using Microsoft Fabric, sharing practical insights and hands-on techniques. Whether you’re just starting out or want to deepen your expertise, you’ll discover everything you need to successfully deploy data solutions in real-world scenarios.

To the Book

🗂️2️⃣ Create sources.yml

Now, create a YAML file called sources.yml within your dbt project. Since the source data in our case is stored in the landing layer, place the file inside the landing/ folder.

💡
Instead of using a single YAML file for all sources, you can also create separate YAML files - one for each source.

⚙️3️⃣ Configure sources.yml

Basic configurations

You can view this post with the tier: Academy Membership

Join academy now to read the post and get access to the full library of premium posts for academy members only.

Join Academy Already have an account? Sign In