Refactoring a Legacy dbt Project

🛠 Refactoring a Legacy dbt Project – Lessons Learned from a Healthcare Org

How we untangled a messy dbt repo, optimized model performance, and brought clarity to chaos.


Legacy code is a rite of passage for every data engineer.

When I joined a healthcare analytics project, I inherited a monolithic dbt repository built over 2+ years by different teams, each with their own logic, naming conventions, and… let’s just say creative approaches.

If you’ve ever opened a dbt repo and thought, “Where do I even begin?”, this post is for you.

Let me walk you through the real-world challenges we faced, the refactoring strategy we followed, and the lessons we learned along the way.


🧩 The Legacy Mess – What We Found

Here’s a snapshot of what we walked into:

  • 300+ models, mostly in a flat structure
  • No model categorization (staging, intermediate, marts… all mixed)
  • No clear data lineage or documentation
  • ref() calls inside Jinja loops 😬
  • Zero tests or validations
  • Multiple full-refresh models running daily on millions of rows

Worse: the code was working, but nobody knew why.


🛠 Step 1: Audit the Existing Repo

We started with dbt’s built-in documentation (dbt docs generate) and explored the lineage graph. It was spaghetti.

Then we used these tools:

  • dbt ls to list and categorize models
  • dbt run --select with tags to understand runtime and dependency chains
  • Basic SQL linting tools to detect anti-patterns

We also spoke to stakeholders to understand which models mattered and which were dead weight.

Lesson #1: Don’t refactor blindly. Identify what’s critical and what’s junk.


📦 Step 2: Introduce Structure

We restructured the repo using the medallion architecture:

  • staging/: Raw tables from source systems, cleaned column names
  • intermediate/: Business logic transformations
  • marts/: Analytics-ready models (fact and dimension)

Example structure:

models/
├── staging/
│   └── emr/
├── intermediate/
│   └── patient_encounters/
├── marts/
│   └── clinical/

Lesson #2: A consistent directory structure improves readability and onboarding.


🧪 Step 3: Add Tests & Snapshots

We added dbt tests across all critical models:

  • unique, not_null, and accepted_values
  • Custom tests like duplicate patient IDs with mismatched birthdates

Then we added snapshots to track slowly changing patient demographics (SCD Type 2).

Lesson #3: A small set of tests can catch big issues before they hit dashboards.


⚡ Step 4: Convert Full Refreshes to Incremental

Several models were doing full table scans daily. We migrated these to incremental models using:

{{ config(
  materialized='incremental',
  unique_key='encounter_id',
  incremental_strategy='merge'
) }}

SELECT ...
FROM source_table
WHERE updated_at > (SELECT MAX(updated_at) FROM {{ this }})

Lesson #4: Incremental models can reduce run times by 90%+ if designed correctly.


🧠 Step 5: Documentation and Knowledge Sharing

Once refactoring was done, we:

  • Wrote model-level and column-level descriptions
  • Generated docs with dbt docs serve
  • Conducted walkthrough sessions with analysts and new engineers

Lesson #5: Documentation is not a luxury—it’s the bridge between engineering and business.


🔍 Real Metrics Before and After

MetricBeforeAfter
Full project run time3.5 hours42 minutes
Number of tests085+
Bugs reported monthly~101-2
Onboarding time for new devs~3 weeks< 5 days

💡 Key Takeaways

  1. Start with understanding, not rewriting.
  2. Structure your models for clarity and maintainability.
  3. Use tests, snapshots, and incremental logic to scale.
  4. Automate documentation to support the data community.
  5. Refactoring isn’t just tech—it’s change management.

📣 Final Thoughts

Legacy dbt projects are like archaeology—layer upon layer of old logic and quick fixes. But with the right strategy, you can turn a fragile mess into a robust, well-documented pipeline.

If you’re working on a legacy dbt project or planning a refactor, feel free to reach out or drop your questions in the comments. I’d love to help (and commiserate 😅).


Tags: #dbt #dataengineering #refactoring #healthcaredata #incrementalmodels #datatesting #modernstack