
🛠 Refactoring a Legacy dbt Project – Lessons Learned from a Healthcare Org
How we untangled a messy dbt repo, optimized model performance, and brought clarity to chaos.
Legacy code is a rite of passage for every data engineer.
When I joined a healthcare analytics project, I inherited a monolithic dbt repository built over 2+ years by different teams, each with their own logic, naming conventions, and… let’s just say creative approaches.
If you’ve ever opened a dbt repo and thought, “Where do I even begin?”, this post is for you.
Let me walk you through the real-world challenges we faced, the refactoring strategy we followed, and the lessons we learned along the way.
🧩 The Legacy Mess – What We Found
Here’s a snapshot of what we walked into:
- 300+ models, mostly in a flat structure
- No model categorization (
staging
,intermediate
,marts
… all mixed) - No clear data lineage or documentation
ref()
calls inside Jinja loops 😬- Zero tests or validations
- Multiple full-refresh models running daily on millions of rows
Worse: the code was working, but nobody knew why.
🛠 Step 1: Audit the Existing Repo
We started with dbt’s built-in documentation (dbt docs generate
) and explored the lineage graph. It was spaghetti.
Then we used these tools:
dbt ls
to list and categorize modelsdbt run --select
with tags to understand runtime and dependency chains- Basic SQL linting tools to detect anti-patterns
We also spoke to stakeholders to understand which models mattered and which were dead weight.
Lesson #1: Don’t refactor blindly. Identify what’s critical and what’s junk.
📦 Step 2: Introduce Structure
We restructured the repo using the medallion architecture:
staging/
: Raw tables from source systems, cleaned column namesintermediate/
: Business logic transformationsmarts/
: Analytics-ready models (fact and dimension)
Example structure:
models/
├── staging/
│ └── emr/
├── intermediate/
│ └── patient_encounters/
├── marts/
│ └── clinical/
Lesson #2: A consistent directory structure improves readability and onboarding.
🧪 Step 3: Add Tests & Snapshots
We added dbt tests across all critical models:
unique
,not_null
, andaccepted_values
- Custom tests like duplicate patient IDs with mismatched birthdates
Then we added snapshots to track slowly changing patient demographics (SCD Type 2).
Lesson #3: A small set of tests can catch big issues before they hit dashboards.
⚡ Step 4: Convert Full Refreshes to Incremental
Several models were doing full table scans daily. We migrated these to incremental models using:
{{ config(
materialized='incremental',
unique_key='encounter_id',
incremental_strategy='merge'
) }}
SELECT ...
FROM source_table
WHERE updated_at > (SELECT MAX(updated_at) FROM {{ this }})
Lesson #4: Incremental models can reduce run times by 90%+ if designed correctly.
🧠 Step 5: Documentation and Knowledge Sharing
Once refactoring was done, we:
- Wrote model-level and column-level descriptions
- Generated docs with
dbt docs serve
- Conducted walkthrough sessions with analysts and new engineers
Lesson #5: Documentation is not a luxury—it’s the bridge between engineering and business.
🔍 Real Metrics Before and After
Metric | Before | After |
---|---|---|
Full project run time | 3.5 hours | 42 minutes |
Number of tests | 0 | 85+ |
Bugs reported monthly | ~10 | 1-2 |
Onboarding time for new devs | ~3 weeks | < 5 days |
💡 Key Takeaways
- Start with understanding, not rewriting.
- Structure your models for clarity and maintainability.
- Use tests, snapshots, and incremental logic to scale.
- Automate documentation to support the data community.
- Refactoring isn’t just tech—it’s change management.
📣 Final Thoughts
Legacy dbt projects are like archaeology—layer upon layer of old logic and quick fixes. But with the right strategy, you can turn a fragile mess into a robust, well-documented pipeline.
If you’re working on a legacy dbt project or planning a refactor, feel free to reach out or drop your questions in the comments. I’d love to help (and commiserate 😅).
Tags: #dbt #dataengineering #refactoring #healthcaredata #incrementalmodels #datatesting #modernstack