dbt is an open-source framework that lets data teams transform, test, and document data in the warehouse using version-controlled SQL.
dbt, short for data build tool, has rapidly become the de-facto standard for transforming raw warehouse tables into reliable, analytics-ready datasets. By treating transformations as software—complete with modular SQL, version control, testing, and documentation—dbt enables analytics engineering practices that dramatically improve data quality and team productivity.
dbt is an open-source command-line tool (pip install dbt-core
) and ecosystem that lets you write modular SQL models, run dependency-aware builds, add tests, auto-generate documentation, and maintain everything in Git. You think in SQL, dbt handles dependency graphs, execution order, and metadata.
As companies move analytics to cloud data warehouses (Snowflake, BigQuery, Redshift, Databricks, etc.), they need a scalable way to convert raw ingested data into clean, governed tables. dbt solves this by:
The core idea is simple: each SQL file in your project is a model that selects from upstream sources and models. When you run dbt run
, dbt materializes those SELECT statements as views or tables in your warehouse, honoring the dependency graph.
Models live under models/
and define transformations. Sources declare raw tables, giving them rich metadata and enabling freshness checks. Seeds are CSV files you want loaded as reference tables.
Add YAML blocks under tests:
for schema tests (e.g., not_null
, unique
) or write custom SQL tests. dbt Docs then compiles a website where each model shows SQL lineage, tests, and descriptions.
Suppose you ingest raw e-commerce data into raw.shopify_orders
. You might create:
stg_orders.sql
– standardizes column namesdim_customers.sql
– aggregates customer infofct_orders.sql
– fact table joining orders to customersBecause fct_orders
references stg_orders
and dim_customers
, dbt automatically determines the correct build order.
Keep raw logic in stg_*
models, business logic in dim_*
/fct_*
marts. This separation simplifies debugging and change management.
Add not_null
, unique
, and conditional tests in staging models so bad data never reaches downstream dashboards.
Set up GitHub Actions to run dbt build --select modified+
on pull requests. You’ll prevent bad SQL from merging to main
.
Description blocks power dbt Docs and help new teammates onboard quickly.
exposures
to clarify lineage.var()
and env_var()
so models deploy cleanly across dev, staging, prod.-- models/stg_orders.sql
with raw as (
select *
from {{ source('shopify', 'orders') }}
)
select
id as order_id,
cast(customer_id as int) as customer_id,
cast(subtotal_price as numeric) as subtotal_amount,
cast(created_at as timestamp) as created_at
from raw;
Run it with dbt run --select stg_orders
. dbt renders Jinja, resolves {{ source }}
, and creates stg_orders
as a view or table.
Not exactly. dbt focuses on the T (transform) once data already lives in your warehouse. Youll still need an EL tool (Fivetran, Airbyte, custom pipelines) to load raw data.
Stored procedures mix logic and execution. dbt externalizes logic into version-controlled SQL files, adds testing, and shows lineage—making it easier to collaborate and audit.
Yes. Because dbt models are plain SQL, you can use Galaxys lightning-fast editor and AI copilot to write, refactor, and share dbt SQL files before running dbt
locally or in CI.
dbt Core is free and open source. dbt Cloud adds a managed scheduler, IDE, and governance. Many teams start with Core and move to Cloud as needs grow.
Without a structured transformation layer, data warehouses turn into junk drawers of half-clean tables, eroding trust in analytics. dbt introduces version control, testing, and documentation so data products stay accurate and reproducible—critical for any organization that relies on data-driven decisions.
Not entirely. It handles transformations after data lands in your warehouse; you still need EL tooling for extraction and loading.
dbt enforces modularity, version control, testing, and documentation, turning ad-hoc SQL into maintainable code.
Absolutely. Galaxys IDE-style SQL editor and AI copilot accelerate writing and refactoring dbt models in your local repo.
No. You can schedule dbt run
in any orchestrator (Airflow, GitHub Actions). dbt Cloud simply provides a managed option.