dbt Snapshot

What is a dbt snapshot and why would I use one?

A dbt snapshot is a special model type that automatically captures and versions row-level changes in a source table over time, giving you an auditable, slowly-changing-dimension history without hand-written change-data-capture logic.

Welcome to the Galaxy, Guardian!
You'll be receiving a confirmation email

Follow us on twitter :)

Oops! Something went wrong while submitting the form.

Description

Example H2

Example H3

What Is a dbt Snapshot?

A dbt snapshot is a declarative configuration that tells dbt to detect changes in a source table, copy the modified rows into a managed table, and record when each version became valid. Unlike an incremental model, a snapshot keeps every historical version of a row, enabling point-in-time analysis and slowly changing dimension (SCD) use cases.

Why Snapshots Matter

Analysts and data engineers frequently need to answer questions such as “What did the customer record look like last month?” or “How many active subscriptions did we have on January 1st?” Traditional ETL pipelines overwrite rows, making historical states hard to reconstruct. dbt snapshots solve this by:

Preserving history — every update is recorded with dbt_valid_from and dbt_valid_to timestamps.
Simplifying SCD logic — dbt generates the SQL needed for slowly changing dimension Type 2 storage.
Enabling auditing — snapshots provide a fully auditable trail of changes, vital for finance, compliance, and debugging.
Reducing boilerplate — no need to hand-roll CDC SQL or manage triggers; you describe the rules and dbt does the rest.

How Snapshots Work Under the Hood

A snapshot is defined in a snapshots/ directory using Jinja + YAML. During a dbt snapshot run, dbt performs these steps:

Compute a unique key per source row (e.g., id).
Hash the selected columns (or the whole row) to detect changes.
Compare the hash to the most recent version stored in the snapshot table.
If the hash differs, close the existing record (dbt_valid_to = current_timestamp) and insert the new version (dbt_valid_from = current_timestamp, dbt_valid_to = null).

Because dbt executes SQL directly in your warehouse, snapshots scale with your database's native performance and partitioning features.

Key Configuration Options

`target_schema` & `target_database`

Tell dbt where to create the physical snapshot table. Best practice is to store snapshots in their own schema (e.g., snapshots) to simplify permissioning.

`unique_key`

The business key that identifies a logical row. Without it, dbt cannot tell which historical chain a record belongs to.

`strategy`

timestamp: Detect change when a source updated_at column advances.
check: Detect change by comparing a list of columns or a hash of the entire row.

Column Selection

Use the updated_at or check_cols parameters to restrict change detection to only the fields you care about, reducing churn and storage cost.

End-to-End Example

Suppose we need to capture historical versions of a customers table that gets nightly backfills.

{% snapshot customers_snapshot %} {{ config( target_database='analytics', target_schema='snapshots', unique_key='id', strategy='check', check_cols=['name', 'email', 'plan', 'company_size'] ) }} select * from {{ source('stripe', 'customers') }} {% endsnapshot %}

Running dbt snapshot creates snapshots.customers_snapshot with four extra fields:

dbt_scd_id — surrogate key for the version
dbt_updated_at — load timestamp
dbt_valid_from
dbt_valid_to

Querying Snapshots

To find the state of a customer on a specific date:

select * from snapshots.customers_snapshot where id = 42 and '2024-03-01' between dbt_valid_from and coalesce(dbt_valid_to, '9999-12-31');

If you use the Galaxy SQL editor, you can save the query in a Collection named “Customer History” and let teammates endorse it for downstream analytics.

Best Practices

Always define a stable unique_key — composite keys are acceptable as long as they never change.
Use the narrowest column set for change detection to minimize storage.
Schedule snapshots after source loads to avoid capturing intermediate states.
Add tests to ensure dbt_valid_to is null for only one current row per unique_key.
Partition & cluster large snapshots on dbt_valid_from or unique_key for query speed.

Common Misconceptions

“Snapshots are the same as incremental models.” Incremental models overwrite old rows; snapshots preserve every version.
“I need CDC logs to build SCD tables.” dbt snapshots work with plain tables; no database-level CDC is required.
“Snapshots are slow and expensive.” When configured with selective columns and proper partitioning, they are usually only a few seconds per run and cost pennies.

Integrating Snapshots with Galaxy

After a dbt job completes, your warehouse contains snapshot tables like snapshots.customers_snapshot. Open Galaxy, connect to the warehouse, and:

Use the sidebar metadata explorer to find the snapshot table.
Write point-in-time queries with Galaxy’s AI Copilot to autocomplete date predicates.
Save the query to a team Collection and endorse it, ensuring everyone uses a single, trusted definition of “customer on date X.”

Conclusion

dbt snapshots offer a simple, declarative route to historical data tracking. By combining snapshots with modern tooling like Galaxy, teams gain both trusted history and a collaborative environment to explore it.

Why dbt Snapshot is important

Modern analytics must answer questions about how data looked at any point in the past. Overwriting tables destroys that history, while database-level CDC is often complex and platform-specific. dbt snapshots provide a vendor-agnostic, SQL-only approach to retain every version of a record. They power regulatory audits, churn analysis, funnel reconstruction, A/B test backfills, and more—without extra infrastructure. Because snapshots run directly in the warehouse, every analytics engineer can implement SCD logic using familiar dbt workflows.

dbt Snapshot Example Usage


How do I create a snapshot of the customers table using dbt so I can see what a row looked like last month?

dbt Snapshot Syntax

Common Mistakes

Forgetting to set a stable unique_key. Without it, dbt inserts a full copy of the source on every run. Fix it by specifying a unique business identifier or composite key in the snapshot config.
Including volatile columns like updated_at in check_cols, causing unnecessary new versions. Limit check_cols to business attributes that matter for analysis.
Running snapshots before the upstream load finishes, leading to partial records. Schedule your orchestrator so snapshots execute after the raw data ingest completes.

Frequently Asked Questions (FAQs)

How often should I run dbt snapshots?

Match your snapshot cadence to the rate of change in the source table. Daily is common, but rapidly changing tables may need hourly snapshots.

Do snapshots support Slowly Changing Dimension Type 1?

Type 1 (overwrite) doesn’t require snapshots because you don’t preserve history. Snapshots implement SCD Type 2 by design.

Can I query dbt snapshot tables from Galaxy?

Yes. Point Galaxy at the same warehouse; the snapshot schema appears in the sidebar. Galaxy’s AI Copilot can even draft time-travel queries for you.

What happens if I change the unique_key after snapshots exist?

dbt treats the new key as unrelated, producing a fresh history and potentially duplicating data. Instead, create a new snapshot file and deprecate the old one.