Slowly Changing Dimension Type 2

What is a slowly changing dimension type 2 and how do I implement it in SQL?

A Slowly Changing Dimension (SCD) Type 2 is a data-warehousing technique that preserves the full history of attribute changes by inserting a new dimension record for every change instead of overwriting existing values.

Welcome to the Galaxy, Guardian!
You'll be receiving a confirmation email

Follow us on twitter :)

Oops! Something went wrong while submitting the form.

Description

Example H2

Example H3

What Is a Slowly Changing Dimension (Type 2)?

A Slowly Changing Dimension Type 2 (SCD 2) is a modeling and ETL pattern used in star or snowflake schemas to track historical changes to dimension attributes—such as a customer’s address—by inserting a new row with a surrogate key and date range each time a change occurs.

The term "slowly changing" refers to dimensions whose attributes change infrequently—think about customer addresses, product categories, or employee job titles. Type 2 is one of several SCD strategies (others include Type 0, 1, 3, 4, and 6). Its distinguishing feature is that it never updates or deletes existing rows; instead, it adds a new row and marks it current with flags or date ranges.

Why SCD Type 2 Matters

Data products frequently require "as was" analysis: How many orders shipped when the customer lived in New York? How did a sales pipeline look before a re-org? Without Type 2, such questions are impossible once attributes are overwritten. Preserving historical context is vital for:

Regulatory compliance (SOX, GDPR, HIPAA auditability)
Time-travel analytics and attribution
Machine-learning features whose values must align with event time
Debugging data pipelines by replaying past states

Core Concepts

Surrogate Key

A system-generated, immutable identifier (e.g., customer_sk) that uniquely identifies each version of the dimension.

Business Key

The natural key in the source system (e.g., customer_id).

Current Flag / Date Range

Columns—typically is_current, effective_date, and end_date—that define which record is active for a given business key at any point in time.

Implementation Blueprint

Landing/Staging – Load raw source data for the current day into a staging table.
Hash or Diff – Calculate a hash (e.g., MD5) of the tracked attributes to detect changes.
Split Logic
- Unchanged: Skip rows whose hash matches the current dimension version.
- Changed: Update the existing dimension row’s end_date to yesterday and is_current = 'N'.
- Insert: Create a new dimension row with is_current = 'Y', effective_date = today, end_date = '9999-12-31'.
Load Fact Tables – Join facts to the dimension on business_key and event_date BETWEEN effective_date AND end_date.

Practical SQL Example

-- Step 1: detect changed customers in staging vs dimension WITH hashed_staging AS ( SELECT s.customer_id, md5(concat_ws('|', s.first_name, s.last_name, s.address, s.city)) AS attr_hash, s.* FROM stage_customer s ), latest_dim AS ( SELECT d.* FROM dim_customer d WHERE d.is_current = 'Y' ) , to_change AS ( SELECT hs.* FROM hashed_staging hs JOIN latest_dim ld ON hs.customer_id = ld.customer_id WHERE hs.attr_hash <> ld.attr_hash ) -- Step 2: close out old versions UPDATE dim_customer d SET is_current = 'N', end_date = CURRENT_DATE - INTERVAL '1 day' FROM to_change c WHERE d.customer_id = c.customer_id AND d.is_current = 'Y'; -- Step 3: insert new versions INSERT INTO dim_customer ( customer_sk, customer_id, first_name, last_name, address, city, attr_hash, effective_date, end_date, is_current ) SELECT nextval('customer_sk_seq'), s.customer_id, s.first_name, s.last_name, s.address, s.city, s.attr_hash, CURRENT_DATE, DATE '9999-12-31', 'Y' FROM to_change s;

Best Practices

Use Surrogate Keys – Avoid using business keys in fact tables; surrogate keys shield downstream queries from nulls and re-keys.
Default Open-Ended Dates – A distant future date (e.g., 9999-12-31) simplifies BETWEEN joins.
Hash Columns – Hashes simplify change detection compared with row-by-row comparisons, especially for wide dimensions.
Idempotent Loads – Design ETL so reruns produce the same result, using transaction boundaries and MERGE statements where possible.
Partitioning & Indexing – Index on business_key and is_current; optionally partition by is_current or effective_year.

How Galaxy Helps

Although SCD 2 is conceptually a modeling technique, its implementation is SQL-heavy. Galaxy’s modern SQL editor speeds up this work by:

AI Copilot – Generate or optimize MERGE statements for Type 2 loads.
Version Control – Commit and annotate ETL queries in Collections, enabling peer review and endorsement across data teams.
Metadata Awareness – Auto-complete surrogate key sequences and dimension columns without hunting through DESCRIBEs.

Common Pitfalls

1. Overwriting Instead of Closing Out

Updating the existing row corrupts the history. Always set is_current = 'N' and end_date before inserting the new version.

2. Missing Time Zones

If effective_date or end_date ignore time zones, facts stamped in UTC may mismatch. Normalize to UTC or store both date and timestamp with TZ.

3. Joining Only on Business Key

Analysts often forget the date range condition. A correct join uses business_key and a window predicate (BETWEEN effective/end dates).

Validations and Audits

After each load, consider:

SELECT customer_id, COUNT(*) FROM dim_customer GROUP BY 1 HAVING COUNT(*) > 1 AND SUM(is_current='Y') <> 1; – Ensures exactly one current row per business key.
Row counts between staging and inserts to detect missed changes.

Key Takeaways

SCD 2 keeps every version of a dimension record, enabling true historical analysis.
Implementation hinges on surrogate keys, date ranges, and idempotent ETL.
Tools like Galaxy accelerate authoring, reviewing, and optimizing Type 2 SQL.

Why Slowly Changing Dimension Type 2 is important

Without Type 2 dimensions, any update to an attribute overwrites history, making it impossible to answer time-variant business questions or comply with audit requirements. Implementing SCD 2 correctly preserves data lineage, supports historical analytics, and ensures regulatory compliance—all foundational needs in modern data engineering and analytics.

Slowly Changing Dimension Type 2 Example Usage

Slowly Changing Dimension Type 2 Syntax

Common Mistakes

Using UPDATE statements to overwrite values instead of inserting a new row, which destroys historical data. Fix by setting the old row’s end_date and inserting a new current row.
Forgetting to set end_date for the previous version, causing overlapping date ranges and ambiguous joins. Fix by always closing the old record before inserting the new one in a single transaction.
Joining fact tables to the dimension only on the business key, leading to duplicate joins or incorrect history. Fix by joining on business_key and date range (effective_date <= event_date <= end_date).

Frequently Asked Questions (FAQs)

What is the difference between SCD Type 1 and Type 2?

Type 1 simply overwrites the dimension record, losing history. Type 2 inserts a new record for every change and preserves all historical versions via date ranges or flags.

How do I query the current version of an SCD 2 dimension?

Filter on is_current = 'Y' or end_date = '9999-12-31'. Indexing that column speeds up lookups.

Does SCD 2 impact fact table sizes?

No. Facts remain the same; only the dimension stores multiple versions. However, ensure you join on both the business key and date range to avoid duplicates.

Can Galaxy help implement SCD 2?

Yes. Galaxy’s AI copilot can auto-generate MERGE statements, while Collections allow teams to review and endorse the Type 2 ETL scripts—all within a fast, memory-light SQL IDE.