How to Normalize Data in Amazon Redshift

Galaxy Glossary

How do I normalize a denormalized Orders table in Amazon Redshift?

Normalizing data in Amazon Redshift splits a wide, denormalized table into well-structured dimension and fact tables to improve query speed, storage, and data quality.

Sign up for the latest in SQL knowledge from the Galaxy Team!

Description

What does “normalizing data” mean in Redshift?

Normalization reorganizes columns and rows so that each table holds one entity type, reducing redundancy and update anomalies. In Redshift, you usually migrate from a flat staging table to dimension and fact tables.

Why normalize if Redshift handles big tables?

Even though Redshift is columnar, smaller dimension tables boost join performance, cut storage, and simplify incremental loads. They also let DISTKEY and SORTKEY settings work efficiently.

How do I split a flat Orders table?

Step 1—Create dimension tables with CTAS

Use CREATE TABLE AS SELECT (CTAS) to extract unique customers and products. Apply DISTKEY on the surrogate key and SORTKEY on frequently filtered columns.

Step 2—Generate surrogate keys

Add IDENTITY columns or the new GENERATED AS IDENTITY syntax so each dimension row has a compact INT key. Replace natural keys in the fact table later.

Step 3—Populate the fact table

Insert from the staging Orders table, joining to the new dimension tables to look up surrogate keys. Store metrics like quantity and total_amount, plus DISTKEY on customer_id or order_id.

How can I automate daily normalization?

Wrap the CTAS and INSERT-SELECT statements in a stored procedure. Use a scheduled AWS Lambda or EventBridge rule to call the procedure after each batch COPY into the staging schema.

What performance settings matter most?

Pick DISTSTYLE KEY on the most common join key, usually customer_id. Keep SORTKEY on date or id columns used in WHERE clauses. Run ANALYZE and VACUUM after large backfills.

Is denormalization ever better?

Heavy, read-only dashboards may prefer a single wide table. You can keep both: normalized tables for writes and a denormalized reporting table refreshed by CTAS.

Why How to Normalize Data in Amazon Redshift is important

How to Normalize Data in Amazon Redshift Example Usage


-- Example: Normalize yesterday’s new orders
do $$
begin
    -- Insert new customers
    insert into dim_customers (customer_id, name, email, created_at)
    select distinct customer_id, name, email, created_at
    from staging.orders
    where order_date = current_date - 1
    and customer_id not in (select customer_id from dim_customers);

    -- Insert new orders into fact table
    insert into fact_orders (order_id, customer_sk, order_date, total_amount)
    select o.id,
           c.customer_sk,
           o.order_date,
           o.total_amount
    from staging.orders o
    join dim_customers c on c.customer_id = o.customer_id
    where o.order_date = current_date - 1;
end $$;

How to Normalize Data in Amazon Redshift Syntax


-- 1. Create Customers dimension
a-- Surrogate key as IDENTITY, DISTKEY and SORTKEY optional
CREATE TABLE dim_customers
DISTKEY(customer_sk)
SORTKEY(created_at) AS
SELECT DISTINCT
       IDENTITY(1,1)      AS customer_sk,
       id                 AS customer_id,
       name,
       email,
       created_at
FROM   staging.orders;

-- 2. Create Products dimension
CREATE TABLE dim_products AS
SELECT DISTINCT
       IDENTITY(1,1)      AS product_sk,
       p.id               AS product_id,
       p.name,
       p.price,
       p.stock
FROM   staging.products p;

-- 3. Create normalized fact table
CREATE TABLE fact_orders
DISTKEY(customer_sk)
SORTKEY(order_date) AS
SELECT
    o.id                          AS order_id,
    c.customer_sk,
    o.order_date,
    o.total_amount
FROM staging.orders o
JOIN dim_customers c ON c.customer_id = o.customer_id;

-- 4. Bridge table for order items
CREATE TABLE fact_order_items AS
SELECT
    oi.id                         AS order_item_id,
    f.order_id,
    p.product_sk,
    oi.quantity
FROM staging.orderitems oi
JOIN fact_orders     f ON f.order_id   = oi.order_id
JOIN dim_products    p ON p.product_id = oi.product_id;

Common Mistakes

Relying on DISTSTYLE ALL for every table. This duplicates full tables across nodes, wasting disk and slowing inserts. Use DISTKEY(customer_sk) or even DISTSTYLE KEY so joins stay co-located.
Creating surrogate keys with SELECT MAX(id)+1. Concurrent sessions can duplicate keys. Use IDENTITY or GENERATED AS IDENTITY to guarantee unique, incremental values.

Frequently Asked Questions (FAQs)

Does Redshift support foreign keys?

Yes, but they’re informational only. They don’t enforce referential integrity at runtime, so you must manage consistency in ETL code.

Should I VACUUM after every normalization run?

For small daily loads, run VACUUM DELETE only weekly. Full VACUUM can be expensive; schedule it during low-traffic windows.

Can I normalize JSON data loaded into a SUPER column?

Yes. Use PartiQL to query nested elements, then CTAS to write them into relational dimension tables.

Want to learn about other SQL terms?

dbt tests

dbt tests are declarative assertions in dbt that automatically validate data quality during model builds.

dbt utils

dbt utils is a community-maintained dbt package that adds time-saving utility macros for analytics engineers, such as generating surrogate keys, safely unioning models, and automating date spines.

dbt Tutorial Glossary: Build, Test & Deploy SQL Models

dbt is an open-source framework that lets data teams transform raw warehouse tables into tested, documented, production-ready datasets using version-controlled SQL.

Trusted by top engineers on high-velocity teams

Assort Health

Curri

Rubie