Normalizing data in Amazon Redshift splits a wide, denormalized table into well-structured dimension and fact tables to improve query speed, storage, and data quality.
Normalization reorganizes columns and rows so that each table holds one entity type, reducing redundancy and update anomalies. In Redshift, you usually migrate from a flat staging table to dimension and fact tables.
Even though Redshift is columnar, smaller dimension tables boost join performance, cut storage, and simplify incremental loads. They also let DISTKEY and SORTKEY settings work efficiently.
Use CREATE TABLE AS SELECT (CTAS) to extract unique customers and products. Apply DISTKEY on the surrogate key and SORTKEY on frequently filtered columns.
Add IDENTITY columns or the new GENERATED AS IDENTITY syntax so each dimension row has a compact INT key. Replace natural keys in the fact table later.
Insert from the staging Orders table, joining to the new dimension tables to look up surrogate keys. Store metrics like quantity and total_amount, plus DISTKEY on customer_id or order_id.
Wrap the CTAS and INSERT-SELECT statements in a stored procedure. Use a scheduled AWS Lambda or EventBridge rule to call the procedure after each batch COPY into the staging schema.
Pick DISTSTYLE KEY on the most common join key, usually customer_id. Keep SORTKEY on date or id columns used in WHERE clauses. Run ANALYZE and VACUUM after large backfills.
Heavy, read-only dashboards may prefer a single wide table. You can keep both: normalized tables for writes and a denormalized reporting table refreshed by CTAS.
Yes, but they’re informational only. They don’t enforce referential integrity at runtime, so you must manage consistency in ETL code.
For small daily loads, run VACUUM DELETE only weekly. Full VACUUM can be expensive; schedule it during low-traffic windows.
Yes. Use PartiQL to query nested elements, then CTAS to write them into relational dimension tables.