How to Normalize Data in ClickHouse

Galaxy Glossary

How do I normalize denormalized ecommerce tables in ClickHouse?

Break large, denormalized tables into smaller, related tables to eliminate redundancy and speed up analytical queries.

Welcome to the Galaxy, Guardian!
You'll be receiving a confirmation email

Follow us on twitter :)

Oops! Something went wrong while submitting the form.

Description

Example H2

Example H3

What does normalizing data in ClickHouse do?

Normalization converts a wide, duplicate-heavy table into multiple related tables, reducing storage, preventing update anomalies, and improving query efficiency.

When should I normalize ecommerce data?

Normalize when the original table repeats customer, product, or order details. Splitting them into Customers, Orders, Products, and OrderItems minimizes redundancy and speeds specific lookups.

How do I create normalized tables?

Create one MergeTree table per entity with a stable primary key.Use LowCardinality for lookup columns and materialized views for fast aggregates.

Syntax breakdown

Use CREATE TABLE to define each entity, then INSERT INTO normalized_table SELECT ... FROM raw_table to backfill. Optional TTL clauses keep history lean.

Example: split Orders and OrderItems

The example query below moves data from a raw denormalized OrdersRaw table into Orders and OrderItems, linking them by order_id.

Best practices for normalization

Pick UInt64 surrogate keys, store dates as Date32, and declare LowCardinality(String) for categorical columns.Use ReplacingMergeTree to avoid duplicate rows.

Common mistakes and fixes

Duplicate primary keys: forgetting ORDER BY id in MergeTree causes multiple rows per key.Always set ORDER BY(id).

Using String instead of LowCardinality: plain String inflates disk usage; switch to LowCardinality for repeated text values.

FAQ

Does normalization hurt ClickHouse query speed?

No—proper joins on primary keys are fast, and column pruning means fewer bytes scanned.

Can I denormalize later for reporting?

Yes, create materialized views that aggregate normalized tables into wide reporting tables without touching source data.

How do I keep normalized tables up-to-date?

Use INSERT SELECT for batch loads or a Kafka engine + materialized view for streaming sync.

Why How to Normalize Data in ClickHouse is important

How to Normalize Data in ClickHouse Example Usage


-- Backfill normalized tables in one transaction-like batch
INSERT INTO Orders
SELECT DISTINCT order_id AS id,
       customer_id,
       toDate(order_timestamp) AS order_date,
       total_amount
FROM OrdersRaw;

INSERT INTO OrderItems
SELECT item_id AS id,
       order_id,
       product_id,
       quantity
FROM OrdersRaw ARRAY JOIN items;

How to Normalize Data in ClickHouse Syntax


-- Customers table
CREATE TABLE Customers (
    id UInt64,
    name String,
    email String,
    created_at DateTime
) ENGINE = MergeTree
ORDER BY id;

-- Orders table
CREATE TABLE Orders (
    id UInt64,
    customer_id UInt64,
    order_date Date,
    total_amount Decimal(10,2)
) ENGINE = MergeTree
ORDER BY id;

-- Products table
CREATE TABLE Products (
    id UInt64,
    name String,
    price Decimal(10,2),
    stock UInt32
) ENGINE = MergeTree
ORDER BY id;

-- OrderItems table
CREATE TABLE OrderItems (
    id UInt64,
    order_id UInt64,
    product_id UInt64,
    quantity UInt32
) ENGINE = MergeTree
ORDER BY id;

-- Backfill Orders and OrderItems from a raw table
INSERT INTO Orders (id, customer_id, order_date, total_amount)
SELECT order_id, customer_id, toDate(order_timestamp), total_amount
FROM OrdersRaw;

INSERT INTO OrderItems (id, order_id, product_id, quantity)
SELECT item_id, order_id, product_id, quantity
FROM OrdersRaw ARRAY JOIN items;

Common Mistakes

Omitting ORDER BY on MergeTree tables leads to duplicate primary keys and slow merges. Always define ORDER BY(id) or a composite key.
Storing categorical text as plain String increases storage. Use LowCardinality(String) to compress repeated values.