Data modeling in ClickHouse is the process of designing tables, engines, and partitioning strategies to maximize read/write performance and compression.
Column-oriented storage, append-only writes, and aggressive compression demand schemas that minimize random updates, leverage ordering keys, and favor wide fact tables over many joins.
Use MergeTree
when you need custom partitioning and primary keys, ReplacingMergeTree
to deduplicate rows, and AggregatingMergeTree
to pre-aggregate metrics.
Create a wide Orders
fact table with denormalized customer and product attributes to reduce joins. Split infrequently queried text columns into separate tables if they bloat storage.
CREATE TABLE OrdersWide
(
id UInt64,
order_date Date,
customer_id UInt64,
customer_name String,
total_amount Decimal(10,2),
product_ids Array(UInt64),
quantities Array(UInt8)
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(order_date)
ORDER BY (order_date, id);
Store the raw dimension in a lookup table using ReplacingMergeTree
and embed surrogate keys in the fact table. Use LowCardinality(String)
for text values that compress well.
Partition by coarse grain columns—month or week on order_date
—to balance mutation cost and query pruning. Never partition by high-cardinality columns such as customer_id
.
Pre-aggregate when queries repeatedly group by the same keys with large scans. Use AggregatingMergeTree
to store running sums and counts.
CREATE TABLE DailyRevenue
(
order_date Date,
total_amount AggregateFunction(sum, Decimal(10,2))
)
ENGINE = AggregatingMergeTree()
PARTITION BY order_date
ORDER BY order_date;
Materialized views automatically keep derived tables in sync. Create one that writes to DailyRevenue
whenever Orders
receives new data.
First, over-normalizing tables causes expensive joins. Second, choosing a primary key that doesn’t match query filters leads to full scans.
Yes, as long as writes are batched. Use INSERTs with thousands of rows to avoid per-row overhead.
Yes, but mutations are asynchronous and costly. Prefer versioning or ReplacingMergeTree for soft updates.
Use it on string columns with ≤100k distinct values to save memory and improve compression.