BigQuery data modeling organizes tables, partitions, and relationships to minimize cost and maximize query speed.
Sound modeling cuts scan costs, accelerates queries, and simplifies maintenance. Poorly-organized tables force full scans and hinder BI adoption.
Denormalized, partitioned, and clustered tables reduce shuffles and slot usage. Store semirelational data (orders with items) in nested, repeated columns when joins are minimal.
Choose wide tables for write-once analytical data (event logs). Use star schemas—fact plus dimension tables—when dimensions change frequently or are reused across facts.
Partition prunes large date ranges; clustering orders rows, enabling micro-pruning inside each partition. The duo slashes scanned bytes without manual indexes.
Partition on a column queried with range filters ≥90% of the time—typically DATE(order_date) or TIMESTAMP(created_at).
Use low-cardinality columns frequently filtered or grouped (customer_id, product_id). Limit keys to ≤4 to avoid diminishing returns.
BigQuery stores arrays and structs column-wise, so nested OrderItems inside Orders keep data co-located and remove joins—ideal for daily exports from OLTP.
Add nullable columns instead of dropping or renaming. Use views to abstract old names. Schedule data quality tests to catch breaking changes early.
Adopt descriptive naming (stg_, dim_, fct_), document column meaning in INFORMATION_SCHEMA, and lock production datasets with IAM roles (dataViewer, dataOwner).
No enforced keys exist. Model relationships logically and use naming conventions or scripts for validation.
One to four. Beyond four the data is unlikely to retain meaningful order, giving little benefit.
No direct ALTER. Create a new table with desired partitioning, backfill data with INSERT … SELECT, then swap names.