Export ClickHouse tables, store them in Google Cloud Storage, and load them into BigQuery with type-safe schemas and minimal downtime.
Moving table structures and data files from a self-managed ClickHouse cluster into Google BigQuery so you can run serverless analytics without managing infrastructure.
1️⃣ Export data from ClickHouse
2️⃣ Upload the export to Google Cloud Storage (GCS)
3️⃣ Create matching tables in BigQuery
4️⃣ Load the files from GCS into BigQuery
5️⃣ Validate row counts & query results
Use clickhouse-client --query
with OUTFILE
to write each table to a compressed Parquet or CSV file.Parquet keeps column types intact, reducing BigQuery schema hassles.
clickhouse-client \ --query="SELECT * FROM Customers" \ --format=Parquet \ > customers.parquet
Install the gcloud CLI and run gsutil cp
. Keep paths consistent so your load jobs can use wildcards.
gsutil cp *.parquet gs://ecom-migration/2024-05/
Use the BigQuery UI or bq mk
.Match ClickHouse types (e.g., String
→ STRING
, Decimal(10,2)
→ NUMERIC
).
CREATE TABLE `ecom.Orders` ( id INT64, customer_id INT64, order_date DATE, total_amount NUMERIC);
Run bq load --source_format=PARQUET --replace
for each table. Parquet automatically reads column names and types.
bq load --source_format=PARQUET \ ecom.Customers \ gs://ecom-migration/2024-05/customers.parquet
Run COUNT(*), MIN/MAX, or checksums in both systems.Differences usually indicate timezone or null-handling issues.
SELECT COUNT(*) FROM `ecom.Customers`;
After loading, verify joins behave as expected.
SELECT c.name, o.order_date, o.total_amountFROM `ecom.Customers` cJOIN `ecom.Orders` o USING (id)WHERE o.order_date > DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY);
• Export during low-traffic windows
• Replicate new ClickHouse inserts into a staging bucket
• Perform a final incremental load before cutover
Mistake 1 – Ignoring nested data types: ClickHouse Array
or Nested
columns need BigQuery ARRAY
or STRUCT
.Cast or flatten before export.
Mistake 2 – Loading CSV without a schema file: BigQuery may auto-detect wrong types. Prefer Parquet or provide an explicit JSON schema.
Use Galaxy’s AI copilot to generate consistent DDL and load scripts automatically, reducing manual errors.
Yes. Use Dataflow or Pub/Sub to stream ClickHouse inserts into BigQuery for near-real-time sync.
No direct equivalent.Recreate them as scheduled queries or materialized views within BigQuery.
Map ClickHouse users to Google IAM roles; BigQuery controls access at the dataset/table level.
.
Yes. Schedule a daily ClickHouse export of new partitions, upload to GCS, and run bq load --append
.
Parquet files are column-compressed; BigQuery reads them natively, saving storage and speeding loads.
Replicate TTL logic with BigQuery scheduled queries that delete or partition-expire old data.