An incremental load strategy in BigQuery is a data-ingestion approach that adds or updates only the new or changed records since the last load, rather than reloading an entire dataset.
Google BigQuery is designed for petabyte-scale analytics, and its columnar storage means you typically pay only for the data you scan. Yet how you ingest data has a massive effect on cost, query latency, and operational complexity. An incremental load strategy focuses on capturing only the delta—new or modified rows—since the previous load cycle. Compared with full refreshes, incremental loads minimize data movement, accelerate downstream queries, and make your ETL/ELT pipelines far more efficient.
Instead of rewriting gigabytes (or terabytes) of data daily, you write only kilobytes or megabytes of change data. This lowers:
Incremental loads shorten pipeline runtimes, so data arrives in the warehouse sooner. Their smaller blast radius also reduces the odds that one bad file or a schema mismatch will require re-loading an entire table.
A watermark is the high-water point—usually a timestamp or monotonically increasing ID—that marks the newest record successfully ingested. The next run queries the source system for rows where updated_at > last_watermark
. In BigQuery you can store this value in a metadata table or Cloud Storage, then read it at runtime.
Most incremental tables are PARTITION BY DATE(updated_at)
(or _PARTITIONTIME
when ingesting from Cloud Storage). Clustering on a high-cardinality key such as customer_id
improves update performance by co-locating related rows.
There are two dominant patterns:
MERGE
syntax handles this atomically without rewriting unaffected partitions.valid_from
/ valid_to
dates (a Type-2 Slowly Changing Dimension). Downstream queries rebuild the current state with window functions.Create a table to hold the latest processed timestamp per source table:
CREATE TABLE admin.pipeline_state (
table_name STRING,
last_ts TIMESTAMP
);
Your Cloud Function, Dataflow job, or Airbyte connector queries the source system:
SELECT *
FROM source_db.orders
WHERE updated_at > @last_ts;
Land the delta in a staging table, ideally partitioned by the extract date:
INSERT INTO staging.orders_delta
SELECT *
FROM EXTERNAL_QUERY(...);
MERGE analytics.orders AS tgt
USING staging.orders_delta AS src
ON tgt.order_id = src.order_id
WHEN MATCHED AND tgt.updated_at < src.updated_at THEN
UPDATE SET *
WHEN NOT MATCHED THEN
INSERT ROW;
INSERT INTO admin.pipeline_state (table_name, last_ts)
VALUES ('orders', (SELECT MAX(updated_at) FROM staging.orders_delta))
ON CONFLICT (table_name) DO UPDATE SET last_ts = EXCLUDED.last_ts;
Schedule the pipeline in Cloud Composer or Cloud Scheduler and alert on job errors, watermark gaps, or sudden row-count spikes.
Even if your source updates hourly, land the delta in a daily partition for long-term storage. This balances query pruning and partition overhead.
_CHANGE_DATE
StreamsIf ingesting from Cloud Spanner, Cloud SQL, or MySQL/Postgres via Datastream, let the change stream feed BigQuery directly. This native CDC service emits inserts, updates, and deletes with metadata so you can reconstruct state without manual watermarks.
Incremental tables often store PII such as email or phone. Apply POLICY TAGS
and authorized views on day one to avoid refactoring later.
Validate each load by comparing expected versus actual counts and computing hash totals on business keys. Persist the results in a load_audit
table.
Galaxy’s context-aware autocomplete surfaces partition and clustering info inline as you craft your MERGE
statements. Saved queries can be Endorsed so teammates reuse the exact same watermark logic. When the schema evolves—say updated_at
changes type—the AI Copilot rewrites the query and flags outdated collections, eliminating silent failures during the next run.
Consider a SaaS company that captures billing events in PostgreSQL. Instead of nightly full dumps, they stream inserts and updates every 5 minutes. Since each micro-batch is only a few hundred rows, costs stay pennies-cheap while dashboards refresh near-real-time. BigQuery’s MERGE
completes in under a second because only the current day’s partition is touched.
Partitioning helps prune scans but does not guarantee you process only new data. You still need change detection or a CDC feed.
Before 2018 this was true, but the MERGE
statement now supports multi-row atomic upserts with complex conditions.
You can handle deletes by switching to the Type-2 append-only pattern or by adding a WHEN MATCHED AND src.is_deleted = TRUE THEN DELETE
clause.
An incremental load strategy in BigQuery couples watermarks, partitioned tables, and MERGE
operations to move data efficiently. By automating the delta extraction and leveraging Galaxy’s collaborative SQL editor, you ensure your warehouse stays fresh, accurate, and inexpensive.
Full-table reloads waste compute, inflate storage bills, and delay analytics. Incremental loading processes only new or changed rows, cutting costs and ensuring near-real-time reporting—critical for modern data engineering and analytics teams handling rapidly growing datasets.
It’s the process of loading only new or modified records—identified via timestamps, IDs, or change streams—into a BigQuery table, instead of reloading the entire dataset.
Frequency depends on business requirements and source-system limits. Common cadences range from every 5 minutes (near real-time dashboards) to hourly or daily for reporting workloads.
Yes. Galaxy’s AI Copilot autocompletes MERGE
syntax, suggests partition keys, and flags missing watermarks, while Collections let teams endorse the final query so everyone uses the same logic.
You can add database triggers to populate an updated_at
column, enable CDC (e.g., Datastream), or fall back to hash-based change detection where you compare row hashes between extracts.