An incremental load strategy in BigQuery ingests only the data that is new or has changed since the last load, minimizing costs and speeding up ELT pipelines.
An incremental load strategy in Google BigQuery is an ELT/ETL approach that ingests and processes only the records that have been added or modified since the previous load cycle, rather than re-loading an entire source table. The goal is to reduce data-processing costs, limit query slots consumed, and shorten pipeline run times while still preserving accurate, up-to-date analytics tables.
BigQuery bills by the number of bytes processed and stored. Loading the full dataset every hour or day multiplies costs and amplifies the risk of hitting slot or reservation limits. Incremental loads minimize the volume of data scanned and moved, making near-real-time analytics financially viable. They also facilitate:
Partitioning by ingestion time (_PARTITIONTIME
) or a logical date column lets BigQuery prune unnecessary partitions during reads, slashing scan costs.
Clustering on high-cardinality keys such as id
or updated_at
further narrows the storage blocks scanned within each partition.
BigQuery API and Dataflow allow low-latency streaming of new records into a raw “landing” table. Streaming rows bypass staging files and become queryable within seconds.
MERGE
allows you to upsert—insert, update, or delete—in one atomic statement. It’s the workhorse for Change Data Capture (CDC) style incremental loads.
If your data source is append-only (e.g., server logs), simply append new records to an ingestion-time partitioned table. Downstream transformations can rely on the ingestion timestamp to filter the last N
days of data.
Many operational databases emit change streams containing INSERT
, UPDATE
, and DELETE
events. Land those events in a staging table, then periodically merge them into a target table that holds the current state:
op='DELETE'
→ deleteupdated_at
→ updateSome teams store a daily “type 2” snapshot to keep historical versions while maintaining an up-to-date “current” table via MERGE. BigQuery supports time travel up to 7 days, but snapshots provide longer retention.
Select a reliable updated_at
timestamp or high-water-mark column. In the absence of a timestamp, maintain a monotonically increasing surrogate key or use log-based CDC tools like Datastream.
Load the changed data into a raw (a.k.a. staging) table. Use partitioning by the landing date to simplify clean-up policies.
Use MERGE
to bring the target table up to date. A typical pattern is:
updated_at
is greater than the max updated_at
in the target.MERGE
, writing to a partitioned, clustered target table.Compare expected versus actual row counts. If a bug is discovered, you can backfill by re-running for the affected partitions without touching the entire table.
updated_at
or partition id in a control table so jobs can resume gracefully.MERGE OPTIONS (partition_expiration_days)
. Expire obsolete partitions automatically to rein in storage costs.Publish metrics on:
Use Cloud Logging sinks and Information Schema views (INFORMATION_SCHEMA.JOBS_BY_USER
) to alert on anomalies.
Galaxy’s modern SQL editor makes designing and tuning MERGE statements effortless. Features that accelerate incremental-load development include:
Because Galaxy connects directly to BigQuery, you can iterate on incremental-load queries, benchmark slot usage, and commit the approved version to CI/CD pipelines.
updated_at
timestamps to UTC before comparison to avoid missed or duplicated rows.DELETE
flag) and propagate via MERGE.Implementing an incremental load strategy in BigQuery is one of the highest-impact optimizations you can make for scalable, cost-efficient analytics. By combining partitioned tables, MERGE DML, and automated watermarks—and by avoiding common pitfalls—you can deliver near-real-time insights without blowing out your budget.
Data warehouses grow by gigabytes to terabytes daily. Reloading entire tables wastes BigQuery scan bytes and increases slot consumption, leading to higher costs and slower pipelines. Incremental loading minimizes data movement, enables near-real-time analytics, and keeps engineering teams compliant with SLAs—all while preserving historical accuracy.
Incremental loading ingests only new or changed data since the last run, whereas a full reload re-imports the entire dataset every time. Incremental loads are faster and cheaper.
No. If your source data is append-only (e.g., logs), a simple INSERT
into partitioned tables may suffice. MERGE is essential when you must keep the target table in sync with updates and deletes.
Partitioning limits scans to relevant date ranges, so both the MERGE and downstream analytical queries read only the partitions that contain new or updated data.
Yes. Galaxy’s SQL editor and AI copilot help you scaffold MERGE statements, enforce coding standards, and share endorsed incremental-load scripts across your team.