Incremental Load Strategy in BigQuery

How do I set up an incremental load strategy in BigQuery?

An incremental load strategy in BigQuery is a data-ingestion approach that adds or updates only the new or changed records since the last load, rather than reloading an entire dataset.

Welcome to the Galaxy, Guardian!
You'll be receiving a confirmation email

Follow us on twitter :)

Oops! Something went wrong while submitting the form.

Description

Example H2

Example H3

Overview

Google BigQuery is designed for petabyte-scale analytics, and its columnar storage means you typically pay only for the data you scan. Yet how you ingest data has a massive effect on cost, query latency, and operational complexity. An incremental load strategy focuses on capturing only the delta—new or modified rows—since the previous load cycle. Compared with full refreshes, incremental loads minimize data movement, accelerate downstream queries, and make your ETL/ELT pipelines far more efficient.

Why Incremental Loads Matter

Performance and Cost

Instead of rewriting gigabytes (or terabytes) of data daily, you write only kilobytes or megabytes of change data. This lowers:

Storage costs, because you avoid duplicated historical data.
Streaming or batch write costs, by shrinking the volume of processed bytes.
Query costs, since partition pruning and clustering work best when tables remain compact.

Reliability and Latency

Incremental loads shorten pipeline runtimes, so data arrives in the warehouse sooner. Their smaller blast radius also reduces the odds that one bad file or a schema mismatch will require re-loading an entire table.

Core Concepts

Watermarks

A watermark is the high-water point—usually a timestamp or monotonically increasing ID—that marks the newest record successfully ingested. The next run queries the source system for rows where updated_at > last_watermark. In BigQuery you can store this value in a metadata table or Cloud Storage, then read it at runtime.

Partitions and Clustering

Most incremental tables are PARTITION BY DATE(updated_at) (or _PARTITIONTIME when ingesting from Cloud Storage). Clustering on a high-cardinality key such as customer_id improves update performance by co-locating related rows.

Merge vs. Append-Only

There are two dominant patterns:

Merge (UPSERT) – New rows are inserted; existing keys are updated in place. BigQuery’s MERGE syntax handles this atomically without rewriting unaffected partitions.
Append-Only with Snapshot Logic – Every change becomes a new row with valid_from / valid_to dates (a Type-2 Slowly Changing Dimension). Downstream queries rebuild the current state with window functions.

Step-by-Step Implementation

1. Define the Watermark

Create a table to hold the latest processed timestamp per source table:

CREATE TABLE admin.pipeline_state ( table_name STRING, last_ts TIMESTAMP );

2. Extract the Delta

Your Cloud Function, Dataflow job, or Airbyte connector queries the source system:

SELECT * FROM source_db.orders WHERE updated_at > @last_ts;

3. Stage in BigQuery

Land the delta in a staging table, ideally partitioned by the extract date:

INSERT INTO staging.orders_delta SELECT * FROM EXTERNAL_QUERY(...);

4. Merge into the Target

MERGE analytics.orders AS tgt USING staging.orders_delta AS src ON tgt.order_id = src.order_id WHEN MATCHED AND tgt.updated_at < src.updated_at THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT ROW;

5. Update the Watermark

INSERT INTO admin.pipeline_state (table_name, last_ts) VALUES ('orders', (SELECT MAX(updated_at) FROM staging.orders_delta)) ON CONFLICT (table_name) DO UPDATE SET last_ts = EXCLUDED.last_ts;

6. Automate and Monitor

Schedule the pipeline in Cloud Composer or Cloud Scheduler and alert on job errors, watermark gaps, or sudden row-count spikes.

Best Practices

Keep Partitions Small and Daily

Even if your source updates hourly, land the delta in a daily partition for long-term storage. This balances query pruning and partition overhead.

Leverage `_CHANGE_DATE` Streams

If ingesting from Cloud Spanner, Cloud SQL, or MySQL/Postgres via Datastream, let the change stream feed BigQuery directly. This native CDC service emits inserts, updates, and deletes with metadata so you can reconstruct state without manual watermarks.

Use Column-Level Security Early

Incremental tables often store PII such as email or phone. Apply POLICY TAGS and authorized views on day one to avoid refactoring later.

Audit with Row Counts and Checksums

Validate each load by comparing expected versus actual counts and computing hash totals on business keys. Persist the results in a load_audit table.

Galaxy in Your Incremental Workflow

Working with Galaxy SQL Editor

Galaxy’s context-aware autocomplete surfaces partition and clustering info inline as you craft your MERGE statements. Saved queries can be Endorsed so teammates reuse the exact same watermark logic. When the schema evolves—say updated_at changes type—the AI Copilot rewrites the query and flags outdated collections, eliminating silent failures during the next run.

Real-World Example

Consider a SaaS company that captures billing events in PostgreSQL. Instead of nightly full dumps, they stream inserts and updates every 5 minutes. Since each micro-batch is only a few hundred rows, costs stay pennies-cheap while dashboards refresh near-real-time. BigQuery’s MERGE completes in under a second because only the current day’s partition is touched.

Common Misconceptions

“Partitioning Alone Is Incremental”

Partitioning helps prune scans but does not guarantee you process only new data. You still need change detection or a CDC feed.

“BigQuery Lacks Upserts”

Before 2018 this was true, but the MERGE statement now supports multi-row atomic upserts with complex conditions.

“Deletes Are Impossible”

You can handle deletes by switching to the Type-2 append-only pattern or by adding a WHEN MATCHED AND src.is_deleted = TRUE THEN DELETE clause.

Putting It All Together

An incremental load strategy in BigQuery couples watermarks, partitioned tables, and MERGE operations to move data efficiently. By automating the delta extraction and leveraging Galaxy’s collaborative SQL editor, you ensure your warehouse stays fresh, accurate, and inexpensive.

Why Incremental Load Strategy in BigQuery is important

Full-table reloads waste compute, inflate storage bills, and delay analytics. Incremental loading processes only new or changed rows, cutting costs and ensuring near-real-time reporting—critical for modern data engineering and analytics teams handling rapidly growing datasets.

Incremental Load Strategy in BigQuery Example Usage


MERGE `prod.analytics.orders` AS tgt
USING `staging.orders_delta` AS src
ON  tgt.order_id = src.order_id
WHEN MATCHED AND tgt.updated_at < src.updated_at THEN
  UPDATE SET *
WHEN NOT MATCHED THEN
  INSERT ROW;

Incremental Load Strategy in BigQuery Syntax

Common Mistakes

Ignoring table partitioning leads BigQuery to scan entire tables during MERGE operations. Fix by partitioning on a date or ingestion timestamp and clustering on the primary key.
Failing to account for late-arriving data lets records newer than the watermark but older than the current date slip through. Fix by adopting a <em>grace window</em> (e.g., reload the past 2 days) or using an event-time watermark that advances only after latency SLAs are met.
Overwriting the target table in a <code>CREATE OR REPLACE</code> query sacrifices atomicity. If the job fails midway, you lose data. Fix by using <code>MERGE</code> or <code>INSERT … SELECT</code> into a temp table, then swapping via <code>ALTER TABLE RENAME TO</code>.

Frequently Asked Questions (FAQs)

What is an incremental load in BigQuery?

It’s the process of loading only new or modified records—identified via timestamps, IDs, or change streams—into a BigQuery table, instead of reloading the entire dataset.

How often should I run incremental loads?

Frequency depends on business requirements and source-system limits. Common cadences range from every 5 minutes (near real-time dashboards) to hourly or daily for reporting workloads.

Can Galaxy help me write incremental load queries?

Yes. Galaxy’s AI Copilot autocompletes MERGE syntax, suggests partition keys, and flags missing watermarks, while Collections let teams endorse the final query so everyone uses the same logic.

What if my source system doesn’t track updated timestamps?

You can add database triggers to populate an updated_at column, enable CDC (e.g., Datastream), or fall back to hash-based change detection where you compare row hashes between extracts.