Airflow Backfill Without Duplicates

Galaxy Glossary

How do I backfill historical data in Airflow without creating duplicate rows?

Executing historical runs of an Apache Airflow DAG in a way that guarantees idempotency—so repeated backfills never create duplicate rows in downstream data stores.

Sign up for the latest in SQL knowledge from the Galaxy Team!
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Description

Airflow Backfill Without Duplicates

Learn proven strategies to rerun Apache Airflow DAGs on historical dates while ensuring every downstream table remains free of duplicate records.

Definition

Backfilling in Apache Airflow is the act of rerunning a Directed Acyclic Graph (DAG) for past execution dates so that the data pipeline can populate historical partitions or correct previously missing data. Backfill without duplicates means doing so safely—each run is idempotent, producing the same result no matter how many times it is executed for the same time period.

Why Does It Matter?

Data teams often discover late-arriving source files, schema changes, or business logic updates long after a DAG has already been running in production. Without a reliable backfill strategy:

  • Dashboards show inconsistent historical metrics.
  • Analysts waste hours manually deleting duplicates.
  • Pipelines lose stakeholder trust.

Given Airflow’s popularity as an orchestration layer, mastering duplicate-free backfills is essential for every data engineer.

Core Principles

1. Idempotent Task Design

All operators touching storage systems (warehouses, object stores, Kafka topics) must be able to run repeatedly with no side-effects. Common techniques include:

  • DELETE-then-INSERT (or TRUNCATE TABLE ... WHERE ds = '{{ ds }}') before re-loading.
  • Upserts (MERGE/INSERT ... ON CONFLICT).
  • Writing to a temporary table and swapping partitions atomically.

2. Partition Awareness

Design downstream tables partitioned by the same calendar grain (e.g. daily ds) that Airflow uses. Backfill scopes naturally align with these partitions, making duplicate removal trivial.

3. DAG Sterility

A sterile DAG encapsulates all state in its outputs; it never depends on mutable global state, non-deterministic random numbers, or execution time. Sterility ensures reruns produce byte-for-byte identical results.

4. Audit & Data Quality Checks

Add GreatExpectationsOperator or similar checks at the end of each task group. If row counts exceed expectations (often a symptom of duplicates) the backfill fails fast.

Hands-On Workflow

Step 1 – Freeze Code & Config

Create a release/<date> branch so the code used during backfill equals the code that will run in production tomorrow.

Step 2 – Parameterize Execution Window

Expose start_date and end_date variables inside your DAG or read them from Airflow Variables. Example:

DEFAULT_START = pendulum.datetime(2023, 1, 1, tz="UTC")
with DAG(
dag_id="example_etl",
default_args={"retries": 2},
start_date=DEFAULT_START,
schedule_interval="@daily",
) as dag:
...

Step 3 – Use the CLI Backfill Command

airflow dags backfill \
--start-date 2023-01-01 \
--end-date 2023-03-31 \
--reset-dagruns \
--rerun-failed-tasks \
example_etl

The --reset-dagruns flag re-creates DagRuns, while --rerun-failed-tasks ensures all tasks execute even if previously marked success.

Step 4 – Monitor & Reconcile

Use the Airflow UI’s Gantt view or the --local argument for small DAGs. When the run finishes, validate counts with a separate reconciliation query before unpausing normal schedules.

Best Practices

  • Immutable Loads: If your warehouse supports INSERT OVERWRITE or CREATE OR REPLACE TABLE, prefer those over row-level deletes.
  • Backfill Windows: Start small (one day) and widen to weeks or months after asserting correctness.
  • Simulated Production: Run the backfill in a temporary schema first; swap into production only after verification.
  • Version Tables: Include a load_timestamp column for debugging duplicate root causes later.
  • Off-Hours Execution: Disable concurrent scheduled runs to avoid race conditions.

Common Misconceptions

  • “Airflow prevents duplicates for me.” Airflow only orchestrates; idempotency lives in your task logic.
  • “Upserts slow my warehouse.” Modern warehouses like Snowflake and BigQuery optimize MERGE; the performance cost is often minimal compared with debugging duplicates.
  • “I can delete duplicates later.” Post-hoc clean-ups are risky if business users already consumed erroneous data.

Practical Example

Below is a simplified DAG snippet using Snowflake and ensuring duplicate-free loads by truncating the target partition first.

from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.providers.snowflake.operators.snowflake import SnowflakeOperator
import pendulum

PARTITION = "{{ ds }}"

TRUNCATE = f"DELETE FROM analytics.orders WHERE ds = '{PARTITION}';"
INSERT = f"INSERT INTO analytics.orders SELECT * FROM staging.orders WHERE ds = '{PARTITION}';"

with DAG(
dag_id="orders_etl",
start_date=pendulum.datetime(2023, 1, 1, tz="UTC"),
schedule_interval="@daily",
catchup=True,
) as dag:

remove_existing = SnowflakeOperator(
task_id="remove_existing",
sql=TRUNCATE,
)

load_new = SnowflakeOperator(
task_id="load_new",
sql=INSERT,
)

remove_existing >> load_new

Galaxy Tip

While Airflow handles orchestration, you still write complex SQL for TRUNCATE, MERGE, and reconciliation. Galaxy’s context-aware AI Copilot can auto-generate these queries, suggest partition filters, and even validate row counts—making your backfill iteration loop much faster than a traditional editor.

Conclusion

Backfilling historical data in Airflow is straightforward once you embrace idempotent task design, partition alignment, and rigorous validation. Follow the principles above and you’ll never fear running a massive backfill again.

Why Airflow Backfill Without Duplicates is important

Bad backfills can corrupt months or years of analytics, destroying user trust and requiring painful manual clean-ups. Ensuring that every backfill run is idempotent lets teams iterate on business logic, onboard new sources, and meet audit obligations without fear of polluting production tables.

Airflow Backfill Without Duplicates Example Usage


airflow dags backfill --start-date 2022-10-01 --end-date 2022-12-31 --reset-dagruns my_sales_dag

Common Mistakes

Frequently Asked Questions (FAQs)

What is a backfill in Airflow?

Backfill is the process of running a DAG for past execution dates so that historical data is generated or corrected.

How do I guarantee no duplicates during backfill?

Design each task to be idempotent by truncating or upserting its target partition, and always run the backfill with --reset-dagruns so tasks are not skipped.

Can Galaxy help me with Airflow backfills?

Yes. Galaxy’s AI Copilot auto-completes partition-aware SQL like DELETE/MERGE statements and surfaces table metadata, making it faster to build idempotent tasks used in Airflow backfill jobs.

Is it safe to backfill while regular schedules run?

Generally no. Pause the DAG or ensure max_active_runs = 1 and distinct execution windows to avoid overlapping writes that create duplicates.

Want to learn about other SQL terms?