Executing historical runs of an Apache Airflow DAG in a way that guarantees idempotency—so repeated backfills never create duplicate rows in downstream data stores.
Airflow Backfill Without Duplicates
Learn proven strategies to rerun Apache Airflow DAGs on historical dates while ensuring every downstream table remains free of duplicate records.
Backfilling in Apache Airflow is the act of rerunning a Directed Acyclic Graph (DAG) for past execution dates so that the data pipeline can populate historical partitions or correct previously missing data. Backfill without duplicates means doing so safely—each run is idempotent, producing the same result no matter how many times it is executed for the same time period.
Data teams often discover late-arriving source files, schema changes, or business logic updates long after a DAG has already been running in production. Without a reliable backfill strategy:
Given Airflow’s popularity as an orchestration layer, mastering duplicate-free backfills is essential for every data engineer.
All operators touching storage systems (warehouses, object stores, Kafka topics) must be able to run repeatedly with no side-effects. Common techniques include:
TRUNCATE TABLE ... WHERE ds = '{{ ds }}'
) before re-loading.MERGE
/INSERT ... ON CONFLICT
).Design downstream tables partitioned by the same calendar grain (e.g. daily ds
) that Airflow uses. Backfill scopes naturally align with these partitions, making duplicate removal trivial.
A sterile DAG encapsulates all state in its outputs; it never depends on mutable global state, non-deterministic random numbers, or execution time. Sterility ensures reruns produce byte-for-byte identical results.
Add GreatExpectationsOperator
or similar checks at the end of each task group. If row counts exceed expectations (often a symptom of duplicates) the backfill fails fast.
Create a release/<date>
branch so the code used during backfill equals the code that will run in production tomorrow.
Expose start_date
and end_date
variables inside your DAG or read them from Airflow Variables. Example:
DEFAULT_START = pendulum.datetime(2023, 1, 1, tz="UTC")
with DAG(
dag_id="example_etl",
default_args={"retries": 2},
start_date=DEFAULT_START,
schedule_interval="@daily",
) as dag:
...
airflow dags backfill \
--start-date 2023-01-01 \
--end-date 2023-03-31 \
--reset-dagruns \
--rerun-failed-tasks \
example_etl
The --reset-dagruns
flag re-creates DagRuns, while --rerun-failed-tasks
ensures all tasks execute even if previously marked success.
Use the Airflow UI’s Gantt view or the --local
argument for small DAGs. When the run finishes, validate counts with a separate reconciliation query before unpausing normal schedules.
INSERT OVERWRITE
or CREATE OR REPLACE TABLE
, prefer those over row-level deletes.load_timestamp
column for debugging duplicate root causes later.MERGE
; the performance cost is often minimal compared with debugging duplicates.Below is a simplified DAG snippet using Snowflake and ensuring duplicate-free loads by truncating the target partition first.
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.providers.snowflake.operators.snowflake import SnowflakeOperator
import pendulum
PARTITION = "{{ ds }}"
TRUNCATE = f"DELETE FROM analytics.orders WHERE ds = '{PARTITION}';"
INSERT = f"INSERT INTO analytics.orders SELECT * FROM staging.orders WHERE ds = '{PARTITION}';"
with DAG(
dag_id="orders_etl",
start_date=pendulum.datetime(2023, 1, 1, tz="UTC"),
schedule_interval="@daily",
catchup=True,
) as dag:
remove_existing = SnowflakeOperator(
task_id="remove_existing",
sql=TRUNCATE,
)
load_new = SnowflakeOperator(
task_id="load_new",
sql=INSERT,
)
remove_existing >> load_new
While Airflow handles orchestration, you still write complex SQL for TRUNCATE
, MERGE
, and reconciliation. Galaxy’s context-aware AI Copilot can auto-generate these queries, suggest partition filters, and even validate row counts—making your backfill iteration loop much faster than a traditional editor.
Backfilling historical data in Airflow is straightforward once you embrace idempotent task design, partition alignment, and rigorous validation. Follow the principles above and you’ll never fear running a massive backfill again.
Bad backfills can corrupt months or years of analytics, destroying user trust and requiring painful manual clean-ups. Ensuring that every backfill run is idempotent lets teams iterate on business logic, onboard new sources, and meet audit obligations without fear of polluting production tables.
Backfill is the process of running a DAG for past execution dates so that historical data is generated or corrected.
Design each task to be idempotent by truncating or upserting its target partition, and always run the backfill with --reset-dagruns
so tasks are not skipped.
Yes. Galaxy’s AI Copilot auto-completes partition-aware SQL like DELETE
/MERGE
statements and surfaces table metadata, making it faster to build idempotent tasks used in Airflow backfill jobs.
Generally no. Pause the DAG or ensure max_active_runs = 1
and distinct execution windows to avoid overlapping writes that create duplicates.