Database backfill is the deliberate, controlled process of (re-)populating historical or missing rows in a data store after schema changes, logic fixes, or late-arriving source data.
Database backfills keep data sets correct and trustworthy after schema changes, bug fixes, or the arrival of late source data. They can be as simple as one UPDATE statement or as complex as a multi-day, multi-service job coordinated across production and analytics systems.
A backfill (sometimes called a replay or historical reload) is the act of inserting, updating, or deleting data retroactively so that the database reflects what the state would have been had the corrected code or schema existed from day one. In practice this means running ad-hoc SQL, batch scripts, or data-pipeline jobs that revisit old partitions and apply new logic.
Adding a new non-nullable column that requires values for past rows.
Correcting historical calculations—e.g., sales tax was applied incorrectly for three months.
Source systems came online late or were offline; once data arrives, historical gaps must be filled.
Downstream aggregates (e.g., dbt incremental
models) need to be regenerated after logic changes.
Backfills aren’t always inserts—sometimes you must delete or anonymize historical data for compliance.
Analytics, ML models, invoicing, and customer dashboards are all only as accurate as their historical data. When past data is wrong, trust erodes. Automated reports trigger false alarms, cohort analyses mislead product decisions, and revenue may be misstated. A disciplined backfill process preserves data integrity, reduces fire-drills, and shortens incident resolution time.
A safe backfill can be re-run without harmful side effects. Techniques include DELETE-INSERT
by partition, MERGE
with deterministic keys, or writing to a new table and swapping names atomically.
Updating billions of rows in one shot will lock tables and bloat WAL/redo logs. Instead:
DATE
partition or primary-key range.LIMIT 10000
) in a loop until rows=0.Run backfills during off-peak hours or on replicas to avoid starving production traffic. For analytics databases (Snowflake, BigQuery), spin up a dedicated warehouse or slot pool.
Emit metrics (rows processed, duration, errors) and write a backfill_runs
log table capturing parameters, code hash, and user. This provides rollback instructions and compliance evidence.
Suppose a new column total_price
(NOT NULL) was added to orders
. We need to populate historical rows without blocking the application.
-- 1. Process in 10k-row batches ordered by primary key
DO $$
DECLARE
_done BOOLEAN := FALSE;
BEGIN
WHILE NOT _done LOOP
WITH cte AS (
SELECT id, quantity * unit_price AS calc
FROM orders
WHERE total_price IS NULL
ORDER BY id
LIMIT 10000
)
UPDATE orders o
SET total_price = cte.calc
FROM cte
WHERE o.id = cte.id;
GET DIAGNOSTICS _done = ROW_COUNT;
RAISE NOTICE 'Rows updated: %', _done;
PERFORM pg_sleep(0.2); -- small pause to reduce load
END LOOP;
END $$;
This loop avoids long-running locks, surfaces progress, and can be resumed safely.
Backfill work often starts as exploratory SQL. Galaxy’s desktop SQL editor makes that rapid iteration painless:
:start_id
, :end_id
, :sleep_ms
.While Galaxy does not orchestrate long-running jobs, it is the ideal control room for crafting, reviewing, and safely executing the SQL stages of a backfill.
Heavy backfills can cascade unintended writes (e.g., audit triggers) or violate FK constraints mid-run. Temporarily disable or defer them, or write to an intermediate table.
Wrapping millions of updates in a single transaction risks running out of log space and causing downtime on COMMIT
. Batch in smaller, independent transactions.
Backfilling production tables without coordinating cache invalidation, search indexes, or derived warehouses leaves systems inconsistent. Publish an RFC, notify stakeholders, and chain downstream rebuilds.
backfill_runs
audit table.Database backfills are routine yet delicate operations that keep historical data honest. By planning carefully, batching updates, observing progress, and leveraging modern editors like Galaxy, teams can correct years of data without risking minutes of downtime.
Historical data drives metrics, billing, ML models, and compliance reporting. When schema changes or bugs leave gaps, a precise backfill restores trust and prevents costly mis-analysis. Understanding backfill mechanics equips data engineers to fix issues quickly without service disruption.
It depends on data volume and database throughput. Measure a single batch, extrapolate, and decide whether to parallelize or offload to replicas. Small fixes finish in minutes; multi-terabyte Reloads may span days.
Yes, if you throttle load and monitor locks. Otherwise schedule during low-traffic windows or run on a read-write replica to prevent user impact.
Galaxy’s AI Copilot writes chunked, idempotent SQL, while Collections let teams share the vetted script. Run/edit history offers an audit trail critical for compliance.
Have a point-in-time backup or snapshots before starting. For analytics tables, swap back to the original table or reload from raw sources. For OLTP, use the backup to restore or reverse the changes with a compensating script.