Using Azure Data Factory to identify, move, and process only the inserts, updates, and deletes that have occurred in a source system since the last data load.
Change-data capture (CDC) is the process of detecting and delivering only the data that has changed in a source system. In Azure, the most common way to operationalize CDC for analytical or operational workloads is by orchestrating change extraction and loading with Azure Data Factory (ADF). This article walks through the end-to-end approach, design decisions, and best practices for implementing CDC pipelines in ADF.
Ingesting an entire table every hour or day may be acceptable for small datasets, but as data volumes grow the approach becomes expensive and slow. Re-reading unchanged rows wastes I/O, network bandwidth, and compute. Modern analytics and near-real-time applications demand fresher data delivered efficiently—hence the need for CDC.
At a high level, an ADF CDC solution has four layers:
• SQL Server / Azure SQL: sys.sp_cdc_enable_db
and sys.sp_cdc_enable_table
add CDC capture and relational change tables.
• Azure Cosmos DB, MySQL, PostgreSQL: use Change Feed, binlog export, or timestamp columns.
• SaaS APIs: lean on incremental endpoints or cursor-based pagination.
Define secure, parameterized connections to the source database and the destination (ADLS Gen2, Synapse Analytics, or Azure SQL).
Option A – Copy activity with query watermark
1. Add a Copy Data activity.
2. Parameterize the SQL query to filter rows where modified_date > @last_watermark
.
3. Store @last_watermark
in an ADF pipeline variable, Azure Key Vault, or a control table.
Option B – SQL CDC tables
1. Copy directly from cdc.<capture_instance>_CT
tables.
2. The __$operation
field indicates INSERT (2), UPDATE (3/4), DELETE (1).
3. Filter on __$start_lsn
greater than the previous high-water-mark.
Option C – Mapping Data Flow CDC source
ADF Data Flows include a dedicated CDC source that detects inserts, updates, and deletes in SQL sources by comparing snapshots. This is simplest for low-volume datasets.
Save incremental results as partitioned files: /raw/<table>/load_date=2024-06-11/part-000.parquet
. Parquet preserves schema and supports efficient downstream merges.
There are three common patterns:
MERGE INTO
Delta tables.After each successful run, write the latest high-water-mark (timestamp or LSN) to:
etl_control.watermarks
table.On rerun, the pipeline reads this value to fetch only new changes.
Create a trigger to execute the pipeline at the required cadence (every 5 minutes to 24 hours). Use Azure Monitor alerts or the ADF Pipeline Run reports to detect failures.
Sales.Orders
.SELECT * FROM cdc.Orders_CT WHERE __$start_lsn > @last_lsn
.adls://datalake/raw/orders/lsn=@last_lsn/
.MERGE dbo.Orders AS T USING staging.Orders_Incr AS S ON T.OrderID = S.OrderID WHEN MATCHED THEN UPDATE SET ... WHEN NOT MATCHED THEN INSERT ... WHEN NOT MATCHED BY SOURCE THEN DELETE;
ADF provides helpful features, but you must still plan watermarks, error handling, and downstream merges.
modified_date
alone is enough."Timestamps can miss out-of-order commits or clock skew. Where possible, rely on log-based CDC for accuracy.
For very small tables, a full reload may outperform the overhead of CDC extraction logic. Benchmark first.
While Galaxy is a modern SQL editor rather than an orchestration tool, its AI copilot helps engineers author the MERGE statements, watermarked queries, and monitoring SQL central to CDC pipelines. You can share parameterized queries inside a Galaxy Collection so the entire team reuses the correct CDC logic across environments.
• Prototype CDC on a single table before scaling.
• Evaluate whether SQL native CDC, timestamp filtering, or a third-party log-based tool (Debezium, Fivetran) best fits your latency and cost goals.
• Automate data quality checks post-merge to ensure no drift between source and target.
CDC delivers fresher data to analytics platforms while minimizing compute and storage costs. For data engineers, mastering CDC in Azure Data Factory unlocks near-real-time insights, reduces pipeline runtimes, and ensures scalable ingestion as source tables grow.
Watermark CDC relies on a timestamp or incrementing key; SQL Server CDC uses the transaction log, guaranteeing no missing changes but adding overhead. Choose based on latency, reliability, and licensing.
Yes. You can query based on modified_date
, triggers, or snapshots, but these methods may miss deletes or late updates.
Match business needs. Critical operational reporting may require every few minutes; nightly is fine for batch analytics. Balance cost and SLAs.
ADF is micro-batch oriented. For sub-second latency, use Azure Stream Analytics, Event Hubs, or third-party log-based tools, then land data for ADF processing.