Change-Data Capture with Azure Data Factory

How do I implement change-data capture with Azure Data Factory?

Using Azure Data Factory to identify, move, and process only the inserts, updates, and deletes that have occurred in a source system since the last data load.

Welcome to the Galaxy, Guardian!
You'll be receiving a confirmation email

Follow us on twitter :)

Oops! Something went wrong while submitting the form.

Description

Example H2

Example H3

Change-data capture (CDC) is the process of detecting and delivering only the data that has changed in a source system. In Azure, the most common way to operationalize CDC for analytical or operational workloads is by orchestrating change extraction and loading with Azure Data Factory (ADF). This article walks through the end-to-end approach, design decisions, and best practices for implementing CDC pipelines in ADF.

Why traditional batch loads fall short

Ingesting an entire table every hour or day may be acceptable for small datasets, but as data volumes grow the approach becomes expensive and slow. Re-reading unchanged rows wastes I/O, network bandwidth, and compute. Modern analytics and near-real-time applications demand fresher data delivered efficiently—hence the need for CDC.

Architectural overview of CDC in Azure Data Factory

At a high level, an ADF CDC solution has four layers:

Change Detection – Use native database CDC (SQL Server, Azure SQL, PostgreSQL), timestamp columns, or log-based connectors to identify changed rows.
Landing – Land changes as incremental files (Parquet/CSV) in Azure Data Lake Storage (ADLS) or stage them in Azure SQL/ Synapse.
Apply/Merge – Use Data Flow, Stored Procedure, or Synapse MERGE to integrate changes into target tables.
Orchestration & Monitoring – Schedule, parameterize, and monitor the entire flow in ADF.

Step-by-step implementation guide

1. Enable CDC or equivalent on the source

• SQL Server / Azure SQL: sys.sp_cdc_enable_db and sys.sp_cdc_enable_table add CDC capture and relational change tables.
• Azure Cosmos DB, MySQL, PostgreSQL: use Change Feed, binlog export, or timestamp columns.
• SaaS APIs: lean on incremental endpoints or cursor-based pagination.

2. Create source and sink linked services in ADF

Define secure, parameterized connections to the source database and the destination (ADLS Gen2, Synapse Analytics, or Azure SQL).

3. Build the incremental extraction pipeline

Option A – Copy activity with query watermark
1. Add a Copy Data activity.
2. Parameterize the SQL query to filter rows where modified_date > @last_watermark.
3. Store @last_watermark in an ADF pipeline variable, Azure Key Vault, or a control table.

Option B – SQL CDC tables
1. Copy directly from cdc.<capture_instance>_CT tables.
2. The __$operation field indicates INSERT (2), UPDATE (3/4), DELETE (1).
3. Filter on __$start_lsn greater than the previous high-water-mark.

Option C – Mapping Data Flow CDC source
ADF Data Flows include a dedicated CDC source that detects inserts, updates, and deletes in SQL sources by comparing snapshots. This is simplest for low-volume datasets.

4. Land changes in a raw zone

Save incremental results as partitioned files: /raw/<table>/load_date=2024-06-11/part-000.parquet. Parquet preserves schema and supports efficient downstream merges.

5. Apply changes to the target

There are three common patterns:

MERGE Stored Procedures – Use a native MERGE statement in Synapse/SQL DW to upsert rows.
Data Flows – Leverage the CDC source + Sink alter row settings for insert/update/delete logic.
Delta Lake – If using Azure Databricks or Synapse Spark, simply MERGE INTO Delta tables.

6. Maintain and advance watermarks

After each successful run, write the latest high-water-mark (timestamp or LSN) to:

An ADF integration runtime variable persisted via Set Variable.
A dedicated etl_control.watermarks table.
Azure App Configuration or Key Vault.

On rerun, the pipeline reads this value to fetch only new changes.

7. Schedule and monitor

Create a trigger to execute the pipeline at the required cadence (every 5 minutes to 24 hours). Use Azure Monitor alerts or the ADF Pipeline Run reports to detect failures.

Best practices

Idempotency – Design your pipelines so reruns with the same watermark produce identical results. Never double-count.
Schema drift handling – Enable schema drift in Data Flow or store Avro-encoded schema with Parquet to cope with new columns.
Partition pruning – Store data in date/LSN-partitioned folders so downstream Spark or Synapse jobs read only new partitions.
Secure secrets – Keep connection strings and watermarks in Azure Key Vault rather than ADF variables.
Atomic MERGE – Wrap MERGE operations in a transaction or WRITE EXCLUSIVE lock to avoid dirty reads.

Practical example: Incremental load from Azure SQL to Synapse

Enable CDC on Sales.Orders.
ADF pipeline fetches max LSN from control table.
Copy activity queries SELECT * FROM cdc.Orders_CT WHERE __$start_lsn > @last_lsn.
Copy lands results in adls://datalake/raw/orders/lsn=@last_lsn/.
Stored procedure in Synapse executes:
MERGE dbo.Orders AS T USING staging.Orders_Incr AS S ON T.OrderID = S.OrderID WHEN MATCHED THEN UPDATE SET ... WHEN NOT MATCHED THEN INSERT ... WHEN NOT MATCHED BY SOURCE THEN DELETE;
Update control table with new max LSN.

Common misconceptions

"ADF has a built-in CDC toggle, so I don't have to design anything."

ADF provides helpful features, but you must still plan watermarks, error handling, and downstream merges.

"Using `modified_date` alone is enough."

Timestamps can miss out-of-order commits or clock skew. Where possible, rely on log-based CDC for accuracy.

"CDC is always faster."

For very small tables, a full reload may outperform the overhead of CDC extraction logic. Benchmark first.

How does this relate to Galaxy?

While Galaxy is a modern SQL editor rather than an orchestration tool, its AI copilot helps engineers author the MERGE statements, watermarked queries, and monitoring SQL central to CDC pipelines. You can share parameterized queries inside a Galaxy Collection so the entire team reuses the correct CDC logic across environments.

Next steps

• Prototype CDC on a single table before scaling.
• Evaluate whether SQL native CDC, timestamp filtering, or a third-party log-based tool (Debezium, Fivetran) best fits your latency and cost goals.
• Automate data quality checks post-merge to ensure no drift between source and target.

Why Change-Data Capture with Azure Data Factory is important

CDC delivers fresher data to analytics platforms while minimizing compute and storage costs. For data engineers, mastering CDC in Azure Data Factory unlocks near-real-time insights, reduces pipeline runtimes, and ensures scalable ingestion as source tables grow.