Delta Lake: ACID Transactions for Data Lakes

What is Delta Lake and how does it manage ACID transactions?

Delta Lake is an open-source storage layer that brings ACID transactions, schema enforcement, and versioned data (“time travel”) to data lakes built on Apache Spark and cloud object stores.

Welcome to the Galaxy, Guardian!
You'll be receiving a confirmation email

Follow us on twitter :)

Oops! Something went wrong while submitting the form.

Description

Example H2

Example H3

Overview

Modern analytics pipelines generate petabytes of semi-structured data that land in cheap cloud object storage (AWS S3, Azure Data Lake Storage, Google Cloud Storage). While inexpensive and highly scalable, these "data lakes" historically lacked the reliability guarantees provided by traditional databases. Delta Lake fills this gap by layering transactional control, schema management, and data versioning on top of open file formats like Parquet—without sacrificing the low cost and open architecture of a lake.

Why Delta Lake Matters

As organizations adopt the lakehouse pattern—combining a data warehouse’s reliability with a data lake’s flexibility—engineers need strong guarantees:

Atomicity: Either all files created by a write are visible, or none are.
Consistency: Readers always see a well-formed table state that passes schema checks.
Isolation: Concurrent jobs do not read partially written data.
Durability: Once a transaction commits, data is persisted on durable object storage.

Delta Lake delivers these ACID properties using a combination of transaction logs, checkpoint files, and optimistic concurrency control. This eliminates the "missing data" and "half-written table" issues that plague raw Parquet folders, enabling reliable machine-learning features, reproducible dashboards, and auditable data pipelines.

Core Architecture

1. Delta Log (Transaction Log)

Each Delta table has a _delta_log/ directory that contains JSON files called <version>.json. Every write operation—append, overwrite, merge, delete—creates a new JSON commit file describing:

Added data files and stats (min/max values, counts, etc.).
Removed data files.
Metadata changes (schema, partitioning, properties).
Set of txn identifiers used for idempotency.

Because JSON files are immutable and ordered by version, readers can reconstruct a table’s state at any point in time simply by replaying the log up to that version.

2. Checkpoints

Over time, thousands of small JSON commits would degrade performance. Delta Lake therefore writes binary .checkpoint.parquet files every N transactions (default 10) or when the log size exceeds a threshold. A checkpoint contains a compact Parquet representation of the table’s state (list of active files + metadata). New readers start by loading the latest checkpoint and then replay only the JSON commits that follow, keeping startup fast even for long-lived tables.

3. Optimistic Concurrency Control (OCC)

Unlike traditional locks, Delta Lake uses optimistic concurrency. Each writer reads the latest snapshot, performs its modifications, and tries to commit a new JSON file with the next version number. If another writer has already claimed that version, the commit fails and Spark automatically retries by re-reading a fresh snapshot. This approach scales to many parallel writers with minimal contention.

4. Schema Enforcement & Evolution

Delta Lake validates the incoming DataFrame’s schema against the table schema during writes. Options include:

spark.databricks.delta.schema.enforce=true (default): Reject columns that do not match.
spark.databricks.delta.schema.autoMerge.enabled=true: Allow automatic schema evolution, e.g., add new columns.

This prevents the “quiet corruptions” that occur when a producer silently changes column order or data types.

5. Time Travel

Because every change is versioned, you can query historical snapshots—critical for root-cause analysis, audits, and reproducible experiments.

-- View table as it was 24 hours ago SELECT * FROM events TIMESTAMP AS OF current_timestamp() - INTERVAL 24 HOURS; -- Or by explicit version SELECT * FROM events VERSION AS OF 1756;

Managing ACID Transactions in Practice

Write Patterns

Common transactional write patterns include:

Batch Append: df.write.format("delta").mode("append").save(path)
Overwrite (Replace): mode("overwrite").option("overwriteSchema","true")
Merge (Upsert): Using MERGE INTO to handle slowly changing dimensions.
Delete / Update: Declarative DML with DELETE FROM and UPDATE SET.

All of the above generate one or more JSON commits. If a job fails midway, incomplete files are automatically vacuumed during the next VACUUM operation, ensuring atomicity.

Read Isolation Level

Delta Lake provides Snapshot Isolation. Readers see a consistent snapshot at the time they started, regardless of subsequent writes. This prevents them from experiencing partial data but allows high concurrency.

Compaction (Optimize) and Vacuum

Streaming and small batch writes create many tiny Parquet files, which hurt downstream query performance. The OPTIMIZE command compacts files into larger ones and records removed files in the transaction log. Later, VACUUM permanently deletes files older than the retention threshold (default 7 days). The two-step process guarantees durability: data is only deleted after every snapshot that might reference it has aged out.

End-to-End Example

from pyspark.sql import SparkSession spark = (SparkSession.builder .appName("delta-demo") .config("spark.sql.extensions","io.delta.sql.DeltaSparkSessionExtension") .config("spark.sql.catalog.spark_catalog","org.apache.spark.sql.delta.catalog.DeltaCatalog") .getOrCreate()) path = "s3://bucket/delta/events" df = spark.read.json("/data/raw/events/") # 1. Append new data atomically (df.write.format("delta") .mode("append") .partitionBy("event_date") .save(path)) # 2. Upsert corrections from delta.tables import DeltaTable delta_tbl = DeltaTable.forPath(spark, path) updates = spark.read.json("/data/corrections/") (delta_tbl.alias("t") .merge(updates.alias("u"), "t.event_id = u.event_id") .whenMatchedUpdateAll() .whenNotMatchedInsertAll() .execute()) # 3. Query historical version spark.read.format("delta").option("versionAsOf", 5).load(path).show()

Galaxy and Delta Lake

You can register Delta paths as external tables in Hive Metastore or the Unity Catalog, exposing them via ANSI SQL. Tools like Galaxy—a modern, developer-centric SQL editor—can then query those tables, leverage autocomplete against Delta metadata, and share vetted queries via Collections. While Delta’s ACID logic runs in the underlying Spark engine, Galaxy provides a fast interface for writing and collaborating on SELECT, MERGE, or OPTIMIZE commands without context-switching to notebooks.

Best Practices

Set spark.databricks.delta.autoCompact.enabled=true for write-time compaction of small files.
Use ZORDER BY on high-cardinality columns to optimize data skipping.
Keep the VACUUM retention period compliant with your audit requirements; never drop below 168 hours if streaming consumers rewind.
Partition tables on columns with low cardinality and predictable filtering (event_date, country). Over-partitioning hurts performance.
Monitor _delta_log size; write checkpoints manually with CHECKPOINT if large backfills generate >10,000 commits.

Common Misconceptions

Delta Lake replaces the data warehouse

While Delta provides warehouse-grade reliability, you may still need specialized engines (DuckDB, ClickHouse) for ultra-low-latency BI. Think of Delta as the storage foundation.

Time travel is free

Each historical snapshot references old files retained on object storage. Leaving VACUUM disabled can double or triple storage costs.

Schema auto-merge is always safe

Allowing unchecked evolution across dozens of upstream teams can lead to sparsely populated columns and query complexity. Govern schema changes via pull requests or a data contract.

Conclusion

Delta Lake bridges the gap between inexpensive, flexible data lakes and the ACID guarantees historically available only in databases. Its transaction log, checkpoints, and optimistic concurrency control let teams build reliable pipelines, experiment safely with time travel, and query petabyte-scale datasets—all while using open file formats compatible with the broader big-data ecosystem.

Why Delta Lake: ACID Transactions for Data Lakes is important

Without Delta Lake, data lakes suffer from inconsistent reads, partial writes, and silent schema drift—issues that break production dashboards, ML models, and compliance audits. By adding an ACID-compliant layer atop low-cost object storage, Delta Lake enables the coveted “lakehouse” architecture where a single copy of data fuels ETL, streaming, BI, and AI workloads with high reliability.

Delta Lake: ACID Transactions for Data Lakes Example Usage


MERGE INTO delta.`s3://bucket/delta/customers` t USING staging s ON t.id = s.id WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT *;

Delta Lake: ACID Transactions for Data Lakes Syntax

Common Mistakes

Ignoring the _delta_log size. Some teams run massive historical backfills that create hundreds of thousands of JSON commits, slowing down query planning. Fix by triggering manual CHECKPOINTs or adjusting the checkpoint interval.
Setting VACUUM retention to 0 hours in dev and forgetting to change it in prod, leading to accidental deletion of files still referenced by running jobs. Always align retention with your max streaming lag or audit window.
Over-partitioning on high-cardinality columns like user_id. This explodes the number of directories and small files, hurting write throughput and increasing metadata overhead. Partition only on coarse columns such as date or region.

Frequently Asked Questions (FAQs)

Is Delta Lake open source?

Yes. Delta Lake was open-sourced by Databricks under the Apache 2.0 license and is governed by the Linux Foundation.

How does Delta Lake differ from Apache Hudi or Iceberg?

All three provide ACID semantics on data lakes, but they vary in write concurrency models, metadata layouts, and ecosystem tooling. Delta is tightly integrated with Apache Spark and emphasizes simple SQL commands, whereas Iceberg focuses on engine-agnostic design and Hudi targets incremental ingestion use cases.

Can I query Delta tables from Galaxy?

Absolutely. Register your Delta path in Hive Metastore or Unity Catalog, connect Galaxy to the same metastore, and write standard SQL. Galaxy’s AI Copilot will autocomplete columns, suggest filters, and let you share the query via Collections.

What happens if a Spark job fails during a write?

The transaction is rolled back. Partially written Parquet files are marked as removed in the log and become eligible for deletion during the next VACUUM, preserving atomicity.