Delta Lake is an open-source storage layer that brings ACID transactions, schema enforcement, and versioned data (“time travel”) to data lakes built on Apache Spark and cloud object stores.
Modern analytics pipelines generate petabytes of semi-structured data that land in cheap cloud object storage (AWS S3, Azure Data Lake Storage, Google Cloud Storage). While inexpensive and highly scalable, these "data lakes" historically lacked the reliability guarantees provided by traditional databases. Delta Lake fills this gap by layering transactional control, schema management, and data versioning on top of open file formats like Parquet—without sacrificing the low cost and open architecture of a lake.
As organizations adopt the lakehouse pattern—combining a data warehouse’s reliability with a data lake’s flexibility—engineers need strong guarantees:
Delta Lake delivers these ACID properties using a combination of transaction logs, checkpoint files, and optimistic concurrency control. This eliminates the "missing data" and "half-written table" issues that plague raw Parquet folders, enabling reliable machine-learning features, reproducible dashboards, and auditable data pipelines.
Each Delta table has a _delta_log/
directory that contains JSON files called <version>.json
. Every write operation—append, overwrite, merge, delete—creates a new JSON commit file describing:
stats
(min/max values, counts, etc.).txn
identifiers used for idempotency.Because JSON files are immutable and ordered by version, readers can reconstruct a table’s state at any point in time simply by replaying the log up to that version.
Over time, thousands of small JSON commits would degrade performance. Delta Lake therefore writes binary .checkpoint.parquet
files every N transactions (default 10) or when the log size exceeds a threshold. A checkpoint contains a compact Parquet representation of the table’s state (list of active files + metadata). New readers start by loading the latest checkpoint and then replay only the JSON commits that follow, keeping startup fast even for long-lived tables.
Unlike traditional locks, Delta Lake uses optimistic concurrency. Each writer reads the latest snapshot, performs its modifications, and tries to commit a new JSON file with the next version number. If another writer has already claimed that version, the commit fails and Spark automatically retries by re-reading a fresh snapshot. This approach scales to many parallel writers with minimal contention.
Delta Lake validates the incoming DataFrame’s schema against the table schema during writes. Options include:
spark.databricks.delta.schema.enforce=true
(default): Reject columns that do not match.spark.databricks.delta.schema.autoMerge.enabled=true
: Allow automatic schema evolution, e.g., add new columns.This prevents the “quiet corruptions” that occur when a producer silently changes column order or data types.
Because every change is versioned, you can query historical snapshots—critical for root-cause analysis, audits, and reproducible experiments.
-- View table as it was 24 hours ago
SELECT * FROM events TIMESTAMP AS OF current_timestamp() - INTERVAL 24 HOURS;
-- Or by explicit version
SELECT * FROM events VERSION AS OF 1756;
Common transactional write patterns include:
df.write.format("delta").mode("append").save(path)
mode("overwrite").option("overwriteSchema","true")
MERGE INTO
to handle slowly changing dimensions.DELETE FROM
and UPDATE SET
.All of the above generate one or more JSON commits. If a job fails midway, incomplete files are automatically vacuumed during the next VACUUM
operation, ensuring atomicity.
Delta Lake provides Snapshot Isolation. Readers see a consistent snapshot at the time they started, regardless of subsequent writes. This prevents them from experiencing partial data but allows high concurrency.
Streaming and small batch writes create many tiny Parquet files, which hurt downstream query performance. The OPTIMIZE
command compacts files into larger ones and records removed files in the transaction log. Later, VACUUM
permanently deletes files older than the retention threshold (default 7 days). The two-step process guarantees durability: data is only deleted after every snapshot that might reference it has aged out.
from pyspark.sql import SparkSession
spark = (SparkSession.builder
.appName("delta-demo")
.config("spark.sql.extensions","io.delta.sql.DeltaSparkSessionExtension")
.config("spark.sql.catalog.spark_catalog","org.apache.spark.sql.delta.catalog.DeltaCatalog")
.getOrCreate())
path = "s3://bucket/delta/events"
df = spark.read.json("/data/raw/events/")
# 1. Append new data atomically
(df.write.format("delta")
.mode("append")
.partitionBy("event_date")
.save(path))
# 2. Upsert corrections
from delta.tables import DeltaTable
delta_tbl = DeltaTable.forPath(spark, path)
updates = spark.read.json("/data/corrections/")
(delta_tbl.alias("t")
.merge(updates.alias("u"), "t.event_id = u.event_id")
.whenMatchedUpdateAll()
.whenNotMatchedInsertAll()
.execute())
# 3. Query historical version
spark.read.format("delta").option("versionAsOf", 5).load(path).show()
You can register Delta paths as external tables in Hive Metastore or the Unity Catalog, exposing them via ANSI SQL. Tools like Galaxy—a modern, developer-centric SQL editor—can then query those tables, leverage autocomplete against Delta metadata, and share vetted queries via Collections. While Delta’s ACID logic runs in the underlying Spark engine, Galaxy provides a fast interface for writing and collaborating on SELECT
, MERGE
, or OPTIMIZE
commands without context-switching to notebooks.
spark.databricks.delta.autoCompact.enabled=true
for write-time compaction of small files.ZORDER BY
on high-cardinality columns to optimize data skipping.VACUUM
retention period compliant with your audit requirements; never drop below 168 hours if streaming consumers rewind.event_date
, country
). Over-partitioning hurts performance._delta_log
size; write checkpoints manually with CHECKPOINT
if large backfills generate >10,000 commits.While Delta provides warehouse-grade reliability, you may still need specialized engines (DuckDB, ClickHouse) for ultra-low-latency BI. Think of Delta as the storage foundation.
Each historical snapshot references old files retained on object storage. Leaving VACUUM
disabled can double or triple storage costs.
Allowing unchecked evolution across dozens of upstream teams can lead to sparsely populated columns and query complexity. Govern schema changes via pull requests or a data contract.
Delta Lake bridges the gap between inexpensive, flexible data lakes and the ACID guarantees historically available only in databases. Its transaction log, checkpoints, and optimistic concurrency control let teams build reliable pipelines, experiment safely with time travel, and query petabyte-scale datasets—all while using open file formats compatible with the broader big-data ecosystem.
Without Delta Lake, data lakes suffer from inconsistent reads, partial writes, and silent schema drift—issues that break production dashboards, ML models, and compliance audits. By adding an ACID-compliant layer atop low-cost object storage, Delta Lake enables the coveted “lakehouse” architecture where a single copy of data fuels ETL, streaming, BI, and AI workloads with high reliability.
Yes. Delta Lake was open-sourced by Databricks under the Apache 2.0 license and is governed by the Linux Foundation.
All three provide ACID semantics on data lakes, but they vary in write concurrency models, metadata layouts, and ecosystem tooling. Delta is tightly integrated with Apache Spark and emphasizes simple SQL commands, whereas Iceberg focuses on engine-agnostic design and Hudi targets incremental ingestion use cases.
Absolutely. Register your Delta path in Hive Metastore or Unity Catalog, connect Galaxy to the same metastore, and write standard SQL. Galaxy’s AI Copilot will autocomplete columns, suggest filters, and let you share the query via Collections.
The transaction is rolled back. Partially written Parquet files are marked as removed
in the log and become eligible for deletion during the next VACUUM, preserving atomicity.