A data lakehouse is an analytics architecture that combines the low-cost, flexible storage of a data lake with the transactional guarantees and management features of a data warehouse, allowing structured and unstructured data to be stored in open formats and queried with SQL or other engines.
Data Lakehouse
A data lakehouse merges the scalability of data lakes with the data management and performance features of warehouses, enabling teams to run BI, ML, and real-time analytics from a single, open data platform.
A data lakehouse is an open analytics architecture that layers ACID transactions, schema enforcement, and data governance directly on top of low-cost object storage. It allows teams to ingest raw data like a traditional data lake, while simultaneously offering the reliability, performance, and SQL semantics expected from a data warehouse. The result is a single source of truth for both structured and unstructured data that supports business intelligence (BI), data science, and streaming workloads without forcing data copies across systems.
Traditional data stacks force organizations to maintain both a data lake for inexpensive storage and a warehouse for curated, query-ready data—a setup that introduces costly data movement, duplicated pipelines, and governance headaches. Lakehouses address these pain points by:
Data warehouses excel at fast SQL queries over structured data but struggle with semi-structured and unstructured formats, and their proprietary storage makes petabyte-scale datasets expensive.
Data lakes democratized storage by placing raw files—CSV, Parquet, images—into cheap object stores like Amazon S3. Unfortunately, data lakes lacked transactions and governance, causing so-called “data swamps.”
The lakehouse architecture emerged around 2019 (pioneered by Databricks with Delta Lake, followed by Apache Iceberg and Hudi) to fuse these paradigms. By adding a transaction log and metadata layer to open file formats, lakehouses deliver warehouse-grade capabilities on lake storage.
S3, Azure Data Lake Storage, or Google Cloud Storage hold immutable Parquet/ORC files.
An ordered record of every write (e.g., _delta_log
for Delta Lake, metadata
for Iceberg) guaranteeing ACID semantics.
Table definitions, schemas, and statistics registered in Apache Hive Metastore, AWS Glue, Unity Catalog, or open catalogs like Nessie.
Spark, Trino, Presto, Flink, Snowflake, BigQuery, DuckDB, or even the Postgres file_fdw
extension can all query the same data.
Delta Lake (Linux Foundation), Apache Iceberg, and Apache Hudi are the dominant open-source table formats. Proprietary offerings such as Snowflake Iceberg Tables, Databricks Unity Catalog, and BigQuery Object Tables bring similar ideas to managed platforms.
Assume sales data lands as daily CSVs in s3://company-data/raw/sales/
. A Spark job converts them to Delta Lake and creates a table:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("ingest").getOrCreate()
raw_df = spark.read.option("header", True).csv("s3://company-data/raw/sales/")
(raw_df
.write
.format("delta")
.partitionBy("sale_date")
.mode("append")
.save("s3://company-data/lakehouse/sales"))
spark.sql("""
CREATE TABLE analytics.sales
USING DELTA
LOCATION 's3://company-data/lakehouse/sales'
""")
A downstream analyst can query the same table with Trino or Galaxy’s SQL editor:
SELECT customer_id, SUM(amount) AS revenue
FROM analytics.sales
WHERE sale_date >= DATE '2024-01-01'
GROUP BY customer_id
ORDER BY revenue DESC
LIMIT 100;
Because most lakehouse table formats expose a standard ANSI-SQL interface, they plug seamlessly into Galaxy. Connect Trino, Postgres, or DuckDB to your object store, launch Galaxy’s desktop app, and you can:
Pick a lakehouse if you need open storage, diverse analytics, or ML on large volumes of semi-structured data. For extremely small, purely relational workloads, a cloud warehouse alone may suffice.
The data lakehouse delivers the elusive single platform for all analytics by marrying lake economics with warehouse reliability. With open formats like Delta Lake, Iceberg, and Hudi—and SQL editors like Galaxy—it has become the modern default for scalable, cost-effective data architectures.
Data teams are under pressure to support real-time dashboards, advanced ML, and petabyte-scale storage while keeping costs down. Traditional approaches split workloads between costly warehouses and messy data lakes, creating silos and governance nightmares. The lakehouse resolves these contradictions by introducing ACID semantics, schema management, and fine-grained security directly on lake storage, letting organizations perform BI, streaming, and AI from one open platform while avoiding vendor lock-in.
It removes the need for separate data lakes and warehouses, cutting storage costs, eliminating data copies, and providing one governed platform for BI, ML, and real-time analytics.
Yes. Even modest datasets benefit from open formats, low storage cost, and rollback capabilities. Managed offerings or serverless engines lower the operational burden.
Galaxy connects to any SQL engine that can query lakehouse tables—such as Trino, DuckDB, or Spark Thrift Server. Once connected, Galaxy’s AI copilot assists with auto-generated, optimized SQL against Delta/Iceberg/Hudi tables.
All three provide ACID transactions and schema evolution. Delta Lake emphasizes simplicity and tight Spark integration, Iceberg focuses on engine neutrality and hidden partitioning, while Hudi shines in streaming upserts. Choose based on workload and ecosystem.