Querying Delta Tables from Presto

Galaxy Glossary

How do I query Delta tables from Presto?

Using Presto’s Hive or native Delta connectors to read Delta Lake transactionally consistent tables stored on object storage or HDFS.

Sign up for the latest in SQL knowledge from the Galaxy Team!
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Description

Learn how to unlock the power of Delta Lake’s ACID-compliant parquet tables from the speed of Presto’s distributed SQL engine.

What Are Delta Tables?

Delta Lake is an open-source storage format that adds ACID transactions, schema evolution, versioned data, and time travel to parquet files stored on object storage or HDFS. A Delta table is simply a directory of parquet data files accompanied by a _delta_log directory that tracks every commit in JSON and parquet checkpoints. By reading the log before scanning data files, an engine can reconstruct a snapshot-consistent view of the table without locking the underlying files.

Why Query Delta Tables from Presto?

Presto (and its fork Trino) shines as a federated MPP SQL engine capable of interactive analytics at petabyte scale. Being able to query Delta tables in Presto means you can:

  • Run low-latency ad-hoc SQL on data produced by Spark streaming or batch pipelines that land in Delta format.
  • Join Delta data with other Presto catalogs (MySQL, Postgres, Kafka, Iceberg, etc.) in a single query.
  • Expose a consistent analytics layer to BI tools and modern SQL editors like Galaxy without duplicating data or sacrificing transactional guarantees.

How Presto Understands Delta Lake

There are two main integration patterns, depending on which Presto distribution you use:

1. Native Delta Lake Connector (Trino & Starburst)

Recent versions of trino (formerly PrestoSQL) ship a delta connector that reads the _delta_log directly. Configuration is as simple as:

[delta]
connector.name=delta-lake
hive.metastore.uri=thrift://hms:9083

Point the catalog at a Hive metastore (or AWS Glue) that tracks the table location. The connector handles snapshot resolution, schema, partition pruning, and predicate pushdown automatically.

2. Hive Connector + Delta Lake Reader (PrestoDB)

The original PrestoDB line does not yet include a dedicated Delta connector, but you can enable Delta support in the Hive connector by flipping a flag:

[hive]
connector.name=hive-hadoop2
delta.enabled=true

When delta.enabled=true, Presto’s Hive split enumerator recognizes Delta directories, parses the JSON log, and exposes the table just like any Hive table.

Step-by-Step: Registering and Querying a Table

  1. Write or convert data to Delta format.
    Example Spark conversion:
    spark.read.parquet('s3://raw/events')\
    .write.format('delta')\
    .partitionBy('event_date')\
    .save('s3://gold/delta/events')
  2. Create a Hive metastore entry.
    CREATE EXTERNAL TABLE delta_events
    USING DELTA
    LOCATION 's3://gold/delta/events';
  3. Verify Presto catalog configuration. In etc/catalog/delta.properties or hive.properties, ensure Hive metastore URI, S3 credentials, and (optionally) delta.hide-non-delta-lake-tables=true are set.
  4. Query from Presto.
    SELECT event_type, count(*) AS n
    FROM delta.delta_events
    WHERE event_date = DATE '2024-04-15'
    GROUP BY event_type
    ORDER BY n DESC;

Presto will:

  1. Read the latest _delta_log/*.json commit.
  2. Download the most recent checkpoint parquet if available.
  3. Generate splits only for active parquet files matching the partition and predicate.
  4. Stream results back with millisecond-level startup.

Performance Best Practices

  • Optimize tables. Periodically run OPTIMIZE or VACUUM in Spark to compact small files and remove tombstones; Presto scans fewer objects.
  • Partition wisely. Choose columns with high cardinality and common filters (e.g., event_date). Avoid over-partitioning; Presto already pushes predicates into parquet.
  • Use DELTA_LAKE_CONCURRENT_READS. Trino’s connector can parallelize JSON log parsing; set this session property to speed up huge commit logs.
  • Leverage manifest files or GENERATE symlink_format_manifest. If you run very old Presto versions, point Hive at the manifest to avoid Delta-specific logic.

Versioned Queries and Time Travel

Delta Lake’s history is fully accessible from Presto. You can query a snapshot as of a particular version or timestamp:

-- Snapshot by version
SELECT *
FROM delta.delta_events FOR VERSION AS OF 325;

-- Snapshot by timestamp
SELECT *
FROM delta.delta_events
FOR TIMESTAMP AS OF TIMESTAMP '2024-04-10 00:00:00';

This is invaluable for debugging data issues, auditing, or building slowly changing dimension logic without physically copying data.

Managing Updates and Deletes

When Delta tables are updated by Spark, Presto automatically sees new snapshots; no refresh is required. Starting Trino 410, you can even DELETE, UPDATE, and MERGE directly from Presto—making the warehouse fully bi-directional.

Error Handling & Troubleshooting

  • Missing metastore entry: If SELECT fails with Table not found, ensure you created an external table or enabled delta.register-table-on-read.
  • Incompatible protocol version: Spark writers may bump the Delta protocol. Upgrade Presto or set delta.allow-high-protocol=true (read-only).
  • S3 404 objects: Validate IAM permissions; Presto needs ListBucket and GetObject for every parquet path in the snapshot.

Galaxy and Delta-on-Presto

Because Galaxy speaks native Presto over JDBC, you can add your Presto cluster as a connection once and instantly explore every Delta table through Galaxy’s AI copilot, autocomplete, and shared collections:

  • Start typing SELECT * FROM and Galaxy suggests Delta tables plus column docs.
  • Let the copilot rewrite a query when the Delta schema evolves.
  • Endorse the canonical events_last_30_days query so teammates stop copy-pasting SQL in Slack.

Conclusion

Querying Delta tables from Presto brings together the best of both worlds: ACID data lakes and blazing-fast distributed SQL. With a few configuration tweaks you can run sub-second analytics, join Delta with any other source, enjoy point-in-time time-travel queries, and surface everything through developer-focused tools like Galaxy.

Enable Delta in your Presto catalog today and unlock a unified, scalable lakehouse that just speaks SQL.

Why Querying Delta Tables from Presto is important

Modern data platforms often land raw data in object storage as Delta Lake while interactive workloads rely on Presto for federated SQL. Bridging the two enables analysts and applications to leverage ACID guarantees without duplicating data or moving it into specialized warehouses. Mastering this integration eliminates silos, reduces latency, and supports advanced features such as time-travel debugging and in-place updates—all critical capabilities for data engineers who build reliable, self-service analytics layers.

Querying Delta Tables from Presto Example Usage


SELECT * FROM delta.sales_orders FOR TIMESTAMP AS OF TIMESTAMP '2024-01-01 00:00:00' LIMIT 100;

Common Mistakes

Frequently Asked Questions (FAQs)

Can Presto write to Delta tables or only read?

Recent Trino versions (410+) and Starburst Enterprise allow INSERT, DELETE, UPDATE, and MERGE statements on Delta tables. PrestoDB remains read-only today.

Is Hive metastore mandatory for Delta?

No. The Delta connector can use the filesystem directly if register-table-on-read=true. However, a metastore simplifies discovery, permissions, and statistics.

How does Galaxy help when querying Delta via Presto?

Galaxy’s desktop SQL editor connects to Presto over JDBC, autocompletes Delta tables, suggests column docs, and lets teams share reviewed queries in Collections—speeding up collaboration.

What Presto version do I need?

Use Trino 356+ or Starburst 364+ for stable Delta read support, and Trino 410+ for write operations. For PrestoDB, run 0.282 or later with delta.enabled=true.

Want to learn about other SQL terms?