Column-level lineage tools trace how every individual field in your datasets is produced, transformed, and consumed across your data stack.
Column-level lineage provides field-by-field visibility into data flows, helping teams debug pipelines faster, meet regulatory requirements, and build trust in analytics.
Data lineage is the record of how data moves and changes from its origin to its final destination. Column-level lineage takes this one step deeper by mapping transformation logic for each individual column rather than at the coarse table or dataset level. Instead of saying “table orders_daily
comes from orders_raw
,” column-level lineage can tell you that:
orders_daily.order_id
is a direct copy of orders_raw.id
orders_daily.revenue_usd
is derived from orders_raw.price_cents / 100
orders_daily.source_country
comes from a lookup inside dim_countries
This granular map is invaluable for impact analysis, debugging, governance, and cost-efficient pipeline design.
email
or ssn
flow.Originally open-sourced by LinkedIn, DataHub stores metadata in a graph database and offers automated column-level lineage if you ingest:
spline
integrationDataHubs React UI renders a lineage graph where each column can be expanded to show parents and children.
Backed by the CNCF, OpenMetadata includes a built-in SQL parser and supports column-level lineage for more than 20 databases and orchestrators. It captures lineage through:
The UI highlights column-to-column edges and allows tag propagation for PII or data quality scores.
OpenLineage is an open standard for lineage events. Marquez is the reference implementation that stores and visualizes them. Recent versions include columnLineage
in the JSON spec, allowing integrations to emit field-level edges. Spark, dbt, Airflow, and Flink emitters already support this.
Atlas provides governance and classification for the Hadoop ecosystem. The Hive hook and Spark Atlas Connector can emit field-level lineage into Atlas9s graph backed by JanusGraph or Solr. UI support is less modern than DataHub/OpenMetadata but still functional.
Spline (Spark Lineage) captures execution plans from Apache Spark and stores column transformations in a MongoDB / ArangoDB graph. While Spark-specific, it excels at deep lineage for complex DataFrame operations.
Tokern, built on sqlparse
, ingests query history from Snowflake, Redshift, Postgres, and BigQuery to compute column-level lineage. A minimal UI and Airflow operator make it attractive for teams wanting a lightweight solution.
sqllineage
(Python) can parse SQL strings and output column dependencies. It19s more of a library than a platform but great for embedding lineage directly in CI checks or notebooks.
Consider:
orders_raw.vat_rate
who screams?”openmetadata_dbt
ingestion pipeline to read target/manifest.json
and target/run_results.json
.dbt run && dbt test
customers_lifetime_value.ltv
to see its upstream columns (orders.amount_usd
, payments.fee
, etc.).Galaxy19s SQL editor stores rich query history and metadata. While Galaxy does not yet generate lineage graphs, its APIs can export executed SQL statements and parameter values. You can feed those logs into OpenLineage or Tokern to achieve column-level lineage without changing your workflow. Future roadmap items like built-in data catalog features may surface lineage directly inside Galaxy19s UI.
Without column-level lineage, engineering and analytics teams struggle to debug broken dashboards, comply with regulations, and understand the blast radius of schema changes. Field-level visibility dramatically shortens incident response times, simplifies compliance audits, and reduces compute costs by highlighting unused columns.
No. Lineage tells you where data comes from, while quality tests verify whether the data is correct. They are complementary.
DataHub and OpenMetadata both ingest dbt artifacts out of the box. You can be viewing column-level graphs in under an hour.
Galaxy currently focuses on query authoring and collaboration. However, you can export query history to OpenLineage or Tokern to build lineage today, and native support is on the roadmap.
Storage costs grow with history length, but open-source tools let you configure TTL or partitioned storage. Capture overhead is typically negligible if you stream events asynchronously.