Column-level lineage is the automated tracking of how every individual column in a dataset is produced and transformed across pipelines.
Column-level lineage answers the question “Where did this column come from, how was it calculated, and who will be affected if I change it?” Unlike table-level lineage, which shows only dataset-to-dataset relationships, column-level lineage drills down to every field, mapping dependencies from raw sources through SQL, ELT tools, notebooks, and dashboards. This granularity is rapidly becoming a must-have for modern data teams that embrace self-service analytics, complex transformations, and stringent compliance requirements.
Tracking data at the column level delivers tangible value in several areas:
Lineage engines parse code (SQL, Python, Scala), observe runtimes (Spark execution plans, dbt manifests), or listen to logs (Snowflake QUERY_HISTORY
) to capture transformation metadata. Each parsed statement produces “read” and “write” events at the column level.
The events are stitched into a Directed Acyclic Graph (DAG) where nodes represent <dataset, column>
pairs and edges describe transformation functions (e.g., SUM
, JOIN
, CAST
). Open-source lineage frameworks typically persist this graph in a graph database (Neo4j), relational store (PostgreSQL), or even as JSON files.
APIs or UI components expose the DAG so users can:
orders.total_amount
")Combines a metadata store, ingestion framework, and React UI. Column-level lineage is extracted via SQL parsers and database query logs. Supports Snowflake, BigQuery, Redshift, Postgres, dbt, and Spark.
A spec-compliant implementation of the OpenLineage standard. Marquez receives OpenLineage events from jobs (Airflow, Spark, dbt) and builds a column-level graph stored in PostgreSQL.
Originally created at LinkedIn, DataHub uses sqlglot
for static SQL parsing and supports dbt’s manifest.json
. Column-level edges are materialized in an Apache Kafka-backed metadata store.
Lightweight Python library that parses SQL with sqlparse
+ custom grammars. Good for embedding lineage into existing ETL codebases.
Captures column-level lineage directly from Spark execution plans without parsing source code. Ideal for JVM-based big data pipelines.
from openmetadata.client import OpenMetadata
from openmetadata.ingestion.source.dbt import DbtSourceConfig, DbtSource
config = DbtSourceConfig(
manifest_path="./target/manifest.json",
catalog_path="./target/catalog.json",
project_dir="./",
target_dataset="analytics"
)
# Initialize OpenMetadata client
ometa = OpenMetadata("http://localhost:8585/api", auth_method="no-auth")
# Run ingestion
source = DbtSource(config, ometa)
source.prepare()
source.launch()
After ingestion, OpenMetadata’s UI lets you click any column and view upstream/downstream columns, SQL logic, owners, and tests derived from dbt.
PII
, SENSITIVE
, etc., and propagate tags downstream automatically.While Galaxy is primarily a modern SQL editor, its collaborative environment and AI copilot produce and modify SQL that defines your data lineage. By exporting executed queries or integrating with OpenLineage emitters, Galaxy can become a first-class producer of column-level metadata, ensuring that lineage graphs stay accurate—even when ad-hoc analysts iterate quickly in the editor.
Without column-level lineage, data teams fly blind—schema changes break dashboards, compliance audits stall, and engineers waste hours tracking transformations. Granular lineage provides instant impact analysis, accelerates debugging, and ensures trustworthy analytics, making it foundational to modern data governance and self-service BI.
Table-level lineage shows which datasets feed others, but it cannot reveal which specific columns are used. Column-level lineage maps dependencies field-by-field, enabling precise impact and root-cause analysis.
Not necessarily. Static SQL parsers (e.g., dbt, sqlglot) can generate lineage offline. Runtime agents or log connectors add extra fidelity for dynamic SQL and BI tools.
If you already run dbt, OpenMetadata or DataHub provide the smoothest integrations. For Spark-heavy shops, Spline captures lineage without code changes. Evaluate based on supported connectors, UI, and community health.
Galaxy focuses on query authoring, but its API can emit OpenLineage events every time a query runs. These events can then be consumed by OpenMetadata or DataHub, allowing lineage graphs to reflect work done in Galaxy.