Visualizing lineage with Marquez means using the platform’s metadata graph to render interactive upstream-downstream relationships of datasets and jobs.
Marquez is an open-source metadata service that captures, stores, and visualizes data lineage. By emitting OpenLineage events from your pipelines and sending them to a Marquez backend, you gain a continuously updated graph of datasets, jobs, and their relationships. This article walks through the underlying concepts, shows how to get lineage into Marquez, and demonstrates multiple ways to visualize that lineage for debugging, governance, and collaboration.
Data lineage is the record of how data moves and transforms through an organization—from its origin, through each processing step, to its final destination. It answers questions such as:
Visualizing lineage turns this metadata into an interactive graph, making complex dependencies easy to understand.
Marquez focuses on three pillars:
Together, these capabilities let you observe dataflows, trace incidents, and meet compliance requirements.
# docker-compose.yml
version: '3.9'
services:
postgres:
image: postgres:14
environment:
POSTGRES_USER: marquez
POSTGRES_PASSWORD: marquez
marquez:
image: marquezproject/marquez:latest
ports: ["5000:5000", "3000:3000"]
environment:
MARQUEZ_DB_USER: marquez
MARQUEZ_DB_PASSWORD: marquez
MARQUEZ_DB_HOST: postgres
MARQUEZ_PORT: 5000
depends_on: [postgres]
Run docker compose up
. The API is now on localhost:5000
and the UI on localhost:3000
.
Add the openlineage-python
client to a pipeline. For example, inside an Airflow DAG:
from openlineage.airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime
def extract():
# fetch from source
pass
def transform():
# heavy lifting
pass
def load():
# write to warehouse
pass
dag = DAG(
dag_id="etl_quickstart",
start_date=datetime(2024, 1, 1),
catchup=False,
)
for task_id, func in {"extract": extract, "transform": transform, "load": load}.items():
PythonOperator(task_id=task_id, python_callable=func, dag=dag)
Set the following environment variables so Airflow will send lineage to Marquez:
OPENLINEAGE_URL=http://localhost:5000
OPENLINEAGE_NAMESPACE=demo
OPENLINEAGE_API_KEY= # leave blank for local
Each task run now posts START
, COMPLETE
, or FAIL
events to Marquez containing inputs, outputs, schema, and run metadata.
Open http://localhost:3000
and select the demo namespace. You will see:
Need lineage inside automated checks or docs? Use the REST API:
curl -X GET \
"http://localhost:5000/api/v1/namespaces/demo/lineage?node=demo:dataset:public.orders&depth=3" | jq
This returns JSON containing nodes and edges suitable for rendering with D3, GraphViz, or internal tooling.
Emit lineage at the job boundary that has business meaning (e.g., ETL step, dbt model) rather than every helper script. This keeps the graph readable.
Include column-level facets such as schema
or columnLineage
. Marquez aggregates these so you can inspect field-level impact.
Choose a namespace convention (team.project
) and stick to it. Lineage graphs are only intuitive when dataset IDs follow a pattern.
Run Marquez in Kubernetes with Helm or deploy the marquez-chart
so that lineage is always available in staging and prod.
Marquez focuses on lineage capture. While it exposes dataset metadata, it’s not a full catalog like DataHub. Many orgs integrate Marquez with a catalog, feeding lineage via Kafka.
The OpenLineage emitters are non-blocking HTTP calls executed after job state transitions. Overhead is typically under 50 ms per event.
There are official emitters for Spark, dbt, Flink, Dagster, Kubernetes Jobs, and custom apps via the API. Any tool can send events if it can POST JSON.
Galaxy is a SQL editor and therefore is tangential to Marquez. If your team uses Galaxy to develop SQL transformations that later run in dbt or Airflow, emitting OpenLineage events from those jobs will let you trace queries authored in Galaxy through to downstream dashboards—bridging the gap between authoring and observability.
Visualizing lineage with Marquez gives teams x-ray vision into their dataflows, accelerating debugging and safeguarding trust in analytics.
Without lineage, data teams spend hours untangling dependencies, delaying incident response and risking incorrect analytics. Marquez turns emitted OpenLineage events into an always-up-to-date graph so engineers can instantly see where data came from, how it was transformed, and what will break if upstream changes occur.
Marquez supports any platform that can emit OpenLineage events. Official libraries exist for Airflow, Spark, Flink, dbt, Dagster, and Kubernetes Jobs. You can also write custom emitters in Python, Java, or any language that can send HTTP POST requests.
Yes. Retrieve lineage via the REST API or GraphQL endpoint and render it with D3, React Flow, or GraphViz. The Marquez UI can also be embedded in an iframe behind your SSO.
While dataset-level lineage is standard, OpenLineage facets allow column lineage as experimental metadata. Tools like dbt automatically include column mappings, which Marquez displays in the UI.
SQL authored in Galaxy often compiles into dbt models or Airflow tasks. If those jobs emit OpenLineage events, Marquez will visualize the downstream impact of changes to queries written in Galaxy, helping teams validate modifications before deployment.