Automated metadata ingestion is the practice of programmatically collecting, transforming, and publishing technical, operational, and business context into data catalogs like Amundsen or DataHub without manual effort.
Automating metadata ingestion into Amundsen or DataHub centralizes data knowledge, increases catalog freshness, and removes human toil by scheduling repeatable extraction jobs from source systems into modern data catalogs.
Data catalogs live or die by the completeness and freshness of their metadata. Manual population does not scale when you manage hundreds of databases, thousands of tables, or rapidly-evolving ETL pipelines. Automation delivers:
Connect to data sources (e.g., Snowflake, BigQuery, Kafka, dbt) and read catalog data, usage metrics, or lineage. Both Amundsen and DataHub ship with dozens of built-in extractors.
Optional modules enrich raw metadata—adding owners, cost centers, PII tags, or glossary terms.
Write the canonical metadata model into the target catalog service (Neo4j/Elasticsearch for Amundsen; Kafka or Graph Service for DataHub).
An external scheduler (Apache Airflow, Dagster, Prefect, GitHub Actions, etc.) coordinates extraction frequency, manages credentials, and handles retries.
This is the most common approach in production because both catalogs provide first-class Airflow operators.
pip install amundsen-databuilder
or pip install acryl-datahub[datahub-rest]
.DataHub supports declarative ingestion recipes. Teams commit YAML into a version-controlled repo, and a CI runner executes datahub ingest -c recipe.yml
. Merges to main
trigger ingestion, ensuring infra as code.
For real-time systems, emit MetadataChangeEvent
(MCE) protobuf messages directly into the DataHub Kafka topic, achieving sub-minute freshness.
pip install amundsen-databuilder snowflake-connector-python
SnowflakeMetadataExtractor
to pull columns, types, and comments.DbtExtractor
parses manifest.json
for upstream/downstream nodes.Neo4jCsvPublisher
sends nodes/relationships to Neo4j; FsNeo4jCSVLoader
stages the CSVs.env
tags (dev, prod) to support multi-tenant catalogs.While Galaxy focuses on interactive SQL editing, it becomes the producer of valuable query metadata. By exporting Galaxy’s endorsed query collections (via its API) into Amundsen or DataHub, teams surface trusted analytical logic directly in their catalog. A lightweight Airflow DAG can call Galaxy APIs nightly, transform query text into the appropriate metadata model (Dashboard/Query entity), and publish it alongside database schemas—giving engineers a single pane of glass for both tables and canonical SQL.
Why it’s wrong: Extracting every temp table or staging schema floods your catalog.
Fix: Use extractor allowlists/regex to limit to production schemas.
Why it’s wrong: Full reloads strain databases and increase Neo4j/Kafka churn.
Fix: Configure stateful
ingestion (DataHub) or watermark filters (Amundsen).
Why it’s wrong: Secrets leak in Git and violate compliance.
Fix: Store credentials in Vault/Secrets Manager and reference via environment variables.
Automated metadata ingestion transforms Amundsen or DataHub from static glossaries into living, searchable maps of your data estate. By combining out-of-the-box extractors with a robust orchestrator, you ensure that tables, dashboards, and even Galaxy queries stay discoverable and trustworthy—fueling self-service analytics and data governance at scale.
Without automation, data catalogs become stale and lose user trust. Automating metadata ingestion guarantees timely updates, captures lineage and usage, and empowers data teams to discover reliable assets—making governance and analytics far more effective.
No. Automation gathers raw metadata; humans still curate descriptions and validate ownership.
Most teams schedule nightly jobs for structural metadata and hourly or streaming jobs for usage statistics.
Both Amundsen and DataHub allow custom extractors using Python; you can subclass a base extractor and emit the required record type.
Galaxy can export endorsed SQL queries via its API. Adding an ingestion pipeline that maps these queries into your data catalog surfaces trusted analytics alongside database schemas.