Enterprise Context Strategy Data Flows: One Entity, End-to-End Walkthrough

Most architecture diagrams for enterprise data are drawn at the 30,000-foot level: boxes, arrows, clouds labeled "AI/ML." They help with budgets and vendor selection. They do almost nothing for the engineer who needs to understand what actually happens to a single record as it moves from a source system into a context graph and out to an AI agent.

This walkthrough operates at ground level. It follows one business entity, a Customer, from source capture through identity resolution, constraint validation, provenance capture, publication, and serving. At each step, it names the control point, the durable artifact produced, and the failure mode that appears when teams skip it.

This article is the operational companion to the enterprise context strategy reference architecture. It is the narrative spine of the series: the blueprint defines the system, and this walkthrough shows how one entity actually moves through it. For deeper treatment of specific modules, see identity resolution and entity masteringontology management and semantic modeling, and provenance and lineage for AI-ready enterprise context.

Why this walkthrough exists (and what it is not)

The semantic layer concept is well established. dbt Labs describes it as an "abstraction layer that sits between your raw data sources … and your business intelligence or analytics tools," designed to eliminate conflicting metric definitions and redundant data transformations. An enterprise context strategy extends that idea beyond metrics to entities, relationships, and operational rules.

This walkthrough gives teams an operational mental model they can hold in their heads during design reviews. It is implementation-neutral: no specific graph database, no specific orchestrator, no specific vendor. If the goal is a deployment guide with Terraform modules and YAML configs, that is a different document.

What this article is: a spine that connects the minimum set of artifacts, including definitions, contracts, IDs, constraints, provenance records, and serving contracts, to the steps that produce them. Each artifact exists because skipping it creates a concrete failure mode downstream. For the structural overview that frames these steps, see the enterprise context strategy reference architecture.

The entity used in this walkthrough

A single entity is needed that is complex enough to be realistic and simple enough to follow without a wall of diagrams. A Customer works well because every enterprise has one, and it touches multiple source systems.

Minimum fields for this walkthrough:

  • legal_name (string, required)

  • email (string, format-constrained)

  • mailing_address (structured object)

  • account_id (foreign key to Account)

Minimum relationships:

  • Customer → Account (many-to-one)

  • Customer → Order (one-to-many)

These fields and relationships are enough to exercise every step in the flow: normalization, identity resolution, constraint validation, provenance capture, and serving. When "the entity" appears below, picture a Customer record with these properties and links.

The minimum viable artifacts created along the way

Before walking the flow, it helps to name the six durable artifacts the process produces. Think of them as the paperwork that makes context trustworthy, not just available. AI systems need reliable context about definitions, relationships, and operational rules to operate safely, and these artifacts encode exactly that.

  1. Shared definitions. A governed vocabulary of entity types, properties, and relationships. "Customer" means one thing across the organization.

  2. Mappings / semantic contract. Versioned mappings from source fields to governed properties. The contract specifies which source field becomes which governed property, along with any transformation logic.

  3. Stable IDs. A persistent, source-independent identifier assigned during identity resolution. This ID survives merges and source-system migrations.

  4. Constraints / shapes. Formal rules the entity must satisfy before publication: required fields, cardinality limits, format patterns, and valid relationship targets.

  5. Provenance / lineage. Records of which sources contributed, which activities transformed the data, and which actors, human or automated, approved it.

  6. Serving contract. The agreement between the context store and its consumers, including dashboards, APIs, and AI agents, about schema, freshness, and access controls.

Every step below produces or consumes at least one of these artifacts. If a step does not produce a durable artifact, it is worth questioning whether the step belongs in the architecture.

End-to-end flow: one entity through the enterprise context strategy architecture

Step 0: Source capture (systems of record and event streams)

The Customer record originates somewhere: a CRM, an e-commerce platform, a support ticketing system, or all three. Step 0 is ingestion, and the critical discipline here is capturing metadata at the edge, not retrofitting it later.

At minimum, each ingested record or event should carry a source system identifierextraction timestampschema version of the source payload, and batch or event ID for replay. These four fields cost almost nothing to capture and are expensive to reconstruct after the fact.

Artifact produced: Raw record plus edge metadata, which later feeds provenance.

Failure mode if skipped: The source schema version is omitted. When the CRM team renames cust_email to contact_email, the mapping breaks silently because the record cannot be tied to the schema version it was emitted under.

Step 1: Normalize and map to shared definitions (semantic contract)

Raw fields arrive with names like cust_nmLEGAL_NAME, or customerFullName. Step 1 maps each raw field to a governed property in the shared definitions. The mapping is the semantic contract: a versioned, machine-readable document that says "cust_nm from CRM v3.2 maps to Customer.legal_name."

Versioning the semantic contract is not optional. When a source system changes its schema, a new contract version is needed, and the old version must remain available for reprocessing historical data. The contract should also declare relationship mappings: the CRM's acct_id foreign key maps to the governed Customer → Account relationship.

This step depends on a governed model of what the entity is allowed to mean. For the operating model behind that layer, see ontology management and semantic modeling.

Artifact produced: Versioned semantic contract.

Failure mode if skipped: Mapping drift. Two teams independently map the same source field to different governed properties, or a mapping is updated in one pipeline but not another. Without a single versioned contract as the source of truth, the architecture recreates the same conflicting-definition problem it was meant to solve.

Step 2: Identity resolution and entity mastering (match, merge, survivorship)

Three source systems each have a record for the same person. Step 2 determines that they represent one Customer, assigns or retrieves a stable ID, and applies survivorship rules to decide which source wins for each property.

The stable ID is the backbone of the context graph. It must be source-independent and must persist even when two previously distinct entities are merged. Match confidence should be recorded as a numeric score or categorical level attached to each merge decision, because downstream consumers may want to filter on confidence.

Deterministic vs. probabilistic matching. Deterministic rules fire on exact field agreement. If two records share the same tax ID, they represent the same entity with near-certainty. Probabilistic matching scores partial agreement across multiple fields, such as normalized name, address tokens, and email domain, and computes a composite confidence. Most production systems use both, applying deterministic rules first and then running probabilistic models on the remaining unmatched population.

blocking step narrows the candidate space before expensive pairwise comparison. Common blocking keys include postal code, name phonetic code, or account ID. Records that score between the auto-merge threshold and the auto-reject threshold land in a review queue for human adjudication, where analysts confirm or deny the proposed merge with supporting evidence.

Merge traceability means recording which source records contributed to the merged entity, which match rule fired, and what the confidence was. Without merge traceability, debugging a bad merge requires archaeology rather than a query.

For the dedicated deep dive, see identity resolution and entity mastering.

Artifact produced: Stable entity ID and merge decision record.

Failure mode if skipped: Two distinct customers share a common name and mailing address, and a loose match rule collapses them into one entity. Without the merge decision record, the error cannot be efficiently identified and reversed.

Step 3: Constraint validation (context QA gate)

The merged, normalized entity is now a candidate for publication. Before it enters the context store, it passes through a data quality gate that checks it against governed constraints.

The conceptual model here borrows cleanly from the W3C's Shapes Constraint Language (SHACL), which defines validation as checking a data graph against a shapes graph. SHACL frames this as validating RDF graphs against a set of conditions, but the pattern applies regardless of whether the implementation uses RDF. The key insight is the separation of concerns: constraints live in a governed, versioned artifact, not embedded in pipeline code.

Example constraints for Customer:

  • legal_name is required

  • email matches a format pattern

  • Customer → Account has max cardinality 1

  • account_id must reference an existing Account entity in the context store

A candidate that fails validation does not get published. It gets routed to a remediation queue with a structured validation report: which constraint failed, which property was involved, and what value triggered the failure.

Artifact produced: Validation result and structured report.

Failure mode if skipped: A pipeline is configured to skip the QA gate "temporarily" during a migration. Months later, the context store contains thousands of entities that violate its own rules, and downstream consumers, including AI agents, treat those violations as trusted facts.

For the dedicated deep dive on this control point, see constraint validation for enterprise context.

Step 4: Provenance and lineage capture (fact-level and pipeline-level)

Provenance answers a deceptively simple question: where did this fact come from, and why should it be trusted? The W3C PROV model defines provenance as information about entities, activities, and people involved in producing a piece of data or thing that can be used to assess quality, reliability, or trustworthiness.

Two layers matter in an enterprise context strategy. Pipeline-level lineage records which jobs ran, in what order, reading from which sources and writing to which targets. Most data engineering teams already have some version of this through orchestrator metadata. Fact-level provenance is finer-grained and less common. It records, for each property on the published entity, which source record contributed the value, which survivorship rule selected it, and when the value was last validated.

For a Customer, fact-level provenance for legal_name might look like this: value Acme Corp sourced from CRM record crm-4481, selected by survivorship rule prefer-CRM-for-legal-name, validated at 2025-01-15T09:32Z.

Example provenance event: A merge activity at Step 2 generates a provenance record with activity type identity-merge, input entities [crm-4481, erp-7720, support-1192], output entity ctx-00382, agent identity-resolution-service-v2.4, timestamp, match rule ID, and confidence score. A validation run at Step 3 generates another with activity type constraint-validation, input entity ctx-00382, result pass, and shapes version customer-shapes-v1.7.

For the dedicated deep dive, see provenance and lineage for AI-ready enterprise context.

Artifact produced: Provenance records at both pipeline and fact level.

Failure mode if skipped: An AI agent returns a wrong answer sourced from a Customer entity, and no one can trace which source contributed the bad value. The remediation path becomes "re-check everything." Provenance turns that into a targeted query.

Step 5: Publish to the context store (context graph / semantic hub)

Publication is the moment the entity transitions from work in progress to trusted, serveable context. Operationally, "published" means three things:

  1. The entity passed constraint validation.

  2. Provenance records are attached.

  3. The entity is written to the context store with a version identifier and a publication timestamp.

Versioning and immutability deserve a brief note. Some teams version entities as immutable snapshots: each change creates a new version, and old versions remain queryable. Others use a mutable-current-plus-audit-log model: the current entity is overwritten, but a separate log captures every previous state. Both patterns work. What matters is that a consumer can always retrieve the version of the entity that was current at a specific point in time.

The context store, whether it is a knowledge graph, a labeled property graph, or a relational model with semantic metadata, is the single authoritative location for this entity. If a downstream system caches a copy, the serving contract governs staleness and refresh.

Artifact produced: Published entity version in the context store, with attached provenance.

Failure mode if skipped: Publishing without a version identifier. When a downstream report shifts and someone asks what changed, the absence of versioning means the entity's state before and after the shift cannot be diffed.

Step 6: Serve to analytics and AI agents (APIs, query, retrieval)

The published Customer entity is now available to consumers: BI dashboards, operational applications, analytics queries, and AI agents. The serving contract governs what each consumer can access, how fresh the data is, and what schema they should expect.

For AI agents specifically, the semantic contract from Step 1 and the constraints from Step 3 play a second role: they constrain how the agent can use the entity. An agent building a query or calling a tool knows that Customer.email is a string with a specific format, that Customer → Account is a many-to-one relationship, and that legal_name is always present. These constraints reduce hallucination surface area by giving the agent a reliable schema to reason against, rather than forcing it to infer structure from example data.

Choosing a serving pattern. The right pattern depends on the consumer's access shape:

  • Entity API (REST or GraphQL): Best when consumers request a single entity or a small set by ID. Low latency, simple caching, and well-suited for operational applications and agent tool calls that resolve one Customer at a time.

  • Semantic query (SPARQL, Cypher, SQL with semantic metadata): Best when consumers need to traverse relationships or filter across entity types.

  • Graph traversal: Best for path-finding and influence analysis, such as identifying which Customers are two hops from a flagged Account.

  • Hybrid retrieval (structured lookup + vector search): Best for RAG pipelines where an AI agent combines a structured entity lookup with semantic similarity search over entity descriptions or associated documents.

Most production deployments combine two or more of these patterns behind a unified serving contract. The contract specifies which endpoints expose which entity types, the freshness SLA for each, and deprecation timelines for schema changes. Consumers that violate the serving contract should fail explicitly rather than receive stale or partial data.

Artifact produced: Serving contract.

Failure mode if skipped: An AI agent built six months ago queries a property that has since been renamed. Without a serving contract and deprecation process, the agent silently receives null values and incorporates the absence into its reasoning.

Control points and failure modes

Four failure patterns account for most context quality incidents in enterprise architectures.

Mapping drift

Two pipelines ingest from the same source but use different mapping versions. One maps cust_email to Customer.email; the other maps it to Customer.contact_email_primary. The context store now has two properties for the same fact, and neither team knows about the collision. The fix is a single versioned semantic contract per source, enforced at ingestion.

Bad merges

Identity resolution collapses two distinct entities because the match rules are too loose, or it fails to merge two records for the same entity because the rules are too strict. The fix is recorded merge decisions with confidence scores, a review queue for low-confidence merges, and an unmerge capability that leverages the merge traceability artifact.

Constraint bypass

A pipeline is configured to skip the QA gate "temporarily" during a migration. Months later, the context store contains thousands of entities that violate the current shapes. Downstream consumers, including AI agents, treat these entities as valid. The fix is enforcing the QA gate as a required step in the publication path, with a structured exception process rather than a boolean bypass flag.

Missing provenance

An executive asks why a revenue figure changed after a Customer entity was updated. Without fact-level provenance, the data team spends days comparing snapshots. With provenance, they query the entity's history, identify the survivorship rule change that swapped the authoritative source for account_id, and resolve the issue in hours.

Frequently asked questions

What is a semantic contract in a data pipeline?

A semantic contract is a versioned, machine-readable mapping that declares how each source field translates to a governed property in the shared definitions. It includes transformation logic and relationship mappings.

What is a stable ID in entity resolution, and why not just use the source system's key?

A stable ID is a persistent, source-independent identifier assigned during identity resolution. Source system keys are tied to a single system and change during migrations, merges, or platform swaps. A stable ID survives all of these events.

Does the constraint validation step require SHACL or RDF?

No. SHACL is useful because its conceptual model maps cleanly to the QA gate, but the same pattern can be implemented with JSON Schema, custom validation frameworks, or other tools that separate constraint definitions from pipeline code.

What is the difference between pipeline-level lineage and fact-level provenance?

Pipeline-level lineage tracks job execution. Fact-level provenance records which source record contributed a specific property value, which survivorship rule selected it, and when it was last validated.

What is a serving contract, and who defines it?

A serving contract is an agreement between the context store and its consumers specifying schema, freshness SLA, access controls, and deprecation timelines. Typically, the context platform team defines and publishes it.

How do AI agents use entity constraints to reduce hallucination?

When an agent has access to constraint definitions, it can validate its own query construction and output against known rules. Constraints act as guardrails the agent reasons against.

What happens when an entity fails constraint validation?

The entity is not published. It is routed to a remediation queue with a structured validation report, then resubmitted after the issue is resolved.

A quick checklist for implementing this flow in an enterprise

Step

Artifact to produce

Key question to answer

0. Source capture

Raw record + edge metadata

Can this ingestion be replayed with the original schema version?

1. Normalize and map

Versioned semantic contract

Is there exactly one mapping per source field per contract version?

2. Identity resolution

Stable ID + merge decision record

Can a merge be reversed without re-running the entire pipeline?

3. Constraint validation

Validation report

Is the QA gate enforced in all publication paths, with no silent bypass?

4. Provenance capture

Fact-level and pipeline-level provenance records

Can "where did this value come from?" be answered for any property?

5. Publish

Versioned entity in context store

Can a consumer retrieve the entity as it existed at any past timestamp?

6. Serve

Serving contract

Does every consumer know which contract version it depends on?

If the answer is "yes" to each key question, the flow is operating at minimum viable maturity. If any answer is "we think so, but we would have to check," that step is the highest-priority gap.

Related Deep Dives

Interested in learning more about Galaxy?

Related articles