Enterprise Context Strategy Data Flows: One Entity, End-to-End Walkthrough
Mar 5, 2026
Enterprise Context Strategy

Most architecture diagrams for enterprise data are drawn at the 30,000-foot level: boxes, arrows, clouds labeled "AI/ML." They help with budgets and vendor selection. They do almost nothing for the engineer who needs to understand what actually happens to a single record as it moves from a source system into a context graph and out to an AI agent.
This walkthrough operates at ground level. We will follow one business entity, a Customer, from source capture through identity resolution, constraint validation, provenance capture, publication, and serving. At each step, we will name the control point, the durable artifact produced, and the failure mode that bites teams who skip it.
Why this walkthrough exists (and what it is not)
The semantic layer concept is well established. dbt Labs describes it as an "abstraction layer that sits between your raw data sources … and your business intelligence or analytics tools," designed to eliminate conflicting metric definitions and redundant data transformations. An enterprise context strategy extends that idea beyond metrics to entities, relationships, and operational rules.
The walkthrough gives you an operational mental model you can hold in your head during design reviews. It is implementation-neutral: no specific graph database, no specific orchestrator, no specific vendor. If you are looking for a deployment guide with Terraform modules and YAML configs, that is a different document.
What it is: a spine that connects the minimum set of artifacts (definitions, contracts, IDs, constraints, provenance records, serving contracts) to the steps that produce them. Each artifact exists because skipping it creates a concrete failure mode downstream. For the structural overview that frames these steps, see the enterprise context strategy reference architecture.
The entity used in this walkthrough
We need a single entity complex enough to be realistic and simple enough to follow without a wall of diagrams. A Customer works well because every enterprise has one, and it touches multiple source systems.
Minimum fields for this walkthrough:
legal_name(string, required)email(string, format-constrained)mailing_address(structured object)account_id(foreign key to Account)
Minimum relationships:
Customer → Account (many-to-one)
Customer → Order (one-to-many)
These fields and relationships are enough to exercise every step in the flow: normalization, identity resolution, constraint validation, and serving. When you see "the entity" below, picture a Customer record with these properties and links.
The minimum viable artifacts (the "paperwork") created along the way
Before we walk the flow, here are the six durable artifacts the process produces. Think of them as the paperwork that makes context trustworthy, not just available. AI systems need reliable context about "definitions, relationships, and operational rules" to operate safely, and these artifacts encode exactly that.
Shared definitions. A governed vocabulary of entity types, properties, and relationships. "Customer" means one thing, organization-wide.
Mappings / semantic contract. Versioned mappings from source fields to governed properties. The contract specifies which source field becomes which governed property, along with any transformation logic.
Stable IDs. A persistent, source-independent identifier assigned during identity resolution. This ID survives merges and source-system migrations.
Constraints / shapes. Formal rules the entity must satisfy before publication: required fields, cardinality limits, format patterns, valid relationship targets.
Provenance / lineage. Records of which sources contributed, which activities transformed the data, and which actors (human or automated) approved it.
Serving contract. The agreement between the context store and its consumers (dashboards, APIs, AI agents) about schema, freshness, and access controls.
Every step below produces or consumes at least one of these artifacts. If a step does not produce a durable artifact, question whether the step belongs in the architecture.
End-to-end flow: one entity through the enterprise context strategy architecture
Step 0: Source capture (systems of record and event streams)
The Customer record originates somewhere: a CRM, an e-commerce platform, a support ticketing system, or all three. Step 0 is ingestion, and the critical discipline here is capturing metadata at the edge, not retrofitting it later.
At minimum, each ingested record or event should carry: source system identifier, extraction timestamp, schema version of the source payload, and batch or event ID for replay. These four fields cost almost nothing to capture and are expensive to reconstruct after the fact.
Artifact produced: Raw record plus edge metadata (feeds into provenance later). Failure mode if skipped: Omitting the source schema version. When the CRM team renames cust_email to contact_email in a release, your mapping breaks silently if you cannot tie the record to the schema version it was emitted under.
Step 1: Normalize and map to shared definitions (semantic contract)
Raw fields arrive with names like cust_nm, LEGAL_NAME, or customerFullName. Step 1 maps each raw field to a governed property in the shared definitions. The mapping is the semantic contract: a versioned, machine-readable document that says "cust_nm from CRM v3.2 maps to Customer.legal_name."
Versioning the semantic contract is not optional. When a source system changes its schema, you need a new contract version that reflects the new mapping, and you need the old version to remain available for reprocessing historical data. The contract should also declare the relationship mappings: the CRM's acct_id foreign key maps to the governed Customer → Account relationship.
Artifact produced: Versioned semantic contract (mapping document). Failure mode if skipped: Mapping drift. Two teams independently map the same source field to different governed properties, or a mapping is updated in one pipeline but not another. Without a single, versioned contract as the source of truth, you get the exact "conflicting metric definitions" problem the semantic layer was designed to prevent.
Step 2: Identity resolution and entity mastering (match, merge, survivorship)
Three source systems each have a record for the same person. Step 2 determines that they represent one Customer, assigns (or retrieves) a stable ID, and applies survivorship rules to decide which source wins for each property.
The stable ID is the backbone of the context graph. It must be source-independent (not a CRM ID, not an email hash) and must persist even when two previously distinct entities are merged. Match confidence should be recorded as a numeric score or categorical level (exact, high, probable, low) attached to each merge decision, because downstream consumers may want to filter on confidence.
Deterministic vs. probabilistic matching. Deterministic rules fire on exact field agreement: if two records share the same tax ID, they represent the same entity with near-certainty. Probabilistic matching scores partial agreement across multiple fields (normalized name, address tokens, email domain) and computes a composite confidence. Most production systems use both, applying deterministic rules first and then running probabilistic models on the remaining unmatched population. A blocking step narrows the candidate space before expensive pairwise comparison; common blocking keys include postal code, name phonetic code, or account ID. Records that score between the auto-merge threshold and the auto-reject threshold land in a review queue for human adjudication, where analysts confirm or deny the proposed merge with supporting evidence.
Merge traceability means recording which source records contributed to the merged entity, which match rule fired, and what the confidence was. Without merge traceability, debugging a bad merge requires archaeology rather than a query. For a deeper treatment of match algorithms, survivorship rule design, and unmerge workflows, see match-merge-survivorship (entity mastering).
Artifact produced: Stable entity ID, merge decision record (source IDs, match rule, confidence). Failure mode if skipped: Two distinct customers share a common name and mailing address, and a loose match rule collapses them into one entity. Without the merge decision record, you cannot efficiently identify and reverse the error.
Step 3: Constraint validation (context QA gate)
The merged, normalized entity is now a candidate for publication. Before it enters the context store, it passes through a data quality gate that checks it against governed constraints.
The conceptual model here borrows cleanly from the W3C's Shapes Constraint Language (SHACL), which defines validation as checking a data graph (your candidate entity) against a shapes graph (your governed constraints). SHACL frames this as "validating RDF graphs against a set of conditions", but the pattern applies regardless of whether your implementation uses RDF. The separation of concerns is the key insight: constraints live in a governed, versioned artifact (the shapes graph), not embedded in pipeline code.
Example constraints for Customer:
legal_nameis required (min cardinality 1).emailmatches a format pattern (RFC 5322, or a pragmatic regex subset).Customer → Accountrelationship has max cardinality 1 (a Customer belongs to exactly one Account).account_idmust reference an existing Account entity in the context store.
A candidate that fails validation does not get published. It gets routed to a remediation queue with a structured validation report: which constraint failed, which property was involved, and what value triggered the failure. For details on constraint authoring, shapes versioning, and remediation queue design, see validate entities against context constraints.
Artifact produced: Validation result (pass/fail plus structured report); the constraints themselves are a versioned artifact maintained by data governance. Failure mode if skipped: A pipeline is configured to skip the QA gate "temporarily" during a migration. Months later, the context store contains thousands of entities that violate its own rules, and downstream consumers (including AI agents) treat those violations as trusted facts.
Step 4: Provenance and lineage capture (fact-level and pipeline-level)
Provenance answers a deceptively simple question: where did this fact come from, and why should I trust it? The W3C PROV model defines provenance as "information about entities, activities, and people involved in producing a piece of data or thing" that "can be used to form assessments about its quality, reliability or trustworthiness."
Two layers of provenance matter in an enterprise context strategy. Pipeline-level lineage records which jobs ran, in what order, reading from which sources and writing to which targets. Most data engineering teams already have some version of pipeline lineage through orchestrator metadata. Fact-level provenance is finer-grained and less common. It records, for each property on the published entity, which source record contributed the value, which survivorship rule selected it, and when the value was last validated.
For our Customer, fact-level provenance for legal_name might look like: "value Acme Corp sourced from CRM record crm-4481, selected by survivorship rule prefer-CRM-for-legal-name, validated at 2025-01-15T09:32Z."
Example provenance event: A merge activity at Step 2 generates a provenance record: activity type identity-merge, input entities [crm-4481, erp-7720, support-1192], output entity ctx-00382, agent identity-resolution-service-v2.4, timestamp, match rule ID, confidence score. A validation run at Step 3 generates another: activity type constraint-validation, input entity ctx-00382, result pass, shapes version customer-shapes-v1.7.
Artifact produced: Provenance records (entities, activities, agents) at both pipeline and fact level. Failure mode if skipped: When an AI agent returns a wrong answer sourced from a Customer entity, and no one can trace which source contributed the bad value, the remediation path is "re-check everything." Provenance turns a multi-day investigation into a targeted query. For storage patterns, query approaches, and regulatory use cases, see fact provenance and pipeline lineage.
Step 5: Publish to the context store (context graph / semantic hub)
Publication is the moment the entity transitions from "work in progress" to "trusted, serveable context." What "published" means operationally:
The entity passed constraint validation (Step 3).
Provenance records are attached (Step 4).
The entity is written to the context store with a version identifier and a publication timestamp.
Versioning and immutability deserve a brief note. Some teams version entities as immutable snapshots: each change creates a new version, and old versions remain queryable. Others use a mutable-current-plus-audit-log model: the "current" entity is overwritten, but a separate log captures every previous state. Both patterns work. The choice depends on query patterns and regulatory requirements. What matters is that a consumer can always retrieve the version of the entity that was current at a specific point in time.
The context store (whether it is a knowledge graph, a labeled property graph, or a relational model with semantic metadata) is the single authoritative location for this entity. If a downstream system caches a copy, the serving contract (Step 6) governs staleness and refresh.
Artifact produced: Published entity version in the context store, with attached provenance. Failure mode if skipped: Publishing without a version identifier. When a downstream report shifts and someone asks "what changed?", the absence of versioning means you cannot diff the entity's state before and after the shift.
Step 6: Serve to analytics and AI agents (APIs, query, retrieval)
The published Customer entity is now available to consumers: BI dashboards, operational applications, analytics queries, and AI agents. The serving contract governs what each consumer can access, how fresh the data is, and what schema they should expect.
For AI agents specifically, the semantic contract from Step 1 and the constraints from Step 3 play a second role: they constrain how the agent can use the entity. An agent building a query or calling a tool knows that Customer.email is a string with a specific format, that Customer → Account is a many-to-one relationship, and that legal_name is always present. These constraints reduce hallucination surface area by giving the agent a reliable schema to reason against, rather than requiring it to infer structure from example data.
Choosing a serving pattern. The right pattern depends on the consumer's access shape:
Entity API (REST or GraphQL): Best when consumers request a single entity or a small set by ID. Low latency, simple caching, well-suited for operational applications and agent tool calls that resolve one Customer at a time.
Semantic query (SPARQL, Cypher, SQL with semantic metadata): Best when consumers need to traverse relationships or filter across entity types, for example "all Customers linked to Accounts in a specific region." Supports ad hoc exploration and BI workloads.
Graph traversal: Best for path-finding and influence analysis, such as "which Customers are two hops from a flagged Account." Requires a graph-native store or index.
Hybrid retrieval (structured lookup + vector search): Best for RAG pipelines where an AI agent combines a structured entity lookup with a semantic similarity search over entity descriptions or associated documents. Use this when the agent's query mixes precise filters ("Account ID = X") with fuzzy intent ("recent complaints about billing").
Most production deployments combine two or more of these patterns behind a unified serving contract. The contract specifies which endpoints expose which entity types, the freshness SLA for each, and deprecation timelines for schema changes. Consumers that violate the serving contract (querying deprecated properties, ignoring access controls) should fail explicitly rather than receive stale or partial data. For API design specifics, semantic contract enforcement during agent tool use, and freshness management, see APIs and retrieval for agents.
Artifact produced: Serving contract (schema, freshness SLA, access controls). Failure mode if skipped: An AI agent built six months ago queries a property that has since been renamed. Without a serving contract and a deprecation process, the agent silently receives null values and incorporates the absence into its reasoning.
Control points and failure modes (what breaks in practice)
Four failure patterns account for most context quality incidents in enterprise architectures:
Mapping drift. Two pipelines ingest from the same source but use different mapping versions. One maps cust_email to Customer.email; the other maps it to Customer.contact_email_primary. The context store now has two properties for the same fact, and neither team knows about the collision. The fix is a single versioned semantic contract per source, enforced at ingestion.
Bad merges. Identity resolution collapses two distinct entities because the match rules are too loose, or it fails to merge two records for the same entity because the rules are too strict. The fix is recorded merge decisions with confidence scores, a review queue for low-confidence merges, and an unmerge capability that leverages the merge traceability artifact.
Constraint bypass. A pipeline is configured to skip the QA gate "temporarily" during a migration. Months later, the context store contains thousands of entities that violate the current shapes. Downstream consumers, including AI agents, treat these entities as valid. The fix is enforcing the QA gate as a required step in the publication path, with a structured exception process (not a boolean flag) for genuine edge cases.
Missing provenance. An executive asks why a revenue figure changed after a Customer entity was updated. Without fact-level provenance, the data team spends days comparing snapshots. With provenance, they query the entity's history, identify the survivorship rule change that swapped the authoritative source for account_id, and resolve the issue in hours.
Frequently asked questions
What is a semantic contract in a data pipeline? A semantic contract is a versioned, machine-readable mapping that declares how each source field translates to a governed property in the shared definitions. It includes transformation logic and relationship mappings. When the source schema changes, a new contract version is published while the old version remains available for reprocessing historical data.
What is a stable ID in entity resolution, and why not just use the source system's key? A stable ID is a persistent, source-independent identifier assigned during identity resolution. Source system keys are tied to a single system and change during migrations, merges, or platform swaps. A stable ID survives all of these events, giving every downstream consumer a single, durable reference to the entity.
Does the constraint validation step require SHACL or RDF? No. The walkthrough references the W3C SHACL specification because its conceptual model (validating a data graph against a shapes graph) maps cleanly to what the QA gate does. You can implement the same pattern with JSON Schema, custom validation frameworks, or any tool that separates constraint definitions from pipeline code. The principle is the separation of the shapes artifact from the data artifact, not the specific technology.
What is the difference between pipeline-level lineage and fact-level provenance? Pipeline-level lineage tracks job execution: which orchestrated tasks ran, what they read, and what they wrote. Fact-level provenance is finer-grained, recording which source record contributed a specific property value, which survivorship rule selected it, and when it was last validated. Pipeline lineage tells you "this job ran at 3:00 AM and wrote to table X." Fact-level provenance tells you "legal_name = Acme Corp came from CRM record crm-4481 via survivorship rule prefer-CRM-for-legal-name."
What is a serving contract, and who defines it? A serving contract is an agreement between the context store and its consumers specifying the schema exposed, the freshness SLA (how stale data can be before refresh), access controls, and deprecation timelines for schema changes. Typically, the context platform team defines and publishes the contract, and consuming teams (BI, application, AI agent developers) depend on a specific contract version.
How do AI agents use entity constraints to reduce hallucination? When an agent has access to the constraint definitions (required fields, cardinality limits, valid relationship targets), it can validate its own query construction and output against known rules. For example, an agent knows that Customer → Account is many-to-one, so it will not generate a query assuming multiple Accounts per Customer. Constraints act as guardrails the agent reasons against, reducing the chance of structurally invalid outputs.
What happens when an entity fails constraint validation? The entity is not published to the context store. Instead, it is routed to a remediation queue along with a structured validation report listing which constraint failed, which property was involved, and what value triggered the failure. A data steward or automated remediation process resolves the issue, and the entity is resubmitted for validation. Bypassing this gate, even temporarily, is the root cause of most context quality debt.
A quick checklist for implementing this flow in an enterprise
Step | Artifact to produce | Key question to answer |
|---|---|---|
0. Source capture | Raw record + edge metadata | Can you replay this ingestion with the original schema version? |
1. Normalize and map | Versioned semantic contract | Is there exactly one mapping per source field per contract version? |
2. Identity resolution | Stable ID + merge decision record | Can you reverse a merge without re-running the entire pipeline? |
3. Constraint validation | Validation report (pass/fail + details) | Is the QA gate enforced in all publication paths, with no silent bypass? |
4. Provenance capture | Fact-level and pipeline-level provenance records | Can you answer "where did this value come from?" for any property? |
5. Publish | Versioned entity in context store | Can a consumer retrieve the entity as it existed at any past timestamp? |
6. Serve | Serving contract (schema, freshness, access) | Does every consumer know which contract version it depends on? |
If you can answer "yes" to each key question, the flow is operating at minimum viable maturity. If any answer is "we think so, but we would have to check," that step is your highest-priority gap.
Where this fits in the broader series
This walkthrough is the narrative spine of the Enterprise Context Strategy series. It connects to the other articles as follows:
The hub blueprint provides the reference architecture that this walkthrough animates. If you have not read it, start with the enterprise context strategy reference architecture for the structural overview before returning here for the operational detail.
Upcoming deep dives will expand individual steps:
Match-merge-survivorship (entity mastering) covers match algorithms, survivorship rule design, and unmerge workflows (expanding Step 2).
Validate entities against context constraints details constraint authoring, shapes versioning, and remediation queue design (expanding Step 3).
Fact provenance and pipeline lineage addresses fact-level provenance storage, query patterns, and regulatory use cases (expanding Step 4).
APIs and retrieval for agents covers API design, semantic contract enforcement for agent tool use, and freshness management (expanding Step 6).
Each deep dive will reference the entity and artifacts defined in this end-to-end enterprise context data flow, so the walkthrough serves as a shared vocabulary for the rest of the series.
© 2025 Intergalactic Data Labs, Inc.