
Every enterprise has the same dirty secret: the same customer, supplier, or product exists as dozens of slightly different records across CRM, ERP, billing, support, and procurement systems. Identity resolution is the process of determining whether different data records refer to the same real-world entity. Entity mastering is the process of deciding what the enterprise should trust once they do, producing a governed golden record that downstream systems can rely on. Getting both right is a prerequisite for any master data management (MDM) strategy and for any enterprise context layer that aims to support reliable analytics, retrieval, policy enforcement, and AI agent behavior.
Neo4j frames entity resolution as determining when different data records represent the same real-world entity. That sounds simple. In practice, it is a decision problem under uncertainty because exact identifiers are often missing, inconsistent, or overloaded across enterprise systems.
This article is the third in a series on enterprise context strategy. It builds on the enterprise context strategy reference architecture and connects to later pieces on provenance and context QA gates. The goal here is to give data and AI leaders a concrete, reusable framework for identity resolution, data deduplication, and entity mastering inside a context layer.
Why identity resolution belongs inside the context layer
Unresolved duplicates do more than waste storage. They fragment every system that consumes entity data. When a customer appears as three separate records, analytics double- or triple-counts revenue, retrieval surfaces incomplete context, and AI agents generate conflicting answers depending on which record they happen to find first.
Policy enforcement breaks in similar ways. If a supplier entity is fragmented, spend limits apply to each fragment independently, and consolidated exposure is invisible. An enterprise context layer, the shared semantic infrastructure that feeds analytics, retrieval, and agent behavior, needs resolved identities to function as a coherent world model.
Identity resolution is where raw records become trustworthy entities. If you push this process outside the context layer into downstream consumers, each team builds its own resolution logic, and the enterprise ends up with competing, incompatible views of the same real-world objects.
Identity resolution vs entity mastering
These two concepts are related but distinct. Identity resolution answers the question: do these records describe the same thing? Entity mastering answers the follow-up: given that they do, what should the canonical enterprise representation look like?
Resolution is a linking problem. It compares records, evaluates evidence, and proposes sameness. Mastering is a governance problem. It selects trusted attribute values, assigns a stable identifier, and maintains that identity over time as sources change.
You can resolve identities without mastering them (just maintain the links), and you can attempt mastering without proper resolution (just pick a "primary" system). Neither shortcut produces a reliable MDM foundation or a trustworthy enterprise context layer. You need both, in sequence, with clear boundaries between them.
The core flow: match, merge, survivorship
Reltio's documentation describes match and merge as core MDM functions that move data toward a "golden state" where it serves as a single source of truth. That framing is useful, but it can obscure the fact that match, merge, and survivorship are three distinct stages with different inputs, controls, and failure modes.
Match
Matching is the comparison stage. It takes incoming or existing records and evaluates whether they are likely to represent the same entity. Deterministic matching uses exact key agreement (same tax ID, same email, same DUNS number). Probabilistic matching uses weighted evidence across multiple attributes: name similarity, address proximity, phone overlap, behavioral signals.
IBM's MDM documentation confirms that matching is used to determine whether records can be collected into master data entities. In practice, most enterprises need both deterministic and probabilistic approaches because no single identifier is universally present or reliable across all source systems. This matching stage is the engine behind data deduplication at enterprise scale.
The output of matching is a set of candidate pairs or clusters with associated confidence scores, not a merged entity. Those candidates then move to merge decisions, either automatically or through human review.
Merge
Merge is the consolidation step. When a match is accepted, the contributing source records are linked under a single entity. The merge operation does not discard source records. Instead, it creates a parent entity that references all contributing records while maintaining their provenance.
A merge can be additive (new source record joins an existing entity) or consolidating (two previously independent entities collapse into one). Both cases require the system to track which sources contributed, when the merge happened, and on what evidence basis. Losing those links makes future audits and corrections nearly impossible.
Survivorship
After a merge, an entity may contain multiple conflicting values for the same attribute. Two source systems may supply different phone numbers, different legal names, or different addresses. Reltio's survivorship documentation describes configurable rules that determine the "operational value" for each attribute based on strategies like source priority, recency, completeness, or frequency.
Survivorship in MDM is attribute-level governance. A mastered customer might get its legal name from the ERP (highest authority for legal data), its phone number from the CRM (most recent update), and its address from a third-party enrichment provider (best completeness). Each attribute can follow a different rule.
Survivorship can also be recalculated as source data changes or business rules evolve. A context layer should support this recalculation without breaking the stable entity identity that downstream systems depend on.
Golden record vs graph entity
There is a useful distinction between two representations of a mastered identity. The golden record (sometimes called the golden entity) is the canonical, flat, governed record: the single set of operational attribute values that the enterprise treats as authoritative. The graph entity is the richer representation that includes the golden attributes plus all source contributions, relationships, match evidence, and provenance edges.
Reltio's data model draws a similar line. An entity in Reltio is a record node with attributes, while a broader profile includes connected entities and interaction data. In a context layer built on a knowledge graph, you typically need both representations. The golden record serves operational systems and APIs that need a single, clean answer. The graph entity serves governance workflows, audit, and any consumer that needs to understand why the golden record looks the way it does.
Collapsing everything into a flat golden record throws away evidence. Keeping only the graph without a canonical view forces every consumer to re-derive the "right" answer. A well-designed context layer maintains both and makes the relationship between them explicit.
Stable entity IDs and why they matter {#stable-entity-ids}
When a mastered entity gets a new ID every time survivorship recalculates or a new source record merges in, downstream systems break. Reports lose historical continuity, API consumers lose their references, and agents that cached an entity ID yesterday cannot find the same entity today.
Stable entity IDs persist across source churn, survivorship recalculation, and re-mastering events. The ID represents the real-world entity, not any particular snapshot of its attributes. When two entities merge, one ID survives and the other becomes an alias that redirects to the surviving ID.
If you are designing serving APIs for entity data, ID stability is a hard requirement. Any system that stores or caches entity references needs confidence that those references will resolve correctly over time.
Confidence scoring and review workflows
Not every match decision should be automated. Confidence scoring is the control surface that determines what happens after matching. A practical operating model uses three bands:
High confidence matches (above an upper threshold) auto-merge. The evidence is strong enough that human review adds cost without adding quality. Medium confidence matches (between thresholds) create review tasks for data stewards or candidate-link relationships that preserve the proposed match without acting on it. Low confidence matches (below a lower threshold) remain separate. The system observed a similarity but judged it insufficient.
Setting those thresholds is a business decision. The cost of a false positive merge (incorrectly collapsing two distinct entities) is almost always higher than the cost of a false negative. A false positive can contaminate customer views, permissions, analytics, and agent retrieval simultaneously. A false negative means two records stay separate and might cause a duplicated outreach or a slightly inaccurate count, which is easier to detect and correct.
I have seen teams default to aggressive auto-merge thresholds in the name of data cleanliness and then spend months untangling incorrect merges that propagated into reporting, billing, and compliance systems. Start conservative. Lower your auto-merge threshold only after you have confidence in your post-merge validation.
Post-merge validation {#post-merge-validation}
A high match score tells you two records probably refer to the same entity. It does not tell you the resulting mastered entity is valid, complete, and safe to serve. These are different questions, and conflating them is a common source of downstream data quality failures.
Consider a supplier entity that, after merge, ends up with two conflicting tax jurisdictions, a revenue figure that exceeds the parent company's total, or a combination of industry codes that violates the ontology. The match was correct (same supplier), but the merged result is semantically broken.
Post-merge validation runs business rules and ontology constraints against the mastered entity. It checks for impossible attribute combinations, missing required fields, relationship violations, and value-range breaches. This stage connects directly to the context QA gate, which applies similar constraint checks across the full context layer. If validation fails, the entity can be flagged for steward review rather than served as-is.
Provenance for merges and survivorship {#provenance-merges}
Every merge decision and every survivorship choice should carry provenance. That means recording which source records contributed to a mastered entity, which match rule fired and at what confidence, which survivorship rule selected each operational value, and whether any human override occurred.
Without merge provenance, governance becomes guesswork. When a downstream analyst asks "why does this customer's address show a Dallas location when our CRM says Chicago?", the answer should be traceable: the survivorship rule selected the enrichment provider's address because it scored higher on completeness, and the merge was based on a 94% probabilistic match on name, email, and phone.
The provenance article in this series covers lineage and evidence tracking across the full context layer. For identity resolution specifically, provenance needs to capture match rationale, rule outcomes, source contributions per attribute, and the full history of overrides and re-mastering events.
Governance controls for entity mastering
Entity mastering is not a batch job you run once. It requires ongoing governance controls: approval policies for merges above certain risk thresholds, exception handling for entities that fail validation, unmerge paths for correcting false positives, and audit trails that satisfy regulatory and internal compliance requirements.
Unmerge deserves particular attention. When a false positive merge is discovered, the system needs to cleanly separate the entities, reassign downstream references, and propagate the correction. If your architecture treats merges as irreversible, you will eventually face a situation where the only fix is manual cleanup across every consuming system.
Steward review workflows should surface the match evidence, the proposed survivorship values, and any validation warnings in a single view. Reviewers should not have to reconstruct the merge rationale from raw logs.
How mastered entities improve downstream systems
When identity resolution and entity mastering work correctly, the benefits propagate through every system that consumes entity data. Analytics teams get accurate counts and aggregations because they operate on deduplicated, governed entities rather than fragmented source records. Retrieval systems return complete context for a query because the entity's full attribute set and relationship neighborhood are consolidated in one place.
Policy and compliance systems can evaluate rules against the authoritative golden record rather than guessing which of several partial records to trust. AI agents that ground their responses in entity data produce consistent answers because they reference a single, stable identity rather than whichever record the vector search happened to return.
The operational value of mastered entities compounds as more systems consume them. Each additional consumer that relies on the context layer instead of building its own resolution logic reduces drift and duplication across the enterprise.
Common failure modes
The most damaging failure in any MDM pipeline is the false positive merge: two distinct entities collapsed into one. The error propagates to every downstream consumer before anyone notices because each system trusts the golden record. Conservative auto-merge thresholds and rigorous post-merge validation are the primary defenses, but neither eliminates the risk entirely. Teams need fast unmerge paths and downstream notification mechanisms to contain the blast radius when a false positive slips through.
Unstable entity IDs create a subtler but equally persistent problem. When a re-mastering run produces new IDs, reports lose continuity, caches go stale, and API integrations fail silently. The failure is quiet: no error messages, just slowly diverging data across systems that once agreed. ID stability has to be a design constraint from day one, because retrofitting it after downstream systems have already stored references is painful.
Opaque scoring undermines the human governance layer. If stewards cannot see why a match was proposed or why a survivorship rule selected a particular value, they cannot make informed approval or override decisions. The system becomes a black box that people rubber-stamp or, worse, ignore.
A related pattern is the team that reports a 98% match accuracy rate while serving entities that violate business rules. Match accuracy and entity validity are different metrics. Without post-merge validation, high match scores can mask semantically broken output.
Finally, survivorship rules that never get updated degrade quietly. Source quality shifts, new systems come online, and the relative trustworthiness of different sources changes over months and years. Survivorship governance needs periodic review, ideally triggered by data quality metrics rather than calendar reminders.
A practical operating model
A layered operating model for identity resolution and entity mastering in a context layer separates concerns so that each stage can evolve independently. The five tiers below represent distinct responsibilities, controls, and failure surfaces.
Source records are ingested and standardized but never modified. They remain the system of record for what each source actually said.
Candidate matches are proposed by matching logic, scored, and stored as explicit relationships between source records. No merge has occurred yet. This is where data deduplication logic lives.
Mastered entities are created when a candidate match is accepted (automatically or by a steward). Each mastered entity carries a stable ID, survivorship-selected operational values, and links to all contributing source records.
Validation runs post-merge checks against ontology constraints and business rules, flagging any entity that fails for review before it reaches consumers.
Serving exposes mastered entities (with their stable IDs and governed attributes) to downstream consumers through APIs, graph queries, and event streams.
This separation means matching logic can improve without disrupting mastered entities. Survivorship rules can change without breaking stable IDs. Validation can tighten without requiring re-matching. Each layer has its own controls, its own failure modes, and its own audit surface.
Where this fits in the broader enterprise context strategy
Identity resolution and entity mastering sit at the center of the enterprise context strategy reference architecture. They operate after ingestion and standardization, and before serving and consumption. In the data flows walkthrough, identity resolution is the stage where raw records become entity candidates and then governed, mastered identities.
The provenance article covers how merge decisions and survivorship choices feed into the broader lineage model. The context QA gate describes the constraint-validation patterns that apply after mastering to ensure that served entities meet semantic and business-rule requirements. Together, these components form the backbone of an AI-ready enterprise context layer: a system where identity is stable, evidence is preserved, and governance is continuous.
Frequently asked questions
What is the difference between identity resolution and entity mastering? Identity resolution determines whether two or more records refer to the same real-world entity. Entity mastering takes resolved records and produces a single, governed golden record with trusted attribute values and a stable identifier. Resolution is a linking problem; mastering is a governance problem.
What is a golden record? A golden record is the canonical, authoritative representation of a mastered entity. It contains the survivorship-selected values for each attribute (for example, legal name from ERP, phone from CRM, address from an enrichment provider) and serves as the single version of truth for downstream systems.
Why are stable entity IDs important? Stable entity IDs persist across merges, survivorship recalculations, and source changes. Without them, downstream reports lose historical continuity, API consumers lose their references, and cached entity lookups silently break. The ID should represent the real-world entity, not a particular snapshot of its attributes.
What is survivorship in MDM? Survivorship is the set of rules that determine which attribute value "wins" when multiple source records contribute conflicting data for the same field. Common strategies include source priority, recency, completeness, and frequency. Survivorship operates at the attribute level, so different attributes on the same entity can follow different rules.
How does identity resolution relate to data deduplication? Data deduplication is often used to describe the removal of duplicate records within a single system. Identity resolution extends that concept across multiple systems, using deterministic and probabilistic matching to identify records that refer to the same entity even when identifiers differ or are missing.
What happens when a merge is wrong? A false positive merge (two distinct entities incorrectly combined) requires an unmerge: cleanly separating the entities, reassigning downstream references, and propagating the correction. Architectures that treat merges as irreversible eventually face costly manual cleanup across every consuming system.
Conclusion
Identity resolution and entity mastering are the mechanism that turns fragmented source records into stable, governed identities across the enterprise. Resolution decides sameness. Mastering decides trust. Both require distinct stages (match, merge, survivorship), clear governance controls, stable identifiers, confidence-driven workflows, and post-merge validation.
The goal is to create a stable enterprise identity while preserving the evidence, relationships, and semantics that explain why that identity exists. When the context layer gets identity right, every downstream system inherits that correctness. When it gets identity wrong, the error compounds with every additional consumer. The compounding works in both directions, which is why identity resolution belongs at the foundation of the context layer, not bolted on as an afterthought.
Interested in learning more about Galaxy?




