Back to Articles

What is Entity Resolution?

Jan 20, 2026

Glossary

A mid-sized healthcare network discovered they had five different patient records for "Robert Johnson"—one under Bob, another as R.J., a third with a hyphenated last name from a previous marriage. Each record sat in a different system, fragmenting his medical history across disconnected databases. When Robert arrived unconscious at the emergency room, physicians couldn't access his complete allergy information or medication list. This isn't a data problem. It's a patient safety crisis.

Entity resolution solves exactly this challenge: determining when different records represent the same real-world entity despite inconsistencies in data entry or formatting. Organizations lose an average of $12.9 million annually from poor data quality and duplicate records. Meanwhile, 82% of enterprises report that data silos disrupt their critical workflows, leaving 68% of enterprise data completely unanalyzed.

Evaluating entity resolution tools?

Entity resolution is often bundled into broader master data management, data integration, and knowledge graph initiatives. For a practical shortlist of platforms and what to look for in an evaluation, see:

What is Entity Resolution?

Core Definition

Entity resolution identifies, links, and merges records that correspond to the same real-world entities across different data sources. In practice, this means recognizing that "Robert Johnson," "Bob Johnson," and "R.J. Johnson" all refer to the same person, even when their addresses differ slightly or one record uses a maiden name.

The process applies to any entity type: people, organizations, locations, accounts, products. An entity in this context refers to a real-world object or concept represented in data—most commonly customers, companies, addresses, devices, or inventory items.

Entity resolution serves as the critical foundation for master data management, consolidating and maintaining accurate, unified views of core business entities. Without it, fragmented records prevent unified analytics, consistent reporting, and coherent customer experiences.

Alternative Terminology

The field goes by several names depending on context and industry. Record linkage remains the most common academic term, while enterprise practitioners prefer entity resolution. You'll also encounter data matching, merge-purge, fuzzy matching, deduplication, and identity resolution—all describing variations of the same fundamental challenge.

Healthcare organizations talk about patient matching. Marketing teams discuss identity resolution. Financial services firms focus on KYC (know your customer) matching. The terminology shifts, but the underlying problem stays consistent: determining when separate records describe the same thing.

Relationship to Master Data Management

Entity resolution and MDM share an interdependent relationship. Most MDM vendors either built their solutions around entity resolution capabilities or acquired companies whose primary product was an entity resolution application.

Organizations increasingly begin their MDM journey with entity resolution to ensure cleansed and harmonized data before launching full MDM programs. This approach delivers immediate value through improved data quality while establishing the clean foundation necessary for successful MDM implementation.

The Business Problem: Why Entity Resolution Matters

Data Silos and Fragmentation

The average enterprise runs on nearly 900 applications, with only one-third integrated. This fragmentation forces employees to lose 30% of their weekly work hours chasing data across disconnected systems.

When departments operate in isolation, no single team accesses complete data. AI systems trained on limited datasets make decisions based on partial truths, creating downstream risks that compound over time.

Duplicate Records Impact

Large healthcare facilities typically spend more than $1 million annually fixing duplicate data issues, including staff time, technology costs, and downstream operational impacts. The problem starts early: 92% of duplicate records are created during initial registration when overworked staff create new records rather than searching for existing ones.

These duplicates create confusion in customer databases, unnecessarily inflate storage volumes, increase processing times, and introduce critical errors in reporting and analytics. Without resolution, organizations treat variations of the same customer as separate entities, fragmenting insights and creating inconsistencies that ripple through every downstream process.

Customer 360 Challenges

88% of executives consider a single customer view critical for business success. Yet only 14% of organizations have achieved a 360-degree customer view.

The gap isn't primarily technological. 92.2% of organizations report that their biggest impediment to becoming data-driven is business processes and culture, not technology. Entity resolution addresses the technical foundation, but successful implementation requires organizational commitment to data quality and governance.

Core Entity Resolution Methods and Techniques

Deterministic (Rules-Based) Matching

Deterministic record linkage generates links when all or some identifiers above a certain threshold match exactly. A rule might require names to match exactly, addresses to match within 90%, and phone numbers to be identical.

This approach yields the highest confidence matches because it makes no assumptions or inferences. The caveat: deterministic matching breaks down when data isn't clean or personally identifiable information isn't available.

Probabilistic Matching

Probabilistic approaches compute weights for each identifier based on its estimated ability to correctly identify a match or non-match. The Fellegi-Sunter framework from 1969 remains the most well-known probabilistic classification method, later proven mathematically equivalent to Naive Bayes classification under independence assumptions.

Record pairs with probabilities above a certain threshold are considered matches. This method handles messier data better than deterministic approaches, finding matches at greater scale while accepting some accuracy tradeoff.

Fuzzy Matching and String Similarity

Fuzzy matching finds patterns that match approximately rather than exactly. Edit distance metrics—Levenshtein, Jaro-Winkler, Damerau-Levenshtein—count the minimum operations required to transform one string into another.

These algorithms prove necessary for linking records with spelling variations, typos, and formatting differences. Without fuzzy matching, "Jonson" and "Johnson" remain forever separated despite obviously referring to the same surname.

Machine Learning Approaches

Supervised learning models trained on labeled match/non-match pairs can predict likelihood for new record pairs. Algorithms including random forest, SVM, and logistic regression often outperform traditional Fellegi-Sunter methods.

Machine learning techniques allow businesses to refine matching processes over time based on historical data patterns. The models learn which attribute combinations most reliably indicate matches, adapting as data characteristics evolve.

Hybrid and Graph-Based Approaches

Hybrid strategies combine deterministic exact matching with probabilistic or ML techniques for fuzzier connections. This combination provides finer control over match quality while strengthening overall matching accuracy.

Graph-based resolution treats pairwise matches as edges in a graph, with clusters of connected records representing entities. In simple cases, this equals finding connected components; in complex scenarios, it reveals relationship networks that simpler methods miss.

Enterprise Use Cases and Real-World Examples

Single Customer View / Customer 360

Entity resolution deduplicates data to create a complete 360-degree view of each customer. Businesses store customer data across CRM, e-commerce, marketing automation, and support platforms—entity resolution connects these fragments into unified profiles.

These consolidated views enable personalized recommendations, targeted marketing, and improved service. A retailer might discover that their "best online customer" and "frequent in-store shopper" are the same person, fundamentally changing how they approach that relationship.

Healthcare - Patient Matching

Entity resolution in healthcare determines when different medical records represent the same patient across multiple systems. Without proper matching, a single patient appears as multiple identities, fragmenting medical history across disconnected systems.

This fragmentation leads to incomplete views of patient medical history, missed drug interactions or allergies, duplicate tests and procedures, and delayed diagnoses. The Centers for Medicare and Medicaid Services uses probabilistic algorithms matching their National Plan and Provider Enumeration System with OpenPayments datasets for provider entity resolution.

Financial Services - Fraud Detection and AML

Financial institutions use entity resolution to uncover fraud rings by connecting seemingly unrelated accounts and transactions. A bank can link accounts opened under different names but sharing the same address and phone number, uncovering potential money laundering schemes.

Fraudsters frequently create accounts with variations of the same identity. Entity resolution helps link related records—shared phone numbers, addresses, devices—to detect fraudulent patterns and reduce losses while ensuring compliance with anti-money laundering regulations.

Supply Chain and Product Master Data

E-commerce platforms merge product listings to enhance search, recommendations, and customer experience. CRM systems consolidate information to improve service and enable targeted marketing.

Resolving duplicate product codes and descriptions across suppliers and internal systems creates consistency in catalogs, pricing, and inventory management. What appears as three different SKUs might actually be the same product entered by different vendors.

Technical Implementation Challenges

Scalability and Performance Issues

Brute force matching creates n*(n-1)/2 unique pairs. As records increase 10 times, comparisons increase 100 times—exponential growth that quickly becomes computationally prohibitive.

Blocking techniques restrict comparisons to records where particularly discriminating identifiers agree, tremendously reducing pairs to compare. Modern cloud solutions like AWS Entity Resolution reduce months of development to minutes of setup, handling scale challenges through optimized infrastructure.

Data Quality Prerequisites

Small differences in identifier recording—spelling variations, mobile versus home phone numbers, work versus personal email addresses—prevent unique matching. 47.7% of organizations identify data quality as their biggest challenge implementing enterprise analytics initiatives.

Tools must cleanse, match, enrich, and unify disparate records before resolution can succeed. Without addressing underlying data quality, even sophisticated algorithms produce unreliable results.

Implementation Complexity

Home-grown solutions face limitations in harmonization, deduplication, and matching robustness. Understanding these limitations often drives companies to pivot from in-house master data solutions to external MDM providers.

Starting with strong entity resolution capabilities early increases MDM success probability. Organizations that treat entity resolution as an afterthought typically face longer implementation cycles and lower confidence in results.

Knowledge Graphs and Semantic Data Platforms

Knowledge Graphs for Entity Resolution

Entity-Resolved Knowledge Graphs play a central role in scaling semantic layer solutions. Graph databases aggregate data from multiple sources to create comprehensive entity profiles, breaking down silos and making data more connected, contextual, and usable.

Knowledge graphs represent real-world objects and their semantic relations through visual graph structures. Unlike traditional databases that simply store data, knowledge graphs focus on definitions of entities and connections between them.

Semantic Layer Benefits

Semantic layers add real-world meaning to structured and unstructured data, making insights clearer and previously unseen connections explicit. This approach creates a contextualized view of data without requiring movement outside storage systems where it resides.

Cross-functional teams can query complex relationships across domain silos. The semantic layer eliminates data silos, activates unused data, and enables new levels of on-demand business insight.

Modern Platform Approaches

Platforms like Galaxy integrate entity resolution into semantic data unification, reducing manual mapping that plagues traditional approaches. By creating explicit models of entities and relationships, these platforms unify fragmented enterprise data into a shared foundation that both humans and AI can reason over.

Automated entity resolution combined with semantic understanding addresses the long implementation cycles associated with legacy MDM tools. Organizations gain a living model of their business where entities, relationships, and meaning become explicit rather than implicit.

Entity Resolution Tools and Technologies

AWS Entity Resolution

Best for: Organizations needing to match customer records across multiple applications and channels quickly.

AWS Entity Resolution offers rule-based, ML-based, and data service provider-led matching techniques. The service matches, links, and enhances records across multiple applications, channels, and data stores.

Pros:

Quick setup: Configure in minutes versus months for bespoke solutions
Multiple techniques: Combines deterministic, probabilistic, and ML approaches
Native integration: Works seamlessly with other AWS services

Cons:

AWS lock-in: Requires commitment to AWS ecosystem
Cost unpredictability: Pay-per-use pricing can be difficult to forecast at scale
Limited customization: Less flexible than building custom solutions

Pricing: Pay-per-use based on records processed

Galaxy

Best for: Organizations building a semantic understanding of their business across fragmented systems.

Galaxy creates a living model of business entities and relationships, making structure and meaning explicit. The platform combines entity resolution with semantic modeling to unify data without replacing existing sources.

Pros:

Semantic foundation: Models businesses as systems, not just tables
Incremental adoption: Connects to existing data sources without migration
AI-ready context: Provides grounded, inspectable models for both humans and AI

Cons:

Technical maturity required: Best suited for organizations beyond basic BI needs
Implementation investment: Requires thoughtful modeling of business semantics
Newer platform: Less established than legacy MDM vendors

Pricing: Contact for enterprise pricing

Neo4j

Best for: Organizations needing to analyze complex relationship networks alongside entity resolution.

Neo4j performs entity resolution through graph database technology, treating relationships as first-class citizens. The platform aggregates data from multiple sources to create comprehensive entity profiles.

Pros:

Relationship analysis: Excels at uncovering connections between entities
Query flexibility: Graph query language enables complex pattern matching
Established ecosystem: Mature platform with extensive community support

Cons:

Learning curve: Graph databases require different thinking than relational databases
Specialized use cases: Overkill for simple deduplication needs
Performance tuning: Requires expertise to optimize for large-scale operations

Pricing: Free community edition; enterprise pricing varies

Senzing

Best for: Real-time entity resolution in master data management contexts.

Senzing provides entity resolution capabilities specifically focused on MDM and data quality initiatives. The platform handles real-time matching as new records arrive.

Pros:

Real-time processing: Resolves entities as data streams in
MDM integration: Built specifically for master data workflows
Accuracy focus: Sophisticated matching algorithms for high-confidence results

Cons:

Narrow scope: Primarily entity resolution without broader semantic capabilities
Integration effort: Requires connection to surrounding data infrastructure
Limited transparency: Proprietary algorithms less inspectable than open approaches

Pricing: Contact for enterprise pricing

Best Practices and Implementation Guidelines

Starting with Entity Resolution

Establish a clean, deduplicated foundation before attempting full MDM implementation. Run multiple blocking passes to catch records with identifier errors—a first pass might join only records with identical ZIP codes and birth years, while a second pass uses first and last names.

Create additional linkage rules for missing identifiers using phonetic algorithms. When Social Security numbers are absent, compare name, date of birth, sex, and ZIP code. Phonetic algorithms like Soundex, NYSIIS, or Metaphone help resolve name variations.

Choosing the Right Approach

Use deterministic matching for highest confidence when clean data and personally identifiable information are available. The exact matching provides certainty but requires pristine inputs.

Apply probabilistic methods for messier data and greater scale, accepting a reasonable confidence tradeoff. Implement hybrid strategies combining both methods for optimal match quality—exact matching where possible, fuzzy matching where necessary.

Governance and Maintenance

Define acceptable differences in identifiers before implementation. ZIP codes might be flexible, but dates of birth should match strictly. These thresholds depend on your specific data quality and business requirements.

Create golden records representing entities with highest confidence from multiple sources. Continuously refine machine learning models based on historical match and non-match patterns, treating entity resolution as an evolving capability rather than a one-time project.

Common Questions About Entity Resolution

Entity Resolution vs. Record Linkage vs. Identity Resolution

Identity resolution determines what data elements uniquely identify an entity. Record linkage applies identity resolution across multiple sources to solve for a specific entity. Entity resolution ensures unique, complete, accurate, and consistent data representation.

The terms overlap significantly in practice. Academic literature favors record linkage, while enterprise contexts prefer entity resolution or identity resolution depending on industry.

Deterministic vs. Probabilistic: Which is Better?

Deterministic matching is more accurate by definition, but this doesn't end the conversation. Probabilistic approaches find matches at greater scale in messier data, sacrificing some accuracy to achieve that scale.

Neither is definitively better. The choice depends on your data quality, volume, and confidence requirements. Many organizations use both: deterministic for high-confidence matches, probabilistic for broader coverage.

Handling Missing or Incomplete Data

Create additional rules comparing alternative identifiers when primary ones are missing. If Social Security numbers aren't available, fall back to name, date of birth, sex, and ZIP code combinations.

Use phonetic algorithms to resolve name variations and typos. Define which identifier differences are acceptable and which should block matches entirely—this varies by entity type and business context.

When to Prioritize Entity Resolution Before MDM

Entity resolution provides immediate value through improved data quality and relationship insights, delivering early wins that build momentum. It establishes the clean, deduplicated foundation necessary for successful MDM implementation.

Modern entity resolution technologies scale to enterprise volumes while maintaining accuracy. Starting here reduces risk and increases the probability of MDM success compared to attempting comprehensive MDM without first addressing entity resolution.

Conclusion

Entity resolution transforms fragmented records into unified, actionable entity views that organizations can actually trust. The techniques range from simple deterministic matching to sophisticated machine learning models, with the right approach depending on your data quality, scale, and accuracy requirements.

Modern semantic platforms automate resolution while reducing the manual mapping and long implementation cycles that plague traditional MDM tools. Organizations that establish strong entity resolution capabilities early achieve higher success rates and faster time-to-value than those treating it as an afterthought.

The healthcare network that started this story eventually implemented entity resolution across their patient records. Robert Johnson now has one unified medical record, regardless of which system clinicians access. When he arrives at any facility in the network, physicians see his complete history—allergies, medications, past procedures—in seconds rather than hours. That's not just better data management. That's better patient care.

Back to Articles