The End of “Gritty” Entity Resolution: How We Overcame Data Cleansing Limits by Entrusting 250,000 Comparisons to AI

In the world of data engineering, if one were to name the most dreaded yet unavoidable task, it would undoubtedly be “Entity Resolution” (known as nayose in Japan).

Inconsistent notations, duplicate records, and minute differences in address formatting—untangling these one by one to identify the same individual or corporation is a form of “penance” that feels like walking through a data abyss. However, a ray of light called AI (LLM) is now shining onto this gritty process that has long stifled engineer creativity.

This article is a real-world account of how we smartly broke through the desperate phase of comparing 250,000 combinations using LLMs. This is not just a record of efficiency; it represents a paradigm shift in how modern engineers should interact with data.

Why AI-Driven Entity Resolution is Essential Now

In data analysis and CRM (Customer Relationship Management) implementation, data integrity is the lifeline. However, traditional rule-based entity resolution (using regular expressions or dictionary matching) has its limits. It becomes a game of “whack-a-mole,” where new rules must be added every time an unexpected input pattern appears, bloating development man-hours and significantly degrading maintainability.

【TechTrend Watch Perspective】 Traditional entity resolution was a "systematic matching" process seeking exact hits. However, the rise of LLMs has introduced a human-like flexibility: "contextual similarity judgment." While it is physically impossible for a human to compare 250,000 pairs, an AI can complete it in a few hours with an incredible cost-performance ratio—ranging from a few dozen to a few hundred dollars. This is no longer just a choice of methodology; it is a business decision to transform data "debt" into "assets."

Implementation Architecture: How to Integrate LLMs “Smartly”

Simply throwing all data into an LLM is unrealistic from both a cost and time perspective. Sophisticated engineers achieve both high precision and low cost through the following three steps:

  1. Blocking (Candidate Filtering): Avoid a brute-force comparison of all 250,000 permutations. Use highly reliable keys, such as the last four digits of a phone number or a zip code, to pre-filter candidates down to a few hundred cases for the LLM to judge.
  2. Context Prompting: Instruct the LLM to consider domain-specific variations for the filtered candidates, such as the presence of legal entity status (Inc./Corp.), floor level notations in addresses, or abbreviations of building names.
  3. Probabilistic Scoring: Have the LLM output a “confidence score” from 0 to 100 regarding the match. This allows for the automation of most decisions, extracting only the “gray zones” that require final human verification.

Comparison with Existing Methods: Why LLMs are Game Changers

Evaluation CriteriaTraditional Rule-BasedVector Search (Embeddings)LLM (GPT-4o, etc.)
FlexibilityLow (Rejected if outside definitions)Medium (Nearest neighbor search)Highest (Deeply understands context)
Implementation Lead TimeLong (Extensive requirement definition)ShortShortest (Same-day implementation possible)
Matching AccuracyRigidModerate (Keyword dependent)Extremely High
Cost StructureMassive human labor costsLow (Compute resources only)API Costs (Pay-as-you-go)

In conclusion, the hybrid configuration—“entrusting the last mile of complex human judgment to the LLM”—is the current best practice.

Prescriptions for Practice: “Pitfalls” to Avoid

When deploying this method in the field, the strategic points to keep in mind can be summarized into the following three areas:

  • Managing Hallucinations: LLMs can sometimes confidently produce incorrect answers. It is crucial to ensure verifiability by asking the AI to output the “Reasoning” behind its judgment, rather than just a “Yes/No” answer.
  • Privacy and Governance: Sending customer data to external APIs requires caution. PII (Personally Identifiable Information) masking or the use of secure, private environments like Azure OpenAI Service is a mandatory requirement.
  • Scalability via Asynchronous Processing: In large-scale data processing, sequential processing is a death sentence. It is essential to design systems that use asynchronous (Async) processing or Batch APIs to avoid timeouts and rate limits.

FAQ: Responding to Questions from the Field

Q: Is the cost-performance ratio justifiable? A: With proper blocking, the cost of processing tens of thousands of records can range from $50 to $100. Compared to the labor costs of a skilled engineer or clerk spending a month on manual entity resolution, the difference is stark.

Q: Can local LLMs (like Llama 3) be used as a substitute? A: While theoretically possible, large-scale commercial models like GPT-4o still have an edge in understanding “ambiguous nuances” based on specific regional address formats or unique business customs. If accuracy is the priority, commercial models are recommended.

Q: Can you guarantee 100% accuracy? A: One should not demand “perfection” from AI. The essence of this method lies in “releasing human resources from 95% of low-value repetitive tasks to allow them to focus on the 5% of critical judgments.”

Conclusion: Data Engineering in the AI Era

The experience of “manually scrutinizing 250,000 combinations” might have once been a badge of honor. However, in an age where we have the powerful leverage of AI, it is nothing more than an inefficiency to be avoided.

Our mission as engineers is to tame AI as a “tool” and dedicate our time to designing architectures that are more creative and bring direct value to the business. If you have “dirty data” lying dormant because you gave up on utilizing it, now is the time to build an entity resolution pipeline powered by LLMs. Beyond that lies the true value of data that was previously invisible.


This article is also available in Japanese.