Entity resolution, also known as record linkage, is a critical step in many data workflows. It involves identifying and matching records that refer to the same entity across different datasets. This task is challenging when names, formats, or structures differ, and requires a blend of automation and domain expertise.
Large language models (LLMs) are increasingly being explored to tackle entity resolution problems more intelligently. However, in this article, we focus on a more transparent and accessible approach, combining rule-based preprocessing, fuzzy joins, and human validation. This allows us to introduce key practical aspects of entity resolution (human validation, output structure, usage for data enrichment) and to highlight the limits of such approaches, revealing opportunities for more advanced techniques like LLMs to step in.
An example area of application is ESG (Environmental, Social, and Governance) analytics, where organizations rely on a combination of internal and third-party data to measure their value chain’s externalities to strategically remedy them and comply with regulatory obligations. The parameters of this exercise look different depending on the value chain type.
In this article, we explore how Dataiku was used to create an entity resolution tool for matching company records from internal and external datasets. We describe the problem, outline the solution, demonstrate validation interfaces, and explain how the results support data enrichment.
The project we created can be accessed on the Dataiku Gallery, a public, read-only instance of Dataiku. Throughout the article, links to the project's key datasets and recipes (aka data transformation operations) are provided.
In our use case, we wanted to enrich internal data on 2,001 companies with sustainability data from an external provider containing 9,895 companies. The challenge? Company names between datasets were often inconsistent. For instance:
Such discrepancies made exact matches unfeasible. The solution required a combination of name preprocessing, fuzzy matching, and human validation to ensure high accuracy.
Below, you can see examples of records from both datasets for companies in Switzerland’s financial services industry.
The solution aims to provide a matching table with one column for the company identifier used by the CRM and another for the one used by the external provider.
To improve match accuracy, we standardized company names (e.g., transforming "Mercedes-Benz Group AG" into "mercedes benz group").
We used the following steps:
This preprocessing ensured a consistent format for comparisons. Detailed steps can be found in the compute entities prepared recipe.
We applied a fuzzy join to identify matches where the textual distance was below a defined threshold. Here's how it worked:
1. Left Join: Internal CRM names were compared to external provider names based on textual distance (computed by the “Damerau-Levenshtein” algorithm, as shown in the screenshot below). Pairs with less than 20% distance between names and the same Country/Industry were kept. This resulted in 2,146 rows, which shows that a few entities had multiple matches.
2. Select Closest Match: For entities with multiple matches, only the closest one was kept. This was achieved by Grouping by ID and computing the first matched entity by distance.
This resulted in a dataset with the same number of rows as the CRM data (2,001). The fuzzy join computed distance values as percentages, and we turned them into Confidence values by taking 1 minus the distance (values are between 0 and 100%, the latter corresponding to a perfect match), which we think is more intuitive for human reviewers to work with.
Notice that we chose to identify matched entities by their IDs and not their names, for our matching table to be robust to potential adjustments in company names in the external dataset.
We categorized entities of our internal dataset into:
These categories were split into separate datasets for further analysis and validation.
The flow pipeline below illustrates the whole process described above:
Automated matching successfully identified connections like:
However, challenges remain:
Such cases highlight the importance of domain expertise and human review in the process. Advanced techniques, such as large language models, could further reduce errors, but human validation will remain essential.
Using Dataiku’s Visual Edit plugin, we deployed two no-code web interfaces to involve domain experts, based on the Uncertain Matches and Missing Matches datasets. One is used to validate the machine's work, and the other is used to do the work that the machine couldn’t do, i.e., to find missing matches. (The links point to webapps hosted on the Dataiku Gallery; please note that the Gallery is read-only and therefore the interfaces don’t allow editing data.)
With the Audit Trail:
In the final “Dispatch Edits” flow zone, validations and edits made in the web interfaces are fed into the data pipeline to create the final matching table:
We recommend implementing ESG analytics in separate projects, using the matching table as a bridge to connect internal data with an enrichment dataset on sustainability provided by an external provider.
In Dataiku, this enrichment can be done with a three-way left join recipe, where the ID column matches identifiers used in the internal dataset, and ID_ext matches identifiers used in the enrichment dataset. We prefer to avoid updating source data (e.g. updating company names in the internal dataset so that they match those of the external provider), and we consider the matching table as a new source of data. This way, we have the flexibility to un-resolve if a matching error was made, while keeping a history of the error and its fix via the editlog.
The enriched dataset can be used to build sustainability analytics and dashboards. We recommend centralizing dashboards and datasets from both projects in a single Dataiku Workspace for easy access by all stakeholders:
This simple project demonstrated how Dataiku accelerates data workflows such as entity resolution by combining automation with domain expertise:
This is achieved thanks to a no-code, transparent environment:
While our example focused on using entity resolution to enrich company data with external sustainability information for ESG reporting and compliance, its applications extend far beyond this scenario. For instance:
You can download the example project here and import it as a new project on your Dataiku instance (if you don’t have one, free trials are available on Dataiku Cloud). You’ll find more information on how to adapt it to your own data and on how to use it in practice in the project’s Wiki.