In the last few weeks, I had to work on matching a large number of records relating to various companies with an internal feature rich dataset for 10 million companies. Needless to say, there were no readily available standardized identifiers across the two databases the one could perform a join operation. The record matching had to be based on approximate matching and probabilistic matching. Until this piece of work came along my way, I had never heard of “Record Matching” as a subject in itself where people do PhDs in.

Immersed myself in a quite a few papers, books and blogposts to understand this field just enough so that I can get my work done. In the process I found several interesting talks, books, decks, papers. Hopefully in the days to come, I will try to summarize a few papers and books. This post summarizes a series of posts written by Robin Linacre, who works in Ministry of Justice UK.

Interactive Introduction to Probabilistic Record Linkage

  • Probabilistic record linkage is a technique used to link together records that lack unique identifiers
  • A combination of non-unique variables such as name, gender and date of birth can be used to identify individuals
  • It can be done within datasets or between datasets
  • Linkage is probabilistic in nature as it relies on the balance of evidence.
  • It is impossible to classify pairs of records as matches or non-matches beyond any doubt. Instead, the aim of probabilistic record linkage is to quantify the probability that a pair of records refer to the same entity by considering evidence in favor and against a match and weighing it appropriately
  • We start with a prior, which represents the probability that two records drawn at random area match. We then compare the two records, increasing the match probability where columns agree, and decreasing it when columns disagree
  • Columns with higher number of distinct values, tend to represent stronger evidence in favor of a match since it is less likely that two records chosen at random would match by coincidence
  • Compare records and weigh the evidence appropriately, we estimate the probability of a match

The mathematics of the Fellegi Sunter Model

  • Whenever two records are compared, the comparison vector for a specific field takes a value of 1 if the similarity score is high enough to be considered a match else it takes a value of 0
  • Fellegi Sunter model takes the comparisons as inputs and outputs the match probabilities
  • How do we combine the comparison vector to compute an overall match probabilities ?
    • m and u probabilities are to be estimated. These probabilities capture the fact that fields match or fields do not match given that the ground truth is match or non-match.
    • Intuitively the probabilities serve as a way to measure the strength of field level match or field level non-match

Visualizing the Fellegi Sunter Model

  • The post does a great job of showcasing the fact that each field’s posterior becomes the prior for the next field computation
  • By playing with m and u probabilities in the app displayed on the page, one gets a good understanding of the way a comparison vector gets morphed in to a match probability for the pair of records

Intuition behind the use of EM to train record linkage models

  • How does one compute the probability that two records match ? The underlying statistical model is called the Fellegi Sunter model. It works by computing the partial weights, which are a measure of importance of the different columns in computing the probability of a match.
    • However, there’s a paradox at the heart of the problem of training a record linkage model - To estimate the parameters of the model, you need to know which records match. But finding which records match is the whole problem you are trying to solve. Manual labeling of comparison is error prone and costly.
  • The key to the EM approach is leveraging the fact that matching and non-matching record comparisons form two fairly distinct ‘clusters’
    • Amongst matching record comparisons, the information in most or all columns in the two records usually matches.
    • Amongst non-matching record comparisons, the information in most or all columns in the two records usually match.
    • This results in a bimodal distribution of record comparisons
    • This separation between matches and non-matches means it’s relatively easy to come up with a rough decision rule that is able to correctly identify most matches and most non-matches.
  • Expectation Step: Use the estimated model parameters to predict which record comparisons are matches
  • Maximization Step: Use the predictions to re-estimate the model parameters
  • Splink software uses a different method for estimating u probabilities. It estimates by taking random pairwise record comparison, assuming they do not match. Since the probability of two random records being a match is usually very low, then this will yield a good estimate of u values

Understanding Match Weights

  • Some columns are more important than others to the calculation of match probability. The relative importance of columns is quantified in a model’s match weights
  • Waterfall chart for the computation of final likelihood
  • Bayes Factor is similar to the concept of odds. But it differs from odds in that they are only meaningful in the context of a prior. A Bayes factors is an adjustment - it tells us something is more or less likely.
  • In Column matches the denominator of the Bayes factor drives the result
  • In Column mismatches, the numerator of the Bayes factor drives the result