Record Linkage Primer
In the last few weeks, I had to work on matching a large number of records relating to various companies with an internal feature rich dataset for 10 million companies. Needless to say, there were no readily available standardized identifiers across the two databases the one could perform a join operation. The record matching had to be based on approximate matching and probabilistic matching. Until this piece of work came along my way, I had never heard of “Record Matching” as a subject in itself where people do PhDs in.
Immersed myself in a quite a few papers, books and blogposts to understand this field just enough so that I can get my work done. In the process I found several interesting talks, books, decks, papers. Hopefully in the days to come, I will try to summarize a few papers and books. This post summarizes a series of posts written by Robin Linacre, who works in Ministry of Justice UK.