In the last few weeks, I had to work on matching a large number of records
relating to various companies with an internal feature rich dataset for 10
million companies. Needless to say, there were no readily available standardized
identifiers across the two databases the one could perform a join operation. The
record matching had to be based on approximate matching and probabilistic
matching. Until this piece of work came along my way, I had never heard of
“Record Matching” as a subject in itself where people do PhDs in.
Immersed myself in a quite a few papers, books and blogposts to understand this
field just enough so that I can get my work done. In the process I found several
interesting talks, books, decks, papers. Hopefully in the days to come, I will
try to summarize a few papers and books. This post summarizes a series of posts
written by Robin Linacre, who works in Ministry of Justice UK.