State and Local government organizations manage a lot of data. One agency may capture home title information, while another may be collecting test information for COVID-19. Direct links between these separate systems often do not exist. The ability to combine information about individuals across systems can help government employees better serve the public and simplify internal processes. However, directly merging datasets is often impossible because of the lack of a common key or ID.
Probabilistic matching — a Baysian statistical approach for joining tables without a common key — can help IT teams can solve this problem in a robust way. In this post, we’ll introduce probabilistic matching, and in a future post, we’ll show how to implement this solution in Azure Databricks using the Splink package – a popular distributed implementation of the Felligi Sunter matching model.
Let us consider an example using the datasets we mentioned earlier: home ownership and Covid-19 test results. Perhaps an agency is looking to combine these two sources for the purposes of a COVID-19 contact tracing application. Suppose we want to join these two tables on four fields: First Name, Last Name, Address, and Occupation.
As we can see in the above example, the two records match last name and address but fail to match on first name and occupation. Most basic approaches to direct joins between these two tables would fail, whether we joined on all four fields or just first name / last name. Probabalistic matching offers some flexibility where a direct join does not. Rather than providing a binary yes or no for matches based on all rules, a probabilistic matching algorithm quantifies the likelihood probability score between 0 and 1 given information it has learned about the dataset and which features have matched.
To better understand how probabilistic matching works, we can follow the waterfall chart. First, the algorithm determines the probability that two records would match randomly. In our example, let’s say that the probability of a match is low, so we’ll start it .1 (in practice, this number is typically much closer to 0, but we’ll set it to .1 for visual purposes).
The ability to combine information about individuals across systems can help government employees better serve the public and simplify internal processes.
Next, the algorithm will evaluate if first names match. As “Andrew” and “Andy” do not match, it decreases the probability. Following this, the algorithm evaluates the last names. Having a match of the last name is rare in our dataset, so it increases by .35.
Let’s suppose address is a unique feature as well in our dataset and combinations of last name and address are especially rare, so the algorithm jumps significantly by .5 to .9.
Finally, the probability decreases after failing to match on our occupation field, which rarely joins. This results in a probability of .8, which is high! For the purposes of our contact tracing project, we will likely want to see if other people at this address have tested positive.
Probabilistic Matching Advantages and Caveats
The advantage to this approach is that the algorithm the probability of a successful join between features and the probability of an unsuccessful join between two features as separate things. As a result, rare matches, like address and occupation, are highly rewarded but not severely punished. This is very useful in a person-matching context as an individual’s information is often in flux. Address change regularly, people have nicknames, and systems capture information in different ways.
It is important to remember that this algorithm – while very useful for combining information about people – is not a replacement for an absolute join. Like many machine learning solutions, probabilistic matching requires oversight to ensure the algorithm is performing as intended.
It is important to remember that this algorithm – while very useful for combining information about people – is not a replacement for an absolute join.
Probabilistic matching can remove the need for complex manual mapping processes to join disparate government systems. In the next post, we’ll walk through how to use probabilistic matching for state and local government with the Splink package on Azure Databricks.
Connect with Andrew Kraemer, Tallan Data Science Lead on LinkedIn
Interested in learning more about data science? Join us on July 27th for Tallan’s Data Scientists: Ask Us Anything virtual event. Lee Harper, our Director of Data Science & AI Solutions, and Andrew Kraemer, our Data Science Lead, will go live on Microsoft Teams to answer any and all questions that you may have regarding Tallan’s Data Science process and offerings! Click here to learn more.