Photo by Fabrizio Verrecchia on Unsplash

Hands-on Tutorial

The Optimization of Fuzzy String Matching Using TF-IDF and Nearest Neighbors Algorithm

How to accelerate the computation time of fuzzy string matching from hours to seconds

5 min readFeb 13, 2021

--

When working with real data, the biggest problems are mostly in data pre-processing. It may vary, but matching can be one of the biggest challenges faced by a lot of analysts. For instance, when we are talking about George Washington and G Washington, of course, we are talking about one person, namely the first President of the United States. We are dealing with duplicate data. Luckily, researchers have developed the probabilistic data matching algorithm or well-known as fuzzy matching.

What is fuzzy string matching?

Probabilistic data matching often referred to as fuzzy string matching, is the algorithm to match a pattern between a string with a sequence of strings in the database and give a matching similarity — in percentage. It explicitly indicates that the output must be the probability (in the range 0 to 1 or the percentage of similarity) instead of an exact number.

There are many ways to perform fuzzy string matching, for instance, Levenshtein distance, but it has a problem with the algorithm…

--

--

Audhi Aprilliant
Audhi Aprilliant

Written by Audhi Aprilliant

Data Scientist. Tech Writer. Statistics, Data Analytics, and Computer Science Enthusiast. Portfolio & social media links at http://audhiaprilliant.github.io/