Member-only story
Hands-on Tutorial
The Optimization of Fuzzy String Matching Using TF-IDF and Nearest Neighbors Algorithm
How to accelerate the computation time of fuzzy string matching from hours to seconds
When working with real data, the biggest problems are mostly in data pre-processing. It may vary, but matching can be one of the biggest challenges faced by a lot of analysts. For instance, when we are talking about George Washington and G Washington, of course, we are talking about one person, namely the first President of the United States. We are dealing with duplicate data. Luckily, researchers have developed the probabilistic data matching algorithm or well-known as fuzzy matching.
What is fuzzy string matching?
Probabilistic data matching often referred to as fuzzy string matching, is the algorithm to match a pattern between a string with a sequence of strings in the database and give a matching similarity — in percentage. It explicitly indicates that the output must be the probability (in the range 0 to 1 or the percentage of similarity) instead of an exact number.
There are many ways to perform fuzzy string matching, for instance, Levenshtein distance, but it has a problem with the algorithm performance. The reason is that each record is compared against all the other records in the data. This phenomenon is well-known as quadratic time complexity.
How do TF-IDF and nearest neighbors algorithm accelerate the computation time?
Term Frequency — Inverse Document Frequency (TF-IDF) is a Natural Language Processing (NLP) technique that tries to represent the text in numerical data and to measure how important a word is in a given document. For example, from a collection of documents, how important is the word “flower”?

The TF-IDF is implemented using n-grams of groups of letters caused by the possibility of misspellings or typos. For instance, the word independence is chunked into the following…