Photo by Fabrizio Verrecchia on Unsplash

Member-only story

Hands-on Tutorial

The Optimization of Fuzzy String Matching Using TF-IDF and Nearest Neighbors Algorithm

How to accelerate the computation time of fuzzy string matching from hours to seconds

5 min readFeb 13, 2021

When working with real data, the biggest problems are mostly in data pre-processing. It may vary, but matching can be one of the biggest challenges faced by a lot of analysts. For instance, when we are talking about George Washington and G Washington, of course, we are talking about one person, namely the first President of the United States. We are dealing with duplicate data. Luckily, researchers have developed the probabilistic data matching algorithm or well-known as fuzzy matching.

What is fuzzy string matching?

Probabilistic data matching often referred to as fuzzy string matching, is the algorithm to match a pattern between a string with a sequence of strings in the database and give a matching similarity — in percentage. It explicitly indicates that the output must be the probability (in the range 0 to 1 or the percentage of similarity) instead of an exact number.

There are many ways to perform fuzzy string matching, for instance, Levenshtein distance, but it has a problem with the algorithm performance. The reason is that each record is compared against all the other records in the data. This phenomenon is well-known as quadratic time complexity.

How do TF-IDF and nearest neighbors algorithm accelerate the computation time?

Term Frequency — Inverse Document Frequency (TF-IDF) is a Natural Language Processing (NLP) technique that tries to represent the text in numerical data and to measure how important a word is in a given document. For example, from a collection of documents, how important is the word “flower”?

Term Frequency — Inverse Document Frequency n-grams — Term Frequency — Inverse Document Frequency of n-grams (Image by Author)

The TF-IDF is implemented using n-grams of groups of letters caused by the possibility of misspellings or typos. For instance, the word independence is chunked into the following…

Hands-on Tutorial

The Optimization of Fuzzy String Matching Using TF-IDF and Nearest Neighbors Algorithm

How to accelerate the computation time of fuzzy string matching from hours to seconds

What is fuzzy string matching?

How do TF-IDF and nearest neighbors algorithm accelerate the computation time?

Written by Audhi Aprilliant

Responses (4)