When working with text values provided by real users, you must deal with various approximations or typing errors. Let's have a look at how Data Science Studio makes dealing with a list of misspelled movie titles incredibly easy.
In a previous article, I illustrated how to use text normalization and text clustering within Data Science Studio to work on text based datasets. Let's now see another use case.
We have two different lists:
This is a typical use case: we want to match every searched title from the queries list with the actual movie from the IMDB database. This will allow us to add some additional data in our queries dataset (like for example the genre and average rating). This could allow us to better understand what our users search for.
We call this operation a join. When the values to compare are not exactly identical, it can become quite difficult to perform.
Thus, we perform what we call a 'fuzzy join'. It is a special join that can match even if the two strings being compared are not exactly equal, but close. For example, we would like to match the query 'inglourios basterds' with the real movie title 'Inglourious Basterds'.
Here is an overview of the two lists :
The transformation is easy to perform in Data Science Studio. On the list of misspelled titles, let us apply the processor 'Fuzzy join' and just fill a few parameters:
The key parameter is the maximum distance for the match. The algorithm uses the Damerau–Levenshtein distance. Increasing this distance will lead to more matches but also probably imply more mistakes in matching.
Let us have a look at how this paramater changes the match. With a distance at 1, we match 14 movies out of our list of 26 misspelled movie titles.B asically, it finds titles that are one operation away from the real title where an operation is defined as an insertion, deletion, or substitution of a single character, or a transposition of two adjacent characters.
With a distance at 2, we get 21 matches. At 3, we get 24. And all titles match for a distance of 4.
Note that few matches are not the ones probably expected.
'toy stories' matches with 'Dog Stories' but 'toy stories 2' goes with 'Toy Story 2'.
You may consult the full final list here.
Fuzzy join, directly integrated in Data Science Studio's visual interactive data preparation is really convenient whenever you need to match together text values that were entered manually. I would be happy to discuss with you in the comments below if you want more information.
Please fill out the form below to receive the success story by email:
How can we come back to you ?