Working on text-based datasets is a different world than dealing with numbers. Comparisons between words are much harder and it may be difficult to group or aggregate similar values. Find out more how to deal with that.
A textual input in your database is probably composed of a sequence of words. If you want to compare two inputs to check how close they are, it is difficult.
The new Yuzu version of Data Science Studio integrates tools to help you with natural language processing. Today, I present you two tips I often use.
To illustrate this article, I downloaded a publicly available dataset of web queries collected on AOL Search in 2006 (as of 2016, this dataset can't be downloaded anymore from infochimps). From the 20M search queries, I extracted 300,000 queries to create the dataset we are going to use.
Each row is a search query issued by an user. The columns are :
This is a really interesting dataset because the queries come from a real case and with misspelling.
Learn more about the dataset in the README file provided.
Let us have a look to the data. We have around 300,000 rows. We see that a text transformation has already been applied to the dataset: all queries are in lower case.
The top queries are:
A first transformation we are going to do is to simplify the text of the queries. It will help us to compare and regroup similar queries.
Clicking on the "add script" in the interface, we choose the "simplify text" processor. Then we change the settings to transform the 'Query' column into a new column with the following actions:
Top queries are now:
The next step is to group or to cluster the similar queries. The idea behind the transformation is to reduce variance among the user queries. Simplifying the query allows to have more relevant statistics by merging queries that should be considered as the same.
With Data Science Studio, this transformation is very easy. We open the Category clustering window, choose few settings and wait for suggested merge. Then, we choose the clusters to merge.
In our case, it calculated that we have 198,760 distinct values (out of ~300,000 queries). Among the distinct values, it found 1,320 clusters. Here is an overview:
By merging clusters with a set of settings, we can easily reduce the number of distinct values by few thousands.
Top queries after a two-step clustering:
That is it ! This is a good way to work with text-based datasets. We have other great tools in the studio such as the fuzzy join for joining datasets on an approximate string matching.
Feel free to leave a comment in you want to know more, or request for a demo instance!
Please fill out the form below to receive the success story by email:
How can we come back to you ?