Looking for associations between items is a very frequent way to mine transaction data. A famous example is the so-called “market basket analysis”, where one would look for products frequently bought together at a grocery store for instance. In this post, we show how to mine frequent itemsets using R, in DSS.
You’ll need a proper installation of R on the server running DSS, as well as the arules R package.
We’ll be using the excellent MovieLens dataset in this tutorial, available at this website, in its 1 million ratings version. This dataset consists in a series of ratings made by users on movies, and we are going to look for pairs of movies frequently reviewed, hence seen, by users. This may serve as a basis for a very simple recommendation engine for movies.
Proceed to the download of the MovieLens 1M file, uncompress the zip archive and upload the 3 files to DSS:
Import each file with a “One record per line” setting:
Now the 3 input files are ready, they can be cleaned using a visual data preparation script available under the “Analyze” section. The script will be mostly the same for the three files, consisting in parsing the delimiter (“::”) and assigning the proper names to the columns. Let’s see an example with the Movie dataset:
Note that we use a regex-based processor to extract the year of the movie. We essentially look for 4 consecutive digits in the title (which has some limitations…), and store it in a “year” variable:
Deploy the scripts and build the 3 datasets.
We are almost, done. The last step is to use a Join processor to create the final dataset, merging the previous 3 all together.
You’ll simply need to follow the indications to merge the 3 datasets, using inner joins:
This will create a completely denormalized dataset, ready for the association rules analysis:
Creating associations rules, aka mining frequent itemsets, is a set of techniques that can be used to look for movies frequently reviewed, or seen, together by users. The “arules” R package contains the apriori algorithm, which we will rely on here.
We need just some pretty simple data: a “grouping” key, which is here the UserID, and an “item” column, which is here the movies seen:
From the Flow screen, create a new R recipe, that takes the fully joined dataset as input (“ratings_full” here), and outputs an new one called here “associations”. In the Code tab of the recipe, just enter the R code required to create the association rules:
This is pretty straightforward. The code does the following:
The final workflow should look like this:
The associations rules are actually stored here in the “associations” dataset, which has the following content:
The rules with a lift higher than 1 are the one of interest. The higher value, the higher the correlation. Going further, you may want to adjust the parameters of the algorithm, being more or less restrictive on the different settings.
Using R and DSS, associations rules are pretty simple to build. They make a powerful tool to explore frequent associations between items, and can be used in wide array of applications. They could serve for instance as a basis for an item-to-item recommender.