howto

Mining frequent itemsets in R

September 08, 2015

Looking for associations between items is a very frequent way to mine transaction data. A famous example is the so-called "market basket analysis", where one would look for products frequently bought together at a grocery store for instance. In this post, we show how to mine frequent itemsets using R, in DSS.

Assumptions

You'll need a proper installation of R on the server running DSS, as well as the arules R package.

Supporting data

We'll be using the excellent MovieLens dataset in this tutorial, available at this website, in its 1 million ratings version. This dataset consists in a series of ratings made by users on movies, and we are going to look for pairs of movies frequently reviewed, hence seen, by users. This may serve as a basis for a very simple recommendation engine for movies.

Proceed to the download of the MovieLens 1M file, uncompress the zip archive and upload the 3 files to DSS:

  • ratings.dat contains the list of user / movie / rating
  • movies.dat has some metadata about the movies
  • users.dat has some metadata about the users

Import each file with a "One record per line" setting:

Now the 3 input files are ready, they can be cleaned using a visual data preparation script available under the "Analyze" section. The script will be mostly the same for the three files, consisting in parsing the delimiter ("::") and assigning the proper names to the columns. Let's see an example with the Movie dataset:

Note that we use a regex-based processor to extract the year of the movie. We essentially look for 4 consecutive digits in the title (which has some limitations...), and store it in a "year" variable:

^.*(?<year>\d{4}).*$


Deploy the scripts and build the 3 datasets.

We are almost, done. The last step is to use a Join processor to create the final dataset, merging the previous 3 all together.

You'll simply need to follow the indications to merge the 3 datasets, using inner joins:

This will create a completely denormalized dataset, ready for the association rules analysis:

Mining frequent associations with R

Creating associations rules, aka mining frequent itemsets, is a set of techniques that can be used to look for movies frequently reviewed, or seen, together by users. The "arules" R package contains the apriori algorithm, which we will rely on here.

We need just some pretty simple data: a "grouping" key, which is here the UserID, and an "item" column, which is here the movies seen:

From the Flow screen, create a new R recipe, that takes the fully joined dataset as input ("ratings_full" here), and outputs an new one called here "associations". In the Code tab of the recipe, just enter the R code required to create the association rules:

library(dataiku)
library(arules)

# Input datasets
transactions <- read.dataset("ratings_full")

# Transform data to make it suitable
transactions <- as(
  split(as.vector(transactions$Movie_Title), as.vector(transactions$UserID)),
  "transactions"
)

# Analyze
rules <- apriori(
  transactions, 
  parameter=list(supp=0.02, conf=0.8, target="rules", minlen=2, maxlen=2)
)

rules <- sort(rules, by ="lift")

# Output datasets
write.dataset_with_schema(as(rules, "data.frame"), "associations")


This is pretty straightforward. The code does the following:

  • import the required packages, including the Dataiku API to read the input dataset
  • read the dataset
  • put the dataset into a suitable "transaction" format for the arules functions
  • apply the apriori algorithm using a few parameters:
    • minimum level of support and confidence (more on this later)
    • extract only the rules made of 2 elements
    • sort the results by descending lift
  • write the resulting dataframe into a DSS dataset

The final workflow should look like this:

The associations rules are actually stored here in the "associations" dataset, which has the following content:

  • the "lhs" and "rhs" columns stand respectively for "left hand side" and "right hand side", the two components of the rules. If we take the first line as an example, this can be read as: "people who saw Nightmare on Elm Street 5 also saw Nightmare on Elm Street 4".
  • the "support" is the fraction of people who saw both movies among the entire dataset
  • the "confidence" tells us that, if we take again the first record, 80% of the users who saw Nightmare on Elm Street 5, saw also Nightmare on Elm Street 4
  • finally, the "lift" is an interesting measure that will allow us to remove trivial rules, ones that could appear only because both movies in the rule are popular. It is a correlation measure based on the fact that the actual joint probability of seeing both movies is higher than the one if they were independent.

The rules with a lift higher than 1 are the one of interest. The higher value, the higher the correlation. Going further, you may want to adjust the parameters of the algorithm, being more or less restrictive on the different settings.

Using R and DSS, associations rules are pretty simple to build. They make a powerful tool to explore frequent associations between items, and can be used in wide array of applications. They could serve for instance as a basis for an item-to-item recommender.