Predicting purchasing behavior using Apache Spark


Suppose your are leading the Data Science activities at a large retail chain, and you are given the task to help develop the revenue of specific departments of your stores. A way to approach this problem is to create a set of predictive models that will help you target the customers who have the highest propensity to purchase in these departments, and set-up personalized communications accordingly.

In this tutorial, you will learn how to develop Data Science workflows for this kind of activity, and entirely using Apache Spark, as retailers often need to manage very large datasets.


We'll assume that you have a working installation of Dataiku DSS version 2.1+, configured to work with Apache Spark version 1.4+. Note that for the purpose of the tutorial, a simple local, non-distributed, Spark will suffice.

Supporting data

Through this tutorial, we'll be using a very nice dataset provided by Dunnhumby: The Complete Journey. For the Dunnhumby's website: "this dataset contains household level transactions over two years from a group of 2,500 households who are frequent shoppers at a retailer. It contains all of each household’s purchases, not just those from a limited number of categories."

We do not distribute this dataset, so for the purpose of running the tutorial, you may first register on the website and download the files by yourself.

Creating the datasets in DSS

Log in your DSS instance via your webbrowser, and create a new project called "Retail Prediction":

You need first to create the DSS dataset, by pointing at the source files you just downloaded. Click the + IMPORT YOUR FIRST DATASET blue button, and in the connectors menu, select Server Filesystem (we assume that you downloaded the files on the machine hosting DSS).

In your New Filesystem dataset configuration screen, browse your filesystem to the directory where the files reside:

Note that we defined here a convenient shortcut to a specific path on the local filesystem, called "datasets".

You'll need to import 3 files, hence create 3 DSS datasets:

  • transaction_data.csv: this is the item-level transactions, the largest file
  • hh_demographic.csv: this file stores demographic information about the household doing a transaction
  • product.csv: this is the product look-up table, with a set of attributes qualifying each product

Proceed to the import of the files. For instance, importing the hh_demographic.csv file will lead to this screen:

Create your dataset. The content of the dataset will be displayed at the end of the process:

Repeat this operation for the 2 other datasets (transaction_data and product). You now have created the 3 datasets that will support our workflow, visible from the Dataset section of DSS:

Defining your objective and methodology

This step is critical in Data Science project, it is one of their key factor of success.