Recommendation system

Build a recommendation system using collaborative filtering and machine learning.

Plugin Information

Version	0.0.4
Author	Dataiku
Released	2021-12
Last updated	2023-04
License	Apache-2.0
Source code	Github
Reporting issues	Github

This plugin provides a set of tools to create a recommendation system workflow and predict future user-item interactions. It is composed of:

A set of recipes to compute collaborative filtering and generate negative samples:
- Auto collaborative filtering: Compute collaborative filtering scores from a dataset of user-item samples
- Custom collaborative filtering: Compute affinity scores from a dataset of user-item samples and a dataset of similarity scores (i.e similarity between pairs of users or items)
- Sampling: Generate negative samples from user-item implicit feedbacks (that necessary includes only positive samples)
A pre-packaged recommendation system workflow in a Dataiku Application, so you can create your first recommendation system in a few clicks.

How to set up
How to use
Example of a recommendation flow

How to set up

Right after installing the plugin, you will need to build its code environment. If this is the first time you install this plugin, click on Build new environment.

Note that Python version 3.6 or 3.7 is required.

Connections

The plugin recipes run on SQL databases. Both the input and outputs datasets of the recipes must be in the same SQL connection.

Supported SQL connections: PostgreSQL, Snowflake, Google BigQuery, Microsoft SQL Server, Azure Synapse.

How to use

The plugin provides 3 recipes that can be used together to build a complete recommendation flow in DSS.
You can also generate a first recommendation system in a a few clicks thanks to the Dataiku Application.

Auto collaborative filtering

Use this recipe to compute collaborative filtering scores from a dataset of user-item samples. Optionally, you can provide explicit feedbacks (a rating is associated to each interaction between a user and an item) that will be taken into account to compute affinity scores.

Summary

In this recipe, some user-item samples are first filtered based on some pre-processing parameters (users or items with not enough interactions and old interactions are filtered).

Then, depending on whether you chose user-based or item-based collaborative filtering, similarity scores between users (or items) are computed.

For user-based, the similarity between user 1 and user 2 is based on the number of same items user 1 and user 2 have interacted with

For item-based, the similarity between item 1 and item 2 is based of the number of users that interacted both with item 1 and item 2

Finally, using the similarity matrix generated before, we compute the affinity score between a user and an item using the user’s top N most similar users that have interacted with the item.

Notes:

In case of explicit feedbacks (if a rating column is provided):
- The similarity between users is computed using the Pearson correlation.
- The affinity score between a user u and an item i is computed by taking the weighted average of the ratings of the top N users that are most similar to user u who rated item i.
All of the above is for user-based collaborative filtering, the item-based approach is symmetrical.

Input

Samples dataset with user-item pairs (one column for the items, another column for the users) and optionally a timestamp column and a numerical explicit feedback column.

Output

Scored samples dataset of new user-item pairs (not in the input dataset) with affinity scores.
(Optional) Similarity scores dataset of either users or items similarity scores used to compute the affinity scores.

Settings

Input parameters

Users column: Column with users id
Items column: Column with items id
(Optional) Ratings column: Column with numerical explicit feedbacks (such as ratings)
- If not specified, the recipe will use implicit feedbacks.
- In explicit feedbacks, the Pearson correlation is used to compute similarity between either users or items.

Pre-processing parameters

Minimum visits per user: Users that have interacted with less items are removed
Minimum visits per item: Items that have interacted with less users are removed
Normalisation method: Choose between
- L1 normalisation: To normalise user-item visits using the L1 norm
- L2 normalisation: To normalise user-item visits using the L2 norm
Use timestamp filtering: Whether to filter interactions based on a timestamp column
Timestamp column: Column used to order interactions (can be dates or numerical, higher values means more recent)
Nb. of items to keep per user: Only the N most recent items seen per user are kept based on the timestamp column

Collaborative filtering parameters

Collaborative filtering method: Choose between
- User-based: Compute user-based collaborative filtering scores
- Item-based: Compute item-based collaborative filtering scores
Nb. of most similar users/items: Compute user-item affinity scores using the N most similar users (in case of user-based) or items (in case of item-based)

Performance

During the score computation, the longest task is to compute the similarity matrix between users or user-based (resp. items for item-based).

To do so, it computes a table of size:

number of users X average number of visit per user X average number of visit per item

(resp. number of items X average number of visit per user X average number of visit per item)

Reducing these metrics will decrease the memory usage and running time.