en

Similarity Search

Find similar items in your data using Nearest Neighbor Search indices

Plugin information

Version 0.3.0
Author Dataiku (Liev GARCIA, Thibault DESFONTAINES, Alex COMBESSIE)
Released 2020-11
Last updated 2023-08
License Apache Software License
Source code Github
Reporting issues Github

With this plugin, you will be able to:

  • Build the index required to search for nearest neighbors
  • Find the nearest neighbors of each row of a dataset using a pre-computed index

Note that this plugin requires at least DSS version 8.0.2.

How to set up

Right after installing the plugin, you will need to build its code environment. Note that this plugin requires Python version 3.6 and that conda is not supported.

How to use

To use this plugin, you need to have a dataset with:

  • One column to identify your items
  • Numeric columns which represent these items, which can combine
    • Simple numeric columns with integer or decimal values
    • Vector columns with embeddings* computed from Deep Learning models

In this documentation, we will work on image data as an example, using the Deep Learning for Images plugin to compute embeddings. If you have text data, you can also leverage the Sentence Embedding plugin. Alternatively, you can use your numeric columns directly or compute embeddings with your own code.

Example dataset with image paths and embedding

Once your dataset is ready, navigate to the Flow and select the Similarity Search plugin from the +RECIPE dropdown menu under the Recommender System category. If your dataset is selected in the Flow, you can directly find the plugin on the right panel.

Accessing the Similarity Search plugin from the Flow

This plugin contains two recipes, Build Nearest Neighbor Search Index and Find Nearest Neighbors.

Plugin components

1. Build Nearest Neighbor Search Index

Build index required to search for nearest neighbors

Input

  • Dataset containing numeric or vector data (e.g. embeddings)

Output

  • Folder where the index will be saved

Settings

Build Nearest Neighbor Search Index Recipe Settings

Input parameters

  • Unique ID column which uniquely identifies each row
  • Feature column(s) with numeric or vector data (e.g. embeddings)

Note

To address memory issues, the feature column(s) must not contain vectors longer than 65,536 = 2^16.

Modeling parameters

  • Algorithm: Choose Annoy (Spotify) or Faiss (Facebook)
  • Expert mode: If activated, display Advanced parameters depending on the chosen algorithm
    • Annoy: Distance metric and Number of trees according to this documentation
    • Faiss: Index type and Number of LSH bits (if Index type is Locality-Sensitive Hashing) according to this documentation

2. Find Nearest Neighbors

Find the nearest neighbors of each row of a dataset using a pre-computed index

Input

  • Dataset containing numeric or vector data (e.g. embeddings) – May be different from the one used to build indices
  • Folder containing a pre-computed index

Output

  • Dataset with identified nearest neighbors for each row
Output dataset containing the distance between the nearest neighbors

Settings

Find Nearest Neighbors Recipe Settings
Find Nearest Neighbors Recipe Settings

Input parameters

  • Unique ID column which uniquely identifies each row
  • Feature column(s) with numeric or vector data (e.g. embeddings) in the same order as the index
    • You can check the order of columns used in the index in the output folder of the previous recipe, inside the config.json file

Note

To address memory issues, the feature column(s) must not contain vectors longer than 65,536 = 2^16.

Lookup parameters

  • Number of Neighbors: Choose how many nearest neighbors to retrieve from the pre-computed index

To conclude

With these two recipes, you can build simple yet powerful recommender systems to answer real-life use cases. If you run a support team, you can help your agents find similar tickets to the ones they are working on. If you run an e-commerce website, you can help your users find similar products to the one they are looking for.

Final flow

Happy similarity search!

For the curious ones

* An embedding is a multi-dimensional space that is used to represent complex objects like images, videos, texts, or sounds. Neural networks used for text classification or image recognition, for example, are learning embeddings in their hidden layers to produce an actual prediction. In terms of geometry, items that are similar with respect to a prediction task will be close to one another in terms of distance in the embedding space.

Get the Dataiku Data Sheet

Learn everything you ever wanted to know about Dataiku (but were afraid to ask), including detailed specifications on features and integrations.

Get the data sheet