ML Assisted Labeling

When you need to manually label rows for a machine learning classification problem, active learning can help optimize the order in which you process the unlabeled data. The ML-assisted Labeling plugin enables active learning techniques in Dataiku DSS.

Plugin information

Version 2.0.0
Author Dataiku (Alexandre Abraham, Andrey Avtomonov)
Released 2020-03-16
Last updated 2020-08-10
License MIT License
Source code Github
Reporting issues Github

 

When you need to manually label rows for a machine learning classification problem, active learning can help optimize the order in which you process the unlabeled data. The ML-assisted Labeling plugin enables active learning techniques in Dataiku DSS.

Not all samples bring the same amount of information when it comes to training a model. Labeling a very similar to an already labeled sample might not bring any improvement to the model performance. Active learning aims at estimating how much additional information labeling a sample can bring to a model and select the next sample to label accordingly. As an example, see how active learning performs compare to random sample annotation on a task of predicting wine color:

Description

This plugin offers a collection of visual webapps to label data (whether tabular, sound, image classification or object detection), a visual recipe to determine which sample needs to be labeled next, and a scenario trigger to automate the rebuild of the ML model that assists in determining which sample is labeled next.

When to use this plugin

The webapp’s purpose is to ease the labeling task. For large datasets or with a limited labeling budget, the Active Learning recipe and scenario can be leveraged. This becomes an iterative ML-assisted labeling process.

Dataiku Applications

To facilitate the usage this plugin ships 3 Dataiku Applications

Labeling Webapp

This plugin offeres labeling webapps for multiple use cases:

Tabluar data classification Images classification
Object detection on images Sound labeling

 

In order to label data first select the webapp that fits the data to be labeled.

All labeling webapps offer the same settings. For image labeling, those are:

  • Images to labelmanaged folder containing unlabeled images.
  • Categories – set of labels to be assigned to images.
  • Labeling status and metadatadataset name for the labeling metadata.
  • Labels datasetdataset to save the labels into.
  • Label column namecolumn name under which the manual labels will be stored.
  • Queries (optional) – dataset containing the unlabeled data with an associated uncertainty score.

Note that the latter queries dataset is optional as labeling can always be done without Active Learning. In this case the user will be offered to label samples in a random order.

After the webapp has started, the annotation process can start.

Note: For implementation purpose, in order to distinguish labeled samples from unlabeled in the tabular case, the webapp adds a column — called label_id by default — to the output dataset. This feature should not be used in any model.

Active Learning Recipe

When a sufficient number of samples has been labeled, a classifier from the DSS Visual Machine Learning interface can be trained to predict the labels, and be deployed in the project’s flow. In order to later used the Active Learning plugin, it’s required to use a python3 environment to train the model. Here’s a link describing how to create a new code environment in DSS . Make sure that it’s based on python3.

From the plugin, after the Query Sampler recipe is selected, the proposed inputs are:

  • Classifier Model – deployed classifier model.
  • Unlabeled Datadataset containing the raw unlabeled data.
  • Data to be labeleddataset containing the unlabeled data with an associated uncertainty score.

There is only one setting to choose from, the Active Learning strategy.

This plugin proposes the three most common active learning strategies: Smallest confidence, Smallest margin, and Greatest entropy. Here are their definitions in the general multiclass classification settings with n classes. p^(i) denotes the ith-highest predicted probability among all n classes.

Note: In the binary classification case, the ranking generated by all the different strategies will be the same. In that case, one should therefore go with the Smallest confidence strategy that is the less computationally costly.

Smallest confidence

This is a confidence score based on the probability of the most probable class.

Smallest margin

This approach focuses on the difference between the top two probabilities:

Greatest Entropy

Shannon’s entropy measures the information carried by a random variable. This leads to the following definition:

In order to have an homogeneous output, this is normalized between 0 and 1.

Sessions

Tracking performance evolution is useful in an active learning setting. For this purpose, a session_id counter on how many times the query sampler has been ran so far is added to the queries dataset.

This session_id is then reported in the metadata dataset, output of the labeling webapp.

Active Learning Scenario

The Active Learning process is instrisically a loop in which the samples labeled so far and the trained classifier are leveraged to select the next batch of samples to be labeled. This loop takes place in DSS through the webapp, that takes the queries to fill the training data of the model, and a scenario that regularly trains the model and generates new queries.

To set up this scenario, this plugin proposes a custom trigger that can be used to retrain the model every n labelings. Here are the steps to follow to put in place the training:

  • Create the scenario, add a custom trigger Every n labeling.

The following is then displayed:

Last but not least, the following three steps constitutes the full Active Learning scenario: