|Author||Dataiku (Alexandre Abraham, Andrey Avtomonov)|
When you need to manually label rows for a machine learning classification problem, active learning can help optimize the order in which you process the unlabeled data. The ML-assisted Labeling plugin enables active learning techniques in Dataiku DSS.
Not all samples bring the same amount of information when it comes to training a model. Labeling a very similar to an already labeled sample might not bring any improvement to the model performance. Active learning aims at estimating how much additional information labeling a sample can bring to a model and select the next sample to label accordingly. As an example, see how active learning performs compare to random sample annotation on a task of predicting wine color:
This plugin offers a collection of visual webapps to label data (whether tabular, sound, image classification or object detection), a visual recipe to determine which sample needs to be labeled next, and a scenario trigger to automate the rebuild of the ML model that assists in determining which sample is labeled next.
- Data Labeling (webapp)
- Score an unlabeled dataset (recipe)
- Set up a Active Learning scenario (scenario)
When to use this plugin
The webapp’s purpose is to ease the labeling task. For large datasets or with a limited labeling budget, the Active Learning recipe and scenario can be leveraged. This becomes an iterative ML-assisted labeling process.
To facilitate the usage this plugin ships 3 Dataiku Applications
This plugin offeres labeling webapps for multiple use cases:
|Tabluar data classification||Images classification|
|Object detection on images||Sound labeling|
In order to label data first select the webapp that fits the data to be labeled.
All labeling webapps offer the same settings. For image labeling, those are:
Images to label– managed folder containing unlabeled images.
Categories– set of labels to be assigned to images.
Labeling status and metadata– dataset name for the labeling metadata.
Labels dataset– dataset to save the labels into.
Label column name– column name under which the manual labels will be stored.
Queries(optional) – dataset containing the unlabeled data with an associated uncertainty score.
Note that the latter
queries dataset is optional as labeling can always be done without Active Learning. In this case the user will be offered to label samples in a random order.
After the webapp has started, the annotation process can start.
Note: For implementation purpose, in order to distinguish labeled samples from unlabeled in the tabular case, the webapp adds a column — called
label_id by default — to the output dataset. This feature should not be used in any model.
When a sufficient number of samples has been labeled, a classifier from the DSS Visual Machine Learning interface can be trained to predict the labels, and be deployed in the project’s flow. In order to later used the Active Learning plugin, it’s required to use a python3 environment to train the model. Here’s a link describing how to create a new code environment in DSS . Make sure that it’s based on python3.
From the plugin, after the Query Sampler recipe is selected, the proposed inputs are:
Classifier Model– deployed classifier model.
Unlabeled Data– dataset containing the raw unlabeled data.
Data to be labeled– dataset containing the unlabeled data with an associated uncertainty score.
There is only one setting to choose from, the Active Learning strategy.
This plugin proposes the three most common active learning strategies: Smallest confidence, Smallest margin, and Greatest entropy. Here are their definitions in the general multiclass classification settings with n classes.
p^(i) denotes the ith-highest predicted probability among all n classes.
Note: In the binary classification case, the ranking generated by all the different strategies will be the same. In that case, one should therefore go with the
Smallest confidence strategy that is the less computationally costly.
This is a confidence score based on the probability of the most probable class.
This approach focuses on the difference between the top two probabilities:
Shannon’s entropy measures the information carried by a random variable. This leads to the following definition:
In order to have an homogeneous output, this is normalized between 0 and 1.
Tracking performance evolution is useful in an active learning setting. For this purpose, a
session_id counter on how many times the query sampler has been ran so far is added to the queries dataset.
session_id is then reported in the metadata dataset, output of the labeling webapp.
The Active Learning process is instrisically a loop in which the samples labeled so far and the trained classifier are leveraged to select the next batch of samples to be labeled. This loop takes place in DSS through the webapp, that takes the queries to fill the training data of the model, and a scenario that regularly trains the model and generates new queries.
To set up this scenario, this plugin proposes a custom trigger that can be used to retrain the model every n labelings. Here are the steps to follow to put in place the training:
- Create the scenario, add a custom trigger
Every n labeling.
The following is then displayed:
Last but not least, the following three steps constitutes the full Active Learning scenario: