howto

Deep Learning for Image Classification

Applies to DSS 4.1.x and above | March 06, 2018

Deep learning models are powerful tools for image classification, but are difficult and expensive to create from scratch. Dataiku provides a plugin that supplies a number of pre-trained deep learning models that you can use to classify images. You can also retrain a model to specialize it on a particular set of images, a process known as transfer learning.

Objectives

We will show you how to:

  • Install the deep learning for images plugin
  • Add a pre-trained deep learning model to the flow
  • Classify a set of test images with the pre-trained model
  • Transfer learning from a set of labeled training images to improve the pre-built model

Prerequisites

We will work with images of lions and tigers. Download the lions_and_tigers.zip (189MB) file from our website and extract the contents. There should be a folder with Images to classify, a folder with Images for retraining, and a Python script.

Install the Plugin

First you need to install the Deep Learning for Images plugin. This requires Administrator privileges on the Dataiku DSS instance.

Log in as the Dataiku DSS Administrator, and from the Admin Tools menu in the top navigation bar, choose Administration.

Dataiku administration tools, Plugin tab

Navigate to the Plugins tab and click Store. Search for “deep learning”. There are two options here. If you have Dataiku installed with GPU support for deep learning, choose the GPU option. Otherwise, install Deep Learning image (CPU).

The plugin requires a dedicated code environment. Choose to create it.

Installing the deep learning for images plugin from the store

Create Your Project and Prepare the Data

Create a new project and give it a name like Lions and Tigers.

In the flow, create two folders (from the + Dataset dropdown, select Folder) named:

  • Images to classify
  • Images for retraining

… and populate these folders with the contents of the same-named folders from the zip file you downloaded.

Adding folders with images to the flow

In order to use the images for retraining, we need a dataset that labels each image as a lion or tiger. Fortunately, the name of each image file contains the text “lion” or “tiger”, correctly identifying which big cat is in the image. We can process the filenames with Python code to create this labels dataset.

Select the Images for retraining folder and choose the Python code recipe from the actions menu. In the recipe creation dialog, create a new dataset called Labels. Then click Create Recipe.

Between the code for Recipe inputs and Recipe outputs, insert the following code:

paths = images_for_retraining.list_paths_in_partition()

LABEL_0 = "lion"
LABEL_1 = "tiger"

pandas_dataframe = pd.DataFrame(columns=['path', 'label'])
for i,j in enumerate(paths):
    if LABEL_0 in j:
        pandas_dataframe.loc[i] = [j[1:], LABEL_0]
    if LABEL_1 in j:
        pandas_dataframe.loc[i] = [j[1:], LABEL_1]

Alternatively, you can copy the code from the Python script file in the zip file you downloaded. If you do, just be sure to change the reference to the Dataiku folder from 63RF8lzq to the reference in your project (it’s visible in the left panel under the Input Datasets).

Adding code to parse the image file names to the Python recipe

Click Run, then check the output dataset to ensure it looks right. Now the data is ready for use!

Add a Pre-Trained Model to the Flow

The plugin includes a macro for downloading a pre-trained deep learning model. Navigate to the project home, then to Macros in the top navigation bar. Click Download pre-trained model.

Macro for downloading a pre-trained deep learning model for image classification

In the Download pre-trained model dialog, type Pre-trained model (imagenet) as the output folder name. Click Run Macro. When the process completes, go back to the flow to see the pre-trained model has been added and is ready for use.

Pre-trained deep learning model added to the flow

Classify a Set of Test Images with the Pre-Trained Model

In order to use the pre-trained model, from the + Recipe dropdown, select Deep Learning Image (CPU) > Image Classification. In the create recipe dialog, select Images to classify as the images folder and Pre-trained model (imagenet) as the model folder. Create a new output dataset called Classification. Click Create Recipe.

In the Image Classification dialog, set the Max number of class labels to 1. We want the model to make a single prediction for each image. Click Run.

Image classification settings

The resulting dataset contains a column with the predictions. Each prediction is a simple JSON with the predicted label and the model-predicted probability that the label is correct. Manually scanning the predictions to see which are correct is time-consuming and error-prone, so you can use a Prepare recipe to find the correct and incorrect classifications.

Classification dataset, showing three misclassified images

From the Actions menu of the Classification dataset, select the Prepare recipe. In the recipe creation dialog, rename the output dataset Classification_results, then click Create Recipe.

From the images column dropdown, select More actions > Find and replace…. Type labels as the output column. Select Regular expression as the matching mode. Type .*_(.*)\..* as the regular expression and $1 as the replacement value.

From the prediction column dropdown, select More actions > Find and replace…. Select Regular expression as the matching mode. Type .*"(.*)".* as the regular expression and $1 as the replacement value.

Click Add a New Step and choose Formula from the processors library. Type good_prediction as the name of the output column. Type if(labels==prediction,1,0) as the expression. Sort the new good_prediction column in ascending order.

Right out of the box, the pre-trained model can classify lions and tigers; however, three lions are misclassified as brown_bear, bison, and baboon.

Prepared classification dataset, showing three misclassified images

Finally, click Run to create the output dataset and return to the Flow.

Transfer Learning to Retrain the Model

We can use the folder with training images to improve the pre-trained model with transfer learning. In order to re-train the model, from the + Recipe dropdown, select Deep Learning Image (CPU) > Retraining Image Classification Model. In the create recipe dialog, select Labels as the label dataset, Images for retraining as the images folder, and Pre-trained model (imagenet) as the model folder. Create a new output folder called Retrained model. Click Create Recipe.

In the Retraining Image Classification Model dialog, set the Image filename column to path, and set the Label column to Label. Reduce Batch Size to 10, Steps per Epoch to 10, and Number of Validation Steps to 5 in order to speed retraining. Select You can access tensorboard via a DSS webapp. Click Run.

Retraining image classification settings

Classification after Transfer Learning

Now let’s use the retrained model to classify the test set. From the + Recipe dropdown, select Deep Learning Image (CPU) > Image Classification. In the create recipe dialog, select Images to classify as the images folder and Retrained model as the model folder. Create a new output dataset called Classification_after_retrain. Click Create Recipe.

In the Image Classification dialog, set the Max number of class labels to 1. We want the model to make a single prediction for each image. Click Run, then return to the Flow.

Flow after classifying test set on retrained model

We can prepare the classification output as before. Select the Classification_after_retrain dataset and choose Prepare from the actions. In the recipe creation dialog, rename the output dataset Classification_after_retrain_results, then click Create Recipe.

From the images column dropdown, select More actions > Find and replace…. Type labels as the output column. Select Regular expression as the matching mode. Type .*_(.*)\..* as the regular expression and $1 as the replacement value.

From the prediction column dropdown, select More actions > Find and replace…. Select Regular expression as the matching mode. Type .*"(.*)".* as the regular expression and $1 as the replacement value.

Click Add a New Step and choose Formula from the processors library. Type good_prediction as the name of the output column. Type if(labels==prediction,1,0) as the expression. Sort the new good_prediction column in ascending order.

The retrained model, somewhat disappointingly, misclassifies 4 images instead of 3. However, our retraining set is not particularly large, we reduced some of the settings to shorten the retraining time, and the transfer learning has focused the model on lions and tigers.

Prepared classification dataset, showing four misclassified images

Finally, click Run to create the output dataset and return to the Flow.

What’s Next

There is a Dataiku gallery project that shows a completed project using the plugin.

In order to visualize the retrained model, you can create a webapp that accesses Tensorboard – see this in the dashboard on the gallery.

The plugin also allows you to extract features from images for use in building predictive models; for example, the goal of the Two Sigma Connect Kaggle competition was to predict how popular an apartment rental listing would be, based on various characteristics, including pictures of the apartment. Using the deep learning image plugin, you can extract features that can be used in the model.