howto

Sentiment Analysis in Dataiku DSS

Applies to DSS 4.3.0 and above | July 18, 2018

Binary Sentiment Analysis is the task of automatically analyzing a text data to decide whether it is positive or negative. This is useful when faced with a lot of text data that would be too time-consuming to manually label. Dataiku provides a plugin that allows you to compute binary sentiment scores for English text data.

Objectives

We will show you how to:

  • Install the sentiment analysis plugin
  • Compute sentiment scores for text data

Prerequisites

We will be working with IMDB movie reviews. The original data is from the Large Movie Review Dataset, which is a compressed folder with many text files, each corresponding to a review. In order to simplify this how-to, we have provided a single csv file for download.

Install the Plugin

First you need to install the Sentiment Analysis plugin. This requires Administrator privileges on the Dataiku DSS instance.

Log in as the Dataiku DSS Administrator, and from the Application menu in the top navigation bar, choose Administration.

Dataiku administration tools, Plugin tab

Navigate to the Plugins tab and click Store. Search for “sentiment analysis”.

The plugin requires a dedicated code environment. Choose to create it.

Create Your Project And Prepare The Data

Create a new project and give it a name like IMDB Sentiment Analysis. In the flow, create a files-based dataset and upload the CSV file you downloaded earlier.

The dataset has three columns, one containing the text of the review, one containing the rating given by the customer on a 1-10 scale, and one containing a mapping of that rating to sentiment polarity. When a text is positive we say that it has a polarity of 1, otherwise we say it has a polarity of 0.

IMDB test sample

Let’s predict the sentiment of these reviews and then compare the predicted sentiment polarities with the actual values to get a sense of how the plugin works and how well it does.

Compute Sentiment Scores

In the project Flow, click on +RECIPE then select the Sentiment Analysis plugin.

Sentiment analysis plugin dialog

Select the recipe “Compute sentiment scores”, specify an input dataset where the reviews can be found, and specify an output dataset.

After creating the recipe, you can run it by simply selecting the column containing the texts (here, our movie reviews). Also set the Output confidence scores checkbox, which outputs the model’s confidence on each prediction as a new column. Then click Run.

Compute sentiment scores dialog

After a few seconds, the plugin outputs a copy of the original dataset with 2 additional columns:

Dataset with computed sentiment scores

You can see that the plugin is sometimes right in its predictions and sometimes wrong. To get a better idea of how well the plugin did on our task, let’s compute an accuracy score (the number of good predictions over the number of reviews). To do that we go to “Status > Edit” and create a new Python Probe, using the following code as the metric:

def process(dataset, partition_id):
    df = dataset.get_dataframe()
    prediction = df["predicted_sentiment"].values == "positive"
    original = df["polarity"].values

    return {"accuracy": (prediction == original).mean()}

Then we run this probe, save it, and get the accuracy in “Status > Metrics”:

Accuracy metric for sentiment scores

So, we can see that using the Sentiment Analysis plugin, we get ~ 89.5% accuracy over 25,000 movie reviews.

What’s Next

There is a Dataiku gallery project that shows a completed project using the plugin.

There is also a page dedicated to the Sentiment Analysis plugin.

You can build your own deep learning models for Sentiment Analysis using Keras and Tensorflow in Dataiku’s Visual Machine learning.