Amazon Comprehend NLP

This plugin provides recipes to call the Amazon Comprehend APIs

Plugin information

Version 1.0.0
Author Dataiku (Alex COMBESSIE and Joachim ZENTICI)
Released 2020-05
Last updated 2020-05
License Apache Software License
Source code Github
Reporting issues Github

How to set up

If you are a Dataiku and AWS admin user, follow these configuration steps right after you install the plugin. If you are not an admin, you can forward this to your admin and scroll down to the How to use section.

Note that the Amazon Comprehend API  is a paid service. You can consult the API pricing page to evaluate the future cost.

1. Create an IAM user with the Amazon Comprehend policy – in AWS

Let’s assume that your AWS account has already been created and that you have full admin access. If not, please follow this guide.

Start by creating a dedicated IAM user to centralize access to the Comprehend API, or select an existing one. Next, you will need to attach a policy to this user following this documentation. We recommend using the “ComprehendFullAccess” managed policy, as shown below:

IAM policy for Amazon Comprehend
IAM policy for Amazon Comprehend

Alternatively, you can create a custom IAM policy to allow  “comprehend:*” actions.  After completing this step, you will be able to retrieve the user Access key ID and Secret access key.

Completed IAM User Creation
Completed IAM User Creation

2. Create an API configuration preset – in Dataiku DSS

In Dataiku DSS, navigate to the Plugin page > Settings > API configuration and create your first preset.

API Configuration Preset Creation
API Configuration Preset Creation

3. Configure the preset – in Dataiku DSS

Completed API Configuration Preset
Completed API Configuration Preset
  • Fill the AUTHENTIFICATION settings.
    • Copy-paste your Access key ID and Secret access key from Step 1 in the corresponding fields.
    • The AWS region parameter needs to be specified within this list.
    • Alternatively, you may leave the fields empty so that the credentials are ascertained from the server environment. If you choose this option, please follow this documentation on the server hosting DSS.
  • (Optional) Review the API QUOTA and PARALLELIZATION settings.
    • The default API Quota settings ensure that one recipe calling the API will be throttled at 25 requests (Rate limit parameter) per second (Period parameter). In other words, after sending 25 requests, it will wait for 1 second, then send another 25, etc.
    • By default, each request to the API contains a batch of 10 documents (Batch size parameter). Combined with the previous settings, it means that it will send 25 * 10 = 250 rows to the API every second.
    • This default quota is defined by Amazon. You can request a quota increase, as documented on this page.
    • You may need to decrease the Rate limit parameter if you envision that multiple recipes will run concurrently to call the API. For instance, if you want to allow 5 concurrent DSS activities, you can set this parameter at 25/5 = 5 requests per second.
    • The default Concurrency parameter means that 4 threads will call the API in parallel. This parallelization operates within the API Quota settings defined above. We do not recommend to change this default parameter unless your server has a much higher number of CPU cores.
  • Set the Permissions of your preset.
    • You can declare yourself as Owner of this preset and make it available to everybody, or to a specific group of users.
    • Any user belonging to one of these groups on your Dataiku DSS instance will be able to see and use this preset.

Voilà! Your preset is ready to be used.

Later, you (or another Dataiku admin) will be able to add more presets. This can be useful to segment plugin usage by user group. For instance, you can create a “Default” preset for everyone and a “High performance” one for your Marketing team, with separate billing by team.


How to use

Let’s assume that you have a Dataiku DSS project with a dataset containing text data. As an example, we will use the Amazon Review dataset for instant videos. You can follow the same steps with your own data.

First, create an Amazon Comprehend NLP recipe from the + RECIPE button or from the right panel if your dataset is selected.

Plugin Recipe Creation
Plugin Recipe Creation

Language Detection

Language Detection Recipe
Language Detection Recipe
    • Specify the Text column parameter for your column containing text data.
  • Review CONFIGURATION parameters
    • The API configuration preset parameter is automatically filled by the default one made available by your Dataiku admin.
    • You may select another one if multiple presets have been created.
  • (Optional) Review ADVANCED parameters
    • You can activate the Expert mode to access advanced parameters.
    • The Error handling parameter determines how the recipe will behave if the API returns an error.
      • In “Log” error handling, this error will be logged to the output but it will not cause the recipe to fail.
      • We do not recommend to change this parameter to “Fail” mode unless this is the desired behaviour.

Sentiment Analysis

Sentiment Analysis recipe
Sentiment Analysis recipe

The parameters are almost exactly the same as the Language Detection recipe (see above).

The only change is the addition of Language parameters. By default, we assume the Text column is in English. You can change it to any of the supported languages listed here or choose “Detected language column” if you have multiple languages. In this case, you will need to reuse the language code column computed by the Language Detection recipe.

Named Entity Recognition

Named Entity Recognition Recipe
Named Entity Recognition Recipe

The parameters under INPUT PARAMETERS and CONFIGURATION are the same as the Sentiment Analysis recipe (see above).

Under ADVANCED with Expert mode activated, you have access to additional parameters which are specific to this recipe:

  • Entity types: select multiple among this list
  • Minimum score: increase from 0 to 1 to filter results which are not relevant. Default is 0 so that no filtering is applied.

Key Phrase Extraction

Key Phrase Extraction Recipe
Key Phrase Extraction Recipe

The parameters under INPUT PARAMETERS and CONFIGURATION are the same as the Sentiment Analysis recipe (see above).

Under ADVANCED with Expert mode activated, you can tune the Number of key phrases which are extracted, by decreasing order of confidence score. The default value extracts the top 3 key phrases.

Putting It All Together: Visualization

Thanks to the output datasets produced by the plugin, you can create charts to analyze results from the API. For instance, you can:

  • filter documents to focus on one language
  • analyze the distribution of sentiment scores
  • identify which entities are mentioned
  • understand what are the key phrases used by reviewers

After crafting these charts, you can share them with business users in a dashboard such as the one below:

Example Dashboard Analyzing Amazon product reviews with the Amazon Comprehend API
Example Dashboard Analyzing Amazon product reviews with the Amazon Comprehend API

Happy natural language processing!

Get the Dataiku Data Sheet

Learn everything you ever wanted to know about Dataiku (but were afraid to ask), including detailed specifications on features and integrations.

Get the data sheet