Language Detection

This plugin provides a recipe and a processor to detect languages in text data, among 114 languages

Plugin information

Version 1.0.0
Author Dataiku (Damien JACQUEMART and Alex COMBESSIE)
Released 2020-07
Last updated 2020-07
License Apache Software License
Source code Github
Reporting issues Github

With this plugin, you will be able to detect dominant languages in text data, among 114 languages in the ISO 639-1 standard. If you have multilingual data, this step is necessary to apply filtering and/or custom processing per language.

How to set up

Right after installing the plugin from the Store, you will be asked to create a code environment.

Creating code environment
Creating code environment

You can choose Python version 3.6 or 3.5. Note that Conda is not supported, as one of the plugin dependencies (pycld3) is not available on Anaconda. If you are installing the plugin on macOS, note that the pycld3 dependency requires at least macOS 10.14.

Finally, you have the option to build container images for this code environment, if you plan to push down computation to a Kubernetes cluster.

 

How to use

Let’s assume that you have a Dataiku DSS project with a dataset containing text data of multiple languages. As an example, we will use the Wikipedia Language Identification database (WiLI-2018).

Language detection is available both as a standalone recipe and as a processor for the Prepare recipe.

Language Detection recipe

First, create the recipe from the + RECIPE button, or from the right panel if your dataset is selected.

Plugin Recipe Creation
Plugin Recipe Creation

After creation, you can choose the following recipe settings:

Language Detection Recipe Settings
Language Detection Recipe Settings
  • Fill INPUT PARAMETERS
    • Specify which column in your input dataset contains text data
  • (Optional) Review ADVANCED parameters
    • You can activate the Expert mode to access advanced parameters.
    • The Language scope parameter allows you to specify the set of languages in your specific dataset. You can leave it empty (default) to use all 114 possible languages.
    • The Minimum score parameter allows you to filter detected languages based on the confidence score in the prediction. You can leave it at 0 (default) to apply no filtering.
    • The Fallback language parameter is for cases where the detected language is not in your language scope, or the confidence score is below your specified minimum. You can leave it at “None” (default) to output an empty cell.

Language Detection processor

You will first need to create a Prepare recipe. Then, click on + ADD A NEW STEP, browse to the Natural Language category and select Detect languages.

Prepare Recipe Processor Creation
Prepare Recipe Processor Creation

After creation, you can specify the same parameters as the Language Detection recipe (see above).

Language Detection Processor Settings
Language Detection Processor Settings

The advantage of using a processor instead of a recipe is that you can chain processors. For instance, you can create multiple Language Detection processors for different columns and then continue with other Natural Language processors such as Simplify text. This is more efficient from a pipeline perspective, as it avoids creating multiple datasets.

Putting It All Together: Visualization

You can now create charts to analyze the results. For instance, you can:

  • filter documents to focus on a specific set of languages
  • analyze the distribution of languages
  • identify which languages are well detected (assuming you have true labels)

After crafting these charts, you can share them with business users in a dashboard such as the one below:

Example Dashboard Analyzing Language Detection Results
Example Dashboard Analyzing Language Detection Results

Happy natural language processing!

Get the Dataiku Data Sheet

Learn everything you ever wanted to know about Dataiku (but were afraid to ask), including detailed specifications on features and integrations.

Get the data sheet