|Author||Dataiku (Damien JACQUEMART and Alex COMBESSIE)|
|License||Apache Software License|
With this plugin, you will be able to detect dominant languages in text data, among 114 languages in the ISO 639-1 standard. If you have multilingual data, this step is necessary to apply filtering and/or custom processing per language.
How to set up
Right after installing the plugin from the Store, you will be asked to create a code environment.
You can choose Python version 3.6 or 3.5. Note that Conda is not supported, as one of the plugin dependencies (pycld3) is not available on Anaconda. If you are installing the plugin on macOS, note that the pycld3 dependency requires at least macOS 10.14.
Finally, you have the option to build container images for this code environment, if you plan to push down computation to a Kubernetes cluster.
How to use
Let’s assume that you have a Dataiku DSS project with a dataset containing text data of multiple languages. As an example, we will use the Wikipedia Language Identification database (WiLI-2018).
Language detection is available both as a standalone recipe and as a processor for the Prepare recipe.
Language Detection recipe
First, create the recipe from the + RECIPE button, or from the right panel if your dataset is selected.
After creation, you can choose the following recipe settings:
- Fill INPUT PARAMETERS
- Specify which column in your input dataset contains text data
- (Optional) Review ADVANCED parameters
- You can activate the Expert mode to access advanced parameters.
- The Language scope parameter allows you to specify the set of languages in your specific dataset. You can leave it empty (default) to use all 114 possible languages.
- The Minimum score parameter allows you to filter detected languages based on the confidence score in the prediction. You can leave it at 0 (default) to apply no filtering.
- The Fallback language parameter is for cases where the detected language is not in your language scope, or the confidence score is below your specified minimum. You can leave it at “None” (default) to output an empty cell.
Language Detection processor
You will first need to create a Prepare recipe. Then, click on + ADD A NEW STEP, browse to the Natural Language category and select Detect languages.
After creation, you can specify the same parameters as the Language Detection recipe (see above).
The advantage of using a processor instead of a recipe is that you can chain processors. For instance, you can create multiple Language Detection processors for different columns and then continue with other Natural Language processors such as Simplify text. This is more efficient from a pipeline perspective, as it avoids creating multiple datasets.
Putting It All Together: Visualization
You can now create charts to analyze the results. For instance, you can:
- filter documents to focus on a specific set of languages
- analyze the distribution of languages
- identify which languages are well detected (assuming you have true labels)
After crafting these charts, you can share them with business users in a dashboard such as the one below:
Happy natural language processing!