Text Preparation

This plugin provides recipes to detect languages and correct misspellings of text data

Plugin information

Version 1.0.1
Author Dataiku (Alex COMBESSIE, Damien JACQUEMART)
Released 2020-09
Last updated 2020-09
License Apache Software License
Source code Github
Reporting issues Github

With this plugin, you will be able to:

  • Detect dominant languages in a text column among 114 languages, in the ISO 639-1 standard. If you have multilingual data, this step is necessary to apply custom preparation per language.
  • Identify and correct misspellings in a text column for 35 languages.

 

How to set up

Right after installing the plugin, you will need to build its code environment.

Code environment creation
Code environment creation

If you are installing the plugin on macOS, note that the pycld3 dependency requires at least macOS 10.14. In addition to this, Conda is not supported, as pycld3 is not currently available on Anaconda Cloud.

If you want to push down computation to a Kubernetes cluster, you can build container images for this code environment on the same screen.

 

How to use

Let’s assume that you have a Dataiku DSS project with a dataset containing raw text data of multiple languages.

Navigate to the Flow, click on the + RECIPE button and access the Natural Language Processing menu. If your dataset is selected, you can directly find the plugin on the right panel.

Plugin Recipe Creation
Plugin Recipe Creation

Language Detection recipe

Input

  • Dataset with a text column

Output

  • Dataset with 3 additional columns
    • ISO 639-1 language code
    • ISO 639-1 language name
    • Confidence score from 0 to 1
Language Detection Output Dataset
Language Detection Output Dataset

Settings

Language Detection Recipe Settings
Language Detection Recipe Settings
  • Fill INPUT PARAMETERS
    • The Text column parameter lets you choose the column of your input dataset containing text data
  • (Optional) Review ADVANCED parameters
    • You can activate the Expert mode to access advanced parameters
      • The Language scope parameter allows you to specify the set of languages in your specific dataset. You can leave it empty (default) to detect all 114 languages.
      • The Minimum score parameter allows you to filter detected languages based on the confidence score in the prediction. You can leave it at 0 (default) to apply no filtering.
      • The Fallback language parameter is for cases where the detected language is not in your language scope, or the confidence score is below your specified minimum. You can leave it at “None” (default) to output an empty cell.

Spell Checking recipe

Input

  • Dataset with a text column
  • (Optional) Custom vocabulary dataset
    • This dataset should contain a single column for words that should not be corrected
    • This input is case-sensitive
  • (Optional) Custom corrections dataset
    • This dataset should contain two columns, the first one for words and the second one for your custom correction
    • This input is also case-sensitive and will override the custom vocabulary input if the same word is present in both inputs.

Output

  • Dataset with 3 additional columns
    • Corrected text
    • List of detected misspellings
    • Number of misspellings
Spell Checking Output Dataset
Spell Checking Output Dataset
  • (Optional) Diagnosis dataset with spell checking information on each word
Spell Checking Diagnosis Dataset
Spell Checking Diagnosis Dataset

Note that including this optional diagnosis dataset will increase the recipe runtime.

Settings

Spell Checking Recipe Settings
Spell Checking Recipe Settings
  • Fill INPUT PARAMETERS
    • The Text column parameter lets you choose the column of your input dataset containing text data.
    • The Language parameter lets you choose among 35 supported languages.
      • You can either choose a single language or select “Detected language column” to specify the Language column parameter.
      • This Language column parameter can use the output of the Language Detection recipe, or ISO 639-1 language codes computed by other means.
  • (Optional) Review ADVANCED parameters
    • You can activate the Expert mode to access advanced parameters.
      • The Edit distance parameter allows you to tune the maximum edit distance between a word and its potential correction.
      • The Ignore pattern parameter lets you define a regular expression matching words which should not be corrected.

Happy natural language processing!

Get the Dataiku Data Sheet

Learn everything you ever wanted to know about Dataiku (but were afraid to ask), including detailed specifications on features and integrations.

Get the data sheet