Text Preparation

Prepare text data with language detection, spell checking and text cleaning

Plugin information

Version 1.1.1
Author Dataiku (Alex COMBESSIE, Damien JACQUEMART)
Released 2020-09
Last updated 2020-12
License Apache Software License
Source code Github
Reporting issues Github

With this plugin, you will be able to:

  • Detect dominant languages among 114 languages
    • If you have multilingual data, this step is necessary to apply custom processing per language
  • Identify and correct misspellings in 36 languages
  • Tokenize, filter and lemmatize text data in 58 languages

Note that languages are defined as per the ISO 639-1 standard with 2-letter codes.

How to set up

Right after installing the plugin, you will need to build its code environment.

Code environment creation
Code environment creation

Note that Python version 3.6 is required. If you are installing the plugin on macOS, the pycld3 dependency requires at least macOS 10.14. Finally, Conda is not supported, as pycld3 is not currently available on Anaconda Cloud.

If you want to push down computation to a Kubernetes cluster, you can build container images for this code environment on the same screen.

How to use

Let’s assume that you have a Dataiku DSS project with a dataset containing raw text data of multiple languages.

Navigate to the Flow, click on the + RECIPE button and access the Natural Language Processing menu. If your dataset is selected, you can directly find the plugin on the right panel.

Plugin Recipe Creation
Plugin Recipe Creation

Language Detection recipe

Detect dominant languages among 114 languages

Input

  • Dataset with a text column

Output

  • Dataset with 3 additional columns
    • ISO 639-1 language code
    • ISO 639-1 language name
    • Confidence score from 0 to 1
Language Detection Output Dataset
Language Detection Output Dataset

Settings

Language Detection Recipe Settings
Language Detection Recipe Settings
  • Fill INPUT PARAMETERS
    • The Text column parameter lets you choose the column of your input dataset containing text data.
    • The Language scope parameter allows you to specify the set of languages in your specific dataset. You can leave it empty (default) to detect all 114 languages.
  • (Optional) Review ADVANCED parameters
    • You can activate the Expert mode to access advanced parameters
      • The Minimum score parameter allows you to filter detected languages based on the confidence score in the prediction. You can leave it at 0 (default) to apply no filtering.
      • The Fallback language parameter is for cases where the detected language is not in your language scope, or the confidence score is below your specified minimum.
        • You can leave it at “None” (default) to output an empty cell.
        • If the fallback is used, the confidence score will be an empty cell.

Spell Checking recipe

Identify and correct misspellings in 36 languages

Input

  • Dataset with a text column
  • (Optional) Custom vocabulary dataset
    • This dataset should contain a single column for words that should not be corrected.
    • This input is case-sensitive, so “NY” and “Ny” are considered as different words.
    • If your data contains special terms or slang which are not in the bundled dictionaries, we recommend using this dataset.
  • (Optional) Custom corrections dataset
    • This dataset should contain two columns, the first one for words and the second one for your custom correction.
    • Words are also case-sensitive.
    • If the same word is present in both custom vocabulary and corrections, the custom correction will overrule the custom vocabulary.
    • Use this dataset if you wish to correct the special terms or slang contained in your data.

Output

  • Dataset with 4 additional columns
    • Corrected text
    • Misspelled text
    • List of unique misspellings
    • Number of misspellings
Spell Checking Output Dataset
Spell Checking Output Dataset
  • (Optional) Diagnosis dataset with spell checking information on each word
Spell Checking Diagnosis Dataset
Spell Checking Diagnosis Dataset

Note that including this optional diagnosis dataset will increase the recipe runtime.

Settings

Spell Checking Recipe Settings
Spell Checking Recipe Settings
  • Fill INPUT PARAMETERS
    • The Text column parameter lets you choose the column of your input dataset containing text data.
    • The Language parameter lets you choose among 36 supported languages.
      • You can either choose a single language or select “Detected language column” to specify the Language column parameter.
      • This Language column parameter can use the output of the Language Detection recipe, or ISO 639-1 language codes computed by other means.
  • (Optional) Review ADVANCED parameters
    • You can activate the Expert mode to access advanced parameters.
      • The Edit distance parameter allows you to tune the maximum edit distance between a word and its potential correction.
      • The Ignore pattern parameter lets you define a regular expression matching words which should not be corrected.
        • This is useful if you work with special-domain data with acronyms and codes

Text cleaning recipe

Tokenize, filter and lemmatize text data in 58 language

Input

  • Dataset with a text column

Output

  • Dataset with additional columns
    • Cleaned text after tokenization, filtering and lemmatization
    • If the Keep filtered tokens parameter is activated, one column for each filter: punctations, stopwords, numbers, etc.
Text Cleaning Output Dataset
Text Cleaning Output Dataset

Settings

Text Cleaning Recipe Settings
Text Cleaning Recipe Settings
  • Fill INPUT PARAMETERS
    • The Text column parameter lets you choose the column of your input dataset containing text data.
    • The Language parameter lets you choose among 58 supported languages.
      • You can either choose a single language or select “Detected language column” to specify the Language column parameter.
      • This Language column parameter can use the output of the Language Detection recipe, or ISO 639-1 language codes computed by other means.
  • Review CLEANING PARAMETERS
    • Select which types of token to filter in the Token filters parameter, among
      • Punctuation: if all token characters are within the Unicode “P” class
      • Stopword: if the token matches our per-language stopword lists
        • These stopword lists are based on spaCy and NLTK, with an additional human review done by native speakers at Dataiku
      • Number: if the token contains only digits (e.g. 11) or matches the written form of a number in the corresponding language (e.g. eleven)
      • Currency sign: if all token characters are within the Unicode “Sc” class
      • Datetime: if the token matches this regular expression for tokens like “10:04” or “2/10/2049”
      • Measure: if the token matches this rule for tokens like “17th” or “10km”
      • URL: if the token matches this rule for tokens starting with “http://”, “https://”, “www.” or matching a correct domain name
      • Email: if the token matches this regular expression
      • Username: if the token begins with “@”
      • Hashtag: if the token begins with “#”
      • Emoji: if one of the token characters is recognized as a Unicode emoji
      • Symbol: if the token is not an emoji and if all token characters are within the Unicode “M”, “S” classes
    • Activate Lemmatization to simplify words to their “lemma” form, sometimes called “dictionary” form.
    • Activate Lowercase to convert all words to lowercase.
  • (Optional) Review ADVANCED parameters
    • You can activate the Expert mode to access advanced parameters.
      • The Unicode normalization parameter allows you to add a post-processing step to apply one of the unicodedata normalization methods.
      • If you activate the Keep filtered tokens parameter, you will get an additional column for each selected Token filters.
        • This is useful if you want to analyze the usage of tokens like emojis or hashtags in your text data.

Happy natural language processing!

Get the Dataiku Data Sheet

Learn everything you ever wanted to know about Dataiku (but were afraid to ask), including detailed specifications on features and integrations.

Get the data sheet