en

Text Analysis

Analyze text data with ontology tagging

Plugin information

Version 0.2.0
Author Dataiku (Jane BELLAICHE, Alex COMBESSIE)
Released 2021-06
Last updated 2023-04
License Apache Software License
Source code Github
Reporting issues Github

With this plugin, you can tag documents matching keywords within a corpus of text documents

Table of contents

How to set up

After installing the plugin, you need to build its code environment. 

Code environment creation
Code environment creation

Note that Python version 3.6 or 3.7 is required.

To use this plugin with containers, you will need to customize the base image. Please follow this documentation with this Dockerfile fragment.

How to use

Let’s assume that you have a Dataiku DSS project with a dataset containing raw text data in one or more languages. This text data must be stored as a single column in the dataset, with one row for each document.

To create your first recipe, navigate to the Flow, click on the + RECIPE button and access the Natural Language Processing menu. If your dataset is selected, you can directly find the plugin on the right panel.

Plugin recipe creation
Plugin recipe creation

Ontology tagging recipe

Assign tags to documents matching keywords within a corpus of text documents

Inputs

  • Document dataset: Dataset with a text column, and potentially a language column 
Document dataset example (kaggle)
Document dataset example (kaggle)
  • Ontology dataset
    You will also need to create or import a dataset composed of the following columns:

    •  a column such as “keywords”, which are the terms that will be searched for in the corpus of documents.
    • a column such as “tags”, which are the tags you wish to assign to your documents. A tag can be linked to a set of keywords.
    • (Optional) a column such as “category”, which are the categories you wish to group your tags by. All tags without categories will later be classified as “uncategorized”.
Ontology dataset example
Ontology dataset example

Settings

Ontology tagging settings
Ontology tagging settings
  •  Document dataset
    • Text column: name of the column containing the text to be tagged
    • Language: the language parameter lets you choose among 59 supported languages if your documents are monolingual. Else, the Multilingual option will let you specify an additional Language column, that is expected to contain the ISO 639-1 language code corresponding to each document. 
  • Ontology dataset
    • Tag column: name of the tag column
    • Keyword column: name of the keyword column
    •  (Optional) Category column: name of the category column
  •  Matching Parameters
    You can widen the search by activating the following matching parameters:

    • Ignore case: To match documents by ignoring case (e.g will try to match ‘guitar’, ‘Guitar’, ‘GUITAR’)
    • Ignore diacritics: To match documents by ignoring diacritics marks such as accents, cedillas, tildes
    • Lemmatize: To match documents by simplifying words to their lemma form (e.g., going → go, mice → mouse).
  • Output parameters
    There are three available output formats. All the output formats preserve the initial data of the input dataset

    • Default output “One row per document (array style) ”: A row for each document, with additional columns:  
      •  tag_list column: the assigned tags of the document (array-like). If you gave a category column in the matching parameters, you will have one column per category, each one containing a list of associated tags
      • tag_keywords column: the keywords that matched the document (array-like)
      • tag_sentences column: the concatenated sentences that matched the keywords (string-like).
    • Output “One row per document (JSON style) ”: A row for each document, with additional output columns:
      • tag_json_full column: a dictionary with details about occurrences for each tag, matched keywords for each tag, and the sentences where they appear. (object-like)
      • If you gave a category column in the matching parameters, you will also have a column tag_json_categories (object-like), which is a simplified version of the previous dictionary containing categories (keys) and associated tags (values)
    • Output “One row per match ”: A row per match per document. This mode may create an output dataset with more rows than the input dataset, as you may have multiple matching keywords in a single document, which would result in duplication of that document in the output for each matched keyword. In this case, the additional output columns would be:
      • tag column: a tag column (string-like)
      • keyword column: a keyword column (string-like)
      • sentence column: a sentence column (string-like)
      • category column: (Optional) a category column (string-like)

Output

  • Dataset with assigned tags for each document
Default output format one row per doc (array style)
Default output format: one row per doc (array style)

Advanced topics

Supported languages

Ontology tagging

Here are the 59 supported languages and their ISO 639-1 format :

  • Afrikaans (af) *
  • Albanian (sq) *
  • Arabic (ar) *
  • Armenian (hy) *
  • Basque (eu) *
  • Bengali (bn)
  • Bulgarian (bg) *
  • Catalan (ca)
  • Chinese (simplified) (zh) *
  • Croatian (hr)
  • Czech (cs)
  • Danish (da)
  • Dutch (nl)
  • English (en)
  • Estonian (et) *
  • Finnish (fi) *
  • French (fr)
  • German (de)
  • Greek (el)
  • Gujarati (gu) *
  • Hebrew (he) *
  • Hindi (hi) *
  • Hungarian (hu)
  • Icelandic (is) *
  • Indonesian (id)
  • Irish (ga) *
  • Italian (it)
  • Japanese (ja) *
  • Kannada (kn) *
  • Latvian (lv) *
  • Lithuanian (lt)
  • Luxembourgish (lb)
  • Macedonian (mk)
  • Malayalam (ml) *
  • Marathi (mr) *
  • Nepali (ne) *
  • Norwegian Bokmål (nb)
  • Persian (fa)
  • Polish (pl)
  • Portuguese (pt)
  • Romanian (ro)
  • Russian (ru)
  • Sanskrit (sa) *
  • Serbian (sr)
  • Sinhala (si) *
  • Slovak (sk) *
  • Slovenian (sl) *
  • Spanish (es)
  • Swedish (sv)
  • Tagalog (tl)
  • Tamil (ta) *
  • Tatar (tt) *
  • Telugu (te) *
  • Thai (th) *
  • Turkish (tr)
  • Ukrainian (uk) *
  • Urdu (ur)
  • Vietnamese (vi) *
  • Yoruba (yo) *

* Lemmatization not supported

Get the Dataiku Data Sheet

Learn everything you ever wanted to know about Dataiku (but were afraid to ask), including detailed specifications on features and integrations.

Get the data sheet