This plugin provides a tool for computing numerical sentence representations (also known as Sentence Embeddings).
These embeddings can be used as features to train a downstream machine learning model (for sentiment analysis for example). But they can also be used to compare texts and compute their similarity using your favorite distance or similarty (like cosine similarity).
|Author||Dataiku (Hicham EL BOUKKOURI)|
|License||Apache Software License|
This plugin comes with:
A macro that allows you to download pre-trained word embeddings from various models: Word2vec, GloVe, FastText or ELMo.
A recipe that allows you to use these vectors to compute sentence embeddings. This recipe relies on one of two possible aggregation methods:
A second recipe that allows you to compute the similarity (or rather the distance) between couples of texts. This is achieved by computing text representations, just like in the previous recipe, for each text column before computing their distances using one of the following metrics:
1 - cos(x, y)where
yare the sentences' word vectors.
L2distance between two vectors.
L1distance between two vectors.
This macros downloads the specified model's pre-trained embeddings into the specified managed folder of the flow. In the folder doesn't exist, it creates it first then downloads the embeddings.
These are the available models:
Note: Unlike the other models, ELMo produces contextualized word embeddings. This means that the model will process the sentence where a word occurs to produce a context-dependent representation. As a result, ELMo embeddings are better but unfortunately also slower to compute.
This recipe creates sentence embeddings for the texts of a given column. The sentence embeddings are obtained using pre-trained word embeddings and one of the following two aggregation methods: a simple average aggregation (by default) or a weighted aggregation (so-called SIF embeddings).
How to use the recipe
Using the recipe is very straightforward. After downloading the pre-trained word embeddings of your choice, just plug in your dataset and pre-trained vectors, select the column(s) containing your texts, an aggregation method and run the recipe!
Note: For SIF embeddings you can set advanced hyper-parameters such as the model's smoothing parameter and the number of components to extract.
Note: You can also use your own custom word embeddings. To do that, you will need to create a managed folder and put the embeddings in a text file where each line corresponds to a different word embedding in the following format:
word emb1 emb2 emb3 ... embN where
emb are the embedding values.
For example, if the word
dataiku has a word vector
[0.2, 1.2, 1, -0.6] then its corresponding line in the text file should be:
dataiku 0.2 1.2 1 -0.6.
This recipe takes two text columns and computes the similarity (distance) of each couple of sentences. The similarity is based on sentence vectors computed using pre-trained word embeddings that are compared using one of three available metrics: cosine distance (default), euclidian distance (L2), absolute distance (L1) or earth-mover distance.
How to use the recipe
Using this recipe is similar to using the "Compute sentence embeddings" recipe. The only differences are that you will now choose exactly two text columns and you will have the option to choose distance from the list of available distances.
Sanjeev Arora, Yingyu Liang and Tengyu Ma, A Simple but Tough-to-Beat Baseline for Sentence Embeddings
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. In Proceedings of Workshop at ICLR, 2013.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of NIPS, 2013.
Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic Regularities in Continuous Space Word Representations. In Proceedings of NAACL HLT, 2013.
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation.
P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information
Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner,
Christopher Clark, Kenton Lee, Luke Zettlemoyer. Deep contextualized word representations NAACL 2018.