Introduction to GloVe Embeddings
In the previous articles, we have discussed what word embeddings are and how to train them from scratch or using word2vec models. This article is an intuitive guide to understanding Glove Embeddings which is a powerful word vector learning technique. We focus on why GloVe is better than word2vec in some ways and arrive at the cost function of GloVe used for training word vectors.
To recap, word embeddings transform words into a vector space where similar words are placed together and different words wide off. Word2Vec models only consider the local (context, target) words for training word vectors, unlike GloVe which does not rely on local statistics or local context of words but includes global statistics to train word vectors.
Drawbacks of Word2Vec models
To refresh on the idea behind Word2Vec models, we considered the local context and target words for training word embeddings and it works quite well so why not just stick with it. Because, as it relies only on local information the semantics learnt for the target word only depend on surrounding context words. For example, consider the following sentence:
An apple a day keeps the doctor away
When trained using word2vec models it doesn’t explain whether “the” is a special context of words “day, doctor” or is “the” just a stopword.
Global Vectors — GloVe
The GloVe model learns vectors or word embeddings from their co-occurrence information. That is how frequently they appear together in large text corpus. Unlike word2vec which is a predictive deep learning model, GloVe is a count-based model.
Consider a corpus having V words, the co-occurrence matrix X will be of size V x V, where Xij denotes the number of times word j appears in the context of word i. Let’s consider these two sentences to form a co-occurrence matrix.
I love NLP.
I love to write blogs.
Now, how to obtain a metric that computes semantic similarity between words from the above symmetric co-occurrence matrix. For doing that, we require three words at a time as shown below.
The behavior of Pik / Pjk for different words [1]
Consider the expression,
Pik / Pjk ,where Pik = Xik / Xi
Pik is the probability of seeing words i and k together computed by dividing no. of times words i and k appeared together (Xik) by total number of times word i appeared in the corpus (Xi).
Let words (i, j) be (ice, steam), following inferences can be drawn from the above table:
- When word k = solid, the ratio Pik / Pjk will be very high (>1) as the word solid is similar to ice but irrelevant to steam.
- When word k = gas, the ratio Pik / Pjk will be very low (<1) as the word gas is similar to steam but irrelevant to ice.
- When word k is random, the ratio Pik / Pjk will be close to 1 as the word k is unrelated to ice and steam.
So, if we can find a way to include the ratio Pik / Pjk to compute word embeddings we will achieve our goal of using global statistics when training word vectors.
To achieve this, we have mainly 3 problems listed below to overcome:
- The ratio Pik / Pjk is a scalar whereas the corresponding word vectors are high dimensional and there is a dimensional mismatch.
- As we have three terms (i, j and k) involved it is difficult to form loss function with 3 entities, so it should be reduced to only 2 terms.
- We only have an expression but not the equation F(i, j, k) = Pik/Pjk
Transforming independent variables of function F
Going forward, we will solve these three problems and find how it helps us in finding word vector algorithms. Suppose we have a function which gives us the ratio as shown below.
F(wi, wj, uk) = Pik/Pjk — — — — (1)
where w and u are two embedding layers as described in the GloVe paper it helps model to reduce overfitting, they both differ only at random initialization. Since the word vectors wi, wj, uk are linear systems we can perform arithmetic operations, e.g.
wking — wmale + wfemale = wqueen
Transforming eq (1) to F(wi-wj, uk) = Pik/Pjk has many added advantages or properties about Pik/Pjk in the embedding space.
Consider the four scenarios shown in below figure, when started with wi-wj vector and computing distance of target word k correlated with reciprocal of Pik the distance or length of dashed lines clearly changes when different words are considered. So, it is a good idea to start with wi-wj.
Changes of target word k with respect to vector wi-wj
Vector — Scalar Problem
We can overcome this by a simple fix by introducing a transpose and dot product between two vectors in the following way.
F((wi-wj)T . uk) = Pik/Pjk
If the word embedding is of size D x 1 then (wi-wj)T will be 1 x D when multiplied with uk gives a scalar as output.
Finding out the function F
Assuming the function F has homomorphism property between multiplicative and additive groups, e.g. homomorphism ensures that F(A — B) can be written as F(A) / F(B), it gives
F(wi * uk — wj * uk) = F(wi * uk) / F(wj * uk) = Pik / Pjk
In the paper [1], they have assumed F(wi * uk) / F(wj * uk) = Pik / Pjk to
F(wi * uk) = cPik for some constant c
As you have probably guessed by now, the exp function satisfies the homomorphism property. Therefore,
Exp(wi * uk) = Pik = Xik / Xi and
wi * uk = log(Xik) — log(Xi) as Xi independent of k
wi * uk + log(Xi) = log(Xik)
Finally expressing log(Xi) in terms of neural network jargon, we obtain
wi * uk + bwi + buk — log(Xik) = 0 here bwi, buk are biases of network
Cost Function
Based on the soft constraint equation above, we get 0 for perfect word embedding vectors, so our goal is to minimize the below objective function J
V is the size of vocabulary, bi and bj are scalar bias terms associated with words i and j, respectively.
Some words i, j may rarely co-occur or never then the term log(Xij) goes crazy. To avoid this, they have added a weighting function f(x) given below
Weighting function f with α = 3 / 4 [1]
Advantages
- Fast training as we can incorporate parallel computing during training which is not possible for word2vec models.
- Scalable to very large corpus and also gives good performance with small corpus.
Drawbacks
- Memory intensive process: as a faster training process we need to keep a co-occurrence matrix in RAM as a hash map and perform co-occurrence increments.
Conclusion
So, we have learned the GloVe model which utilizes the main benefit of count data of words capturing global statistics which makes GloVe a powerful model for the unsupervised learning of word representations outperforming other models on word analogy, named entity recognition and word similarity tasks.
References
GloVe: Global Vectors for Word Representation (original paper)