2014 Word2vecExplainedDerivingMikolo

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Word Vector, Fixed-Length Vector Representation, word2vec Algorithm.

Notes

Cited By

Quotes

Abstract

The word2vec software of Tomáš Mikolov and colleagues[1] has gained a lot of traction lately, and provides state-of-the-art word embeddings. The learning models behind the software are described in two research papers. We found the description of the models in these papers to be somewhat cryptic and hard to follow.

While the motivations and presentation may be obvious to the neural networks language-modeling crowd, we had to struggle quite a bit to figure out the rationale behind the equations.

This note is an attempt to explain equation (4) (negative sampling) in “Distributed Representations of Words and Phrases and their Compositionality” by Tomáš Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado and Jeffrey Dean.

1 The skip-gram model

The departure point of the paper is the skip-gram model. In this model we are given a corpus of words [math]\displaystyle{ w }[/math] and their contexts [math]\displaystyle{ c }[/math]. We consider the conditional probabilities [math]\displaystyle{ p(c \mid w) }[/math], and given a corpus [math]\displaystyle{ Text }[/math], the goal is to set the parameters [math]\displaystyle{ \theta }[/math] of [math]\displaystyle{ p(c \mid w; \theta) }[/math] so as to maximize the corpus probability:

$\displaystyle \arg \max _{\theta} \prod_{w \in T e x t}\left[\prod_{c \in C(w)} p(c \mid w ; \theta)\right]$ (1)

in this equation, $C(w)$ is the set of contexts of word $w$. Alternatively:

$\displaystyle \arg \max _{\theta} \prod_{(w, c) \in D} p(c \mid w ; \theta)$ (2)

here $D$ is the set of all word and context pairs we extract from the text.

1.1 Parameterization of the skip-gram model

One approach for parameterizing the skip-gram model follows the neural-network language models literature, and models the conditional probability [math]\displaystyle{ p(c \mid w; \theta) }[/math] using soft-max:

$p(c \mid w; \theta) = \dfrac{e^{v_c \cdot v_w}}{\Sigma_{c' \in C^{e{v_{c'} \cdot v_w}}}}$ (3)

where [math]\displaystyle{ v_c }[/math] and [math]\displaystyle{ v_w \in \mathbb{R}^d }[/math] are vector representations for [math]\displaystyle{ c }[/math] and [math]\displaystyle{ w }[/math] respectively, and [math]\displaystyle{ C }[/math] is the set of all available contexts [2].The parameters $\theta$ are $v_{ci}$, $v_{wi}$ for $w \in V,\; c \in C,\; i \in 1, \ldots , d$ (a total of $|C| \times |V| \times d$ parameters). We would like to set the parameters such that the product (2) is maximized.

$\displaystyle \arg \max _{\theta} \sum_{(w, c) \in D} \log p(c \mid w)=\sum_{(w, c) \in D}\left(\log e^{v_{c} \cdot v_{w}}-\log \sum_{c^{\prime}} e^{v_{c^{\prime}} \cdot v_{w}}\right)$ (4)

An assumption underlying the embedding process is the following:

Assumption maximizing objective 4 will result in good embeddings $v_w\,\forall w \in V$ , in the sense that similar words will have similar vectors.

It is not clear to us at this point why this assumption holds.

While objective (4) can be computed, it is computationally expensive to do so, because the term $p\left(c|w; \theta\right)$ is very expensive to compute due to the summation $\displaystyle \sum_{c'\in C} e^{v_c^{\prime} \cdot v_w}$ over all the contexts $c^{\prime}$ (there can be hundreds of thousands of them). One way of making the computation more tractable is to replace the softmax with an hierarchical softmax. We will not elaborate on this direction.

2 Negative Sampling

Mikolov et al. present the negative-sampling approach as a more efficient way of deriving word embeddings. While negative-sampling is based on the skip-gram model, it is in fact optimizing a different objective. What follows is the derivation of the negative-sampling objective.

Consider a pair (w, c) of word and context. Did this pair come from the training data? Let’s denote by p(D = 1|w, c) the probability that (w, c) came from the corpus data. Correspondingly, p(D = 0|w, c) = 1 - p(D = 1|w, c) will be the probability that (w, c) did not come from the corpus data. As before, assume there are parameters � controlling the distribution: p(D = 1|w, c; �).

2.1 Remarks

3 Context definitions

This section lists some peculiarities of the contexts used in the word2vec software, as reflected in the code. Generally speaking, for a sentence of n words w1, . . . ,wn, contexts of a word wi comes from a window of size k around the word: C(w) = wi-k, . . . ,wi-1,wi+1, . . . ,wi+k, where k is a parameter. However, there are two subtleties:

Dynamic window size
the window size that is being used is dynamic – the parameter k denotes the maximal window size. For each word in the corpus, a window size k' is sampled uniformly from 1, ..., k.
Effect of subsampling and rare-word pruning
word2vec has two additional parameters for discarding some of the input words: words appearingmin-count times are not considered as either words or contexts, an in addition frequent words (as defined by the sample parameter) are down-sampled. Importantly, these words are removed from the text before generating the contexts. This has the effect of increasing the effective window size for certain words. According to Mikolov et al., sub-sampling of frequent words improves the quality of the resulting embedding on some benchmarks. The original motivation for sub-sampling was that frequent words are less informative. Here we see another explanation for its effectiveness: the effective window size grows, including context-words which are both content-full and linearly far away from the focus word, thus making the similarities more topical.

4 Why does this produce good word representations?

Good question. We don’t really know. The distributional hypothesis states that words in similar contexts have similar meanings. The objective above clearly tries to increase the quantity [math]\displaystyle{ v_w · v_c }[/math] for good word-context pairs, and decrease it for bad ones. Intuitively, this means that words that share many contexts will be similar to each other (note also that contexts sharing many words will also be similar to each other). This is, however, very hand-wavy. Can we make this intuition more precise? We’d really like to see something more formal.

Footnotes

  1. https://code.google.com/p/word2vec/
  2. Throughout this note, we assume that the words and the contexts come from distinct vocabularies, so that, for example, the vector associated with the word dog will be different from the vector associated with the context dog. This assumption follows the literature, where it is not motivated. One motivation for making this assumption is the following: consider the case where both the word dog and the context dog share the same vector v. Words hardly appear in the contexts of themselves, and so the model should assign a low probability to p(dog|dog), which entails assigning a low value to v · v which is impossible.

References

  • [1] Tomáš Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. (2013). “Efficient Estimation of Word Representations in Vector Space." CoRR, abs/1301.3781, 2013.
  • [2] Tomáš Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. (2013). “Distributed Representations of Words and Phrases and Their Compositionality.” In: Advances in Neural Information Processing Systems, 26;


 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2014 Word2vecExplainedDerivingMikoloYoav Goldberg
Omer Levy
word2vec Explained: Deriving Mikolov Et Al.'s Negative-sampling Word-embedding Method