# 2014 NeuralWordEmbeddingAsImplicitMa

- (Levy & Goldberg, 2014) ⇒ Omer Levy, and Yoav Goldberg. (2014). “Neural Word Embedding As Implicit Matrix Factorization.” In: Advances in Neural Information Processing Systems.

**Subject Headings:** SPPMI Continuous Dense Distributional Word Model Training Algorithm.

## Notes

## Cited By

### 2015

- (Vilnis & McCallum, 2015) ⇒ Luke Vilnis, and Andrew McCallum. (2015). “Word Representations via Gaussian Embedding.” In: arXiv preprint arXiv:1412.6623 submitted to ICRL 2015.
- QUOTE: … Related work in probabilistic matrix factorization (Mnih & Salakhutdinov, 2007) embeds rows and columns as Gaussians, and some forms of this do provide each row and column with its own variance (Salakhutdinov & Mnih, 2008). Given the parallels between embedding models and matrix factorization (Deerwester et al., 1990; Riedel et al., 2013; Levy & Goldberg, 2014), this is relevant to our approach.

### 2014

- Radim Řehůřek. (2014-12-23). http://radimrehurek.com/2014/12/making-sense-of-word2vec/
- QUOTE: In a manner analogous to GloVe, Levy and Goldberg analyzed the objective function of word2vec with negative sampling. … “SPPMI” name simply reflects that we’re subtracting [math]\log(k)[/math] from PMI (“shifting”) ...

## Quotes

### Abstract

We analyze skip-gram with negative-sampling (SGNS), a word embedding method introduced by Mikolov et al., and show that it is implicitly factorizing a word-context matrix, whose cells are the pointwise mutual information (PMI) of the respective word and context pairs, shifted by a global constant. We find that another embedding method, NCE, is implicitly factorizing a similar matrix, where each cell is the (shifted) log conditional probability of a word given its context. We show that using a sparse Shifted Positive PMI word-context matrix to represent words improves results on two word similarity tasks and one of two analogy tasks. When dense low-dimensional vectors are preferred, exact factorization with SVD can achieve solutions that are at least as good as SGNS's solutions for word similarity tasks. On analogy questions SGNS remains superior to SVD. We conjecture that this stems from the weighted nature of SGNS's factorization.

### 1. Introduction

Most tasks in natural language processing and understanding involve looking at words, and could benefit from word representations that do not treat individual words as unique symbols, but instead reflect similarities and dissimilarities between them. The common paradigm for deriving such representations is based on the distributional hypothesis of Harris [15], which states that words in similar contexts have similar meanings. This has given rise to many word representation methods in the NLP literature, the vast majority of whom can be described in terms of a word-context matrix M in which each row i corresponds to a word, each column j to a context in which the word appeared, and each matrix entry [math]M_{ij}[/math] corresponds to some association measure between the word and the context. Words are then represented as rows in M or in a dimensionality-reduced matrix based on M.

Recently, there has been a surge of work proposing to represent words as dense vectors, derived using various training methods inspired from neural-network language modeling [3, 9, 23, 21]. These representations, referred to as “neural embeddings” or “word embeddings”, have been shown to perform well in a variety of NLP tasks [26, 10, 1]. In particular, a sequence of papers by Mikolov and colleagues [20, 21] culminated in the skip-gram with negative-sampling (SGNS) training method which is both efficient to train and provides state-of-the-art results on various linguistic tasks. The training method (as implemented in the word2vec software package) is highly popular, but not well understood. While it is clear that the training objective follows the distributional hypothesis – by trying to maximize the dot-product between the vectors of frequently occurring word-context pairs, and minimize it for random word-context pairs – very little is known about the quantity being optimized by the algorithm, or the reason it is expected to produce good word representations.

In this work, we aim to broaden the theoretical understanding of neural-inspired word embeddings. Specifically, we cast SGNS’s training method as weighted matrix factorization, and show that its objective is implicitly factorizing a shifted PMI matrix – the well-known word-context PMI matrix from the word-similarity literature, shifted by a constant offset. A similar result holds also for the 1 NCE embedding method of Mnih and Kavukcuoglu [24]. While it is impractical to directly use the very high-dimensional and dense shifted PMI matrix, we propose to approximate it with the positive shifted PMI matrix (Shifted PPMI), which is sparse. Shifted PPMI is far better at optimizing SGNS’s objective, and performs slightly better than word2vec derived vectors on several linguistic tasks.

Finally, we suggest a simple spectral algorithm that is based on performing SVD over the Shifted PPMI matrix. The spectral algorithm outperforms both SGNS and the Shifted PPMI matrix on the word similarity tasks, and is scalable to large corpora. However, it lags behind the SGNS-derived representation on word-analogy tasks. We conjecture that this behavior is related to the fact that SGNS performs weighted matrix factorization, giving more influence to frequent pairs, as opposed to SVD, which gives the same weight to all matrix cells. While the weighted and non-weighted objectives share the same optimal solution (perfect reconstruction of the shifted PMI matrix), they result in different generalizations when combined with dimensionality constraints.

### 2 Background: Skip-Gram with Negative Sampling (SGNS)

Our departure point is SGNS – the skip-gram neural embedding model introduced in [20] trained using the negative-sampling procedure presented in [21]. In what follows, we summarize the SGNS model and introduce our notation. A detailed derivation of the SGNS model is available in [14].

- Setting and Notation

The skip-gram model assumes a corpus of words [math]w \in V_W[/math] and their contexts [math]c \in V_C[/math], where V_W and V_C are the word and context vocabularies. In [20, 21] the words come from an unannotated textual corpus of words w1;w2; : : : ;wn (typically n is in the billions) and the contexts for word wi are the words surrounding it in an L-sized window [math]w_{i-L},...,w_{i-1},w_{i+1},...,w_{i+L}[/math]. Other definitions of contexts are possible [18]. We denote the collection of observed words and context pairs as D. We use #(w; c) to denote the number of times the pair (w; c) appears in D. Similarly, [math]#(w) = P c02VC #(w; c0)[/math] and [math]#(c) = P w02VW #(w0; c)[/math] are the number of times w and c occurred in D, respectively.

Each word [math]w \in V_W[/math] is associated with a vector [math]\vec{\mathcal{w}} \in \mathbb{R}^d[/math] and similarly each context [math]c \in V_C[/math] is represented as a vector [math]c \in \mathbb{R}^d[/math], where [math]d[/math] is the embedding’s dimensionality. The entries in the vectors are latent, and treated as parameters to be learned. We sometimes refer to the vectors ~w as rows in a jVWj�d matrixW, and to the vectors ~c as rows in a jVCj�d matrix C. In such cases, [math]W_i (C_i)[/math] refers to the vector representation of the ith word (context) in the corresponding vocabulary. When referring to embeddings produced by a specific method x, we will usually use Wx and Cx explicitly, but may use just W and C when the method is clear from the discussion.

- SGNS’s Objective

Consider a word-context pair (w; c). Did this pair come from the observed data D? Let P(D = 1jw; c) be the probability that (w; c) came from the data, and P(D = 0jw; c) = 1 - P(D = 1\bar w,c) the probability that (w; c) did not. The distribution is modeled as: [math]P(D = 1\bar w,c) = \sigma (\vec{w} \cdot \vec{c}) = \frac{1}{1 + e^{\vec{w} \cdot \vec{c}}}[/math] where ~w and ~c (each a d-dimensional vector) are the model parameters to be learned.

The negative sampling objective tries to maximize P(D = 1jw; c) for observed (w; c) pairs while maximizing P(D = 0jw; c) for randomly sampled “negative” examples (hence the name “negative sampling”), under the assumption that randomly selecting a context for a given word is likely to result in an unobserved (w; c) pair. SGNS’s objective for a single (w; c) observation is then: [math]log �( ~w � ~c) + k � EcN�PD [log �(??~w � ~cN)] (1)[/math] where k is the number of “negative” samples and cN is the sampled context, drawn according to the empirical unigram distribution PD(c) = #(c) jDj . ^{[1]}

The objective is trained in an online fashion using stochastic gradient updates over the observed pairs in the corpus D. The global objective then sums over the observed (w; c) pairs in the corpus: [math]\mathcal{l} = \Sigma w2VW \Sigma c2VC #(w; c) (log �( ~w � ~c)[/math] [math] + k � EcN�PD [log �(??~w � ~cN)]) (2)[/math]

Optimizing this objective makes observed word-context pairs have similar embeddings, while scattering unobserved pairs. Intuitively, words that appear in similar contexts should have similar embeddings, though we are not familiar with a formal proof that SGNS does indeed maximize the dot-product of similar words.

### 3 SGNS as Implicit Matrix Factorization

SGNS embeds both words and their contexts into a low-dimensional space [math]\R^d[/math], resulting in the word and context matrices Kmath>W and [math]C[/math]. The rows of matrix W are typically used in NLP tasks (such as computing word similarities) while C is ignored. It is nonetheless instructive to consider the product W � C> = M. Viewed this way, SGNS can be described as factorizing an implicit matrix M of dimensions jVWj � jVCj into two smaller matrices.

...

## References

;

Author | volume | Date Value | title | type | journal | titleUrl | doi | note | year | |
---|---|---|---|---|---|---|---|---|---|---|

2014 NeuralWordEmbeddingAsImplicitMa | Omer Levy Yoav Goldberg | Neural Word Embedding As Implicit Matrix Factorization | 2014 |

- ↑ In the algorithm described in [21], the negative contexts are sampled according to p3=4(c) = #c3=4 Z instead of the unigram distribution #c/Z . Sampling according to p3=4 indeed produces somewhat superior results on some of the semantic evaluation tasks. It is straight-forward to modify the PMI metric in a similar fashion by replacing the p(c) term with p3=4(c), and doing so shows similar trends in the matrix-based methods as it does in word2vec’s stochastic gradient based training method. We do not explore this further in this paper, and report results using the unigram distribution.