2015 EigenwordsSpectralWordEmbedding

(Dhillon et al., 2015) ⇒ Paramveer S. Dhillon, Dean P. Foster, and Lyle H. Ungar. (2015). “Eigenwords: Spectral Word Embeddings.” In: Journal of Machine Learning Research.

Subject Headings: Spectral Learning Algorithm.

Notes

Cited By

http://scholar.google.com/scholar?q=%222015%22+Eigenwords%3A+Spectral+Word+Embeddings

Quotes

Abstract

Spectral learning algorithms have recently become popular in data-rich domains, driven in part by recent advances in large scale randomized SVD, and in spectral estimation of Hidden Markov Models. Extensions of these methods lead to statistical estimation algorithms which are not only fast, scalable, and useful on real data sets, but are also provably correct. Following this line of research, we propose four fast and scalable spectral algorithms for [[learning word embeddings]] - low dimensional real vectors (called Eigenwords) that capture the "meaning" of words from their context. All the proposed algorithms harness the multi-view nature of text data i.e. the left and right context of each word, are fast to train and have strong theoretical properties. Some of the variants also have lower sample complexity and hence higher statistical power for rare words. We provide theory which establishes relationships between these algorithms and optimality criteria for the estimates they provide. We also perform thorough qualitative and quantitative evaluation of Eigenwords showing that simple linear approaches give performance comparable to or superior than the state-of-the-art non-linear deep learning based methods.

1. Introduction

In recent years there has been immense interest in learning embeddings for words from large amounts of raw text.^[1] Word embeddings map each word in text to a ‘k’ dimensional (~50) real valued vector. They are typically learned in a totally unsupervised manner by exploiting the co-occurrence structure of words in unlabeled text. Ideally these embeddings should capture a rich variety of information about that word, including topic, part of speech, word features such as animacy, sentiment, gender, whether the numbers are years or small numbers, and the direction of sentiment (happy vs. sad).

The importance of word embeddings has been amplified by the fact that over the past decade there has been increased interest in using unlabeled data to supplement the]]labeled data]] in semi-supervised learning. Semi-supervised learning reduces data sparsity and gives improved generalization accuracies in high dimensional domains like NLP. Approaches like (Ando and Zhang, 2005; Suzuki and Isozaki, 2008) have been empirically very successful, achieving excellent accuracies on a variety of NLP tasks. However, it is often difficult to adapt these approaches to use in conjunction with an existing supervised NLP system as they enforce a particular choice of model.

An increasingly popular alternative is to learn representational embeddings for words from a large collection of unlabeled data, either using a generative model or an artificial neural network, and to use these embeddings to augment the feature set of a supervised learner, thereby improving the performance of a state-of-the-art NLP system such as a sentiment analyzer, parser or part of speech tagger.

Word embeddings have proven useful and have given state-of-the-art performance on many natural language processing tasks e.g. syntactic parsing (Täckström et al., 2012; Parikh et al., 2014), POS Tagging (Dhillon et al., 2012b; Huang et al., 2013), dependency parsing (Bansal et al., 2014; Koo et al., 2008; Dhillon et al., 2012a), sentiment analysis(Dhillon et al., 2012b), chunking (Turian et al., 2010; Dhillon et al., 2011), Named Entity Recognition (NER) (Turian et al., 2010; Dhillon et al., 2011), word analogies (Mikolov et al., 2013a, b) and word similarity (Huang et al., 2012) to name a few.

These NLP systems use labeled data to learn a model, but there is often only a limited amount of labeled text available for these tasks. (This is less of a problem for English, but other languages often have very little labeled data.) Thus, word embeddings, which can be learned from large amounts of unlabeled data, provide a highly discriminative set of features which enable the supervised learner to perform better.

As mentioned earlier, embedding methods produce features in low dimensional spaces, unlike the traditional approach of working in the original high dimensional vocabulary space with only one dimension “on” at a given time.

Broadly speaking, embedding methods fall into two categories:

Clustering based word embeddings: Clustering methods, often hierarchical, are used to group distributionally similar words based on their contexts. The two dominant approaches are Brown Clustering (Brown et al., 1992) and (Pereira et al., 1993). As recently shown, HMMs can also be used to induce a multinomial distribution over possible clusters (Huang and Yates, 2009).
Dense embeddings: These embeddings are dense, low dimensional and real-valued. Each dimension of these embeddings captures latent information about a combination of syntactic and semantic word properties. They can either be induced using neural networks like C&W embeddings (Collobert and Weston, 2008), Hierarchical log-linear (HLBL) embeddings (Mnih and Hinton, 2007), word2vec embeddings (Mikolov et al., 2013a, b) or by eigen-decomposition of the word co-occurrence matrix, e.g. Latent Semantic Analysis / Latent Semantic Indexing (LSA / LSI) (Dumais et al., 1988).

The most classic and successful algorithm for learning word embeddings is Latent Semantic Analysis (LSA) (Landauer et al., 2008), which works by performing SVD on the word by document matrix.

Unfortunately, the state-of-the-art embedding methods suffer from a number of shortcomings:

1). They are slow to train (especially, the Deep Learning based approaches (Collobert and Weston, 2008; Mnih and Hinton, 2007). Recently, (Mikolov et al., 2013a, b) have proposed neural network based embeddings which avoid using the hidden layers which are typical in Deep Learning. This, coupled with good engineering allows their embeddings to be trained in minutes.

2) Are sensitive to the scaling of the embeddings (especially ^[2] based approaches like LSA / PCA).

3) Learn a single embedding for a given word type; i.e. all the occurrences of the word “bank” will have the same embedding, irrespective of whether the context of the word suggests it means “a financial institution” or “a river bank.” Recently, (Huang et al., 2012) have proposed context specific word embeddings, but their Deep Learning based approach is slow and can not scale to large vocabularies. In this paper we provide spectral algorithms (based on eigen-decomposition) for learning word embeddings, as they have been shown to be fast and scalable for learning from large amounts of unlabeled data (Turney and Pantel, 2010), have a strong theoretical grounding, and are guaranteed to converge to globally optimal solutions (Hsu et al., 2009). Particularly, we are interested in Canonical Correlation Analysis (CCA) (Hotelling, 1935) based methods since: 1. Unlike PCA or LSA based methods, they are scale invariant and 2. Unlike LSA, they can capture multi-view information. In text applications the left and right contexts of the words provide a natural split into two views which is totally ignored by LSA as it throws the entire context into a bag of words while constructing the term-document matrix.

We propose a variety of dense embeddings; they learn real-valued word embeddings by performing Canonical Correlation Analysis (CCA) (Hotelling, 1935) between the past and future views of the data. All our embeddings have a number of common characteristics and address the shortcomings of the current state-of-the-art embeddings. In particular, they are:

. Fast, scalable and scale invariant.
. Provide better sample complexity2 for rare words.
. Can induce context-specific embeddings i.e. different embeddings for “bank” based on whether it means “a financial institution” or “a river bank.”
. Have strong theoretical foundations.

Most importantly, in this paper we show that simple linear methods based on eigendecomposition of the context matrices at the simplest level give accuracies comparable to or better than state-of-the-art highly non-linear deep learning based approaches like (Collobert and Weston, 2008; Mnih and Hinton, 2007; Mikolov et al., 2013a, b).

The remainder of the paper is organized as follows. In the next section we give a brief overview of CCA, which forms the core of our method. The following section describes our four proposed algorithms. After a brief description of context-specific embeddings and of the efficient SVD method we use, we present a set of systematic studies. These studies evaluate our CCA variants and alternatives including those derived from deep neural networks, including C&W, HLB, SENNA, and word2vec on problems in POS tagging, word similarity, generalized sentiment classification, NER, cross-lingual WSD and semantic & syntactic analogies.

2. Brief Review: Canonical Correlation Analysis (CCA)

CCA (Hotelling, 1935) is the analog to Principal Component Analysis (PCA) for pairs of matrices. PCA computes the directions of maximum covariance between elements in a single matrix, whereas CCA computes the directions of maximal correlation between a pair of matrices. Like PCA, CCA can be cast as an eigenvalue problem on a covariance matrix, but can also be interpreted as deriving from a generative mixture model (Bach and Jordan, 2005). See (Hardoon et al., 2004) for a review of CCA with applications to machine learning. More specifically, given n i.i.d samples from two sets of multivariate data Dz = fz1;:::; zng 2 Rm1 and Dw = fw1;:::; wng 2 Rm2 where pairs (z1, w1) have correspondence and so on, CCA tries to find a pair of linear transformations �z 2 Rm1�k and �w 2 Rm2�k, (where k � m1 � m2) such that the correlation between the projection of z onto �z and w onto �w is maximized. This can be expressed as the following optimization problem:

…

Footnotes

↑ 1. This paper is based in part on work in (Dhillon et al., 2011), (Dhillon et al., 2012b).
↑ In the sense that relative statistical efficiency is better.

References

;

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2015 EigenwordsSpectralWordEmbedding	Lyle H. Ungar Paramveer S. Dhillon Dean P. Foster			Eigenwords: Spectral Word Embeddings						2015

[1] 1. This paper is based in part on work in (Dhillon et al., 2011), (Dhillon et al., 2012b).

[2] In the sense that relative statistical efficiency is better.

[1]

[2]