2015 WordRepresentationsviaGaussianE

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Fitting Gaussian Mixture Models, Gaussian Embedding, Gaussian Vector Space.

Notes

Cited By

2015

Quotes

Abstract

Current work in lexical distributed representations maps each word to a point vector in low-dimensional space. Mapping instead to a density provides many interesting advantages, including better capturing uncertainty about a representation and its relationships, expressing asymmetries more naturally than dot product or cosine similarity, and enabling more expressive parameterization of decision boundaries. This paper advocates for density-based distributed embeddings and presents a method for learning representations in the space of Gaussian distributions. We compare performance on various word embedding benchmarks, investigate the ability of these embeddings to model entailment and other asymmetric relationships, and explore novel properties of the representation.

1 INTRODUCTION

In recent years there has been a surge of interest in learning compact distributed representations or embeddings for many machine learning tasks, including collaborative filtering (Koren et al., 2009), image retrieval (Weston et al., 2011), relation extraction (Riedel et al., 2013), word semantics and language modeling (Bengio et al., 2006; Mnih & Hinton, 2008; Mikolov et al., 2013), and many others. In these approaches input objects (such as images, relations or words) are mapped to dense vectors having lower-dimensionality than the cardinality of the inputs, with the goal that the geometry of his low-dimensional latent embedded space be smooth with respect to some measure of similarity in the target domain. That is, objects associated with similar targets should be mapped to nearby points in the embedded space.

While this approach has proven powerful, representing an object as a single point in space carries some important limitations. An embedded vector representing a point estimate does not naturally express uncertainty about the target concepts with which the input may be associated. Point vectors are typically compared by dot products, cosine-distance or Euclean distance, none of which provide for asymmetric comparisons between objects (as is necessary to represent inclusion or entailment). Relationships between points are normally measured by distances required to obey the triangle inequality.

This paper advocates moving beyond vector point representations to potential functions (Aizerman et al., 1964), or continuous densities in latent space. In particular we explore Gaussian function embeddings (currently with diagonal covariance), in which both means and variances are learned from data. Gaussians innately represent uncertainty, and provide a distance function per object. KLdivergence between Gaussian distributions is straightforward to calculate, naturally asymmetric, and has a geometric interpretation as an inclusion between families of ellipses.

There is a long line of previous work in mapping data cases to probability distributions, perhaps the most famous being radial basis functions (RBFs), used both in the kernel and neural network literature. We draw inspiration from this work to propose novel word embedding algorithms that embed words directly as Gaussian distributional potential functions in an infinite dimensional function space. This allows us to map word types not only to vectors but to soft regions in space, modeling uncertainty, inclusion, and entailment, as well as providing a rich geometry of the latent space.

Figure 1: Learned diagonal variances for each word, with the first letter of each word indicating the position of its mean, and projected onto generalized eigenvectors between the mixture means and variance of query word Bach.

Here we might infer that Bach was a famous classical composer.

After discussing related work and presenting our algorithms below we explore properties of our algorithms with multiple qualitative and quantitative evaluation on several real and synthetic datasets. We show that concept containment and specificity matches common intuition on examples concerning people, genres, foods, and others. We compare our embeddings to Skip-Gram on seven standard word similarity tasks, and evaluate the ability of our method to learn unsupervised lexical entailment. We also demonstrate that our training method also supports new styles of supervised training that explicitly incorporate asymmetry into the objective.

2 RELATED WORK

This paper builds on a long line of work on both distributed and distributional semantic word vectors, including distributional semantics, neural language models, count-based language models, and, more broadly, the field of representation learning.

Related work in probabilistic matrix factorization (Mnih & Salakhutdinov, 2007) embeds rows and columns as Gaussians, and some forms of this do provide each row and column with its own variance (Salakhutdinov & Mnih, 2008). Given the parallels between embedding models and matrix factorization (Deerwester et al., 1990; Riedel et al., 2013; Levy & Goldberg, 2014), this is relevant to our approach. However, these Bayesian methods apply Bayes’ rule to observed data to infer the latent distributions, whereas our model works directly in the space of probability distributions and discriminatively trains them. This allows us to go beyond the Bayesian approach and use arbitrary (and even asymmetric) training criteria, and is more similar to methods that learn kernels (Lanckriet et al., 2004) or function-valued neural networks such as mixture density networks (Bishop, 1994).

Other work in multiplicative tensor factorization for word embeddings (Kiros et al., 2014) and metric learning (Xing et al., 2002) learns some combinations of representations, clusters, and a distance metric jointly; however, it does not effectively learn a distance function per item. [[Fitting Gaussian mixture models on embeddings]] has been done in order to apply Fisher kernels to entire documents (Clinchant & Perronnin, 2013b; a). Preliminary concurrent work from Kiyoshiyo et al. (2014) describes a significantly different model similar to Bayesian matrix factorization, using a probabilistic Gaussian graphical model to define a distribution over pairs of words, and they lack quantitative experiments or evaluation.

In linguistic semantics, work on the distributional inclusion hypothesis (Geffet & Dagan, 2005), uses traditional count-based vectors to define regions in vector space (Erk, 2009) such that subordinate concepts are included in these regions. In fact, one strength of our proposed work is that we extend these intuitively appealing ideas (as well as the ability to use a variety of asymmetric distances between vectors) to the dense, low-dimensional distributed vectors that are now gaining popularity.

3 BACKGROUND

Our goal is to map every word type w in some dictionary D and context word type c in a dictionary C to a Gaussian distribution over a latent embedding space, such that linguistic properties of the words are captured by properties of and relationships between the distributions. For precision, we call an element of the dictionary a word type, and a particular observed token in some context a word token. This is analogous to the class vs. instance distinction in object-oriented programming. In unsupervised learning of word vectors, we observe a sequence of word tokens ft (w) ig for each type w, and their contexts (sets of nearby word tokens), fc (w) ig. The goal is to map each word type w and context word type c to a vector, such that types that appear in similar contexts have similar vectors. When it is unambiguous, we also use the variables w and c to denote the vectors associated to that given word type or context word type.

An energy function (LeCun et al., 2006) is a function [math]\displaystyle{ E_\theta (x, y) }[/math] that scores pairs of inputs x and outputs y, parametrized by [math]\displaystyle{ \theta }[/math]. The goal of energy-based learning is to train the parameters of the energy function to score observed positive input-output pairs higher (or lower, depending on sign conventions) than negative pairs. This is accomplished by means of a loss function L which defines which pairs are positive and negative according to some supervision, and provides gradients on the parameters given the predictions of the energy function. In prediction-based (energy-based) word embedding models, the parameters � correspond to our learned word representations, and the x and y input-output pairs correspond to word tokens and their contexts. These contexts can be either positive (observed) or negative (often randomly sampled). In the word2vec Skip-Gram (Mikolov et al., 2013) word embedding model, the energy function takes the form of a dot product between the vectors of an observed word and an observed context [math]\displaystyle{ w^\text{T}\gt c }[/math]. The loss function is a binary logistic regression classifier that treats the score of a word and its observed context as the score of a positive example, and the score of a word and a randomly sampled context as the score of a negative example.

Backpropagating (Rumelhart et al., 1986) this loss to the word vectors trains them to be predictive of their contexts, achieving the desired effect (words in similar contexts have similar vectors). In recent work, word2vec has been shown to be equivalent to factoring certain types of weighted pointwise mutual information matrices (Levy & Goldberg, 2014).

In our work, we use a slightly different loss function than Skip-Gram word2vec embeddings. Our energy functions take on a more limited range of values than do vector dot products, and their dynamic ranges depend in complex ways on the parameters. Therefore, we had difficulty using the word2vec loss that treats scores of positive and negative pairs as positive and negative examples to a binary classifier, since this relies on the ability to push up on the energy surface in an absolute, rather than relative, manner. To avoid the problem of absolute energies, we train with a ranking-based loss. We chose a max-margin ranking objective, similar to that used in Rank-SVM(Joachims, 2002) or Wsabie (Weston et al., 2011), which pushes scores of positive pairs above negatives by a margin: [math]\displaystyle{ L_m (w, c_p, c_n) = \operatorname{max} (0, m - E(w,c_p) + E(w,c_n)) }[/math] In this terminology, the contribution of our work is a pair of energy functions for training Gaussian distributions to represent word types.

4 WARMUP: EMPIRICAL COVARIANCES

Given a pre-trained set of word embeddings trained from contexts, there is a simple way to construct variances using the empirical variance of a word type’s set of context vectors. For a word w with N word vector sets fc(w)ig representing the words found in its contexts, and window size W, the empirical variance is : [math]\displaystyle{ \Sigma_w = \frac{1}{NW} \Sigma_i^N \Sigma_j^W (c(w)_{ij} =w ) (c(w)_{ij} - w)^\text{T} }[/math] This is an estimator for the covariance of a distribution assuming that the mean is fixed at w. In practice, it is also necessary to add a small ridge term [math]\displaystyle{ \delta \gt 0 }[/math] to the diagonal of the matrix to regularize and avoid numerical problems when inverting.

However, in Section 6.2 we note that the distributions learned by this empirical estimator do not possess properties that we would want from Gaussian distributional embeddings, such as unsupervised entailment represented as inclusion between ellipsoids. By discriminatively embedding our predictive vectors in the space of Gaussian distributions, we can improve this performance. Our models can learn certain forms of entailment during unsupervised training, as discussed in Section 6.2 and exemplified in Figure 1.

5 ENERGY-BASED LEARNING OF GAUSSIAN

As discussed in Section 3, our architecture learns Gaussian distributional embeddings to predict words in context given the current word, and ranks these over negatively sampled words. We present two energy functions to train these embeddings.

7 CONCLUSION AND FUTURE WORK

In this work we introduced a method to embed word types into the space of Gaussian distributions, and learn the embeddings directly in that space. This allows us to represent words not as low-dimensional vectors, but as densities over a latent space, directly representing notions of uncertainty and enabling a richer geometry in our embedded space. We demonstrated the effectiveness of these embeddings on a linguistic task requiring asymmetric comparisons, as well as standard word similarity benchmarks, learning of synthetic hierarchies, and several qualitative examinations. In future work, we hope to move beyond spherical or diagonal covariances and into combinations of low rank and diagonal matrices. Efficient updates and scalable learning is still possible due to the Sherman-Woodbury-Morrison formula. Additionally, going beyond diagonal covariances will enable us to keep our semantics from being axis-aligned, which will increase model capacity and expressivity. We also hope to move past stochastic gradient descent and warm starting and be able to learn the Gaussian representations robustly in one pass from scratch by using e.g. proximal or block coordinate descent methods. Improved optimization strategies will also be helpful on the highly nonconvex problem of training supervised hierarchies with KL divergence.

Representing words and concepts as different types of distributions (including other elliptic distributions such as the Student’s t) is an exciting direction – Gaussians concentrate their density on a thin spherical ellipsoidal shell, which can lead to counterintuitive behavior in high dimensions. Combining ideas from kernel methods and manifold learning with deep learning and linguistic representation learning is an exciting frontier.

In other domains, we want to extend the use of potential function representations to other tasks requiring embeddings, such as relational learning with the universal schema (Riedel et al., 2013). We hope to leverage the asymmetric measures, probabilistic interpretation, and flexible training criteria of our model to tackle tasks involving similarity-in-context, comparison of sentences and paragraphs, and more general common sense reasoning.

References

  • Aizerman, M. A., Braverman, E. A., and Rozonoer, L. Theoretical foundations of the potential function method in pattern recognition learning. In Automation and Remote Control,, number 25 in Automation and Remote Control,, pp. 821–837, 1964.
  • Baroni, Marco, Bernardini, Silvia, Ferraresi, Adriano, and Zanchetta, Eros. The wacky wide web: a collection of very large linguistically processed web-crawled corpora. Language resources and evaluation, 43(3):209–226, 2009.
  • Baroni, Marco, Bernardi, Raffaella, Do, Ngoc-Quynh, and Shan, Chung-chieh. Entailment above the word level in distributional semantics. In: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, EACL ’12, pp. 23–32, Stroudsburg, PA, USA, 2012. Association for Computational Linguistics.
  • Baroni, Marco, Dinu, Georgiana, and Kruszewski, Germ´an. Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 238–247. Association for Computational Linguistics, 2014.
  • Bengio, Yoshua, Schwenk, Holger, Senécal, Jean-Sébastien, Morin, Fréderic, and Gauvain, Jean-Luc. Neural probabilistic language models. In Innovations in Machine Learning, pp. 137–186. Springer, 2006.
  • Bishop, Christopher M. Mixture density networks, 1994.
  • Brown, Gerald G and Rutemiller, Herbert C. Means and variances of stochastic vector products with applications to random linear models. Management Science, 24(2):210–216, 1977. Bruni, Elia, Tran, Nam-Khanh, and Baroni, Marco. Multimodal distributional semantics. JAIR, 2014.
  • Clinchant, Stéphane and Perronnin, Florent. Textual similarity with a bag-of-embedded-words model. In: Proceedings of the 2013 Conference on the Theory of Information Retrieval, ICTIR ’13, pp. 25:117–25:120, New York, NY, USA, 2013a. ACM.
  • Clinchant, Stéphane and Perronnin, Florent. Aggregating continuous word embeddings for information retrieval. ACL 2013, pp. 100, 2013b.
  • Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., and Harshman, R.A. Indexing by latent semantic analysis. Journal of the American Society for Information Science 41, pp. 391–407, 1990.
  • Duchi, John, Hazan, Elad, and Singer, Yoram. Adaptive subgradient methods for online learning and stochastic optimization. The Journal of Machine Learning Research, 12:2121–2159, 2011.
  • Erk, Katrin. Representing words as regions in vector space. In: Proceedings of the Thirteenth Conference on Computational Natural Language Learning, pp. 57–65. Association for Computational Linguistics, 2009.
  • Faruqui, Manaal and Dyer, Chris. Community evaluation and exchange of word vectors at wordvectors. org. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Baltimore, USA, June 2014. Association for Computational Linguistics.
  • Finkelstein, Lev, Gabrilovich, Evgeniy, Matias, Yossi, Rivlin, Ehud, Solan, Zach, Wolfman, Gadi, and Ruppin, Eytan. Placing search in context: The concept revisited. In: Proceedings of the 10th International Conference on World Wide Web, pp. 406–414. ACM, 2001.
  • Geffet, Maayan and Dagan, Ido. The distributional inclusion hypotheses and lexical entailment. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pp. 107–114. Association for Computational Linguistics, 2005.
  • Hill, Felix, Reichart, Roi, and Korhonen, Anna. Simlex-999: Evaluating semantic models with (genuine) similarity estimation. arXiv preprint arXiv:1408.3456, 2014.
  • Jebara, Tony, Kondor, Risi, and Howard, Andrew. Probability product kernels. The Journal of Machine Learning Research, 5:819–844, 2004.
  • Joachims, Thorsten. Optimizing search engines using clickthrough data. In: Proceedings of the eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 133–142. ACM, 2002.
  • Kay, S.M. Fundamentals of Statistical Signal Processing, Volume III: Practical Algorithm Development. Fundamentals of Statistical Signal Processing. Prentice-Hall PTR, 2013. ISBN 9780132808033.
  • Kiros, Ryan, Zemel, Richard, and Salakhutdinov, Ruslan R. A multiplicative model for learning distributed text-based attribute representations. In NIPS, 2014.
  • Kiyoshiyo, Shimaoka, Masayasu, Muraoka, Futo, Yamamoto, Watanabe, Yotaro, Okazaki, Naoaki, and Inui, Kentaro. Distribution representation of the meaning of words and phrases by a gaussian distribution. In Language Processing Society 20th Annual Conference (In Japanese), 2014.
  • Koren, Yehuda, Bell, Robert, and Volinsky, Chris. Matrix factorization techniques for recommender systems. Computer, 42(8):30–37, August 2009. ISSN 0018-9162.
  • Lanckriet, Gert RG, Cristianini, Nello, Bartlett, Peter, Ghaoui, Laurent El, and Jordan, Michael I. Learning the kernel matrix with semidefinite programming. The Journal of Machine Learning Research, 5:27–72, 2004.
  • LeCun, Yann, Chopra, Sumit, and Hadsell, Raia. A tutorial on energy-based learning. 2006. Levy, Omer and Goldberg, Yoav. Neural word embedding as implicit matrix factorization. In Advances in Neural Information Processing Systems, pp. 2177–2185, 2014.
  • Mikolov, Tomas, Sutskever, Ilya, Chen, Kai, Corrado, Greg S, and Dean, Jeff. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, pp. 3111–3119, 2013.
  • Miller, George A and Charles, Walter G. Contextual correlates of semantic similarity. Language and cognitive processes, 6(1):1–28, 1991.
  • Mnih, Andriy and Hinton, Geoffrey E. A scalable hierarchical distributed language model. In Advances in Neural Information Processing Systems, pp. 1081–1088, 2008.
  • Mnih, Andriy and Salakhutdinov, Ruslan. Probabilistic matrix factorization. In Advances in neural information processing systems, pp. 1257–1264, 2007.
  • Petersen, Kaare Brandt. The matrix cookbook. 2006.
  • Riedel, Sebastian, Yao, Limin, McCallum, Andrew, and Marlin, Benjamin M. Relation extraction with matrix factorization and universal schemas. 2013.
  • Rubenstein, Herbert and Goodenough, John B. Contextual correlates of synonymy. Commun. ACM, 8(10):627–633, October 1965. ISSN 0001-0782.
  • Rumelhart, D.E., Hintont, G.E., and Williams, R.J. Learning representations by back-propagating errors. Nature, 323(6088):533–536, 1986.
  • Salakhutdinov, Ruslan and Mnih, Andriy. Bayesian probabilistic matrix factorization using markov chain monte carlo. In: Proceedings of the 25th International Conference on Machine learning, pp. 880–887. ACM, 2008.
  • Szumlanski, Sean R, Gomez, Fernando, and Sims, Valerie K. A new set of norms for semantic relatedness measures. In ACL, 2013.
  • Weston, Jason, Bengio, Samy, and Usunier, Nicolas. Wsabie: Scaling up to large vocabulary image annotation. In IJCAI, volume 11, pp. 2764–2770, 2011.
  • Xing, Eric P, Jordan, Michael I, Russell, Stuart, and Ng, Andrew Y. Distance metric learning with application to clustering with side-information. In Advances in Neural Information Processing Systems, pp. 505–512, 2002.
  • Yang, Dongqiang and Powers, David M. W. Verb similarity on the taxonomy of wordnet. In In the 3rd International WordNet Conference (GWC-06), Jeju Island, Korea, 2006.

;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2015 WordRepresentationsviaGaussianELuke Vilnis
Andrew McCallum
Word Representations via Gaussian Embedding2015