2012 ImprovingWordRepresentationsvia

(Huang et al., 2012) ⇒ Eric H. Huang, Richard Socher, Christopher D. Manning, and Andrew Y. Ng. (2012). “Improving Word Representations via Global Context and Multiple Word Prototypes.” In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL 2012).

Subject Headings: Unsupervised Semantic Word Modeling, Word Vector-Space Modeling Algorithms.

Notes

Cited By

Quotes

Abstract

Unsupervised word representations are very useful in NLP tasks both as inputs to learning algorithms and as extra word features in NLP systems. However, most of these models are built with only local context and one representation per word. This is problematic because words are often polysemous and global context can also provide useful information for learning word meanings. We present a new neural network architecture which 1) learns word embeddings that better capture the semantics of words by incorporating both local and global document context, and 2) accounts for homonymy and polysemy by learning multiple embeddings per word. We introduce a new dataset with human judgments on pairs of words in sentential context, and evaluate our model on it, showing that our model outperforms competitive baselines and other neural language models. ^[1]

1 Introduction

Vector-space models (VSM) represent word meanings with vectors that capture semantic and syntactic information of words. These representations can be used to induce similarity measures by computing distances between the vectors, leading to many useful applications, such as information retrieval (Manning et al., 2008), document classification (Sebastiani, 2002) and question answering (Tellex et al., 2003).

Despite their usefulness, most VSMs share a common problem that each word is only represented with one vector, which clearly fails to capture homonymy and polysemy. Reisinger and Mooney (2010b) introduced a multi-prototype VSM where word sense discrimination is first applied by clustering contexts, and then prototypes are built using the contexts of the sense-labeled words. However, in order to cluster accurately, it is important to capture both the syntax and semantics of words. While many approaches use local contexts to disambiguate word meaning, global contexts can also provide useful topical information (Ng and Zelle, 1997). Several studies in psychology have also shown that global context can help language comprehension (Hess et al., 1995) and acquisition (Li et al., 2000).

We introduce a new neural-network-based language model that distinguishes and uses both local and global context via a joint training objective. The model learns word representations that better capture the semantics of words, while still keeping syntactic information. These improved representations can be used to represent contexts for clustering word instances, which is used in the multi-prototype version of our model that accounts for words with multiple senses.

We evaluate our new model on the standard WordSim-353 (Finkelstein et al., 2001) dataset that includes human similarity judgments on pairs of words, showing that combining both local and global context outperforms using only local or global context alone, and is competitive with stateof - the-art methods. However, one limitation of this evaluation is that the human judgments are on pairs of words presented in isolation, ignoring meaning variations in context. Since word interpretation in context is important especially for homonymous and polysemous words, we introduce a new dataset with human judgments on similarity between pairs of words in sentential context. To capture interesting word pairs, we sample different senses of words using WordNet (Miller, 1995). The dataset includes verbs and adjectives, in addition to nouns. We show that our multi-prototype model improves upon the single-prototype version and outperforms other neural language models and baselines on this dataset.

Local Context Global Context scorel scoreg Document... he walks to the bank... sum score river water shore global semantic vector? play weighted average Figure 1: An overview of our neural language model. The model makes use of both local and global context to compute a score that should be large for the actual next word (bank in the example), compared to the score for other words. When word meaning is still ambiguous given local context, information in global context can help disambiguation.

2 Global Context-Aware Neural Language Model

In this section, we describe the training objective of our model, followed by a description of the neural network architecture, ending with a brief description of our model’s training method.

2.1 Training Objective

Our model jointly learns word representations while learning to discriminate the next word given a short word sequence (local context) and the document (global context) in which the word sequence occurs. Because our goal is to learn useful word representations and not the probability of the next word given previous words (which prohibits looking ahead), our model can utilize the entire document to provide global context.

Given a word sequence s and document d in which the sequence occurs, our goal is to discriminate the correct last word in s from other random words. We compute scores g (s; d) and g (sw; d) where sw is s with the last word replaced by word w, and g (�; �) is the scoring function that represents the neural networks used. We want g (s; d) to be larger than g (sw; d) by a margin of 1, for any other word w in the vocabulary, which corresponds to the training objective of minimizing the ranking loss for each (s; d) found in the corpus:: [math]\displaystyle{ C_{s, d} = \ Sigma_{w \ in V} \ operatorname{max} (0, 1 - g (s, d) + g (s^w, d)) \ (1) }[/math] Collobert and Weston (2008) showed that this ranking approach can produce good word embeddings that are useful in several NLP tasks, and allows much faster training of the model compared to optimizing log-likelihood of the next word.

2.2 Neural Network Architecture We define two scoring components that contribute to the final score of a (word sequence, document) pair. The scoring components are computed by two neural networks, one capturing local context and the other global context, as shown in Figure 1. We now describe how each scoring component is computed.

... …

Using Amazon Mechanical Turk, we collected 10 human similarity ratings for each pair, as Snow et al. (2008) found that 10 non-expert annotators can achieve very close inter-annotator agreement with expert raters. To ensure worker quality, we only allowed workers with over 95% approval rate to work on our task. Furthermore, we discarded all ratings by a worker if he / she entered scores out of the accepted range or missed a rating, signaling low-quality work.

We obtained a total of 2, 003 word pairs and their sentential contexts. The word pairs consist of 1, 712 unique words. Of the 2, 003 word pairs, 1328 are noun-noun pairs, 399 verb-verb, 140 verb-noun, 97 adjective-adjective, 30 noun-adjective, and 9 verbadjective. 241 pairs are same-word pairs.

4.3.2 Evaluations onWord Similarity in Context

For evaluation, we also compute Spearman correlation between a model’s computed similarity scores and human judgments. Table 5 compares different models’ results on this dataset. We compare against the following baselines: tf-idf represents words in a word-word matrix capturing co-occurrence counts in all 10-word context windows. Reisinger and Mooney (2010b) found pruning the low-value tf-idf features helps performance. We report the result of this pruning technique after tuning the threshold value on this dataset, removing all but the top 200 features in each word vector. We tried the same multi-prototype approach and used spherical k-means3 to cluster the contexts using tf-idf representations, but obtained lower numbers than singleprototype (55.4 with AvgSimC). We then tried using pruned tf-idf representations on contexts with our clustering assignments (included in Table 5), but still got results worse than the single-prototype version of the pruned tf-idf model (60.5 with AvgSimC). This suggests that the pruned tf-idf representations might be more susceptible to noise or mistakes in context clustering.

By utilizing global context, our model outperforms C&W’s vectors and the above baselines on this dataset. With multiple representations per word, we show that the multi-prototype approach can improve over the single-prototype version without using context (62.8 vs. 58.6). Moreover, using AvgSimC4 which takes contexts into account, the multi-prototype model obtains the best performance (65.7).

5 Related Work

Neural language models (Bengio et al., 2003; Mnih and Hinton, 2007; Collobert and Weston, 2008; Schwenk and Gauvain, 2002; Emami et al., 2003) have been shown to be very powerful at language modeling, a task where models are asked to accurately predict the next word given previously seen words. By using distributed representations of 3 We first tried movMF as in Reisinger and Mooney (2010b), but were unable to get decent results (only 31.5). 4 probability of being in a cluster is calculated as the inverse of the distance to the cluster centroid. words which model words’ similarity, this type of models addresses the data sparseness problem that n-gram models encounter when large contexts are used. Most of these models used relative local contexts of between 2 to 10 words. Schwenk and Gauvain (2002) tried to incorporate larger context by combining partial parses of past word sequences and a neural language model. They used up to 3 previous head words and showed increased performance on language modeling. Our model uses a similar neural network architecture as these models and uses the ranking-loss training objective proposed by Collobert and Weston (2008), but introduces a new way to combine local and global context to train word embeddings.

Besides language modeling, word embeddings induced by neural language models have been useful in chunking, NER (Turian et al., 2010), parsing (Socher et al., 2011b), sentiment analyssi (Socher et al., 2011c) and paraphrase detection (Socher et al., 2011a). However, they have not been directly evaluated on word similarity tasks, which are important for tasks such as information retrieval and summarization. Our experiments show that our word embeddings are competitive in word similarity tasks. Most of the previous vector-space models use a single vector to represent a word even though many words have multiple meanings. The multi-prototype approach has been widely studied in models of categorization in psychology (Rosseel, 2002; Griffiths et al., 2009), while Schütze (1998) used clustering of contexts to perform word sense discrimination. Reisinger and Mooney (2010b) combined the two approaches and applied them to vector-space models, which was further improved in Reisinger and Mooney (2010a). Two other recent papers (Dhillon et al., 2011; Reddy et al., 2011) present models for constructing word representations that deal with context. It would be interesting to evaluate those models on our new dataset.

Many datasets with human similarity ratings on pairs of words, such as WordSim-353 (Finkelstein et al., 2001), MC (Miller and Charles, 1991) and RG (Rubenstein and Goodenough, 1965), have been widely used to evaluate vector-space models. Motivated to evaluate composition models, Mitchell and Lapata (2008) introduced a dataset where an intransitive verb, presented with a subject noun, is compared to another verb chosen to be either similar or dissimilar to the intransitive verb in context. The context is short, with only one word, and only verbs are compared. Erk and Pad´o (2008), Thater et al. (2011) and Dinu and Lapata (2010) evaluated word similarity in context with a modified task where systems are to rerank gold-standard paraphrase candidates given the SemEval 2007 Lexical Substitution Task dataset. This task only indirectly evaluates similarity as only reranking of already similar words are evaluated.

6 Conclusion

We presented a new neural network architecture that learns more semantic word representations by using both local and global context in learning. These learned word embeddings can be used to represent word contexts as low-dimensional weighted average vectors, which are then clustered to form different meaning groups and used to learn multi-prototype vectors. We introduced a new dataset with human judgments on similarity between pairs of words in context, so as to evaluate model’s abilities to capture homonymy and polysemy of words in context. Our new multi-prototype neural language model outperforms previous neural models and competitive baselines on this new dataset.

Footnotes

↑ The dataset and word vectors can be downloaded at http://ai.stanford.edu/~ehhuang/

References

;

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2012 ImprovingWordRepresentationsvia	Christopher D. Manning Andrew Y. Ng Richard Socher Eric H. Huang			Improving Word Representations via Global Context and Multiple Word Prototypes						2012

[1] The dataset and word vectors can be downloaded at http://ai.stanford.edu/~ehhuang/

[1]