2014 AnExplorationofEmbeddingsforGen

Jump to navigation Jump to search

Subject Headings: Phrase Vector.


Cited By



Deep learning embeddings have been successfully used for many natural language processing problems. Embeddings are mostly computed for word forms although lots of recent papers have extended this to other linguistic units like morphemes and word sequences. In this paper, we define the concept of generalized phrase that includes conventional linguistic phrases as well as skip-bigrams. We compute embeddings for generalized phrases and show in experimental evaluations on coreference resolution and paraphrase identification that such embeddings perform better than word form embeddings.

1 Motivation

One advantage of recent work in deep learning on natural language processing (NLP) is that linguistic units are represented by rich and informative embeddings. These embeddings support better performance on a variety of NLP tasks (Collobert et al., 2011) than symbolic linguistic representations that do not directly represent information about similarity and other linguistic properties. Embeddings are mostly derived for word forms although a number of recent papers have extended this to other linguistic units like morphemes (Luong et al., 2013), phrases and word sequences (Socher et al., 2010; Mikolov et al., 2013). [1] Thus, an important question is: what are the basic linguistic units that should be represented by embeddings in a deep learning NLP system? Building on the prior work in (Socher et al., 2010; Mikolov et al., 2013), we generalize the notion of phrase to include skip-bigrams (SkipBs) and lexicon entries, where lexicon entries can be both “continuous” and “noncontinuous” linguistic phrases. Examples of skip-bigrams at distance 2 in the sentence “this tea helped me to relax” are: “this*helped”, “tea*me”, “helped*to”. … Examples of linguistic phrases listed in a typical lexicon are continuous phrases like “cold cuts” and “White House” that only occur without intervening words and discontinous phrases like “take over” and “turn off” that can occur with intervening words. We consider it promising to compute embeddings for these phrases because many phrases, including the four examples we just gave, are noncompositional or weakly compositional, i.e., it is difficult to compute the meaning of the phrase from the meaning of its parts. We write gaps as “*” for SkipBs and “” for phrases.

We can approach the question of what basic linguistic units should have representations from a practical as well as from a cognitive point of view. In practical terms, we want representations to be optimized for good generalization. There are many situations where a particular task involving a word cannot be solved based on the word itself, but it can be solved by analyzing the context of the word. For example, if a coreference resolution system needs to determine whether the unknown word “Xiulan” (a Chinese first name) in “he helped Xiulan to find a flat” refers to an animate or an inanimate entity, then the SkipB “helped*to” is a good indicator for the animacy of the unknown word – whereas the unknown word itself provides no clue.

From a cognitive point of view, it can be argued that many basic units that the human cognitive system uses have multiple words. Particularly convincing examples for such units are phrasal verbs in English, which often have a non-compositional meaning. It is implausible to suppose that we retrieve atomic representations for, say, “keep”, “up”, “on” and “from” and then combine them to form the meanings of the expressions “keep your head up,” “keep the pressure on,” “keep him from laughing”. Rather, it is more plausible that we recognize “keep up”, “keep on” and “keep from” as relevant basic linguistic units in these contexts and that the human cognitive systems represents them as units.

We can view SkipBs and discontinuous phrases as extreme cases of treating two words that do not occur next to each other as a unit. SkipBs are defined purely statistically and we will consider any pair of words as a potential SkipB in our experiments below. In contrast, discontinuous phrases are well motivated. It is clear that the words “picked” and “up” in the sentences “I picked it up” belong together and form a unit very similar to the word “collected” in “I collected it”. The most useful definition of discontinuous units probably lies in between SkipBs and phrases: we definitely want to include all phrases, but also some (but not all) statistical SkipBs. The initial work presented in this paper may help in finding a good “compromise” definition.

This paper contributes to a preliminary investigation of generalized phrase embeddings and shows that they are better suited than word embedding for a coreference resolution classification task and for paraphrase identification. Another contribution lies in that the phrase embeddings we release[2] could be a valuable resource for others.

The remainder of this paper is organized as follows. Section 2 and Section 3 introduce how to learn embeddings for SkipBs and phrases, respectively. Experiments are provided in Section 4. Subsequently, we analyze related work in Section 5, and conclude our work in Section 6.

2 Embedding learning for SkipBs

With English Gigaword Corpus (Parker et al., 2009), we use the skip-gram model as implemented in word2vec[3] (Mikolov et al., 2013) to induce embeddings. Word2vec skip-gram scheme is a neural network language model, using a given word to predict its context words within a window size. To be able to use word2vec directly without code changes, we represent the corpus as a sequence of sentences, each consisting of two tokens: a SkipB and a word that occurs between the two enclosing words of the SkipB. The distance k between the two enclosing words can be varied. In our experiments, we use either distance k = 2 or distance 2 � k � 3. For example, for k = 2, the trigram wi-1 wi wi + 1 generates the single sentence “wi-1*wi + 1 wi”; and for 2 � k � 3, the fourgram wi-2 wi-1 wi wi + 1 generates the four sentences “wi-2*wi wi-1”, “wi-1*wi + 1 wi”, “wi-2*wi + 1 wi-1” and “wi-2*wi + 1 wi”.

In this setup, the middle context of SkipBs are kept (i.e., the second token in the new sentences), and the surrounding context of words of original sentences are also kept (i.e., the SkipB in the new sentences). We can run word2vec without any changes on the reformatted corpus to learn embeddings for SkipBs. As a baseline, we run word2vec on the original corpus to compute embeddings for words. Embedding size is set to 200.

3 Embedding learning for phrases

3.1 Phrase collection

Phrases defined by a lexicon have not been deeply investigated before in deep learning. To collect canonical phrase set, we extract two-word phrases defined in Wiktionary[4], and two-word phrases defined in Wordnet (Miller and Fellbaum, 1998) to form a collection of size 95218. This collection contains phrases whose parts always occur next to each other (e.g., “cold cuts”) and phrases whose parts more often occur separated from each other (e.g., “take (something) apart”).

3.2 Identification of phrase continuity

Wiktionary and WordNet do not categorize phrases as continuous or discontinous. So we need a heuristic for determining this automatically.

For each phrase “A B”, we compute [ c1, c2, c3, c4, c5] where ci, 1 � i � 5, indicates there are ci occurrences of A and B in that order with a distance of i. We compute these statistics for a corpus consisting of Gigaword and Wikipedia. We set the maximal distance to 5 because discontinuous phrases are rarely separated by more than 5 tokens.

6 Conclusion and Future-Work

We have argued that generalized phrases are part of the inventory of linguistic units that we should compute embeddings for and we have shown that such embeddings are superior to word form embeddings in a coreference resolution task and standard paraphrase identification task. In this paper we have presented initial work on several problems that we plan to continue in the future: (i) How should the inventory of continuous and discontinous phrases be determined? We used a purely statistical definition on the one hand and dictionaries on the other. A combination of the two methods would be desirable. (ii) How can we distinguish between phrases that only occur in continuous form and phrases that must or can occur discontinuously? (iii) Given a sentence that contains the parts of a discontinuous phrase in correct order, how do we determine that the cooccurrence of the two parts constitutes an instance of the discontinuous phrase? (iv) Which tasks benefit most significantly from the introduction of generalized phrases?


  1. Socher et al. use the term “word sequence”. Mikolov et al. use the term “phrase” for word sequences that are mostly frequent continuous collocations.
  2. http://www.cis.lmu.de / pub / phraseEmbedding.txt.bz2
  3. https://code.google.com/p/word2vec/
  4. http://en.wiktionary.org / wiki / Wiktionary:Main_Page



 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2014 AnExplorationofEmbeddingsforGenHinrich Schütze
Wenpeng Yin
An Exploration of Embeddings for Generalized Phrases