2013 BetterWordRepresentationswithRe

(Luong et al., 2013) ⇒ Thang Luong, Richard Socher, and Christopher Manning. (2013). “Better Word Representations with Recursive Neural Networks for Morphology.” In: Proceedings of the Seventeenth Conference on Computational Natural Language Learning (CoNLL-2013).

Subject Headings: Word Embedding System, Character-Level Embedding System, MorphoRNN Embedding System, Morphological Segmentation Toolkit, Morfessor.

Notes

Cited By

Google Scholar ~ 849 Citations.

2014

(Pennington et al., 2014) ⇒ Jeffrey Pennington, Richard Socher, and Christopher Manning. (2014). “Glove: Global Vectors for Word Representation.” In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532-1543.

Quotes

Abstract

Vector-space word representations have been very successful in recent years at improving performance across a variety of NLP tasks. However, common to most existing work, words are regarded as independent entities without any explicit relationship among morphologically related words being modeled. As a result, rare and complex words are often poorly estimated, and all unknown words are represented in a rather crude way using only one or a few vectors. This paper addresses this shortcoming by proposing a novel model that is capable of building representations for morphologically complex words from their morphemes. We combine recursive neural networks (RNNs), where each morpheme is a basic unit, with neural language models (NLMs) to consider contextual information in learning morphologically aware word representations. Our learned models outperform existing word representations by a good margin on word similarity tasks across many datasets, including a new dataset we introduce focused on rare words to complement existing ones in an interesting way.

1. Introduction

The use of word representations or word clusters pretrained in an unsupervised fashion from lots of text has become a key “secret sauce” for the success of many NLP systems in recent years, across tasks including named entity recognition, part-of-speech tagging, parsing, and semantic role labeling. This is particularly true in deep neural network models (Collobert et al., 2011), but it is also true in conventional feature-based models (Koo et al., 2008; Ratinov and Roth, 2009).

Deep learning systems give each word a distributed representation, i.e., a dense low-dimensional real-valued vector]] or an embedding. The main advantage of having such a distributed representation over word classes is that it can capture various dimensions of both semantic and syntactic information in a vector where each dimension corresponds to a latent feature of the word. As a result, a distributed representation is compact, less susceptible to data sparsity, and can implicitly represent an exponential number of word clusters.

However, despite the widespread use of word clusters and word embeddings, and despite much work on improving the learning of word representations, from feed-forward networks (Bengio et al., 2003) to hierarchical models (Morin, 2005; Mnih and Hinton, 2009) and recently recurrent neural networks (Mikolov et al., 2010; Mikolov et al., 2011), these approaches treat each full-form word as an independent entity and fail to capture the explicit relationship among morphological variants of a word.^[1] The fact that morphologically complex words are often rare exacerbates the problem. Though existing clusterings and embeddings represent well frequent words, such as “distinct”, they often badly model rare ones, such as “distinctiveness”.

In this work, we use recursive neural networks (Socher et al., 2011b), in a novel way to model morphology and its compositionality. Essentially, we treat each morpheme as a basic unit in the RNNs and construct representations for morphologically complex words on the fly from their morphemes. By training a neural language model (NLM) and integrating RNN structures for complex words, we utilize contextual information in an interesting way to learn morphemic semantics and their compositional properties. Our model has the capability of building representations for any new unseen word comprised of known morphemes, giving the model an infinite (if still incomplete) covered vocabulary.

Our learned representations outperform publicly available embeddings by a good margin on word similarity tasks across many datasets, which include our newly released dataset focusing on rare words (see Section 5). The detailed analysis in Section 6 reveals that our models can blend well syntactic information, i.e., the word structure, and the semantics in grouping related words.^[2]

Common to many of these works is use of a distributed word representation as the basic input unit. These representations usually capture local co-occurrence statistics but have also been extended to include document-wide context (Huang et al., 2012). Their main advantage is that they can both be learned unsupervisedly as well as be tuned for supervised tasks. In the former training regiment, they are evaluated by how well they can capture human similarity judgments. They have also been shown to perform well as features for supervised tasks, e.g., NER (Turian et al., 2010).

While much work has focused on different objective functions for training single and multiword vector representations, very little work has been done to tackle sub-word units and how they can be used to compute syntactic-semantic word vectors. Collobert et al. (2011) enhanced word vectors with additional character-level features such as capitalization but still can not recover more detailed semantics for very rare or unseen words, which is the focus of this work.

This is somewhat ironic, since working out correct morphological inflections was a very central problem in early work in the parallel distributed processing paradigm and criticisms of it (Rumelhart and McClelland, 1986; Plunkett and Marchman, 1991), and later work developed more sophisticated models of morphological structure and meaning (Gasser and Lee, 1990; Gasser, 1994), while not providing a compositional semantics nor working at the scale of what we present.

To the best of our knowledge, the work closest to ours in terms of handing unseen words are the factored NLMs (Alexandrescu and Kirchhoff, 2006) and the compositional distributional semantic models (DSMs) (Lazaridou et al., 2013). In the former work, each word is viewed as a vector of features such as stems, morphological tags, and cases, in which a single embedding matrix is used to look up all of these features.^[3] Though this is a principled way of handling new words in NLMs, the by-product word representations, i.e. the concatenations of factor vectors, do not encode in them the compositional information (they are stored in the NN parameters). Our work does not simply concatenate vectors of morphemes, but rather combines them using RNNs, which captures morphological compositionality.

The latter work experimented with different compositional DSMs, originally designed to learn meanings of phrases, to derive representations for complex words, in which the base unit is the morpheme similar to ours. However, their models can only combine a stem with an affix and does not support recursive morpheme composition. It is, however, interesting to compare our neural-based representations with their DSM-derived ones and cross test these models on both our rare word similarity dataset and their nearest neighbor one, which we leave as future work.

Mikolov et al. (2013) examined existing word embeddings and showed that these representations already captured meaningful syntactic and semantic regularities such as the singular / plural relation that $x_{apple} - x_{apples} \approx x_{car} - x_{cars}$. However, we believe that these nice relationships will not hold for rare and complex words when their vectors are poorly estimated as we analyze in Section 6. Our model, on the other hand, explicitly represents these regularities through morphological structures of words.

3. Morphological RNNs

Our morphological Recursive Neural Network (morphoRNN) is similar to (Socher et al., 2011b), but operates at the morpheme level instead of at the word level. Specifically, morphemes, the minimum meaning-bearing unit in languages, are modeled as real-valued vectors of parameters, and are used to build up more complex words. We assume access to a dictionary of morphemic analyses of words, which will be detailed in Section 4.

Following (Collobert and Weston, 2008), distinct morphemes are encoded by column vectors in a morphemic embedding matrix $\mathbf{W}_e \in \R^{d\times |M|}, where $d$ is the vector dimension and $M$ is an ordered set of all morphemes in a language.

As illustrated in Figure 1, vectors of morphologically complex words are gradually built up from their morphemic representations. At any local decision (a dotted node), a new parent word vector ($\mathbf{p}$) is constructed by combining a stem vector ($\mathbf{x}_{stem}$) and an affix vector ($x_{affix}$) as follow:

$\mathbf{p}=f\left(\mathbf{W_m}\left[\mathbf{x}_m;\mathbf{x}_{affix}\right]+\mathbf{b_m}\right)$

(1)

Here, $\mathbf{W_m} \in \R^{d\times 2d}$ is a matrix of morphemic parameters while $\mathbf{b_m} \in \R^{d\times 1}$ is an intercept vector. We denote an element-wise activation function as $f$, such as $\tanh$. This forms the basis of our morphoRNN models with $\boldsymbol{\theta} = \{\mathbf{W_e}, \mathbf{W_m}, \mathbf{b_m}\}$ being the parameters to be learned.

**Figure 1: Morphological Recursive Neural Network**. A vector representation for the word “`unfortunately`” is constructed from morphemic vectors: `un_pre`, `fortunate_stm`, `ly_suf`. Dotted nodes are computed on-the-fly and not in the lexicon.

3.1 Context-insensitive Morphological RNN

Our first model examines how well morphoRNNs could construct word vectors simply from the morphemic representation without referring to any context information. Input to the model is a reference embedding matrix, i.e. word vectors trained by an NLM such as (Collobert and Weston, 2008) and (Huang et al., 2012). By assuming that these reference vectors are right, the goal of the model is to construct new representations for morphologically complex words from their morphemes that closely match the corresponding reference ones.

Specifically, the structure of the context insensitive morphoRNN (cimRNN) is the same as the basic morphoRNN. For learning, we first define a cost function s for each word xi as the squared Euclidean distance between the newly-constructed representation $\mathbf{p}_c \left(x_i\right)$ and its reference vector $\mathbf{p}_r \left(xi\right): s \left(x_i\right) = \parallel\mathbf{p}_c \left(x_i\right) − \mathbf{p}_r \left(x_i\right) \parallel^2_2$.

The objective function is then simply the sum of all individual costs over $N$ training examples, plus a regularization term, which we try to minimize:

$J\left(\boldsymbol{\theta}\right) = \displaystyle \sum^N_{i=1} s\left(x_i\right) + \dfrac{\lambda}{2} \parallel\boldsymbol{\theta}\parallel^2_2 $

(2)

3.2 Context-sensitive Morphological RNN

The cimRNN model, though simple, is interesting to attest if morphemic semantics could be learned solely from an embedding. However, it is limited in several aspects. Firstly, the model has no chance of improving representations for rare words which might have been poorly estimated. For example, “distinctness” and “unconcerned” are very rare, occurring only 141 and 340 times in Wikipedia documents, even though their corresponding stems “distinct” and “concern” are very frequent (35323 and 26080 respectively). Trying to construct exactly those poorly-estimated word vectors might result in a bad model with parameters being pushed in wrong directions.

Secondly, though word embeddings learned from an NLM could, in general, blend well both the semantic and syntactic information, it would be useful to explicitly model another kind of syntactic information, the word structure, as we train our embeddings. Motivated by these limitations, we propose a context-sensitive morphoRNN (csmRNN) which integrates RNN structures into NLM training, allowing for contextual information being taken into account in learning morphemic compositionality. Specifically, we adopt the NLM training approach proposed in (Collobert et al., 2011) to learn word embeddings, but build representations for complex words from their morphemes. During learning, updates at the top level of the neural network will be back-propagated all the way till the morphemic layer.

**Figure 2: Context-sensitive morphological RNN** has two layers: (a) the morphological RNN, which constructs representations for words from their morphemes and (b) the word-based neural language which optimizes scores for relevant ngrams.

Structure-wise, we stack the NLM on top of our morphoRNN as illustrated in Figure 2. Complex words like “unfortunately” and “closed” are constructed from their morphemic vectors, $un_{pre} + fortunate_{stm} + ly_{suf}$ and $close_{stm} + d_{suf}$, whereas simple words^[4], i.e. stems, and affixes could be looked up from the morphemic embedding matrix $\mathbf{W}_e$ as in standard NLMs. Once vectors of all complex words have been built, the NLM assigns a score for each ngram $n_i$ consisting of words $x_1,\ldots , x_n$ as follows:

$s\left(n_{i}\right)=\boldsymbol{v}^{\top} f\left(\boldsymbol{W}\left[\boldsymbol{x}_{1} ; \ldots ; \boldsymbol{x}_{n}\right]+\boldsymbol{b}\right)$

Here, $\mathbf{x}_j$ is the vector representing the word $x_j$. We follow (Huang et al., 2012) to use a simple feed-forward network with one $h$-dimensional hidden layer. $\mathbf{W} \in \R^{h\times nd}, \mathbf{b} \in \R^{h\times 1}$, and $\mathbf{v} \in \R^{h\times 1}$ are parameters of the NLM, and $f$ is an element-wise activation function as in Eq.(1). We adopt a ranking-type cost in defining our objective function to minimize as below:

$J(\boldsymbol{\theta})=\sum_{i=1}^{N} \max \left\{0,1-s\left(n_{i}\right)+s\left(\bar{n}_{i}\right)\right\}$

(3)

Here, $N$ is the number of all available ngrams in the training corpus, whereas $\bar{n}_i$ is a “corrupted” ngram created from $n_i$ by replacing its last word with a random word similar in spirit to (Smith and Eisner, 2005). Our model parameters are $\boldsymbol{\theta} =\{\mathbf{W}_e, \mathbf{W}_m, \mathbf{b}_m, \mathbf{W}, \mathbf{b}, \mathbf{v}\}$.

Such a ranking criterion influences the model to assign higher scores to valid ngrams than to invalid ones and has been demonstrated in (Collobert et al., 2011) to be both efficient and effective in learning word representations.

3.3 Learning

Our models alternate between two stages: (1) forward pass – recursively construct morpheme trees (cimRNN, csmRNN) and language model structures (csmRNN) to derive scores for training examples and (2) back-propagation pass – compute the gradient of the corresponding object function with respect to the model parameters.

For the latter pass, computing the objective gradient amounts to estimating the gradient for each individual cost $\frac{\partial s (x)}{\partial\boldsymbol{\theta}}$, where $x$ could be either a word (cimRNN) or an ngram (csmRNN). We have the objective gradient for the cimRNN derived as:

$\frac{\partial J(\boldsymbol{\theta})}{\partial \boldsymbol{\theta}}=\displaystyle\sum_{i=1}^{N} \frac{\partial s\left(x_{i}\right)}{\partial \boldsymbol{\theta}}+\lambda \boldsymbol{\theta}$

In the case of csmRNN, since the objective function in Eq. (3) is not differentiable, we use the subgradient method (Ratliff et al., 2007) to estimate the objective gradient as:

$\frac{\partial J(\boldsymbol{\theta})}{\partial \boldsymbol{\theta}}=\displaystyle\sum_{i: 1-s\left(n_{i}\right)+s\left(\bar{n}_{i}\right)>0}-\frac{\partial s\left(n_{i}\right)}{\partial \boldsymbol{\theta}}+\frac{\partial s\left(\bar{n}_{i}\right)}{\partial \boldsymbol{\theta}}$

Back-propagation through structures (Goller and Kuchler, 1996) is employed to compute the gradient for each individual cost with similar formulae as in (Socher et al., 2010). Unlike their RNN structures over sentences, where each sentence could have an exponential number of derivations, our morphoRNN structure per word is, in general, deterministic. Each word has a single morphological tree structure which is constructed from the main morpheme (the stem) and gradually appended affixes in a fixed order (see Section 4 for more details). As a result, both our forward and backward passes over morphological structures are efficient with no recursive calls implementation-wise.

4. Unsupervised Morphological Structures

We utilize an unsupervised morphological segmentation toolkit, named Morfessor by Creutz and Lagus (2007), to obtain segmentations for words in our vocabulary. Morfessor segments words in two stages: (a) it recursively splits words to minimize an objective inspired by the minimum description length principle and (b) it labels morphemes with tags $\mathrm{pre}$ (prefixes), $\mathrm{stm}$ (stems), and $\mathbf{suf}$ (suffixes) using hidden Markov models.

Morfessor captures a general word structure of the form $(\mathrm{pre}^{\star}\; \mathrm{stm}\; \mathrm{suf}^{\star})^+$, which is handy for words in morphologically rich languages like Finnish or Turkish. However, such general form is currently unnecessary in our models as the morphoRNNs assume input of the form $\mathrm{pre}^{\star}\; \mathrm{stm}\; \mathrm{suf}^{\star}$ for efficient learning of the RNN structures: a stem is always combined with an affix to yield a new stem.^[5] We, thus, postprocess as follows:

(1) Restrict segmentations to the form $\mathrm{pre}^{\star}\;\mathrm{stm}\{1, 2\}\; \mathrm{suf}^{\star}$ : allow us to capture compounds.
(2) Split hyphenated words $A-B$ as $A_{stm}$ $B_{stm}$.
(3) For a segmentation with two stems, $\mathrm{pre}^{\star}\; A_{stm} B_{stm}\; \mathrm{suf}^{\star}$, we decide if one could be a main stem while the other could functions as an affix.^[6] Otherwise, we reject the segmentation. This will provide us with more interesting morphemes such as alpre in Arabic names (al-jazeera, al-salem) and relatedsuf in compound adjectives (health-related, government-related).
(4) To enhance precision, we reject a segmentation if it has either an affix or an unknown stem (not a word by itself) whose type count is below a predefined threshold.^[7]

The final list of affixes produced is given in Table 1. Though generally reliable, our final segmentations do contain errors, most notably non-compositional ones, e.g. $de_{pre}$ $fault_{stm}$ $ed_{suf}$ or $re_{pre}$ $turn_{stm}$ $s_{suf}$. With a sufficiently large number of segmentation examples, we hope that the model would be able to pick up general trends from the data. In total, we have about 22K complex words out of a vocabulary of 130K words.

**Table 1:** List of prefixes and suffixes discovered – conventional affixes in English are italicized.
Prefixes	Suffixes
0 al all anti auto co counter cross de dis electro end ex first five focus four half high hyper ill im in inter ir jan jean long low market mc micro mid multi neuro newly no non off one over post pre pro re second self semi seven short six state sub super third three top trans two un under uni well	able al ally american ance ate ation backed bank based born controlled d dale down ed en er es field ford free ful general head ia ian ible ic in ing isation ise ised ish ism ist ity ive ization ize ized izing land led less ling listed ly made making man ment ness off on out owned related s ship shire style ton town up us ville wood

Examples of words with interesting affixes are given in Table 2. Beside conventional affixes, non-conventional ones like “0” or “mc” help further categorize rare or unknown words into meaningful groups such as measurement words or names.

**Table 2:**Sample affixes and corresponding words.
Affix	Words
0	0-acre, 0-aug, 0-billion, 0-centistoke
anti	anti-immigrant, antipsychotics
counter	counterexample, counterinsurgency
hyper	hyperactivity, hypercholesterolemia
mc	mcchesney, mcchord, mcdevitt
bank	baybank, brockbank, commerzbank
ford	belford, blandford, carlingford
land	adventureland, bodoland, bottomland
less	aimlessly, artlessness, effortlessly
owned	bank-owned, city-owned disney-owned

5. Experiments

As our focus is in learning morphemic semantics, we do not start training from scratch, but rather, initialize our models with existing word representations. In our experiments, we make use of two publicly-available embeddings (50-dimensional) provided by (Collobert et al., 2011) (denoted as C&W)^[8] and Huang et al. (2012) (referred as HSMN)^[9].

Both of these representations are trained on Wikipedia documents using the same ranking-type cost function as in Eq. (3). The latter further utilizes global context and adopts a multi-prototype approach, i.e. each word is represented by multiple vectors, to better capture word semantics in various contexts. However, we only use their single-prototype embedding^[10] and as we train, we do not consider the global sentence-level context information. It is worth to note that these aspects of the HSMN embedding – incorporating global context and maintaining multiple prototypes – are orthogonal to our approach, which would be interesting to investigate in future work.

For the context-sensitive morphoRNN model, we follow Huang et al. (2012) to use the April 2010 snapshot of the Wikipedia corpus (Shaoul and Westbury, 2010). All paragraphs containing non-roman characters are removed while the remaining text are lowercased and then tokenized. The resulting clean corpus contains about 986 million tokens. Each digit is then mapped into 0, i.e. 2013 will become 0000. Other rare words not in the vocabularies of C&W and HSMN are mapped to an UNKNOWN token, and we use and for padding tokens representing the beginning and end of each sentence.

Follow (Huang et al., 2012)'s implementation, which our code is based on initially, we use 50-dimensional vectors to represent morphemic and word embeddings. For cimRNN, the regularization weight $\lambda$ is set to $10^{−2}$. For csmRNN, we use 10-word windows of text as the local context, 100 hidden units, and no weight regularization.

5.1 Word Similarity Task

Similar to (Reisinger and Mooney, 2010) and (Huang et al., 2012), we evaluate the quality of our morphologically-aware embeddings on the popular WordSim-353 dataset (Finkelstein et al., 2002), WS353 for short. In this task, we compare correlations between the similarity scores given by our models and those rated by human.

To avoid overfitting our models to a single dataset, we benchmark our models on a variety of others including MC (Miller and Charles, 1991), RG (Rubenstein and Goodenough, 1965), SCWS^∗^[11] (Huang et al., 2012), and our new rare word (RW) dataset (details in §5.1.1). Information about these datasets are summarized in Table 3.

**Table 3: Word similarity datasets** and their statistics: number of pairs/raters/type counts as well as rating scales. The number of complex words are shown as well (both type and token counts). RW denotes our new rare word dataset.
	pairs	type	raters	scale	Complex words
	pairs	type	raters	scale	token	type
WS353	353	437	13-16	0-10	24	17
MC	30	39	38	0-4	0	0
RG	65	48	51	0-4	0	0
SCWS^∗	1762	1703	10	0-10	190	113
RW (new)	2034	2951	10	0-10	987	686

We also examine these datasets from the “rareness” aspect by looking at distributions of words across frequencies as in Table 4. The first bin counts unknown words in each dataset, while the remaining bins group words based on their frequencies extracted from Wikipedia documents. It is interesting to observe that WS353, MC, RG contain very frequent words and have few complex words (only WS353 has).^[12] SCWS^∗ and RW have a more diverse set of words in terms of frequencies and RW has the largest number of unknown and rare words, which makes it a challenging dataset.

**Table 4: Word distribution by frequencies** – distinct words in each dataset are grouped based on frequencies and counts are reported for the following bins : $unknown | [1, 100] / [101, 1000] / [1001, 10000] / [10001, \infty)$. We report counts for all words in each dataset as well as complex ones
	All words	Complex words
WS353	$0 \vert 0 / 9 / 87 / 341$	$0 \vert 0 / 1 / 6 / 10$
MC	$0 \vert 0 / 1 / 17 / 21$	$0 \vert 0 / 0 / 0 / 0$
RG	$0 \vert 0 / 4 / 22 / 22$	$0 \vert 0 / 0 / 0 / 0$
SCWS^∗	$26 \vert 2 / 140 / 472 / 1063 $	$8 \vert 2 / 22 / 44 / 45$
RW	$801 \vert 41 / 676 / 719 / 714 $	$ 621\vert 34 / 311 / 238 / 103$

5.1.1 Rare Word Dataset

As evidenced in Table 4, most existing word similarity datasets contain frequent words and few of them possesses enough rare or morphologically complex words that we could really attest the expressiveness of our morphoRNN models. In fact, we believe a good embedding in general should be able to learn useful representations for not just frequent words but also rare ones. That motivates us to construct another dataset focusing on rare words to complement existing ones.

Our dataset construction proceeds in three stages: (1) select a list of rare words, (2) for each of the rare words, find another word (not necessarily rare) to form a pair, and (3) collect human judgments on how similar each pair is.

Rare word selection: our choices of rare words ($word_1$) are based on their frequencies – based on five bins $(5, 10 ], (10, 100 ], (100, 1000 ], (1000, 10000 ]$, and the affixes they possess. To create a diverse set of candidates, we randomly select 15 words for each configuration (a frequency bin, an affix). At the scale of Wikipedia, a word with frequency of 1-5 is most likely a junk word, and even restricted to words with frequencies above five, there are still many non-English words. To counter such problems, each word selected is required to have a non-zero number of synsets in WordNet (Miller, 1995).

Table 5 (top) gives examples of rare words selected and organized by frequencies and affixes. It is interesting to find out that words like obtainment and acquirement are extremely rare (not in traditional dictionaries) but are perfectly understandable. We also have less frequent words like revetment from French or organismal from biology.

**Table 5: Rare words** (top) – $word_1$ by affixes and frequencies and sample word pairs (bottom).
	(5, 10]	(10, 100]	(100, 1000]
un−	untracked unrolls undissolved	unrehearsed unflagging unfavourable	unprecedented unmarried uncomfortable
−al	apocalyptical traversals bestowals	acoustical extensional organismal	directional diagonal spherical
−ment	obtainment acquirement retrenchments	discernment revetment rearrangements	confinement establishment management
word₁	untracked unflagging unprecedented	apocalyptical organismal diagonal	obtainment discernment confinement
word₂	inaccessible constant new	prophetic system line	acquiring knowing restraint

Pair construction: following (Huang et al., 2012), we create pairs with interesting relationships for each $word_1$ as follow. First, a WordNet synset of $word_1$ is randomly selected, and we construct a set of candidates which connect to that synset through various relations, e.g., hypernyms, hyponyms, holonyms, meronyms, and attributes. A $word_2$ is then randomly selected from these candidates, and the process is repeated another time to generate a total of two pairs for each $word_1$. Sample word pairs are given in Table 5 in which $word_2$ includes mostly frequent words, implying a balance of words in terms of frequencies in our dataset. We collected 3145 pairs after this stage.

Human judgment: we use Amazon Mechanical Turk to collect 10 human similarity ratings on a scale of $[ 0, 10 ]$ per word pair.^[13] Such procedure has been demonstrated by Snow et al. (2008) in replicating ratings for the MC dataset, achieving close inter-annotator agreement with expert raters. Since our pairs contain many rare words which are challenging even to native speakers, we ask raters to indicate for each pair if they do not know the first word, the second word, or both. We use such information to collect reliable ratings by either discard pairs which many people do not know or collect additional ratings to ensure we have 10 ratings per pair.^[14] As a result, only 2034 pairs are retained. ==== 5.2 Results ====

**Table 6: Nearest neighbors**. We show morphologically related words and their closest words in different representations (“unaffect” is a pseudo-word; ∅ marks no results due to unknown words).
Words	C&W	C&W+cimRNN	C&W+csmRNN
commenting	insisting insisted focusing hinted	republishing accounting expounding	commented comments criticizing
comment	commentary rant statement remark	commentary rant statement remark	rant commentary statement anecdote
distinctness	morphologies pesawat clefts	modality indistinct tonality spatiality	indistinct distinctiveness largeness uniqueness
distinct	different distinctive broader narrower	different distinctive broader divergent	divergent diverse distinctive homogeneous
unaffected	unnoticed dwarfed mitigated	disaffected unconstrained uninhibited	undesired unhindered unrestricted
affected	caused plagued impacted damaged	disaffected unaffected mitigated disturbed	complicated desired constrained reasoned
unaffect	∅	affective affecting affectation unobserved	affective affecting affectation restrictive
affect	exacerbate impacts characterize	affects affectation exacerbate characterize	decrease arise complicate exacerbate
heartlessness	∅	fearlessness vindictiveness restlessness	depersonalization terrorizes sympathizes
heartless	merciless sadistic callous mischievous	merciless sadistic callous mischievous	sadistic callous merciless hideous
heart	death skin pain brain life blood	death skin pain brain life blood	death brain blood skin lung mouth
saudi-owned	avatar mohajir kripalani fountainhead	saudi-based somaliland al-jaber	saudi-based syrian-controlled syrian-backed
short-changed	kindled waylaid endeared peopled	conformal conformist unquestionable	short-termism short-positions self-sustainable

We evaluate the quality of our morphoRNN embeddings through the word similarity task discussed previously. The Spearman's rank correlation is used to gauge how well the relationship between two variables, the similarity scores given by the NLMs and the human annotators, could be described using a monotonic function.

Detailed performance of the morphoRNN embeddings trained from either the HSMN or the C&W embeddings are given in Table 7 for all datasets. We also report baseline results (rows HSMN, C&W) using these initial embeddings alone, which interestingly reveals strengths and weaknesses of existing embeddings. While HSMN is good for datasets with frequent words (WS353, MC, and RG), its performances for those with more rare and complex words (SCWS^∗ and RW) are much inferior than those of C&W, and vice versa. Additionally, we consider two slightly more competitive baselines (rows + stem) based on the morphological segmentation of unknown words: instead of using a universal vector representing all unknown words, we use vectors representing the stems of unknown words. These baselines yield slightly better performance for the SCWS^∗ and RW datasets while the trends we mentioned earlier remain the same.

**Table 7: Word similarity task** – shown are Spearman's rank correlation coefficient ($\rho\times 100$) between similarity scores assigned by neural language models and by human annotators. stem indicates baseline systems in which unknown words are represented by their stem vectors. cimRNN and csmRNN refer to our context insensitive and sensitive morphological RNNs respectively.
	WS353	MC	RG	SCWS^∗	RW
HSMN	62.58	65.90	62.81	32.11	1.97
+stem	62.58	65.90	62.81	32.11	3.40
+cimRNN	62.81	65.90	62.81	32.97	14.85
+csmRNN	64.58	71.72	65.45	43.65	22.31
C&W	49.77	57.37	49.30	48.59	26.75
+stem	49.77	57.37	49.30	49.05	28.03
+cimRNN	51.76	57.37	49.30	47.00	33.24
+csmRNN	57.01	60.20	55.40	48.48	34.36

Our first model, the context-insensitive morphoRNN (cimRNN), outperforms its corresponding baseline significantly over the rare word dataset. The performance is constant for MC and RG (with no complex words) and modestly improved for MC (with some complex words – see Table 4). This is expected since the cimRNN model only concerns about reconstructing the original embedding (while learning word structures), and the new representation mostly differs at morphologically complex words. For SCWS^∗, the performance, however, decreases when training with C&W, which perhaps is due to: (a) the baseline performance of C&W for SCWS^∗ is competitive and (b) the model trades off between learning syntactics (the word structure) and capturing semantics, which requires context information.

On the other hand, the context-sensitive morphoRNN (csmRNN) consistently improves correlations over the cimRNN model for all datasets, demonstrating the effectiveness of using surrounding contexts in learning both morphological syntactics and semantics. It also outperforms the corresponding baselines by a good margin for all datasets (except for SCWS^∗). This highlights the fact that our method is reliable and potentially applicable for other embeddings.

6. Analysis

To gain a deeper understanding of how our morphoRNN models have "moved" word vectors around, we look at nearest neighbors of several complex words given by various embeddings, where cosine similarity is used as a distance metric. Examples are shown in Table 6 for three representations: C&W and the context-insensitive/sensitive morphoRNN models trained on the C&W embedding.^[15]

Syntactically, it is interesting to observe that the cimRNN model could well enforce structural agreement among related words. For example, it returns V-ing as nearest neighbors for “commenting” and similarly, JJ-ness for “fearlessness”, an unknown word that C&W cannot handle. However, for those cases, the nearest neighbors are badly unrelated.

On the semantic side, we notice that when structural agreement is not enforced, the cimRNN model tends to cluster words sharing the same stem together, e.g., rows with words of the form --affect--^[16]. This might be undesirable when we want to differentiate semantics of words sharing the same stem, e.g. “affected” and “unaffected".

The csmRNN model seems to balance well between the two extremes (syntactic and semantic) by taking into account contextual information when learning morphological structures. It returns neighbors of the same structure un--ed for “unaffected", but does not include any negation of “affected” in the top 10 results when “affected” is queried.^[17] Even better, the answers for “distinctness” have blended well both types of results.

7. Conclusion

This paper combines recursive neural networks (RNNs) and neural language models (NLMs) in a novel way to learn better word representations. Each of these components contributes to the learned syntactic-semantic word vectors in a unique way. The RNN explicitly models the morphological structures of words, i.e., the syntactic information, to learn morphemic compositionality. This allows for better estimation of rare and complex words and a more principled way of handling unseen words, whose representations could be constructed from vectors of known morphemes.

The NLMs, on the other hand, utilize surrounding word contexts to provide further semantics to the learned morphemic representations. As a result, our context-sensitive morphoRNN embeddings could significantly outperform existing embeddings on word similarity tasks for many datasets. Our analysis reveals that the model could blend well both the syntactic and semantic information in clustering related words. We have also made available a word similarity dataset focusing on rare words to complement existing ones which tend to include frequent words.

Lastly, as English is still considered limited in terms of morphology, our model could potentially yield even better performance when applied to other morphologically complex languages such as Finnish or Turkish, which we leave for future work. Also, even within English, we expect our model to be value to other domains, such as bioNLP with complicated but logical taxonomy.

Acknowledgments

We thank the anonymous reviewers for their feedback and Eric Huang for making his various pieces of code available to us as well as answering our questions on different datasets. Stanford University gratefully acknowledges the support of the Defense Advanced Research Projects Agency (DARPA) Deep Exploration and Filtering of Text (DEFT) Program under Air Force Research Laboratory (AFRL) contract no. FA8750-13-2-0040 and the DARPA Broad Operational Language Translation (BOLT) program through IBM. Any opinions, findings, and conclusion or recommendations expressed in this material are those of the authors and do not necessarily reflect the view of the DARPA, AFRL, or the US government.

Footnotes

↑ An almost exception is the word clustering of (Clark, 2003), which does have a model of morphology to encourage words ending with the same suffix to appear in the same class, but it still does not capture the relationship between a word and its morphologically derived forms.
↑ The rare word dataset and trained word vectors can be found at http://nlp.stanford.edu/˜lmthang/ morphoNLM
↑ (Collobert et al., 2011) used multiple embeddings, one per discrete feature type, e.g., POS, Gazeteer, etc.
↑ "fortunate", “the", “bank", “was", and “close"
↑ When multiple affixes are present, we use a simple heuristic to first merge suffixes into stems and then combine prefixes. Ideally, we would want to learn and generate an order for such combination, which we leave for future work.
↑ We first aggregate type counts of pairs $(A,\; left )$ and $(B,\; right )$ across all segmentations with two stems. Once done, we label $A$ as $\mathrm{stm}$ and $B$ as $\mathrm{suf}$ if count $(B,\; right ) > 2 \times count (A,\; left )$, and conversely, we label them as $A_{pre}$ $B_{stm}$ if $count (A,\; left ) > 2 \times count(B,\; right )$. Our rationale was that affixes occur more frequently than stems.
↑ Set to 15 and 3 for affixes and stems respectively.
↑ http://ronan.collobert.com/senna/
↑ http://www-nlp.stanford.edu/~ehhuang/
↑ The embedding obtained just before the clustering step to build multi-prototype representation
↑ SCWS^∗ is a modified version of the Stanford's contextual word similarities dataset. The original one utilizes surrounding contexts in judging word similarities and includes pairs of identical words, e.g. financial bank vs. river bank. We exclude these pairs and ignore the provided contexts.
↑ All these counts are with respect to the vocabulary list in the C&W embedding (we obtain similar figures for HSMN).
↑ We restrict to only US-based workers with 95% approval rate and ask for native speakers to rate 20 pairs per hit.
↑ In our later experiments, an aggregated rating is derived for each pair. We first discard ratings not within one standard deviation of the mean, and then estimate a new mean from the remaining ones to use as an aggregated rating.
↑ Results of HSMN-related embeddings are not shown, but similar trends follow.
↑ "unaffect” is a pseudo-word that we inserted.
↑ "disaffected” is ranked 5th for the first query while “affecting” occurs at position 8 for the latter.

References

2011

(Socher et al., 2011) ⇒ Richard Socher, Cliff Chiung-Yu Lin, Andrew Y. Ng and Christopher D. Manning (2011). "Parsing Natural Scenes and Natural Language with Recursive Neural Networks". In: Proceedings of the 28th International Conference on Machine Learning (ICML 2011).

2008

(Collobert & Weston, 2008) ⇒ Ronan Collobert, and Jason Weston. (2008). “A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning.” In: Proceedings of the 25th International Conference on Machine learning. ISBN:978-1-60558-205-4 doi:10.1145/1390156.1390177----

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2013 BetterWordRepresentationswithRe	Christopher D. Manning Richard Socher Thang Luong			Better Word Representations with Recursive Neural Networks for Morphology						2013

[ftn-1-1] An almost exception is the word clustering of (Clark, 2003), which does have a model of morphology to encourage words ending with the same suffix to appear in the same class, but it still does not capture the relationship between a word and its morphologically derived forms.

[ftn-2-2] The rare word dataset and trained word vectors can be found at http://nlp.stanford.edu/˜lmthang/ morphoNLM

[ftn-3-3] (Collobert et al., 2011) used multiple embeddings, one per discrete feature type, e.g., POS, Gazeteer, etc.

[ftn-4-4] "fortunate", “the", “bank", “was", and “close"

[ftn-5-5] When multiple affixes are present, we use a simple heuristic to first merge suffixes into stems and then combine prefixes. Ideally, we would want to learn and generate an order for such combination, which we leave for future work.

[ftn-6-6] We first aggregate type counts of pairs $(A,\; left )$ and $(B,\; right )$ across all segmentations with two stems. Once done, we label $A$ as $\mathrm{stm}$ and $B$ as $\mathrm{suf}$ if count $(B,\; right ) > 2 \times count (A,\; left )$, and conversely, we label them as $A_{pre}$ $B_{stm}$ if $count (A,\; left ) > 2 \times count(B,\; right )$. Our rationale was that affixes occur more frequently than stems.

[ftn-7-7] Set to 15 and 3 for affixes and stems respectively.

[ftn-8-8] ttp://ronan.collobert.com/senna/

[ftn-9-9] ttp://www-nlp.stanford.edu/~ehhuang/

[ftn-10-10] The embedding obtained just before the clustering step to build multi-prototype representation

[ftn-11-11] SCWS^∗ is a modified version of the Stanford's contextual word similarities dataset. The original one utilizes surrounding contexts in judging word similarities and includes pairs of identical words, e.g. financial bank vs. river bank. We exclude these pairs and ignore the provided contexts.

[ftn-12-12] All these counts are with respect to the vocabulary list in the C&W embedding (we obtain similar figures for HSMN).

[ftn-13-13] We restrict to only US-based workers with 95% approval rate and ask for native speakers to rate 20 pairs per hit.

[ftn-14-14] In our later experiments, an aggregated rating is derived for each pair. We first discard ratings not within one standard deviation of the mean, and then estimate a new mean from the remaining ones to use as an aggregated rating.

[ftn-15-15] Results of HSMN-related embeddings are not shown, but similar trends follow.

[ftn-16-16] "unaffect” is a pseudo-word that we inserted.

[ftn-17-17] "disaffected” is ranked 5th for the first query while “affecting” occurs at position 8 for the latter.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

2013 BetterWordRepresentationswithRe

Notes

Cited By

2014

Quotes

Abstract

1. Introduction

2. Related Work

3. Morphological RNNs

3.1 Context-insensitive Morphological RNN

3.2 Context-sensitive Morphological RNN

3.3 Learning

4. Unsupervised Morphological Structures

5. Experiments

5.1 Word Similarity Task

5.1.1 Rare Word Dataset

6. Analysis

7. Conclusion

Acknowledgments

Footnotes

References

2011

2008

Navigation menu

2013 BetterWordRepresentationswithRe

Notes

Cited By

2014

Quotes

Abstract

1. Introduction

2. Related Work

3. Morphological RNNs

3.1 Context-insensitive Morphological RNN

3.2 Context-sensitive Morphological RNN

3.3 Learning

4. Unsupervised Morphological Structures

5. Experiments

5.1 Word Similarity Task

5.1.1 Rare Word Dataset

6. Analysis

7. Conclusion

Acknowledgments

Footnotes

References

2011

2008

Navigation menu

Search