2011 SemiSupervisedRecursiveAutoenco

(Socher et al., 2011a) ⇒ Richard Socher, Jeffrey Pennington, Eric H. Huang, Andrew Y. Ng, and Christopher D. Manning. (2011). “Semi-supervised Recursive Autoencoders for Predicting Sentiment Distributions.” In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. ISBN:978-1-937284-11-4

Subject Headings: Recursive Autoencoder.

Notes

Cited By

Quotes

Abstract

We introduce a novel machine learning framework based on recursive autoencoders for sentence-level prediction of sentiment label distributions. Our method learns vector space representations for multi-word phrases. In sentiment prediction tasks these representations outperform other state-of-the-art approaches on commonly used datasets, such as movie reviews, without using any pre-defined sentiment lexica or polarity shifting rules. We also evaluate the model's ability to predict sentiment distributions on a new dataset based on confessions from the experience project. The dataset consists of personal user stories annotated with multiple labels which, when aggregated, form a multinomial distribution that captures emotional reactions. Our algorithm can more accurately predict distributions over such labels compared to several competitive baselines.

1. Introduction

The ability to identify sentiments about personal experiences, products, movies etc. is crucial to understand user generated content in social networks, blogs or product reviews. Detecting sentiment in these data is a challenging task which has recently spawned a lot of interest (Pang and Lee, 2008).

Current baseline methods often use bag-of-words representations which cannot properly capture more complex linguistic phenomena in sentiment analysis (Pang et al., 2002). For instance, while the two phrases “white blood cells destroying an infection” and “an infection destroying white blood cells” have the same bag-of-words representation, the former is a positive reaction while the later is very negative. More advanced methods such as (Nakagawa et al., 2010) that can capture such phenomena use many manually constructed resources (sentiment lexica, parsers, polarity-shifting rules). This limits the applicability of these methods to a broader range of tasks and languages. Lastly, almost all previous work is based on single, positive / negative categories or scales such as star ratings. Examples are movie reviews (Pang and Lee, 2005), opinions (Wiebe et al., 2005), customer reviews (Ding et al., 2008) or multiple aspects of restaurants (Snyder and Barzilay, 2007). Such a one-dimensional scale does not accurately reflect the complexity of human emotions and sentiments.

In this work, we seek to address three issues. (i) Instead of using a bag-of-words representation, our model exploits hierarchical structure and uses compositional semantics to understand sentiment. (ii) Our system can be trained both on unlabeled domain data and on supervised sentiment data and does not require any language-specific sentiment lexica, parsers, etc. (iii) Rather than limiting sentiment to a positive / negative scale, we predict a multidimensional distribution over several complex, interconnected sentiments.

We introduce an approach based on semisupervised, recursive autoencoders (RAE) which use as input continuous word vectors. Fig. 1 shows an illustration of the model which learns vector representations of phrases and full sentences as well as their hierarchical structure from unsupervised text. We extend our model to also learn a distribution over sentiment labels at each node of the hierarchy. We evaluate our approach on several standard datasets where we achieve state-of-the art performance. Furthermore, we show results on the recently introduced experience project (EP) dataset (Potts, 2010) that captures a broader spectrum of human sentiments and emotions. The dataset consists of very personal confessions anonymously made by people on the experience project website HTTP://www.experienceproject.com. Confessions are labeled with a set of five reactions by other users. Reaction labels are you rock (expressing approvement), tehee (amusement), I understand, Sorry, hugs and Wow, just wow (displaying shock). For evaluation on this dataset we predict both the label with the most votes as well as the full distribution over the sentiment categories. On both tasks our model outperforms competitive baselines. A set of over 31,000 confessions as well as the code of our model are available at www.socher.org.

XXXX
Figure 1: Illustration of our recursive autoencoder architecture which learns semantic vector representations of phrases.

Word indices (orange) are first mapped into a semantic vector space (blue). Then they are recursively merged by the same autoencoder network into a fixed length sentence representation. The vectors at each node are used as features to predict a distribution over sentiment labels.

After describing the model in detail, we evaluate it qualitatively by analyzing the learned n-gram vector representations and compare quantitatively against other methods on standard datasets and the EP dataset.

2 Semi-Supervised Recursive Autoencoders

Our model aims to find vector representations for variable-sized phrases in either unsupervised or semi-supervised training regimes. These representations can then be used for subsequent tasks. We first describe neural word representations and then proceed to review a related recursive model based on autoencoders, introduce our recursive autoencoder (RAE) and describe how it can be modified to jointly learn phrase representations, phrase structure and sentiment distributions.

2.1 Neural Word Representations

We represent words as [[continuous vectors of parameter]]s. We explore two settings. In the first setting we simply initialize each word vector x 2 Rn by sampling it from a zero mean Gaussian distribution: x � N (0, �2). These word vectors are then stacked into a word embedding matrix L 2 V|, where|V|is the size of the vocabulary. This initialization works well in supervised settings where a network can subsequently modify these vectors to capture certain label distributions. In the second setting, we pre-train the word vectors with an unsupervised neural language model (Bengio et al., 2003; Collobert and Weston, 2008). These models jointly learn an embedding of words into a vector space and use these vectors to predict how likely a word occurs given its context. After learning via gradient ascent the word vectors capture syntactic and semantic information from their co-occurrence statistics. In both cases we can use the resulting matrix of word vectors L for subsequent tasks as follows. Assume we are given a sentence as an ordered list of m words. Each word has an associated vocabulary index k into the embedding matrix which we use to retrieve the word’s vector representation. Mathematically, this look-up operation can be seen as a simple projection layer where we use a binary vector b which is zero in all positions except at the kth index, x_i = Lbk 2 Rn. (1)

In the remainder of this paper, we represent a sentence (or any n-gram) as an ordered list of these vectors (x1,. . . , xm). This word representation is better suited to autoencoders than the binary number representations used in previous related autoencoder models such as the recursive autoassociative memory (RAAM) model (Pollack, 1990; Voegtlin and Dominey, 2005) or recurrent neural networks (Elman, 1991) since sigmoid units are inherently continuous. Pollack circumvented this problem by having vocabularies with only a handful of words and by manually defining a threshold to binarize the resulting vectors.

x1 x2 x3 x4
y1=f (W (1) [ x3; [[x4] +]] b)
y2=f (W (1) [ x2; [[y1] +]] b)
y3=f (W (1) [ x1; [[y2] +]] b)
Figure 2: Illustration of an application of a recursive autoencoder to a binary tree.

The nodes which are not filled are only used to compute reconstruction errors. A standard autoencoder (in box) is re-used at each node of the tree.

2.2 Traditional Recursive Autoencoders

The goal of autoencoders is to learn a representation of their inputs. In this section we describe how to obtain a reduced dimensional vector representation for sentences.

In the past autoencoders have only been used in setting where the tree structure was given a-priori. We review this setting before continuing with our model which does not require a given tree structure. Fig. 2 shows an instance of a recursive autoencoder (RAE) applied to a given tree. Assume we are given a list of word vectors x = (x1,. . . , xm) as described in the previous section as well as a binary tree structure for this input in the form of branching triplets of parents with children: (p ! c1c2). Each child can be either an input word vector x_i or a nonterminal node in the tree. For the example in Fig. 2, we have the following triplets: ((y1 ! x3x4), (y2 ! x2y1), (y1 ! x1y2)). In order to be able to apply the same neural network to each pair of children, the hidden representations yi have to have the same dimensionality as the xi’s.

Given this tree structure, we can now compute the parent representations. The first parent vector y1 is computed from the children (c1, c2) = (x3, x4): p = f (W (1) [ c1; [[c2] +]] b (1)), (2) where we multiplied a matrix of parameters W (1) 2 Rn×2n by the concatenation of the two children. After adding a bias term we applied an elementwise activation function such as tanh to the resulting vector. One way of assessing how well this ndimensional vector represents its children is to try to reconstruct the children in a reconstruction layer: � c01; c02 � = W (2) p + b (2). (3) During training, the goal is to minimize the reconstruction errors of this input pair. For each pair, we compute the Euclidean distance between the original input and its reconstruction: XXXX This model of a standard autoencoder is boxed in Fig. 2. Now that we have defined how an autoencoder can be used to compute an n-dimensional vector representation (p) of two n-dimensional children (c1, c2), we can describe how such a network can be used for the rest of the tree.

Essentially, the same steps repeat. Now that y1 is given, we can use Eq. 2 to compute y2 by setting the children to be (c1, c2) = (x2, y1). Again, after computing the intermediate parent vector y2, we can assess how well this vector capture the content of the children by computing the reconstruction error as in Eq. 4. The process repeat until the full tree is constructed and we have a reconstruction error at each nonterminal node. This model is similar to the RAAM model (Pollack, 1990) which also requires a fixed tree structure.

2.3 Unsupervised Recursive Autoencoder for Structure Prediction

Now, assume there is no tree structure given for the input vectors in x. The goal of our structureprediction RAE is to minimize the reconstruction error of all vector pairs of children in a tree. We define A (x) as the set of all possible trees that can be built from an input sentence x. Further, let T (y) be a function that returns the triplets of a tree indexed by s of all the non-terminal nodes in a tree. Using the reconstruction error of Eq. 4, we compute XXXX We now describe a greedy approximation that constructs such a tree.

Greedy Unsupervised RAE.

For a sentence with m words, we apply the autoencoder recursively. It takes the first pair of neighboring vectors, defines them as potential children of a phrase (c1; c2) = (x1; x2), concatenates them and gives them as input to the autoencoder. For each word pair, we save the potential parent node p and the resulting reconstruction error.

After computing the score for the first pair, the network is shifted by one position and takes as input vectors (c1, c2) = (x2, x3) and again computes a potential parent node and a score. This process repeats until it hits the last pair of words in the sentence: (c1, c2) = (xm-1, xm). Next, it selects the pair which had the lowest reconstruction error (Erec) and its parent representation p will represent this phrase and replace both children in the sentence word list. For instance, consider the sequence (x1, x2, x3, x4) and assume the lowest Erec was obtained by the pair (x3, x4). After the first pass, the new sequence then consists of (x1, x2, p (3, 4)). The process repeats and treats the new vector p (3, 4) like any other input vector. For instance, subsequent states could be either: (x1, p (2, (3, 4))) or (p (1, 2), p (3, 4)). Both states would then finish with a deterministic choice of collapsing the remaining two states into one parent to obtain (p (1, (2, (3, 4)))) or (p ((1, 2), (3, 4))) respectively. The tree is then recovered by unfolding the collapsing decisions. The resulting tree structure captures as much of the single-word information as possible (in order to allow reconstructing the word vectors) but does not necessarily follow standard syntactic constraints. We also experimented with a method that finds better solutions to Eq.5 based on CKY-like beam search algorithms (Socher et al., 2010; Socher et al., 2011) but the performance is similar and the greedy version is much faster.

Weighted Reconstruction.

One problem with simply using the reconstruction error of both children equally as describe in Eq. 4 is that each child could represent a different number of previously collapsed words and is hence of bigger importance for the overall meaning reconstruction of the sentence. For instance in the case of (x1, p (2, (3, 4))) one would like to give more importance to reconstructing p than x1. We capture this desideratum by adjusting the reconstruction error. Let n1, n2 be the number of words underneath a current potential child, we re-define the reconstruction error to be XXX

Length Normalization.

One of the goals of RAEs is to induce semantic vector representations that allow us to compare n-grams of different lengths. The RAE tries to lower reconstruction error of not only the bigrams but also of nodes higher in the tree. Unfortunately, since the RAE computes the hidden representations it then tries to reconstruct, it can just lower reconstruction error by making the hidden layer very small in magnitude. To prevent such undesirable behavior, we modify the hidden layer such that the resulting parent representation always has length one, after computing p as in Eq. 2, we simply set: p = p|| p||.

2.4 Semi-Supervised Recursive Autoencoders

So far, the RAE was completely unsupervised and induced general representations that capture the semantics of multi-word phrases. In this section, we extend RAEs to a semi-supervised setting in order to predict a sentence - or phrase-level target distribution t.1

…

References

;

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2011 SemiSupervisedRecursiveAutoenco	Christopher D. Manning Andrew Y. Ng Jeffrey Pennington Richard Socher Eric H. Huang			Semi-supervised Recursive Autoencoders for Predicting Sentiment Distributions						2011