2012 ModelingSentencesintheLatentSpa

(Guo & Diab, 2012) ⇒ Weiwei Guo, and Mona Diab. (2012). “Modeling Sentences in the Latent Space.” In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers (ACL 2012).

Subject Headings:

Notes

Cited By

Quotes

Abstract

Sentence Similarity is the process of computing a similarity score between two sentences. Previous sentence similarity work finds that latent semantics approaches to the problem do not perform well due to insufficient information in single sentences. In this paper, we show that by carefully handling words that are not in the sentences (missing words), we can train a reliable latent variable model on sentences. In the process, we propose a new evaluation framework for sentence similarity: Concept Definition Retrieval. The new framework allows for large scale tuning and testing of Sentence Similarity models. Experiments on the new task and previous data sets show significant improvement of our model over baselines and other traditional latent variable models. Our results indicate comparable and even better performance than current state of the art systems addressing the problem of sentence similarity.

1. Introduction

Identifying the degree of semantic similarity [ SS ] between two sentences is at the core of many NLP applications that focus on sentence level semantics such as Machine Translation (Kauchak and Barzilay, 2006), Summarization (Zhou et al., 2006), Text Coherence Detection (Lapata and Barzilay, 2005), etc. To date, almost all Sentence Similarity [ SS ] approaches work in the high-dimensional word space and rely mainly on word similarity. There are two main (not unrelated) disadvantages to word similarity based approaches: 1. lexical ambiguity as the pairwise word similarity ignores the semantic interaction between the word and its sentential context; 2. word co-occurrence information is not sufficiently exploited.

Latent variable models, such as Latent Semantic Analysis (LSA) (Landauer et al., 1998), Probabilistic Latent Semantic Analysis (PLSA) (Hofmann, 1999), Latent Dirichlet Allocation (LDA) (Blei et al., 2003) can solve the two issues naturally by modeling the semantics of words and sentences simultaneously in the low-dimensional latent space. However, attempts at addressing SS using LSA perform significantly below high dimensional word similarity based models (Mihalcea et al., 2006; O’Shea et al., 2008).

We believe that the latent semantics approaches applied to date to the SS problem have not yielded positive results due to the deficient modeling of the sparsity in the semantic space. SS operates in a very limited contextual setting where the sentences are typically very short to derive robust latent semantics. Apart from the SS setting, robust modeling of the latent semantics of short sentences / texts is becoming a pressing need due to the pervasive presence of more bursty data sets such as Twitter feeds and SMS where short contexts are an inherent characteristic of the data.

In this paper, we propose to model the missing words (words that are not in the sentence), a feature that is typically overlooked in the text modeling literature, to address the sparseness issue for the SS task. We define the missing words of a sentence as the whole vocabulary in a corpus minus the observed words in the sentence. Our intuition is since observed words in a sentence are too few to tell us what the sentence is about, missing words can be used to tell us what the sentence is not about. We assume that the semantic space of both the observed and missing words make up the complete semantics profile of a sentence.

After analyzing the way traditional latent variable models (LSA, PLSA / LDA) handle missing words, we decide to model sentences using a weighted matrix factorization approach (Srebro and Jaakkola, 2003), which allows us to treat observed words and missing words differently. We handle missing words using a weighting scheme that distinguishes missing words from observed words yielding robust latent vectors for sentences.

Since we use a feature that is already implied by the text itself, our approach is very general (similar to LSA / LDA) in that it can be applied to any format of short texts. In contrast, existing work on modeling short texts focuses on exploiting additional data, e.g., Ramage et al. (2010) model tweets using their metadata (author, hashtag, etc.).

Moreover in this paper, we introduce a new evaluation framework for SS: Concept Definition Retrieval (CDR). Compared to existing data sets, the CDR data set allows for large scale tuning and testing of SS modules without further human annotation.

2 Limitations of Topic Models and LSA for Modeling Sentences

Usually latent variable models aim to find a latent semantic profile for a sentence that is most relevant to the observed words. By explicitly modeling missing words, we set another criterion to the latent semantics profile: it should not be related to the missing words in the sentence. However, missing words are not as informative as observed words, hence the need for a model that does a good job of modeling missing words at the right level of emphasis / impact is central to completing the semantic picture for a sentence.

LSA and PLSA / LDA work on a word-sentence co-occurrence matrix. Given a corpus, the row entries of the matrix are the unique M words in the corpus, and the N columns are the sentence ids. The yielded M × N co-occurrence matrix X comprises the TF-IDF values in each Xij cell, namely that TFIDF value of word wi in sentence sj. For ease of exposition, we will illustrate the problem using a special case of the SS framework where the sentences are concept definitions in a dictionary such as WordNet (Fellbaum, 1998) (WN). Therefore, the sentence corresponding to the concept definition of bank#n#1 is a sparse vector in X containing the following observed words where Xij 6= 0: the 0.1, financial 5.5, institution 4, that 0.2, accept 2.1, deposit 3, and 0.1, channel 6, the 0.1, money 5, into 0.3, lend 3.5, activity 3

All the other words (girl, car,..., check, loan, business,...) in matrix X that do not occur in the concept definition are considered missing words for the concept entry bank#n#1, thereby their Xij = 0.

Topic models (PLSA / LDA) do not explicitly model missing words. PLSA assumes each document has a distribution over K topics P (zk|dj), and each topic has a distribution over all vocabularies P (wi|zk). Therefore, PLSA finds a topic distribution for each concept definition that maximizes the log likelihood of the corpus X (LDA has a similar form): …

…

3 The Proposed Approach

3.1 Weighted Matrix Factorization

The weighted matrix factorization (WMF) approach is very similar to SVD, except that it allows for direct control on each matrix cell [math]\displaystyle{ X_{ij} }[/math]. The model factorizes the original matrix X into two matrices such that [math]\displaystyle{ X \approx P^\text{T} Q }[/math], where P is a K × M matrix, and Q is a K × N matrix (figure 1). The model parameters (vectors in P and Q) are optimized by minimizing the objective function:

[math]\displaystyle{ \Sigma_i \Sigma_j \ W_{ij}( P_{·,i} · Q_{·,j} - X_{ij})^2 + \lambda \mid\mid P \mid\mid^2_2 + \lambda \mid\mid Q \mid\mid^2_2 \ (3) }[/math]

where [math]\displaystyle{ \lambda }[/math] is a free regularization factor, and the weight matrix W defines a weight for each cell in X.

Accordingly, P·, i is a K-dimension latent semantics vector profile for word w_i; similarly, Q·, j is the K-dimension vector profile that represents the sentence sj. Operations on these K-dimensional vectors have very intuitive semantic meanings:

the inner product of P·, i and Q·, j is used to approximate semantic relatedness of word wi and sentence sj: P·, i · Q·, j � Xij, as the shaded parts in Figure 1;
equation 3 explicitly requires a sentence should not be related to its missing words by forcing P·, i · Q·, j = 0 for missing words Xij = 0.
we can compute the similarity of two sentences sj and sj0 using the cosine similarity between Q·, j, Q·, j0.

The latent vectors in P and Q are first randomly initialized, then can be computed iteratively by the following equations (derivation is omitted due to limited space, which can be found in (Srebro and Jaakkola, 2003)):

…

References

;

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2012 ModelingSentencesintheLatentSpa	Weiwei Guo Mona Diab			Modeling Sentences in the Latent Space						2012