2007 ComputingSemanticRelatednessUsi

Jump to: navigation, search

Subject Headings: Document Semantic Similarity Measure, Wikipedia-based Corpus, Document Relatedness Measure Training Algorithm.


Cited By





[[Computing semantic relatedness of natural language texts]] requires access to vast amounts of common-sense and domain-specific world knowledge. We propose Explicit Semantic Analysis (ESA), a novel method that represents the meaning of texts in a high-dimensional space of concepts derived from Wikipedia. We use machine learning techniques to explicitly represent the meaning of any text as a weighted vector of Wikipedia-based concepts. Assessing the relatedness of texts in this space amounts to comparing the corresponding vectors using conventional metrics (e.g., cosine). Compared with the previous state of the art, using ESA results in substantial improvements in correlation of computed relatedness scores with human judgments: from r=0.56 to 0.75 for individual words and from r=0.60 to 0.72 for texts. Importantly, due to the use of natural concepts, the ESA model is easy to explain to human users.

1. Introduction

How related are “cat” and “mouse”? And what about “preparing a manuscript” and “writing an article”? Reasoning about semantic relatedness of natural language utterances is routinely performed by humans but remains an unsurmountable obstacle for computers. Humans do not judge text relatedness merely at the level of text words. Words trigger reasoning at a much deeper level that manipulates concepts — the basic units of meaning that serve humans to organize and share their knowledge. Thus, humans interpret the specific wording of a document in the much larger context of their background knowledge and experience.

It has long been recognized that in order to process natural language, computers require access to vast amounts of common-sense and domain-specific world knowledge [Buchanan and Feigenbaum, 1982; Lenat and Guha, 1990]. However, prior work on semantic relatedness was based on purely statistical techniques that did not make use of background knowledge [Baeza-Yates and Ribeiro-Neto, 1999; Deerwester et al., 1990], or on lexical resources that incorporate very limited knowledge about the world [Budanitsky and Hirst, 2006; Jarmasz, 2003].

We propose a novel method, called Explicit Semantic Analysis (ESA), for fine-grained semantic representation of unrestricted natural language texts. Our method represents meaning in a high-dimensional space of natural concepts derived from Wikipedia (http://en.wikipedia.org), the largest encyclopedia in existence. We employ text classification techniques that allow us to explicitly represent the meaning of any text in terms of Wikipedia-based concepts. We evaluate the effectiveness of our method on automatically computing the degree of semantic relatedness between fragments of natural language text.

The contributions of this paper are threefold. First, we present Explicit Semantic Analysis, a new approach to representing semantics of natural language texts using natural concepts. Second, we propose a uniform way for computing relatedness of both individual words and arbitrarily long text fragments. Finally, the results of using ESA for computing semantic relatedness of texts are superior to the existing state of the art. Moreover, using Wikipedia-based concepts makes our model easy to interpret, as we illustrate with a number of examples in what follows.


3.1 Datasets and Evaluation Procedure

Humans have an innate ability to judge semantic relatedness of texts. Human judgements on a reference set of text pairs can thus be considered correct by definition, a kind of “gold standard” against which computer algorithms are evaluated. Several studies measured inter-judge correlations and found them to be consistently high (Budanitsky and Hirst, 2006; Jarmasz, 2003; Finkelstein et al., 2002), r = 0.88 − 0.95. These findings are to be expected — after all, it is this consensus that allows people to understand each other.

In this work, we use two such datasets, which are to the best of our knowledge the largest publicly available collections of their kind. To assess word relatedness, we use theWordSimilarity-353 collection2 (Finkelstein et al., 2002), which contains 353 word pairs.3 Each pair has 13–16 human judgements, which were averaged for each pair to produce a single relatedness score. Spearman rank-order correlation coefficient was used to compare computed relatedness scores with human judgements.

For document similarity, we used a collection of 50 documents from the Australian Broadcasting Corporation’s news mail service (Lee et al., 2005). These documents were paired in all possible ways, and each of the 1,225 pairs has 8–12 human judgements. When human judgements have been averaged for each pair, the collection of 1,225 relatedness scores have only 67 distinct values. Spearman correlation is not appropriate in this case, and therefore we used Pearson’s linear correlation coefficient.




 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2007 ComputingSemanticRelatednessUsiEvgeniy Gabrilovich
Shaul Markovitch
Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis2007
AuthorEvgeniy Gabrilovich + and Shaul Markovitch +
titleComputing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis +
year2007 +