2010 FromFrequencytoMeaningVectorSpa

(Turney & Pantel, 2010) ⇒ Peter D. Turney, and Patrick Pantel. (2010). “From Frequency to Meaning: Vector Space Models of Semantics.” In: Journal of Artificial Intelligence Research, 37(1). doi:10.1613/jair.2934

Subject Headings: Word Vector Space Models.

Notes

Cited By

Quotes

Abstract

Computers understand very little of the meaning of human language. This profoundly limits our ability to give instructions to computers, the ability of computers to explain their actions to us, and the ability of computers to analyse and process text. Vector space models (VSMs) of semantics are beginning to address these limits. This paper surveys the use of VSMs for semantic processing of text. We organize the literature on VSMs according to the structure of the matrix in a VSM. There are currently three broad classes of VSMs, based on term-document, word-context, and pair-pattern matrices, yielding three classes of applications. We survey a broad range of applications in these three categories and we take a detailed look at a specific open source project in each category. Our goal in this survey is to show the breadth of applications of VSMs for semantics, to provide a new perspective on VSMs for those who are already familiar with the area, and to provide pointers into the literature for those who are less familiar with the field.

{{#ifanon:|

1. Introduction

One of the biggest obstacles to making full use of the power of computers is that they currently understand very little of the meaning of human language. Recent progress in search engine technology is only scratching the surface of human language, and yet the impact on society and the economy is already immense. This hints at the transformative impact that deeper semantic technologies will have. Vector space models (VSMs), surveyed in this paper, are likely to be a part of these new semantic technologies.

In this paper, we use the term semantics in a general sense, as the meaning of a word, a phrase, a sentence, or any text in human language, and the study of such meaning. We are not concerned with narrower senses of semantics, such as the semantic web or approaches to semantics based on formal logic. We present a survey of VSMs and their relation with the distributional hypothesisas an approach to representing some aspects of natural language semantics.

The VSM was developed for the SMART information retrieval system (Salton, 1971) by Gerard Salton and his colleagues (Salton, Wong, & Yang, 1975). SMART pioneered many of the concepts that are used in modern search engines (Manning, Raghavan, & Schutze, 2008). The idea of the VSM is to represent each document in a collection as a point in a space (a vector in a vector space). Points that are close together in this space are semantically similar and points that are far apart are semantically distant. The user’s query is represented as a point in the same space as the documents (the query is a pseudodocument). The documents are sorted in order of increasing distance (decreasing semantic similarity) from the query and then presented to the user.

The success of the VSM for information retrieval has inspired researchers to extend the VSM to other semantic tasks in natural language processing, with impressive results. For instance, Rapp (2003) used a vector-based representation of word meaning to achieve a score of 92.5% on multiple-choice synonym questions from the Test of English as a Foreign Language (TOEFL), whereas the average human score was 64.5%.^[1] Turney (2006) used a vector-based representation of semantic relations to attain a score of 56% on multiple-choice analogy questions from the SAT college entrance test, compared to an average human score of 57%. ^[2]

In this survey, we have organized past work with VSMs according to the type of matrix involved: term–document, word–context, and pair–pattern. We believe that the choice of a particular matrix type is more fundamental than other choices, such as the particular linguistic processing or mathematical processing. Although these three matrix types cover most of the work, there is no reason to believe that these three types exhaust the possibilities. We expect future work will introduce new types of matrices and higher-order tensors.^[3]

1.1 Motivation for Vector Space Models of Semantics

VSMs have several attractive properties. VSMs extract knowledge automatically from a given corpus, thus they require much less labour than other approaches to semantics, such as hand-coded knowledge bases and ontologies. For example, the main resource used in Rapp’s (2003) VSM system for measuring word similarity is the British National Corpus (BNC), ^[4] whereas the main resource used in non-VSM systems for measuring word similarity (Hirst & St-Onge, 1998; Leacock & Chodrow, 1998; Jarmasz & Szpakowicz, 2003) is a such as WordNet ^[5] or Roget’s Thesaurus. Gathering a corpus for a new language is generally much easier than building a lexicon, and building a lexicon often involves also gathering a corpus, such as SemCor for WordNet (Miller, Leacock, Tengi, & Bunker, 1993). VSMs perform well on tasks that involve measuring the similarity of meaning between words, phrases, and documents. Most search engines use VSMs to measure the similarity between a query and a document (Manning et al., 2008). The leading algorithms for measuring semantic relatedness use VSMs (Pantel & Lin, 2002a; Rapp, 2003; Turney, Littman, Bigham, & Shnayder, 2003). The leading algorithms for measuring the similarity of semantic relations also use VSMs (Lin & Pantel, 2001; Turney, 2006; Nakov & Hearst, 2008). (Section 2.4 discusses the differences between these types of similarity.) We find VSMs especially interesting due to their relation with the distributional hypothesis and related hypotheses (see Section 2.7). The distributional hypothesis is that words that occur in similar contexts tend to have similar meanings (Wittgenstein, 1953; Harris, 1954; Weaver, 1955; Firth, 1957; Deerwester, Dumais, Landauer, Furnas, & Harshman, 1990). Efforts to apply this abstract hypothesis to concrete algorithms for measuring the similarity of meaning often lead to vectors, matrices, and higher-order tensors. This intimate connection between the distributional hypothesis and VSMs is a strong motivation for taking a close look at VSMs. Not all uses of vectors and matrices count as vector space models. For the purposes of this survey, we take it as a defining property of VSMs that the values of the elements in a VSM must be derived from event frequencies, such as the number of times that a given word appears in a given context (see Section 2.6). For example, often a lexicon or a knowledge base may be viewed as a graph, and a graph may be represented using an adjacency matrix, but this does not imply that a lexicon is a VSM, because, in general, the values of the elements in an adjacency matrix are not derived from event frequencies. This emphasis on event frequencies brings unity to the variety of VSMs and explicitly connects them to the distributional hypothesis; furthermore, it avoids triviality by excluding many possible matrix representations.

1.2 Vectors in AI and Cognitive Science

Vectors are common in AI and cognitive science; they were common before the VSM was introduced by Salton et al. (1975). The novelty of the VSM was to use frequencies in a corpus of text as a clue for discovering semantic information. In machine learning, a typical problem is to learn to classify or cluster a set of items (i.e., examples, cases, individuals, entities) represented as feature vectors (Mitchell, 1997; Witten & Frank, 2005). In general, the features are not derived from event frequencies, although this is possible (see Section 4.6). For example, a machine learning algorithm can be applied to classifying or clustering documents (Sebastiani, 2002).

Collaborative filtering and recommender systems also use vectors (Resnick, Iacovou, Suchak, Bergstrom, & Riedl, 1994; Breese, Heckerman, & Kadie, 1998; Linden, Smith, & York, 2003). In a typical recommender system, we have a person-item matrix, in which the rows correspond to people (customers, consumers), the columns correspond to items (products, purchases), and the value of an element is the rating (poor, fair, excellent) that the person has given to the item. Many of the mathematical techniques that work well with term–document matrices (see Section 4) also work well with person-item matrices, but ratings are not derived from event frequencies.

In cognitive science, prototype theory often makes use of vectors. The basic idea of prototype theory is that some members of a category are more central than others (Rosch & Lloyd, 1978; Lakoff, 1987). For example, robin is a central (prototypical) member of the category bird, whereas penguin is more peripheral. Concepts have varying degrees of membership in categories (graded categorization). A natural way to formalize this is to represent concepts as vectors and categories as sets of vectors (Nosofsky, 1986; Smith, Osherson, Rips, & Keane, 1988). However, these vectors are usually based on numerical scores that are elicited by questioning human subjects; they are not based on event frequencies. Another area of psychology that makes extensive use of vectors is psychometrics, which studies the measurement of psychological abilities and traits. The usual instrument for measurement is a test or questionnaire, such as a personality test. The results of a test are typically represented as a subject-item matrix, in which the rows represent the subjects (people) in an experiment and the columns represent the items (questions) in the test (questionnaire). The value of an element in the matrix is the answer that the corresponding subject gave for the corresponding item. Many techniques for vector analysis, such as factor analysis (Spearman, 1904), were pioneered in psychometrics.

In cognitive science, Latent Semantic Analysis (LSA) ([[Deerwester et al., 1990]; Landauer & Dumais, 1997), Hyperspace Analogue to Language (HAL) (Lund, Burgess, & Atchley, 1995; Lund & Burgess, 1996), and related research (Landauer, McNamara, Dennis, & Kintsch, 2007) is entirely within the scope of VSMs, as defined above, since this research uses vector space models in which the values of the elements are derived from event frequencies, such as the number of times that a given word appears in a given context. Cognitive scientists have argued that there are empirical and theoretical reasons for believing that VSMs, such as LSA and HAL, are plausible models of some aspects of human cognition (Landauer et al., 2007). In AI, computational linguistics, and information retrieval, such plausibility is not essential, but it may be seen as a sign that VSMs are a promising area for further research.

1.3 Motivation for This Survey

This paper is a survey of vector space models of semantics. There is currently no comprehensive, up-to-date survey of this field. As we show in the survey, vector space models are a highly successful approach to semantics, with a wide range of potential and actual applications. There has been much recent growth in research in this area. This paper should be of interest to all AI researchers who work with natural language, especially researchers who are interested in semantics. The survey will serve as a general introduction to this area and it will provide a framework – a unified perspective – for organizing the diverse literature on the topic. It should encourage new research in the area, by pointing out open problems and areas for further exploration. This survey makes the following contributions: New framework: We provide a new framework for organizing the literature: term– document, word–context, and pair–pattern matrices (see Section 2). This framework shows the importance of the structure of the matrix (the choice of rows and columns) in determining the potential applications and may inspire researchers to explore new structures (different kinds of rows and columns, or higher-order tensors instead of matrices). New developments: We draw attention to pair–pattern matrices. The use of pair– pattern matrices is relatively new and deserves more study. These matrices address some criticisms that have been directed at word–context matrices, regarding lack of sensitivity to word order.

Breadth of approaches and applications: There is no existing survey that shows the breadth of potential and actual applications of VSMs for semantics. Existing summaries omit pair–pattern matrices (Landauer et al., 2007). Focus on NLP and CL: Our focus in this survey is on systems that perform practical tasks in natural language processing and computational linguistics. Existing overviews focus on cognitive psychology (Landauer et al., 2007).

Success stories: We draw attention to the fact that VSMs are arguably the most successful approach to semantics, so far.

1.4 Intended Readership

Our goal in writing this paper has been to survey the state of the art in vector space models of semantics, to introduce the topic to those who are new to the area, and to give a new perspective to those who are already familiar with the area. We assume our reader has a basic understanding of vectors, matrices, and linear algebra, such as one might acquire from an introductory undergraduate course in linear algebra, or from a text book (Golub & Van Loan, 1996). The basic concepts of vectors and matrices are more important here than the mathematical details. Widdows (2004) gives a gentle introduction to vectors from the perspective of semantics. We also assume our reader has some familiarity with computational linguistics or information retrieval. Manning et al. (2008) provide a good introduction to information retrieval. For computational linguistics, we recommend Manning and Schütze’s (1999) text. If our reader is familiar with linear algebra and computational linguistics, this survey should present no barriers to understanding. Beyond this background, it is not necessary to be familiar with VSMs as they are used in information retrieval, natural language processing, and computational linguistics. However, if the reader would like to do some further background reading, we recommend Landauer et al. ’s (2007) collection.

1.5 Highlights and Outline

This article is structured as follows. Section 2 explains our framework for organizing the literature on VSMs according to the type of matrix involved: term–document, word–context, and pair–pattern. In this section, we present an overview of VSMs, without getting into the details of how a matrix can be generated from a corpus of raw text. After the high-level framework is in place, Sections 3 and 4 examine the steps involved in generating a matrix. Section 3 discusses linguistic processing and Section 4 reviews mathematical processing. This is the order in which a corpus would be processed in most VSM systems (first linguistic processing, then mathematical processing). When VSMs are used for semantics, the input to the model is usually plain text. Some VSMs work directly with the raw text, but most first apply some linguistic processing to the text, such as stemming, part-of-speech tagging, word sense tagging, or parsing. Section 3 looks at some of these linguistic tools for semantic VSMs.

In a simple VSM, such as a simple term–document VSM, the value of an element in a document vector is the number of times that the corresponding word occurs in the given document, but most VSMs apply some mathematical processing to the raw frequency values. Section 4 presents the main mathematical operations: weighting the elements, smoothing the matrix, and comparing the vectors. This section also describes optimization strategies for comparing the vectors, such as distributed sparse matrix multiplication and randomized techniques.

By the end of Section 4, the reader will have a general view of the concepts involved in vector space models of semantics. We then take a detailed look at three VSM systems in Section 5. As a representative of term–document VSMs, we present the Lucene information retrieval library.^[6] For word–context VSMs, we explore the Semantic Vectors package, which builds on Lucene. ^[7] As the representative of pair–pattern VSMs, we review the Latent Relational Analysis module in the S-Space package, which also builds on Lucene. ^[8] The source code for all three of these systems is available under open source licensing. We turn to a broad survey of applications for semantic VSMs in Section 6. This section also serves as a short historical view of research with semantic VSMs, beginning with information retrieval in Section 6.1. Our purpose here is to give the reader an idea of the breadth of applications for VSMs and also to provide pointers into the literature, if the reader wishes to examine any of these applications in detail.

In a term–document matrix, rows correspond to terms and columns correspond to documents (Section 6.1). A document provides a context for understanding the term. If we generalize the idea of documents to chunks of text of arbitrary size (phrases, sentences, paragraphs, chapters, books, collections), the result is the word–context matrix, which includes the term–document matrix as a special case. Section 6.2 discusses applications for word–context matrices. Section 6.3 considers pair–pattern matrices, in which the rows correspond to pairs of terms and the columns correspond to the patterns in which the pairs occur.

In Section 7, we discuss alternatives to VSMs for semantics. Section 8 considers the future of VSMs, raising some questions about their power and their limitations. We conclude in Section 9.

2. Vector Space Models of Semantics

The theme that unites the various forms of VSMs that we discuss in this paper can be stated as the statistical semantics hypothesis: statistical patterns of human word usage can be used to figure out what people mean. ^[9] This general hypothesis underlies several more specific hypotheses, such as the bag of words hypothesis, the distributional hypothesis, the extended distributional hypothesis, and the latent relation hypothesis, discussed below.

2.1 Similarity of Documents: The Term–Document Matrix

In this paper, we use the following notational conventions: Matrices are denoted by bold capital letters, [math]\displaystyle{ \mathbf{A} }[/math]. Vectors are denoted by bold lowercase letters, [math]\displaystyle{ \mathbf{b} }[/math]. Scalars are represented by lowercase italic letters, [math]\displaystyle{ c }[/math].

If we have a large collection of documents, and hence a large number of document vectors, it is convenient to organize the vectors into a matrix. The row vectors of the matrix correspond to terms (usually terms are words, but we will discuss some other possibilities) and the column vectors correspond to documents (web pages, for example). This kind of matrix is called a term–document matrix.

…

Footnotes

↑ 1. Regarding the average score of 64.5% on the TOEFL questions, Landauer and Dumais (1997) note that, “Although we do not know how such a performance would compare, for example, with U.S. school children of a particular age, we have been told that the average score is adequate for admission to many universities.”
↑ 2. This is the average score for highschool students in their senior year, applying to US universities. For more discussion of this score, see Section 6.3 in Turney’s (2006) paper.
↑ A vector is a first-order tensor and a matrix is a second-order tensor. See Section 2.5.
↑ See http://www.natcorp.ox.ac.uk/
↑ 5. See http://wordnet.princeton.edu/
↑ See http://lucene.apache.org/java/docs/
↑ See http://code.google.com/p/semanticvectors/.
↑ See http://code.google.com/p/airhead-research/wiki / LatentRelationalAnalysis.
↑ This phrase was taken from the Faculty Profile of George Furnas at the University of Michigan, http://www.si.umich.edu/people/faculty-detail.htm?sid=41. The full quote is, “Statistical Semantics – Studies of how the statistical patterns of human word usage can be used to figure out what people mean, at least to a level sufficient for information access.” The term statistical semantics appeared in the work of Furnas, Landauer, Gomez, and Dumais (1983), but it was not defined there.

}}

References

;

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2010 FromFrequencytoMeaningVectorSpa	Peter D. Turney Patrick Pantel			From Frequency to Meaning: Vector Space Models of Semantics						2010

[1] 1. Regarding the average score of 64.5% on the TOEFL questions, Landauer and Dumais (1997) note that, “Although we do not know how such a performance would compare, for example, with U.S. school children of a particular age, we have been told that the average score is adequate for admission to many universities.”

[2] 2. This is the average score for highschool students in their senior year, applying to US universities. For more discussion of this score, see Section 6.3 in Turney’s (2006) paper.

[3] A vector is a first-order tensor and a matrix is a second-order tensor. See Section 2.5.

[4] See http://www.natcorp.ox.ac.uk/

[5] 5. See http://wordnet.princeton.edu/

[6] See http://lucene.apache.org/java/docs/

[7] See http://code.google.com/p/semanticvectors/.

[8] See http://code.google.com/p/airhead-research/wiki / LatentRelationalAnalysis.

[9] This phrase was taken from the Faculty Profile of George Furnas at the University of Michigan, http://www.si.umich.edu/people/faculty-detail.htm?sid=41. The full quote is, “Statistical Semantics – Studies of how the statistical patterns of human word usage can be used to figure out what people mean, at least to a level sufficient for information access.” The term statistical semantics appeared in the work of Furnas, Landauer, Gomez, and Dumais (1983), but it was not defined there.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]