2009 MoreDataTrumpsSmarterAlgorithms

(Recchia & Jones, 2009) ⇒ Gabriel Recchia, and Michael N. Jones. (2009). “More Data Trumps Smarter Algorithms: Comparing Pointwise Mutual Information with Latent Semantic Analysis.” In: Behavior research methods, 41(3).

Subject Headings: Pointwise Mutual Information.

Notes

Cited By

http://scholar.google.com/scholar?q=%222009%22+More+Data+Trumps+Smarter+Algorithms%3A+Comparing+Pointwise+Mutual+Information+with+Latent+Semantic+Analysis

Quotes

Abstract

Computational models of lexical semantics, such as latent semantic analysis, can automatically generate semantic similarity measures between words from statistical redundancies in text. These measures are useful for experimental stimulus selection and for evaluating a model's cognitive plausibility as a mechanism that people might use to organize meaning in memory. Although humans are exposed to enormous quantities of speech, practical constraints limit the amount of data that many current computational models can learn from. We follow up on previous work evaluating a simple metric of pointwise mutual information. Controlling for confounds in previous work, we demonstrate that this metric benefits from training on extremely large amounts of data and correlates more closely with human semantic similarity ratings than do publicly available implementations of several more complex models. We also present a simple tool for building simple and scalable models from large corpora quickly and efficiently.

Introduction

Explaining how semantic representations of words are derived from experience is a central task for high-dimensional semantic space models such as the hyperspace analogue to language (HAL) framework of Burgess and Lund (2000), latent semantic analysis (LSA; Landauer & Dumais, 1997), and other lexical co-occurrence models of semantic memory. Although the models differ considerably in the algorithms used, they are all fundamentally based on the principle that a word’s meaning can be induced by observing its statistical usage across a large sample of language. Evaluation typically proceeds by first training a model on a corpus of text, after which the model can generate similarity ratings between word pairs for comparison with human judgments. For example, if humans rate the words “cat” and “feline” as closer in meaning than the words “cat” and “cupboard,” it would be desirable for a semantic space model to do so as well. Because the similarity ratings of semantic space models are used for experimental stimulus selection and for model evaluation, close correspondence with human data is critically important.

Starting from the premise that “simple associations in context are aggregated into conceptual representations” (Burgess & Lund, 2000), semantic space models typically apply techniques that go beyond direct lexical co-occurrence to induce more abstract semantic representations. Perhaps the best-known example is LSA’s use of singular value decomposition (SVD), a technique from linear algebra. As described by Landauer, Foltz, and Laham (1998), LSA operates by first constructing a term–document matrix in which the value of each cell (i,j) represents the number of occurrences of word i in document j. Each term is then weighted with a log–entropy transform applied to the value of each cell in order to reduce the influence of very frequent words. Next, the matrix is factored, using SVD, allowing the construction of a new matrix of lower rank which can be thought of as a low-dimensional approximation of the original matrix. Finally, two rows of the final matrix can be correlated to obtain the semantic similarity between the rows’ corresponding terms. Because the optimal choice of dimensionality for an LSA space varies depending on the choice of task and training corpus, it is generally chosen by recomputing the SVD over a wide range of possible choices and selecting the one for which the final matrix performs the best on the task at hand (Quesada, 2006). Similar probabilistic methods for inferring latent semantic components are arguably even more sophisticated and computationally intensive (e.g., Griffiths, Steyvers, & Tenenbaum, 2007; Hoffman, 1999).

The computational expense of the dimension reduction step raises several challenging issues for LSA and related semantic inference models. The first is lack of scalability. Because standard algorithms for computing SVDs require the entire term–document matrix to be held in memory, training LSA on corpora of many tens of millions of tokens is infeasible, even with high-end supercomputing resources. The problem is exacerbated by the fact that as the size of the corpus increases, the number of rows and columns in the matrix both increase significantly, the number of columns growing linearly with the number of documents and the number of rows growing in approximate proportion to the square root of the number of tokens. A promising class of algorithms known as “out-of-core” methods for SVD calculation (Lin, 2000) have recently been shown to calculate large SVDs without having to store the entire matrix in memory (Martin, Martin, Berry, & Browne, 2007). However, no publicly available implementations of LSA currently make use of these methods. Even if out-of-core SVD algorithms can be shown to solve LSA’s memory bottleneck, the high computational demands of computing extremely large SVDs may continue to be prohibitive for quite some time.

The standard LSA space used by most researchers (and available at http://lsa.colorado.edu) is trained on the Touchstone Applied Science Associates (TASA) corpus of textbook readings from a variety of grade levels (Zeno, Ivens, Millard, & Duvvuri, 1995). This corpus comprises some 11 million tokens, 1 a reasonable (perhaps even generous) approximation of the number of tokens most adults schooled in the United States have been exposed to through reading, given estimates that most children have read around only 3.8 million words by late grade school (Landauer & Dumais, 1997).

However, these estimates do not take into account the amount of language input that humans are exposed to through speech; indeed, 11 million tokens is approximately one third the number that children are estimated to have heard by the time they are 3 years old (Risley & Hart, 2006). Risley and Hart estimated that most American children hear between 10 and 33 million words in their first 3 years of life, not counting the 4 to 12 million that they produce during this time. By the age of 18 — a lower bound for the age of study participants from whom semantic similarity judgments are collected — it is safe to assume that most would have experienced many times this amount. Although Landauer and Dumais (1997) argued convincingly that older children acquire most of their new word meanings from reading, this does not mean that these meanings cannot change over time as a result of additional exposure. If co-occurrence information plays a significant role in shaping humans’ lexical semantic representations over time, one would expect our representations of word meaning to be shaped by co-occurrences in speech as well as in print. Given that semantic similarity judgments are collected from adult study participants, being able to scale semantic space models to large data sets is extremely important. We found training LSAor the topics model (Griffiths et al., 2007) on corpora significantly larger than TASA to be infeasible, given our resources and the massive computational demands of the learning algorithms.

Another challenge for LSA is a lack of incrementality: an inability to [[update semantic representations incrementally]] in response to a continual accumulation of language input. Humans update their semantic representations in response to new data continuously over time. In contrast, after an LSA space has been built, additional documents cannot be incorporated into the space without recomputing the low-rank approximation of the original matrix. This is problematic, because the original matrix is no longer presumed to be stored in memory after the SVD step has been computed (Landauer & Dumais, 1997). This lack of incrementality decreases the cognitive plausibility of LSA as a model of semantic organization (Lemaire & Denhière, 2004; Perfetti, 1998), and the appeal of static systems when learning from dynamically changing textbases.

Finally, there is the issue of complexity. Computational models of word segmentation, semantics, and syntax “often assume a computational complexity and linguistic knowledge likely to be beyond the abilities of developing young children” (Onnis & Christiansen, 2008). Given two models that correlate with human data equally well, the principle of scientific parsimony would tend to favor the one that makes fewer assumptions about the capabilities of the developing brain. SVD is arguably a [[complex process without an obvious explanation of how it might be implemented in humans. From the beginning, the SVD realization of LSA has been regarded more as a convenient expedient for dimensionality reduction than a claim about the specific analysis that humans apply to the data. Landauer and Dumais (1997, p. 218) stated that SVD should be considered as only one of a class of mathematical techniques worth exploration, and that additional features could be added to the dimensionality reduction step to “make it more closely resemble what we know or think we know about the basic processes of perception, learning, and memory.” These considerations motivated us to take a different approach. Rather than adding additional features to dimensional reduction algorithms to make them more scalable and psychologically plausible, our approach was to throw much more data at the problem, even if doing so required a less sophisticated learning algorithm. Recent discoveries in the field of computational linguistics have revealed that much more progress has been made toward systems for automatic parsing and sense disambiguation by training existing simple systems on more language data as it became available, rather than by investing in the development of superior algorithms. We take this lesson from computational linguistics as a useful indicator that computational models of semantics may reap greater benefits from more data rather than from developing more clever learning algorithms, and our goal in this article is to explore that hypothesis.

Pointwise mutual information (PMI) is one such measure that meets the previously discussed desiderata of scalability, incrementality, and simplicity, and which has been shown to yield high performance on forced-choice tests of semantic similarity (Bullinaria & Levy, 2006; Terra & Clarke, 2003; Turney, 2001). However, PMI has not yet been shown to outperform LSA on a wide variety of evaluation tasks. Budiu, Royer, and Pirolli (2007) found that PMI outperformed LSA on a variety of semantic tasks when trained on a larger corpus, but their work confounded corpus size and corpus quality. Bullinaria and Levy, using PMI to compare vectors of co-occurrence counts (i.e., the rows of a term 3 term matrix), found that PMI outperformed LSA on the test of English as a foreign language (TOEFL) forced-choice synonymy test when both methods were trained on a small corpus derived from Grolier’s Academic American Encyclopedia. However, they did not correlate the performance of PMI and LSA with human judgments of semantic similarity.

First, we set out to determine whether PMI could provide a better match than LSA to human judgments of similarity when both measures were trained on the same type of text and only corpus size was varied. Finding this to be the case, we compared the performance of PMI trained on a large text corpus with several other measures of semantic relatedness, its high performance motivating us to release an easy-to-use resource for obtaining similarity judgments from arbitrary corpora. We end with a brief discussion of the theoretical and practical ramifications of these results.

Previous Research

PMI was first introduced in the context of word associations by Church and Hanks (1990). PMI is a very simple information-theoretic measure that, when computed between two words x and y, “compares the probability of observing x and y together (the joint probability) with the probabilities of observing x and y independently (chance) ” (Church & Hanks, 1990, p. 23). It is defined as I x y P x y P x P y (,) log (,) () =. 2 (1) In practice, P (x) can be approximated as the number of times that x appears in the corpus, P (y) as the number of times y appears in the corpus, and P (x, y) as the number of times the two words co-occur in a context. Because a logarithm is a monotonically increasing function, relative ordinal rankings between PMI estimates are maintained if the log is dropped. Because we are interested in rank correlations to human similarity judgments, we use the term PMI in the remainder of this article to refer simply to the number of co-occurrences of x and y divided by the product of their individual frequencies.

Turney (2001) found that a version of PMI that used search engine hit counts to estimate lexical co-occurrences outperformed LSA on two forced choice vocabulary tests (TOEFL and ESL, described later), suggesting that PMI might be capable of approximating human similarity judgments more closely than LSA. However, LSA and PMI were trained on vastly different genres and sizes of text, leaving open the question of whether LSA would have performed better if trained on the same kind of text as PMI. Terra and Clarke (2003) also found high performance of PMI on TOEFL. They found that in general, performance seemed to increase with corpus size up to a point of around 850 GB, although the picture was muddied by the discrete nature of the evaluation metric. Furthermore, ESL and TOEFL constitute forced choice synonymy tests; correlations with human judgments of word similarity were not attempted in either study.

In an excellent comparison of algorithms and metrics, Bullinaria and Levy (2006) systematically investigated the effects of the corpus size and quality on PMI. Their evaluation tasks consisted of TOEFL, a distance comparison (another forced-choice task in which the model must select the most semantically similar word out of a collection of alternatives), and two categorization tasks, one semantic and one syntactic. Rather than using the PMI score itself as the similarity value, Bullinaria and Levy constructed vectors for each word w in which each element corresponded to the PMI between w and another term from the training data. One can think of this as building up a term x term matrix; similarity scores between two words are estimated by calculating the distance between the corresponding two rows of the matrix. Using this method, the distance metric used to assess the similarity between vectors significantly affects the results. Bullinaria and Levy exhaustively tested a wide variety of distance metrics, along with other parameters, such as the number of vector components and the size of the context window. Although TOEFL was the only task on which Bullinaria and Levy directly compared LSA with PMI, they found that when these two methods were equated for corpus size and quality, comparing vectors of PMI scores using the cosine distance metric outperformed LSA even on a small corpus. They also found that additional training data greatly enhanced the performance of PMI on each of their evaluation tasks.

Finally, Budi, Royer, and Pirolli (2007) compared the performance of PMI with that of LSA and GLSA (Matveeva, Levow, Farahat, & Royer, 2005) on a variety of tasks across two corpora: the aforementioned TASA corpus of general textbook readings ranging from first grade through college, and the much larger Stanford corpus, consisting of the first 6.7 million pages of the WebBase project (Cho et al., 2006), a public-domain collection of Web page snapshots archived from 2004 to 2008. They found that a version of PMI trained on Stanford (PMIStanford) performed better than a version of LSA trained on TASA on several semantic similarity tasks. They also found that PMI-Stanford performed better on these tasks than a version of PMI trained on TASA did. They concluded that training on more data was what caused PMI’s performance to improve.

However, the study confounded corpus size and corpus type, since TASA and the Stanford corpus represent two very different genres of text. TASA consists of a collection of readings “carefully put together to reflect college students’ vocabulary, whereas Stanford is generated by a web crawl” (Budiu et al., 2007). It may be that PMI-Stanford produced a closer correspondence with human judgments, not because the Stanford corpus is larger, but rather because it is a better representation of the sort of language to which humans are exposed in everyday life. If LSA had been trained not on TASA but rather on a subset of the Stanford corpus, it might have performed better. Therefore, the experiment was inconclusive on the question of whether PMI would outperform LSA if both measures were trained on the same type of text. Likewise, if a version of PMI were trained on a representative subset of the Stanford corpus, that version might have done no better than PMI-Stanford did. In Experiment 1, we compared the performance of LSA and PMI trained on a representative TASA-sized subset of the Wikipedia corpus, as well as with the performance of PMI trained on the full Wikipedia corpus (an infeasible task for LSA), in order to clarify whether corpus size or type of text was the causal factor behind PMI’s success.

Experiment 1

Method

…

References

;

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2009 MoreDataTrumpsSmarterAlgorithms	Gabriel Recchia Michael N. Jones			More Data Trumps Smarter Algorithms: Comparing Pointwise Mutual Information with Latent Semantic Analysis						2009