2007 ProbabilisticTopicModels

(Steyvers & Griffiths, 2007) ⇒ Mark Steyvers, and Thomas L. Griffiths. (2007). “Probabilistic Topic Models.” In: (Landauer et al., 2007).

Subject Headings: Probabilistic Topic Model.

Notes

Cited By

~212 http://scholar.google.com/scholar?cites=9982802760707177758

Quotes

1. Introduction

Many chapters in this book illustrate that applying a statistical method such as Latent Semantic Analysis (LSA; Landauer & Dumais, 1997; Landauer, Foltz, & Laham, 1998) to large databases can yield insight into human cognition. The LSA approach makes three claims: that semantic information can be derived from a word-document co-occurrence matrix; that dimensionality reduction is an essential part of this derivation; and that words and documents can be represented as points in Euclidean space. In this chapter, we pursue an approach that is consistent with the first two of these claims, but differs in the third, describing a class of statistical models in which the semantic properties of words and documents are expressed in terms of probabilistic topics.

Topic models (e.g., Blei, Ng, & Jordan, 2003; Griffiths & Steyvers, 2002; 2003; 2004; Hofmann, 1999; 2001) are based upon the idea that documents are mixtures of topics, where a topic is a probability distribution over words. A topic model is a generative model for documents: it specifies a simple probabilistic procedure by which documents can be generated. To make a new document, one chooses a distribution over topics. Then, for each word in that document, one chooses a topic at random according to this distribution, and draws a word from that topic. Standard statistical techniques can be used to invert this process, inferring the set of topics that were responsible for generating a collection of documents. Figure 1 shows four example topics that were derived from the TASA corpus, a collection of over 37,000 text passages from educational materials (e.g., language & arts, social studies, health, sciences) collected by Touchstone Applied Science Associates (see Landauer, Foltz, & Laham, 1998). The figure shows the sixteen words that have the highest probability under each topic. The words in these topics relate to drug use, colors, memory and the mind, and doctor visits. Documents with different content can be generated by choosing different distributions over topics. For example, by giving equal probability to the first two topics, one could construct a document about a person that has taken too many drugs, and how that affected color perception. By giving equal probability to the last two topics, one could construct a document about a person who experienced a loss of memory, which required a visit to the doctor.

2. Generative Models

A generative model for documents is based on simple probabilistic sampling rules that describe how words in documents might be generated on the basis of latent (random) variables. When fitting a generative model, the goal is to find the best set of latent variables that can explain the observed data (i.e., observed words in documents), assuming that the model actually generated the data.

References

,

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2007 ProbabilisticTopicModels	Thomas L. Griffiths Mark Steyvers			Probabilistic Topic Models			http://cocosci.berkeley.edu/tom/papers/topics chapter.pdf			2007