2008 ModelingScience

(Blei, 2008) ⇒ David M. Blei. (2008). “Modeling Science.” Presentation. April 17, 2008

Subject Headings: Topic Modeling, Latent Dirichlet Allocation, Graphical Models.

Notes

On-line archives of document collections require better organization. Manual organization is not practical.
Our goal: To discover the hidden thematic structure with hierarchical probabilistic models called topic models.
Use this structure for browsing, search, and similarity.
Our data are the pages Science from 1880-2002 (from JSTOR)
No reliable punctuation, meta-data, or references.
Note: this is just a subset of JSTOR’s archive.

human evolution disease computer genome evolutionary host models dna species bacteria information genetic organisms diseases data genes life resistance computers sequence origin bacteria …

Treat data as observations that arise from a generative probabilistic process that includes hidden variables
- For documents, the hidden variables reflect the thematic structure of the collection.
Infer the hidden structure using posterior inference
- What are the topics that describe this collection?
Situate new data into the estimated model.
- How does this query or new document fit into the estimated topic structure?

LDA is a powerful model for
- Visualizing the hidden thematic structure in large corpora
- Generalizing new data to fit into that structure
LDA is a mixed membership model (Erosheva, 2004) that builds on the work of Deerwester et al. (1990) and Hofmann (1999).
For document collections and other grouped data, this might be more appropriate than a simple finite mixture

Modular: It can be embedded in more complicated models.
- E.g., syntax and semantics; authorship; word sense
General: The data generating distribution can be changed.
- E.g., images; social networks; population genetics data
Variational inference is fast; lets us to analyze large data sets.
See Blei et al., 2003 for details and a quantitative comparison.
Code to play with LDA is freely available on my web-site, http://www.cs.princeton.edu/ blei.

The Dirichlet is an exponential family distribution on the simplex, positive vectors that sum to one.
However, the near independence of components makes it a poor choice for modeling topic proportions.
An article about fossil fuels is more likely to also be about geology than about genetics.

Topic models provide useful descriptive statistics for analyzing and understanding the latent structure of large text collections.
Probabilistic graphical models are a useful way to express assumptions about the hidden structure of complicated data.
Variational methods allow us to perform posterior inference to automatically infer that structure from large data sets.
Current research
- Choosing the number of topics
- Continuous time dynamic topic models
- Topic models for prediction
- Inferring the impact of a document

,

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2008 ModelingScience	David M. Blei			Modeling Science			http://www.cs.princeton.edu/~blei/modeling-science.pdf			2008