2009 EfficientMethodsforTopicModelIn

Jump to: navigation, search

Subject Headings: Topic Modeling Algorithm, SparseLDA Algorithm, Gibbs Sampling Algorithm.


Cited By


Author Keywords

Topic Modeling, Inference


Topic models provide a powerful tool for analyzing large text collections by representing high dimensional data in a low dimensional subspace. Fitting a topic model given a set of training documents requires approximate inference techniques that are computationally expensive. With today's large-scale, constantly expanding document collections, it is useful to be able to infer topic distributions for new documents without retraining the model. In this paper, we empirically evaluate the performance of several methods for topic inference in previously unseen documents, including methods based on Gibbs sampling, variational inference, and a new method inspired by text classification. The classification-based inference method produces results similar to iterative inference methods, but requires only a single matrix multiplication. In addition to these inference methods, we present SparseLDA, an algorithm and data structure for evaluating Gibbs sampling distributions. Empirical results indicate that SparseLDA can be approximately 20 times faster than traditional LDA and provide twice the speedup of previously published fast sampling methods, while also using substantially less memory.

1. Introduction

... Second, since many of the methods we discuss rely on Gibbs sampling to infer topic distributions, we also present a simple method, SparseLDA, for efficient Gibbs sampling in topic models along with a data structure that results in very fast sampling performance with a small memory footprint.



We evaluate three different sampling-based inference methods for LDA. Gibbs sampling is an MCMC method that involves iterating over a set of variables z1, z2, ...zn, sampling each zi from P(zi|z\i,w). Each iteration over all variables is referred to as a Gibbs sweep. Given enough iterations, Gibbs sampling for LDA [4] produces samples from the posterior P(z|w). The difference between the three methods we explore is in the set of variables [math]z[/math] that are sampled, as illustrated in Figure 1, and which portion of the complete data is used in estimating Φ.


3.4 Time- and Memory-Efficient Gibbs Sampling for LDA

The efficiency of Gibbs sampling-based inference methods depends almost entirely on how fast we can evaluate the sampling distribution over topics for a given token. We therefore present SparseLDA, our new algorithm and data structure that substantially improves sampling performance.


Another class of approximate inference method widely used in fitting topic models is variational EM. Variational inference involves defining a parametric family of distributions that forms a tractable approximation to an intractable true joint distribution. In the case of LDA, Blei, Ng, and Jordan [3] suggest a factored distribution consisting of a variational Dirichlet distribution ˜ d for each document and a variational multinomial ˜ di over topics for each word position in the document.




 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2009 EfficientMethodsforTopicModelInLimin Yao
David Mimno
Andrew McCallum
Efficient Methods for Topic Model Inference on Streaming Document CollectionsKDD-2009 Proceedings10.1145/1557019.15571212009