2009 EfficientMethodsforTopicModelIn

(Yao et al., 2009) ⇒ Limin Yao, David Mimno, and Andrew McCallum. (2009). “Efficient Methods for Topic Model Inference on Streaming Document Collections.” In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2009). doi:10.1145/1557019.1557121

Subject Headings: Topic Modeling Algorithm, SparseLDA Algorithm, Gibbs Sampling Algorithm.

Notes

Categories and Subject Descriptors: H.4 Information Systems Applications: Miscellaneous.
General Terms: Experimentation, Performance, Design.

Cited By

Quotes

Author Keywords

Topic Modeling, Inference

Abstract

Topic models provide a powerful tool for analyzing large text collections by representing high dimensional data in a low dimensional subspace. Fitting a topic model given a set of training documents requires approximate inference techniques that are computationally expensive. With today's large-scale, constantly expanding document collections, it is useful to be able to infer topic distributions for new documents without retraining the model. In this paper, we empirically evaluate the performance of several methods for topic inference in previously unseen documents, including methods based on Gibbs sampling, variational inference, and a new method inspired by text classification. The classification-based inference method produces results similar to iterative inference methods, but requires only a single matrix multiplication. In addition to these inference methods, we present SparseLDA, an algorithm and data structure for evaluating Gibbs sampling distributions. Empirical results indicate that SparseLDA can be approximately 20 times faster than traditional LDA and provide twice the speedup of previously published fast sampling methods, while also using substantially less memory.

1. Introduction

... Second, since many of the methods we discuss rely on Gibbs sampling to infer topic distributions, we also present a simple method, SparseLDA, for efficient Gibbs sampling in topic models along with a data structure that results in very fast sampling performance with a small memory footprint.

…

3. SAMPLING-BASED INFERENCE

We evaluate three different sampling-based inference methods for LDA. Gibbs sampling is an MCMC method that involves iterating over a set of variables z₁, z₂, ...z_n, sampling each z_i from P(z_i|z_\i,w). Each iteration over all variables is referred to as a Gibbs sweep. Given enough iterations, Gibbs sampling for LDA [4] produces samples from the posterior P(z|w). The difference between the three methods we explore is in the set of variables [math]\displaystyle{ z }[/math] that are sampled, as illustrated in Figure 1, and which portion of the complete data is used in estimating Φ.

…

3.4 Time- and Memory-Efficient Gibbs Sampling for LDA

The efficiency of Gibbs sampling-based inference methods depends almost entirely on how fast we can evaluate the sampling distribution over topics for a given token. We therefore present SparseLDA, our new algorithm and data structure that substantially improves sampling performance.

4. VARIATIONAL INFERENCE

Another class of approximate inference method widely used in fitting topic models is variational EM. Variational inference involves defining a parametric family of distributions that forms a tractable approximation to an intractable true joint distribution. In the case of LDA, Blei, Ng, and Jordan [3] suggest a factored distribution consisting of a variational Dirichlet distribution ˜ d for each document and a variational multinomial ˜ di over topics for each word position in the document.

. …

References

,

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2009 EfficientMethodsforTopicModelIn	Limin Yao David Mimno Andrew McCallum			Efficient Methods for Topic Model Inference on Streaming Document Collections		KDD-2009 Proceedings		10.1145/1557019.1557121		2009