- (Yao et al., 2009) ⇒ Limin Yao, David Mimno, and Andrew McCallum. (2009). “Efficient Methods for Topic Model Inference on Streaming Document Collections.” In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2009). doi:10.1145/1557019.1557121
- Categories and Subject Descriptors: H.4 Information Systems Applications: Miscellaneous
- General Terms: Experimentation, Performance, Design.
Topic models provide a powerful tool for analyzing large text collections by representing high dimensional data in a low dimensional subspace. Fitting a topic model given a set of training documents requires approximate inference techniques that are computationally expensive. With today's large-scale, constantly expanding document collections, it is useful to be able to infer topic distributions for new documents without retraining the model. In this paper, we empirically evaluate the performance of several methods for topic inference in previously unseen documents, including methods based on Gibbs sampling, variational inference, and a new method inspired by text classification. The classification-based inference method produces results similar to iterative inference methods, but requires only a single matrix multiplication. In addition to these inference methods, we present SparseLDA, an algorithm and data structure for evaluating Gibbs sampling distributions. Empirical results indicate that SparseLDA can be approximately 20 times faster than traditional LDA and provide twice the speedup of previously published fast sampling methods, while also using substantially less memory.
... Second, since many of the methods we discuss rely on Gibbs sampling to infer topic distributions, we also present a simple method, SparseLDA, for efficient Gibbs sampling in topic models along with a data structure that results in very fast sampling performance with a small memory footprint.
3. SAMPLING-BASED INFERENCE
We evaluate three different sampling-based inference methods for LDA. Gibbs sampling is an MCMC method that involves iterating over a set of variables z1, z2, ...zn, sampling each zi from P(zi|z\i,w). Each iteration over all variables is referred to as a Gibbs sweep. Given enough iterations, Gibbs sampling for LDA  produces samples from the posterior P(z|w). The difference between the three methods we explore is in the set of variables [math]z[/math] that are sampled, as illustrated in Figure 1, and which portion of the complete data is used in estimating Φ.
3.4 Time- and Memory-Efficient Gibbs Sampling for LDA
The efficiency of Gibbs sampling-based inference methods depends almost entirely on how fast we can evaluate the sampling distribution over topics for a given token. We therefore present SparseLDA, our new algorithm and data structure that substantially improves sampling performance.
4. VARIATIONAL INFERENCE
Another class of approximate inference method widely used in fitting topic models is variational EM. Variational inference involves defining a parametric family of distributions that forms a tractable approximation to an intractable true joint distribution. In the case of LDA, Blei, Ng, and Jordan  suggest a factored distribution consisting of a variational Dirichlet distribution ˜ d for each document and a variational multinomial ˜ di over topics for each word position in the document.
|2009 EfficientMethodsforTopicModelIn||Limin Yao|
|Efficient Methods for Topic Model Inference on Streaming Document Collections||KDD-2009 Proceedings||10.1145/1557019.1557121||2009|
|Author||Limin Yao +, David Mimno + and Andrew McCallum +|
|journal||Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining +|
|title||Efficient Methods for Topic Model Inference on Streaming Document Collections +|