2006 TopicModelingBeyongBoW

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Topic Modeling Algorithm

Notes

Cited By

Quotes

Abstract

  • Some models of textual corpora employ text generation methods involving n-gram statistics, while others use latent topic variables inferred using the "bag-of-words" assumption, in which word order is ignored. Previously, these methods have not been combined. In this work, I explore a hierarchical generative probabilistic model that incorporates both n-gram statistics and latent topic variables by extending a unigram topic model to include properties of a hierarchical Dirichlet bigram language model. The model hyperparameters are inferred using a Gibbs EM algorithm. On two data sets, each of 150 documents, the new model exhibits better predictive accuracy than either a hierarchical Dirichlet bigram language model or a unigram topic model. Additionally, the inferred topics are less dominated by function words than are topics discovered using unigram statistics, potentially making them more meaningful.

,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2006 TopicModelingBeyongBoWTopic Modeling: beyond bag-of-wordshttp://www.cs.umass.edu/~wallach/publications/wallach06beyond.pdf10.1145/1143844.1143967