2007 MixturesofHierarchicalTopicswit

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Pachinko Allocation Model.

Notes

Cited By

Quotes

Abstract

The four-level pachinko allocation model (PAM) (Li & McCallum, 2006) represents correlations among topics using a DAG structure. It does not, however, represent a nested hierarchy of topics, with some topical word distributions representing the vocabulary that is shared among several more specific topics. This paper presents hierarchical PAM --- an enhancement that explicitly represents a topic hierarchy. This model can be seen as combining the advantages of hLDA's topical hierarchy representation with PAM's ability to mix multiple leaves of the topic hierarchy. Experimental results show improvements in likelihood of held-out documents, as well as mutual information between automatically-discovered topics and human-generated categories such as journals.

1. Introduction

Topic models are an important tool because of their ability to identify latent semantic components in unlabeled text data. Recently, attention has focused on models that are able not only to identify topics but also to discover the organization and cooccurrences of the topics themselves.

In this paper, we focus on discovering topics organized into hierarchies. A hierarchical topical structure is intuitively appealing. Some language is shared by large numbers of documents, while other language may be restricted to a specific subset of a corpus. Within these subsets, there may be further divisions, each with its own characteristic words. We believe that a topic model that takes such structure into account will have two primary advantages over a \ at “topic model. First, explicitly modeling the hierarchical cooccurrence patterns of topics should allow us to learn better, more predictive models. For example, knowing that hockey and baseball are both contained in a more general class \ sports " should help to predict what words will be contained in previously unseen documents. Second, a hierarchical topic model should be able to describe the organization of a corpus more accurately than a topic model that does not represent such structure.

A natural representation for a hierarchical topic model is to organize topics into a tree. This approach is taken in the hierarchical LDA model of Blei et al. (2004). In hLDA, each document is assigned to a path through the topic tree, and each word in a given document is assigned to a topic at one of the levels of that path. A tree structured hierarchical topic model has several limitations. First, it is critically important to identify the correct tree. In order to learn the tree structure, the hLDA model uses a non-parametric nested Chinese restaurant process (NCRP) to provide a prior on tree structures. Second, it is not unusual for documents that are in clearly distinct subsets of a corpus to share a topic. For example, various topics in a professional sports sub-hierarchy and various topics in a computer games sub-hierarchy would both use similar words describing “games, “players, and “points. The only way for sports and computer gaming to share this language would be for both sub-hierarchies to descend from a common parent, which may not be the most appropriate organization for the corpus.

Another approach to representing the organization of topics is the pachinko allocation model (PAM) (Li & McCallum, 2006). PAM is a family of generative models in which words are generated by a directed acyclic graph (DAG) consisting of distributions over words and distributions over other nodes. A simple example of the PAM framework, four-level PAM, is described in Li and McCallum (2006). There is a single node at the top of the DAG that defines a distribution over nodes in the second level, which we refer to as super-topics. Each node in the second level defines a distribution over all nodes in the third level, or sub-topics. Each sub-topic maps to a single distribution over the vocabulary. Only the sub-topics, therefore, actually produce words. The super-topics represent clusters of topics that frequently cooccur.

In this paper, we develop a different member of the PAM family and apply it to the task of hierarchical topic modeling. This model, hierarchical PAM (hPAM), includes multinomials over the vocabulary at each internal node in the DAG. This model addresses the problems outlined above: we no longer have to commit to a single hierarchy, so getting the tree structure exactly right is not as important as in hLDA. Furthermore, \ methodological " topics such as one referring to \ points " and \ players " can be shared between segments of the corpus.

Computer Science provides a good example of the benefits of the hPAM model. Consider three subfields of Computer Science: Natural Language Processing, Machine Learning, and Computer Vision. All three can be considered part of Artificial Intelligence. Vision and NLP both use ML extensively, but all three subfields also appear independently. In order to represent ML as a single topic in a tree-structured model, NLP and Vision must both be children of ML; otherwise words about Machine Learning must be spread between an NLP topic, a Vision topic, and an ML-only topic. In contrast, hPAM allows higher-level topics to share lower-level topics. For this work we use a fixed number of topics, although it is possible to use nonparametric priors over the number of topics.

We evaluate hPAM, hLDA and LDA based on the criteria mentioned earlier. We measure the ability of a topic model to predict unseen documents based on the empirical likelihood of held-out data given simulations drawn from the generative process of each model. We measure the ability of a model to describe the hierarchical structure of a corpus by calculating the mutual information between topics and human-generated categories such as journals. We find a 1.1% increase in empirical log likelihood for hPAM over hLDA and a five-fold increase in super-topic / journal mutual information.

References

;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2007 MixturesofHierarchicalTopicswitDavid Mimno
Andrew McCallum
Wei Li
Mixtures of Hierarchical Topics with Pachinko Allocation10.1145/1273496.12735762007