2010 LearningAuthorTopicModelsfromTe

From GM-RKB
Jump to navigation Jump to search

Subject Headings:

Notes

Cited By

Quotes

Author Keywords

Topic models, Gibbs sampling, unsupervised learning, author models, perplexity.

Abstract

We propose an unsupervised learning technique for extracting information about authors and topics from large text collections. We model documents as if they were generated by a [[two-stage stochastic process. An author is represented by a probability distribution over topics, and each topic is represented as a probability distribution over words. The probability distribution over topics in a multi-author paper is a mixture of the distributions associated with the authors. The topic-word and author-topic distributions are learned from data in an unsupervised manner using a Markov chain Monte Carlo algorithm. We apply the methodology to three large text corpora: 150,000 abstracts from the CiteSeer digital library, 1740 papers from the Neural Information Processing Systems (NIPS) Conferences, and 121,000 emails from the Enron corporation. We discuss in detail the interpretation of the results discovered by the system including specific topic and author models, ranking of authors by topic and topics by author, parsing of abstracts by topics and authors, and detection of unusual papers by specific authors. Experiments based on perplexity scores for test documents and precision-recall for document retrieval are used to illustrate systematic differences between the proposed author-topic model and a number of alternatives. Extensions to the model, allowing for example, generalizations of the notion of an author, are also briefly discussed.

1.INTRODUCTION

With the advent of the Web and specialized digital text collections, automated extraction of useful information from text has become an increasingly important research area in information retrieval, statistical natural language processing, and machine learning. Applications include document annotation, database organization, query answering, and automated summarization of text collections. Statistical approaches based upon generative models have proven effective in addressing these problems, providing efficient methods for extracting structured representations from large document collections.

In this article we describe a generative model for document collections, the author-topic (AT) model, which simultaneously models the content of documents and the interests of authors. This generative model represents each document as a mixture of probabilistic topics, in a manner similar to Latent Dirichlet Allocation (Blei et al. 2003). It extends previous work using probabilistic topics-to-author modeling by allowing the mixture weights for different topics to be determined by the authors of the document. By learning the parameters of the model, we obtain the set of topics that appear in a corpus and their relevance to different documents, and identify which topics are used by which authors. Figure 1 shows an example of several such topics with associated authors and words, as obtained from a single sample of the Gibbs sampler trained on a collection of papers from the annual Neural Information Processing Systems (NIPS) conferences (these will be discussed in more detail later in the article). Both the words and the authors associated with each topic are quite focused and reflect a variety of different and quite specific research areas associated with the NIPS conference. The model used in Figure 1 also produces a topic distribution for each author — Figure 2 shows the likely topics for a set of well-known NIPS authors from this model. By modeling the interests of authors, we can answer a range of queries about the content of document collections, including, for example, which subjects an author writes about, which authors are likely to have written documents similar to an observed document, and which authors produce similar work.

Fig. 1. Eight examples of topics (out of 100 topics in total) from a model fit to NIPS papers from 1987 to 1999 — shown are the 10 most likely words and 10 most likely authors per topic.

The generative model at the heart of our approach is based upon the idea that a document can be represented as a mixture of topics. This idea has motivated several different approaches in machine learning and statistical natural language processing (Hofmann 1999; Blei et al. 2003; Minka and Lafferty 2002; Griffiths and Steyvers 2004; Buntine and Jakulin 2004 ]. Topic models have three major advantages over other approaches to document modeling: the topics are extracted in a completely unsupervised fashion, requiring no document labels and no special initialization; each topic is individually interpretable, providing a representation that can be understood by the user; and each document can express multiple topics, capturing the topic combinations that arise in text documents.

Fig. 2. Selected authors from the NIPS corpus, and four high-probability topics for each author from the author-topic model.

Topics unrelated to technical content (such as topics containing words such as results, methods, experiments, etc.) were excluded.

Supervised learning techniques for automated categorization of documents into known classes or topics have received considerable attention in recent years (e.g., Yang (1999)). However, unsupervised methods are often necessary for addressing the challenges of modeling large document collections. For many document collections, neither predefined topics nor labeled documents may be available. Furthermore, there is considerable motivation to uncover hidden topic structure in large corpora, particularly in rapidly changing fields such as computer science and biology, where predefined topic categories may not reflect dynamically evolving content.

Topic models provide an unsupervised method for extracting an interpretable representation from a collection of documents. Prior work on automatic extraction of representations from text has used a number of different approaches. One general approach, in the context of the general “bag of words” framework, is to represent high-dimensional term vectors in a lower-dimensional space. Local regions in the lower-dimensional space can then be associated with specific topics. For example, the WEBSOM system [ Lagus et al. 1999] uses nonlinear dimensionality reduction via self-organizing maps to represent term vectors in a two-dimensional layout. Linear projection techniques, such as latent semantic indexing (LSI), are also widely used (e.g., [[Berry et al. [1994]] ]). Deerwester et al. [1990], while not using the term “topics” per se, state: ``In various problems, we have approximated the original term-document matrix using 50–100 orthogonal factors or derived dimensions. Roughly speaking, these factors may be thought of as artificial concepts; they represent extracted common meaning components of many different words and documents A well-known drawback of the LSI approach is that the resulting representation is often hard to interpret. The derived dimensions indicate axes of a space, but there is no guarantee that such dimensions will make sense to the user of the method. Another limitation of LSI is that it implicitly assumes a Gaussian (squared-error) noise model for the word-count data, which can lead to implausible results such as predictions of negative counts.

A different approach to unsupervised topic extraction relies on clustering documents into groups containing presumably similar semantic content. A variety of well-known document clustering techniques have been used for this purpose (e.g., Cutting et al. [1992]; [[McCallum et al. [2000]] ]; [[Popescul et al. [2000]]; Dhillon and Modha (2001)). Each cluster of documents can then be associated with a latent topic as represented for example by the mean term vector for documents in the cluster. While clustering can provide useful broad information about topics, clusters are inherently limited by the fact that each document is typically only associated with one cluster. This is often at odds with the multi-topic nature of text documents in many contexts — combinations of diverse topics within a single document are difficult to represent. For example, the present article contains at least two significantly different topics: document modeling and Bayesian estimation. For this reason, other representations that allow documents to be composed of multiple topics generally provide better models for sets of documents (e.g., better out of sample predictions, [[Blei et al. [2003]] ]).

There are several generative models for document collections that model individual documents as mixtures of topics. Hofmann (1999) introduced the aspect model (also referred to as probabilistic LSI, or pLSI) as a probabilistic alternative to projection and clustering methods. In pLSI, topics are modeled as multinomial probability distributions over words, and documents are assumed to be generated by the activation of multiple topics. While the pLSI model produced impressive results on a number of text document problems such as information retrieval, the parameterization of the model was susceptible to overfitting and did not provide a straightforward way to make inferences about documents not seen in the training data. [[Blei et al. [2003]] ] addressed these limitations by proposing a more general Bayesian probabilistic topic model called latent Dirichlet allocation (LDA). The parameters of the LDA model (the topic-word and document-topic distributions) are estimated using an approximate inference technique known as variational EM, since standard estimation methods are intractable. [[Griffiths and Steyvers [2004]] ] further showed how Gibbs sampling, a Markov chain Monte Carlo technique, could be applied to the problem of parameter estimation for this model, with relatively large data sets.

More recent research on topic models in information retrieval has focused on including additional sources of information to constrain the learned topics. For example, Cohn and Hofmann (2001) proposed an extension of pLSI to model both the document content as well as citations or hyperlinks between documents. Similarly, Erosheva et al. [2004] extended the LDA model to model both text and citations and applied their model to scientific papers from the Proceedings of the National Academy of Sciences. Other recent work on topic models focused on incorporating correlations among [[topics [ Blei and Lafferty 2006a]]; Li and McCallum 2006], incorporating time dependent topics (Blei and Lafferty 2006b), and incorporating context [ Mei and Zhai 2006].

Our aim in this article is to extend probabilistic topic models to include authorship information. Joint author-topic modeling has received little or no attention as far as we are aware. The areas of stylometry, authorship attribution, and forensic linguistics focus on the related but different problem of identifying which author (among a set of possible authors) wrote a particular piece of text [ Holmes 1998]. For example, Mosteller and Wallace [1964] used Bayesian techniques to infer whether Hamilton or Madison was the more likely author of disputed Federalist papers. More recent work of a similar nature includes authorship analysis of a purported poem by Shakespeare [ Thisted and Efron 1987], identifying authors of software [ Gray et al. 1997], and the use of techniques such as neural networks [ Kjell 1994] and support vector machines [ Diederich et al. 2003] for author identification.

These author identification methods emphasize the use of distinctive stylistic features such as sentence length that characterize a specific author. In contrast, the models we present here focus on extracting the general semantic content of a document, rather than the stylistic details of how it was written. For example, in our model we omit common “stop” words since they are generally irrelevant to the topic of the document — however, the distributions of stop words can be quite useful in stylometry. While topic information could be usefully combined with stylistic features for author classification we do not pursue this idea in this particular article.

Graph-based and network-based models are also frequently used as a basis for representation and analysis of relations among scientific authors. For example, McCain [1990], Newman (2001), Mutschke (2003) and Erten et al. (2003) use a variety of methods from bibliometrics, social networks, and graph theory to analyze and visualize coauthor and citation relations in the scientific literature. Kautz et al. (1997) developed the interactive ReferralWeb system for exploring networks of [[computer scientists working]] in artificial intelligence and information retrieval, and White and Smyth (2003) used PageRank-style ranking algorithms to analyze coauthor graphs. In all of this work only the network connectivity information is used — the text information from the underlying documents is not used in modeling. Thus, while the grouping of authors via these network models can implicitly provide indications of latent topics, there is no explicit representation of the topics in terms of the content (the words) of the documents.

This article is about a novel probabilistic model that represents both authors and topics. It extends the previous work introduced by the authors in Steyvers et al. [2004] and [[Rosen-Zvi et al. [2004]] ], by providing a systematic comparison of the author-topic model to other existing models, showing how the authortopic model can be applied as an extension of the LDA model with the use of fictitious authors and illustrating a number of applications of the model.

The outline of the article is as follows: Section 2 describes the author-topic model and Section 3 outlines how the parameters of the model (the topic-word distributions and author-topic distributions) can be learned from training data consisting of documents with known authors. Section 4 discusses the application of the model to three different document collections: papers from the NIPS conferences, abstracts from the CiteSeer collection, and emails from Enron. The section includes a general discussion of convergence and stability in learning, and examples of specific topics and specific author models that are learned by the algorithm. In Section 5we describe illustrative applications of the model, including detecting unusual papers for selected authors and detecting which parts of a text were written by different authors. Section 6 compares and contrasts the proposed author-topic model with a number of related models, including the LDA model, a simple author model (with no topics), and a model allowing fictitious authors, and includes experiments quantifying test set perplexity and precision-recall, for different models. Section 7 contains a brief discussion and concluding comments.

2. THE AUTHOR-TOPIC (AT) MODEL

In this section we introduce the author-topic model. The author topic model belongs to a family of generative models for text where words are viewed as discrete random variables, a document contains a fixed number of words, and each word takes one value from a predefined vocabulary. We will use integers to denote the entries in the vocabulary, with each word [math]\displaystyle{ w }[/math] taking a value from [math]\displaystyle{ 1,..., W }[/math], where W is the number of unique words in the vocabulary. A document d is represented as a vector of words, wd, with Nd entries. A corpus with D documents is represented as a concatenation of the document vectors, which we will denote w, having [math]\displaystyle{ N = \sum^D_{d=1}N_d }[/math] entries. In addition to these words, we have information about the authors of each document. We define ad to be the set of authors of document [math]\displaystyle{ d }[/math]. ad consists of elements that are integers from [math]\displaystyle{ 1,..., A }[/math], where [math]\displaystyle{ A }[/math] is the number of authors who generated the documents in the corpus. [math]\displaystyle{ A_d }[/math] will be used to denote the number of authors of document [math]\displaystyle{ d }[/math].

References

;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2010 LearningAuthorTopicModelsfromTeThomas L. Griffiths
Mark Steyvers
Padhraic Smyth
Michal Rosen-Zvi
Chaitanya Chemudugunta
Learning Author-topic Models from Text Corpora10.1145/1658377.16583812010