2011 CollaborativeTopicModelingforRe

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Scientific Paper Recommendation, Collaborative Topic Regression.

Notes

Cited By

Quotes

Author Keywords

Abstract

Researchers have access to large online archives of scientific articles. As a consequence, finding relevant papers has become more difficult. Newly formed online communities of researchers sharing citations provides a new way to solve this problem. In this paper, we develop an algorithm to recommend scientific articles to users of an online community. Our approach combines the merits of traditional collaborative filtering and probabilistic topic modeling. It provides an interpretable latent structure for users and items, and can form recommendations about both existing and newly published articles. We study a large subset of data from CiteULike, a bibliography sharing service, and show that our algorithm provides a more effective recommender system than traditional collaborative filtering.

1. INTRODUCTION

Modern researchers have access to large archives of scientific articles. These archives are growing as new articles are placed online and old articles are scanned and indexed. While this growth has allowed researchers to quickly access more scientific information, it has also made it more difficult for them to find articles relevant to their interests. Modern researchers need new tools for managing what is available to them.

Historically, one way that researchers find articles is by following citations in other articles that they are interested in. This is an effective practice — and one that we should continue — but it limits researchers to specific citation communities, and it is biased towards heavily cited papers. A statistician may miss a relevant paper in economics or biology because the two literatures rarely cite each other; and she may miss a relevant paper in statistics because it was also missed by the authors of the papers that she has read. One of the opportunities of online archives is to inform researchers about literature that they might not be aware of.

A complementary method of finding articles is keyword search. This is a powerful approach, but it is also limited. Forming queries for finding new scientific articles can be difficult as a researcher may not know what to look for; search is mainly based on content, while good articles are also those that many others found valuable; and search is only good for directed exploration, while many researchers would also like a “feed” of new and interesting articles.

Recently, websites like CiteULike1 and Mendeley2 allow researchers to create their own reference libraries for the articles they are interested in and share them with other researchers. This has opened the door to using recommendation methods [13] as a third way to help researchers find interesting articles. In this paper, we develop an algorithm for recommending scientific articles to users of online archives. Each user has a library of articles that he or she is interested in, and our goal is to match each user to articles of interest that are not in his or her library.

We have several criteria for an algorithm to recommend scientific articles. First, recommending older articles is important. Users of scientific archives are interested in older articles for learning about new fields and understanding the foundations of their fields. When recommending old articles, the opinions of other users plays a role. A foundational article will be in many users’ libraries; a less important article will be in few.

Second, recommending new articles is also important. For example, when a conference publishes its proceedings, users would like see the recommendations from these new articles to keep up with the state-of-the-art in their discipline. Since the articles are new, there is little information about which or how many other users placed the articles in their libraries, and thus traditional collaborative filtering methods has difficulties making recommendations. With new articles, a recommendation system must use their content. Finally, exploratory variables can be valuable in online scientific archives and communities. For example, we can summarize and describe each user’s preference profile based on the content of the articles that he or she likes. This lets us connect similar users to enhance the community, and indicate why we are connecting them. Further, we can describe articles in terms of what kinds of users like them. For example, we might detect that a machine learning

1http://www.citeulike.org

2http://www.mendeley.com

article is of strong interest to computer vision researchers. If enough researchers use such services, these variables might also give an alternative measure of the impact of an article within a field. With these criteria in mind, we develop a machine learning algorithm for recommending scientific articles to users in an online scientific community. Our algorithm uses two types of data — the other users’ libraries and the content of the articles — to form its recommendations. For each user, our algorithm can finds both older papers that are important to other similar users and newly written papers whose content reflects the user’s specific interests. Finally, our algorithm gives interpretable representations of users and articles. Our approach combines ideas from collaborative filtering based on latent factor models [17, 18, 13, 1, 22] and content analysis based on probabilistic topic modeling [7, 8, 20, 2]. Like latent factor models, our algorithm uses information from other users’ libraries. For a particular user, it can recommend articles from other users who liked similar articles. Latent factor models work well for recommending known articles, but cannot generalize to previously unseen articles.

To generalize to unseen articles, our algorithm uses topic modeling. Topic modeling provides a representation of the articles in terms of latent themes discovered from the collection. When used in our recommender system, this component can recommend articles that have similar content to other articles that a user likes. The topic representation of articles allows the algorithm to make meaningful recommendations about articles before anyone has rated them. We combine these approaches in a probabilistic model, where making a recommendation for a particular user is akin to computing a conditional expectation of hidden variables. We will show how the algorithm for computing these expectations naturally balances the influence of the content of the articles and the libraries of the other users. An article that has not been seen by many will be recommended based more on its content; an article that has been widely seen will be recommended based more on the other users. We studied our algorithm with data from CiteULike: 5; 551 users, 16; 980 articles, and 204; 986 bibliography entries. We will demonstrate that combining content-based and collaborative-based methods works well for recommending scientific articles. Our method provides better performance than matrix factorization methods alone, indicating that content can improve recommendation systems. Further, while traditional collaborative filtering cannot suggest articles before anyone has rated them, our method can use the content of new articles to make predictions about who will like them.

2. BACKGROUND

We first give some background. We describe two types of recommendation problems we address; we describe the classical matrix factorization solution to recommendation; and we review latent Dirichlet allocation (LDA) for topic modeling of text corpora.

2.1 Recommendation Tasks

The two elements in a recommender system are users and items. In our problem, items are scientific articles and users are researchers. We will assume I users and J items. The rating variable rij 2 f0; 1g denotes whether user i includes article j in her library [12]. If it is in the library, this means that user i is interested in article j. (This differs from some other systems where users explicitly rate items on a scale.) Note that rij = 0 can be interpreted into two ways. One way is that user i is not interested in article j; the other is that user i does not know about article j.

For each user, our task is to recommend articles that are not in her library but are potentially interesting. There are two types of Figure 1: Illustration of the two tasks for scientific article recommendation systems, where p indicates “like”, � “dislike” and ? “unknown”. recommendation: in-matrix prediction and out-of-matrix prediction.

Figure 1 illustrates the idea.

In-matrix prediction. Figure 1 (a) illustrates in-matrix prediction. This refers to the problem of making recommendations about those articles that have been rated by at least one user in the system. This is the task that traditional collaborative filtering can address. Out-of-matrix prediction. Figure 1 (b) illustrates out-of-matrix prediction, where articles 4 and 5 have never been rated. (This is sometimes called “cold start recommendation.”) Traditional collaborative filtering algorithms cannot make predictions about these articles because those algorithms only use information about other users’ ratings. This task is important for online scientific archives, however, because users want to see new articles in their fields. A recommender system that cannot handle out-of-matrix prediction cannot recommend newly published papers to its users.

2.2 Recommendation by Matrix Factorization

The traditional approach to recommendation is collaborative filtering (CF), where items are recommended to a user based on other users with similar patterns of selected items. (Note that collaborative filtering does not use the content of the items.) Most successful recommendation methods are latent factor models [17, 18, 13, 1, 22], which provide better recommendation results than the neighborhood methods [11, 13]. In this paper, we focus on latent factor models.

Among latent factor methods, matrix factorization performs well [13]. In matrix factorization, we represent users and items in a shared latent low-dimensional space of dimension [math]\displaystyle{ K }[/math]user [math]\displaystyle{ i }[/math] is represented by a latent vector [math]\displaystyle{ u_j \in \mathbb{R}^K }[/math] and item [math]\displaystyle{ j }[/math] by a latent vector [math]\displaystyle{ v_j \in \mathbb{R}^K }[/math]. We form the prediction of whether user [math]\displaystyle{ i }[/math] will like item [math]\displaystyle{ j }[/math] with the inner product between their latent representations,

[math]\displaystyle{ \hat{r}_{ij} = u^T_iv_j. \ (1) }[/math]

Biases for different users and items can also be incorporated [13]. To use matrix factorization, we must compute the latent representations of the users and items given an observed matrix of ratings. The common approach is to minimize the regularized squared error loss with respect to U = (ui)Ii =1 and V = (vj)Jj =1, minU;V P i;j(rij 􀀀 uTi vj)2 + �ujjuijj2 + �vjjvj jj2; (2)

where �u and �v are regularization parameters.

This matrix factorization for collaborative filtering can be generalized as a probabilistic model [18]. In probabilistic matrix factorization (PMF), we assume the following generative process, 1. For each user i, draw user latent vector ui � N(0; �􀀀1 u IK). 2. For each item j, draw item latent vector vj � N(0; �􀀀1 v IK). 3. For each user-item pair (i; j), draw the response rij � N(uTi vj ; c􀀀1 ij ); (3)

where cij is the precision parameter for rij . (Note that IK is a K-dimensional identity matrix.) This is the interpretation of matrix factorization that we will build on. When cij = 1, for 8i; j, the maximum a posteriori estimation (MAP) of PMF corresponds to the solution in Eq. 2. Here, the precision parameter cij serves as a confidence parameter for rating rij . If cij is large, we trust rij more. As we mentioned above, rij = 0 can be interpreted into two ways — the user i is either not interested in item j or is unaware of it. This is thus a “one-class collaborative filtering problem,” similar to the TV program and news article recommendation problems studied in [12] and [16]. In that work, the authors introduce different confidence parameters cij for different ratings rij . We will use the same strategy to set cij a higher value when rij = 1 than when rij = 0, cij = � a; if rij = 1; b; if rij = 0; (4)

where a and b are tuning parameters satisfying a > b > 0. We fit a CF model by finding a locally optimal solution of the user variables U and item variables V , usually with an iterative algorithm [12]. We then use Eq. 1 to predict the ratings of the articles outside of each user’s library.

There are two main disadvantages to matrix factorization for recommendation. First, the learnt latent space is not easy to interpret; second, as mentioned, matrix factorization only uses information from other users — it cannot generalize to completely unrated items.

2.3 Probabilistic Topic Models

Topic modeling algorithms [5] are used to discover a set of “topics” from a large collection of documents, where a topic is a distribution over terms that is biased around those associated under a single theme. Topic models provide an interpretable low-dimensional representation of the documents [8]. They have been used for tasks like corpus exploration, document classification, and information retrieval. Here we will exploit the discovered topic structure for recommendation.

The simplest topic model is latent Dirichlet allocation (LDA) [7]. Assume there areK topics � = �1:K, each of which is a distribution over a fixed vocabulary. The generative process of LDA is as follows. For each article wj in the corpus, 1. Draw topic proportions �j � Dirichlet(�). 2. For each word n, (a) Draw topic assignment zjn � Mult(�j). (b) Draw word wjn � Mult(�zjn).

This process reveals how the words of each document are assumed to come from a mixture of topics: the topic proportions are documentspecific, but the set of topics is shared by the corpus.

Given a collection, the posterior distribution (or maximum likelihood estimate) of the topics reveals the K topics that likely generated its documents. Unlike a clustering model, where each document is assigned to one cluster, LDA allows documents to exhibit multiple topics. For example, LDA can capture that one article might be about biology and statistics, while another might be about biology and physics. Since LDA is unsupervised, the themes of “physics” “biology” and “statistics” can be discovered from the corpus; the mixed-membership assumptions lead to sharper estimates of word co-occurrence patterns.

Figure 2: The graphical model for the CTR model. Given a corpus of documents, we can use variational EM to learn the topics and decompose the documents according to them [7]. Further, given a new document, we can use variational inference to situate its content in terms of the topics. Our goal is to use topic modeling to give a content-based representation of items in a recommender system.

3. COLLABORATIVE TOPIC REGRESSION

In this section, we describe the collaborative topic regression (CTR) model. CTR combines traditional traditional collaborative filtering with topic modeling. A first approach to combining collaborative filtering and topic modeling is to fit a model that uses the latent topic space to explain both the observed ratings and the observed words. For example, we can use the topic proportion �j in place of the latent item latent vector vj in Eq. 3, rij � N(uTi�j ; c􀀀1 ij ): (5)

(We note that [19] proposed a similar approach to Eq. 5, but based on correlated topic models [4]. It showed modest improvement over matrix factorization on several movie recommendation datasets.) This model suffers from the limitation that it cannot distinguish topics for explaining recommendations from topics important for explaining content. Consider two articles A and B that are both about machine learning applied to social networks. They are similar and, therefore, have similar topic proportions �A and �B. Now further suppose that these articles are interesting to different kinds of users: Article A might give an interesting machine learning algorithm that is applied to social network applications; article B uses standard machine learning techniques, but gives an important piece of data analysis on social network data.

Users that work in machine learning will prefer article A and rarely consider article B; users that work in social networks will prefer the opposite. However, using the topic proportions as in Eq. 5 will be likely to make similar recommendations for both articles to both types of users. Collaborative topic regression can detect this difference — that one type of user likes the first article and another type likes the second.

As above, collaborative topic regression (CTR) represents users with topic interests and assumes that documents are generated by a topic model. CTR additionally includes a latent variable �j that offsets the topic proportions �j when modeling the user ratings. As more users rate articles, we have a better idea of what this offset is. This offset variable can explain, for example, that article A is more interesting to machine learning researchers than it is to social network analysis researchers. How much of the prediction relies on content and how much it relies on other users depends on how many users have rated the article.

Figure 2 shows the graphical model. Again, assume there are K topics � = �1:K. The generative process of CTR is as follows, 1. For each user i, draw user latent vector ui � N(0; �􀀀1 u IK). 2. For each item j, (a) Draw topic proportions �j � Dirichlet(�). (b) Draw item latent offset �j � N(0; �􀀀1 v IK) and set the item latent vector as vj = �j + �j . (c) For each word wjn, i. Draw topic assignment zjn � Mult(�). ii. Draw word wjn � Mult(�zjn). 3. For each user-item pair (i; j), draw the rating rij � N(uTi vj ; c􀀀1 ij ): (6) The key property in CTR lies in how the item latent vector vj is generated. Note that vj = �j + �j , where �j � N(0; �􀀀1 v Ik), is equivalent to vj � N(�j ; �􀀀1 v IK), where we assume the item latent vector vj is close to topic proportions �j , but could diverge from it if it has to. Note that the expectation of rij is a linear function of �j , E[rij jui; �j ; �j ] = uTi (�j + �j):

This is why we call the model collaborative topic regression. Learning the parameters. Given topic parameter �, computing the full posterior of ui, vj and �j is intractable. We develop an EMstyle algorithm to learn the maximum a posteriori (MAP) estimates. Maximization of the posterior is equivalent to maximizing the complete log likelihood of U, V , �1:J , and R given �u, �v and �, L = 􀀀�u 2 P i uTi ui 􀀀 �v 2 P j(vj 􀀀 �j)T (vj 􀀀 �j) (7) + P j P n log 􀀀P k �jk�k;wjn � 􀀀 P i;j cij 2 (rij 􀀀 uTi vj)2: We have omitted a constant and set � = 1. We optimize this function by coordinate ascent, iteratively optimizing the collaborative filtering variables fui; vjg and the topic proportions �j . For ui and vj , maximization follows in a similar fashion as for basic matrix factorization [12]. Given the current estimate of �j , taking the gradient of L with respect to ui and vj and setting it to zero leads to (recall the matrix definition U = (ui)I i=1 and V = (vj)Jj =1) ui (V CiV T + �uIK)􀀀1V CiRi (8) vj (UCjUT + �vIK)􀀀1(UCjRj + �v�j): (9) where Ci is a diagonal matrix with cij ; j = 1 � � � ; J as its diagonal elements and Ri = (rij)Jj =1 for user i. For item j, Cj and Rj are similarly defined. Eq. 9 shows how topic proportions �j affects item latent vector vj , where �v balances this effect. Finally, we note that the complexity is linear in the number of articles in the users’ libraries. This follows from the special structure of cij defined in Eq. 4. (See [12] for details.)

Given U and V , we now describe how to learn the topic proportions �j .3 We first define q(zjn = k) = �jnk. Then we separate the items that contain �j and apply Jensen’s inequality, L(�j) � 􀀀�v 2 (vj 􀀀 �j)T (vj 􀀀 �j) + P n P k �jnk 􀀀 log �jk�k;wjn 􀀀 log �jnk � = L(�j ;�j): (10) Let �j = (�jnk)N�K n=1;k=1. The optimal �jnk satisfies �jnk / �jk�k;wjn: (11) The L(�j ;�j) gives the tight lower bound of L(�j). We cannot optimize �j analytically, so we use projection gradient [3]. We use 3On our data, we found that simply fixing �j as the estimate from vanilla LDA gives comparable performance and saves computation. coordinate ascent to optimize the remaining parameters, U, V , �1:J and �1:J .

After we estimate U, V and �, we can optimize �, �kw / P j P n �jnk1[wjn = w]: (12) Note this is the same M-step update for topics as in LDA [7]. Prediction. After all the (locally) optimal parameters U�, V �, �� 1:J and �� are learned, the CTR model can be used for both inmatrix and out-of-matrix prediction. Let D be the observed data, in general each prediction is estimated as E[rij jD] � E[ui jD]T (E[�j jD] + E[�j jD]) : (13) For in-matrix prediction, we use the point estimate of ui, �j and �j to approximate their expectations, r� ij � (u� i )T (�� j + �� j ) = (u� i )T v� j ; (14) where recall that vj = �j + �j . In out-of-matrix prediction the article is new, and no other ratings are available. Thus, E[�j ] = 0 and we predict with r� ij � (u� i )T �� j : (15) To obtain the topic proportions �� j for a new article, we optimize Eq. 10. The first term is dropped because vj = �j . Related work. Several other work uses content for recommendation [15, 14, 1, 2]. Among these, the closest work to ours is fLDA by [2]. FLDA generalizes the supervised topic model (sLDA) [6], using the empirical topic proportions �zj = (1=N) PN n=1 zjn as well as several other covariates to form predictions. In our settings, where we do not have additional covariates, their approach is roughly akin to setting vj = �j . We show in Section 4 that a similar setting does not perform as well as the CTR model because it largely ignores the other users ratings.

Other recent work considers the related problem of using topic modeling to predict legislative votes [21, 10]. Neither of these methods introduces offset terms to account for votes (i.e., ratings). Legislative votes might be an interesting application for the CTR model.

4. EMPIRICAL STUDY

We demonstrate our model by analyzing a real-world community of researchers and their citation files.4

Dataset. Our data are users and their libraries of articles obtained from CiteULike.5 At CiteUlike, registered users create personal reference libraries; each article usually has a title and abstract. (The other information about the articles, such as the authors, publications and keywords, is not used in this paper.)

We merged duplicated articles, removed empty articles, and removed users with fewer than 10 articles to obtain a data set of 5; 551 users and 16; 980 articles with 204; 986 observed user-item pairs. (This matrix has a sparsity of 99:8%; it is highly sparse.) On average, each user has 37 articles in the library, ranging from 10 to 403. 93% of the users have fewer than 100 articles.

For each article, we concatenate its title and abstract. We remove stop words and use tf-idf to choose the top 8; 000 distinct words as the vocabulary [5]. This yielded a corpus of 1.6M words. These articles were added to CiteULike between 2004 and 2010. On average, each article appears in 12 users’ libraries, ranging from 1 to 321. 97% of the articles appear in fewer than 40 libraries. Evaluation. In our experiments, we will analyze a set of articles and user libraries. We will evaluate recommendation algorithms on sets of held-out articles and ratings. We will (hypothetically) present each user with M articles sorted by their predicted rating and evaluate based on which of these articles were actually in each user’s library.

4A demo of the results can be found at http://www.cs.princeton.edu/˜chongw/citeulike/

5http://www.citeulike.org/faq/data.adp

Two possible metrics are precision and recall. However, as we discussed earlier, zero ratings are uncertain. They may indicate that a user does not like an article or does not know about it. This makes it difficult to accurately compute precision. Rather, since ratings of rij = 1 are known to be true positives, we focus on recall. Recall only considers the positively rated articles within the top M — a high recall with lower M will be a better system. For each user, the definition of recall@M is recall@M = number of articles the user likes in top M total number of article the user likes

The recall for the entire system can be summarized using the average recall from all users.

The recall above we defined is user-oriented. We also consider article-oriented recall for testing the system’s predictive performance on a particular article. For article j, we consider the population of users that like the article and the proportion of those for whom that article appears in their top M recommended articles. This evaluates the predictive power of the system on a chosen set of articles. As we discussed in section 2, we consider two recommendation tasks users, in-matrix prediction and out-of-matrix prediction. In-matrix prediction. In-matrix prediction considers the case where each user has a set of articles that she has not seen, but that at least one other user has seen. We ask the question, how good is each system at rating that set of articles for each user? As discussed in section 2.1, this task is similar to traditional collaborative filtering. We split the data into a training set and test set, ensuring that all articles in the test set have appeared in the training set. Content information is not required to perform recommendations — though we will see that it helps — and thus matrix factorization can be used.

We use 5-fold cross-validation. For every article that appears at least 5 times in the users’ libraries, we evenly split their user-item pairs (both 1’s and 0’s) into 5 folds. We iteratively consider each fold to be a test set and the others to be the training set. For those articles that appear fewer than 5 times, we always put them into the training set. This guarantees that all articles in the test set must appear in the training set. (9% of the articles are always in the training set, since they appear fewer than 5 times.)

For each fold, we fit a model to the training set and test on the within-fold articles for each user. (Note: each user has a different set of within-fold articles.) We form predictive ratings for the test set, and generate a list of the top M recommended articles. Out-of-matrix prediction. Out-of-matrix prediction considers the case where a new set of articles is published and no one has seen them. Again we ask, how good is each system at rating that set of articles for each user?

We again use 5-fold cross validation. First, we evenly group all articles into 5 folds. For each fold, we fit the model to the submatrix formed by the out-of-fold articles and then test the recommendations for each user on the within-fold articles. Note that in this case, each user has the same set of within-fold articles and we are guaranteed that none of these articles is in the training set for any user. Again, number of recommended articles

recall 0.3 0.4 0.5 0.6 0.7 0.8 in−matrix l l l l l l l l l l 50 100 150 200 out−of−matrix 50 100 150 200 method l CF CTR LDA Figure 3: Recall comparison on in-matrix and out-of-matrix prediction tasks by varying the number of recommended articles. For CTR, we set �v = 100. Error bars are too small to show. The maximum expected recall for random recommendation is about 6%. CF can not do out-of-matrix prediction. CTR performs best. we form predictive ratings for the test set, and generate a list of the top M recommended articles.

These two experimental set-ups — in-matrix and out-of-matrix predictions — are designed to be comparable — the topM articles are computed from the same size of candidate populations. Experimental settings. For matrix factorization for collaborative filtering (CF), we used grid search to find that K = 200, �u = �v = 0:01, a = 1, b = 0:01 gives good performance on held out recommendations. We use CF to denote this method. For collaborative topic regression (CTR), we set the parameters similarly as for CF, K = 200, �u = 0:01, a = 1 and b = 0:01. In addition, the precision parameter �v balances how the article’s latent vector vj diverges from the topic proportions �j . We vary �v 2 f10; 100; 1000; 10000g, where a larger �v increases the penalty of vj diverging from �j .

We also compare to the model that only uses LDA-like features, as we discussed in the beginning of section 3. This is equivalent to fixing the per-item latent vector vj = �j in the CTR model. This is a nearly content-only model — while the per-user vectors are fit to the ratings data, the document vectors �j are only based on the words of the document. We use LDA to denote this method. (Note that we use the resulting topics and proportions of LDA to initialize the CTR model.)

The baseline is the random model, where a user see M random recommended articles. We note that the expected recall for the random method from a pool of Mtot articles is irrelevant to library size. It is always M=Mtot.

Comparisons. Figure 3 shows the overall performance for in-matrix and out-of-matrix prediction, when we vary the number of returned articles M = 20; 40; � � � ; 200. For CTR, we pick �v = 100; Figure 4 shows the performs when we change �v for the CTR model compared with CF and LDA when we fix M = 100. Figure 3 and 4 shows that matrix factorization works well for in-matrix prediction, but adding content with CTR improves performance. The improvement is greater when the number of returned documents M is larger. The reason is as follows. Popular articles are more likely to be recommended by both methods. However, whenM becomes large, few user ratings are available to ensure that CF gives good recommendations; the contribution of the content becomes more important.

Compared to both CF and CTR, LDA suffers for in-matrix prenumber of articles a user has

recall 0.0 0.2 0.4 0.6 0.8 1.0 CF, in−matrix 100 200 300 400 CTR, in−matrix 100 200 300 400 LDA, in−matrix 100 200 300 400 CTR, out−of−matrix 100 200 300 400 LDA, out−of−matrix 100 200 300 400

Figure 5: These scatter plots show how the number of articles a user has affects his or her recall. Red lines indicate the average. In these plots, the number of recommended articles is 100. CF can not do out-of-matrix prediction. This shows that CTR performs the best over user-oriented recall.

number of users an article appears in

recall 0.0 0.2 0.4 0.6 0.8 1.0 CF, in−matrix 50 100 150 200 250 300 CTR, in−matrix 50 100 150 200 250 300 LDA, in−matrix 50 100 150 200 250 300 CTR, out−of−matrix 50 100 150 200 250 300 LDA, out−of−matrix 50 100 150 200 250 300

Figure 6: These scatter plots show how the number of users that like an article affects its recall. Red lines indicate the average. In these plots, the number of recommended articles is 100. CF can not do out-of-matrix prediction. This shows that CTR performs best over article-oriented recall as well.

lv recall 0.55 0.60 0.65 0.70 0.75 0.80 in−matrix 10 100 1000 10000 out−of−matrix 10 100 1000 10000 method l CF CTR LDA

Figure 4: Recall comparison on in-matrix and out-of-matrix prediction tasks by fixing the number of recommended articles at M = 100. Error bars are too small to show. This shows how the precision parameter �v affects the performance of CTR.

The expected recall of random recommendation is about 3%. CF can not do out-of-matrix prediction.

diction. It does not account enough for the users’ information in forming its predicted ratings. The gap between CF and LDA is interesting — other users provide a better assessment of preferences than content alone.

Out-of-matrix prediction is a harder problem, as shown by the relatively lower recall. In this task, CTR performs slightly better than LDA. Matrix factorization cannot perform out-of-matrix prediction. (Note also that LDA performs almost the same on both in-matrix and out-of-matrix predictions. This is expected because, in both settings, it makes its recommendations almost entirely based on content.) Overall, CTR is the best model.

In Figure 4 we study the effect of the precision parameter �v. When �v is small in CTR, the per-item latent vector vj can diverge significantly from the topic proportions �j . Here, CTR behaves more like matrix factorization where no content is considered. When �v increases, CTR is penalized for vj diverging from the topic proportions; this brings the content into the recommendations. When �v is too large, vj is nearly the same as �j and, consequently, CTR behaves more like LDA.

We next study the relationship, across models, between recommendation performance and properties of the users and articles. For this study we set the number of recommended articles M = 100 and the precision �v = 100. Figure 5 shows how the performance varies as a function of the number of articles in a user’s library; Figure 6 shows how the performance varies as a function of the number of users that like an article.

As we see from Figure 5, for both in-matrix and out-of-matrix prediction, users with more articles tend to have less variance in their predictions. Users with few articles tend to have a diversity in the predictions, whose recall values vary around the extreme values of 0 and 1. In addition, we see that recall for users with more articles have a decreasing trend. This is reasonable because when a user has more articles then there will be more infrequent ones. As we see next, these articles are harder to predict.

From Figure 6, on in-matrix prediction for CF, CTR and LDA articles with high frequencies tend to have high recalls for in-matrix prediction and their predictions have less variance. This is because these articles have more collaborative information than infrequent ones, and, furthermore, CF and CTR make use of this information.

Topic −0.02 0.00 0.02 0.04 1 2 3 4 5 6 7 8 9 10 var theta theta correction topic 1: estimate, estimates, likelihood, maximum, estimated, missing, distances topic 10: parameters, Bayesian, inference, optimal, procedure, prior, assumptions Figure 7: Maximum likelihood from incomplete data via the EM algorithm. Here, “theta” denotes �j and “theta correction” denotes the offset �j . The 10 topics are obtained by joining the top 5 topics ranked by �jk and another top 5 topics ranked by j�jkj, k = 1; � � � ;K. Under CTR, an article of wide interest is likely to exhibit more topics than its text exhibits. For example, this article brings in several other topics, including one on “Bayesian statistics” (topic 10). Note that the EM article is mainly about parameter estimation (topic 1), though is frequently referenced by Bayesian statisticians (and scholars in other fields as well).

For LDA, this trend is much smaller. In out-of-matrix predictions, since predictions are made on new articles, these frequencies do not have an effect on training the model.

We now turn to an exploratory analysis of our results on the CTR model. (In the following, the precision �v = 100.) Examining User Profiles. One advantage of the CTR model is that it can explain the user latent space using the topics learned from the data. For one user, we can find the top matched topics by ranking the entries of her latent vector ui. Table 1 shows two example users and their top 3 matched topics along with their top 10 preferred articles as predicted by the CTR model.

The learned topics serve as a summary of what users might be interested in. For user I, we see that he or she might be a researcher working on machine learning and its applications to texts and images. Although the predicted top 10 articles don’t contain a vision article, we see such articles when more articles are retrieved. For user II, he or she might be a researcher who is interested in user interfaces and collaborative filtering.

Examining the Latent Space of Articles. We can also examine the latent space of articles beyond their topic proportions. Here we inspect the articles with the largest overall offsets �j . Table 2 shows the top 10 articles with the largest offsets measured by the distance between vj and �j , �Tj �j = (vj 􀀀 �j)T (vj 􀀀 �j). The last two columns show the average of predicted ratings over those users who actually have that article (avg–like) and those users who do not have that article (avg–dislike).

These articles are popular in this data. Among the top 50 articles by this measure, 94% of them have at least 50 appearances. Articles with large offsets enjoy readership from different areas, and their item latent vectors have to diverge from the topic proportions to account for this. For example, Figure 7 illustrates the article that is the main citation for the expectation-maximization algorithm, “Maximum likelihood from incomplete data via the EM algorithm” [9]. Its top topic (found by k = arg maxk �jk), is shown as topic 1. It is about “parameter estimation,” which is the main focus of this Topic −0.1 0.0 0.1 0.2 0.3 0.4 0.5 1 2 3 4 5 6 7 8 9 var theta theta correction topic 1: neurons, responses, neuronal, spike, cortical, stimuli, stimulus Figure 8: Phase-of-firing coding of natural visual stimuli in primary visual cortex. This figure was created in the same way as Figure 7. It shows that a less popular article might also have a high offset value �j . In this case, it changes the actual magnitudes in �j , but does not bring in other topics.

article. We can also examine the topics that are offset the most, k = arg maxk j�jkj = arg maxk jvjk􀀀�jkj. The maximum offset is for topic 10, a topic about “Bayesian statistics.” Topic 10 has a low value in �j — the EM paper is not a Bayesian paper — but readers of Bayesian statistics typically have this paper in their library. Examining the offset can yield the opposite kind of article. For example, consider the article “Phase-of-firing coding of natural visual stimuli in primary visual cortex” in Figure 8. Its most probable topic is topic 1 (about “Computational Neuroscience”). Taking into account the offset, the most probable topic does not change and nor are new topics brought in. This indicates that the offset �j only adjusts vj so that the objective function is well minimized. This article is not as interesting to users outside of Neuroscience.

5. CONCLUSIONS AND FUTURE WORK

We proposed an algorithm for recommending scientific articles to users based on both content and other users’ ratings. Our study showed that this approach works well relative to traditional matrix factorization methods and makes good predictions on completely unrated articles.

Further, our algorithm provides interpretable user profiles. Such profiles could be useful in real-world recommender systems. For example, if a particular user recognizes her profile as representing different topics, she can choose to “hide” some topics when seeking recommendations about a subject.

user I in user’s lib? top 3 topics 1. image, measure, measures, images, motion, matching, transformation, entropy, overlap, computed, match 2. learning, machine, training, vector, learn, machines, kernel, learned, classifiers, classifier, generalization 3. sets, objects, defined, categories, representations, universal, category, attributes, consisting, categorization top 10 articles 1. Information theory inference learning algorithms X 2. Machine learning in automated text categorization X 3. Artificial intelligence a modern approach × 4. Data xmining: practical machine learning tools and techniques × 5. Statistical learning theory × 6. Modern information retrieval X 7. Pattern recognition and machine learning, information science and statistics X 8. Recognition by components: a theory of human image understanding × 9. Data clustering a review X 10. Indexing by latent semantic analysis X user II in user’s lib? top 3 topics 1. users, user, interface, interfaces, needs, explicit, implicit, usability, preferences, interests, personalized 2. based, world, real, characteristics, actual, exploring, exploration, quite, navigation, possibilities, dealing 3. evaluation, collaborative, products, filtering, product, reviews, items, recommendations, recommender top 10 articles 1. Combining collaborative filtering with personal agents for better recommendations × 2. An adaptive system for the personalized access to news X 3. Implicit interest indicators × 4. Footprints history-rich tools for information foraging X 5. Using social tagging to improve social navigation X 6. User models for adaptive hypermedia and adaptive educational systems X 7. Collaborative filtering recommender systems X 8. Knowledge tree: a distributed architecture for adaptive e-learning X 9. Evaluating collaborative filtering recommender systems X 10. Personalizing search via automated analysis of interests and activities X Table 1: Two example users. We show their position in latent space via their highest weighted topics. We list the top 10 preferred articles as predicted by CTR. The last column shows whether each article is in the user’s library.

References

  • 1. Deepak Agarwal, Bee-Chung Chen, Regression-based Latent Factor Models, Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, June 28-July 01, 2009, Paris, France doi:10.1145/1557019.1557029
  • 2. Deepak Agarwal, Bee-Chung Chen, FLDA: Matrix Factorization through Latent Dirichlet Allocation, Proceedings of the Third ACM International Conference on Web Search and Data Mining, February 04-06, 2010, New York, New York, USA doi:10.1145/1718487.1718499
  • 3. D. Bertsekas. Nonlinear Programming. Athena Scientific, 1999.
  • 4. D. Blei and J. Lafferty. A Correlated Topic Model of Science. Annals of Applied Statistics, 1(1):17--35, 2007.
  • 5. D. Blei and J. Lafferty. Topic Models. In A. Srivastava and M. Sahami, Editors, Text Mining: Theory and Applications. Taylor and Francis, 2009.
  • 6. D. Blei and J. McAuliffe. Supervised Topic Models. In Neural Information Processing Systems, 2007.
  • 7. David M. Blei, Andrew Y. Ng, Michael I. Jordan, Latent Dirichlet Allocation, The Journal of Machine Learning Research, 3, p.993-1022, 3/1/2003
  • 8. J. Chang, J. Boyd-Graber, S. Gerrish, C. Wang, and D. Blei. Reading Tea Leaves: How Humans Interpret Topic Models. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, Editors, Advances in Neural Information Processing Systems 22, Pages 288--296, 2009.
  • 9. A. Dempster, N. Laird, and D. Rubin. Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society, Series B, 39:1--38, 1977.
  • 10. S. M. Gerrish and D. M. Blei. Predicting Legislative Roll Calls from Text. In: Proceedings of the 28th Annual International Conference on Machine Learning, ICML '11, 2011.
  • 11. Jonathan L. Herlocker, Joseph A. Konstan, Al Borchers, John Riedl, An Algorithmic Framework for Performing Collaborative Filtering, Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, p.230-237, August 15-19, 1999, Berkeley, California, USA doi:10.1145/312624.312682
  • 12. Yifan Hu, Yehuda Koren, Chris Volinsky, Collaborative Filtering for Implicit Feedback Datasets, Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, p.263-272, December 15-19, 2008 doi:10.1109/ICDM.2008.22
  • 13. Yehuda Koren, Robert Bell, Chris Volinsky, Matrix Factorization Techniques for Recommender Systems, Computer, v.42 n.8, p.30-37, August 2009 doi:10.1109/MC.2009.263
  • 14. Prem Melville, Raymod J. Mooney, Ramadass Nagarajan, Content-boosted Collaborative Filtering for Improved Recommendations, Eighteenth National Conference on Artificial Intelligence, p.187-192, July 28-August 01, 2002, Edmonton, Alberta, Canada
  • 15. Raymond J. Mooney, Loriene Roy, Content-based Book Recommending Using Learning for Text Categorization, Proceedings of the Fifth ACM Conference on Digital Libraries, p.195-204, June 02-07, 2000, San Antonio, Texas, USA doi:10.1145/336597.336662
  • 16. Rong Pan, Yunhong Zhou, Bin Cao, Nathan N. Liu, Rajan Lukose, Martin Scholz, Qiang Yang, One-Class Collaborative Filtering, Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, p.502-511, December 15-19, 2008 doi:10.1109/ICDM.2008.16
  • 17. Ruslan Salakhutdinov, Andriy Mnih, Bayesian Probabilistic Matrix Factorization Using Markov Chain Monte Carlo, Proceedings of the 25th International Conference on Machine Learning, p.880-887, July 05-09, 2008, Helsinki, Finland doi:10.1145/1390156.1390267
  • 18. R. Salakhutdinov and A. Mnih. Probabilistic Matrix Factorization. Advances in Neural Information Processing Systems, 20:1257--1264, 2008.
  • 19. Hanhuai Shan, Arindam Banerjee, Generalized Probabilistic Matrix Factorizations for Collaborative Filtering, Proceedings of the 2010 IEEE International Conference on Data Mining, p.1025-1030, December 13-17, 2010 doi:10.1109/ICDM.2010.116
  • 20. Y. Teh, M. Jordan, M. Beal, and D. Blei. Hierarchical Dirichlet Processes. Journal of the American Statistical Association, 101(476):1566--1581, 2007.
  • 21. E. Wang, D. Liu, J. Silva, D. Dunson, and L. Carin. Joint Analysis of Time-evolving Binary Matrices and Associated Documents. In J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta, Editors, Advances in Neural Information Processing Systems 23, Pages 2370--2378. 2010.
  • 22. Kai Yu, John Lafferty, Shenghuo Zhu, Yihong Gong, Large-scale Collaborative Prediction Using a Nonparametric Random Effects Model, Proceedings of the 26th Annual International Conference on Machine Learning, p.1185-1192, June 14-18, 2009, Montreal, Quebec, Canada doi:10.1145/1553374.1553525

}};


 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2011 CollaborativeTopicModelingforReChong Wang
David M. Blei
Collaborative Topic Modeling for Recommending Scientific Articles10.1145/2020408.20204802011