2015 CollaborativeDeepLearningforRec

(Wang et al., 2015) ⇒ Hao Wang, Naiyan Wang, and Dit-Yan Yeung. (2015). “Collaborative Deep Learning for Recommender Systems.” In: Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2015). ISBN:978-1-4503-3664-2 doi:10.1145/2783258.2783273

Subject Headings: Collaborative Deep Learning Algorithm.

Notes

Cited By

Quotes

Because of privacy concerns, it is generally more difficult to collect user profiles than past activities. Nevertheless, CF-based methods do have their limitations. The prediction accuracy often drops significantly when the ratings are very sparse. Moreover, they cannot be used for recommending new products which have yet to receive rating information from users. Consequently, it is inevitable for CF-based methods to exploit auxiliary information and hence hybrid methods have gained popularity in recent years. According to whether two-way interaction exists between the rating information and auxiliary information, we may further divide hybrid methods into two sub-categories: loosely coupled and tightly coupled methods. Loosely coupled methods like [29] process the auxiliary information once and then use it to provide features for the CF models. Since information ow is one-way, the rating information cannot provide feedback to guide the extraction of useful features. For this sub-category, improvement often has to rely on a manual and tedious feature engineering process. On the contrary, tightly coupled methods like [34] allow two-way interaction. On one hand, the rating information can guide the learning of features. On the other hand, the extracted features can further improve the predictive power of the CF models (e.g., based on matrix factorization of the sparse rating matrix). With two-way interaction, tightly coupled methods can automatically learn features from the auxiliary information and naturally balance the in uence of the rating and auxiliary information. This is why tightly coupled methods often outperform loosely coupled ones [35].

Collaborative topic regression (CTR) [34] is a recently proposed tightly coupled method. It is a probabilistic graph- ical model that seamlessly integrates a topic model, latent Dirichlet allocation (LDA) [5], and a model-based CF method, probabilistic matrix factorization (PMF) [27]. CTR is an appealing method in that it produces promising and in- terpretable results. Nevertheless, the latent representation learned is often not effective enough especially when the aux- iliary information is very sparse. It is this representation learning problem that we will focus on in this paper. On the other hand, deep learning models recently show great potential for learning effective representations and de- liver state-of-the-art performance in computer vision [38] and natural language processing [15, 26] applications. In deep learning models, features are learned in a supervised or unsupervised manner. Although they are more appealing than shallow models in that the features can be learned au- tomatically (e.g., effective feature representation is learned from text content), they are inferior to shallow models such as CF in capturing and learning the similarity and implicit relationship between items. This calls for integrating deep learning with CF by performing deep learning collaboratively.

Unfortunately, very few attempts have been made to develop deep learning models for CF. [28] uses restricted Boltz- mann machines instead of the conventional matrix factorization formulation to perform CF and [9] extends this work by incorporating user-user and item-item correlations. Although these methods involve both deep learning and CF, they actually belong to CF-based methods because they do not incorporate content information like CTR, which is cru- cial for accurate recommendation. [24] uses low-rank matrix factorization in the last weight layer of a deep network to sig- nificantly reduce the number of model parameters and speed up training, but it is for classification instead of recommen- dation tasks. On music recommendation, [21, 39] directly use conventional CNN or deep belief networks (DBN) to as- sist representation learning for content information, but the deep learning components of their models are deterministic without modeling the noise and hence they are less robust. The models achieve performance boost mainly by loosely coupled methods without exploiting the interaction between content information and ratings. Besides, the CNN is linked directly to the rating matrix, which means the models will perform poorly when the ratings are sparse, as shown in the following experiments.

To address the challenges above, we develop a hierarchical Bayesian model called collaborative deep learning (CDL) as a novel tightly coupled method for RS. We first present a Bayesian formulation of a deep learning model called stacked denoising autoencoder (SDAE) [32]. With this, we then present our CDL model which tightly couples deep represen- tation learning for the content information and collaborative �ltering for the ratings (feedback) matrix, allowing two-way interaction between the two. Experiments show that CDL significantly outperforms the state of the art. Note that al- though we present CDL as using SDAE for its feature learning component, CDL is actually a more general framework which can also admit other deep learning models such as deep Boltzmann machines [25], recurrent neural networks [10], and convolutional neural networks [16].

The main contribution of this paper is summarized below: � By performing deep learning collaboratively, CDL can simultaneously extract an effective deep feature repre- sentation from content and capture the similarity and implicit relationship between items (and users). The learned representation may also be used for tasks other than recommendation.

� Unlike previous deep learning models which use simple target like classification [15] and reconstruction [32], we propose to use CF as a more complex target in a probabilistic framework.

� Besides the algorithm for attaining maximum a poste- riori (MAP) estimates, we also derive a sampling-based algorithm for the Bayesian treatment of CDL, which, interestingly, turns out to be a Bayesian generalized version of back-propagation.

� To the best of our knowledge, CDL is the first hierar- chical Bayesian model to bridge the gap between state- of-the-art deep learning models and RS. Besides, due to its Bayesian nature, CDL can be easily extended to incorporate other auxiliary information to further boost the performance.

� Extensive experiments on three real-world datasets from different domains show that CDL can significantly ad- vance the state of the art.

2. NOTATION AND PROBLEM FORMULATION

Similar to the work in [34], the recommendation task con- sidered in this paper takes implicit feedback [13] as the training and test data. The entire collection of J items (articles or movies) is represented by a J-by-S matrix Xc, where row j is the bag-of-words vector Xc;j� for item j based on a vo- cabulary of size S. With I users, we define an I-by-J binary rating matrix R = [Rij ]I�J . For example, in the dataset citeulike-a Rij = 1 if user i has article j in his or her per- sonal library and Rij = 0 otherwise. Given part of the rat- ings in R and the content information Xc, the problem is to predict the other ratings in R. Note that although we focus on movie recommendation (where plots of movies are con- sidered as content information) and article recommendation like [34] in this paper, our model is general enough to handle other recommendation tasks (e.g., tag recommendation). The matrix Xc plays the role of clean input to the SDAE while the noise-corrupted matrix, also a J-by-S matrix, is denoted by X0. The output of layer l of the SDAE is de- noted by Xl which is a J-by-Kl matrix. Similar to Xc, row j of Xl is denoted by Xl;j�. Wl and bl are the weight ma- trix and bias vector, respectively, of layer l, Wl;�n denotes column n of Wl, and L is the number of layers. For conve- nience, we use W+ to denote the collection of all layers of weight matrices and biases. Note that an L=2-layer SDAE corresponds to an L-layer network.

3. COLLABORATIVE DEEP LEARNING

We are now ready to present details of our CDL model. We first brie y review SDAE and give a Bayesian formula- tion of SDAE. This is then followed by the presentation of CDL as a hierarchical Bayesian model which tightly inte- grates the ratings and content information. 3.1 Stacked Denoising Autoencoders SDAE [32] is a feedforward neural network for learning representations (encoding) of the input data by learning to predict the clean input itself in the output, as shown in Figure 2. Usually the hidden layer in the middle, i.e., X2 in the �gure, is constrained to be a bottleneck and the input layer X0 is a corrupted version of the clean input data. An J I x0 xL=2 xc x0 x1 x2 xc ¸w W+ ¸v ¸n v R ¸u u J I x0 xL=2 x0 x1 W+ ¸ w ¸v v R ¸u u Figure 1: On the left is the graphical model of CDL. The part inside the dashed rectangle represents an SDAE. An example SDAE with L = 2 is shown. On the right is the graphical model of the degenerated CDL. The part inside the dashed rectangle represents the encoder of an SDAE. An example SDAE with L = 2 is shown on the right of it. Note that although L is still 2, the decoder of the SDAE vanishes. To prevent clutter, we omit all variables xl except x0 and xL=2 in the graphical models. X0 X1 X2 X3 X4 Xc Figure 2: A 2-layer SDAE with L = 4. SDAE solves the following optimization problem: min fWlg;fblg kXc 􀀀 XLk2 F + � X l kWlk2 F ; where � is a regularization parameter and k � kF denotes the Frobenius norm.

3.2 Generalized Bayesian SDAE

If we assume that both the clean input Xc and the cor- rupted input X0 are observed, similar to [4, 19, 3, 7], we can define the following generative process: 1. For each layer l of the SDAE network, (a) For each column n of the weight matrixWl, draw Wl;�n � N(0; �􀀀1 w IKl ): (b) Draw the bias vector bl � N(0; �􀀀1 w IKl ). (c) For each row j of Xl, draw Xl;j� � N(�(Xl􀀀1;j�Wl + bl); �􀀀1 s IKl ): (1) 2. For each item j, draw a clean input 1 Xc;j� � N(XL;j�; �􀀀1 n IJ ): Note that if �s goes to in�nity, the Gaussian distribution in Equation (1) will become a Dirac delta distribution [31] centered at �(Xl􀀀1;j�Wl + bl), where �(�) is the sigmoid function. The model will degenerate to be a Bayesian for- mulation of SDAE. That is why we call it generalized SDAE. Note that the first L=2 layers of the network act as an en- coder and the last L=2 layers act as a decoder. Maximization 1Note that while generation of the clean input Xc from XL is part of the generative process of the Bayesian SDAE, gen- eration of the noise-corrupted input X0 from Xc is an arti- fficial noise injection process to help the SDAE learn a more robust feature representation.

of the posterior probability is equivalent to minimization of the reconstruction error with weight decay taken into con- sideration.

3.3 Collaborative Deep Learning

Using the Bayesian SDAE as a component, the generative process of CDL is defined as follows: 1. For each layer l of the SDAE network, (a) For each column n of the weight matrixWl, draw Wl;�n � N(0; �􀀀1 w IKl ): (b) Draw the bias vector bl � N(0; �􀀀1 w IKl ). (c) For each row j of Xl, draw Xl;j� � N(�(Xl􀀀1;j�Wl + bl); �􀀀1 s IKl ): 2. For each item j, (a) Draw a clean input Xc;j� � N(XL;j�; �􀀀1 n IJ ). (b) Draw a latent item o�set vector �j � N(0; �􀀀1 v IK) and then set the latent item vector to be: vj = �j + XT L2

j�

3. Draw a latent user vector for each user i: ui � N(0; �􀀀1 u IK): 4. Draw a rating Rij for each user-item pair (i; j): Rij � N(uTi vj ;C􀀀1 ij ): Here �w, fin, �u, �s, and �v are hyperparameters and Cij is a confidence parameter similar to that for CTR (Cij = a if Rij = 1 and Cij = b otherwise). Note that the middle layer XL=2 serves as a bridge between the ratings and content in- formation. This middle layer, along with the latent o�set �j , is the key that enables CDL to simultaneously learn an ef- fective feature representation and capture the similarity and (implicit) relationship between items (and users). Similar to the generalized SDAE, for computational efficiency, we can also take �s to in�nity. The graphical model of CDL when �s approaches positive in�nity is shown in Figure 1, where, for notational simplicity, we use x0, xL=2, and xL in place of XT 0;j�, XT L2

j�, and XT

L;j�, respectively. item user 1 1 2 2 3 3 4 4 5 5 X0 X1 X2 X3 X4 Xc X0 X1 X2 corrupted clean

Figure 3: NN representation for degenerated CDL.

3.4 Maximum A Posteriori Estimates

Based on the CDL model above, all parameters could be treated as random variables so that fully Bayesian methods such as Markov chain Monte Carlo (MCMC) or variational approximation methods [14] may be applied. However, such treatment typically incurs high computational cost. Besides, since CTR is our primary baseline for comparison, it would be fair and reasonable to take an approach analogous to that used in CTR. Consequently, we devise below an EM-style algorithm for obtaining the MAP estimates, as in [34]. Like in CTR, maximizing the posterior probability is equiv- alent to maximizing the joint log-likelihood of U, V, fXlg, Xc, fWlg, fblg, and R given �u, �v, �w, �s, and fin: L = 􀀀 �u 2 X i kuik22 􀀀 �w 2 X l (kWlk2 F + kblk22 ) 􀀀 �v 2 X j kvj 􀀀 XT L2

j�k22

􀀀 �n 2 X j kXL;j� 􀀀 Xc;j�k22 􀀀 �s 2 X l X j k�(Xl􀀀1;j�Wl + bl) 􀀀 Xl;j�k22 􀀀 X i;j Cij 2 (Rij 􀀀 uTi vj)2: If �s goes to in�nity, the likelihood becomes: L = 􀀀 �u 2 X i kuik22 􀀀 �w 2 X l (kWlk2 F + kblk22 ) 􀀀 �v 2 X j kvj 􀀀 fe(X0;j�;W+)T k22 􀀀 �n 2 X j kfr(X0;j�;W+) 􀀀 Xc;j�k22 􀀀 X i;j Cij 2 (Rij 􀀀 uTi vj)2; (2) where the encoder function fe(�;W+) takes the corrupted content vector X0;j� of item j as input and computes the encoding of the item, and the function fr(�;W+) also takes X0;j� as input, computes the encoding and then the recon- structed content vector of item j. For example, if the num- ber of layers L = 6, fe(X0;j�;W+) is the output of the third layer while fr(X0;j�;W+) is the output of the sixth layer. From the perspective of optimization, the third term in the objective function (2) above is equivalent to a multi-layer perceptron using the latent item vectors vj as target while the fourth term is equivalent to an SDAE minimizing the re- construction error. Seeing from the view of neural networks (NN), when �s approaches positive in�nity, training of the probabilistic graphical model of CDL in Figure 1(left) would degenerate to simultaneously training two neural networks overlaid together with a common input layer (the corrupted input) but different output layers, as shown in Figure 3. Note that the second network is much more complex than typical neural networks due to the involvement of the rating matrix.

When the ratio fin=�v approaches positive in�nity, it will degenerate to a two-step model in which the latent repre- sentation learned using SDAE is put directly into the CTR. Another extreme happens when fin=�v goes to zero where the decoder of the SDAE essentially vanishes. On the right of Figure 1 is the graphical model of the degenerated CDL when fin=�v goes to zero. As demonstrated in the experi- ments, the predictive performance will suffer greatly for both extreme cases.

For ui and vj , coordinate ascent similar to [34, 13] is used. Given the currentW+, we compute the gradients of L with respect to ui and vj and set them to zero, leading to the following update rules:

ui (VCiVT + �uIK)􀀀1VCiRi vj (UCiUT + �vIK)􀀀1(UCjRj + �vfe(X0;j�;W+)T ); where U = (ui)Ii = 1, V = (vj)Jj = 1, Ci = diag(Ci1; : : : ;CiJ ) is a diagonal matrix, Ri = (Ri1; : : : ;RiJ )T is a column vec- tor containing all the ratings of user i, and Cij re ects the confidence controlled by a and b as discussed in [13]. Given U and V, we can learn the weights Wl and biases bl for each layer using the back-propagation learning algo- rithm. The gradients of the likelihood with respect to Wl and bl are as follows: rWlL = 􀀀�wWl 􀀀 �v X j rWlfe(X0;j�;W+)T (fe(X0;j�;W+)T 􀀀 vj) 􀀀 fin X j rWlfr(X0;j�;W+)(fr(X0;j�;W+) 􀀀 Xc;j�) rblL = 􀀀�wbl 􀀀 �v X j rblfe(X0;j�;W+)T (fe(X0;j�;W+)T 􀀀 vj) 􀀀 fin X j rblfr(X0;j�;W+)(fr(X0;j�;W+) 􀀀 Xc;j�): By alternating the update of U, V,Wl, and bl, we can find a local optimum for L. Several commonly used techniques such as using a momentum term may be used to alleviate the local optimum problem. For completeness, we also provide a sampling- based algorithm for CDL in the appendix.

3.5 Prediction

Let D be the observed test data. Similar to [34], we use the point estimates of ui, W+ and �j to calculate the predicted rating: E[Rij jD] � E[uijD]T (E[fe(X0;j�;W+)T jD] + E[�j jD]); where E[�] denotes the expectation operation. In other words, we approximate the predicted rating as:

R� ij � (u� j )T (fe(X0;j�;W+� )T + �� j ) = (u� i )T v� j :

Note that for any new item j with no rating in the training data, its o�set ��j will be 0.

4. EXPERIMENTS

Extensive experiments are conducted on three real-world datasets from different domains to demonstrate the effective- ness of our model both quantitatively and qualitatively2.

4.1 Datasets

We use three datasets from different real-world domains, two from CiteULike3 and one from Net ix, for our experiments. The first two datasets, from [35], were collected in different ways, specifically, with different scales and different degrees of sparsity to mimic different practical situations. The first dataset, citeulike-a, is mostly from [34]. The sec- ond dataset, citeulike-t, was collected independently of the first one. They manually selected 273 seed tags and collected all the articles with at least one of those tags. Similar to [34], users with fewer than 3 articles are not included. As a re- sult, citeulike-a contains 5551 users and 16980 items. For citeulike-t, the numbers are 7947 and 25975. We can see that citeulike-t contains more users and items than citeulike-a. Also, citeulike-t is much sparser as only 0:07% of its user- item matrix entries contain ratings but citeulike-a has rat- ings in 0:22% of its user-item matrix entries.

The last dataset, Net ix, consists of two parts. The first part, with ratings and movie titles, is from the Net ix chal- lenge dataset. The second part, with plots of the corre- sponding movies, was collected by us from IMDB 4. Similar to [41], in order to be consistent with the implicit feedback setting of the first two datasets, we extract only positive rat- ings (rating 5) for training and testing. After removing users with less than 3 positive ratings and movies without plots, we have 407261 users, 9228 movies, and 15348808 ratings in the final dataset.

We follow the same procedure as that in [34] to preprocess the text information (item content) extracted from the ti- tles and abstracts of the articles and the plots of the movies. After removing stop words, the top S discriminative words according to the tf-idf values are chosen to form the vocab- ulary (S is 8000, 20000, and 20000 for the three datasets).

4.2 Evaluation Scheme

For each dataset, similar to [35, 36], we randomly select P items associated with each user to form the training set and use all the rest of the dataset as the test set. To eval- uate and compare the models under both sparse and dense settings, we set P to 1 and 10, respectively, in our experi- ments. For each value of P, we repeat the evaluation �ve times with different randomly selected training sets and the average performance is reported.

As in [34, 22, 35], we use recall as the performance measure because the rating information is in the form of implicit 2Code and data are available at www.wanghao.in 3CiteULike allows users to create their own collections of articles. There are abstract, title, and tags for each arti- cle. More details about the CiteULike data can be found at

http://www.citeulike.org.

4http://www.imdb.com

feedback [13, 23]. Specifically, a zero entry may be due to the fact that the user is not interested in the item, or that the user is not aware of its existence. As such, precision is not a suitable performance measure. Like most recommender systems, we sort the predicted ratings of the candidate items and recommend the top M items to the target user. The recall@M for each user is then defined as: recall@M = number of items that the user likes among the top M total number of items that the user likes

The final result reported is the average recall over all users. Another evaluation metric is the mean average precision (mAP). Exactly the same as [21], we set the cuto� point at 500 for each user. 4.3 Baselines and Experimental Settings The models included in our comparison are listed as fol- lows:

� CMF: Collective Matrix Factorization [30] is a model incorporating different sources of information by simul- taneously factorizing multiple matrices. In this paper, the two factorized matrices are R and Xc.

� SVDFeature: SVDFeature (Chen, Zhang, et al., 2012) is a model for feature-based collaborative filtering. In this paper we use the content information Xc as raw features to feed into SVDFeature.

� DeepMusic: DeepMusic [21] is a model for music rec- ommendation mentioned in Section 1. We use the vari- ant, a loosely coupled method, that achieves the best performance as our baseline.

� CTR: Collaborative Topic Regression [34] is a model performing topic modeling and collaborative filtering simultaneously as mentioned in the previous section.

� CDL: Collaborative Deep Learning is our proposed model as described above. It allows different levels of model complexity by varying the number of layers.

In the experiments, we first use a validation set to find the optimal hyperparameters for CMF, SVDFeature, CTR, and DeepMusic. For CMF, we set the regularization hyper- parameters for the latent factors of different contexts to 10. After the grid search, we find that CMF performs best when the weights for the rating matrix and content matrix (BOW) are both 5 in the sparse setting. For the dense setting the weights are 8 and 2, respectively. For SVDFeature, the best performance is achieved when the regularization hyperparameters for the users and items are both 0:004 with the learning rate equal to 0:005. For DeepMusic, we find that the best performance is achieved using a CNN with two con- volutional layers. We also try our best to tune the other hy- perparameters. For CTR, we find that it can achieve good prediction performance when �u = 0:1, �v = 10, a = 1, b = 0:01, and K = 50 (note that a and b determine the con- �dence parameters Cij). For CDL, we directly set a = 1, b = 0:01, K = 50 and perform grid search on the hyperparameters �u, �v, fin, and �w. For the grid search, we split the training data and use 5-fold cross validation. We use a masking noise with a noise level of 0:3 to get the corrupted input X0 from the clean input Xc. For CDL with more than one layer of SDAE (L > 2), we use a dropout rate [2, 33, 11] of 0:1 to achieve adaptive regularization. In terms of network architecture, the number of hidden units Kl is set to 200 for l such that l 6= L=2 and 0 < l < L. While both K0 and KL are equal to the number of words S in the dictionary, KL=2 is set to K which is the number of dimensions of the

50 100 150 200 250 300 0.05 0.1 0.15 0.2 0.25 0.3 M Recall CDL CTR DeepMusic CMF SVDFeature 50 100 150 200 250 300 0.05 0.1 0.15 0.2 0.25 0.3 M Recall CDL CTR DeepMusic CMF SVDFeature 50 100 150 200 250 300 0 0.05 0.1 0.15 0.2 0.25 0.3 M Recall CDL CTR DeepMusic CMF SVDFeature

Figure 4: Performance comparison of CDL, CTR, DeepMusic, CMF, and SVDFeature based on recall@M for datasets citeulike-a, citeulike-t, and Net ix in the sparse setting. A 2-layer CDL is used.

50 100 150 200 250 300 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 M Recall CDL CTR DeepMusic CMF SVDFeature 50 100 150 200 250 300 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 M Recall CDL CTR DeepMusic CMF SVDFeature 50 100 150 200 250 300 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 M Recall CDL CTR DeepMusic CMF SVDFeature

Figure 5: Performance comparison of CDL, CTR, DeepMusic, CMF, and SVDFeature based on recall@M for datasets citeulike-a, citeulike-t, and Net ix in the dense setting. A 2-layer CDL is used.

Table 1: mAP for three datasets

citeulike-a citeulike-t Net ix CDL 0.0514 0.0453 0.0312 CTR 0.0236 0.0175 0.0223 DeepMusic 0.0159 0.0118 0.0167 CMF 0.0164 0.0104 0.0158 SVDFeature 0.0152 0.0103 0.0187

learned representation. For example, the 2-layer CDL model (L = 4) has a Bayesian SDAE of architecture `8000-200-50- 200-8000' for the citeulike-a dataset.

4.4 Quantitative Comparison

Figures 4 and 5 show the results that compare CDL, CTR, DeepMusic, CMF, and SVDFeature using the three datasets under both the sparse (P = 1) and dense (P = 10) set- tings. We can see that CTR is a strong baseline which beats DeepMusic, CMF, and SVDFeature in all datasets even though DeepMusic has a deep architecture. In the sparse setting, CMF outperforms SVDFeature most of the time and sometimes even achieves performance compara- ble to CTR. DeepMusic performs poorly due to lack of rat- ings and overfitting. In the dense setting, SVDFeature is significantly better than CMF for citeulike-a and citeulike- t but is inferior to CMF for Net ix. DeepMusic is still slightly worse than CTR due to the reasons mentioned in Section 1. To focus more specifically on comparing CDL with CTR, we can see that for citeulike-a, 2-layer CDL out- performs CTR by a margin of 4.2%�6.0% in the sparse set- ting and 3.3%�4.6% in the dense setting. If we increase the number of layers to 3 (L = 6), the margin will go up to 5.8%�8.0% and 4.3%�5.8%, respectively. Similarly for citeulike-t, 2-layer CDL outperforms CTR by a margin of 10.4%�13.1% in the sparse setting and 4.7%�7.6% in the dense setting. When the number of layers is increased to 3,

Table 2: Recall@300 in the sparse setting (%)

layers 1 2 3

citeulike-a 27.89 31.06 30.70 citeulike-t 32.58 34.67 35.48 Net ix 29.20 30.50 31.01 the margin will even go up to 11.0%�14.9% and 5.2%�8.2%, respectively. For Net ix, 2-layer CDL outperforms CTR by a margin of 1.9%�5.9% in the sparse setting and 1.5%�2.0% in the dense setting. As we can see, seamless and success- ful integration of deep learning and RS requires careful de- signs to avoid overfitting and achieve significant performance boost.

Table 1 shows the mAP for all models in the sparse set- tings. We can see that the mAP of CDL is almost or more than twice of CTR. Tables 2 and 3 show the recall@300 re- sults when CDL with different numbers of layers are applied to the three datasets under both the sparse and dense set- tings. As we can see, for citeulike-t and Net ix, the recall increases as the number of layers increases. For citeulike-a, CDL starts to over�t when it exceeds two layers. Since the standard deviation is always very small (4:31 � 10􀀀5 � 9:31 � 10􀀀3), we do not include it in the �gures and tables as it is not noticeable anyway.

Note that the results are somewhat different for the first two datasets although they are from the same domain. This is due to the different ways in which the datasets were col- lected, as discussed above. Specifically, both the text in- formation and the rating matrix in citeulike-t are much sparser.5 By seamlessly integrating deep representation learning for content information and CF for the rating matrix, CDL can handle both the sparse rating matrix and the 5Each article in citeulike-a has 66:6 words on average and that for citeulike-t is 18:8.

50 100 150 200 250 300 0.2 0.25 0.3 0.35 0.4 0.45 0.5 M Recall λn = 104 λn = 102 λn = 100 λn = 10−2 PMF 50 100 150 200 250 300 0.25 0.3 0.35 0.4 0.45 0.5 M Recall λn = 104 λn = 103 λn = 102 λn = 101 PMF

Figure 6: Performance of CDL based on recall@M for different values of fin on citeulike-t. The left plot is for L = 2 and the right one is for L = 6. sparse text information much better and learn a much more effective latent representation for each item and hence each user.

Figure 6 shows the results for different values of fin using citeulike-t under the dense setting. We set �u = 0:01, �v = 100, and L to 2 and 6. Similar phenomena are ob- served when the number of layers and the value of P are varied but they are omitted here due to space constraints. As mentioned in the previous section, when fin is extremely large, fin=�v will approach positive in�nity so that CDL de- generates to two separate models. In this case the latent item representation will be learned by the SDAE in an un- supervised manner and then it will be put directly into (a simpliffed version of) the CTR. Consequently, there is no interaction between the Bayesian SDAE and the collabora- tive filtering component based on matrix factorization and hence the prediction performance will suffer greatly. For the other extreme when fin is extremely small, fin=�v will ap- proach zero so that CDL degenerates to that in Figure 1 in which the decoder of the Bayesian SDAE component essen- tially vanishes. This way the encoder of the Bayesian SDAE component will easily over�t the latent item vectors learned by simple matrix factorization. As we can see in Figure 6, the prediction performance degrades significantly as fin gets very large or very small. When fin < 0:1, the recall@M is already very close to (or even worse than) the result of PMF.

4.5 Qualitative Comparison

To gain a better insight into CDL, we first take a look at two example users in the citeulike-t dataset and represent the profile of each of them using the top three matched top- ics. We examine the top 10 recommended articles returned by a 3-layer (L = 6) CDL and CTR. The models are trained under the sparse setting (P = 1). From Table 4, we can spec- ulate that user I might be a computer scientist with focus on tag recommendation, as clearly indicated by the first topic in CDL and the second one in CTR. CDL correctly recom- mends many articles on tagging systems while CTR focuses on social networks instead. When digging into the data, we find that the only rated article in the training data is `What drives content tagging: the case of photos on Flickr', which is an article that talks about the impact of social networks on tagging behaviors. This may explain why CTR focuses its recommendation on social networks. On the other hand, CDL can better understand the key points of the article (i.e., tagging and CF) to make appropriate recommendation ac- cordingly. Consequently, the precision of CDL and CTR is 70% and 10%, respectively.

From the matched topics returned by both CDL and CTR, user II might be a researcher on blood ow dynamic the- ory particularly in the field of medical science. CDL cor-

Table 3: Recall@300 in the dense setting (%)

layers 1 2 3

citeulike-a 58.35 59.43 59.31 citeulike-t 52.68 53.81 54.48 Net ix 69.26 70.40 70.42

rectly captures the user profile and achieves a precision of 100%. However, CTR recommends quite a few articles on astronomy instead. When examining the data, we find that the only rated article returned by CTR is `Simulating de- formable particle suspensions using a coupled lattice-Boltzmann and finite-element method'. As expected, this article is on deformable particle suspension and the ow of blood cells.

CTR might have misinterpreted this article, focusing its rec- ommendation on words like ` ows' and `formation' sepa- rately. This explains why CTR recommends articles like `Formation versus destruction: the evolution of the star clus- ter population in galaxy mergers' (formation) and `Macro- scopic effects of the spectral structure in turbulent ows' ( ows). As a result, its precision is only 30%. From these two users, we can see that with a more effective representation, CDL can capture the key points of articles and the user preferences more accurately (e.g., user I). Be- sides, it can model the co-occurrence and relations of words better (e.g., user II).

We next present another case study which is for the Net- ix dataset under the dense setting (P = 10). In this case study, we choose one user (user III) and vary the number of ratings (positive feedback) in the training set given by the user from 1 to 10. The partition of training and test data remains the same for all other users. This is to examine how the recommendation of CTR and CDL adapts as user III expresses preference for more and more movies. Table 5 shows the recommendation lists of CTR and CDL when the number of training samples is set to 2, 4, and 10. When there are only two training samples, the two movies user III likes are `Moonstruck' and `True Romance', which are both romance movies. For now the precision of CTR and CDL is close (20% and 30%). When two more samples are added, the precision of CDL is boosted to 50% while that of CTR remains unchanged (20%). That is because the two new movies, `Johnny English' and `American Beauty', belong to action and drama movies. CDL successfully captures the user's change of taste and gets two more recommendations right but CTR fails to do so. Similar phenomena can be ob- served when the number of training samples increases from 4 to 10. From this case study, we can see that CDL is sensi- tive enough to changes of user taste and hence can provide more accurate recommendation.

5. COMPLEXITY ANALYSIS AND IMPLEMENTATION

Following the update rules in this paper, the computa- tional complexity of updating ui is O(K2J +K3), where K is the dimensionality of the learned representation and J is the number of items. The complexity for vj is O(K2I + K3 + SK1), where I is the number of users, S is the size of the vocabulary, and K1 is the dimensionality of the out- put in the first layer. Note that the third term O(SK1) is the cost of computing the output of the encoder and it is dominated by the computation of the first layer. For Table 4: Interpretability of the latent structures learned user I (CDL) in user's lib? top 3 topics 1. search, image, query, images, queries, tagging, index, tags, searching, tag 2. social, online, internet, communities, sharing, networking, facebook, friends, ties, participation 3. collaborative, optimization, filtering, recommendation, contextual, planning, items, preferences top 10 articles 1. The structure of collaborative tagging Systems yes 2. Usage patterns of collaborative tagging systems yes 3. Folksonomy as a complex network no 4. HT06, tagging paper, taxonomy, Flickr, academic article, to read yes 5. Why do tagging systems work yes 6. Information retrieval in folksonomies: search and ranking no 7. tagging, communities, vocabulary, evolution yes 8. The complex dynamics of collaborative tagging yes 9. Improved annotation of the blogosphere via autotagging and hierarchical clustering no 10. Collaborative tagging as a tripartite network yes user I (CTR) in user's lib? top 3 topics 1. social, online, internet, communities, sharing, networking, facebook, friends, ties, participation 2. search, image, query, images, queries, tagging, index, tags, searching, tag 3. feedback, event, transformation, wikipedia, indicators, vitamin, log, indirect, taxonomy top 10 articles 1. HT06, tagging paper, taxonomy, Flickr, academic article, to read yes 2. Structure and evolution of online social networks no 3. Group formation in large social networks: membership, growth, and evolution no 4. Measurement and analysis of online social networks no 5. A face(book) in the crowd: social searching vs. social browsing no 6. The strength of weak ties no 7. Flickr tag recommendation based on collective knowledge no 8. The computer-mediated communication network no 9. Social capital, self-esteem, and use of online social network sites: A longitudinal analysis no 10. Increasing participation in online communities: A framework for human-computer interaction no user II (CDL) in user's lib?

6. CONCLUSION AND FUTUREWORK

We have demonstrated in this paper that state-of-the-art performance can be achieved by jointly performing deep rep- resentation learning for the content information and collab- orative filtering for the ratings (feedback) matrix. As far as we know, CDL is the first hierarchical Bayesian model to bridge the gap between state-of-the-art deep learning models and RS. In terms of learning, besides the algorithm for at- taining the MAP estimates, we also derive a sampling-based algorithm for the Bayesian treatment of CDL as a Bayesian generalized version of back-propagation.

Among the possible extensions that could be made to CDL, the bag-of-words representation may be replaced by more powerful alternatives, such as [20]. The Bayesian na- ture of CDL also provides potential performance boost if other side information is incorporated as in [37]. Besides, as

Table 5: Example user with recommended movies User III Movies in the training set: Moonstruck, True Romance, Johnny English, American Beauty, The Princess Bride, Top Gun, Double Platinum, Rising Sun, Dead Poets Society, Waiting for Gu�man

training samples 2 4 10

Top 10 recommended movies by CTR

Swordfish Pulp Fiction Best in Snow A Fish Called Wanda A Clockwork Orange Chocolat Terminator 2 Being John Malkovich Good Will Hunting A Clockwork Orange Raising Arizona Monty Python and the Holy Grail Sling Blade Sling Blade Being John Malkovich Bridget Jones's Diary Sword�sh Raising Arizona Raising Arizona A Fish Called Wanda The Graduate A Streetcar Named Desire Saving Grace Sword�sh The Untouchables The Graduate Tootsie The Full Monty Monster's Ball Saving Private Ryan

training samples 2 4 10

Top 10 recommended movies by CDL

Snatch Pulp Fiction Good Will Hunting The Big Lebowski Snatch Best in Show Pulp Fiction The Usual Suspect The Big Lebowski Kill Bill Kill Bill A Few Good Men Raising Arizona Momento Monty Python and the Holy Grail The Big Chill The Big Lebowski Pulp Fiction Tootsie One Flew Over the Cuckoo's Nest The Matrix Sense and Sensibility As Good as It Gets Chocolat Sling Blade Goodfellas The Usual Suspect Swinger The Matrix CaddyShack remarked above, CDL actually provides a framework that can also admit deep learning models other than SDAE. One promising choice is the convolutional neural network model which, among other things, can explicitly take the context and order of words into account. Further performance boost may be possible when using such deep learning models.

References

1. D. Agarwal and B.-C. Chen. Regression-based Latent Factor Models. In KDD, Pages 19--28, 2009.
2. P. Baldi and P. J. Sadowski. Understanding Dropout. In NIPS, Pages 2814--2822, 2013.
3. Y. Bengio, L. Yao, G. Alain, and P. Vincent. Generalized Denoising Auto-encoders As Generative Models. In NIPS, Pages 899--907, 2013.
4. C. M. Bishop. Pattern Recognition and Machine Learning. Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006.
5. D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet Allocation. JMLR, 3:993--1022, 2003.
6. J. Bobadilla, F. Ortega, A. Hernando, and A. Gutiérrez. Recommender Systems Survey. Knowledge Based Systems, 46:109--132, 2013.
7. M. Chen, Z. E. Xu, K. Q. Weinberger, and F. Sha. Marginalized Denoising Autoencoders for Domain Adaptation. In ICML, Pages 767--774, 2012.
8. T. Chen, W. Zhang, Q. Lu, K. Chen, Z. Zheng, and Y. Yu. Svdfeature: A Toolkit for Feature-based Collaborative Filtering. JMLR, 13:3619--3622, 2012.
9. K. Georgiev and P. Nakov. A Non-iid Framework for Collaborative Filtering with Restricted Boltzmann Machines. In ICML, Pages 1148--1156, 2013.
10. A. Graves, S. Fernández, F. J. Gomez, and J. Schmidhuber. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. In ICML, Pages 369--376, 2006.
11. G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Improving Neural Networks by Preventing Co-adaptation of Feature Detectors. CoRR, Abs/1207.0580, 2012.
12. L. Hu, J. Cao, G. Xu, L. Cao, Z. Gu, and C. Zhu. Personalized Recommendation via Cross-domain Triadic Factorization. In WWW, Pages 595--606, 2013.
13. Y. Hu, Y. Koren, and C. Volinsky. Collaborative Filtering for Implicit Feedback Datasets. In ICDM, Pages 263--272, 2008.
14. M. I. Jordan, Z. Ghahramani, T. Jaakkola, and L. K. Saul. An Introduction to Variational Methods for Graphical Models. Machine Learning, 37(2):183--233, 1999.
15. N. Kalchbrenner, E. Grefenstette, and P. Blunsom. A Convolutional Neural Network for Modelling Sentences. ACL, Pages 655--665, 2014.
16. A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. In NIPS, Pages 1106--1114, 2012.
17. K. Lang. Newsweeder: Learning to Filter Netnews. In ICML, Pages 331--339, 1995.
18. W.-J. Li, D.-Y. Yeung, and Z. Zhang. Generalized Latent Factor Models for Social Network Analysis. In IJCAI, Pages 1705--1710, 2011.
19. D. J. C. MacKay. A Practical Bayesian Framework for Backpropagation Networks. Neural Computation, 4(3):448--472, 1992.
20. T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed Representations of Words and Phrases and their Compositionality. In NIPS, Pages 3111--3119, 2013.
21. A. V. D. Oord, S. Dieleman, and B. Schrauwen. Deep Content-based Music Recommendation. In NIPS, Pages 2643--2651, 2013.
22. S. Purushotham, Y. Liu, and C.-C. J. Kuo. Collaborative Topic Regression with Social Matrix Factorization for Recommendation Systems. In ICML, Pages 759--766, 2012.
23. S. Rendle, C. Freudenthaler, Z. Gantner, and L. Schmidt-Thieme. BPR: Bayesian Personalized Ranking from Implicit Feedback. In UAI, Pages 452--461, 2009.
24. T. N. Sainath, B. Kingsbury, V. Sindhwani, E. Arisoy, and B. Ramabhadran. Low-rank Matrix Factorization for Deep Neural Network Training with High-dimensional Output Targets. In ICASSP, Pages 6655--6659, 2013.
25. R. Salakhutdinov and G. E. Hinton. Deep Boltzmann Machines. In AISTATS, Pages 448--455, 2009.
26. R. Salakhutdinov and G. E. Hinton. Semantic Hashing. Int. J. Approx. Reasoning, 50(7):969--978, 2009.
27. R. Salakhutdinov and A. Mnih. Probabilistic Matrix Factorization. In NIPS, Pages 1257--1264, 2007.
28. R. Salakhutdinov, A. Mnih, and G. E. Hinton. Restricted Boltzmann Machines for Collaborative Filtering. In ICML, Pages 791--798, 2007.
29. S. G. Sevil, O. Kucuktunc, P. Duygulu, and F. Can. Automatic Tag Expansion Using Visual Similarity for Photo Sharing Websites. Multimedia Tools Appl., 49(1):81--99, 2010.
30. A. P. Singh and G. J. Gordon. Relational Learning via Collective Matrix Factorization. In KDD, Pages 650--658, 2008.
31. R. S. Strichartz. A Guide to Distribution Theory and Fourier Transforms. World Scientific, 2003.
32. J. Tang, R. W. White, and P. Bailey. Recommending Interesting Activity-related Local Entities. In SIGIR, Pages 1161--1162, 2011.
33. P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol. Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion. JMLR, 11:3371--3408, 2010.
34. S. Wager, S. Wang, and P. Liang. Dropout Training As Adaptive Regularization. In NIPS, Pages 351--359, 2013.
35. C. Wang and D. M. Blei. Collaborative Topic Modeling for Recommending Scientific Articles. In KDD, Pages 448--456, 2011.
36. H. Wang, B. Chen, and W.-J. Li. Collaborative Topic Regression with Social Regularization for Tag Recommendation. In IJCAI, Pages 2719--2725, 2013.
37. H. Wang and W. Li. Relational Collaborative Topic Regression for Recommender Systems. TKDE, 27(5):1343--1355, 2015.
38. H. Wang, X. Shi, and D. Yeung. Relational Stacked Denoising Autoencoder for Tag Recommendation. In AAAI, Pages 3052--3058, 2015.
39. N. Wang and D.-Y. Yeung. Learning a Deep Compact Image Representation for Visual Tracking. In NIPS, Pages 809--817, 2013.
40. X. Wang and Y. Wang. Improving Content-based and Hybrid Music Recommendation Using Deep Learning. In ACM MM, Pages 627--636, 2014.
41. W. Zhang, H. Sun, X. Liu, and X. Guo. Temporal Qos-aware Web Service Recommendation via Non-negative Tensor Factorization. In WWW, Pages 585--596, 2014.
42. K. Zhou and H. Zha. Learning Binary Codes for Collaborative Filtering. In KDD, Pages 498--506, 2012.

;

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2015 CollaborativeDeepLearningforRec	Dit-Yan Yeung Hao Wang Naiyan Wang			Collaborative Deep Learning for Recommender Systems				10.1145/2783258.2783273		2015

2015 CollaborativeDeepLearningforRec

Notes

Cited By

Quotes

Author Keywords

Abstract

1. INTRODUCTION

5. COMPLEXITY ANALYSIS AND IMPLEMENTATION

6. CONCLUSION AND FUTUREWORK

References

Navigation menu

Search