2009 NonparametricLatentFeatureModel

(Miller et al., 2009) ⇒ Kurt T Miller, Thomas L Griffiths, and Michael I Jordan. (2009). “Nonparametric Latent Feature Models for Link Prediction..” In: Advances in Neural Information Processing Systems 22 (NIPS 2009).

Subject Headings: Latent Factor Model, Nonparametric Latent Model.

Notes

Cited By

http://scholar.google.com/scholar?q=%222009%22+Nonparametric+Latent+Feature+Models+for+Link+Prediction.

Quotes

Abstract

As the availability and importance of relational data -- such as the friendships summarized on a social networking website -- increases, it becomes increasingly important to have good models for such data. The kinds of latent structure that have been considered for use in predicting links in such networks have been relatively limited. In particular, the machine learning community has focused on latent class models, adapting nonparametric Bayesian methods to jointly infer how many latent classes there are while learning which entities belong to each class. We pursue a similar approach with a richer kind of latent variable -- latent features -- using a nonparametric Bayesian technique to simultaneously infer the number of features at the same time we learn which entities have each feature. The greater expressiveness of this approach allows us to improve link prediction on three datasets.

1 Introduction

Statistical analysis of social networks and other relational data has been an active area of research for over seventy years and is becoming an increasingly important problem as the scope and availability of social network datasets increase [1]. In these problems, we observe the interactions between a set of entities and we wish to extract informative representations that are useful for making predictions about the entities and their relationships. One basic challenge is link prediction, where we observe the relationships (or “links”) between some pairs of entities in a network (or “graph”) and we try to predict unobserved links. For example, in a social network, we might only know some subset of people are friends and some are not, and seek to predict which other people are likely to get along. Our goal is to improve the expressiveness and performance of generative models based on extracting latent structure representing the properties of individual entities from the observed data, so we will focus on these kinds of models. This rules out approaches like the popular [math]\displaystyle{ p^* }[/math] model that uses global quantities of the graph, such as how many edges or triangles are present [2, 3]. Of the approaches that do link prediction based on attributes of the individual entities, these can largely be classified into class-based and feature-based approaches. There are many models that can be placed under these approaches, so we will focus on the models that are most comparable to our approach.

Most generative models using a class-based representation are based on the stochastic blockmodel, introduced in [4] and further developed in [5]. In the most basic form of the model, we assume there are a finite number of classes that entities can belong to and that these classes entirely determine the structure of the graph, with the probability of a link existing between two entities depending only on the classes of those entities. In general, these classes are unobserved, and inference reduces to assigning entities to classes and inferring the class interactions.

One of the important issues that arise in working with this model is determining how many latent classes there are for a given problem. The Infinite Relational Model (IRM) [6 ] used methods from nonparametric Bayesian statistics to tackle this problem, allowing the number of classes to be determined at inference time. The Infinite Hidden Relational Model [ 7] further elaborated on this model and the Mixed Membership Stochastic Blockmodel (MMSB) [8] extended it to allow entities to have mixed memberships. All these class-based models share a basic limitation in the kinds of relational structure they naturally capture. For example, in a social network, we might find a class which contains “male high school athletes” and another which contains “male high school musicians.” We might believe these two classes will behave similarly, but with a class-based model, our options are to either merge the classes or duplicate our knowledge about common aspects of them. In a similar vein, with a limited amount of data, it might be reasonable to combine these into a single class “male high school students,” but with more data we would want to split this group into athletes and musicians. For every new attribute like this that we add, the number of classes would potentially double, quickly leading to an overabundance of classes. In addition, if someone is both an athlete and a musician, we would either have to add another class for that or use a mixed membership model, which would say that the more a student is an athlete, the less he is a musician.

An alternative approach that addresses this problem is to use features to describe the entities. There could be a separate feature for “high school student,” “male,” “athlete,” and “musician” and the presence or absence of each of these features is what defines each person and determines their relationships. One class of latent-feature models for social networks has been developed by [9, 10, 11], who proposed real-valued vectors as latent representations of the entities in the network where depending on the model, either the distance, inner product, or weighted combination of the vectors corresponding to two entities affects the likelihood of there being a link between them. However, extending our high school student example, we might hope that instead of having arbitrary real-valued features (which are still useful for visualization), we would infer binary features where each feature could correspond to an attribute like “male” or “athlete.” Continuing our earlier example, if we had a limited amount of data, we might not pick up on a feature like “athlete.” However, as we observe more interactions, this could emerge as a clear feature. Instead of doubling the numbers of classes in our model, we simply add an additional feature. Determining the number of features will therefore be of extreme importance.

In this paper, we present the nonparametric latent feature relational model, a Bayesian nonparametric model in which each entity has binary-valued latent features that influences its relations. In addition, the relations depend on a set of known covariates. This model allows us to simultaneously infer how many latent features there are while at the same time inferring what features each entity has and how those features influence the observations. This model is strictly more expressive than the stochastic blockmodel. In Section 2, we describe a simplified version of our model and then the full model. In Section 3, we discuss how to perform inference. In Section 4, we illustrate the properties of our model using synthetic data and then show that the greater expressiveness of the latent feature representation results in improved link prediction on three real datasets. Finally, we conclude in Section 5.

2 The nonparametric latent feature relational model

Assume we observe the directed relational links between a set of N entities. Let Y be the N × N binary matrix that contains these links. That is, let [math]\displaystyle{ y_{ij} \equiv Y (i, j) = 1 }[/math] if we observe a link from entity i to entity j in that relation and y_{ij} = 0 if we observe that there is not a link. Unobserved links are left unfilled. Our goal will be to learn a model from the observed links such that we can predict the values of the unfilled entries.

2.1 Basic model

In our basic model, each entity is described by a set of binary features. We are not given these features a priori and will attempt to infer them. We assume that the probability of having a link from one entity to another is entirely determined by the combined effect of all pairwise feature interactions. If there are K features, then let Z be the N × K binary matrix where each row corresponds to an entity and each column corresponds to a feature such that Z_ik \equiv Z (i, k) = 1 if the ith entity has feature k and Z_ik = 0 otherwise. and let Z_i denote the feature vector corresponding to entity i. Let W be a K × K real-valued weight matrix where wkk0 \equiv W (k, k0) is the weight that affects the probability of there being a link from entity i to entity j if both entity i has feature k and entity j has feature k0.

We assume that links are independent conditioned on Z and W, and that only the features of entities i and j influence the probability of a link between those entities. This defines the likelihood

[math]\displaystyle{ Pr (Y|Z, W) = Y i, j Pr (y_{ij}|Z_i, Z_j, W) (1) }[/math] where the product ranges over all pairs of entities.

Given the feature matrix Z and weight matrixW, the probability that there is a link from entity i to entity j is

[math]\displaystyle{ Pr (y_{ij} = 1|Z, W) = \sigma “Z_iWZ\gt j ” = \sigma 0 @ X k, k0 Z_ikZ_jk0wkk0 1 A (2) }[/math] where \sigma (·) is a function that transforms values on (- 1, 1) to (0, 1) such as the sigmoid function</math> \sigma (x) = 1 1 + exp (- x)</math> or the probit function \sigma (x) = \ Phi (x).

An important aspect of this model is that all-zero columns of Z do not affect the likelihood. We will take advantage of this in Section 2.2.

This model is very flexible. With a single feature per entity, it is equivalent to a stochastic blockmodel. However, since entities can have more than a single feature, the model is more expressive. In the high school student example, each feature can correspond to an attribute like “male,” “musician,” and “athlete.” If we were looking at the relation “friend of” (not necessarily symmetric !), then the weight at the (athlete, musician) entry of would correspond to the weight that an athlete would be a friend of a musician. A positive weight would correspond to an increased probability, a negative weight a decreased probability, and a zero weight would indicate that there is no correlation between those two features and the observed relation. The more positively correlated features people have, the more likely they are to be friends. Another advantage of this representation is that if our data contained observations of students in two distant locations, we could have a geographic feature for the different locations. While other features such as “athlete” or “musician” might indicate that one person could be a friend of another, the geographic features could have extremely negative weights so that people who live far from each other are less likely to be friends. However, the parameters for the non-geographic features would still be tied for all people, allowing us to make stronger inferences about how they influence the relations. Class-based models would need an abundance of classes to capture these effects and would not have the same kind of parameter sharing.

Given the full set of observations Y, we wish to infer the posterior distribution of the feature matrix Z and the weights W. We do this using Bayes’ theorem, [math]\displaystyle{ p (Z, W|Y) / p (Y|Z, W) p (Z) p (W) }[/math], where we have placed an independent prior on Z and W. Without any prior knowledge about the features or their weights, a natural prior for W involves placing an independent [math]\displaystyle{ N (0, \sigma 2w) }[/math] prior on each wij. However, placing a prior on Z is more challenging. If we knew how many features there were, we could place an arbitrary parametric prior on Z. However, we wish to have a flexible prior that allows us to simultaneously infer the number of features at the same time we infer all the entries in Z. The Indian Buffet Process is such a prior.

2.2 The Indian Buffet Process and the basic generative model

As mentioned in the previous section, any features which are all-zero do not affect the likelihood. That means that even if we added an infinite number of all-zero features, the likelihood would remain the same. The Indian Buffet Process (IBP) [12] is a prior on infinite binary matrices such that with probability one, a feature matrix drawn from it for a finite number of entities will only have a finite number of non-zero features. Moreover, any feature matrix, no matter how many non-zero features it contains, has positive probability under the IBP prior. It is therefore a useful nonparametric prior to place on our latent feature matrix Z.

The [[generative process to sample matrices from the IBP can be described through a culinary metaphor that gave the IBP its name. In this metaphor, each row of Z corresponds to a diner at an Indian buffet and each column corresponds to a dish at the infinitely long buffet. If a customer takes a particular dish, then the entry that corresponds to the customer’s row and the dish’s column is a one and the entry is zero otherwise. The culinary metaphor describes how people choose the dishes. In the IBP, the first customer chooses a Poisson (\alpha) number of dishes to sample, where \alpha is a parameter of the IBP. The ith customer tries each previously sampled dish with probability proportional to the number of people that have already tried the dish and then samples a Poisson (\alpha / i) number of new dishes. This process is exchangeable, which means that the order in which the customers enter the restaurant does not affect the configuration of the dishes that people try (up to permutations of the dishes as described in [12]). This insight leads to a straightforward Gibbs sampler to do posterior inference that we describe in Section 3.

Using an IBP prior on Z, our basic generative latent feature relational model is:

[math]\displaystyle{ Z \sim IBP( \alpha ) }[/math].

[math]\displaystyle{ wkk0 \sim N(0, \sigma 2w ) }[/math] for all k, k0 for which features k and k0 are non-zero

[math]\displaystyle{ y_{ij} \sim \sigma ?? Z_iWZ\gt j }[/math] for each observation.

2.3 Full nonparametric latent feature relational model

We have described the basic nonparametric latent feature relational model. We now combine it with ideas from the social network community to get our full model. First, we note that there are many instances of logit models used in statistical network analysis that make use of covariates in link prediction [2]. Here we will focus on a subset of ideas discussed in [10]. …

References

[1] Stanley-Wasserman and Katherine Faust. Social Network Analysis: Methods and Applications. Cambridge University Press, 1994.
[2] Stanley Wasserman and Philippa Pattison. Logit models and logistic regressions for social networks: I. an introduction to Markov random graphs and p*. Psychometrika, 61(3):401–425, 1996.
[3] Garry Robins, Tom Snijders, PengWang, Mark Handcock, and Philippa Pattison. Recent developments in exponential random graph (p*) models for social networks. Social Networks, 29(2):192–215, May 2007.
[4] Yuchung J. Wang and George Y. Wong. Stochastic blockmodels for directed graphs. Journal of the American Statistical Association, 82(397):8–19, 1987.
[5] Krzysztof Nowicki and Tom A. B. Snijders. Estimation and prediction for stochastic blockstructures. Journal of the American Statistical Association, 96(455):1077–1087, 2001.
[6] Charles Kemp, Joshua B. Tenenbaum, Thomas L. Griffiths, Takeshi Yamada, and Naonori Ueda. Learning systems of concepts with an infinite relational model. In: Proceedings of the American Association for Artificial Intelligence (AAAI), 2006.
[7] Zhao Xu, Volker Tresp, Kai Yu, and Hans-Peter Kriegel. Infinite hidden relational models. In: Proceedings of the 22nd Conference on Uncertainty in Artificial Intelligence (UAI), 2006.
[8] Edoardo M. Airoldi, David M. Blei, Eric P. Xing, and Stephen E. Fienberg. Mixed membership stochastic block models. In D. Koller, Y. Bengio, D. Schuurmans, and L. Bottou, editors, Advances in Neural Information Processing Systems (NIPS) 21. Red Hook, NY: Curran Associates, 2009.
[9] Peter D. Hoff, Adrian E. Raftery, and Mark S. Handcock. Latent space approaches to social network analysis. Journal of the American Statistical Association, 97(460):1090–1098, 2002.
[10] Peter D. Hoff. Bilinear mixed-effects models for dyadic data. Journal of the American Statistical Association, 100(469):286–295, 2005.
[11] Peter D. Hoff. Multiplicative latent factor modelsfor description and prediction of social networks. Computational and Mathematical Organization Theory, 2008.
[12] Thomas L. Griffiths and Zoubin Ghahramani. Infinite latent feature models and the Indian Buffet Process. In Y. Weiss, B. Sch¨olkopf, and J. Platt, editors, Advances in Neural Information Processing Systems (NIPS) 18. Cambridge, MA: MIT Press, 2006.
[13] Edward Meeds, Zoubin Ghahramani, Radford Neal, and Sam Roweis. Modeling dyadic data with binary latent factors. In B. Sch¨olkopf, J. Platt, and T. Hofmann, editors, Advances in Neural Information Processing Systems (NIPS) 19. Cambridge, MA: MIT Press, 2007.
[14] Daniel L. Navarro and Thomas L. Griffiths. Latent features in similarity judgment: A nonparametric Bayesian approach. Neural Computation, 20(11):2597–2628, 2008.
[15] Christian P. Robert and George Casella. Monte Carlo Statistical Methods. Springer, 2004.
[16] Radford M. Neal. Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9(2):249–265, 2000.
[17] James H. Albert and Siddhartha Chib. Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association, 88(422):669–679, 1993.
[18] Dilan G¨or¨ur, Frank J¨akel, and Carl Edward Rasmussen. A choice model with infinitely many latent features. In: Proceedings of the 23rd International Conference on Machine learning (ICML), 2006.
[19] Rudolph J. Rummel. Dimensionality of nations project: Attributes of nations and behavior of nation dyads, 1950–1965. ICPSR data file, 1999.
[20] Woodrow W. Denham. The Detection of Patterns in Alyawarra Nonverbal Behavior. PhD thesis, University of Washington, 1973.
[21] Jin Huang and Charles X. Ling. Using AUC and accuracy in evaluating learning algorithms. IEEE Transactions on Knowledge and Data Engineering, 17(3):299–310, 2005.
[22] Amir Globerson, Gal Chechik, Fernando Pereira, and Naftali Tishby. Euclidean embedding of cooccurrence data. The Journal of Machine Learning Research, 8:2265–2295, 2007.

;

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2009 NonparametricLatentFeatureModel	Thomas L. Griffiths Kurt T Miller Michael I. Jordan			Nonparametric Latent Feature Models for Link Prediction.						2009