2009 ReflectAndCorrect

(Bilgic & Getoor, 2009) ⇒ Mustafa Bilgic, and Lise Getoor. (2009). “Reflect and Correct: A misclassification prediction approach to active inference.” In: ACM Transactions on Knowledge Discovery from Data (TKDD), 3(4). doi:10.1145/1631162.1631168

Subject Headings: Collective Classification Task.

Notes

Cited By

Quotes

Author Key Words

Active inference, label acquisition, collective classification, viral marketing, information diffusion

Abstract

Information diffusion, viral marketing, graph-based semi-supervised learning, and collective classification all attempt to model and exploit the relationships among nodes in a network to improve the performance of node labeling algorithms. However, sometimes the advantage of exploiting the relationships can become a disadvantage. Simple models like label propagation and iterative classification can aggravate a misclassification by propagating mistakes in the network, while more complex models that define and optimize a global objective function, such as Markov random fields and graph mincuts, can misclassify a set of nodes jointly. This problem can be mitigated if the classification system is allowed to ask for the correct labels for a few of the nodes during inference. However, determining the optimal set of labels to acquire is intractable under relatively general assumptions, which forces us to resort to approximate and heuristic techniques. We describe three such techniques in this article. The first one is based on directly approximating the value of the objective function of label acquisition and greedily acquiring the label that provides the most improvement. The second technique is a simple technique based on the analogy we draw between viral marketing and label acquisition. Finally, we propose a method, which we refer to as reflect and correct, that can learn and predict when the classification system is likely to make mistakes and suggests acquisitions to correct those mistakes. We empirically show on a variety of synthetic and real-world datasets that the reflect and correct method significantly outperforms the other two techniques, as well as other approaches based on network structural measures such as node degree and network clustering.

1. INTRODUCTION

Information diffusion, viral marketing, graph-based semi-supervised learning, and collective classification all attempt to exploit relationships in a network to reason and make inferences about the labels of the nodes in the network. The common intuition is that knowing (or inferring) something about the label of a particular node can tell us something useful about the other nodes’ labels in the network. For instance, the labels of the linked nodes often tend to be correlated (not necessarily a positive correlation) for many domains; hence, finding the correct label of a node is useful for not only that particular node, but the inferred label also has an impact on the predictions that are made about the nodes in the rest of the network. Thus, it has been shown that methods such as collective classification, that is, classifying the nodes of a network simultaneously, can significantly outperform content-only classification methods, which make use of only the attributes of nodes and ignore the relationships between them [Chakrabarti et al. 1998; Neville and Jensen 2000; Getoor et al. 2001; Taskar et al. 2002; Lu and Getoor 2003a; Jensen et al. 2004; Macskassy and Provost 2007; Sen et al. 2008]. However, sometimes, the advantage of exploiting the relationships can become a disadvantage. In addition to the typical errors made by content-only classification models (errors due to model limitations, noise in the data, etc.), collective classification models can also make mistakes by propagating misclassifications in the network. This can sometimes even have a domino effect leading to misclassification of most of the nodes in the network. For example, consider a simple binary classification problem where an island of nodes that should be labeled with the positive label are surrounded with a sea of negatively labeled nodes. The island may be flooded with the labels of the neighboring sea of negative nodes.

This flooding of the whole network (or part of it) can occur for simple models such as iterative classification (Lu and Getoor 2003a; Neville and Jensen 2000 and label propagation (Zhu and Ghahramani 2002 ]. A misclassification can be propagated to the rest of the network, especially if the misclassification is systematic and common, such as misclassifying nodes as belonging to the majority class. Flooding can also happen for more complex models that define a global objective function to be optimized. For example, for pairwise Markov random field models (Taskar et al. 2002 with parameter values that prefer intra-class interactions over inter-class interactions, the most probable configuration of the labels might be the one where most of the network is labeled with one class. Or, for a graph mincut formulation (Blum and Chawla 2001 ], where we pay a penalty for inter-class interactions, the best objective value might be achieved by assigning only one label to each connected component in the graph.

One strategy for avoiding flooding is to have an expert in the loop during inference, who can guide the inference and constrain the solution space in the right directions by providing the correct labels for a few nodes. Depending on the application, labels can be acquired by asking the expert to rate specific items, a company can provide free samples to a small set of customers and customers’ viral networking or purchasing behavior can be observed, or targeted laboratory experiments can be performed to determine protein functions, etc. However, providing these additional labels is often costly and we are often limited to operate within a given budget. As we show later, determining the optimal set of labels to acquire is intractable under relatively general assumptions. Therefore, we are forced to resort to approximate and heuristic techniques to get practical solutions.

2. PROBLEM FORMULATION

We begin by reviewing the collective classification problem and define the objective function for label acquisition for collective classification. In this problem, we assume that our data is represented as a graph with nodes and edges, [math]\displaystyle{ G }[/math] = (V, E). Each node [math]\displaystyle{ V_i }[/math] ∈ [math]\displaystyle{ V }[/math] is described by an attribute vector [math]\displaystyle{ \bf{X}_i }[/math] and a class label [math]\displaystyle{ Y_i }[/math] pair, [math]\displaystyle{ V_i }[/math] = <[math]\displaystyle{ \bf{X}_i }[/math], [math]\displaystyle{ Y_i }[/math]>. [math]\displaystyle{ \bf{X}_i }[/math] is a vector of individual attributes < X_i1, X_i2, . . ., X_ip . The domain of X_ij can be either discrete or continuous whereas the domain of the class label [math]\displaystyle{ Y_i }[/math]is discrete and denoted as {y₁, y₂, . . ., y_m}. Each edge E_ij ∈ $E$ describes some sort of relationship between its endpoints, E_ij = <[math]\displaystyle{ V_i }[/math],V_j>. Examples include:

- Social Networks. Here, the nodes are people, the attributes may include demographic information such as age and income and the edges are friendships. The labels indicate categories of people, for example we may be interested in labeling the people that are likely to partake in some activity (e.g., smoking, IV drug use), have some disease (e.g., tuberculosis, obesity), or exhibit some behavior (buying a product, spreading a rumor).
- Citation Networks. The nodes are publications, the attributes include content information and the edges represent citations. The labels may be the topics of the publications, or an indication of the reputation of the paper, for example whether the paper is seminal or not.
- Biological Networks. Where for example, the nodes represent proteins, attributes include annotations, and edges represent interactions. In this domain for example, we may be interested in inferring protein function.

2.1 Collective Classification

In graph data, the labels of neighboring nodes are often correlated (though not necessarily positively correlated). For example, friends tend to have similar smoking behaviors, papers are likely to have similar topics to the papers that they cite, and proteins are likely to have complementary functions. Exploiting these correlations can significantly improve classification performance over using only the attributes, [math]\displaystyle{ \bf{X}_i }[/math], for the nodes. However, when predicting the label of a node, the labels of the related instances are also unknown and need to be predicted. Collective classification is the term used for simultaneously predicting the labels Y of [math]\displaystyle{ V }[/math] in the graph [math]\displaystyle{ G }[/math], where [math]\displaystyle{ Y }[/math] denotes the set of labels of all of the nodes, Y = {Y₁, Y₂, . . ., Y_n}. In general, the label [math]\displaystyle{ Y_i }[/math]of a node can be influenced by its own attributes [math]\displaystyle{ \bf{X}_i }[/math] as well as the labels Yj and attributes [math]\displaystyle{ \bf{X}_j }[/math] of other nodes in the graph.

There are many collective classification models proposed to date that make different modeling assumptions about these dependencies. They can be grouped into two broad categories. In the first category, local collective classification models, the collective models consist of a collection of local vector-based classifiers, such as logistic regression. For the this category of collective models, each object is described as a vector of its local attributes X_i and an aggregation of attributes and labels of its neighbors. Examples include Chakrabarti et al. (1998), Neville and Jensen (2000), Lu and Getoor (2003a), Macskassy and Provost (2007), and McDowell et al. (2007). The second category of collective classification models are global collective classification models. In this case, the collective classification is defined as a global objective function to be optimized. In many cases, a relational graphical model is learned over all the attributes and labels in the graph, and a joint probability distribution over these attributes and labels is learned and optimized. Examples of this category include conditional random fields (Lafferty et al. 2001 ], relational Markov networks (Taskar et al. 2002), probabilistic relational models (Getoor et al. 2002 ], and Markov logic networks (Richardson and Domingos 2006 ].

In this article, we use an example model from each category, which we explain briefly here. For the local collective classification model, we use Iterative Classification Algorithm (ICA) (Neville and Jensen 2000; Lu and Getoor 2003a ], and, for the global collective classification model, we use a pairwise Markov Random Fields (MRF) based on the relational Markov network of [[Taskar et al. [2002]] ]. We first introduce notations and assumptions common to both models and then describe the two approaches.

Let Ni denote the labels of the neighboring nodes of Vi, Ni = {Y j | Vi, Vj ∈ E}. A general assumption that is made is the Markov assumption that Yi is directly influenced only by Xi and Ni . Given the values of Ni, Yi is independent of Y \Ni and is independent of X \ { Xi}, where X denotes the set of all attribute vectors in the graph, X = { X1, X2, . . ., Xn}. That is, once we know the the values of Ni, then Yi is independent of attribute vectors X j of all neighbors and nonneighbors, and it is independent of labels Y j of all non-neighbors.

2.1.1 Iterative Classification Algorithm (ICA).

In the ICA model, each node in the graph is represented as a vector that is a combination of node features, Xi, and features that are constructed using the labels of the nodes’ immediate neighbors. Because each node can have a varying number of neighbors, we use an aggregation function over the neighbor labels in order to get a fixed-length vector representation. For example, the count aggregation constructs a fixed-size feature vector by counting how many of the neighbors belong to each label; other examples of aggregations include proportion, mode, etc. Once the features are constructed, then an off-the-shelf probabilistic classifier can be used to learn P(Yi | X [math]\displaystyle{ i }[/math], aggr(Ni )), where aggr is an aggregate function that converts a set of inputs into a fixed length vector. One can use a single classifier to learn P(Yi | Xi, aggr(Ni)) or can use a structured classifier to learn P(Yi | Xi) and P(Yi | aggr(Ni)) separately, which can be combined in a variety of ways to compute P(Yi | Xi, aggr(Ni )).

A key component of this approach is that during inference, the labels of the neighboring instances are often not known. ICA addresses this issue, and performs collective classification, by using the predicted labels for the neighbors for feature construction. ICA iterates over all nodes making a new prediction based on the predictions made for the unknown labels of the neighbors in the previous iteration; in the first step of the algorithm, initial labels can be inferred based solely on attribute information, or based on attribute and any observed neighboring labels. ICA learning is typically done using fully labeled training data, however there are also approaches that use semi-supervised techniques for learning ICA [Lu and Getoor 2003b; Xiang and Neville 2008].

…

5. RELATED WORK

Substantial research has been done in the area of active learning [Cohn et al. 1994; Seung et al. 1992]. While active learning is related to label acquisition during inference, there are two fundamental differences between existing active learning work and active inference. The first is that the objective of the two problems are different. The aim of active learning is to learn a good model, while active inference assumes that a learned model (or training data) already exists. The second difference is that most of the existing active learning work is on non-graph data, while active inference makes more sense in graph data.

Nonetheless, there are many similarities between the two approaches and some of the active learning techniques could be used for active inference as well. The closest active learning work to approximate inference greedy acquisition (AIGA) method is that of Roy and McCallum [2001]. They estimate the usefulness of a label in terms of the expected error reduction; similarly, AIGA estimates the usefulness of a label in terms of the improvement in the value of the objective function. The closest works to reflect and correct (RAC) are uncertainty sampling [Lewis and Gale 1994] and query by committee [Seung et al. 1992; Freund et al. 1997]. Uncertainty sampling acquires the label that the underlying classifier is most uncertain about, where uncertainty is measured in terms of the posterior probability distribution of the label [Lewis and Gale 1994]. RAC on the other hand, queries the label that the collective model is most likely to misclassify, and the possibility of misclassification is not measured in terms of the probability distribution of the label according to the collective model, but measured using a misclassification prediction classifier. Similarly, the query by committee work samples a committee of classifiers from the set of possible hypotheses using a sampling strategy like Gibbs sampling, and queries the label on which the disagreement of the committee members is the highest [Seung et al. 1992; Freund et al. 1997]. We can think of the content-only classifier and the collective classifier as two committee members, though it is not in the same sense that Seung et al. [1992] used (they do not belong to the same classes of hypotheses). RAC capitalizes on the disagreement between the content-only classifier and the collective classifier, but it is only one of the three features that the misclassification prediction classifier uses.

Another related area to active inference is viral (or targeted) marketing [Richardson and Domingos 2002; Kempe et al. 2003; Leskovec et al. 2007; Provost et al. 2007] where a subset of customers need to be selected for targeted advertisement so as to maximize the product sales. We showed how viral marketing is related to label acquisition and used Richardson and Domingos’s model [2002] to compare against. Other models could very well be used and compared against; one of the reasons we chose [Richardson and Domingos 2002] is because the exact solution was tractable. The work in feature-value acquisition during testing [Sheng and Ling 2006; Bilgic and Getoor 2007] is very related to the label acquisition problem; however, the focus in this field has been on acquiring feature values, not labels.

The most closely related work that we are aware of is that of Rattigan et al. [2007]. They are the first to directly describe label acquisition during inference. They compared different acquisition methods based on network structural measures, such as degree and betweenness, and they suggested a method based on clustering the network. They showed empirically that the clustering method performed the best. They assumed that the nodes did not have any attributes, thus their method did not require any training data. We made different assumptions about the data; that is, the nodes have attributes and we have training data available.

…

References

1. Bilgic, M. and Getoor, L. (2007). VOILA: Efficient feature-value acquisition for classification. In: Proceedings of the AAAI Conference on Artificial Intelligence. 1225--1230.
2. Bilgic, M. and Getoor, L. (2008). Effective label acquisition for collective classification. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York, 43--51.
3. Blum, A. and Chawla, S. (2001). Learning from labeled and unlabeled data using graph mincuts. In: Proceedings of the International Conference on Machine Learning. 19--26.
4. Chakrabarti, S., Dom, B., and Indyk, P. (1998). Enhanced hypertext categorization using hyperlinks. In: Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, New York, 307--318.
5. Chapelle, O., Schölkopf, B., and Zien, A., Eds. (2006). Semi-Supervised Learning. MIT Press, Cambridge, MA.
6. Cohn, D., Atlas, L., and Ladner, R. (1994). Improving generalization with active learning. Mach. Learn. 15, 2, 201--221.
7. Cohn, D. A., Ghahramani, Z., and Jordan, M. I. (1996). Active learning with statistical models. J. Artif. Intell. Res. 4, 129--145.
8. Freund, Y., Seung, H. S., Shamir, E., and Tishby, N. (1997). Selective sampling using the query by committee algorithm. Mach. Learn. 28, 2, 133--168.
9. Getoor, L., Friedman, N., Koller, D., and Taskar, B. (2002). Learning probabilistic models of link structure. J. Mach. Learn. Res. 3, 679--707.
10. Getoor, L., Segal, E., Taskar, B., and Koller, D. (2001). Probabilistic models of text and link structure for hypertext classification. In: Proceedings of the IJCAI Workshop on Text Learning: Beyond Supervision. ACM, New York, 24--29.
11. Giles, C. L., Bollacker, K. D., and Lawrence, S. (1998). CiteSeer: An automatic citation indexing system. In: Proceedings of the ACM Conference on Digital Libraries. ACM, New York, 89--98.
12. Gilks, W. R., Richardson, S., and Spiegelhalter, D. J. (1996). Markov Chain Monte Carlo in Practice. Interdisciplinary Statistics. Chapman & Hall/CRC.
13. Howard, R. A. 1966. Information value theory. IEEE Trans. Syst. Sci. Cybernet. 2, 1, 22--26.
14. Jensen, D., Neville, J., and Gallagher, B. (2004). Why collective inference improves relational classification. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York, 593--598.
15. Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. (1999). An introduction to variational methods for graphical models. Mach. Learn. 37, 2, 183--233.
16. Kempe, D., Kleinberg, J., and Tardos, E. (2003). Maximizing the spread of influence through a social network. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York, 137--146.
17. Krause, A. and Guestrin, C. (2005). Optimal nonmyopic value of information in graphical models — efficient algorithms and theoretical limits. In: Proceedings of the International Joint Conference on Artificial Intelligence. 1339--1345.
18. Lafferty, J. D., McCallum, A., and Pereira, F. C. N. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of the International Conference on Machine Learning. 282--289.
19. Leskovec, J., Adamic, L. A., and Huberman, B. A. (2007). The dynamics of viral marketing. ACM Trans. Web 1, 1, 5.
20. Leskovec, J., Kleinberg, J., and Faloutsos, C. (2007). Graph evolution: Densification and shrinking diameters. ACM Trans. Knowl. Discov. Data 1, 1, 177--187.
21. Lewis, D. D. and Gale, W. A. (1994). A sequential algorithm for training text classifiers. In: Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, 3--12.
22. Lu, Q. and Getoor, L. 2003a. Link based classification. In: Proceedings of the International Conference on Machine Learning. 496--503.
23. Lu, Q. and Getoor, L. 2003b. Link-based classification using labeled and unlabeled data. In: Proceedings of the ICML Workshop on the Continuum from Labeled to Unlabeled Data in Machine Learning and Data Mining.
24. Macskassy, S. and Provost, F. (2003). A simple relational classifier. In: Proceedings of the ACM Workshop on Multi-Relational Data Mining. ACM, New York.
25. Macskassy, S. and Provost, F. (2007). Classification in networked data: A toolkit and a univariate case study. J. Mach. Learn. Res. 8, 935--983.
26. McCallum, A. and Nigam, K. (1998). Employing EM and pool-based active learning for text classification. In: Proceedings of the International Conference on Machine Learning. 350--358.
27. McCallum, A. K., Nigam, K., Rennie, J., and Seymore, K. (2000). Automating the construction of internet portals with machine learning. Inf. Retrieval 3, 2, 127--163.
28. McDowell, L., Gupta, K. M., and Aha, D. W. (2007). Cautious inference in collective classification. In: Proceedings of the AAAI Conference on Artificial Intelligence. 596--601.
29. Melville, P. and Mooney, R. J. (2004). Diverse ensembles for active learning. In: Proceedings of the International Conference on Machine Learning. 584--591.
30. Neville, J. and Jensen, D. (2000). Iterative classification in relational data. In: Proceedings of the SRL Workshop in AAAI.
31. Newman, M. E. J. (2003). Mixing patterns in networks. Phys. Rev. E 67, 2, 026126.
32. Provost, F., Melville, P., and Saar-Tsechansky, M. (2007). Data acquisition and cost-effective predictive modeling: targeting offers for electronic commerce. In: Proceedings of the ACM International Conference on Electronic Commerce. 389--398.
33. Rattigan, M., Maier, M., and Jensen, D. (2007). Exploiting network structure for active inference in collective classification. In: Proceedings of the ICDM Workshop on Mining Graphs and Complex Structures. 429--434.
34. Matthew Richardson and Pedro Domingos 2002. Mining knowledge-sharing sites for viral marketing. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York, 61--70.
35. Matthew Richardson and Pedro Domingos 2006. Markov logic networks. Mach. Learn. 62, 1-2, 107--136.
36. Roy, N. and McCallum, A. (2001). Toward optimal active learning through sampling estimation of error reduction. In: Proceedings of the International Conference on Machine Learning. 441--448.
37. Saar-Tsechansky, M. and Provost, F. (2004). Active sampling for class probability estimation and ranking. Mach. Learn. 54, 2, 153--178.
38. Sen, P., Namata, G. M., Bilgic, M., Getoor, L., Gallagher, B., and Eliassi-Rad, T. (2008). Collective classification in network data. AI Magazine 29, 3.
39. Seung, H. S., Opper, M., and Sompolinsky, H. (1992). Query by committee. In: Proceedings of the ACM Annual Workshop on Computational Learning Theory. ACM, New York, 287--294.
40. Sheng, V. S. and Ling, C. X. (2006). Feature value acquisition in testing: a sequential batch test algorithm. In: Proceedings of the International Conference on Machine Learning. 809--816.
41. Taskar, B., Abbeel, P., and Koller, D. (2002). Discriminative probabilistic models for relational data. In: Proceedings of the Annual Conference on Uncertainty in Artificial Intelligence. 485--492.
42. Tong, S. and Koller, D. (2002). Support vector machine active learning with applications to text classification. J. Mach. Learn. Res. 2, 45--66.
43. Xiang, R. and Neville, J. (2008). Pseudolikelihood em for within-network relational learning. In: Proceedings of the IEEE International Conference on Data Mining. IEEE Computer Society Press, Los Alamitos, CA, 1103--1108.
44. Yedidia, J., Freeman, W. T., and Weiss, Y. (2000). Generalized belief propagation. In Neural Information Processing Systems. 689--695.
45. Zadrozny, B. and Elkan, C. (2001). Learning and making decisions when costs and probabilities are both unknown. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 204--213.
46. Zhu, X. and Ghahramani, Z. (2002). Learning from labeled and unlabeled data with label propagation. Tech. Rep. CMU-CALD-02-107, Carnegie Mellon University.
47. Zhu, X., Ghahramani, Z., and Lafferty, J. D. (2003). Semi-supervised learning using gaussian fields and harmonic functions. In: Proceedings of the International Conference on Machine Learning. 912--919.

,

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2009 ReflectAndCorrect	Mustafa Bilgic Lise Getoor			Reflect and Correct: A misclassification prediction approach to active inference		ACM Transactions on Knowledge Discovery from Data	http://waimea.cs.umd.edu:8080/basilic/web/Publications/2009/bilgic:tkdd09/bilgic-tkdd09.pdf	10.1145/1631162.1631168		2009