- (Taskar et al., 2003) ⇒ Ben Taskar, Carlos Guestrin, Daphne Koller. (2003). “Max-Margin Markov Networks.” In: Advances in Neural Information Processing Systems (NIPS 2003).
Subject Headings: Max-Margin Markov Networks.
In typical classification tasks, we seek a function which assigns a label to a single object. Kernel-based approaches, such as support vector machines (SVMs), which maximize the margin of confidence of the classifier, are the method of choice for many such tasks. Their popularity stems both from the ability to use high-dimensional feature spaces, and from their strong theoretical guarantees. However, many real-world tasks involve sequential, spatial, or structured data, where multiple labels must be assigned. Existing kernel-based methods ignore structure in the problem, assigning labels independently to each object, losing much useful information. Conversely, probabilistic graphical models, such as Markov networks, can represent correlations between labels, by exploiting problem structure, but cannot handle high-dimensional feature spaces, and lack strong theoretical generalization guarantees. In this paper, we present a new framework that combines the advantages of both approaches: Maximum margin Markov (M3) networks incorporate both kernels, which efficiently deal with high-dimensional features, and the ability to capture correlations in structured data. We present an efficient algorithm for learning M3 networks based on a compact quadratic program formulation. We provide a new theoretical bound for generalization in structured domains. Experiments on the task of handwritten character recognition and collective hypertext classification demonstrate very significant gains over previous approaches.
In supervised classification, our goal is to classify instances into some set of discrete categories. Recently, support vector machines (SVMs) have demonstrated impressive successes on a broad range of tasks, including document categorization, character recognition, image classification, and many more. SVMs owe a great part of their success to their ability to use kernels, allowing the classifier to exploit a very high-dimensional (possibly even infinite-dimensional) feature space. In addition to their empirical success, SVMs are also appealing due to the existence of strong generalization guarantees, derived from the margin-maximizing properties of the learning algorithm.
However, many supervised learning tasks exhibit much richer structure than a simple categorization of instances into one of a small number of classes. In some cases, we might need to label a set of inter-related instances. For example: optical character recognition (OCR) or part-of-speech tagging both involve labeling an entire sequence of elements into some number of classes; image segmentation involves labeling all of the pixels in an image; and collective webpage classification involves labeling an entire set of interlinked webpages. In other cases, we might want to label an instance (e.g., a news article) with multiple non-exclusive labels. In both of these cases, we need to assign multiple labels simultaneously, leading to a classification problem that has an exponentially large set of joint labels. A common solution is to treat such problems as a set of independent classification tasks, dealing with each instance in isolation. However, it is well-known that this approach fails to exploit significant amounts of correlation information .
An alternative approach is offered by the probabilistic framework, and specifically by probabilistic graphical models. In this case, we can define and learn a joint probabilistic model over the set of label variables. For example, we can learn a hidden Markov model, or a conditional random field (CRF)  over the labels and features of a sequence, and then use a probabilistic inference algorithm (such as the Viterbi algorithm) to classify these instances collectively, finding the most likely joint assignment to all of the labels simultaneously. This approach has the advantage of exploiting the correlations between the different labels, often resulting in significant improvements in accuracy over approaches that classify instances independently [7, 10]. The use of graphical models also allows problem structure to be exploited very effectively. Unfortunately, even probabilistic graphical models that are trained discriminatively do not usually achieve the same level of generalization accuracy as SVMs, especially when kernel features are used. Moreover, they are not (yet) associated with generalization bounds comparable to those of margin-based classifiers.
We present a discriminative framework for labeling and segmentation of structured data such as sequences, images, etc. Our approach seamlessly integrates state-of-the-art kernel methods developed for classification of independent instances with the rich language of graphical models that can exploit the structure of complex data. In our experiments with the OCR task, for example, our sequence model significantly outperforms other approaches by incorporating high-dimensional decision boundaries of polynomial kernels over character images while capturing correlations between consecutive characters. We construct our models by solving a convex quadratic program that maximizes the per-label margin. Although the number of variables and constraints of our QP formulation is polynomial in the example size (e.g., sequence length), we also address its quadratic growth using an effective optimization procedure inspired by SMO. We provide theoretical guarantees on the average per-label generalization error of our models in terms of the training set margin. Our generalization bound significantly tightens previous results of Collins  and suggests possibilities for analyzing per-label generalization properties of graphical models. For brevity, we simplified our presentation of graphical models to only pairwise Markov networks. Our formulation and generalization bound easily extend to interaction patterns involving more than two labels (e.g., higher-order Markov models). Overall, we believe that M3 networks will significantly further the applicability of high accuracy margin-based methods to real-world structured data.
-  Y. Altun, I. Tsochantaridis, and T. Hofmann. Hidden markov support vector machines. In: Proceedings. ICML, 2003.
-  D. Bertsekas. Nonlinear Programming. Athena Scientific, Belmont, MA, 1999.
-  M. Collins. Parameter estimation for statistical parsing models: Theory and practice of distribution-free methods. In IWPT, 2001.
-  R.G. Cowell, A.P. Dawid, S.L. Lauritzen, and D.J. Spiegelhalter. Probabilistic Networks and Expert Systems. Springer, New York, 1999.
-  K. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernelbased vector machines. Journal of Machine Learning Research, 2(5):265–292, 2001.
-  R. Kassel. A Comparison of Approaches to On-line Handwritten Character Recognition. PhD thesis, MIT Spoken Language Systems Group, 1995.
-  J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings. ICML01, 2001.
-  J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, 1988.
-  J. Platt. Using sparseness and analytic QP to speed training of support vector machines. In NIPS, 1999.
-  Ben Taskar, P. Abbeel, and D. Koller. Discriminative probabilistic models for relational data. In: Proceedings. UAI02, Edmonton, Canada, 2002.
-  V. N. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, New York, 1995.
-  J. Yedidia, W. Freeman, and Yair Weiss. Generalized belief propagation. In NIPS, 2000.
-  T. Zhang. Covering number bounds of certain regularized linear function classes. Journal of Machine Learning Research, 2:527–550, 2002.,