1999 SupportVectorLearningforOrdinal

From GM-RKB
Jump to navigation Jump to search

Subject Headings:

Notes

Cited By

2007

Quotes

Abstract

We investigate the problem of predicting variables of ordinal scale. This task is referred to as ordinal regression and is complementary to the standard machine learning tasks of classification and metric regression. In contrast to statistical models we present a distribution independent formulation of the problem together with uniform bounds of the risk functional. The approach presented is based on a mapping from objects to scalar utility values. Similar to support vector methods we derive a new learning algorithm for the task of ordinal regression based on large margin rank boundaries. We give experimental results for an information retrieval task: learning the order of documents with respect to an initial query. Experimental results indicate that the presented algorithm outperforms more naive approaches to ordinal regression such as support vector classification and support vector regression in the case of more than two ranks.

1. Introduction

Problems of ordinal regression arise in many fields, e.g., in information retrieval (Herbrich et al. 1998), in econometric models (Tangian and Gruber 1995), and in classical statistics (McCullagh 1980; Anderson 1984). They can be related to the standard machine learning paradigm as follows: Given an i.i.d. sample S = f (xi; yi) gl i=1 Pl X Y and a set H of mappings h from X to Y, a learning procedure selects one mapping hl such that — using a predefined loss l: Y � Y 7 ! R — the risk functional R (hl) is minimized. Typically, in machine learning the risk functional R (h) under consideration is the expectation value of the loss l (y; h (x)), i.e., the loss at each point (x; y) weighted by its (unknown) probability PXY (x; y). Using the principle of Empirical Risk Minimization (ERM), one chooses that function hl which minimizes the mean of the loss Remp (hl) given the sample S. Two main scenarios were considered in the past: (i) If [math]\displaystyle{ Y }[/math] is a finite unordered set (nominal scale), the task is referred to as classification. Since Y is unordered, the 0-1 loss, i.e., [math]\displaystyle{ l_{0-1}(y,\hat{y}) = 0 }[/math] iff y = ^y, and l0 ? ?1 (y; ^y) = 1 iff y ? = ^y, is adequate to capture the loss at each point (x; y). (ii) If Y is a metric space, e.g., the set of real numbers, the task is referred to as regression estimation. In this case the loss function can take into account the full metric structure (see Smola (1998) for a detailed discussion of loss functions for regression).

In ordinal regression, we consider a problem which shares properties of both classification (i) and metric regression (ii). Like in (i) Y is a finite set and like in (ii) there exists an ordering among the elements of Y. A variable of the above type exhibits an ordinal scale and can be thought of as the result of coarse measurement of a continuous variable (Anderson 1984). The ordinal scale leads to problems in defining an appropriate loss function for our task (see McCullagh 1980). In Section 2 we present a distribution independent model for ordinal regression, which is based on a loss function that acts on pairs of ranks. We give explicit uniform convergence bounds for the proposed loss function and show the relation between ordinal regression and preference learning.

As an application of the theory we derive an algorithm for ordinal regression in Section 3 by modeling ranks as intervals on the real line. Considering pairs of objects, the task of learning reduces to finding a utility function that best reflects the preferences induced by the unknown distribution PXY. The resulting algorithm is similar to Support Vector Machines (SVM) (Vapnik 1998) and enforces large margin rank boundaries. It is easily extended to non–linear utility functions using the “kernel trick” (Smola 1998).

Finally, in Section 4 we present learning curves of our approach in a controlled experiment and in a real–world experiment on data from information retrieval.

2 A Risk Formulation for Ordinal Regression

Consider an input space [math]\displaystyle{ X \subset \R^n }[/math] with objects being represented by feature vectors x = (x1; : : : ; xn)T 2 Rn, where n denotes the number of features. Furthermore, let us assume that there is an outcome space Y = fr1; : : : ; rqg with ordered ranks rq ≻Y rq􀀀1 ≻Y � � � ≻Y r1. The symbol ≻Y denotes the ordering between different ranks and can be interpreted as “is prefered to”. Suppose that an i.i.d. sample S = f(xi; yi)gℓ i=1 � X �Y is given. Let us consider a model space H = fh(�) : X 7! Y g of mappings from objects to ranks. Moreover, each such function h induces an ordering ≻X on the elements of the input space by the following rule xi ≻X xj , h(xi)≻Y h(xj) : (1)

A distribution independent model of ordinal regression has to single out that function h� pref which induces the ordering of the space X that incurs the smallest number of inversions on pairs (x1; x2) of objects (for a similar reasoning see Sobel 1990; McCullagh 1980). Given a pair (x1; y1) and (x2; y2) of objects we have to distinguish between two different outcomes: y1 ≻Y y2 and y2 ≻Y y1. Thus, the probability of incurred inversion is given by …

The ERM principle recommends to take that mapping hℓ which minimizes the empirical risk Remp(h; S), Remp(h; S) = 1 ℓ2 Σℓ i=1 Σℓ j=1 ls pref(h(xi); h(xj); yi; yj) ; which is effectively based on a new training set whose elements are pairs of objects. Using the shorthand notation x(1) and x(2) to denote the first and second object of a pair, the new training set S′ : X � X � f􀀀1; +1g can be derived from S if we use all 2–sets

References

;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
1999 SupportVectorLearningforOrdinalRalf Herbrich
Thore Graepel
Klaus Obermayer
Support Vector Learning for Ordinal Regression1999