Open main menu

GM-RKB β

2004 LearningtheKernelMatrixwithSemi

Notes

Cited By

Quotes

Abstract

Kernel-based learning algorithms work by embedding the data into an Euclidean space, and then searching for linear relations among the embedded data points. The embedding is performed implicitly, by specifying the inner products between each pair of points in the embedding space. This information is contained in the so-called kernel matrix, a symmetric and positive semidefinite matrix that encodes the relative positions of all points. Specifying this matrix amounts to specifying the geometry of the embedding space and inducing a notion of similarity in the input space --- classical model selection problems in machine learning. In this paper we show how the kernel matrix can be learned from data via semidefinite programming (SDP) techniques. When applied to a kernel matrix associated with both training and test data this gives a powerful transductive algorithm --- using the labeled part of the data one can learn an embedding also for the unlabeled part. The similarity between test points is inferred from training points and their labels. Importantly, these learning problems are convex, so we obtain a method for learning both the model class and the function without local minima. Furthermore, this approach leads directly to a convex method for learning the 2-norm soft margin parameter in support vector machines, solving an important open problem.

1. Introduction

Recent advances in kernel-based learning algorithms have brought the field of machine learning closer to the desirable goal of autonomy|the goal of providing learning systems that require as little intervention as possible on the part of a human user. In particular, kernel-based algorithms are generally formulated in terms of convex optimization problems, which have a single global optimum and thus do not require heuristic choices of learning rates, starting configurations or other free parameters. There are, of course, statistical model selection problems to be faced within the kernel approach; in particular, the choice of the kernel and the corresponding feature space are central choices that must generally be made by a human user. While this provides opportunities for prior knowledge to be brought to bear, it can also be di±cult in practice to find prior justification for the use of one kernel instead of another. It would be desirable to explore model selection methods that allow kernels to be chosen in a more automatic way based on data.

It is important to observe that we do not necessarily need to choose a kernel function, specifying the inner product between the images of all possible data points when mapped from an input space X to an appropriate feature space F. Since kernel-based learning methods extract all information needed from inner products of training data points in F, the values of the kernel function at pairs which are not present are irrelevant. So, there is no need to learn a kernel function over the entire sample space to specify the embedding of a finite training data set via a kernel function mapping. Instead, it is su±cient to specify a finite-dimensional kernel matrix (also known as a Gram matrix) that contains as its entries the inner products in F between all pairs of data points. Note also that it is possible to show that any symmetric positive semidefinite matrix is a valid Gram matrix, based on an inner product in some Hilbert space. This suggests viewing the model selection problem in terms of Gram matrices rather than kernel functions.

In this paper our main focus is transduction|the problem of completing the labeling of a partially labeled dataset. In other words, we are required to make predictions only at a finite set of points, which are specified a priori. Thus, instead of learning a function, we only need to learn a set of labels. There are many practical problems in which this formulation is natural|an example is the prediction of gene function, where the genes of interest are specified a priori, but the function of many of these genes is unknown.

We will address this problem by learning a kernel matrix corresponding to the entire dataset, a matrix that optimizes a certain cost function that depends on the available labels. In other words, we use the available labels to learn a good embedding, and we apply it to both the labeled and the unlabeled data. The resulting kernel matrix can then be used in combination with any of a number of existing learning algorithms that use kernels. One example that we discuss in detail is the support vector machine (SVM), where our methods yield a new transduction method for SVMs that scales polynomially with the number of test points. Furthermore, this approach will offer us a method to optimize the 2-norm soft margin parameter for these SVM learning algorithms, solving an important open problem.

All this can be done in full generality by using techniques from semidefinite programming (SDP), a branch of convex optimization that deals with the optimization of convex functions over the convex cone of positive semidefinite matrices, or convex subsets thereof. Any convex set of kernel matrices is a set of this kind. Furthermore, it turns out that many natural cost functions, motivated by error bounds, are convex in the kernel matrix.

A second application of the ideas that we present here is to the problem of combining data from multiple sources. Specifically, assume that each source is associated with a kernel function, such that a training set yields a set of kernel matrices. The tools that we develop in this paper make it possible to optimize over the coe±cients in a linear combination of such kernel matrices. These coe±cients can then be used to form linear combinations of kernel functions in the overall classifier. Thus this approach allows us to combine possibly heterogeneous data sources, making use of the reduction of heterogeneous data types to the common framework of kernel matrices, and choosing coe±cients that emphasize those sources most useful in the classification decision.

In Section 2, we recall the main ideas from kernel-based learning algorithms, and introduce a variety of criteria that can be used to assess the suitability of a kernel matrix: the hard margin, the 1-norm and 2-norm soft margin, and the kernel alignment. Section 3 reviews the basic concepts of semidefinite programming. In Section 4 we put these ideas together and consider the optimization of the various criteria over sets of kernel matrices. For a set of linear combinations of fixed kernel matrices, these optimization problems reduce to SDP. If the linear coe±cients are constrained to be positive, they can be simplified even further, yielding a quadratically-constrained quadratic program, a special case of the SDP framework. If the linear combination contains the identity matrix, we obtain a convex method for optimizing the 2-norm soft margin parameter in support vector machines. Section 5 presents statistical error bounds that motivate one of our cost functions. Empirical results are reported in Section 6.

Notation

Vector s are represented in bold notation, e.g., v 2 Rn, and their scalar components in italic script, e.g., v1; v2;:::; vn. Matrices are represented in italic script, e.g., X 2 Rm£n. For a square, symmetric matrix X, X º 0 means that X is positive semidefinite, while X Â 0 means that X is positive definite. For a vector v, the notations v ¸ 0 and v > 0 are understood componentwise.

2. Kernel Methods

Kernel-based learning algorithms (see, for example, Cristianini and Shawe-Taylor, 2000; Scholkopf and Smola, 2002; Shawe-Taylor and Cristianini, 2004) work by embedding the data into a Hilbert space, and searching for linear relations in such a space. The embedding is performed implicitly, by specifying the inner product between each pair of points rather than by giving their coordinates explicitly. This approach has several advantages, the most important deriving from the fact that the inner product in the embedding space can often be computed much more easily than the coordinates of the points themselves.

Given an input set X, and an embedding space F, we consider a map ff X ! F.

Given two points [math]x_i \in \mathcal{X}[/math] and [math]x_j \in \mathcal{X}[/math], the function that returns the inner product between their images in the space F is known as the kernel function.

Definition 1: A kernel is a function [math]k[/math], such that [math]k(\mathbf{x}, \mathbf{z}) = \lt \Phi(\mathbf{x}); \Phi(\mathbf{z})\gt [/math] for all x; z 2 X, where ff is a mapping from X to an (inner product) feature space F. A kernel matrix is a square matrix K 2 Rn£n such that Kij = k (xi; xj) for some x1;:::; xn 2 X and some kernel function k. The kernel matrix is also known as the Gram matrix. It is a symmetric, positive semidefinite matrix, and since it specifies the inner products between all pairs of points fxign i=1, it completely determines the relative positions of those points in the embedding space.

Since in this paper we will consider a finite input set X, we can characterize kernel functions and matrices in the following simple way.

Proposition 2 Every positive semidefinite and symmetric matrix is a kernel matrix. Conversely, every kernel matrix is symmetric and positive semidefinite.

...

...

3. Semidefinite Programming (SDP)

...

3.1 Definition of Semidefinite Programming

A linear matrix inequality, abbreviated LMI, is a constraint of the form: [math]F (u): = F_0 + u_1F_1 + … + u_qF_q \preceq 0: [/math] Here, [math]u[/math] is the vector of decision variables, and [math]F_0, ..., F_q[/math] are given symmetric [math]p \times p[/math] matrices. The notation [math]F(u) = 0[/math] means that the symmetric matrix [math]F[/math] is negative semidefinite. Note that such a constraint is in general a nonlinear constraint; the term "linear" in the name LMI merely emphasizes that F is affine in u. Perhaps the most important feature of an LMI constraint is its convexity: the set of u that satisfy the LMI is a convex set.

An LMI constraint can be seen as an infinite set of scalar, affine constraints. Indeed, for a given u, F (u) ¹ 0 if and only if zTF (u) z · 0 for every z; every constraint indexed by z is an affine inequality, in the ordinary sense, i.e., the left-hand side of the inequality is a scalar, composed of a linear term in u and a constant term. Alternatively, using a standard result from linear algebra, we may state the constraint as 8Z 2 P: trace (F (u) Z) · 0: (8) This can be seen by writing down the spectral decomposition of Z and using the fact that zTF (u) z ·0 for every z.

A semi-definite program (SDP) is an optimization problem with a linear objective, and linear matrix inequality and affine equality constraints.

Definition 7 A [semi-definite program]] is a problem of the form ...

...

...

7. Discussion

In this paper we have presented a new method for learning a kernel matrix from data. Our approach makes use of semide¯nite programming (SDP) ideas. It is motivated by the fact that every symmetric, positive semide¯nite matrix can be viewed as a kernel matrix (corresponding to a certain embedding of a ¯nite set of data), and the fact that SDP deals with the optimization of convex cost functions over the convex cone of positive semide¯nite matrices (or convex subsets of this cone). Thus convex optimization and machine learning concerns merge to provide a powerful methodology for learning the kernel matrix with SDP.

We have focused on the transductive setting, where the labeled data are used to learn an embedding, which is then applied to the unlabeled part of the data. Based on a new generalization bound for transduction, we have shown how to impose convex constraints that e®ectively control the capacity of the search space of possible kernels and yield an e±cient learning procedure that can be implemented by SDP. Furthermore, this approach leads to a convex method to learn the 2-norm soft margin parameter in support vector machines, solving an important open problem. Promising empirical results are reported on standard benchmark datasets; these results show that the new approach provides a principled way to combine multiple kernels to yield a classi¯er that is comparable with the best individual classi¯er, and can perform better than any individual kernel. Performance is also comparable with a classi¯er in which the kernel hyperparameter is tuned with cross-validation; our approach achieves the e®ect of this tuning without cross-validation. We have also shown how optimizing a linear combination of kernel matrices provides a novel method for fusing heterogeneous data sources. In this case, the empirical results show a signi¯- cant improvement of the classi¯cation performance for the optimal combination of kernels when compared to individual kernels.

References