2005 AUnifyingViewofSparseApproximat

(Quiñonero-Candela & Rasmussen, 2005) ⇒ Joaquin Quiñonero-Candela, and Carl Edward Rasmussen. (2005). “A Unifying View of Sparse Approximate Gaussian Process Regression.” In: The Journal of Machine Learning Research, 6.

Subject Headings: Gaussian Process-based Regression

Notes

Cited By

Quotes

Author Keywords

Gaussian process, probabilistic regression, sparse approximation, Bayesian committee machine.

Abstract

We provide a new unifying view, including all existing proper probabilistic sparse approximations for Gaussian process regression. Our approach relies on expressing the effective prior which the methods are using. This allows new insights to be gained, and highlights the relationship between existing methods. It also allows for a clear theoretically justified ranking of the closeness of the known approximations to the corresponding full GPs. Finally we point directly to designs of new better sparse approximations, combining the best of the existing strategies, within attractive computational constraints.

Introduction

Regression models based on Gaussian processes (GPs) are simple to implement, flexible, fully probabilistic models, and thus a powerful tool in many areas of application. Their main limitation is that memory requirements and computational demands grow as the square and cube respectively, of the number of training cases n, effectively limiting a direct implementation to problems with at most a few thousand cases. To overcome the computational limitations numerous authors have recently suggested a wealth of sparse approximations. Common to all these approximation schemes is that only a subset of the latent variables are treated exactly, and the remaining variables are given some approximate, but computationally cheaper treatment. However, the published algorithms have widely different motivations, emphasis and exposition, so it is difficult to get an overview (see [[Rasmussen and Williams, 2006, chapter 8) of how they relate to each other, and which can be expected to give rise to the best algorithms.

In this paper we provide a unifying view of sparse approximations for GP regression. Our approach is simple, but powerful: for each algorithm we analyze the posterior, and compute the effective prior which it is using. Thus, we reinterpret the algorithms as “exact inference with an approximated prior”, rather than the existing (ubiquitous) interpretation “approximate inference with the exact prior”. This approach has the advantage of directly expressing the approximations in terms of prior assumptions about the function, which makes the consequences of the approximations much easier to understand. While our view of the approximations is not the only one possible, it has the advantage of putting all existing probabilistic sparse approximations under one umbrella, thus enabling direct comparison and revealing the relation between them.

In Section 1 we briefly introduce GP models for regression. In Section 2 we present our unifying framework and write out the key equations in preparation for the unifying analysis of sparse algorithms in Sections 4-7. The relation of transduction and augmentation to our sparse framework is covered in Section 8. All our approximations are written in terms of a new set of inducing variables. The choice of these variables is itself a challenging problem, and is discussed in Section 9. We comment on a few [[special approximations outside our general scheme in Section 10 and conclusions are drawn at the end.

1. Gaussian Processes for Regression

Probabilistic regression is usually formulated as follows: given a training set [math]\displaystyle{ D = { (x_i, y_i), i = 1, ..., n} }[/math] of [math]\displaystyle{ n }[/math] pairs of (vectorial) inputs x_i and noisy (real, scalar) outputs [math]\displaystyle{ y_i }[/math], compute the predictive distribution of the function values [math]\displaystyle{ f() }[/math] (or noisy y_) at test locations x_.

In the simplest case (which we deal with here) we assume that the noise is additive, independent and Gaussian, such that the relationship between the (latent) function f (x) and the observed noisy targets y are given by yi = f (xi) +ei, where ei _ N (0, s2n oise), (1) where s2 noise is the variance of the noise.

Definition 1: A Gaussian process (GP) is a collection of random variables, any finite number of which have consistent1 joint Gaussian distributions.

Gaussian process (GP) regression is a Bayesian approach which assumes a GP prior^[1] over functions, i.e. assumes a priori that function values behave according to : [math]\displaystyle{ p (\mathbf{f}|x_1, x_2,..., x_n) = \mathcal{N} (0, K), (2) }[/math] where [math]\displaystyle{ f = [f_1, f_2,..., f_n]^T }[/math] is a vector of latent function values, [math]\displaystyle{ f_i = f(\mathbf{x}_i) }[/math] and K is a covariance matrix, whose entries are given by the covariance function, [math]\displaystyle{ K_{ij} = k (x_i, x_j) }[/math]. Note that the GP treats the latent function values [math]\displaystyle{ f_i }[/math] as random variables, indexed by the corresponding input. In the following, for simplicity we will always neglect the explicit conditioning on the inputs; the GP model and all expressions are always conditional on the corresponding inputs. The GP model is concerned only with the conditional of the outputs given the inputs; we do not model anything about the inputs themselves.

References

1. Corinna Cortes, Vladimir Vapnik, Support-Vector Networks, Machine Learning, v.20 n.3, p.273-297, Sept. 1995 doi:10.1023/A:1022627411411
2. Lehel Csató, Manfred Opper, Sparse on-line Gaussian Processes, Neural Computation, v.14 n.3, p.641-668, March 2002 doi:10.1162/089976602317250933
3. Sathiya Keerthi and Wei Chu. A Matching Pursuit Approach to Sparse Gaussian Process Regression. In Y. Weiss, B. Schölkopf, and J. Platt, Editors, Advances in Neural Information Processing Systems 18, Cambridge, Massachussetts, 2006. The MIT Press.
4. Malte Kuss, Carl Edward Rasmussen, Assessing Approximate Inference for Binary Gaussian Process Classification, The Journal of Machine Learning Research, 6, p.1679-1704, 12/1/2005
5. Joaquin Quiñonero-Candela. Learning with Uncertainty - Gaussian Processes and Relevance Vector Machines. PhD Thesis, Technical University of Denmark, Lyngby, Denmark, 2004.
6. Carl Edward Rasmussen. Reduced Rank Gaussian Process Learning. Technical Report, Gatsby Computational Neuroscience Unit, UCL, 2002.
7. Carl Edward Rasmussen, Joaquin Quiñonero-Candela, Healing the Relevance Vector Machine through Augmentation, Proceedings of the 22nd International Conference on Machine Learning, p.689-696, August 07-11, 2005, Bonn, Germany doi:10.1145/1102351.1102438
8. Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning. The MIT Press, 2006.
9. Anton Schwaighofer and Volker Tresp. Transductive and Inductive Methods for Approximate Gaussian Process Regression. In Suzanna Becker, Sebastian Thrun, and Klaus Obermayer, Editors, Advances in Neural Information Processing Systems 15, Pages 953-960, Cambridge, Massachussetts, 2003. The MIT Press.
10. Matthias Seeger, Christopher K. I. Williams, and Neil Lawrence. Fast Forward Selection to Speed Up Sparse Gaussian Process Regression. In Christopher M. Bishop and Brendan J. Frey, Editors, Ninth International Workshop on Artificial Intelligence and Statistics. Society for Artificial Intelligence and Statistics, 2003.
11. Bernhard W. Silverman. Some Aspects of the Spline Smoothing Approach to Non-parametric Regression Curve Fitting. J. Roy. Stat. Soc. B, 47(1):1-52, 1985. (with Discussion).
12. Alexander J. Smola and Peter L. Bartlett. Sparse Greedy Gaussian Process Regression. In Todd K. Leen, Thomas G. Dietterich, and Volker Tresp, Editors, Advances in Neural Information Processing Systems 13, Pages 619-625, Cambridge, Massachussetts, 2001. The MIT Press.
13. Edward Snelson and Zoubin Ghahramani. Sparse Gaussian Processes Using Pseudo-inputs. In Y.Weiss, B. Schölkopf, and J. Platt, Editors, Advances in Neural Information Processing Systems 18, Cambridge, Massachussetts, 2006. The MIT Press.
14. Michael E. Tipping, Sparse Bayesian Learning and the Relevance Vector Machine, The Journal of Machine Learning Research, 1, p.211-244, 9/1/2001 doi:10.1162/15324430152748236
15. Volker Tresp, A Bayesian Committee Machine, Neural Computation, v.12 n.11, p.2719-2741, November 2000 doi:10.1162/089976600300014908
16. Vladimir N. Vapnik, The Nature of Statistical Learning Theory, Springer-Verlag New York, Inc., New York, NY, 1995
17. Grace Wahba, Xiwu Lin, Fangyu Gao, Dong Xiang, Ronald Klein, Barbara Klein, The Bias-variance Tradeoff and the Randomized GACV, Proceedings of the 1998 Conference on Advances in Neural Information Processing Systems II, p.620-626, July 1999
18. Christopher K. I. Williams and Carl Edward Rasmussen. Gaussian Processes for Regression. In David S. Touretzky, Michael C. Mozer, and Michael E. Hasselmo, Editors, Advances in Neural Information Processing Systems 8, Pages 514-520, Cambridge, Massachussetts, 1996. The MIT Press.
19. Christopher K. I. Williams, Carl Edward Rasmussen, Anton Schwaighofer, and Volker Tresp. Observations of the Nyström Method for Gaussiam Process Prediction. Technical Report, University of Edinburgh, Edinburgh, Scotland, 2002.
20. Christopher K. I. Williams and Mathias Seeger. Using the Nyström Method to Speed Up Kernel Machines. In Todd K. Leen, Thomas G. Dietterich, and Volker Tresp, Editors, Advances in Neural Information Processing Systems 13, Pages 682-688, Cambridge, Massachussetts, 2001. The MIT Press.
21. David Wipf, Jason Palmer, and Bhaskar Rao. Perspectives on Sparse Bayesian Learning. In Sebastian Thrun, Lawrence Saul, and Bernhard Schölkopf, Editors, Advances in Neural Information Processing Systems 16, Cambridge, Massachussetts, 2004. The MIT Press.

;

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2005 AUnifyingViewofSparseApproximat	Joaquin Quiñonero-Candela Carl Edward Rasmussen			A Unifying View of Sparse Approximate Gaussian Process Regression						2005

↑ For notational simplicity we exclusively use zero-mean priors.

[1] For notational simplicity we exclusively use zero-mean priors.

[1]