Gaussian Process Regression Task

From GM-RKB
Jump to navigation Jump to search

A Gaussian Process Regression Task is a nonparametric regression task that is based on the Kernel Method and Gaussian Processes.



References

2017a

2017b

2017c =

  • (Quadrianto at al., 2017) ⇒ Novi Quadrianto, Kristian Kersting, Zhao Xu(2017) "Gaussian Process" in "Encyclopedia of Machine Learning and Data Mining" (2017) pp pp 535-548
    • QUOTE: In a Regression problem, we are interested to recover a functional dependency [math]\displaystyle{ y_i= f(x_i) + \epsilon_i }[/math] from [math]\displaystyle{ N }[/math] observed training data points [math]\displaystyle{ \{(xi , yi)\}^N_{i=1} }[/math], where [math]\displaystyle{ y_i \in \mathbb{R} }[/math] is the noisy observed output at input location [math]\displaystyle{ x_i \in \mathbb{R}^d }[/math] . Traditionally, in the Bayesian Linear Regression model, this regression problem is tackled by requiring us to parameterize the latent function [math]\displaystyle{ f }[/math] by a parameter [math]\displaystyle{ w \in \mathbb{R}^H ,\; f(x):=\langle\phi(x), w_i\rangle }[/math] for [math]\displaystyle{ H }[/math] fixed basis functions [math]\displaystyle{ \{\phi_h(x)\}^H_{h=1} }[/math]. A prior distribution is then defined over parameter [math]\displaystyle{ w }[/math]. The idea of the Gaussian process regression (in the geostatistical literature, this is also called kriging; see, e.g., Krige 1951; Matheron 1963) is to place a prior directly on the space of functions without parameterizing the function (...)

      === 2017c ===

  • (Scikit-Learn, 2017) ⇒ "1.7.1. Gaussian Process Regression (GPR)" in http://scikit-learn.org/stable/modules/gaussian_process.html Retrieved:2017-09-03
    • QUOTE: The GaussianProcessRegressor implements Gaussian processes (GP) for regression purposes. For this, the prior of the GP needs to be specified. The prior mean is assumed to be constant and zero (for normalize_y=False) or the training data’s mean (for normalize_y=True). The prior’s covariance is specified by a passing a kernel object. The hyperparameters of the kernel are optimized during fitting of GaussianProcessRegressor by maximizing the log-marginal-likelihood (LML) based on the passed optimizer. As the LML may have multiple local optima, the optimizer can be started repeatedly by specifying n_restarts_optimizer. The first run is always conducted starting from the initial hyperparameter values of the kernel; subsequent runs are conducted from hyperparameter values that have been chosen randomly from the range of allowed values. If the initial hyperparameters should be kept fixed, can be passed as optimizer.

      The noise level in the targets can be specified by passing it via the parameter alpha, either globally as a scalar or per datapoint. Note that a moderate noise level can also be helpful for dealing with numeric issues during fitting as it is effectively implemented as Tikhonov regularization, i.e., by adding it to the diagonal of the kernel matrix. An alternative to specifying the noise level explicitly is to include a WhiteKernel component into the kernel, which can estimate the global noise level from the data (see example below).

      The implementation is based on Algorithm 2.1 of [RW2006]. In addition to the API of standard scikit-learn estimators, GaussianProcessRegressor:

      • allows prediction without prior fitting (based on the GP prior)
      • provides an additional method sample_y(X), which evaluates samples drawn from the GPR (prior or posterior) at given inputs
      • exposes a method log_marginal_likelihood(theta), which can be used externally for other ways of selecting hyperparameters, e.g., via Markov chain Monte Carlo.

2017d

  • (Schulz et al., 2107) ⇒ Schulz, E., Speekenbrink, M., & Krause, A. (2017). A tutorial on Gaussian process regression with a focus on exploration-exploitation scenarios. bioRxiv, 095190 DOI:10.1101/095190.
    • QUOTE: In Gaussian process regression, we assume the output [math]\displaystyle{ y }[/math] of a function [math]\displaystyle{ f }[/math] at input [math]\displaystyle{ x }[/math] can be written as

      [math]\displaystyle{ y = f (x) + \epsilon\quad }[/math](3)

      with [math]\displaystyle{ \epsilon\approx N (0, \sigma_\epsilon^2 ) }[/math]. Note that this is similar to the assumption made in linear regression, in that we assume an observation consists of an independent “signal” term [math]\displaystyle{ f (x) }[/math] and “noise” term [math]\displaystyle{ \epsilon }[/math]. New in Gaussian process regression, however, is that we assume that the signal term is also a random variable which follows a particular distribution. This distribution is subjective in the sense that the distribution reflects our uncertainty regarding the function. The uncertainty regarding [math]\displaystyle{ f }[/math] can be reduced by observing the output of the function at different input points. The noise term reflects the inherent randomness in the observations, which is always present no matter how many observations we make. In Gaussian process regression, we assume the function [math]\displaystyle{ f(x) }[/math] is distributed as a Gaussian process:

      [math]\displaystyle{ f(x) \sim \mathcal{GP} (m(x), k(x, x' )) }[/math]

      A Gaussian process GP is a distribution over functions and is defined by a mean and a covariance function. The mean function [math]\displaystyle{ m(x) }[/math] reflects the expected function value at input [math]\displaystyle{ x }[/math]:

      [math]\displaystyle{ m(x) = E[f(x)] }[/math]

      , i.e. the average of all functions in the distribution evaluated at input [math]\displaystyle{ x }[/math]. The prior mean function is often set to [math]\displaystyle{ m(x) = 0 }[/math] in order to avoid expensive posterior computations and only do inference via the covariance directly. The covariance function [math]\displaystyle{ k(x, x) }[/math] models the dependence between the function values at different input points [math]\displaystyle{ x }[/math] and [math]\displaystyle{ x }[/math] :

      [math]\displaystyle{ k(x, x' ) = E [(f (x) − m(x))(f (x') − m(x'))] }[/math].

      The function [math]\displaystyle{ k }[/math] is commonly called the kernel of the Gaussian process (Jäkel, Schölkopf, & Wichmann, 2007). The choice of an appropriate kernel is based on assumptions such as smoothness and likely patterns to be expected in the data (more on this later). A sensible assumption is usually that the correlation between two points decays with the distance between the points according to a power function. This just means that closer points are expected to behave more similarly than points which are further away from each other. One very popular choice of a kernel fulfilling this assumption is the radial basis function kernel, which is defined as

      [math]\displaystyle{ k(x, x') = \sigma_f^2\exp(-\frac{\parallel x-x'\parallel^2}{2 \lambda^2}) }[/math]

      The radial basis function provides an expressive kernel to model smooth functions. The two hyper-parameters [math]\displaystyle{ \lambda }[/math] (called the length-scale) and [math]\displaystyle{ \sigma_f^2 }[/math] (the signal variance) can be varied to increase or reduce the correlation between points and consequentially the smoothness of the resulting function.