Gaussian process regression
Gaussian process captures the typical behavior of the system based on the system observations and results in a probability distribution for possible interpolation functions of the present problem.
Gaussian process regression make use of Bayes’ statement in the following, which is why it should be explained briefly in advance.
Usually Bayesian theorem defined as follows:
It can be used to infer from known values to unknown values. A frequently used application example is disease detection. For example, in the case of rapid tests, one is interested in how real the probability is that a person who is positively tested actually has the disease. [Fah16]
In the following, we will use this principle Gaussian process.
Gaussian process determined for each random variable,
mean function m(x) and a
covariance function k(x,x´).
mean function m(x) reflects prior function for the problem in question and is based on known trends or prejudices. If the expected value (average function) is a constant of 0, it is called a centralized Gaussian process.
covariance function k(x, x´) , also called ‘nuclear‘, describes the covariance of the random variables x and x ′. These functions are defined analytically.
The kernel defines the form and flow of model functions and is used to describe, for example, abstract properties such as smoothness, roughness and noise. Additional cores can be combined with certain computational rules to mimic systems with overlapping properties. [Ebd08][Kuß06] [Ras06][Vaf17]
The following are the three most commonly used cores:
The exponential core of a square
Popular Nuclear is
Squared Exponential Kernel (Radial Basis Function) and has established itselfstandard core‘for Gaussian process and Vector machine support. [Sci18l]
The following figures show an example of A-prior-Gaus process
mean function m(x)(black line) and
confidence interval (gray background). In general, the confidence interval indicates, based on the infinite repetition of a random experiment, the probability of the actual position of the parameter [Fah16][Enc18]. In this case, the boundaries of the confidence interval are determined by standard deviation σ.
Colored curves represent some random functions Gaussian process. The example curves only serve as an abstract form of the possible result functions. In principle, an infinite number of these curves could be created.
There are only two hyperparameters in the kernel:
- l (length_measure) describes the characteristics of the covariance function. length_scale affect ‘waves‘On Gaussian functions.
- variance σ² determines the average distance of the function from the mean. The value should be selected high for functions that cover a large area on the y-axis. [Ebd08]
The following figure shows the effects of hyperparameters A-Priori-Gaussian process and its functions.
A rational quadratic core
Rational Quadratic Kernel can be seen as a combination of several Quadratic exponential nuclei with different
length_scale settings (l). The parameter α determines’large-scale‘and’small-scale‘functions. As α approaches infinity, the rational quadratic nucleus is identical to the exponential nucleus of the square. [Duv14][Sci18k][Mur12]
Periodic Kernel allow functions to repeat themselves. The point p describes the distance between repetitions of the function. ”length scale‘parameter (l) is as previously mentioned. [Sci18j]
kernel funnction and
mean function together describe A-priori-Gaussian process. Using some measured values a A-posterior-Gaussian process can be determined, taking into account all available information about the problem. More specifically, there will be no single solution, but all possible interpolation functions that are weighted with different probabilities. Especially in the regression problem, the solution (function) with the highest probability is decisive. [Ras06][Wik18a][Wik18a]
For regression, a data set with the values of the independent variable X ∈ R and the associated dependent variable f ∈ R is typically given, and it is desired to predict the output values f ∗ for the new values X ∗. [Vaf17]
In the simplest case, in a noise-free process, the multidimensional Gaussian distribution is defined as follows:
The covariance matrix can be divided into four parts. Covariance within unknown values K_X ∗ X ∗, covariance between unknown and known K_X ∗ X values, and covariance within known values K_XX.
f is fully known, the substitution of the probability density in the Bayesian theorem produces a-posterior-Gaussian distribution.
Rasmussen has given a detailed introduction in his bookGaussian processes for machine learning“. [Ras06, p.8 ff.]
From the Gaussian process in advance to the posterior: Explained by a simple example
In practice, several other cores are used, including combinations of several core functions. continuous core for example, usually used in conjunction with others. Using this kernel without a combination with other kernels usually doesn’t make sense because only standard correlations can be modeled. Nevertheless, in the following, the constant kernel is used to explain and illustrate the regression of the Gaussian process in a simple way.
The image below shows a-priori-Gaussian process with the variance of one. By defining a constant nucleus as a covariance function, all sample functions point to a parallel line on the x-axis.
Since no statement is made in advance about the possible noise of the measured data, the process assumes that the given measurement point is part of the actual function. This limits the number of possible function equations to a line passing directly through a point. Since the standard core only allows horizontal lines, the number of possible lines in this simple case narrows to exactly one possible function. Thus the covariance a-posterior-Gaussian process is zero.
RBF Kernel, arbitrary processes can be mapped, but this time the result is not a single straight line A-posteriori-Gaussian, but numerous functions. The function of the most probability is
mean function of A-posterior-Gaussian process. The following image shows A-posterior-Gaussian process and measurement points used.
To implement the Gaussian process in Python A-priori-Gaussian process must be defined in advance.
mean function m(x) is generally assumed to be constant and zero. By setting the parameter
normalize_y = True, the process uses the average of the values in the data set as a function of the expected constant value. The covariance function is selected by selecting the core. [Sci18m]
Regression of the Gaussian process in Scikit-Learn
The following source code describes how Gaussian process regression is implemented by scikit learning and
RBF Kernel used as a covariance function. The first optimization process starts with preset values (
variance) kernel. Parameter
alpha the noise level of the training data can be assumed in advance.
Optimization process: Estimation of hyperparameters using the most probability method
Hyperparameters are optimized during model fitting by maximizing log-marginal likelihood (LML). Maximum Probability Estimate (MLE) is a method for determining the parameters of a statistical model. Although regression methods already presented, such as linear regression, tend to minimize Mean square error, The regression of the Gaussian process tries to maximize probability function. In other words, the parameters of the model are chosen so that the observed data appear most probable according to their distribution.
probability function f a
random variable X is defined as follows:
This distribution is assumed to depend on the parameter ϑ. Based on the observed data, the probability can be considered as a function of ϑ:
Most probability estimators tend to maximize this function. Maxima is usually identified by separating the function and then setting it to zero. Because log likelihood function is the largest at the same point as the probability function, but it is easier to calculate, it is usually used. [Goo16, p.128]
We speak of a normal or Gaussian distribution if the random variable X has the following probability density [Fah16, p.83]:
The maximum likelihood method is explained below using a simple one-dimensional example. The image below contains the data. All three plotted probability distributions reflect the distribution of the data. Most likely, one is interested in the most probable distribution.
The goal is to define the parameters σ² and µ so that the probability is maximized at all considered data points. For example, when three data points with x-values of 9, 10, and 13 are given, similar to the data points in the figure, the common probability is calculated from the individual probabilities as follows:
This function must be maximized. The average then corresponds to the value
x-value it most likely happens.
The following figure shows a Gaussian process regression model.