Gaussian Processes

Gaussian Processes#

\[\DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\argmin}{arg\,min}\]

Gaussian Processes (GPs), or Kriging models, have become one of the most commonly used surrogate modeling techniques in many engineering design applications due to their ability to deliver both predictions and uncertainty estimates [1, 2, 3, 4]. They are a non-parametric supervised learning method used to solve regression, optimization, and probabilistic classification problems [5]. In particular, a Gaussian process can be viewed as a collection of random variables, any finite number of which have a joint Gaussian distribution. A Gaussian Process model can be completely determined by its mean function \(m(\mathbf{x})\) and covariance function \(k(\mathbf{x}, \mathbf{x}')\) of a real process \(f(\mathbf{x})\) as given below:

(1)#\[\begin{split}m(\mathbf{x}) &= \mathbb{E}[f(\mathbf{x})], \\ k(\mathbf{x}, \mathbf{x}') &= \mathbb{E}[(f(\mathbf{x}) - m(\mathbf{x}))(f(\mathbf{x}') - m(\mathbf{x}'))]\end{split}\]

The Gaussian Process can then be written as:

(2)#\[f(\mathbf{x}) \sim \mathcal{GP}(m(\mathbf{x}), k(\mathbf{x}, \mathbf{x}'))\]

The covariance function in Equation (1) is often approximated by a kernel function which assigns a specific covariance between pairs of random variables so that \(\text{cov}\left(f(\mathbf{x}), f(\mathbf{x}')\right) = k(\mathbf{x}, \mathbf{x}')\). The kernel function can take on various forms, one of the most commonly used functions, the squared exponential, is given below:

(3)#\[K_{SE}(\mathbf{x}, \mathbf{x}', \sigma_f^2, \boldsymbol{\theta}) = \sigma_f^2 \exp\left(-\sum\limits_{i = 1}^{d} \frac{(x_i - x'_i)^2}{2 \theta_i^2} \right)\]

In this function, the parameters \(\sigma_f^2\) and \(\boldsymbol{\theta}\) collectively denoted as \(\boldsymbol{\psi} = \{\sigma_f^2, \boldsymbol{\theta}\}\) are the hyperparameters of the model, which are learned from the data:

\(\sigma_f^2\) is the variance, which is a scaling factor that controls the overall vertical variation of the function.
\(\theta_i\) is the length scale for each input dimension i. It determines how quickly the correlation between points decreases with distance. A small length scale means the function varies rapidly, while a large length scale indicates a smoother function.

Making Predictions#

To make predictions using a Gaussian process model, we assume an observed training dataset \(\mathcal{D}\) containing n input-output pairs:

(4)#\[\mathcal{D} = \left\{\mathbf{X}, \mathbf{Y} \right\} = \left\{(\mathbf{x}_i, y_i) \mid i = 1, \ldots, n \right\}\]

where each input \(\mathbf{x}_i \in \mathbb{R}^d\) has a corresponding observed output \(y_i = f(\mathbf{x}_i)\). For now, we will assume the output of f is scalar, though this framework can be generalized to vector outputs, as discussed in Section Multi-Output GPs.

To predict outputs at a new set of test locations, \(\mathbf{X}_*\), we model the joint distribution over the training and test outputs. It is common to assume a zero-mean function for the GP, such that \(m(\mathbf{x}) = 0\). This assumption simplifies the model but is not overly restrictive, as it only constrains the prior distribution; the posterior predictive mean will be non-zero and adapt to the data. With these assumptions, the joint distribution of the training outputs \(\mathbf{y} = f(\mathbf{X})\) and the test outputs \(\mathbf{y}_* = f(\mathbf{X}_*)\) is given by:

(5)#\[\begin{split}\begin{pmatrix} \mathbf{y} \\ \mathbf{y}_* \end{pmatrix} \sim \mathcal{N}\left( \begin{pmatrix} \mathbf{0} \\ \mathbf{0} \end{pmatrix}, \begin{pmatrix} K(\mathbf{X}, \mathbf{X}') & K(\mathbf{X}, \mathbf{X}_*) \\ K(\mathbf{X}_*, \mathbf{X}) & K(\mathbf{X}_*, \mathbf{X}_*') \end{pmatrix} \right)\end{split}\]

Here, the prime notation is a notational convenience: \(\mathbf{X}'\) represents the same set of training points as \(\mathbf{X}\), but as the second argument of the covariance function. Thus \(K(\mathbf{X}, \mathbf{X}')\) evaluates the kernel between all pairs of training points, producing an n × n covariance matrix with entries \([K(\mathbf{X}, \mathbf{X}')]_{ij} = k(\mathbf{x}_i, \mathbf{x}_j)\). For notational simplicity, we partition this covariance matrix into blocks:

(6)#\[\begin{split}\begin{pmatrix} \boldsymbol{\Sigma}_{11} & \boldsymbol{\Sigma}_{12} \\ \boldsymbol{\Sigma}_{21} & \boldsymbol{\Sigma}_{22} \end{pmatrix} = \begin{pmatrix} K(\mathbf{X}, \mathbf{X}') & K(\mathbf{X}, \mathbf{X}_*) \\ K(\mathbf{X}_*, \mathbf{X}) & K(\mathbf{X}_*, \mathbf{X}_*') \end{pmatrix}\end{split}\]

where \(\boldsymbol{\Sigma}_{11}\) is the covariance of training points with themselves, \(\boldsymbol{\Sigma}_{12}\) is the covariance of training points with test points, \(\boldsymbol{\Sigma}_{22}\) is the covariance of the test points with themselves, and \(\boldsymbol{\Sigma}_{21} = \boldsymbol{\Sigma}_{12}^T\).

Posterior Predictive Distribution#

From the multivariate Gaussian theorem, the conditional distribution of a Gaussian is itself Gaussian. This fact allows for making predictions at the test points \(\mathbf{X}_*\) conditioned on the training data \(\mathcal{D}\). Using the block matrix notation defined in Equation (6), the posterior predictive distribution is given by:

(7)#\[\begin{split}p(\mathbf{y}_* \mid \mathbf{y}) &= \mathcal{N}(\boldsymbol{\mu}_{*}, \boldsymbol{\Sigma}_{*}) \\ \boldsymbol{\mu}_{*} &= \boldsymbol{\Sigma}_{21} \boldsymbol{\Sigma}_{11}^{-1} \mathbf{y} \\ \boldsymbol{\Sigma}_{*} &= \boldsymbol{\Sigma}_{22} - \boldsymbol{\Sigma}_{21} \boldsymbol{\Sigma}_{11}^{-1} \boldsymbol{\Sigma}_{12}\end{split}\]

Here, \(\boldsymbol{\mu}_{*}\) is the posterior predictive mean and \(\boldsymbol{\Sigma}_{*}\) is the posterior predictive covariance.

Learning Hyperparameters#

The kernel hyperparameters, \(\psi\), are not set manually but are instead learned from the training data \(\mathcal{D}\). A standard approach is to find the values that maximize the log marginal likelihood of the observations, \(p(\mathbf{y} \mid \mathbf{X}, \boldsymbol{\psi})\) [5]. For a noise-free model, the log marginal likelihood is given by:

(8)#\[\log p(\mathbf{y}|\mathbf{X}, \boldsymbol{\psi}) = -\frac{1}{2} \mathbf{y}^\top \mathbf{K}_{\boldsymbol{\psi}}^{-1} \mathbf{y} - \frac{1}{2}\log|\mathbf{K}_{\boldsymbol{\psi}}| - \frac{n}{2}\log 2\pi\]

where \(\mathbf{K}_{\boldsymbol{\psi}} = K(\mathbf{X},\mathbf{X})\). Optimized hyperparameters, \(\boldsymbol{\psi}^*\), are then selected according to:

(9)#\[\boldsymbol{\psi}^* = \argmax_{\boldsymbol{\psi}} \left( \log p(\mathbf{y}|\mathbf{X}, \boldsymbol{\psi}) \right)\]

Computational Considerations#

Note that directly computing the expression in Equation (8) requires evaluating both the inverse and determinant of the matrix \(K = k(\mathbf{X}, \mathbf{X}')\), operations that are computationally expensive and numerically unstable for large n. To address these challenges, the Cholesky decomposition is typically employed [5]. For a symmetric positive definite matrix K, the Cholesky decomposition yields a lower triangular matrix L such that \(K = LL^\top\). This factorization enables efficient computation of both the matrix inverse and the log-determinant. Indeed, the Cholesky method is approximately twice as fast as Gaussian elimination for symmetric positive definite matrices [6], and it provides better numerical stability, especially for large-scale problems where the kernel matrix may be ill-conditioned.

[1]

A Comparison of Numerical Optimizers in Developing High Dimensional Surrogate Models, volume Volume 2B: 45th Design Automation Conference of International Design Engineering Technical Conferences and Computers and Information in Engineering Conference, 08 2019. URL: https://doi.org/10.1115/DETC2019-97499, arXiv:https://asmedigitalcollection.asme.org/IDETC-CIE/proceedings-pdf/IDETC-CIE2019/59193/V02BT03A037/6452976/v02bt03a037-detc2019-97499.pdf, doi:10.1115/DETC2019-97499.

[2]

Matheron Georges. Principles of geostatistics. Economic geology, 58(8):1246–1266, 1963.

[3]

D. G. Krige. A statistical approach to some basic mine valuation problems on the witwatersrand. OR, 4(1):18–18, 1953. URL: http://www.jstor.org/stable/3006914 (visited on 2025-02-20).

[4]

William J. Welch, Robert J. Buck, Jerome Sacks, Henry P. Wynn, and Toby J. Mitchell. Screening, predicting, and computer experiments. Technometrics, 34(1):15–25, 1992. URL: https://www.tandfonline.com/doi/abs/10.1080/00401706.1992.10485229, arXiv:https://www.tandfonline.com/doi/pdf/10.1080/00401706.1992.10485229, doi:10.1080/00401706.1992.10485229.

[5] (1,2,3)

Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, Cambridge, MA, 2006. ISBN 9780262182539. URL: http://www.gaussianprocess.org/gpml/.

[6]

Gregoire Allaire and Sidi Mahmoud Kaber. Numerical Linear Algebra. Texts in applied mathematics. Springer, New York, NY, January 2008.

[7]

Weiyu Liu and Stephen Batill. Gradient-enhanced response surface approximations using kriging models. In 9th AIAA/ISSMO Symposium on Multidisciplinary Analysis and Optimization. Reston, Virigina, September 2002. American Institute of Aeronautics and Astronautics.

[8]

Wataru Yamazaki, Markus Rumpfkeil, and Dimitri Mavriplis. Design optimization utilizing gradient/hessian enhanced surrogate model. In 28th AIAA Applied Aerodynamics Conference. Reston, Virigina, June 2010. American Institute of Aeronautics and Astronautics.

[9]

Selvakumar Ulaganathan, Ivo Couckuyt, Tom Dhaene, Joris Degroote, and Eric Laermans. Performance study of gradient-enhanced kriging. Eng. Comput., 32(1):15–34, January 2016.

[10]

Alexander I.J. Forrester and Andy J. Keane. Recent advances in surrogate-based optimization. Progress in Aerospace Sciences, 45(1):50–79, 2009. URL: https://www.sciencedirect.com/science/article/pii/S0376042108000766, doi:https://doi.org/10.1016/j.paerosci.2008.11.001.

[11]

Youwei He, Kuan Tan, Chunming Fu, and Jinliang Luo. An efficient gradient-enhanced kriging modeling method assisted by fast kriging for high-dimension problems. International journal of numerical methods for heat & fluid flow, 33(12):3967–3993, 2023.

[12]

Selvakumar Ulaganathan, Ivo Couckuyt, Tom Dhaene, Eric Laermans, and Joris Degroote. On the use of gradients in kriging surrogate models. In Proceedings of the Winter Simulation Conference 2014. IEEE, December 2014.

[13]

Liming Chen, Haobo Qiu, Liang Gao, Chen Jiang, and Zan Yang. A screening-based gradient-enhanced kriging modeling method for high-dimensional problems. Applied Mathematical Modelling, 69:15–31, 2019. URL: https://www.sciencedirect.com/science/article/pii/S0307904X18305900, doi:https://doi.org/10.1016/j.apm.2018.11.048.

[14]

Zhong-Hua Han, Yu Zhang, Chen-Xing Song, and Ke-Shi Zhang. Weighted gradient-enhanced kriging for high-dimensional surrogate modeling and design optimization. AIAA Journal, 55(12):4330–4346, 2017. URL: https://doi.org/10.2514/1.J055842, arXiv:https://doi.org/10.2514/1.J055842, doi:10.2514/1.J055842.

[15]

Yiming Yao, Fei Liu, and Qingfu Zhang. High-Throughput Multi-Objective bayesian optimization using gradients. In 2024 IEEE Congress on Evolutionary Computation (CEC), volume 2, 1–8. IEEE, June 2024.

[16]

Misha Padidar, Xinran Zhu, Leo Huang, Jacob R Gardner, and David Bindel. Scaling gaussian processes with derivative information using variational inference. Advances in Neural Information Processing Systems, 34:6442–6453, 2021. arXiv:2107.04061.

[17]

Haitao Liu, Jianfei Cai, and Yew-Soon Ong. Remarks on multi-output gaussian process regression. Knowledge-Based Systems, 144:102–121, 2018. URL: https://www.sciencedirect.com/science/article/pii/S0950705117306123, doi:https://doi.org/10.1016/j.knosys.2017.12.034.

[18]

B. Rakitsch, Christoph Lippert, K. Borgwardt, and Oliver Stegle. It is all in the noise: efficient multi-task gaussian process inference with structured residuals. Advances in Neural Information Processing Systems, pages, 01 2013.

[19]

Edwin V Bonilla, Kian Chai, and Christopher Williams. Multi-task gaussian process prediction. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems, volume 20. Curran Associates, Inc., 2007. URL: https://proceedings.neurips.cc/paper_files/paper/2007/file/66368270ffd51418ec58bd793f2d9b1b-Paper.pdf.

[20]

Chen Zhou Xu, Zhong Hua Han, Ke Shi Zhang, and Wen Ping Song. Improved weighted gradient-enhanced kriging model for high-dimensional aerodynamic modeling problems. In 32nd Congress of the International Council of the Aeronautical Sciences, ICAS 2021, 32nd Congress of the International Council of the Aeronautical Sciences, ICAS 2021. International Council of the Aeronautical Sciences, 2021.