Derivative-Enhanced Gaussian Processes

Derivative-Enhanced Gaussian Processes#

When derivative information is available, it can be incorporated to improve model accuracy in high-dimensional or highly nonlinear problems [7, 8]. The most common approach involves including first-order gradient information, which has been shown to significantly improve the accuracy of GP models, particularly for functions with a high number of dimensions \((d \geq 8)\) [9].

This is achieved by augmenting the observation vector to include the partial derivatives of the function at each training point. The observation vector \(\mathbf{y}\) is expanded into an augmented vector, \(\mathbf{y}^{G}\). In the general case, the predictions at the test locations \(\mathbf{X}_*\) are also augmented to include derivatives, forming a vector \(\mathbf{y}^{G}_*\):

(1)#\[\begin{split}\mathbf{y}^{G} = \begin{bmatrix} f(\mathbf{X}) \\ \frac{\partial f(\mathbf{X})}{\partial x_1} \\ \vdots \\ \frac{\partial f(\mathbf{X})}{\partial x_d} \end{bmatrix}, \quad \mathbf{y}^{G}_* = \begin{bmatrix} f(\mathbf{X}_*) \\ \frac{\partial f(\mathbf{X}_*)}{\partial x_1} \\ \vdots \\ \frac{\partial f(\mathbf{X}_*)}{\partial x_d} \end{bmatrix}\end{split}\]

The joint distribution between the augmented training observations and the augmented test predictions is a multivariate Gaussian:

(2)#\[\begin{split}\begin{pmatrix} \mathbf{y}^{G} \\ \mathbf{y}^{G}_* \end{pmatrix} \sim \mathcal{N}\left( \mathbf{0}, \begin{pmatrix} \boldsymbol{\Sigma}^{G}_{11} & \boldsymbol{\Sigma}^{G}_{12} \\ \boldsymbol{\Sigma}^{G}_{21} & \boldsymbol{\Sigma}^{G}_{22} \end{pmatrix} \right)\end{split}\]

The blocks of this covariance matrix are also augmented. The training covariance block, \(\boldsymbol{\Sigma}^{G}_{11}\), is an \(n(d + 1) \times n(d + 1)\) matrix:

(3)#\[\begin{split}\boldsymbol{\Sigma}^{G}_{11} = \begin{pmatrix} K(\mathbf{X}, \mathbf{X}') & \frac{\partial K(\mathbf{X}, \mathbf{X}')}{\partial \mathbf{X}'} \\ \frac{\partial K(\mathbf{X}, \mathbf{X}')}{\partial \mathbf{X}} & \frac{\partial^2 K(\mathbf{X}, \mathbf{X}')}{\partial \mathbf{X} \partial \mathbf{X}'} \end{pmatrix}\end{split}\]

The training-test covariance block, \(\boldsymbol{\Sigma}^{G}_{12}\), contains the covariances between all training and test observations:

(4)#\[\begin{split}\boldsymbol{\Sigma}^{G}_{12} = \begin{pmatrix} K(\mathbf{X}, \mathbf{X}_*) & \frac{\partial K(\mathbf{X}, \mathbf{X}_*)}{\partial \mathbf{X}_*'} \\ \frac{\partial K(\mathbf{X}, \mathbf{X}_*)}{\partial \mathbf{X}} & \frac{\partial^2 K(\mathbf{X}, \mathbf{X}_*)}{\partial \mathbf{X} \partial \mathbf{X}_*'} \end{pmatrix}\end{split}\]

The remaining blocks are defined as \(\boldsymbol{\Sigma}^{G}_{21} = \left(\boldsymbol{\Sigma}^{G}_{12}\right)^T\), and \(\boldsymbol{\Sigma}^{G}_{22}\) has the same structure as \(\boldsymbol{\Sigma}^{G}_{11}\) but is evaluated at the test points \(\mathbf{X}_*\). The posterior predictive distribution for the augmented test vector \(\mathbf{y}^{G}_*\) is then given by:

(5)#\[\begin{split}\boldsymbol{\mu}_{*} &= \boldsymbol{\Sigma}^{G}_{21} \left(\boldsymbol{\Sigma}^{G}_{11}\right)^{-1} \mathbf{y}^G \\ \boldsymbol{\Sigma}_{*} &= \boldsymbol{\Sigma}^{G}_{22} - \boldsymbol{\Sigma}^{G}_{21} \left(\boldsymbol{\Sigma}^{G}_{11}\right)^{-1} \boldsymbol{\Sigma}^{G}_{12}\end{split}\]

The posterior mean \(\boldsymbol{\mu}_{*}\) now provides predictions for both function values and derivatives, while \(\boldsymbol{\Sigma}_{*}\) provides their uncertainty.

Similar to the standard GP, the kernel hyperparameters \(\boldsymbol{\psi}\) are determined by maximizing the log marginal likelihood (MLL) of the augmented observations:

(6)#\[\log p(\mathbf{y}^{G}|\mathbf{X}, \boldsymbol{\psi}) = -\frac{1}{2} \left(\mathbf{y}^{G}\right)^\top \left(\boldsymbol{\Sigma}^{G}_{11}\right)^{-1} \mathbf{y}^{G} - \frac{1}{2}\log|\boldsymbol{\Sigma}^{G}_{11}| - \frac{n(d+1)}{2}\log 2\pi\]

Evaluating this function during optimization is computationally demanding. The primary bottleneck is computing the inverse of \(\boldsymbol{\Sigma}^{G}_{11}\) and the log-determinant \(\log|\boldsymbol{\Sigma}^{G}_{11}|\), typically via Cholesky decomposition. The cost of this decomposition is approximately \(\mathcal{O}(M^3)\), where \(M\) is the matrix dimension. For the gradient-enhanced case, \(M=n(d + 1)\), resulting in a cost of \(\mathcal{O}\left((n(d + 1))^3\right)\). This cubic scaling with respect to both n and d makes hyperparameter optimization prohibitively expensive for problems with many data points or high dimensionality [10, 11].

Hessian-Enhanced Gaussian Processes#

The framework can be further extended to include second-order derivative information (Hessians), which is particularly useful for capturing behavior in highly nonlinear problems [8]. This is achieved by further augmenting the observation vectors to include all function values, gradients, and unique Hessian components.

The augmented training vector, now denoted \(\mathbf{y}^{H}\), concatenates the function values, gradients, and the \(n\times d(d+1)/2\) unique components of the Hessian matrix from each of the \(n\) training points. For a general model that also predicts these quantities, the test vector \(\mathbf{y}^{H}_*\) is augmented similarly.

The joint distribution over these fully augmented vectors remains Gaussian, but the covariance matrix blocks are expanded further to include up to the fourth-order derivatives of the kernel function. The augmented training-training covariance block, \(\boldsymbol{\Sigma}^{H}_{11}\), is a 3 × 3 block matrix with the following structure:

(7)#\[\begin{split}\boldsymbol{\Sigma}^{H}_{11} = \begin{pmatrix} K & \frac{\partial K}{\partial \mathbf{X}'} & \frac{\partial^2 K}{\partial (\mathbf{X}')^2} \\ \frac{\partial K}{\partial \mathbf{X}} & \frac{\partial^2 K}{\partial \mathbf{X} \partial \mathbf{X}'} & \frac{\partial^3 K}{\partial \mathbf{X} \partial (\mathbf{X}')^2} \\ \frac{\partial^2 K}{\partial \mathbf{X}^2} & \frac{\partial^3 K}{\partial \mathbf{X}^2 \partial \mathbf{X}'} & \frac{\partial^4 K}{\partial \mathbf{X}^2 \partial (\mathbf{X}')^2} \end{pmatrix}\end{split}\]

where K = K(\(\mathbf{X}\), \(\mathbf{X}\)). The training-test block \(\boldsymbol{\Sigma}^{H}_{12}\) and test-test block \(\boldsymbol{\Sigma}^{H}_{22}\) are constructed analogously. The posterior predictive equations retain their standard form but now operate on these much larger matrices.

Note on Simplified Test Predictions#

In many applications, only the function values \(f(\mathbf{X}_*)\) are required at the test points, not their derivatives. In this case, the augmented training-test covariance blocks \(\Sigma_{12}^{(\cdot)}\) simplify by retaining only the columns corresponding to the test function values, while still leveraging derivative information from the training points. Additionally, the test-test covariance block \(\Sigma_{22}^{(\cdot)}\) reduces to the standard form \(K(\mathbf{X}_*, \mathbf{X}_*')\), since no derivatives are predicted at the test locations.

For a gradient-enhanced GP (GEK), the full training-test block is

\[\begin{split}\Sigma_{12}^{G} = \begin{pmatrix} K(\mathbf{X}, \mathbf{X}_*) & \frac{\partial K(\mathbf{X}, \mathbf{X}_*)}{\partial \mathbf{X}_*'} \\ \frac{\partial K(\mathbf{X}, \mathbf{X}_*)}{\partial \mathbf{X}} & \frac{\partial^2 K(\mathbf{X}, \mathbf{X}_*)}{\partial \mathbf{X} \partial \mathbf{X}_*'} \end{pmatrix}\end{split}\]

When only \(f(\mathbf{X}_*)\) is needed, this reduces to

\[\begin{split}\Sigma_{12}^{G} \;\longrightarrow\; \begin{pmatrix} K(\mathbf{X}, \mathbf{X}_*) \\ \frac{\partial K(\mathbf{X}, \mathbf{X}_*)}{\partial \mathbf{X}} \end{pmatrix}, \quad \text{and} \quad \Sigma_{22}^{G} \;\longrightarrow\; K(\mathbf{X}_*, \mathbf{X}_*')\end{split}\]

For a Hessian-enhanced GP (HEGP), the full block is

\[\begin{split}\Sigma_{12}^{H} = \begin{pmatrix} K & \frac{\partial K}{\partial \mathbf{X}_*'} & \frac{\partial^2 K}{\partial (\mathbf{X}_*')^2} \\ \frac{\partial K}{\partial \mathbf{X}} & \frac{\partial^2 K}{\partial \mathbf{X} \partial \mathbf{X}_*'} & \frac{\partial^3 K}{\partial \mathbf{X} \partial (\mathbf{X}_*')^2} \\ \frac{\partial^2 K}{\partial \mathbf{X}^2} & \frac{\partial^3 K}{\partial \mathbf{X}^2 \partial \mathbf{X}_*'} & \frac{\partial^4 K}{\partial \mathbf{X}^2 \partial (\mathbf{X}_*')^2} \end{pmatrix}\end{split}\]

which reduces to

\[\begin{split}\Sigma_{12}^{H} \;\longrightarrow\; \begin{pmatrix} K \\ \frac{\partial K}{\partial \mathbf{X}} \\ \frac{\partial^2 K}{\partial \mathbf{X}^2} \end{pmatrix}, \quad \text{and} \quad \Sigma_{22}^{H} \;\longrightarrow\; K(\mathbf{X}_*, \mathbf{X}_*')\end{split}\]

This simplification significantly reduces the computational cost of making predictions while still benefiting from derivative information during training.

[1]

A Comparison of Numerical Optimizers in Developing High Dimensional Surrogate Models, volume Volume 2B: 45th Design Automation Conference of International Design Engineering Technical Conferences and Computers and Information in Engineering Conference, 08 2019. URL: https://doi.org/10.1115/DETC2019-97499, arXiv:https://asmedigitalcollection.asme.org/IDETC-CIE/proceedings-pdf/IDETC-CIE2019/59193/V02BT03A037/6452976/v02bt03a037-detc2019-97499.pdf, doi:10.1115/DETC2019-97499.

[2]

Matheron Georges. Principles of geostatistics. Economic geology, 58(8):1246–1266, 1963.

[3]

D. G. Krige. A statistical approach to some basic mine valuation problems on the witwatersrand. OR, 4(1):18–18, 1953. URL: http://www.jstor.org/stable/3006914 (visited on 2025-02-20).

[4]

William J. Welch, Robert J. Buck, Jerome Sacks, Henry P. Wynn, and Toby J. Mitchell. Screening, predicting, and computer experiments. Technometrics, 34(1):15–25, 1992. URL: https://www.tandfonline.com/doi/abs/10.1080/00401706.1992.10485229, arXiv:https://www.tandfonline.com/doi/pdf/10.1080/00401706.1992.10485229, doi:10.1080/00401706.1992.10485229.

[5]

Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, Cambridge, MA, 2006. ISBN 9780262182539. URL: http://www.gaussianprocess.org/gpml/.

[6]

Gregoire Allaire and Sidi Mahmoud Kaber. Numerical Linear Algebra. Texts in applied mathematics. Springer, New York, NY, January 2008.

[7]

Weiyu Liu and Stephen Batill. Gradient-enhanced response surface approximations using kriging models. In 9th AIAA/ISSMO Symposium on Multidisciplinary Analysis and Optimization. Reston, Virigina, September 2002. American Institute of Aeronautics and Astronautics.

[8] (1,2)

Wataru Yamazaki, Markus Rumpfkeil, and Dimitri Mavriplis. Design optimization utilizing gradient/hessian enhanced surrogate model. In 28th AIAA Applied Aerodynamics Conference. Reston, Virigina, June 2010. American Institute of Aeronautics and Astronautics.

[9]

Selvakumar Ulaganathan, Ivo Couckuyt, Tom Dhaene, Joris Degroote, and Eric Laermans. Performance study of gradient-enhanced kriging. Eng. Comput., 32(1):15–34, January 2016.

[10]

Alexander I.J. Forrester and Andy J. Keane. Recent advances in surrogate-based optimization. Progress in Aerospace Sciences, 45(1):50–79, 2009. URL: https://www.sciencedirect.com/science/article/pii/S0376042108000766, doi:https://doi.org/10.1016/j.paerosci.2008.11.001.

[11]

Youwei He, Kuan Tan, Chunming Fu, and Jinliang Luo. An efficient gradient-enhanced kriging modeling method assisted by fast kriging for high-dimension problems. International journal of numerical methods for heat & fluid flow, 33(12):3967–3993, 2023.

[12]

Selvakumar Ulaganathan, Ivo Couckuyt, Tom Dhaene, Eric Laermans, and Joris Degroote. On the use of gradients in kriging surrogate models. In Proceedings of the Winter Simulation Conference 2014. IEEE, December 2014.

[13]

Liming Chen, Haobo Qiu, Liang Gao, Chen Jiang, and Zan Yang. A screening-based gradient-enhanced kriging modeling method for high-dimensional problems. Applied Mathematical Modelling, 69:15–31, 2019. URL: https://www.sciencedirect.com/science/article/pii/S0307904X18305900, doi:https://doi.org/10.1016/j.apm.2018.11.048.

[14]

Zhong-Hua Han, Yu Zhang, Chen-Xing Song, and Ke-Shi Zhang. Weighted gradient-enhanced kriging for high-dimensional surrogate modeling and design optimization. AIAA Journal, 55(12):4330–4346, 2017. URL: https://doi.org/10.2514/1.J055842, arXiv:https://doi.org/10.2514/1.J055842, doi:10.2514/1.J055842.

[15]

Yiming Yao, Fei Liu, and Qingfu Zhang. High-Throughput Multi-Objective bayesian optimization using gradients. In 2024 IEEE Congress on Evolutionary Computation (CEC), volume 2, 1–8. IEEE, June 2024.

[16]

Misha Padidar, Xinran Zhu, Leo Huang, Jacob R Gardner, and David Bindel. Scaling gaussian processes with derivative information using variational inference. Advances in Neural Information Processing Systems, 34:6442–6453, 2021. arXiv:2107.04061.

[17]

Haitao Liu, Jianfei Cai, and Yew-Soon Ong. Remarks on multi-output gaussian process regression. Knowledge-Based Systems, 144:102–121, 2018. URL: https://www.sciencedirect.com/science/article/pii/S0950705117306123, doi:https://doi.org/10.1016/j.knosys.2017.12.034.

[18]

B. Rakitsch, Christoph Lippert, K. Borgwardt, and Oliver Stegle. It is all in the noise: efficient multi-task gaussian process inference with structured residuals. Advances in Neural Information Processing Systems, pages, 01 2013.

[19]

Edwin V Bonilla, Kian Chai, and Christopher Williams. Multi-task gaussian process prediction. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems, volume 20. Curran Associates, Inc., 2007. URL: https://proceedings.neurips.cc/paper_files/paper/2007/file/66368270ffd51418ec58bd793f2d9b1b-Paper.pdf.

[20]

Chen Zhou Xu, Zhong Hua Han, Ke Shi Zhang, and Wen Ping Song. Improved weighted gradient-enhanced kriging model for high-dimensional aerodynamic modeling problems. In 32nd Congress of the International Council of the Aeronautical Sciences, ICAS 2021, 32nd Congress of the International Council of the Aeronautical Sciences, ICAS 2021. International Council of the Aeronautical Sciences, 2021.

Derivative-Enhanced Gaussian Processes

Contents

Derivative-Enhanced Gaussian Processes#

Hessian-Enhanced Gaussian Processes#

Note on Simplified Test Predictions#