Directional Derivative-Enhanced Gaussian Processes

Directional Derivative-Enhanced Gaussian Processes#

First-order directional derivatives offer an alternative approach to incorporating derivative information in Gaussian processes [15, 16]. Rather than using all \(d\) partial derivatives at each training point as in GEK, directional derivative GPs (DDGP) use only \(q \ll d\) derivatives along user-selected directions. This reduces computational cost while still capturing important local geometric information. A key feature of this approach is that the directions can be chosen uniquely for each training point, allowing for a more flexible and adaptive model.

For a training point \(\mathbf{x}_i\), let \(\mathbf{v}_{i,j} \in \mathbb{R}^d\) denote the \(j\)-th direction vector (with \(\|\mathbf{v}_{i,j}\| = 1\)), for \(j = 1, \ldots, q\). The directional derivative of \(f\) at \(\mathbf{x}_i\) along direction \(\mathbf{v}_{i,j}\) is:

\[\frac{\partial f(\mathbf{x}_i)}{\partial \mathbf{v}_{i,j}} = \nabla f(\mathbf{x}_i)^T \mathbf{v}_{i,j} = \sum_{k=1}^{d} \frac{\partial f(\mathbf{x}_i)}{\partial x_k} v_{i,j,k}\]

The training data is augmented with \(q\) directional derivatives at each of the \(n\) points. Let \(\mathbf{V} = \{\mathbf{v}_{i,j} \mid i = 1,\ldots,n, \, j = 1,\ldots,q\}\) denote the collection of all training direction vectors. The augmented training vector is:

(1)#\[\begin{split}\mathbf{y}^{DD} = \begin{bmatrix} f(\mathbf{x}_1) \\ \vdots \\ f(\mathbf{x}_n) \\ \frac{\partial f(\mathbf{x}_1)}{\partial \mathbf{v}_{1,1}} \\ \vdots \\ \frac{\partial f(\mathbf{x}_1)}{\partial \mathbf{v}_{1,q}} \\ \vdots \\ \frac{\partial f(\mathbf{x}_n)}{\partial \mathbf{v}_{n,q}} \end{bmatrix} \in \mathbb{R}^{n(1+q)}\end{split}\]

For predictions at test locations \(\mathbf{X}_*\) with direction vectors \(\mathbf{V}_*\), the augmented test vector \(\mathbf{y}^{DD}_*\) is defined analogously.

Covariance Structure#

The joint distribution over the augmented training and test vectors is multivariate Gaussian:

(2)#\[\begin{split}\begin{pmatrix} \mathbf{y}^{DD} \\ \mathbf{y}^{DD}_* \end{pmatrix} \sim \mathcal{N}\left( \mathbf{0}, \begin{pmatrix} \boldsymbol{\Sigma}_{11}^{DD} & \boldsymbol{\Sigma}_{12}^{DD} \\ \boldsymbol{\Sigma}_{21}^{DD} & \boldsymbol{\Sigma}_{22}^{DD} \end{pmatrix} \right)\end{split}\]

The training covariance block \(\boldsymbol{\Sigma}_{11}^{DD}\) is an \(n(1+q) \times n(1+q)\) matrix with block structure:

(3)#\[\begin{split}\boldsymbol{\Sigma}_{11}^{DD} = \begin{pmatrix} K(\mathbf{X}, \mathbf{X}') & \frac{\partial K(\mathbf{X}, \mathbf{X}')}{\partial \mathbf{V}'} \\ \frac{\partial K(\mathbf{X}, \mathbf{X}')}{\partial \mathbf{V}} & \frac{\partial^2 K(\mathbf{X}, \mathbf{X}')}{\partial \mathbf{V} \partial \mathbf{V}'} \end{pmatrix}\end{split}\]

The upper-left block \(K(\mathbf{X}, \mathbf{X}')\) is the standard \(n \times n\) kernel matrix between function values. The off-diagonal blocks \(\frac{\partial K(\mathbf{X}, \mathbf{X}')}{\partial \mathbf{V}}\) (size \(nq \times n\)) contain covariances between directional derivatives and function values. The lower-right block \(\frac{\partial^2 K(\mathbf{X}, \mathbf{X}')}{\partial \mathbf{V} \partial \mathbf{V}'}\) (size \(nq \times nq\)) contains covariances between all pairs of directional derivatives across the training points.

Similarly, the training-test covariance block \(\boldsymbol{\Sigma}_{12}^{DD}\) is an \(n(1+q) \times n_*(1+q_*)\) matrix:

(4)#\[\begin{split}\boldsymbol{\Sigma}_{12}^{DD} = \begin{pmatrix} K(\mathbf{X}, \mathbf{X}_*) & \frac{\partial K(\mathbf{X}, \mathbf{X}_*)}{\partial \mathbf{V}_*'} \\ \frac{\partial K(\mathbf{X}, \mathbf{X}_*)}{\partial \mathbf{V}} & \frac{\partial^2 K(\mathbf{X}, \mathbf{X}_*)}{\partial \mathbf{V} \partial \mathbf{V}_*'} \end{pmatrix}\end{split}\]

where \(\boldsymbol{\Sigma}_{21}^{DD} = (\boldsymbol{\Sigma}_{12}^{DD})^T\), and \(\boldsymbol{\Sigma}_{22}^{DD}\) has the same structure as \(\boldsymbol{\Sigma}_{11}^{DD}\) but evaluated at test points. The posterior predictive distribution follows the standard GP conditioning formula:

(5)#\[\begin{split}\boldsymbol{\mu}^{DD}_{*} &= \boldsymbol{\Sigma}_{21}^{DD} (\boldsymbol{\Sigma}_{11}^{DD})^{-1} \mathbf{y}^{DD}, \\ \boldsymbol{\Sigma}^{DD}_{*} &= \boldsymbol{\Sigma}_{22}^{DD} - \boldsymbol{\Sigma}_{21}^{DD} (\boldsymbol{\Sigma}_{11}^{DD})^{-1} \boldsymbol{\Sigma}_{12}^{DD}\end{split}\]

Hyperparameter Optimization#

Similar to other derivative-enhanced GPs, the kernel hyperparameters \(\boldsymbol{\psi}\) are determined by maximizing the log marginal likelihood:

(6)#\[\log p(\mathbf{y}^{DD}|\mathbf{X}, \mathbf{V}, \boldsymbol{\psi}) = -\frac{1}{2} (\mathbf{y}^{DD})^\top (\boldsymbol{\Sigma}_{11}^{DD})^{-1} \mathbf{y}^{DD} - \frac{1}{2}\log|\boldsymbol{\Sigma}_{11}^{DD}| - \frac{n(1+q)}{2}\log 2\pi\]

Computational Advantages#

The primary advantage of DDGP is its favorable computational scaling. The matrix to be inverted has dimension \(M = n(1+q)\), resulting in \(\mathcal{O}((n(1+q))^3)\) complexity. Since \(q\) is typically chosen as a small constant (\(q \ll d\)), this is substantially lower than the \(\mathcal{O}((n(d+1))^3)\) cost of full GEK, making DDGP practical for high-dimensional problems where computing all partial derivatives would be prohibitive. The trade-off is that DDGP captures less geometric information than GEK, but with careful selection of directions (e.g., along principal curvature directions or gradient directions), it can still provide significant accuracy improvements over standard GP models.

References#

[1]

A Comparison of Numerical Optimizers in Developing High Dimensional Surrogate Models, volume Volume 2B: 45th Design Automation Conference of International Design Engineering Technical Conferences and Computers and Information in Engineering Conference, 08 2019. URL: https://doi.org/10.1115/DETC2019-97499, arXiv:https://asmedigitalcollection.asme.org/IDETC-CIE/proceedings-pdf/IDETC-CIE2019/59193/V02BT03A037/6452976/v02bt03a037-detc2019-97499.pdf, doi:10.1115/DETC2019-97499.

[2]

Matheron Georges. Principles of geostatistics. Economic geology, 58(8):1246–1266, 1963.

[3]

D. G. Krige. A statistical approach to some basic mine valuation problems on the witwatersrand. OR, 4(1):18–18, 1953. URL: http://www.jstor.org/stable/3006914 (visited on 2025-02-20).

[4]

William J. Welch, Robert J. Buck, Jerome Sacks, Henry P. Wynn, and Toby J. Mitchell. Screening, predicting, and computer experiments. Technometrics, 34(1):15–25, 1992. URL: https://www.tandfonline.com/doi/abs/10.1080/00401706.1992.10485229, arXiv:https://www.tandfonline.com/doi/pdf/10.1080/00401706.1992.10485229, doi:10.1080/00401706.1992.10485229.

[5]

Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, Cambridge, MA, 2006. ISBN 9780262182539. URL: http://www.gaussianprocess.org/gpml/.

[6]

Gregoire Allaire and Sidi Mahmoud Kaber. Numerical Linear Algebra. Texts in applied mathematics. Springer, New York, NY, January 2008.

[7]

Weiyu Liu and Stephen Batill. Gradient-enhanced response surface approximations using kriging models. In 9th AIAA/ISSMO Symposium on Multidisciplinary Analysis and Optimization. Reston, Virigina, September 2002. American Institute of Aeronautics and Astronautics.

[8]

Wataru Yamazaki, Markus Rumpfkeil, and Dimitri Mavriplis. Design optimization utilizing gradient/hessian enhanced surrogate model. In 28th AIAA Applied Aerodynamics Conference. Reston, Virigina, June 2010. American Institute of Aeronautics and Astronautics.

[9]

Selvakumar Ulaganathan, Ivo Couckuyt, Tom Dhaene, Joris Degroote, and Eric Laermans. Performance study of gradient-enhanced kriging. Eng. Comput., 32(1):15–34, January 2016.

[10]

Alexander I.J. Forrester and Andy J. Keane. Recent advances in surrogate-based optimization. Progress in Aerospace Sciences, 45(1):50–79, 2009. URL: https://www.sciencedirect.com/science/article/pii/S0376042108000766, doi:https://doi.org/10.1016/j.paerosci.2008.11.001.

[11]

Youwei He, Kuan Tan, Chunming Fu, and Jinliang Luo. An efficient gradient-enhanced kriging modeling method assisted by fast kriging for high-dimension problems. International journal of numerical methods for heat & fluid flow, 33(12):3967–3993, 2023.

[12]

Selvakumar Ulaganathan, Ivo Couckuyt, Tom Dhaene, Eric Laermans, and Joris Degroote. On the use of gradients in kriging surrogate models. In Proceedings of the Winter Simulation Conference 2014. IEEE, December 2014.

[13]

Liming Chen, Haobo Qiu, Liang Gao, Chen Jiang, and Zan Yang. A screening-based gradient-enhanced kriging modeling method for high-dimensional problems. Applied Mathematical Modelling, 69:15–31, 2019. URL: https://www.sciencedirect.com/science/article/pii/S0307904X18305900, doi:https://doi.org/10.1016/j.apm.2018.11.048.

[14]

Zhong-Hua Han, Yu Zhang, Chen-Xing Song, and Ke-Shi Zhang. Weighted gradient-enhanced kriging for high-dimensional surrogate modeling and design optimization. AIAA Journal, 55(12):4330–4346, 2017. URL: https://doi.org/10.2514/1.J055842, arXiv:https://doi.org/10.2514/1.J055842, doi:10.2514/1.J055842.

[15]

Yiming Yao, Fei Liu, and Qingfu Zhang. High-Throughput Multi-Objective bayesian optimization using gradients. In 2024 IEEE Congress on Evolutionary Computation (CEC), volume 2, 1–8. IEEE, June 2024.

[16]

Misha Padidar, Xinran Zhu, Leo Huang, Jacob R Gardner, and David Bindel. Scaling gaussian processes with derivative information using variational inference. Advances in Neural Information Processing Systems, 34:6442–6453, 2021. arXiv:2107.04061.

[17]

Haitao Liu, Jianfei Cai, and Yew-Soon Ong. Remarks on multi-output gaussian process regression. Knowledge-Based Systems, 144:102–121, 2018. URL: https://www.sciencedirect.com/science/article/pii/S0950705117306123, doi:https://doi.org/10.1016/j.knosys.2017.12.034.

[18]

B. Rakitsch, Christoph Lippert, K. Borgwardt, and Oliver Stegle. It is all in the noise: efficient multi-task gaussian process inference with structured residuals. Advances in Neural Information Processing Systems, pages, 01 2013.

[19]

Edwin V Bonilla, Kian Chai, and Christopher Williams. Multi-task gaussian process prediction. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems, volume 20. Curran Associates, Inc., 2007. URL: https://proceedings.neurips.cc/paper_files/paper/2007/file/66368270ffd51418ec58bd793f2d9b1b-Paper.pdf.

[20]

Chen Zhou Xu, Zhong Hua Han, Ke Shi Zhang, and Wen Ping Song. Improved weighted gradient-enhanced kriging model for high-dimensional aerodynamic modeling problems. In 32nd Congress of the International Council of the Aeronautical Sciences, ICAS 2021, 32nd Congress of the International Council of the Aeronautical Sciences, ICAS 2021. International Council of the Aeronautical Sciences, 2021.