Weighted Gradient Enchanced Gaussian Processes

Weighted Gradient Enchanced Gaussian Processes#

\[\DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\argmin}{arg\,min}\]

As previously outlined in the derivative-enhanced GPs section, operating with a large correlation matrix that includes all cross-correlations between function values and derivatives becomes computationally prohibitive for large datasets. An alternative strategy, known as weighted gradient-enhanced kriging (WGEK), alleviates this burden by decomposing the problem into multiple smaller submodels [14]. Rather than training a single GP on all derivative data, WGEK constructs submodels, each incorporating only a subset of the available derivatives. These submodels are then combined through a weighted sum that preserves the full set of interpolation conditions. The key advantage is that each submodel operates on a much smaller correlation matrix, avoiding the computational cost of decomposing the full gradient-enhanced covariance matrix.

Problem Formulation#

The WGEK formulation begins with an observed training dataset \(\mathcal{D}\) containing \(n\) input-output pairs with gradients:

\[\mathcal{D} = \left\{\mathbf{X}, f(\mathbf{X}), \nabla f(\mathbf{X})\right\} := \left\{(\mathbf{x}_i, f(\mathbf{x}_i), \nabla f(\mathbf{x}_i)) \mid i = 1, \ldots, n \right\}\]

The gradient information \(\nabla f(\mathbf{X})\) is partitioned into \(M \leq n\) disjoint sets:

\[\begin{split}D_1 &= \left\{\nabla f(\mathbf{x}_i) \mid i = 1, \ldots, m_1\right\}, \\ D_2 &= \left\{\nabla f(\mathbf{x}_i) \mid i = m_1 + 1, \ldots, m_2\right\}, \\ &\vdots \\ D_M &= \left\{\nabla f(\mathbf{x}_i) \mid i = m_{M-1} + 1, \ldots, n\right\},\end{split}\]

where \(0 < m_1 < m_2 < \cdots < m_{M-1} < n\) are cumulative partition boundaries. The training data for the \(k^{\text{th}}\) submodel is then:

\[\mathcal{D}_k = \left\{(\mathbf{x}_i, f(\mathbf{x}_i)) \mid i = 1, \ldots, n \right\} \cup D_k\]

That is, each submodel contains all \(n\) function values but only the gradient information from partition \(D_k\). Let \(\mathcal{I}_k = \{i \mid \nabla f(\mathbf{x}_i) \in D_k\}\) denote the indices corresponding to partition D_k, and let \(\mathbf{X}_k = \{\mathbf{x}_i \mid i \in \mathcal{I}_k\}\) denote the locations of these gradients. The observation vector for the \(k^{\text{th}}\) submodel is augmented to include only these selected gradients:

(1)#\[\begin{split}\mathbf{y}^{(k)} = \begin{bmatrix} f(\mathbf{X}) \\ \nabla f(\mathbf{X}_k) \end{bmatrix} \in \mathbb{R}^{n + d|\mathcal{I}_k|}\end{split}\]

An important distinction of the WGEK formulation is that predictions are made only for function values \(f(\mathbf{X}_*)\) at test locations \(\mathbf{X}_*\), not for derivatives. Because each submodel is trained on an incomplete set of gradient observations and the submodels are combined through weighted summation, the framework does not directly provide predictive distributions for derivative values or their associated uncertainties. This contrasts with the full GEK formulation and PGEK, which naturally support predictions for both function values and derivatives with uncertainty quantification.

Submodel Covariance Structure#

For each submodel \(k\), the joint distribution between the augmented training observations \(\mathbf{y}^{(k)}\) and the test predictions \(f(\mathbf{X}_*)\) is multivariate Gaussian:

(2)#\[\begin{split}\begin{pmatrix} \mathbf{y}^{(k)} \\ f(\mathbf{X}_*) \end{pmatrix} \sim \mathcal{N}\left( \mathbf{0}, \begin{pmatrix} \boldsymbol{\Sigma}^{(k)}_{11} & \boldsymbol{\Sigma}^{(k)}_{12} \\ \boldsymbol{\Sigma}^{(k)}_{21} & \boldsymbol{\Sigma}^{(k)}_{22} \end{pmatrix} \right)\end{split}\]

The training covariance block \(\boldsymbol{\Sigma}^{(k)}_{11}\) is an \((n + d|\mathcal{I}_k|) \times (n + d|\mathcal{I}_k|)\) matrix:

(3)#\[\begin{split}\boldsymbol{\Sigma}^{(k)}_{11} = \begin{pmatrix} K(\mathbf{X}, \mathbf{X}') & \frac{\partial K(\mathbf{X}, \mathbf{X}_k')}{\partial \mathbf{X}_k'} \\ \frac{\partial K(\mathbf{X}_k, \mathbf{X}')}{\partial \mathbf{X}_k} & \frac{\partial^2 K(\mathbf{X}_k, \mathbf{X}_k')}{\partial \mathbf{X}_k \partial \mathbf{X}_k'} \end{pmatrix}\end{split}\]

The off-diagonal block \(\frac{\partial K(\mathbf{X}, \mathbf{X}_k')}{\partial \mathbf{X}_k'}\) is an \(n \times d|\mathcal{I}_k|\) matrix representing cross-covariances between function values at all \(n\) training points and gradients at the \(|\mathcal{I}_k|\) points in \(\mathbf{X}_k\). The lower-right block \(\frac{\partial^2 K(\mathbf{X}_k, \mathbf{X}_k')}{\partial \mathbf{X}_k \partial \mathbf{X}_k'}\) is a \(d|\mathcal{I}_k| \times d|\mathcal{I}_k|\) matrix representing cross-covariances between gradients at points in \(\mathbf{X}_k\) only.

Since predictions of derivatives are not made at the test points \(\mathbf{X}_*\), the cross-covariance block \(\boldsymbol{\Sigma}^{(k)}_{12}\) between training data and test points is a \(\left(n + d|\mathcal{I}_k|\right) \times n_*\) matrix:

(4)#\[\begin{split}\boldsymbol{\Sigma}^{(k)}_{12} = \begin{pmatrix} K(\mathbf{X}, \mathbf{X}_*) \\ \frac{\partial K(\mathbf{X}_k, \mathbf{X}_*)}{\partial \mathbf{X}_k} \end{pmatrix}\end{split}\]

and the test covariance block \(\boldsymbol{\Sigma}^{(k)}_{22} = K(\mathbf{X}_*, \mathbf{X}_*')\), an \(n_* \times n_*\) matrix where \(n_* = |\mathbf{X}_*|\) is the number of test points. The posterior predictive distribution for the \(k^{\text{th}}\) submodel follows the standard GP conditioning formula (analogous to Equation (5)):

(5)#\[\begin{split}\boldsymbol{\mu}^{(k)}_{*} &= \boldsymbol{\Sigma}^{(k)}_{21} \left(\boldsymbol{\Sigma}^{(k)}_{11}\right)^{-1} \mathbf{y}^{(k)}, \\ \boldsymbol{\Sigma}^{(k)}_{*} &= \boldsymbol{\Sigma}^{(k)}_{22} - \boldsymbol{\Sigma}^{(k)}_{21} \left(\boldsymbol{\Sigma}^{(k)}_{11}\right)^{-1} \boldsymbol{\Sigma}^{(k)}_{12}\end{split}\]

Weighted Combination of Submodels#

By construction, each submodel prediction \(\boldsymbol{\mu}_*^{(k)}\) exactly interpolates its training observations: \(\boldsymbol{\mu}_*^{(k)}(\mathbf{x}_i) = f(\mathbf{x}_i)\) for all \(i\), and \(\nabla \boldsymbol{\mu}_*^{(k)}(\mathbf{x}_i) = \nabla f(\mathbf{x}_i)\) for \(i \in \mathcal{I}_k\). The \(M\) submodel predictions are then combined through a weighted summation to obtain the global prediction \(\boldsymbol{\mu}_*\) at test locations \(\mathbf{X}_*\):

(6)#\[\boldsymbol{\mu}_* = \sum_{k=1}^{M} w_k \boldsymbol{\mu}_*^{(k)}\]

where \(w_k: \mathbb{R}^d \to \mathbb{R}\) are weight functions. To ensure that the WGEK predictor preserves these interpolation conditions globally (i.e., \(\boldsymbol{\mu}_*(\mathbf{x}_i) = f(\mathbf{x}_i)\) for all training points), the weights must satisfy:

(7)#\[\sum_{k=1}^{M} w_k(\mathbf{x}) = 1 \quad \forall \mathbf{x} \in \mathbb{R}^d\]

and

\[\begin{split}w_k(\mathbf{x}_i) = \begin{cases} 1 & \text{if } i \in \mathcal{I}_k, \\ 0 & \text{otherwise}. \end{cases}\end{split}\]

That is, at each training point \(\mathbf{x}_i\), only the submodel containing that point’s gradient information (submodel \(k\) where \(i \in \mathcal{I}_k\)) contributes to the prediction, thereby ensuring that both function and derivative interpolation conditions are satisfied.

Computing Weight Functions#

To compute the weight functions satisfying these constraints, we follow the methodology outlined in [14]. For a given test set \(\mathbf{X}_*\), the weights are determined by solving the following constrained linear system:

(8)#\[\begin{split}\begin{bmatrix} K(\mathbf{X}, \mathbf{X}') & \mathbf{f} \\ \mathbf{f}^T & 0 \end{bmatrix} \begin{bmatrix} \mathbf{W} \\ \boldsymbol{\mu} \end{bmatrix} = \begin{bmatrix} K(\mathbf{X}, \mathbf{X}_*) \\ \mathbf{1}_{1 \times n_*} \end{bmatrix}\end{split}\]

where:

\(\mathbf{W} \in \mathbb{R}^{n \times n_*}\) contains weight contributions from each training point (rows) to each test point (columns)
\(\mathbf{f} \in \mathbb{R}^{n \times 1}\) is a vector of ones enforcing the partition-of-unity constraint
\(\boldsymbol{\mu} \in \mathbb{R}^{1 \times n_*}\) contains Lagrange multipliers for each test point
\(\mathbf{1}_{1 \times n_*}\) is a row vector of n_* ones

The submodel weight coefficients \(w_k(\mathbf{x}_*^{(j)})\) for the \(k\)-th submodel at test point \(\mathbf{x}_*^{(j)}\) are obtained by summing the weight contributions from all training points belonging to partition \(\mathcal{I}_k\):

(9)#\[w_k(\mathbf{x}_*^{(j)}) = \sum_{i \in \mathcal{I}_k} W_{ij}\]

The predictive variance is computed following [14] by weighting the submodel standard deviations:

(10)#\[\left[\boldsymbol{\Sigma}_{*}\right]_{jj} = \left(\sum_{k=1}^{M} w_k(\mathbf{x}_*^{(j)}) \sqrt{\left[\boldsymbol{\Sigma}^{(k)}_{*}\right]_{jj}}\right)^2\]

where \(\left[\boldsymbol{\Sigma}^{(k)}_{*}\right]_{jj}\) is the predictive variance from the \(k\)-th submodel at test point \(\mathbf{x}_*^{(j)}\).

Hyperparameter Optimization#

As with the previous models, the kernel hyperparameters \(\boldsymbol{\psi}\) must be determined. Since the submodels represent different views of the same underlying random process, they share the same hyperparameters. For the \(k\)-th submodel, the marginal log-likelihood (MLL) of the augmented observations \(\mathbf{y}^{(k)}\) is:

(11)#\[\log p(\mathbf{y}^{(k)}|\mathbf{X}, \boldsymbol{\psi}) = -\frac{1}{2} \left(\mathbf{y}^{(k)}\right)^\top \left(\boldsymbol{\Sigma}^{(k)}_{11}\right)^{-1} \mathbf{y}^{(k)} - \frac{1}{2}\log|\boldsymbol{\Sigma}^{(k)}_{11}| - \frac{n + d|\mathcal{I}_k|}{2}\log 2\pi\]

Following [14], the shared hyperparameters are determined by maximizing an aggregated log-likelihood function:

(12)#\[\text{JLL}(\boldsymbol{\psi}) = \frac{1}{M}\sum_{k=1}^{M}\log p(\mathbf{y}^{(k)}|\mathbf{X}, \boldsymbol{\psi})\]

This uniform averaging provides a computationally efficient approximation to the full joint likelihood while ensuring all submodels contribute equally to the hyperparameter estimation. The optimal hyperparameters are obtained as:

(13)#\[\boldsymbol{\psi}^* = \argmax_{\boldsymbol{\psi}} \text{JLL}(\boldsymbol{\psi})\]

Computational Trade-offs#

This decomposition reduces the total training complexity from \(\mathcal{O}((n(d+1))^3)\) for standard GEK to \(\mathcal{O}\left(\sum_{k=1}^M (n + d|\mathcal{I}_k|)^3\right)\) for WGEK, providing substantial computational savings when the partitions distribute the gradient information appropriately. However, this computational efficiency comes at the cost of model accuracy. By partitioning the derivative data, WGEK disrupts the derivative cross-correlation structure that full GEK exploits. As the number of partitions \(M\) increases, each submodel contains less derivative information, making the global predictor progressively less informative. The extreme case of \(M=n\) (one gradient per submodel) provides maximum computational savings but minimal derivative cross-correlation information, resulting in the least accurate predictions. Conversely, reducing \(M\) retains more of the derivative correlation structure, improving model accuracy at the expense of increased computational cost. Thus, the choice of \(M\) represents a trade-off between computational efficiency and predictive fidelity.

References#

[1]

A Comparison of Numerical Optimizers in Developing High Dimensional Surrogate Models, volume Volume 2B: 45th Design Automation Conference of International Design Engineering Technical Conferences and Computers and Information in Engineering Conference, 08 2019. URL: https://doi.org/10.1115/DETC2019-97499, arXiv:https://asmedigitalcollection.asme.org/IDETC-CIE/proceedings-pdf/IDETC-CIE2019/59193/V02BT03A037/6452976/v02bt03a037-detc2019-97499.pdf, doi:10.1115/DETC2019-97499.

[2]

Matheron Georges. Principles of geostatistics. Economic geology, 58(8):1246–1266, 1963.

[3]

D. G. Krige. A statistical approach to some basic mine valuation problems on the witwatersrand. OR, 4(1):18–18, 1953. URL: http://www.jstor.org/stable/3006914 (visited on 2025-02-20).

[4]

William J. Welch, Robert J. Buck, Jerome Sacks, Henry P. Wynn, and Toby J. Mitchell. Screening, predicting, and computer experiments. Technometrics, 34(1):15–25, 1992. URL: https://www.tandfonline.com/doi/abs/10.1080/00401706.1992.10485229, arXiv:https://www.tandfonline.com/doi/pdf/10.1080/00401706.1992.10485229, doi:10.1080/00401706.1992.10485229.

[5]

Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, Cambridge, MA, 2006. ISBN 9780262182539. URL: http://www.gaussianprocess.org/gpml/.

[6]

Gregoire Allaire and Sidi Mahmoud Kaber. Numerical Linear Algebra. Texts in applied mathematics. Springer, New York, NY, January 2008.

[7]

Weiyu Liu and Stephen Batill. Gradient-enhanced response surface approximations using kriging models. In 9th AIAA/ISSMO Symposium on Multidisciplinary Analysis and Optimization. Reston, Virigina, September 2002. American Institute of Aeronautics and Astronautics.

[8]

Wataru Yamazaki, Markus Rumpfkeil, and Dimitri Mavriplis. Design optimization utilizing gradient/hessian enhanced surrogate model. In 28th AIAA Applied Aerodynamics Conference. Reston, Virigina, June 2010. American Institute of Aeronautics and Astronautics.

[9]

Selvakumar Ulaganathan, Ivo Couckuyt, Tom Dhaene, Joris Degroote, and Eric Laermans. Performance study of gradient-enhanced kriging. Eng. Comput., 32(1):15–34, January 2016.

[10]

Alexander I.J. Forrester and Andy J. Keane. Recent advances in surrogate-based optimization. Progress in Aerospace Sciences, 45(1):50–79, 2009. URL: https://www.sciencedirect.com/science/article/pii/S0376042108000766, doi:https://doi.org/10.1016/j.paerosci.2008.11.001.

[11]

Youwei He, Kuan Tan, Chunming Fu, and Jinliang Luo. An efficient gradient-enhanced kriging modeling method assisted by fast kriging for high-dimension problems. International journal of numerical methods for heat & fluid flow, 33(12):3967–3993, 2023.

[12]

Selvakumar Ulaganathan, Ivo Couckuyt, Tom Dhaene, Eric Laermans, and Joris Degroote. On the use of gradients in kriging surrogate models. In Proceedings of the Winter Simulation Conference 2014. IEEE, December 2014.

[13]

Liming Chen, Haobo Qiu, Liang Gao, Chen Jiang, and Zan Yang. A screening-based gradient-enhanced kriging modeling method for high-dimensional problems. Applied Mathematical Modelling, 69:15–31, 2019. URL: https://www.sciencedirect.com/science/article/pii/S0307904X18305900, doi:https://doi.org/10.1016/j.apm.2018.11.048.

[14] (1,2,3,4)

Zhong-Hua Han, Yu Zhang, Chen-Xing Song, and Ke-Shi Zhang. Weighted gradient-enhanced kriging for high-dimensional surrogate modeling and design optimization. AIAA Journal, 55(12):4330–4346, 2017. URL: https://doi.org/10.2514/1.J055842, arXiv:https://doi.org/10.2514/1.J055842, doi:10.2514/1.J055842.

[15]

Yiming Yao, Fei Liu, and Qingfu Zhang. High-Throughput Multi-Objective bayesian optimization using gradients. In 2024 IEEE Congress on Evolutionary Computation (CEC), volume 2, 1–8. IEEE, June 2024.

[16]

Misha Padidar, Xinran Zhu, Leo Huang, Jacob R Gardner, and David Bindel. Scaling gaussian processes with derivative information using variational inference. Advances in Neural Information Processing Systems, 34:6442–6453, 2021. arXiv:2107.04061.

[17]

Haitao Liu, Jianfei Cai, and Yew-Soon Ong. Remarks on multi-output gaussian process regression. Knowledge-Based Systems, 144:102–121, 2018. URL: https://www.sciencedirect.com/science/article/pii/S0950705117306123, doi:https://doi.org/10.1016/j.knosys.2017.12.034.

[18]

B. Rakitsch, Christoph Lippert, K. Borgwardt, and Oliver Stegle. It is all in the noise: efficient multi-task gaussian process inference with structured residuals. Advances in Neural Information Processing Systems, pages, 01 2013.

[19]

Edwin V Bonilla, Kian Chai, and Christopher Williams. Multi-task gaussian process prediction. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems, volume 20. Curran Associates, Inc., 2007. URL: https://proceedings.neurips.cc/paper_files/paper/2007/file/66368270ffd51418ec58bd793f2d9b1b-Paper.pdf.

[20]

Chen Zhou Xu, Zhong Hua Han, Ke Shi Zhang, and Wen Ping Song. Improved weighted gradient-enhanced kriging model for high-dimensional aerodynamic modeling problems. In 32nd Congress of the International Council of the Aeronautical Sciences, ICAS 2021, 32nd Congress of the International Council of the Aeronautical Sciences, ICAS 2021. International Council of the Aeronautical Sciences, 2021.