Multi-output Gaussian Processes

Multi-output Gaussian Processes#

A multi-output Gaussian process (MOGP) extends the standard single-output GP to jointly approximate multiple outputs \(\{ \mathbf{y}_t \}_{t=1}^T\), explicitly modeling their correlations to improve predictive accuracy compared to independent modeling [17]. For convenience isotropic training sets are considered so that for \(t \in \left\{ 1 , \ldots, T \right\}, \mathbf{X}_T = \mathbf{X}\).

Formally,

\[\mathbf{f}(\mathbf{x}) = \big( f_1(\mathbf{x}), \ldots, f_T(\mathbf{x}) \big)^T \sim \mathcal{GP}\big(0, \mathcal{K}_M(\mathbf{x}, \mathbf{x}')\big),\]

where the multi-output covariance is defined as

\[\begin{split}\mathcal{K}_M(\mathbf{x}, \mathbf{x}') = \begin{bmatrix} k_{11}(\mathbf{x}, \mathbf{x}') & \cdots & k_{1T}(\mathbf{x}, \mathbf{x}') \\ \vdots & \ddots & \vdots \\ k_{T1}(\mathbf{x}, \mathbf{x}') & \cdots & k_{TT}(\mathbf{x}, \mathbf{x}') \end{bmatrix}.\end{split}\]

Where in particular, \(k_{tt'}(\mathbf{x}, \mathbf{x}')\) corresponds to the correlation between outputs \(f_{t}(\mathbf{x})\) and \(f_{t'}(\mathbf{x}')\). Then, it is assumed that:

(1)#\[y_t(\mathbf{x}) = f_t(\mathbf{x}) + \epsilon_t\]

where \(\epsilon_t \sim \mathcal{N}(0, \sigma_{s, t}^2)\) is an additive independent and identically distributed (i.i.d) Gaussian noise of the \(t^{\text{th}}\) output. Given this observation model, the likelihood function for the \(T\) outputs follows:

(2)#\[p(\mathbf{y}|\mathbf{f}, \mathbf{x}, \boldsymbol{\Sigma}_s) = \mathcal{N}(\mathbf{f}(\mathbf{x}), \boldsymbol{\Sigma}_s)\]

where \(\boldsymbol{\Sigma}_s \in \mathbb{R}^{T \times T}\) is a diagonal matrix with elements \(\{\sigma_{s,t}^2\}_{t=1}^{T}\) on the diagonal, reflecting the independence of the observation noise across outputs. Consideration of this noise not only improves robustness of the covariance matrix but also information transfer across the outputs [18, 19].

To predict outputs at a new set of test locations, \(\mathbf{X}_*\), we model the joint distribution given by Equation (3) over the training and test outputs where the training outputs are described by \([f_1(\mathbf{X}), \ldots, f_T(\mathbf{X})]^T\) and the test outputs by \([f_1(\mathbf{X}_*), \ldots, f_T(\mathbf{X}_*)]^T\).

The joint distribution between the augmented training observations and the augmented test predictions is a multivariate Gaussian:

(3)#\[\begin{split}\begin{pmatrix} \mathbf{y}^{MO} \\ \mathbf{y}^{MO}_* \end{pmatrix} \sim \mathcal{N}\left( \mathbf{0}, \begin{pmatrix} \boldsymbol{\Sigma}_{11}^{MO} & \boldsymbol{\Sigma}_{12}^{MO} \\ \boldsymbol{\Sigma}^{MO}_{21} & \boldsymbol{\Sigma}_{22}^{MO} \end{pmatrix} \right).\end{split}\]

Here, the training covariance block, \(\boldsymbol{\Sigma}^{MO}_{11}\), is an \(nT \times nT\) matrix:

(4)#\[\begin{split}\boldsymbol{\Sigma}^{MO}_{11} = \begin{pmatrix} k_{11}(\mathbf{X}, \mathbf{X}') & \cdots & k_{1T}(\mathbf{X}, \mathbf{X}') \\ \vdots & \ddots & \vdots \\ k_{T1}(\mathbf{X}, \mathbf{X}') & \cdots & k_{TT}(\mathbf{X}, \mathbf{X}') \end{pmatrix}.\end{split}\]

To account for observation noise, we define \(\Sigma_M = \boldsymbol{\Sigma}_s \otimes I_n \in \mathbb{R}^{nT \times nT}\), where \(\otimes\) denotes the Kronecker product. This matrix adds the appropriate noise variance \(\sigma_{s,t}^2\) to the diagonal of each output’s block in the training covariance. The training-test covariance block, \(\boldsymbol{\Sigma}_{12}^{MO}\), contains the covariances between all training and test observations:

(5)#\[\begin{split}\boldsymbol{\Sigma}^{MO}_{12} = \begin{pmatrix} k_{11}(\mathbf{X}, \mathbf{X_*}) & \cdots & k_{1T}(\mathbf{X}, \mathbf{X_*}) \\ \vdots & \ddots & \vdots \\ k_{T1}(\mathbf{X}, \mathbf{X_*}) & \cdots & k_{TT}(\mathbf{X}, \mathbf{X_*}) \end{pmatrix}.\end{split}\]

The remaining blocks are defined as \(\boldsymbol{\Sigma}_{21}^{MO} = (\boldsymbol{\Sigma}_{12}^{MO})^T\), and \(\boldsymbol{\Sigma}_{22}^{MO}\) has the same structure as \(\boldsymbol{\Sigma}_{11}^{MO}\) but is evaluated at the test points \(\mathbf{X}_*\). Note that \(\boldsymbol{\Sigma}_{22}^{MO}\) does not include observation noise, as we are predicting the latent function values rather than noisy observations. The posterior predictive distribution for the augmented test vector \(\mathbf{y}_*^{MO}\) is then given by:

(6)#\[\begin{split}\begin{split} \boldsymbol{\mu}_{*} &= \boldsymbol{\Sigma}_{21}^{MO} \left(\boldsymbol{\Sigma}^{MO}_{11} + \Sigma_{M}\right)^{-1} \mathbf{y}^{MO} \\ \boldsymbol{\Sigma}_{*} &= \boldsymbol{\Sigma}_{22}^{MO} - \boldsymbol{\Sigma}_{21}^{MO} \left(\boldsymbol{\Sigma}_{11}^{MO} + \Sigma_M \right)^{-1} \boldsymbol{\Sigma}_{12}^{MO} \end{split}\end{split}\]

Each element \(k_{tt'}(\mathbf{X}, \mathbf{X}')\) specifies the covariance between outputs \(f_t(\mathbf{X})\) and \(f_{t'}(\mathbf{X}')\), and the model can be trained by maximum likelihood estimation of kernel hyperparameters. It has been observed that the performance of MOGPs highly depends on the multioutput covariance structure [17] that should ensure:

\(\boldsymbol{\Sigma}^{MO}_{11}\) is a positive semidefinite (PSD) matrix
The correlations between outputs and transfer of information across outputs are captured

One such formulation for this covariance matrix as denoted in [19] is given by:

(7)#\[\boldsymbol{\Sigma}^{MO}_{11} = k^t \otimes k^{x}\]

where \(k^t \in \mathbb{R}^{T \times T}\) represents the correlation across outputs and \(k^{x} \in \mathbb{R}^{n \times n}\) is typically a stationary covariance matrix over the inputs. It is typically challenging to construct a valid PSD matrix \(k^t\) in the MOGP formulation. One methodology involves parameterizing \(k^t\) as \(k^t = LL^T\) where:

(8)#\[\begin{split}L = \begin{pmatrix} a_1 & 0 & \ldots & 0 \\ a_2 & a_3 & \ldots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ a_{w - T + 1} & a_{w - T + 2} & \ldots & a_w \end{pmatrix}\end{split}\]

where \(w = T(T + 1)/2\) denotes the number of free parameters in \(k^t\).

Similar to standard GPs, the model hyperparameters are determined by maximizing the log marginal likelihood (MLL):

(9)#\[\log p(\mathbf{y}^{MO}|\mathbf{X}, \boldsymbol{\psi}) = -\frac{1}{2} \left(\mathbf{y}^{MO}\right)^\top \left(\boldsymbol{\Sigma}^{MO}_{11} + \boldsymbol{\Sigma}_M\right)^{-1} \mathbf{y}^{MO} - \frac{1}{2}\log|\boldsymbol{\Sigma}^{MO}_{11} + \boldsymbol{\Sigma}_M| - \frac{nT}{2}\log 2\pi.\]

In the MOGP formulation, the hyperparameter vector \(\boldsymbol{\psi}\) includes not only the parameters of the spatial covariance \(k^{x}\), but also the elements \(\{a_i\}\) of the lower triangular matrix \(L\) that parametrizes the task correlation matrix \(k^t = LL^T\), as well as the noise variances in \(\boldsymbol{\Sigma}_M\). All of these parameters are jointly optimized during maximum likelihood estimation.

A Note on DEGPs#

A gradient-enhanced GP (GEK) or derivative-enhanced GP (DEGP) can be interpreted as a structured MOGP, where the outputs correspond to the function value and its partial derivatives:

\[\mathbf{f}(\mathbf{x}) = \Big(f(\mathbf{x}), \; \tfrac{\partial f}{\partial x_1}(\mathbf{x}), \; \ldots, \; \tfrac{\partial f}{\partial x_d}(\mathbf{x}) \Big)^T .\]

In this setting, the cross-covariances \(k_{tt'}\) are not arbitrary but derived by differentiation of a single latent covariance function \(k\):

\[\begin{split}\begin{aligned} k_{11}(\mathbf{x}, \mathbf{x}') &= k(\mathbf{x}, \mathbf{x}'), \\ k_{1j}(\mathbf{x}, \mathbf{x}') &= \frac{\partial}{\partial x'_j} k(\mathbf{x}, \mathbf{x}'), \\ k_{i1}(\mathbf{x}, \mathbf{x}') &= \frac{\partial}{\partial x_i} k(\mathbf{x}, \mathbf{x}'), \\ k_{ij}(\mathbf{x}, \mathbf{x}') &= \frac{\partial^2}{\partial x_i \partial x'_j} k(\mathbf{x}, \mathbf{x}'). \end{aligned}\end{split}\]

Thus, GEK/DEGP can be viewed as a special case of MOGP where the correlation between outputs is dictated by calculus rather than learned freely. This structured interpretation highlights that gradient information augments the GP model within the same multi-output framework, providing richer posterior inference without introducing additional latent processes.

References#

[1]

A Comparison of Numerical Optimizers in Developing High Dimensional Surrogate Models, volume Volume 2B: 45th Design Automation Conference of International Design Engineering Technical Conferences and Computers and Information in Engineering Conference, 08 2019. URL: https://doi.org/10.1115/DETC2019-97499, arXiv:https://asmedigitalcollection.asme.org/IDETC-CIE/proceedings-pdf/IDETC-CIE2019/59193/V02BT03A037/6452976/v02bt03a037-detc2019-97499.pdf, doi:10.1115/DETC2019-97499.

[2]

Matheron Georges. Principles of geostatistics. Economic geology, 58(8):1246–1266, 1963.

[3]

D. G. Krige. A statistical approach to some basic mine valuation problems on the witwatersrand. OR, 4(1):18–18, 1953. URL: http://www.jstor.org/stable/3006914 (visited on 2025-02-20).

[4]

William J. Welch, Robert J. Buck, Jerome Sacks, Henry P. Wynn, and Toby J. Mitchell. Screening, predicting, and computer experiments. Technometrics, 34(1):15–25, 1992. URL: https://www.tandfonline.com/doi/abs/10.1080/00401706.1992.10485229, arXiv:https://www.tandfonline.com/doi/pdf/10.1080/00401706.1992.10485229, doi:10.1080/00401706.1992.10485229.

[5]

Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, Cambridge, MA, 2006. ISBN 9780262182539. URL: http://www.gaussianprocess.org/gpml/.

[6]

Gregoire Allaire and Sidi Mahmoud Kaber. Numerical Linear Algebra. Texts in applied mathematics. Springer, New York, NY, January 2008.

[7]

Weiyu Liu and Stephen Batill. Gradient-enhanced response surface approximations using kriging models. In 9th AIAA/ISSMO Symposium on Multidisciplinary Analysis and Optimization. Reston, Virigina, September 2002. American Institute of Aeronautics and Astronautics.

[8]

Wataru Yamazaki, Markus Rumpfkeil, and Dimitri Mavriplis. Design optimization utilizing gradient/hessian enhanced surrogate model. In 28th AIAA Applied Aerodynamics Conference. Reston, Virigina, June 2010. American Institute of Aeronautics and Astronautics.

[9]

Selvakumar Ulaganathan, Ivo Couckuyt, Tom Dhaene, Joris Degroote, and Eric Laermans. Performance study of gradient-enhanced kriging. Eng. Comput., 32(1):15–34, January 2016.

[10]

Alexander I.J. Forrester and Andy J. Keane. Recent advances in surrogate-based optimization. Progress in Aerospace Sciences, 45(1):50–79, 2009. URL: https://www.sciencedirect.com/science/article/pii/S0376042108000766, doi:https://doi.org/10.1016/j.paerosci.2008.11.001.

[11]

Youwei He, Kuan Tan, Chunming Fu, and Jinliang Luo. An efficient gradient-enhanced kriging modeling method assisted by fast kriging for high-dimension problems. International journal of numerical methods for heat & fluid flow, 33(12):3967–3993, 2023.

[12]

Selvakumar Ulaganathan, Ivo Couckuyt, Tom Dhaene, Eric Laermans, and Joris Degroote. On the use of gradients in kriging surrogate models. In Proceedings of the Winter Simulation Conference 2014. IEEE, December 2014.

[13]

Liming Chen, Haobo Qiu, Liang Gao, Chen Jiang, and Zan Yang. A screening-based gradient-enhanced kriging modeling method for high-dimensional problems. Applied Mathematical Modelling, 69:15–31, 2019. URL: https://www.sciencedirect.com/science/article/pii/S0307904X18305900, doi:https://doi.org/10.1016/j.apm.2018.11.048.

[14]

Zhong-Hua Han, Yu Zhang, Chen-Xing Song, and Ke-Shi Zhang. Weighted gradient-enhanced kriging for high-dimensional surrogate modeling and design optimization. AIAA Journal, 55(12):4330–4346, 2017. URL: https://doi.org/10.2514/1.J055842, arXiv:https://doi.org/10.2514/1.J055842, doi:10.2514/1.J055842.

[15]

Yiming Yao, Fei Liu, and Qingfu Zhang. High-Throughput Multi-Objective bayesian optimization using gradients. In 2024 IEEE Congress on Evolutionary Computation (CEC), volume 2, 1–8. IEEE, June 2024.

[16]

Misha Padidar, Xinran Zhu, Leo Huang, Jacob R Gardner, and David Bindel. Scaling gaussian processes with derivative information using variational inference. Advances in Neural Information Processing Systems, 34:6442–6453, 2021. arXiv:2107.04061.

[17] (1,2)

Haitao Liu, Jianfei Cai, and Yew-Soon Ong. Remarks on multi-output gaussian process regression. Knowledge-Based Systems, 144:102–121, 2018. URL: https://www.sciencedirect.com/science/article/pii/S0950705117306123, doi:https://doi.org/10.1016/j.knosys.2017.12.034.

[18]

B. Rakitsch, Christoph Lippert, K. Borgwardt, and Oliver Stegle. It is all in the noise: efficient multi-task gaussian process inference with structured residuals. Advances in Neural Information Processing Systems, pages, 01 2013.

[19] (1,2)

Edwin V Bonilla, Kian Chai, and Christopher Williams. Multi-task gaussian process prediction. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems, volume 20. Curran Associates, Inc., 2007. URL: https://proceedings.neurips.cc/paper_files/paper/2007/file/66368270ffd51418ec58bd793f2d9b1b-Paper.pdf.

[20]

Chen Zhou Xu, Zhong Hua Han, Ke Shi Zhang, and Wen Ping Song. Improved weighted gradient-enhanced kriging model for high-dimensional aerodynamic modeling problems. In 32nd Congress of the International Council of the Aeronautical Sciences, ICAS 2021, 32nd Congress of the International Council of the Aeronautical Sciences, ICAS 2021. International Council of the Aeronautical Sciences, 2021.

Multi-output Gaussian Processes

Contents

Multi-output Gaussian Processes#

A Note on DEGPs#

References#