Hyperparameter Optimization#
Overview#
JetGP uses maximum likelihood estimation (MLE) to optimize the hyperparameters of Gaussian Process models.
The optimization is performed after model initialization and before making predictions.
All GP modules share a common optimization interface through the optimize_hyperparameters() method.
JetGP supports multiple optimization algorithms, ranging from gradient-based methods to population-based metaheuristics, allowing users to choose the most appropriate approach for their problem.
Basic Usage#
After initializing a GP model (DEGP, DDEGP, GDDEGP, or WDEGP),
hyperparameters are optimized by calling the optimize_hyperparameters() method:
Example:
# Initialize the model
gp = degp(
X_train, y_train, n_order, n_bases=1, der_indices=der_indices,
normalize=True, kernel="SE", kernel_type="anisotropic"
)
# Optimize hyperparameters
params = gp.optimize_hyperparameters(
optimizer='cobyla',
n_restart_optimizer=10
)
The method returns the optimized hyperparameters, which are automatically stored within the model and used for subsequent predictions.
For model initialization details, see initialization.
What is Optimized#
The hyperparameters optimized depend on the kernel and kernel type selected during model initialization.
Common hyperparameters:
Signal variance (\(\sigma_f^2\)): Controls the overall magnitude of function variation. Larger values indicate greater function amplitude.
Length scale(s) (\(\ell\) or \(\ell_j\)): Controls how quickly correlation decays with distance. Smaller length scales allow the model to capture more rapid variation.
Isotropic kernels: Single \(\ell\) shared across all input dimensions
Anisotropic kernels: Separate \(\ell_j\) for each input dimension, allowing different correlation structures along different axes
Noise variance (\(\sigma_n^2\)): Models observation noise or measurement uncertainty. Set to a small value (or fixed) for noise-free simulations.
Kernel-specific hyperparameters:
Rational Quadratic (RQ): Shape parameter \(\alpha\) controlling the mixture of length scales
Matern: Smoothness parameter \(\nu\) (typically fixed via
smoothness_parameter)SineExp: Period parameters \(p_j\) for periodic structure
Number of hyperparameters:
Kernel |
Isotropic |
Anisotropic |
|---|---|---|
SE (RBF) |
3 (\(\sigma_f^2, \ell, \sigma_n^2\)) |
\(2 + d\) (\(\sigma_f^2, \ell_1, \ldots, \ell_d, \sigma_n^2\)) |
RQ |
4 (\(\sigma_f^2, \ell, \alpha, \sigma_n^2\)) |
\(3 + d\) |
Matern |
3 (ν typically fixed) |
\(2 + d\) |
SineExp |
\(3 + d\) (includes periods) |
\(2 + 2d\) |
where \(d\) is the input dimension (n_bases).
Note
For WDEGP models, hyperparameters are shared across all submodels. The optimization maximizes the combined log marginal likelihood, ensuring consistent covariance structure across the weighted ensemble.
Common Arguments#
optimizer Specifies which optimization algorithm to use for hyperparameter tuning. Available options:
"pso"– Particle Swarm Optimization"jade"– JADE (Adaptive Differential Evolution)"lbfgs"– Limited-memory BFGS (gradient-based)"powell"– Powell’s derivative-free method"cobyla"– Constrained Optimization BY Linear Approximation
n_restart_optimizer Number of random restarts for the optimization procedure. Multiple restarts help avoid local optima by initializing the search from different starting points.
For gradient-based and derivative-free methods (
lbfgs,powell,cobyla): Each restart begins from a randomly sampled initial hyperparameter configuration (unlessx0is provided for the first restart). The best result across all restarts is returned.For population-based methods (
pso,jade): This parameter is ignored. These methods perform a single optimization run using their population-based search strategy, which inherently explores multiple regions of the search space.
Default: 10
x0 (optional)
Initial guess for hyperparameters, used only for restart-based methods (lbfgs, powell, cobyla).
If provided, the first optimization run starts from this point, while subsequent restarts use random initialization.
Population-based methods (pso, jade) use initial_positions instead.
debug
If True, prints intermediate optimization results during execution.
Useful for monitoring convergence and diagnosing optimization issues.
Default: False
Available Optimizers#
1. pso (Particle Swarm Optimization)
A population-based metaheuristic that simulates social behavior of particles searching for optima. Each particle represents a candidate solution that moves through the search space, influenced by its own best position and the global best position found by the swarm.
Unlike restart-based methods, PSO performs a single optimization run without restarts, as the population naturally explores multiple regions simultaneously.
Recommended for: - High-dimensional hyperparameter spaces - Non-smooth likelihood surfaces - Initial exploration when gradient information is unreliable - Multi-modal optimization landscapes
Additional keyword arguments:
pop_size– Number of particles in the swarm (default:20) Larger populations improve exploration but increase computational costn_generations– Maximum number of generations/iterations (default:50) More generations allow better convergence to global optimumlocal_opt_every– Frequency of local gradient-based refinement (default:15) Every N generations, applies a gradient-based optimization (L-BFGS-B) to the current global best position. This hybrid approach combines global exploration from PSO with local exploitation from gradients, significantly improving convergence speed and final solution quality. Set toNoneor a large number to disable local optimization.initial_positions– Optional array of initial particle positions (default:None) Shape:(pop_size, n_dims). If provided, initializes swarm at specified locations instead of random samplingomega– Inertia weight controlling particle momentum (default:0.5) Higher values (0.7-0.9) encourage exploration; lower values (0.4-0.6) encourage exploitationphip– Cognitive acceleration coefficient (default:0.5) Controls attraction to particle’s personal best position (typical range: 0.5-2.5)phig– Social acceleration coefficient (default:0.5) Controls attraction to global best position (typical range: 0.5-2.5)seed– Random seed for reproducibility (default:42)
Note: The n_restart_optimizer parameter is ignored for PSO.
Example:
params = gp.optimize_hyperparameters(
optimizer='pso',
pop_size=30,
n_generations=100,
local_opt_every=20, # Apply gradient refinement every 20 generations
omega=0.7,
phip=1.5,
phig=1.5,
seed=123,
debug=True
)
2. jade (Adaptive Differential Evolution)
An evolutionary algorithm based on the JADE (Adaptive Differential Evolution with Optional External Archive) variant. It adaptively adjusts mutation and crossover parameters during optimization and maintains an archive of recently explored solutions to improve diversity.
Like PSO, JADE performs a single optimization run without restarts, as the population-based search inherently explores multiple regions of the hyperparameter space.
Recommended for: - Rugged likelihood surfaces with many local optima - Problems where gradient-based methods fail to converge - Robust global search requirements - Complex, non-linear optimization landscapes
Additional keyword arguments:
pop_size– Number of individuals in the population (default:20) Larger populations improve exploration but increase computational costn_generations– Maximum number of generations (default:50) More generations allow better convergence to global optimump– Proportion of best individuals for mutation (default:0.1) Controls the greediness of the mutation strategy (typical range: 0.05-0.2). Lower values make the search more exploitative; higher values increase exploration.c– Learning rate for parameter adaptation (default:0.1) Controls how quickly F and CR parameters adapt (typical range: 0.05-0.2). Higher values lead to faster adaptation but may be less stable.local_opt_every– Frequency of local gradient-based refinement (default:15) Every N generations, applies a gradient-based optimization (L-BFGS-B) to the current best individual. This hybrid strategy combines JADE’s robust global search with gradient-based local refinement, improving both convergence speed and solution quality. Set toNoneor a large number to disable.initial_positions– Optional array of initial population positions (default:None) Shape:(pop_size, n_dims). If provided, initializes population at specified locationsseed– Random seed for reproducibility (default:42)
Note: The n_restart_optimizer parameter is ignored for JADE.
Example:
params = gp.optimize_hyperparameters(
optimizer='jade',
pop_size=40,
n_generations=150,
p=0.15,
c=0.1,
local_opt_every=25, # Apply gradient refinement every 25 generations
seed=456,
debug=True
)
3. lbfgs (Limited-memory BFGS)
A gradient-based quasi-Newton method that approximates the inverse Hessian matrix using limited memory. Extremely efficient for smooth objective functions with reliable gradients.
This optimizer uses a multi-restart strategy to explore different regions of the hyperparameter space and avoid local optima.
Recommended for: - Low to moderate dimensional problems (typically < 20 hyperparameters) - Smooth likelihood surfaces - Fast convergence when good initial guesses are available - Production use when gradients are reliable
Additional keyword arguments:
x0– Initial guess for first restart (default:None) If not provided, uses random initialization within boundsn_restart_optimizer– Number of random restarts (default:10) More restarts improve robustness against local optimamaxiter– Maximum number of iterations per restart (default:100)ftol– Function tolerance for convergence (default:1e-8) Optimization stops when(f^k - f^{k+1})/max{|f^k|,|f^{k+1}|,1} <= ftolgtol– Gradient tolerance for convergence (default:1e-8) Optimization stops whenmax{|proj g_i|} <= gtoldisp– Display convergence messages (default:False)debug– Print intermediate results (default:False)
Example:
params = gp.optimize_hyperparameters(
optimizer='lbfgs',
n_restart_optimizer=15,
maxiter=200,
ftol=1e-9,
gtol=1e-9,
debug=True
)
4. powell (Powell’s Method)
A derivative-free optimization algorithm that performs sequential line searches along conjugate directions. Does not require gradient information but can still achieve fast convergence on smooth problems.
This optimizer uses a multi-restart strategy to explore different initial conditions and improve robustness.
Recommended for: - Problems where gradient computation is expensive or unavailable - Moderate-dimensional spaces (typically < 15 hyperparameters) - Situations where analytical derivatives are unreliable - Smooth but noisy objective functions
Additional keyword arguments:
x0– Initial guess for first restart (default:None)n_restart_optimizer– Number of random restarts (default:10) More restarts improve robustness against local optimamaxiter– Maximum number of iterations per restart (default: SciPy default)xtol– Relative tolerance for parameter convergence (default: SciPy default) Absolute error in solution acceptable for convergenceftol– Relative tolerance for function convergence (default: SciPy default) Relative error in objective function acceptable for convergencedisp– Display convergence messages (default:False)debug– Print intermediate results (default:False)
Example:
params = gp.optimize_hyperparameters(
optimizer='powell',
n_restart_optimizer=20,
maxiter=500,
xtol=1e-6,
ftol=1e-6,
debug=True
)
5. cobyla (Constrained Optimization BY Linear Approximation)
A derivative-free method designed for constrained optimization problems. Constructs linear approximations of the objective function and handles bounds through inequality constraints.
This optimizer uses a multi-restart strategy to explore different starting points and avoid local optima.
Recommended for: - Problems with bound constraints on hyperparameters - When gradients are unavailable or unreliable - Moderate-dimensional optimization (typically < 15 hyperparameters) - Noisy or discontinuous objective functions
Additional keyword arguments:
x0– Initial guess for first restart (default:None)n_restart_optimizer– Number of random restarts (default:10) More restarts improve robustness against local optimamaxiter– Maximum number of function evaluations per restart (default: SciPy default)rhobeg– Initial trust region radius (default: SciPy default, typically 1.0) Controls the initial step sizecatol– Absolute tolerance for constraint violations (default: SciPy default)f_target– Target function value; stops if reached (default: SciPy default)disp– Display convergence messages (default:False)debug– Print intermediate results (default:False)
Example:
params = gp.optimize_hyperparameters(
optimizer='cobyla',
n_restart_optimizer=20,
maxiter=500,
rhobeg=0.5,
catol=1e-6,
debug=True
)
Diagnosing Optimization Quality#
Successful hyperparameter optimization is critical for accurate GP predictions. Here are signs to look for when evaluating optimization results.
Signs of successful optimization:
Consistent log-likelihood values across multiple restarts (for restart-based methods)
Log-likelihood stabilizes before reaching maximum iterations (for population-based methods)
Predictions show reasonable uncertainty (not too wide or too narrow)
Length scales are within a reasonable range relative to the data spread
Hyperparameters do not hit bounds
Signs of potential issues:
Very different log-likelihood values across restarts (indicates multiple local optima)
Hyperparameters frequently hitting bounds (may need to adjust bound settings)
Very large length scales (\(\ell \gg\) data range): Model is underfitting, treating data as nearly constant
Very small length scales (\(\ell \ll\) typical point spacing): Model may be overfitting or capturing noise
Very small noise variance with noisy data: Model may be interpolating noise
Optimization fails to converge within iteration limits
Troubleshooting strategies:
Multiple local optima: Increase
n_restart_optimizerfor restart-based methods, or switch to population-based methods (pso,jade)Slow convergence: For population-based methods, increase
n_generationsorpop_size. Enable hybrid refinement withlocal_opt_every=15-25Hitting bounds: Check if data is properly normalized (
normalize=True). Consider if the kernel choice is appropriate for the problemUnreasonable length scales: Verify data scaling and consider using
normalize=True. Check if the number of training points is sufficientPoor predictions despite convergence: Try a different kernel or kernel type. Consider whether derivative information is helping or introducing numerical issues
Notes#
All optimizers maximize the log marginal likelihood (or equivalently minimize its negative).
Hyperparameter bounds are typically set automatically based on the data scale and problem characteristics.
The optimization is performed in the normalized space when
normalize=True, with hyperparameters automatically transformed back to the original scale.Gradient-based methods (
lbfgs): - Use gradients for efficient convergence on smooth surfaces - Use multiple restarts controlled byn_restart_optimizer- First restart usesx0if provided; subsequent restarts use random initializationDerivative-free methods (
powell,cobyla): - Do not require gradient information - Use multiple restarts controlled byn_restart_optimizer- First restart usesx0if provided; subsequent restarts use random initializationPopulation-based methods (
pso,jade): - Perform a single run;n_restart_optimizeris ignored - Useinitial_positionsinstead ofx0to specify starting population - Thelocal_opt_everyparameter enables hybrid optimization by periodically applying gradient-based refinement
Summary#
JetGP provides a flexible hyperparameter optimization framework supporting both gradient-based and gradient-free methods. The unified interface allows easy experimentation with different optimizers while maintaining consistent model behavior.
Quick reference:
Fastest, smooth problems:
lbfgswith 10-20 restartsNo gradients, moderate complexity:
powellorcobylawith 15-25 restartsMulti-modal, global search:
psoorjadewith hybrid refinement (local_opt_every=15-25)Highly non-smooth problems:
psoorjadewithout gradient refinement (local_opt_every=None)Default recommendation: Start with
lbfgs(n_restart_optimizer=10-20); fall back topsowithlocal_opt_every=15if needed
Key distinctions:
Gradient-based methods (
lbfgs) use gradients for efficient local convergenceDerivative-free methods (
powell,cobyla) work without gradient informationBoth gradient-based and derivative-free methods use multiple restarts to explore different starting points
Population-based methods (
pso,jade) use single runs with inherent parallel explorationThe
local_opt_everyparameter in PSO and JADE enables hybrid optimization, combining global exploration with gradient-based local refinement for improved performance
For most applications, starting with lbfgs and falling back to hybrid pso or jade
for difficult cases provides a robust and efficient optimization strategy.
For making predictions with optimized parameters, see predictions.