What is the difference between a parameter and a hyperparameter?

Parameters, such as the weights of a model, are learned from the training data. Hyperparameters, such as the learning rate or regularization strength, are set before training and control how learning proceeds; they are chosen by searching over candidate values and evaluating on validation data.

Why is random search often better than grid search?

When only a few hyperparameters strongly affect performance, grid search wastes many trials varying the unimportant ones. Random sampling explores the important dimensions more thoroughly for the same number of trials, so it tends to find good settings faster.

Hyperparameter Optimization

Hyperparameter optimization searches for the configuration settings of a learning algorithm that yield the best generalization, since these are not learned from data directly.

Definition

Hyperparameter optimization is the process of selecting the values of a model's configuration parameters, those set before training rather than learned from the data, by evaluating candidate settings on held-out validation data and choosing the configuration that generalizes best.

Scope

This topic covers methods for tuning the settings that govern a learning algorithm, such as learning rate, regularization strength, and architecture choices: grid search, random search, Bayesian optimization with surrogate models, and successive-halving and bandit-based approaches. It addresses why hyperparameters must be chosen on validation data and how search cost is managed.

Core questions

What distinguishes hyperparameters from model parameters?
How do grid and random search differ in efficiency?
How does Bayesian optimization use past evaluations to guide the search?
Why must hyperparameters be tuned on validation rather than test data?

Key theories

Grid and random search: Grid search evaluates all combinations on a predefined grid, while random search samples configurations at random and is often more efficient when only a few hyperparameters strongly affect performance.
Bayesian optimization: Bayesian optimization fits a probabilistic surrogate model of performance as a function of hyperparameters and uses it to choose promising configurations to evaluate next, reducing the number of expensive trials.
Validation-based selection: Because hyperparameters control complexity and fit, they must be chosen using validation data separate from the final test set to avoid optimistic performance estimates.

Clinical relevance

Hyperparameter choices can change a model from useless to state of the art, so systematic tuning is essential, and automated methods make it tractable for expensive models; doing it correctly, with proper validation data and honest accounting of the search, is necessary to avoid overstating final performance.

History

Grid search was long the default for tuning, but Bergstra and Bengio showed in 2012 that random search is often more efficient. Bayesian optimization and bandit-based methods such as successive halving subsequently advanced automated tuning, and hyperparameter optimization became a core part of automated machine learning.

Key figures

James Bergstra
Yoshua Bengio
Trevor Hastie

Seminal works

hastie2009
goodfellow2016
bergstra2012

Frequently asked questions

What is the difference between a parameter and a hyperparameter?: Parameters, such as the weights of a model, are learned from the training data. Hyperparameters, such as the learning rate or regularization strength, are set before training and control how learning proceeds; they are chosen by searching over candidate values and evaluating on validation data.
Why is random search often better than grid search?: When only a few hyperparameters strongly affect performance, grid search wastes many trials varying the unimportant ones. Random sampling explores the important dimensions more thoroughly for the same number of trials, so it tends to find good settings faster.