Hypothesis test

Multi-Armed Bandit (UCB, Thompson Sampling)

Also known as: MAB, bandit algorithm, UCB1, Thompson sampling, epsilon-greedy, Çok Kollu Bandit (Multi-Armed Bandit — UCB, Thompson)

The multi-armed bandit (MAB) is an adaptive experimental framework that allocates trials sequentially across competing arms to minimise cumulative regret while simultaneously learning which arm performs best. Formalised by Robbins in 1952 and given finite-time guarantees by Auer et al. (2002), it balances exploration of uncertain options against exploitation of currently known best options — outperforming classical A/B testing whenever early stopping or cost-sensitive allocation matters.

Tools & resources

Download slides

Learn & explore

Read the full method

Members only

Method map

The neighbourhood of related methods — select a node to explore.

Multi-Armed Bandit

A/B Test Adaptive Clinical Trial…Randomized Controlled Tr…Sequential Design

When to use it

Use a multi-armed bandit when you need to learn which of K arms (variants, treatments, content items) yields the highest reward while minimising losses during the learning phase. The rewards must be stationary or slowly drifting; for non-stationary settings, discounted or sliding-window variants are required. Binary or continuous reward signals both work. Thompson Sampling requires a prior distribution that roughly reflects reality; UCB1 is prior-free but assumes bounded rewards. A minimum of about 50 observations across arms is needed before cumulative estimates stabilise.

Strengths & limitations

Strengths

Minimises opportunity cost: the algorithm continuously shifts traffic toward better-performing arms rather than splitting equally until the end.
Principled regret bounds: UCB1 offers logarithmic worst-case guarantees proven by Auer et al. (2002).
Bayesian updating with Thompson Sampling provides natural uncertainty quantification and integrates prior knowledge.
Scales to many arms and extends naturally to contextual settings where user features inform arm selection.

Limitations

Assumes reward stationarity; performance degrades when arm means drift rapidly over time.
Thompson Sampling requires specifying a prior whose misspecification can slow convergence.
Harder to communicate to non-technical stakeholders than a standard A/B test with a fixed sample size.
Cumulative regret bounds are asymptotic; with very few trials per arm the exploration bonus may over-explore.

Frequently asked

How does a bandit differ from a standard A/B test?

A standard A/B test allocates traffic equally and waits until a fixed sample is collected before declaring a winner, ignoring losses incurred on the inferior arm during the trial. A bandit adapts allocation in real time, directing more traffic to better-performing arms as evidence accumulates, thereby reducing expected opportunity cost — at the price of more complex inference.

UCB1 or Thompson Sampling — which should I choose?

UCB1 is deterministic and prior-free, making it easier to audit and explain. Thompson Sampling is randomised, often converges faster in practice, and integrates naturally with Bayesian analysis. If you have reasonable prior information (e.g. historical click rates), Thompson Sampling is generally preferred; if you want transparent exploration rules and no prior assumptions, UCB1 is a safe default.

What is 'regret' and why does it matter?

Regret is the total reward lost by not always selecting the best arm. It quantifies the cost of exploration. UCB1 guarantees that cumulative regret grows no faster than O(√(KT ln T)), meaning the average per-round loss shrinks to zero as the experiment runs longer. Minimising regret is the right objective when every trial has a real cost, unlike a fixed-sample test where cost is treated as fixed.

Can I use a bandit when I have more than two arms?

Yes — bandit algorithms are designed for K ≥ 2 arms and scale naturally. Running K separate pairwise A/B tests would inflate the familywise error rate and increase total opportunity cost; the bandit handles multiplicity implicitly through its regret-minimisation objective.

Sources

Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-Time Analysis of the Multiarmed Bandit Problem. Machine Learning, 47(2–3), 235–256. DOI: 10.1023/A:1013689704352 ↗
Russo, D., Van Roy, B., Kazerouni, A., Osband, I., & Wen, Z. (2018). A Tutorial on Thompson Sampling. Foundations and Trends in Machine Learning, 11(1), 1–96. DOI: 10.1561/2200000070 ↗

How to cite this page

ScholarGate. (2026, June 1). Multi-Armed Bandit (UCB, Thompson Sampling). ScholarGate. https://scholargate.app/en/experimental-design/multiarm-bandit

Which method?

Set this method beside its closest kin and read them side by side — the library lays the books on the table; the choice is yours.

A/B TestExperimental design↔ compare
Adaptive Clinical Trial DesignExperimental design↔ compare
Randomized Controlled TrialExperimental design↔ compare
Sequential DesignExperimental design↔ compare

Compare side by side →

Referenced by

A/B Test

Related reference concepts

Reinforcement Learning Hyperparameter Optimization Stochastic Optimization Markov Decision Processes Sequential Decision Making (MDPs)Model Evaluation and Selection

Spotted an issue on this page? Report or suggest a fix →

Hypothesis test

Multi-Armed Bandit (UCB, Thompson Sampling)

Also known as: MAB, bandit algorithm, UCB1, Thompson sampling, epsilon-greedy, Çok Kollu Bandit (Multi-Armed Bandit — UCB, Thompson)

Tools & resources

Download slides

Learn & explore

Read the full method

Members only

Method map

The neighbourhood of related methods — select a node to explore.

Multi-Armed Bandit

A/B Test Adaptive Clinical Trial…Randomized Controlled Tr…Sequential Design

When to use it

Strengths & limitations

Strengths

Minimises opportunity cost: the algorithm continuously shifts traffic toward better-performing arms rather than splitting equally until the end.
Principled regret bounds: UCB1 offers logarithmic worst-case guarantees proven by Auer et al. (2002).
Bayesian updating with Thompson Sampling provides natural uncertainty quantification and integrates prior knowledge.
Scales to many arms and extends naturally to contextual settings where user features inform arm selection.

Limitations

Assumes reward stationarity; performance degrades when arm means drift rapidly over time.
Thompson Sampling requires specifying a prior whose misspecification can slow convergence.
Harder to communicate to non-technical stakeholders than a standard A/B test with a fixed sample size.
Cumulative regret bounds are asymptotic; with very few trials per arm the exploration bonus may over-explore.

Frequently asked

How does a bandit differ from a standard A/B test?

UCB1 or Thompson Sampling — which should I choose?

What is 'regret' and why does it matter?

Can I use a bandit when I have more than two arms?

Sources

Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-Time Analysis of the Multiarmed Bandit Problem. Machine Learning, 47(2–3), 235–256. DOI: 10.1023/A:1013689704352 ↗
Russo, D., Van Roy, B., Kazerouni, A., Osband, I., & Wen, Z. (2018). A Tutorial on Thompson Sampling. Foundations and Trends in Machine Learning, 11(1), 1–96. DOI: 10.1561/2200000070 ↗

How to cite this page

ScholarGate. (2026, June 1). Multi-Armed Bandit (UCB, Thompson Sampling). ScholarGate. https://scholargate.app/en/experimental-design/multiarm-bandit

Which method?

Set this method beside its closest kin and read them side by side — the library lays the books on the table; the choice is yours.

A/B TestExperimental design↔ compare
Adaptive Clinical Trial DesignExperimental design↔ compare
Randomized Controlled TrialExperimental design↔ compare
Sequential DesignExperimental design↔ compare

Compare side by side →

Referenced by

A/B Test

Related reference concepts

Reinforcement Learning Hyperparameter Optimization Stochastic Optimization Markov Decision Processes Sequential Decision Making (MDPs)Model Evaluation and Selection

Spotted an issue on this page? Report or suggest a fix →

Multi-Armed Bandit (UCB, Thompson Sampling)

Read the full method

Method map

When to use it

Strengths & limitations

Frequently asked

Sources

How to cite this page

Which method?

Referenced by

Similar methods

Related reference concepts

Multi-Armed Bandit (UCB, Thompson Sampling)

Read the full method

Method map

When to use it

Strengths & limitations

Frequently asked

Sources

How to cite this page

Which method?

Referenced by

Similar methods

Related reference concepts