Hypothesis test

Multi-Armed Bandit (UCB, Thompson Sampling)

The multi-armed bandit (MAB) is an adaptive experimental framework that allocates trials sequentially across competing arms to minimise cumulative regret while simultaneously learning which arm performs best. Formalised by Robbins in 1952 and given finite-time guarantees by Auer et al. (2002), it balances exploration of uncertain options against exploitation of currently known best options — outperforming classical A/B testing whenever early stopping or cost-sensitive allocation matters.

Find Topic with PaperMindSoonVideoSoon

Read the full method

Members only

Sign in with a free account to read this section.

Sign in

Sources

  1. Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-Time Analysis of the Multiarmed Bandit Problem. Machine Learning, 47(2–3), 235–256. DOI: 10.1023/A:1013689704352
  2. Russo, D., Van Roy, B., Kazerouni, A., Osband, I., & Wen, Z. (2018). A Tutorial on Thompson Sampling. Foundations and Trends in Machine Learning, 11(1), 1–96. DOI: 10.1561/2200000070

Related methods

Referenced by

ScholarGateMulti-Armed Bandit (Multi-Armed Bandit (UCB, Thompson Sampling)). Retrieved 2026-06-04 from https://scholargate.app/en/experimental-design/multiarm-bandit