How does reinforcement learning differ from supervised learning?

Supervised learning is told the correct output for each input. A reinforcement-learning agent is only given a reward signal that evaluates the outcomes of its actions, must discover good behavior by trial and error, and must cope with rewards that arrive long after the actions that earned them.

What is the exploration-exploitation trade-off?

An agent must choose between exploiting actions known to give good reward and exploring untried actions that might be even better. Too little exploration can lock in a suboptimal strategy, while too much wastes opportunities, so balancing the two is central to reinforcement learning.

Reinforcement Learning

Reinforcement learning trains an agent to make sequences of decisions by trial and error, maximizing cumulative reward through interaction with an environment.

Definition

Reinforcement learning is the problem of learning a policy, a mapping from situations to actions, that maximizes expected cumulative reward, where the agent learns from the consequences of its own actions rather than from labeled examples of correct behavior.

Scope

This area covers learning to act: the Markov decision process framework of states, actions, rewards, and transitions; value functions and the Bellman equations; value-based methods such as temporal-difference learning and Q-learning; policy-gradient methods that optimize a policy directly; and the combination of these ideas with deep neural networks. It addresses the exploration-exploitation trade-off and the challenge of delayed reward.

Sub-topics

Core questions

How can an agent learn good behavior from reward signals alone?
How are long-term value and immediate reward related through the Bellman equations?
How should an agent balance exploring new actions against exploiting known good ones?
How is credit assigned to earlier actions for later rewards?

Key theories

Markov decision processes and value functions: The interaction is modeled as a Markov decision process, and value functions summarize expected future reward, satisfying Bellman equations that underlie nearly all reinforcement-learning algorithms.
Temporal-difference learning: Agents can learn value estimates by bootstrapping, updating predictions toward later predictions plus observed reward, which enables learning from incomplete episodes and online experience.
Deep reinforcement learning: Using deep neural networks to approximate value functions or policies lets reinforcement learning scale to high-dimensional inputs, as demonstrated by agents that learned to play Atari games and the game of Go.

Clinical relevance

Reinforcement learning addresses sequential decision-making under uncertainty and has driven advances in game playing, robotics, recommendation, and control, as well as the alignment of large language models through learning from feedback; its trial-and-error nature and the difficulty of specifying reward make safe and sample-efficient learning active concerns.

History

Reinforcement learning unified ideas from optimal control, dynamic programming, and animal learning. Temporal-difference learning and Q-learning emerged in the 1980s and early 1990s, and Sutton and Barto's textbook codified the field. The 2010s combination with deep learning produced agents reaching human-level play on Atari games and superhuman play at Go.

Debates

Sample efficiency and reward design: Reinforcement learning can require enormous interaction and is sensitive to how reward is specified, prompting debate over how to make it more data-efficient and how to avoid agents exploiting misspecified rewards.

Key figures

Richard Sutton
Andrew Barto
Christopher Watkins
David Silver

Seminal works

sutton2018
mnih2015
silver2016

Frequently asked questions

How does reinforcement learning differ from supervised learning?: Supervised learning is told the correct output for each input. A reinforcement-learning agent is only given a reward signal that evaluates the outcomes of its actions, must discover good behavior by trial and error, and must cope with rewards that arrive long after the actions that earned them.
What is the exploration-exploitation trade-off?: An agent must choose between exploiting actions known to give good reward and exploring untried actions that might be even better. Too little exploration can lock in a suboptimal strategy, while too much wastes opportunities, so balancing the two is central to reinforcement learning.