Reinforcement Learning
Reinforcement learning trains an agent to make sequences of decisions by trial and error, maximizing cumulative reward through interaction with an environment.
Definition
Reinforcement learning is the problem of learning a policy, a mapping from situations to actions, that maximizes expected cumulative reward, where the agent learns from the consequences of its own actions rather than from labeled examples of correct behavior.
Scope
This area covers learning to act: the Markov decision process framework of states, actions, rewards, and transitions; value functions and the Bellman equations; value-based methods such as temporal-difference learning and Q-learning; policy-gradient methods that optimize a policy directly; and the combination of these ideas with deep neural networks. It addresses the exploration-exploitation trade-off and the challenge of delayed reward.
Sub-topics
Core questions
- How can an agent learn good behavior from reward signals alone?
- How are long-term value and immediate reward related through the Bellman equations?
- How should an agent balance exploring new actions against exploiting known good ones?
- How is credit assigned to earlier actions for later rewards?
Key theories
- Markov decision processes and value functions
- The interaction is modeled as a Markov decision process, and value functions summarize expected future reward, satisfying Bellman equations that underlie nearly all reinforcement-learning algorithms.
- Temporal-difference learning
- Agents can learn value estimates by bootstrapping, updating predictions toward later predictions plus observed reward, which enables learning from incomplete episodes and online experience.
- Deep reinforcement learning
- Using deep neural networks to approximate value functions or policies lets reinforcement learning scale to high-dimensional inputs, as demonstrated by agents that learned to play Atari games and the game of Go.
Clinical relevance
Reinforcement learning addresses sequential decision-making under uncertainty and has driven advances in game playing, robotics, recommendation, and control, as well as the alignment of large language models through learning from feedback; its trial-and-error nature and the difficulty of specifying reward make safe and sample-efficient learning active concerns.
History
Reinforcement learning unified ideas from optimal control, dynamic programming, and animal learning. Temporal-difference learning and Q-learning emerged in the 1980s and early 1990s, and Sutton and Barto's textbook codified the field. The 2010s combination with deep learning produced agents reaching human-level play on Atari games and superhuman play at Go.
Debates
- Sample efficiency and reward design
- Reinforcement learning can require enormous interaction and is sensitive to how reward is specified, prompting debate over how to make it more data-efficient and how to avoid agents exploiting misspecified rewards.
Key figures
- Richard Sutton
- Andrew Barto
- Christopher Watkins
- David Silver
Related topics
Seminal works
- sutton2018
- mnih2015
- silver2016
Frequently asked questions
- How does reinforcement learning differ from supervised learning?
- Supervised learning is told the correct output for each input. A reinforcement-learning agent is only given a reward signal that evaluates the outcomes of its actions, must discover good behavior by trial and error, and must cope with rewards that arrive long after the actions that earned them.
- What is the exploration-exploitation trade-off?
- An agent must choose between exploiting actions known to give good reward and exploring untried actions that might be even better. Too little exploration can lock in a suboptimal strategy, while too much wastes opportunities, so balancing the two is central to reinforcement learning.