What does temporal-difference learning bootstrap from?

It updates the value of the current state using the observed reward plus its own estimate of the next state's value. Because it relies partly on another estimate rather than waiting for the final outcome, it can learn online and from incomplete episodes.

Why is Q-learning called off-policy?

Q-learning learns the value of the optimal policy even while the agent follows a different, exploratory policy to collect experience. The behavior used to gather data and the policy being evaluated can differ, which is what off-policy means.

Value-Based Methods

Value-based methods learn how good states and actions are, then act greedily with respect to those estimates to obtain a good policy.

ค้นหาหัวข้อด้วย PaperMindเร็ว ๆ นี้Find papers & topics

Tools & resources

ดาวน์โหลดสไลด์

Learn & explore

วิดีโอเร็ว ๆ นี้

Definition

Value-based methods estimate the expected return of states or state-action pairs and derive a policy by choosing actions with the highest estimated value; they learn these estimates incrementally from experience, often by temporal-difference updates that adjust a prediction toward a later, more informed one.

Scope

This topic covers reinforcement-learning algorithms that center on value functions: Monte Carlo estimation from complete episodes, temporal-difference learning that bootstraps from later estimates, and the control algorithms Sarsa and Q-learning. It addresses on-policy versus off-policy learning, exploration through strategies such as epsilon-greedy, and the use of function approximation when states are too many to enumerate.

Core questions

How are action values learned from experience?
How does temporal-difference learning combine sampling with bootstrapping?
What is the difference between on-policy and off-policy learning?
How is exploration handled when acting greedily on value estimates?

Key theories

Temporal-difference learning: Temporal-difference methods update a value estimate toward the observed reward plus the discounted estimate of the next state, learning online from incomplete episodes without a model of the environment.
Q-learning: Q-learning estimates the value of the best action in each state and converges to the optimal action-value function regardless of the policy used to gather experience, making it a foundational off-policy method.
Value approximation with deep networks: Representing the action-value function with a deep network lets value-based methods handle high-dimensional inputs such as raw pixels, as in the deep Q-network that learned to play many Atari games.

Clinical relevance

Value-based methods are among the most widely used reinforcement-learning algorithms, and Q-learning combined with deep networks produced the first agents to reach human-level performance directly from high-dimensional sensory input, demonstrating how value estimation scales to complex tasks.

History

Sutton introduced temporal-difference learning in 1988, and Watkins's Q-learning in 1989 gave a convergent off-policy control method. Combining Q-learning with deep networks in the deep Q-network of 2015 brought value-based reinforcement learning to high-dimensional problems and launched the modern deep reinforcement-learning era.

Key figures

Richard Sutton
Christopher Watkins
Volodymyr Mnih

Seminal works

sutton2018
mnih2015
watkins1992

Frequently asked questions

What does temporal-difference learning bootstrap from?: It updates the value of the current state using the observed reward plus its own estimate of the next state's value. Because it relies partly on another estimate rather than waiting for the final outcome, it can learn online and from incomplete episodes.
Why is Q-learning called off-policy?: Q-learning learns the value of the optimal policy even while the agent follows a different, exploratory policy to collect experience. The behavior used to gather data and the policy being evaluated can differ, which is what off-policy means.