What did the deep Q-network demonstrate?

It showed that a single neural-network agent could learn to play dozens of different Atari games directly from screen pixels and the score, reaching human-level performance on many of them without game-specific tuning, using experience replay and a target network for stability.

Why is deep reinforcement learning often unstable?

Combining bootstrapped value estimates, off-policy data, and neural-network approximation can amplify errors and cause training to diverge. Techniques such as experience replay, target networks, and careful learning-rate choices are used to keep learning stable.

Deep Reinforcement Learning

Deep reinforcement learning uses neural networks to approximate value functions or policies, scaling reinforcement learning to high-dimensional inputs such as images and complex games.

Finn tema med PaperMindSnartFind papers & topics

Tools & resources

Last ned lysbilder

Learn & explore

VideoSnart

Definition

Deep reinforcement learning is reinforcement learning in which deep neural networks serve as the function approximators for value functions, policies, or models, enabling agents to learn directly from high-dimensional raw observations rather than hand-engineered state features.

Scope

This topic covers the combination of reinforcement learning with deep neural networks: deep Q-networks with experience replay and target networks for stability, deep actor-critic and policy-optimization methods, and the integration of learning with search as in game-playing systems. It addresses the stability challenges of training value functions with function approximation and the resulting landmark achievements.

Core questions

How do neural networks let reinforcement learning handle raw high-dimensional input?
Why is combining value learning with function approximation prone to instability?
What techniques such as experience replay and target networks stabilize training?
How are learning and search combined in game-playing agents?

Key theories

Deep Q-networks: Approximating action values with a deep network, stabilized by experience replay and a slowly updated target network, allowed a single architecture to learn many Atari games from pixels to human level.
Learning combined with search: Pairing deep policy and value networks with Monte Carlo tree search and training through self-play produced systems that mastered the game of Go, exceeding the strongest human players.
Stability of function approximation: Combining bootstrapping, off-policy learning, and function approximation can cause training to diverge, so deep reinforcement learning relies on careful techniques to keep value estimates stable.

Clinical relevance

Deep reinforcement learning produced some of the most visible demonstrations of artificial intelligence, including superhuman game play and advances in robotics and control, and its techniques inform the reward-driven fine-tuning of large models; its high sample cost and training instability remain important practical limitations.

History

The deep Q-network of 2015 showed that reinforcement learning with deep function approximation could learn directly from pixels, and the Go-playing systems of 2016 combined deep networks with search and self-play to defeat top human players. These results, building on the reinforcement-learning foundations codified by Sutton and Barto, established deep reinforcement learning as a major research direction.

Key figures

Volodymyr Mnih
David Silver
Demis Hassabis

Seminal works

mnih2015
silver2016
sutton2018

Frequently asked questions

What did the deep Q-network demonstrate?: It showed that a single neural-network agent could learn to play dozens of different Atari games directly from screen pixels and the score, reaching human-level performance on many of them without game-specific tuning, using experience replay and a target network for stability.
Why is deep reinforcement learning often unstable?: Combining bootstrapped value estimates, off-policy data, and neural-network approximation can amplify errors and cause training to diverge. Techniques such as experience replay, target networks, and careful learning-rate choices are used to keep learning stable.