Machine learningReinforcement learning

Q-Learning

Q-Learning (Off-Policy Temporal-Difference Control) · Also known as: Q-learning algorithm, tabular Q-learning, off-policy TD control, Q-öğrenme

Q-learning, introduced by Christopher Watkins and Peter Dayan in 1992, is a model-free reinforcement-learning algorithm that learns the value of taking each action in each state — the Q-function — purely from experience, without a model of the environment. It is off-policy: it learns the optimal action-values while following an exploratory behaviour policy, and under standard conditions it provably converges to the optimal policy.

Tools & resources

Download slides

Learn & explore

Read the full method

Members only

Method map

The neighbourhood of related methods — select a node to explore.

Q-Learning

Deep Reinforcement Learn…Dynamic Programming Policy Gradient

When to use it

Use Q-learning for sequential decision problems framed as a Markov decision process where you can sample transitions and rewards but have no (or an unreliable) model of the dynamics — control, games, routing, scheduling, and adaptive systems. Tabular Q-learning suits small, discrete state-action spaces and converges to the optimum given sufficient exploration and a decaying step size. For large or continuous spaces the table is replaced by a function approximator (Deep Q-Networks). It assumes the Markov property and a stationary environment, can be sample-inefficient, and the max operator induces an optimistic (maximization) bias addressed by variants like Double Q-learning. When a stochastic or continuous-action policy is needed, policy-gradient methods are the alternative.

Strengths & limitations

Strengths

Model-free: needs no transition/reward model, only sampled experience.
Off-policy: learns the optimal policy while exploring with another policy.
Provably converges to the optimal action-values under standard conditions.
Simple, foundational, and the basis for Deep Q-Networks and many extensions.

Limitations

Tabular form does not scale to large or continuous state/action spaces.
Sample-inefficient; can need many episodes to converge.
Maximization bias (overestimation) from the max operator; mitigated by Double Q-learning.
Assumes a (stationary) Markov decision process; struggles under partial observability or drift.

Frequently asked

What does 'off-policy' mean in Q-learning?

It means Q-learning learns the value of the optimal policy while the agent follows a different (exploratory) behaviour policy. The update targets r + γ·max Q(s', a') — the best next action — not the action actually taken, so exploration does not bias the learned optimum.

How does Q-learning relate to Deep Q-Networks (DQN)?

DQN replaces the Q-table with a deep neural network that approximates Q(s, a), enabling Q-learning in large or continuous (e.g., pixel) state spaces. It adds stabilizing tricks — experience replay and a target network — but the core learning rule is Q-learning's temporal-difference update.

Why does Q-learning overestimate values?

The max operator in the update tends to select over-optimistic estimates, biasing Q upward, especially with noisy rewards. Double Q-learning reduces this by decoupling action selection from value evaluation using two estimators, yielding less biased and often better policies.

Sources

Watkins, C. J. C. H., & Dayan, P. (1992). Q-learning. Machine Learning, 8(3–4), 279–292. DOI: 10.1007/BF00992698 ↗
Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press. ISBN: 978-0-262-03924-6

How to cite this page

ScholarGate. (2026, June 2). Q-Learning (Off-Policy Temporal-Difference Control). ScholarGate. https://scholargate.app/en/machine-learning/q-learning

Which method?

Set this method beside its closest kin and read them side by side — the library lays the books on the table; the choice is yours.

Deep Reinforcement LearningDeep learning↔ compare
Dynamic ProgrammingOptimization↔ compare
Policy GradientMachine learning↔ compare

Compare side by side →

Referenced by

Policy Gradient

Related reference concepts

Value-Based Methods Reinforcement Learning Deep Reinforcement Learning Markov Decision Processes Policy Gradient Methods Sequential Decision Making (MDPs)

Spotted an issue on this page? Report or suggest a fix →

Q-Learning

Q-Learning (Off-Policy Temporal-Difference Control) · Also known as: Q-learning algorithm, tabular Q-learning, off-policy TD control, Q-öğrenme

Tools & resources

Download slides

Learn & explore

Read the full method

Members only

When to use it

Strengths & limitations

Strengths

Model-free: needs no transition/reward model, only sampled experience.
Off-policy: learns the optimal policy while exploring with another policy.
Provably converges to the optimal action-values under standard conditions.
Simple, foundational, and the basis for Deep Q-Networks and many extensions.

Limitations

Tabular form does not scale to large or continuous state/action spaces.
Sample-inefficient; can need many episodes to converge.
Maximization bias (overestimation) from the max operator; mitigated by Double Q-learning.
Assumes a (stationary) Markov decision process; struggles under partial observability or drift.

Frequently asked

What does 'off-policy' mean in Q-learning?

How does Q-learning relate to Deep Q-Networks (DQN)?

Why does Q-learning overestimate values?

Sources

Watkins, C. J. C. H., & Dayan, P. (1992). Q-learning. Machine Learning, 8(3–4), 279–292. DOI: 10.1007/BF00992698 ↗
Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press. ISBN: 978-0-262-03924-6

How to cite this page

ScholarGate. (2026, June 2). Q-Learning (Off-Policy Temporal-Difference Control). ScholarGate. https://scholargate.app/en/machine-learning/q-learning

Which method?

Set this method beside its closest kin and read them side by side — the library lays the books on the table; the choice is yours.

Deep Reinforcement LearningDeep learning↔ compare
Dynamic ProgrammingOptimization↔ compare
Policy GradientMachine learning↔ compare

Compare side by side →

Q-Learning

Read the full method

Method map

When to use it

Strengths & limitations

Frequently asked

Sources

How to cite this page

Which method?

Referenced by

Similar methods

Related reference concepts

Q-Learning

Read the full method

Method map

When to use it

Strengths & limitations

Frequently asked

Sources

How to cite this page

Which method?

Referenced by

Similar methods

Related reference concepts