Machine learningDeep learning / NLP / CV

Reinforcement Learning

Reinforcement Learning (Agent-Environment Reward Optimization) · Also known as: RL, reward-based learning, trial-and-error learning, policy optimization

Reinforcement Learning (RL) is a framework in which an agent learns to make sequential decisions by interacting with an environment, receiving scalar reward signals, and updating a policy to maximise cumulative future reward. Unlike supervised learning, no labeled examples are provided; the agent discovers optimal behavior entirely through experience and delayed feedback.

Tools & resources

Download slides

Learn & explore

Read the full method

Members only

Method map

The neighbourhood of related methods — select a node to explore.

Reinforcement Learning

Policy Gradient Recurrent Neural Network Agent-based dynamic prog…Bayesian Dynamic Program…Explainable Reinforcemen…Fine-Tuned Reinforcement…Multilingual Reinforceme…Multimodal Reinforcement…Self-supervised Reinforc…Semi-supervised Reinforc…

+2 more

When to use it

Use reinforcement learning when the task is sequential decision-making with delayed rewards and no labeled training data: game playing, robotic control, resource scheduling, recommendation sequencing, or fine-tuning language models with human feedback (RLHF). RL is appropriate when the environment can be simulated cheaply and the agent can interact many times. Do NOT use RL when a supervised or self-supervised approach can solve the problem — RL is sample-inefficient and unstable; when the reward function is hard to specify; when interactions with the real environment are expensive or dangerous without a simulator; or when the dataset is static and fixed (use offline RL with caution).

Strengths & limitations

Strengths

Learns directly from interaction without labeled data, making it applicable where annotation is impossible.
Capable of discovering superhuman strategies in complex sequential tasks (games, robotics, scheduling).
Naturally models long-horizon planning and temporal credit assignment.
Scales to high-dimensional state spaces (pixels, text) when combined with deep neural networks.
RLHF enables aligning large language models with human preferences.

Limitations

Extremely sample-inefficient: may require millions of environment interactions to converge.
Training is often unstable and sensitive to hyperparameters, reward shaping, and random seeds.
Reward function design is difficult; poorly specified rewards cause unintended optimisation (reward hacking).
Generalisation to unseen states or environments is not guaranteed.
Real-world deployment is risky without extensive simulation; exploration can cause dangerous actions.

Frequently asked

What distinguishes reinforcement learning from supervised learning?

Supervised learning requires labeled input-output pairs and learns a fixed mapping. RL requires no labels; instead the agent receives reward signals after actions and must discover what to do through trial and error, making it suitable for sequential decision problems where the correct action is not known in advance.

What is the difference between model-free and model-based RL?

Model-free RL (e.g. DQN, PPO) learns policies or value functions directly from experience without an explicit model of environment dynamics. Model-based RL learns or is given a transition model and uses it for planning. Model-based methods tend to be more sample-efficient but require an accurate model, which can be hard to learn.

How do I choose between a value-based and policy-gradient method?

Value-based methods like DQN work well with discrete action spaces and are relatively stable. Policy-gradient methods (REINFORCE, PPO, SAC) handle continuous action spaces naturally and can learn stochastic policies. Actor-critic methods combine both. For most modern applications, PPO or SAC are safe starting points.

How many environment interactions does RL typically need?

RL is notoriously sample-hungry. Simple tabular tasks may converge in thousands of steps; Atari DQN required tens of millions of frames; complex robotics tasks can need hundreds of millions of simulation steps. This makes a fast simulator essential and rules out RL for many real-world problems with expensive interactions.

What is RLHF and why is it important?

Reinforcement Learning from Human Feedback trains a reward model from human preference comparisons and then fine-tunes a language model with RL to maximise that learned reward. It is the dominant technique for aligning large language models with human values and is central to models like InstructGPT and Claude.

Sources

Sutton, R. S. & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press. ISBN: 978-0-262-03924-6
Mnih, V., Kavukcuoglu, K., Silver, D., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518, 529–533. DOI: 10.1038/nature14236 ↗

How to cite this page

ScholarGate. (2026, June 3). Reinforcement Learning (Agent-Environment Reward Optimization). ScholarGate. https://scholargate.app/en/deep-learning/reinforcement-learning

Which method?

Set this method beside its closest kin and read them side by side — the library lays the books on the table; the choice is yours.

Policy GradientMachine learning↔ compare
Recurrent Neural NetworkDeep learning↔ compare

Compare side by side →

Referenced by

Agent-based dynamic programming Bayesian Dynamic Programming Explainable Reinforcement Learning Fine-Tuned Reinforcement Learning Multilingual Reinforcement Learning Multimodal Reinforcement Learning Self-supervised Reinforcement Learning Semi-supervised Reinforcement Learning Transfer Learning with Reinforcement Learning Weakly supervised reinforcement learning

Related reference concepts

Reinforcement Learning Deep Reinforcement Learning Markov Decision Processes Policy Gradient Methods Value-Based Methods Sequential Decision Making (MDPs)

Spotted an issue on this page? Report or suggest a fix →

Reinforcement Learning

Reinforcement Learning (Agent-Environment Reward Optimization) · Also known as: RL, reward-based learning, trial-and-error learning, policy optimization

Tools & resources

Download slides

Learn & explore

Read the full method

Members only

When to use it

Strengths & limitations

Strengths

Learns directly from interaction without labeled data, making it applicable where annotation is impossible.
Capable of discovering superhuman strategies in complex sequential tasks (games, robotics, scheduling).
Naturally models long-horizon planning and temporal credit assignment.
Scales to high-dimensional state spaces (pixels, text) when combined with deep neural networks.
RLHF enables aligning large language models with human preferences.

Limitations

Extremely sample-inefficient: may require millions of environment interactions to converge.
Training is often unstable and sensitive to hyperparameters, reward shaping, and random seeds.
Reward function design is difficult; poorly specified rewards cause unintended optimisation (reward hacking).
Generalisation to unseen states or environments is not guaranteed.
Real-world deployment is risky without extensive simulation; exploration can cause dangerous actions.

Frequently asked

What distinguishes reinforcement learning from supervised learning?

What is the difference between model-free and model-based RL?

How do I choose between a value-based and policy-gradient method?

How many environment interactions does RL typically need?

What is RLHF and why is it important?

Sources

Sutton, R. S. & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press. ISBN: 978-0-262-03924-6
Mnih, V., Kavukcuoglu, K., Silver, D., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518, 529–533. DOI: 10.1038/nature14236 ↗

How to cite this page

ScholarGate. (2026, June 3). Reinforcement Learning (Agent-Environment Reward Optimization). ScholarGate. https://scholargate.app/en/deep-learning/reinforcement-learning

Which method?

Set this method beside its closest kin and read them side by side — the library lays the books on the table; the choice is yours.

Policy GradientMachine learning↔ compare
Recurrent Neural NetworkDeep learning↔ compare

Compare side by side →

Reinforcement Learning

Read the full method

Method map

When to use it

Strengths & limitations

Frequently asked

Sources

How to cite this page

Which method?

Referenced by

Similar methods

Related reference concepts

Reinforcement Learning

Read the full method

Method map

When to use it

Strengths & limitations

Frequently asked

Sources

How to cite this page

Which method?

Referenced by

Similar methods

Related reference concepts