Machine learningReinforcement learning

Policy Gradient Methods

Policy Gradient Methods (REINFORCE / Actor-Critic) · Also known as: REINFORCE, actor-critic, policy optimization, politika gradyanı

Policy gradient methods are reinforcement-learning algorithms that optimize a parameterized policy directly by gradient ascent on the expected return, rather than learning action-values and acting greedily. Founded on Ronald Williams' 1992 REINFORCE algorithm and the policy gradient theorem of Sutton and colleagues (2000), they naturally handle stochastic and continuous action spaces and underpin modern actor-critic and deep-RL algorithms.

Tools & resources

Download slides

Learn & explore

Read the full method

Members only

Method map

The neighbourhood of related methods — select a node to explore.

Policy Gradient

Convex Optimization Deep Reinforcement Learn…Q-Learning Stochastic Gradient Desc…Reinforcement Learning

When to use it

Use policy gradient methods when the action space is continuous or high-dimensional, when a stochastic policy is desirable (exploration, partial observability, multi-agent), or when you want to optimize a policy end-to-end with a neural network — robotics control, continuous control benchmarks, dialogue/recommendation policies, and as the foundation of RLHF for language models. They directly optimize the objective and handle continuous actions that value-based methods struggle with. The costs: gradient estimates are high-variance (mitigated by baselines/critics), they are typically on-policy and sample-inefficient, sensitive to step size and reward scaling, and prone to local optima. When the action space is small and discrete, value-based Q-learning/DQN can be simpler and more sample-efficient; trust-region variants (TRPO/PPO) address step-size instability.

Strengths & limitations

Strengths

Directly optimize the policy; handle continuous and high-dimensional action spaces.
Naturally represent stochastic policies, aiding exploration and partial observability.
Integrate seamlessly with neural-network function approximation (deep RL).
Foundation for modern algorithms (A2C/A3C, TRPO, PPO, DDPG) and RLHF.

Limitations

High-variance gradient estimates; need baselines or critics to be practical.
Usually on-policy and sample-inefficient compared with off-policy value methods.
Sensitive to learning rate, reward scaling, and can converge to local optima.
Stability requires care; naive large steps can collapse the policy.

Frequently asked

How do policy gradients differ from Q-learning?

Q-learning learns action-values and acts greedily; policy gradients parameterize and optimize the policy directly via gradient ascent on expected return. Policy gradients handle continuous/stochastic actions naturally but are higher-variance and usually on-policy, while Q-learning is off-policy and often more sample-efficient for small discrete actions.

What is an actor-critic method?

It combines a policy (actor), updated by the policy gradient, with a learned value function (critic) that estimates how good states/actions are. The critic provides a low-variance advantage signal, stabilizing and speeding up learning. A2C/A3C, PPO, and DDPG are actor-critic algorithms.

Why do policy gradients use a baseline?

The raw gradient estimate is very noisy. Subtracting a baseline — typically a state-value estimate — from the return reduces the variance of the gradient without introducing bias, which makes learning far more stable and efficient. The advantage function (return minus value) is the common choice.

Sources

Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3–4), 229–256. DOI: 10.1007/BF00992696 ↗
Sutton, R. S., McAllester, D., Singh, S., & Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. Advances in Neural Information Processing Systems, 12, 1057–1063. link ↗

How to cite this page

ScholarGate. (2026, June 2). Policy Gradient Methods (REINFORCE / Actor-Critic). ScholarGate. https://scholargate.app/en/machine-learning/policy-gradient

Which method?

Set this method beside its closest kin and read them side by side — the library lays the books on the table; the choice is yours.

Convex OptimizationOptimization↔ compare
Deep Reinforcement LearningDeep learning↔ compare
Q-LearningMachine learning↔ compare
Stochastic Gradient DescentMachine learning↔ compare

Compare side by side →

Referenced by

Q-Learning Reinforcement Learning

Related reference concepts

Policy Gradient Methods Reinforcement Learning Value-Based Methods Deep Reinforcement Learning Markov Decision Processes Sequential Decision Making (MDPs)

Spotted an issue on this page? Report or suggest a fix →

Machine learningReinforcement learning

Policy Gradient Methods

Policy Gradient Methods (REINFORCE / Actor-Critic) · Also known as: REINFORCE, actor-critic, policy optimization, politika gradyanı

Tools & resources

Download slides

Learn & explore

Read the full method

Members only

Method map

The neighbourhood of related methods — select a node to explore.

Policy Gradient

Convex Optimization Deep Reinforcement Learn…Q-Learning Stochastic Gradient Desc…Reinforcement Learning

When to use it

Strengths & limitations

Strengths

Directly optimize the policy; handle continuous and high-dimensional action spaces.
Naturally represent stochastic policies, aiding exploration and partial observability.
Integrate seamlessly with neural-network function approximation (deep RL).
Foundation for modern algorithms (A2C/A3C, TRPO, PPO, DDPG) and RLHF.

Limitations

High-variance gradient estimates; need baselines or critics to be practical.
Usually on-policy and sample-inefficient compared with off-policy value methods.
Sensitive to learning rate, reward scaling, and can converge to local optima.
Stability requires care; naive large steps can collapse the policy.

Frequently asked

How do policy gradients differ from Q-learning?

What is an actor-critic method?

Why do policy gradients use a baseline?

Sources

Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3–4), 229–256. DOI: 10.1007/BF00992696 ↗
Sutton, R. S., McAllester, D., Singh, S., & Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. Advances in Neural Information Processing Systems, 12, 1057–1063. link ↗

How to cite this page

ScholarGate. (2026, June 2). Policy Gradient Methods (REINFORCE / Actor-Critic). ScholarGate. https://scholargate.app/en/machine-learning/policy-gradient

Which method?

Set this method beside its closest kin and read them side by side — the library lays the books on the table; the choice is yours.

Convex OptimizationOptimization↔ compare
Deep Reinforcement LearningDeep learning↔ compare
Q-LearningMachine learning↔ compare
Stochastic Gradient DescentMachine learning↔ compare

Compare side by side →

Referenced by

Q-Learning Reinforcement Learning

Similar methods

Related reference concepts

Policy Gradient Methods Reinforcement Learning Value-Based Methods Deep Reinforcement Learning Markov Decision Processes Sequential Decision Making (MDPs)

Spotted an issue on this page? Report or suggest a fix →