Why optimize the policy directly instead of a value function?

Direct policy optimization naturally handles stochastic policies and continuous action spaces, where extracting a policy from a value function is awkward. It also allows smooth, incremental improvement of behavior, which suits control and robotics tasks.

What is an actor-critic method?

An actor-critic method maintains two learned components: an actor, the policy that selects actions, and a critic, a value estimate that judges how good those actions were. The critic's feedback reduces the variance of the policy updates, making learning more stable.

Policy Gradient Methods

Policy gradient methods directly optimize a parameterized policy by ascending the gradient of expected reward, rather than deriving the policy from a value function.

Find Topic with PaperMindSoonFind papers & topics

Tools & resources

Download slides

Learn & explore

VideoSoon

Definition

Policy gradient methods represent the policy as a differentiable function of parameters and update those parameters in the direction that increases expected cumulative reward, estimating the required gradient from sampled trajectories of the agent's interaction with the environment.

Scope

This topic covers reinforcement-learning methods that adjust policy parameters directly: the policy gradient theorem and the REINFORCE algorithm, the use of baselines and advantage estimates to reduce variance, actor-critic methods that combine a learned policy with a learned value function, and modern trust-region and proximal policy optimization. It addresses why direct policy optimization suits continuous actions and stochastic policies.

Core questions

How can a policy be improved directly by gradient ascent?
What does the policy gradient theorem express?
How do baselines and critics reduce the variance of gradient estimates?
Why are policy gradient methods well suited to continuous action spaces?

Key theories

The policy gradient theorem: The gradient of expected reward with respect to policy parameters can be written as an expectation over trajectories, allowing it to be estimated from sampled experience without differentiating the environment.
Actor-critic methods: Combining a policy that is improved by gradient ascent with a learned value function that provides a low-variance critique yields actor-critic methods that learn more stably and efficiently than pure policy gradients.
Policy optimization at scale: Policy-based learning, often combined with value estimation and search, underlies large-scale successes such as the Go-playing systems that mastered the game through self-play.

Clinical relevance

Policy gradient and actor-critic methods are the standard approach for reinforcement learning in continuous control, robotics, and the fine-tuning of large language models from human feedback, because they optimize stochastic policies directly and handle action spaces that value-based methods struggle with.

History

Williams's REINFORCE algorithm in 1992 gave a direct way to estimate policy gradients, and the policy gradient theorem of the late 1990s provided a rigorous foundation. Actor-critic architectures and later trust-region and proximal methods improved stability, making policy optimization central to modern large-scale reinforcement learning.

Key figures

Ronald Williams
Richard Sutton
David Silver

Seminal works

sutton2018
silver2016
williams1992

Frequently asked questions

Why optimize the policy directly instead of a value function?: Direct policy optimization naturally handles stochastic policies and continuous action spaces, where extracting a policy from a value function is awkward. It also allows smooth, incremental improvement of behavior, which suits control and robotics tasks.
What is an actor-critic method?: An actor-critic method maintains two learned components: an actor, the policy that selects actions, and a critic, a value estimate that judges how good those actions were. The critic's feedback reduces the variance of the policy updates, making learning more stable.