How is this different from reinforcement learning?

Sequential decision making with MDPs assumes the transition and reward model is known, so an optimal policy can be computed directly by dynamic programming. Reinforcement learning addresses the case where the model is unknown and the agent must learn a good policy from experience; it uses the MDP as its underlying formalism.

What is a belief state in a POMDP?

In a partially observable MDP the agent cannot see the true state, so it maintains a belief state, a probability distribution over the possible states, updated as it takes actions and receives observations. Planning then takes place over these belief states rather than over the hidden states directly.

Sequential Decision Making (MDPs)

Sequential decision making formalizes how an agent should act over time in a stochastic environment, using Markov decision processes in which actions yield rewards and probabilistically change the state, to compute a policy maximizing long-run expected reward.

Намерете тема с PaperMindСкороFind papers & topics

Tools & resources

Изтегляне на слайдове

Learn & explore

ВидеоСкоро

Definition

A Markov decision process is defined by states, actions, a transition probability function, and a reward function; sequential decision making seeks a policy mapping states to actions that maximizes expected cumulative (typically discounted) reward, given the model.

Scope

This topic covers decision-theoretic planning over time: the Markov decision process (MDP) model of states, actions, transition probabilities, rewards, and discounting; policies and value functions; the Bellman equations characterizing optimal behavior; and the dynamic-programming algorithms of value iteration and policy iteration for solving a known model. It also introduces partially observable MDPs (POMDPs) and belief-state planning. The focus is on planning when the model is given; learning a policy from experience without a known model is reinforcement learning, which belongs to the machine-learning subfield.

Core questions

How is acting over time under stochastic transitions modeled as states, actions, transitions, and rewards?
What does the Bellman optimality equation say about the value of an optimal policy?
How do value iteration and policy iteration compute an optimal policy when the model is known?
How does partial observability lead to POMDPs and planning over belief states?

Key concepts

states, actions, transitions, rewards
policy
value function
discount factor
Bellman equations
value iteration
policy iteration
POMDP and belief state

Key theories

Bellman optimality equation: The optimal value of a state equals the best immediate reward plus the discounted expected optimal value of the next state; this recursive relation characterizes optimal sequential behavior and is the foundation of dynamic-programming solutions.
Value and policy iteration: For a known MDP, value iteration repeatedly applies the Bellman update until convergence, and policy iteration alternates policy evaluation and improvement; both are guaranteed to find an optimal policy.
Partially observable MDPs: When the state is not directly observable, planning is done over a belief state (a distribution over states) updated from observations; solving such POMDPs is far harder than the fully observable case but captures realistic sensing limitations.

Clinical relevance

MDP- and POMDP-based decision making underlies robot navigation and control, automated dialogue management, maintenance and inventory decisions, and resource allocation, and provides the decision-theoretic planning foundation on which reinforcement learning builds when the environment model must instead be learned.

History

Sequential decision making grew from Bellman's dynamic programming (1957) and Howard's policy iteration (1960). Puterman's 1994 monograph consolidated the theory of Markov decision processes, and Kaelbling, Littman, and Cassandra (1998) brought partially observable MDPs into mainstream AI as a model for acting under uncertain perception.

Key figures

Richard Bellman
Ronald A. Howard
Martin L. Puterman
Leslie P. Kaelbling
Michael L. Littman

Seminal works

bellman1957
puterman1994
kaelbling1998

Frequently asked questions

How is this different from reinforcement learning?: Sequential decision making with MDPs assumes the transition and reward model is known, so an optimal policy can be computed directly by dynamic programming. Reinforcement learning addresses the case where the model is unknown and the agent must learn a good policy from experience; it uses the MDP as its underlying formalism.
What is a belief state in a POMDP?: In a partially observable MDP the agent cannot see the true state, so it maintains a belief state, a probability distribution over the possible states, updated as it takes actions and receives observations. Planning then takes place over these belief states rather than over the hidden states directly.