Markov Decision Processes
Markov decision processes formalize sequential decision-making, modeling an agent that chooses actions in states to maximize long-term reward.
Definition
A Markov decision process is a model of sequential decision-making defined by a set of states, available actions, probabilities of transitioning between states given actions, and rewards, in which the goal is to find a policy that maximizes expected cumulative discounted reward.
Scope
This topic covers the mathematical framework underlying reinforcement learning: states, actions, transition probabilities, rewards, and the discount factor; policies and value functions; the Bellman optimality equations; and the dynamic-programming methods of value iteration and policy iteration that solve a known process. It assumes the Markov property that the future depends only on the current state.
Core questions
- What components define a Markov decision process?
- How do the Bellman equations relate the value of a state to its successors?
- How do value iteration and policy iteration find optimal policies?
- What does the Markov property assume about the environment?
Key theories
- Bellman optimality equations
- The value of acting optimally from a state equals the best immediate reward plus the discounted value of the resulting state, a recursive relation whose solution defines the optimal policy.
- Dynamic programming
- When the process is fully known, value iteration and policy iteration compute optimal value functions and policies by repeatedly applying the Bellman update, guaranteeing convergence to the optimum.
- Discounting and return
- Future rewards are weighted by a discount factor so that the total return is well defined and nearer rewards count more, shaping how far ahead the agent effectively plans.
Clinical relevance
Markov decision processes are the conceptual backbone of reinforcement learning and of much of operations research and control, providing the language of states, actions, and value that nearly all learning algorithms approximate when the model is unknown or too large to solve exactly.
History
The framework arose from Bellman's dynamic programming in the 1950s and Howard's policy-iteration work, providing exact solution methods for known decision processes. Reinforcement learning later adopted the Markov decision process as its standard formalism for the case where transitions and rewards must be learned from experience.
Key figures
- Richard Bellman
- Ronald Howard
- Richard Sutton
Related topics
Seminal works
- sutton2018
- bellman1957
- puterman1994
Frequently asked questions
- What is the Markov property?
- The Markov property says that the future evolution of the process depends only on the current state and action, not on the full history of how the agent got there. This makes the current state a sufficient summary for decision-making.
- Why is a discount factor used?
- Discounting weights nearer rewards more heavily than distant ones. It keeps the total return finite over long or infinite horizons and encodes a preference for sooner reward, while also controlling how far into the future the agent effectively plans.