Machine learningReinforcement learning

정책 경사도 방법

정책 경사도 방법은 행동-가치(action-value)를 학습하고 탐욕적으로 행동하는 대신, 기대 반환값(expected return)에 대한 경사도 상승(gradient ascent)을 통해 매개변수화된 정책(parameterized policy)을 직접 최적화하는 강화 학습 알고리즘입니다. Ronald Williams의 1992년 REINFORCE 알고리즘과 Sutton 동료(2000)의 정책 경사도 정리(policy gradient theorem)에 기반하여, 이 방법들은 확률적이고 연속적인 행동 공간을 자연스럽게 처리하며 현대의 액터-크리틱(actor-critic) 및 딥 강화 학습(deep-RL) 알고리즘의 근간을 이룹니다.

MethodMind에서 열기곧 제공동영상곧 제공Download slides

방법 전문 읽기

회원 전용

무료 계정으로 로그인하면 이 섹션을 읽을 수 있습니다.

로그인

Method map

The neighbourhood of related methods — select a node to explore.

정책 경사도 방법

볼록 최적화 딥 강화학습 Q-러닝 확률적 경사 하강법(Stochastic Gr…강화학습

출처

Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3–4), 229–256. DOI: 10.1007/BF00992696 ↗
Sutton, R. S., McAllester, D., Singh, S., & Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. Advances in Neural Information Processing Systems, 12, 1057–1063. link ↗

이 페이지 인용 방법

ScholarGate. (2026, June 2). Policy Gradient Methods (REINFORCE / Actor-Critic). ScholarGate. https://scholargate.app/ko/machine-learning/policy-gradient

Which method?

Set this method beside its closest kin and read them side by side — the library lays the books on the table; the choice is yours.

볼록 최적화최적화↔ compare
딥 강화학습딥러닝↔ compare
Q-러닝머신러닝↔ compare
확률적 경사 하강법(Stochastic Gradient Descent, SGD)머신러닝↔ compare

Compare side by side →

이 방법을 참조하는 항목

Q-러닝 강화학습

이 페이지에서 오류를 발견하셨나요? 신고하거나 수정을 제안하세요 →