Policy Gradient (PG) is a model-free deep reinforcement learning (DRL) method that directly optimizes the policy by learning a probability distribution over actions. Unlike value-based methods like Q-learning, PG focuses on maximizing expected cumulative rewards through gradient ascent.
Core Concepts
- Policy Function: Maps states to action probabilities (e.g.,
π(a|s)
). - Gradient Ascent: Adjusts policy parameters to increase reward by computing gradients of the expected return.
- Exploration vs. Exploitation: Balances trial-and-error with optimal actions via entropy regularization.
Key Advantages
- Direct Policy Optimization: Avoids the need for value function estimation.
- Continuous Action Spaces: Handles non-discrete actions (e.g., robotic control).
- Stochastic Policies: Naturally incorporates exploration.
Common Algorithms
- REINFORCE
- Actor-Critic
- PPO (Proximal Policy Optimization)
For deeper insights into Actor-Critic methods, check our tutorial: /en/tutorials/rl/actor_critic