Policy Gradient (PG) is a model-free deep reinforcement learning (DRL) method that directly optimizes the policy by learning a probability distribution over actions. Unlike value-based methods like Q-learning, PG focuses on maximizing expected cumulative rewards through gradient ascent.

Core Concepts

  • Policy Function: Maps states to action probabilities (e.g., π(a|s)).
    Policy_Function
  • Gradient Ascent: Adjusts policy parameters to increase reward by computing gradients of the expected return.
    Gradient_Ascent
  • Exploration vs. Exploitation: Balances trial-and-error with optimal actions via entropy regularization.
    Exploration_vs_Exploitation

Key Advantages

  • Direct Policy Optimization: Avoids the need for value function estimation.
  • Continuous Action Spaces: Handles non-discrete actions (e.g., robotic control).
  • Stochastic Policies: Naturally incorporates exploration.

Common Algorithms

  1. REINFORCE
    REINFORCE
  2. Actor-Critic
    Actor_Critic
  3. PPO (Proximal Policy Optimization)
    PPO

For deeper insights into Actor-Critic methods, check our tutorial: /en/tutorials/rl/actor_critic