Proximal Policy Optimization (PPO) Algorithm

From GM-RKB
Jump to navigation Jump to search

A Proximal Policy Optimization (PPO) Algorithm is a model-free reinforcement learning algorithm that is a policy-based algorithm (that learns a policy function to determine the optimal action from the available game states).



References

2023

INPUT: - Initial policy parameters θ - Environment Env with states s and actions a - Number of iterations N - Number of timesteps T per trajectory - Number of optimization epochs E - Minibatch size M - Clipping parameter ε - Learning rate α

OUTPUT: - Optimized policy parameters θ

ALGORITHM:

FOR i = 1 to N DO:

   Collect Trajectories:
   FOR t = 1 to T DO:
       a_t ∼ π_θ(·|s_t)  Sample action from current policy
       (s_{t+1}, r_t) = Env.step(a_t)  Execute action in the environment
   Calculate Advantages
   A_t = CALCULATE_ADVANTAGES(trajectory)
   FOR e = 1 to E DO:
       Shuffle and create minibatches from trajectories
       MINIBATCHES = CREATE_MINIBATCHES(trajectory, M)
       FOR each minibatch in MINIBATCHES DO:
           Calculate Policy Ratio
           r(θ) = π_θ(a_t|s_t) / π_{θ_old}(a_t|s_t)
           Calculate Surrogate Objective
           L(θ) = MIN(r(θ) * A_t, CLIP(r(θ), 1-ε, 1+ε) * A_t)
           Policy Update
           θ = θ + α * GRADIENT_OF(L(θ), θ)
           Value Function Update (Optional)
           UPDATE_VALUE_FUNCTION(minibatch)
   θ_old = θ

END

RETURN θ

2022

2022

  • https://chat.openai.com/chat
    • QUESTION: How does PPO differ from Q-learning?
    • ANSWER: Proximal Policy Optimization (PPO) and Q-learning are both reinforcement learning algorithms, but they work in quite different ways.

      Q-learning is a value-based reinforcement learning algorithm. It involves learning a value function that estimates the expected reward for an agent taking a given action in a given state. The goal of Q-learning is to find the optimal action-value function, which is the function that maximizes the expected reward for the agent. To do this, the Q-learning algorithm updates the action-value function based on the difference between the expected reward and the actual reward that the agent receives after taking an action. This process is repeated until the action-value function converges to the optimal solution.

      PPO, on the other hand, is a policy-based reinforcement learning algorithm. It involves learning a policy function that determines the actions that an agent should take in a given state. The goal of PPO is to find the optimal policy, which is the policy that maximizes the expected reward for the agent. To do this, PPO uses a combination of gradient ascent and importance sampling to update the policy function. The policy function is updated based on the difference between the expected reward and the actual reward that the agent receives after taking an action, weighted by the probability of the action under the current policy. This process is repeated until the policy function converges to the optimal solution.

      In summary, Q-learning is a value-based algorithm that learns a value function to determine the optimal action, while PPO is a policy-based algorithm that learns a policy function to determine the optimal action.

2020

Algorithm Description Model Policy Action Space State Space Operator
PPO Proximal Policy Optimization Model-Free On-policy Continuous Continuous Advantage

2017

  • https://openai.com/blog/openai-baselines-ppo/
    • QUOTE: We’re releasing a new class of reinforcement learning algorithms, Proximal Policy Optimization (PPO), which perform comparably or better than state-of-the-art approaches while being much simpler to implement and tune. PPO has become the default reinforcement learning algorithm at OpenAI because of its ease of use and good performance.

2017

  • (Schulman et al., 2017) ⇒ John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. (2017). “Proximal Policy Optimization Algorithms.” arXiv preprint arXiv:1707.06347
    • ABSTRACT: We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective function using stochastic gradient ascent. Whereas standard policy gradient methods perform one gradient update per data sample, we propose a novel objective function that enables multiple epochs of minibatch updates. The new methods, which we call proximal policy optimization (PPO), have some of the benefits of trust region policy optimization (TRPO), but they are much simpler to implement, more general, and have better sample complexity (empirically). Our experiments test PPO on a collection of benchmark tasks, including simulated robotic locomotion and Atari game playing, and we show that PPO outperforms other online policy gradient methods, and overall strikes a favorable balance between sample complexity, simplicity, and wall-time.