Proximal Policy Gradient (PPO)

Overview

PPO is one of the most popular DRL algorithms. It runs reasonably fast by leveraging vector (parallel) environments and naturally works well with different action spaces, therefore supporting a variety of games. It also has good sample efficiency compared to algorithms such as DQN.

Original paper:

Proximal Policy Optimization Algorithms

Reference resources:

All our PPO implementations below are augmented with the same code-level optimizations presented in openai/baselines's PPO. To achieve this, see how we matched the implementation details in our blog post The 37 Implementation Details of Proximal Policy Optimization.

Implemented Variants

Variants Implemented	Description
`ppo.py`, docs	For classic control tasks like `CartPole-v1`.
`ppo_atari.py`, docs	For playing Atari games. It uses convolutional layers and common atari-based pre-processing techniques.
`ppo_continuous_action.py`, docs	For continuous action space. Also implemented Mujoco-specific code-level optimizations

Below are our single-file implementations of PPO:

`ppo.py`

The ppo.py has the following features:

Works with the Box observation space of low-level features
Works with the Discrete action space
Works with envs like CartPole-v1

Usage

poetry install
python cleanrl/ppo.py --help
python cleanrl/ppo.py --env-id CartPole-v1

Implementation details

ppo.py is based on the "13 core implementation details" in The 37 Implementation Details of Proximal Policy Optimization, which are as follows:

Vectorized architecture ( common/cmd_util.py#L22)
Orthogonal Initialization of Weights and Constant Initialization of biases ( a2c/utils.py#L58))
The Adam Optimizer's Epsilon Parameter ( ppo2/model.py#L100)
Adam Learning Rate Annealing ( ppo2/ppo2.py#L133-L135)
Generalized Advantage Estimation ( ppo2/runner.py#L56-L65)
Mini-batch Updates ( ppo2/ppo2.py#L157-L166)
Normalization of Advantages ( ppo2/model.py#L139)
Clipped surrogate objective ( ppo2/model.py#L81-L86)
Value Function Loss Clipping ( ppo2/model.py#L68-L75)
Overall Loss and Entropy Bonus ( ppo2/model.py#L91)
Global Gradient Clipping ( ppo2/model.py#L102-L108)
Debug variables ( ppo2/model.py#L115-L116)
Separate MLP networks for policy and value functions ( common/policies.py#L156-L160, baselines/common/models.py#L75-L103)

Experiment results

PR vwxyzjn/cleanrl#120 tracks our effort to conduct experiments, and the reprodudction instructions can be found at vwxyzjn/cleanrl/benchmark/ppo.

Below are the average episodic returns for ppo.py. To ensure the quality of the implementation, we compared the results against openai/baselies' PPO.

Environment	`ppo.py`	`openai/baselies`' PPO (Huang et al., 2022)¹
CartPole-v1	492.40 ± 13.05	497.54 ± 4.02
Acrobot-v1	-89.93 ± 6.34	-81.82 ± 5.58
MountainCar-v0	-200.00 ± 0.00	-200.00 ± 0.00

Learning curves:

Tracked experiments and game play videos:

Video tutorial

If you'd like to learn ppo.py in-depth, consider checking out the following video tutorial:

`ppo_atari.py`

The ppo_atari.py has the following features:

For playing Atari games. It uses convolutional layers and common atari-based pre-processing techniques.
Works with the Atari's pixel Box observation space of shape (210, 160, 3)
Works with the Discrete action space

Usage

poetry install -E atari
python cleanrl/ppo_atari.py --help
python cleanrl/ppo_atari.py --env-id BreakoutNoFrameskip-v4

Implementation details

ppo_atari.py is based on the "9 core implementation details" in The 37 Implementation Details of Proximal Policy Optimization, which are as follows:

The Use of NoopResetEnv ( common/atari_wrappers.py#L12)
The Use of MaxAndSkipEnv ( common/atari_wrappers.py#L97)
The Use of EpisodicLifeEnv ( common/atari_wrappers.py#L61)
The Use of FireResetEnv ( common/atari_wrappers.py#L41)
The Use of WarpFrame (Image transformation) common/atari_wrappers.py#L134
The Use of ClipRewardEnv ( common/atari_wrappers.py#L125)
The Use of FrameStack ( common/atari_wrappers.py#L188)
Shared Nature-CNN network for the policy and value functions ( common/policies.py#L157, common/models.py#L15-L26)
Scaling the Images to Range [0, 1] ( common/models.py#L19)

Experiment results

PR vwxyzjn/cleanrl#120 tracks our effort to conduct experiments, and the reprodudction instructions can be found at vwxyzjn/cleanrl/benchmark/ppo.

Below are the average episodic returns for ppo_atari.py. To ensure the quality of the implementation, we compared the results against openai/baselies' PPO.

Environment	`ppo_atari.py`	`openai/baselies`' PPO
BreakoutNoFrameskip-v4	416.31 ± 43.92	406.57 ± 31.554
PongNoFrameskip-v4	20.59 ± 0.35	20.512 ± 0.50
BeamRiderNoFrameskip-v4	2445.38 ± 528.91	2642.97 ± 670.37

Learning curves:

Tracked experiments and game play videos:

Video tutorial

If you'd like to learn ppo.py in-depth, consider checking out the following video tutorial:

`ppo_continuous_action.py`

The ppo_continuous_action.py has the following features:

For continuous action space. Also implemented Mujoco-specific code-level optimizations
Works with the Box observation space of low-level features
Works with the Box (continuous) action space
Includes the 8 implementation details for as shown in the following video tutorial (need fixing)

Huang, Shengyi; Dossa, Rousslan Fernand Julien; Raffin, Antonin; Kanervisto, Anssi; Wang, Weixun (2022). The 37 Implementation Details of Proximal Policy Optimization. ICLR 2022 Blog Track https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/ ↩