PPO code experiment
PPO code experiment
This is a simple experiment to test the PPO algorithm on the OpenAI gym environment.
All code resource is from the repository: Link1, which is also inspired by a blog paper from 37 PPO Tricks.
Here, I’d like to split the code frame, and note what I have learned from the code.
Code Frame
- argparse, set the hyperparameter
- tensorboard, log the training process
- evaluate policy, test the policy
- train policy, train the policy
- Actor Module
- its distrubution
- network structure
- Critic Module
- network structure
- PPO Agent
- evaluate, return the deterministic action(mean)
- choose_action, return the sampled action and its log probability
- update, update the policy and value network when the replay buffer is full. Note that it can be updated multiple times in it since we use the IS ratio trick.
- Actor Module
- init, use np dtype to store
- store, store the transition
- numpy_to_tensorple, Once the buffer is full, “agent.update” will turn all the data in buffer to tensor, and then update the policy and value network.
- RunningMeanStd, calculate the mean and std of the input data with dynamic method.
- Normalization, a class which can normalize data to zero mean and unit variance.
- RewardScaling, a trick to scale the reward to a proper range.(just divide the reward by std)
Trick explanation
In the following, I’ll explain 12 tricks in the PPO algorithm. Some of them indeed works, but some are obscure to me, which means sometimes it works but sometimes it even reduces the performance.
1. use GAE to estimate the advantage
make the advantage estimation more stable.
for more details, you can view my another blog post RL_toolbox
2. use the clipped IS ratio to update the policy
make the update more stable and avoid the large variance.
3. Advantage Normalization
After we calculate all the advantage in a batch via GAE, we normalize the advantage to zero mean and unit variance.
it indeed play an important role in the training process. Training almost cannot be completed without this trick!
4. State Normalization
The core of state normalization is to maintain a running mean and std of the state, and then normalize the state to zero mean and unit variance. Pay attention to that we need to dynamically update the mean and std of all the states.
After state normalization, the policy network can be trained more efficiently with the normalized state input.
5. Reward Scaling
- by chatgpt:
在论文《PPO-Implementation matters in deep policy gradients: A case study on PPO and TRPO》中提出的reward scaling技术,通过动态计算回报的滚动折现和的标准差,并将当前奖励除以这个标准差进行缩放,具有以下优点:
改善学习稳定性:通过将奖励标准化,reward scaling有助于维持学习过程中的数值稳定性。这是因为缩放后的奖励通常会落在一个相对较小和更统一的数值范围内,减少了学习算法在面对极端或不同量级的奖励时的不稳定性。
自适应调整:与静态的奖励缩放方法不同,reward scaling根据实时计算的标准差动态调整奖励的缩放程度。这意味着它能够自适应不同环境和任务中奖励分布的变化,无需手动调整缩放因子。
易于实现和集成:作为一种预处理步骤,reward scaling相对简单,容易在现有的强化学习框架和算法中实现和集成。这使得它成为一种低成本且有效的方法,用于提升深度策略梯度方法的性能。
总的来说,reward scaling提供了一种有效的技术手段,通过动态调整奖励缩放程度,改善强化学习模型的训练稳定性、加速模型收敛,并提高训练过程的自适应性和效率。
6. Policy Entropy
we use entropy to represent the uncertainty of the policy. The entropy is used to encourage the policy to explore more in the environment.
In our code, when we get the distribution of the action (action_dim) from actor network, we can calculate the entropy of the action distribution by the following code:
1 |
7. Learning Rate Decay
Learning rate decay can enhance the stability of the later stages of training to a certain extent. We here utilize the learning rate linear decay method.
1 |
8. Gradient Clip
To prevent gradient explosion, we use the gradient clip trick.
1 | # Update actor |
9. Orthogonal Initialization
To prevent gradient vanishing or explosion in the beginning of training, we use the orthogonal initialization trick.
I still query GPT for details of this trick.
正交初始化(Orthogonal Initialization)是一种常用于深度学习模型中的参数初始化方法,特别是在训练深层神经网络时。这种方法有几个显著的优点:
1. 缓解梯度消失或爆炸问题
2. 促进网络训练的稳定性
3. 加速收敛
4. 改善深层网络的性能
5. 适用于多种网络结构
1. 保持信号范围
2. 促进梯度的稳定传播
3. 减少内部协变量偏移
4. 提高学习效率
10. Adam Optimizer Epsilon Parameter
we turn it from eps = 1e-8 to eps = 1e-5, which can make the training more stable.
11. Tanh Activate Function
Just use it. I don’t know exactly why it works.
12. Gaussian Distribution and Beta Distribution
In fact, we use Gaussian distribution to output the action most of time. But we need to clip the action to a proper range as Gaussian distribution is unbounded, which induce a negative effect on the performance.
We try to use Beta distribution to output action in a range of [0,1]. Then we can map [0.1] to any action range.
Code details
If you want to know more about the code details, maybe you can read my another blog post RL_Code_Details which analyze some code details. PPO code is included of course.