Hardware: Google Colab L4
| Model Type | Discrete | Average Reward | Total Training Steps | HuggingFace Repo |
|---|---|---|---|---|
| PPO | No | 220.66 | 750,000 | Link |
| PPO | Yes | 214.55 | 750,000 | Link |
| SAC | No | 288.74 | 750,000 | Link |
| DQN | Yes | 218.56 | 750,000 | Link |
- Set
ent_coeffor PPO as it encourages exploration of other actions. Stable Baselines3 defaults the value to 0.0. More Information - Do not set your
eval_freqtoo low, as it can sometimes cause instability during learning due to being interrupted by evaluation. (e.g. >=10,000) - Stable Baselines3's DQN parameters
exploration_initial_epsandexploration_final_epshelp determine how exploratory your model is at the beginning and end of training.


