Policy Gradient Learning with CartPole-v0
The challenge of the week was: solving a simple game using policy gradients (other than pong). I've chosen CartPole v1.0 because that's a basic game and there is a ton of documentations/tutorials about that kind of game.
CartPole-v0 defines "solving" as getting average reward of 195.0 over 100 consecutive trials.
- numpy
- gym https://github.com/openai/gym
- tensorflow
To be more readable and more easier to explain I use Jupyter Notebook
Open your terminal, go to the Policy_gradients_CartPole folder and launch notebook
jupyter notebook
4 kinds of information given by the state:
- Position of the cart
- Velocity of the cart
- Position of the pole
- Velocity of the pole
An agent can push the cart:
- 0: left
- 1: right
What we must understand here is that immediate rewards are more important than delayed rewards.
That's why we use gamma as a discount factor
Why ? Because delayed rewards have less impact: imagine you screw up at step 5 (the bar is too leaning) we don't care of rewards after that because you will lose that's why the reward is more and more discounted
Originally taken from, DQN Bootcamp Lecture: Core Lecture 4b Pong from Pixels -- Andrej Karpathy
Remember that:
- A positive advantage --> make the action more likely to happen in the future, at that state
- A negative advantage --> make the action less likely to happen in the future, at that state
This was made possible thanks these 2 fantastic resources:
- Simple Reinforcement Learning with Tensorflow: Part 2 - Policy-based Agents : this article helps me to define a part of the architecture and helps me a lot for the training part.
- Policy gradients for reinforcement learning in TensorFlow