yt1s.com.-.Claude.Shannon.demonstrates.machine.learning.online.video.cutter.com_360p_Trim.mp4
- Define the Environment
- Define the States
- Define the Actions
- Define the Rewards
- The AI Solution
- What is Reinforcement Learning?
- Markov Decision Process(MDP)
- Policy
- Bellman's Equation
- Living Penalty
- Q-Learning
- Temporal Difference
- Implementation
- Building the AI solution
- Coding
- Improvement
If we want to train a helicopter to fly and we are given the position, which is the state, of the helicopter at a given point in time and we need to take actions to make it fly in a certain trajectory then there is no direct mapping of x to y to how to fly the helicopter. Therefore, it is going to be hard to use supervised learning for this problem. However, with reinforcement learning we don't need to specify the correct answer at every step but only a reward function to specify the helicopter when it is flying well and when it is doing poorly. A high reward would signify the helicopter is flying in the correct trajectory and a negative reward when it crashes.
We can observe this process similar to training a dog whereby the latter is given a treat whenever it does something that is being asked. In our case, our rewards which can be value of +1, -1 or 0
will do the exact same job as the treat. One of the challenges of RL is the Credit Assignment Problem. For example in a game of chess, we cannot say it is the last move that made us lose. It could have been the 15th one or the 10th or even the first one. This happens because our rewards are sparse and infrequent. In sparse rewards, we get the rewards only in the end whereas in dense rewards we know what we are doing right or wrong while doing it. Therefore, we need to solve this problem indirectly by proposing small rewards along the way and a bigger reward in the end.
Early on, reinforcement learning has been used to play games like Atari Breakout. Most recently using OpenAI Gym, we can make a humanoid learn how to walk and do a parkour course. Now we are trying to teach self-dricing cars to drive using RL. Instead of going physically on the road where there are risks on collision and damage, we train the car in a simulator and after a lot of trials, we can then let it drive on real physical roads.
Reinforcement Learning problems are modelled by Markov Decision Process(MDP). MDP is a formalism to represent the world where our agent will interact with. It is a tuple of five elements:
-
S: Set of States (In chess this equals all possible chess positions and in our warehouse problem this will equal all posible positions our robot can navigate)
-
A: Set of Actions (In the helicopter example this could be all the positions we can move our control sticks and in our case it will be all the possible moves our robot can make)
-
: State Transisiton Probabilities
(If we take a certain action a in state s then what is the probability of reaching state s')
-
R: Reward Function
For simplicity, we wil reduce our warehouse to a simple 3 x 4
grid world with one obstacle at coordinates (2,2) as shown below:
In order to understand MDP, it is important to understand what is a Markov Process. By definition a Markov process is:
"A Markov chain or Markov process is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. A countably infinite sequence, in which the chain moves state at discrete time steps, gives a discrete-time Markov chain."
In simpler terms, it means that it does not matter where our agent has been before in our gridworld. It is a random process in which the future is independent of the past, given the present. What will happen in the future is only determined by the state our agent is in now plus the actions it will take with the overlay of randomness.
For our simplified gridworld, it contains 11
states and four possible actions per state: Left, Right, Up, Down
. And when we want our robot to come Up for example, it does not always go Up all the time. Due to friction or dynamic noise of our robot, it can slightly veer left or right and we need to model this stochastic behavior which we will experience in the real-world to to our environment.
If we have our robot in cell (3,1)
and we need it to go tUp to cell (3,2)
then the chance of it really going Up is as follows:
To sum up:
- Our robot will wake up at state
(At the moment we turn on the robot)
- Based on the state it is in, it will choose some action
- Based on the action, it will get to some state
which is distributed according to the State Transition Probabilties
- Then it will choose a new action
- As a consequence of action
, it will get to state
governed by the state transitional probabiltiies
Therefore our robots will go through a sequnece of states whereby the total payoffs will be the sum of discounted rewards:
There are two reasons to use a discounted factor:
- The concept is similar to the Time Value of Money where money of today has more value than money of tomorrow. We discount future rewards so that they don't count as much as the current reward. Similar to human behavior, we prefer short term rewards to long term rewards.
- It is more practical to make RL algorithms to converge.
Our goal will be to be able to choose actions that will maximize our expected total payoffs. In order to do son, we need to devise a policy
that will map states to actions.
The policy or controller is the output of our RL algorithm which maps states to actions. So for our MDP, if we are in state (3,1)
, our policy will be Right
. Therefore the optimial policy for this MDP means that whenever we are in state s, we need to take action and subsequently this policy will maximize the expected total payoffs.
We would have assumed that from cell (3,1), we would have gone to cell (3,2) then (3,3) then (4,3). However, our optimal policy suggest that since there is a slight possibility(10% as per our State Transitional Probabilities) that we slid to the fire in cell (4,2), it is better we take the longer route. Note that this can be adjusted by having a higher living penalty.
There is an exponentially large number of policies. For our simple grid of 11 states and 4 actions per state, there are possible policies which is still small as we are dealing with a small MDP. However for our real warehouse, we have
policies which is a huge number.
Hence, to find the optimal policy, we need to define our Value Function.
To find our optimal policy, we first need define ,
and
.
For a policy ,
(Value function for Policy
):
is such that
is the expected total payoffs for starting in state s and executing
.
which leads us to the Bellman Equation.
We start by introducing the immediate reward whereby we reward the agent just for being in that starting state. The agent will perform some action and go to a new state
where it will receive a reward
and perform again some action and receive another reward
. We can write this equation as follows:
If we factor out one factor of gamma, the equation becomes:
where knowm as the
Expected Future Rewards
or the sum of discounted rewards when the robot wakes up is state .
From this we can write Bellman's Equations:
We have . That is in state S, take action
.
Now we need to solve for . So given
, we get a system of linear equations in terms of
.
If our agent is in cell (3,1) then our value function is:
V* is the value of the best optimal policy for that state.
Bellman's equation for the optimal value function:
where is the expected future reward.
If our robot is in state s then what is the best action I could take from state s? The action that maximizes the total expected payoffs.
From the optimal policy we will find the value of a
at which mazimum is attained. We then need to plug in in our Optimal Value Function.
To sum up, the strategy we can use to find our optimal policy:
A Q-value function (Q) shows us how good a certain action is, given a state, for an agent following a policy. The optimal Q-value function (Q*) gives us maximum return achievable from a given state-action pair by any policy. The Q-functions captures the expected total future reward
an agent in state s can receive by executing a certain action a.
We want to take actions that maximize our Q-value. Ultimately, the agent needs a policy , to infer the best action to take at its state s.
So how do we use Q-value to take the next action?
- We feed in all possible actions we can execute at that time.
- We evaluate our Q-function(High Q-value and Low Q-value)
- We pick the action that give us the highest Q-value
There are two ways to find that optimal policy:
- Value Learning
- Policy Learning
We want our neural network to learn the Q-function and then we will use that Q-function to define our policy.
Instead of trying out each different possible actions that we can execute at a given state and output its corresponding Q-value, we want to find the Q-values of each possible actions at a particular state and maximize the target return that will be used to train the agent. The target return is going to be maximized over some infinite time horizon and since can serve as the ground truth to train that agent.
So we can basically roll out the agent and see how it did in the future and based on its performance, we can use that as ground truth. Our target Q-value will be the sum of the reward we got at that time by taking that action and the bes taction we can take at every future time after discounted.
We use a neural network to learn the Q-function and then use the latter to infer define the optimal policy
. Finally, we send the action back to the environment and receive the next state.
- Q-value learning is really suited for a deterministic environment with discrete actions spaces. It cannot learn stochastic policies.
- It does not perform well in complex action scenarios where we have a large number of action space.
To adresss the problems above, we will need to use Policy Learning
.
In policy learning, we are not outputting Q-function but directly optimizing the policy - the probability distributions over the space of all actions given that state. This is the probability that taking that action is going to result in the highest Q-value. We can take actions by sampling from that distribution.
- We are no longer constrained to only categorical action spaces.
- We can parametrize this probability distribution however we would like.
With a discrete action space we ask which direction we should move: Left, Right, Up or Down. But with a continuous action space, we ask how fast can the agent move and in which direction?
- Initialize the agent
- Run a given policy until crash
- Record all states, rewards and actions
- Decrease probability of actions that resulted in low reward
- Increase probability of actions that resulted in high reward
Using the log-likelihood of action
, it tells us how likely the action that we selected was.
Scenatio 1: If our agent got a lot of reward from an action that has a very large likelihood (that is very likely to be selected) then
Loss = - (large number x large number) = - large number
Thefore, we have a minimum loss.
Scenatio 2: If reward is low and probability of selecting that action is high then
Loss = - (large number x small number) = - small number
Thefore, we have a high loss.