These are my personal lecture notes for Georgia Tech's Reinforcement Learning course (CS 7642, Spring 2024) by Charles Isbell and Michael Littman. All images are taken from the course's lectures unless stated otherwise.
-
Littman, M. L., & Szepesvári, C. (1996, July). A generalized reinforcement-learning model: Convergence and applications. In ICML (Vol. 96, pp. 310-318).
-
Littman, M. L. (1996). Algorithms for sequential decision-making. Brown University. (Chapter 3)
-
Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.
Chapters: 7.1-7.2, 7.7, 12.1-12.5, 12.7, 12.10-12.13
-
There's some
(less than infinity) that's polynomial in: - the size of the problem (the number of states
and the number of actions ), - the magnitude of the rewards (
), -
, and - bits of precision (e.g. number of bits that are used to specify the precision of the transition probabilities).
If we run value iteration for $t^$ steps, the policy $\pi(s)$ we get is optimal: $$\pi(s) = \arg\max_a Q_{t^}(s,a)$$
-
Note that the policy is a greedy policy w.r.t.
. -
In other words, we don't need to run value iteration for an infinite number of steps, but only for a polynomial number of steps (
), and that will already give us a Q-function that is close enough to the optimal Q-function. We can then extract the optimal policy from that Q-function. -
This is saying that once we fix the MDP and the precision (see the polynomial properties above), the value iteration will give us actions where the second best action is some distance (a fixed value gap that depends on the precision?) away from the best action. The separation of actions allows the policy to choose the best action.
- the size of the problem (the number of states
2. When the value difference between any two consecutive iterations is small, we are close to the optimal policy.
-
If the change in the value between two consecutive iterations is less than
for all states in the MDP, then the maximum difference between the value function of the policy
and the optimal value function is small: Note:
is the value function following the policy , where is the greedy policy w.r.t. . In other words, if you can get a good enough approximation of the optimal value function, then you know how much you're off from the optimal policy by just looking at the value difference between two consecutive iterations (you don't need to know what the optimal value function is!).
This helps us decide when is a good time to stop the value iteration!
About
: - If we don't care so much about the future, we can set a smaller
. Small will make the value function converge faster (fewer iterations), since is polynomial in . - If we care a lot about the future and set a very large
, the value function will take longer to converge. - In other words, there's a trade-off between the horizon of the problem (how much we care about the future) and the computational time.
- If we don't care so much about the future, we can set a smaller
If we run
$||B^K Q_1 - B^K Q_2||\infty \leq \gamma^k ||Q_1 - Q_2||\infty$
i.e. Value iteration brings the value functions closer. How close they are depends on the number of iterations.
-
Recall that the number of iterations is polynomial in
. - However,
is not really a polynomial in the bits of precision we need to represent (e.g. we can specify a that is very close to 1 with just a few bits of precision to cause the term to explode). - So value iteration is not a polynomial-time algorithm for solving MDPs.
- However,
-
Linear programming (LP) is a polynomial-time algorithm for solving MDPs.
-
LP is an optimization technique that can be used to solve a problem with an objective function and a set of linear constraints.
-
We can encode an MDP as a linear program and solve it using LP
-
To solve a value function, we have a set of constraints to satisfy, one for each state
: -
Since the max function is not a linear operator, the set of equations above is non-linear, and thus we can't solve it using LP.
-
However, we can specify the max operator as a set of linear constraints and an objective function.
For example:
-
Let's say
is the maximum value of a list of numbers: -
We have the following constraints:
-
There is a set of
's that satisfies all these constraints. To find the maximum value, we need to find the smallest among these 's. Therefore, we can add an objective function to minimize :
-
-
Therefore, we can rewrite the above set of value constraints for each
and as follows: -
This is a set of linear constraints.
-
We can add an objective function to minimize the value of
for all states : Note: We can't just do
here because is a set of variables (one for each state ). To find the minimum among all states, an easy way is to sum all for all states first and then find the minimum of that sum.
-
-
This is a linear program that we can use to solve the MDP.
-
The advantage of LP is that it's easy to add other constraints to the problem.
- Another LP (called the dual) can be derived from the above LP (the original one, i.e. the primal).
- Each variable in the primal becomes a constraint in the dual, and vice versa.
- The objective direction is reversed.
Dual of the above LP:
-
Objective function:
Here we want to maximize the rewards for all states and actions.
is what Michael calls the "Policy Flow" in the lecture (see below). Each state-action pair has a . -
Constraints:
for all for all The idea is that the total "policy flow" that arrives at a state
is equal to the flow that the state can send out ( ) (think of it as a "conservation of the flow"). We add a
and discount the flow by on the left-hand side to account for the MDP dynamics. The policy flow must be non-negative, so that we pass flow to the next states, we can't send all the flow to one state and send negative flow to the other states to balance the total flow.
-
Intuitively, the dual LP is about finding a policy that has the "policy flow" distributed in a way that maximizes the rewards.
-
The dual's algorithm is also polynomial, but it focuses on the policy (how the "flow" is distributed) rather than the value function.
Steps:
-
Initialize
-
Policy improvement (greedy policy):
where -
Policy evaluation:
Note:
-
converges to - Convergence is exact and complete in finite time
- Converges at least as fast as VI
-Trade off: PI is more computationally expensive than VI
- The policy evaluation step is like a value iteration process
- Open question: we only know the convergence time of PI is
linear in and exponential in , but we don't know the exact convergence time
-
A policy dominates another policy if it is better in at least one state and not worse in any other state:
i.f.f. -
Strict domination:
i.f.f. (One policy is strictly better than the other in at least one state)
-
A policy is
-optimal i.f.f.: (In other words, the policy is nearly as good as the optimal policy, and the loss/regret is bounded by
)
-
Let's say we have two policies
and . -
We can update the value function
by following and using the Bellman operators and : $\displaystyle B_1V = R(s,\pi_1(s)) + \gamma \sum_{s'} T(s,\pi_1(s),s') V(s')$
$\displaystyle B_2V = R(s,\pi_2(s)) + \gamma \sum_{s'} T(s,\pi_2(s),s') V(s')$
-
We will show why PI works with two properties:
-
Monotonicity:
are monotonic -
Value improvement:
improves the value function
-
Monotonicity:
-
Monotonicity means that the Bellman operator
preserves the order of any two value functions (i.e. if one value function dominates another, the Bellman operator will also preserve this order). -
We will show that
is monotonic ( is the same but we won't need it to prove PI). -
We want to proof that if
( dominates ) then the Bellman operators consisting of the two value functions are also related in the same way. For example:
Proof (Substracting the two Bellman operators and see if one is greater than the other):
Since
, the right-hand side is non-negative, therefore: Therefore,
P.S. The lecture used
here but the same property holds for .
-
Again,
is the Bellman operator associated with policy -
(fixed point of , constraction property) -
Let
be the greedy policy w.r.t. -
Let
be the Bellman operator associated with -
We want to show that after applying the Bellman operator w.r.t. the greedy policy (i.e.
), we will get an updated value function that dominates the value function (i.e. ) before the update: -
Proof:
Let:
-
be the Bellman operator associated with policy , -
be the fixed point of , -
be the greedy policy w.r.t. , and -
be the Bellman operator associated with
-
Value improvement (or at least not getting worse):
Applying
to gives us a value function that dominates : -
Monotonicity:
Applying
times gives us a value function that dominates the value function after applying times: -
Transitivity (If
): If we keep applying
, we will get a sequence of value functions where each value function dominates the previous one: -
Fixed point:
We will reach a fixed point where the value function doesn't change anymore:
- PI does not get stuck in a local minimum because it always improves the policy if it can.
- Recall that the property of value improvement is that a value function dominates the previous value function after an update. There is actually a nuance here we didn't discuss:
-
At some states,
dominates (i.e. ), but there are states where strictly dominates than (i.e. ). For example: State 10 6 3 3 -
However, there is another set of states where the same relationship holds:
State 3 3 10 6 -
In both cases,
. Does that mean that at some point the policy can just go back and forth between these cases as long as ? -
No. If this happens, the value functions may just keep changing without actually improving the overall policy when there is still room for improvement.
-
Actually, "value improvement" in PI means that not only
dominates for a set of states, but once dominates for a state, it will never go back to the previous value function for that state.
-