Skip to content

Commit 10a0bf6

Browse files
committed
report modified
1 parent 2973baf commit 10a0bf6

File tree

1 file changed

+6
-6
lines changed

1 file changed

+6
-6
lines changed

report.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Homework3-Policy-Gradient report
22
## problem1 construct a neural network to represent policy
3-
In more complex tasks(atari games, and even in real-world tasks), it's hard to apply policy iteration/ value iteration directly due to large state/action space, requiring large storage and hard to calculate the Q values for all. So we "learn" the Q values or plicy by neural network. Here in problem 1, we want to use a simple neural network <img src="https://latex.codecogs.com/gif.latex? f_{Q^*} (s, a;\Theta)"> to represent $ Q^* (s, a) \end where $\Theta$ is the parameters of the nerual network, just like the figure showed below:
3+
In more complex tasks(atari games, and even in real-world tasks), it's hard to apply policy iteration/ value iteration directly due to large state/action space, requiring large storage and hard to calculate the Q values for all. So we "learn" the Q values or plicy by neural network. Here in problem 1, we want to use a simple neural network <img src="https://latex.codecogs.com/gif.latex?f_{Q^*}(s,%20a;\Theta)%22"> to represent <img src='https://latex.codecogs.com/gif.latex?Q^*%20(s,%20a)'> where <img src='https://latex.codecogs.com/gif.latex?\Theta'> is the parameters of the nerual network, just like the figure showed below:
44
<img src='pictures/DNNforQ.png' width='300'>
55

66
To implement this, I added two fully connected layers in policy.py file:
@@ -14,8 +14,8 @@
1414
where the nn's output ```probs```(output of nn) is the logits of each action's probability conditioned on the ```self._observations```(input of nn)
1515

1616
## problem2 compute the surrogate loss
17-
In reinforcement learning, our goal is to maximize the accumulated discounted reward $R_t^i = \sum_{{t^′}=t}^T \gamma^{{t^′}-t}r(s_{t^′}, a_{t^′})$ .
18-
We want to maximize the reward, it's the same as minimize the negative reward. In problem 2 here, I added a line to compute the <b>negative</b> surrogate loss $-L(\theta) = -\frac{1}{(NT)}(\sum_{i=1}^N \sum_{t=0}^T log\pi_\theta(a_t^i | s_t^i) *R_t^i)$. and <b>minimize</b> it.
17+
In reinforcement learning, our goal is to maximize the accumulated discounted reward <img src='https://imgur.com/zNVb7qv'> .
18+
We want to maximize the reward, it's the same as minimize the negative reward. In problem 2 here, I added a line to compute the <b>negative</b> surrogate loss <img src='https://imgur.com/vZ393ks'>. and <b>minimize</b> it.
1919

2020
```python
2121
surr_loss = tf.reduce_mean((-log_prob)*self._advantages)
@@ -33,7 +33,7 @@ It's clear that the reward increases during training and terminated(converged) i
3333
<img src='pictures/p4.png'>
3434
The results without baseline converges faster(solved in 76 iterations) than the one with baseline. It's possible that in this task, the unstable of gradient somehow pushed the agent to act like "exploration" and thus finding the solution faster than with baseline subtraction one.
3535
## problem5 actor-critic implementation with bootstrapping
36-
In Actor-Critic algorithm, actors takes actions based on policy-iteration and critics evaluates the actions based on value-iteration(Q-learning). Actor-Critic alorithm combines actor and critic where actor improves the current policy, and critic evaluates(criticizes) the current policy. Here in problem 5, we changed the advantage function in problem3 into $A_t^i = r_t^i + \gamma*V_{t+1}^i - V_t^i$ using one-step bootstrap in policy_gradient/util.py
36+
In Actor-Critic algorithm, actors takes actions based on policy-iteration and critics evaluates the actions based on value-iteration(Q-learning). Actor-Critic alorithm combines actor and critic where actor improves the current policy, and critic evaluates(criticizes) the current policy. Here in problem 5, we changed the advantage function in problem3 into <img src='https://imgur.com/ovyr0s7'> using one-step bootstrap in policy_gradient/util.py
3737
```python
3838
new_b = np.append(b[1:], 0)
3939
return x + discount_rate * new_b
@@ -44,9 +44,9 @@ The boostrapping actor-critic agent doesn't converge since it's not stable enoug
4444
<img src ='pictures/p5.png'>
4545
## Problem6 generalized advantage estimation (GAE)
4646
Since the original actor-critic is not stable, in problem6, we introduce λ to compromise the advantage function based on problem3 and 5.
47-
Assume the $\delta_t^i$ represent the i-step bootstrapping (e.g. $\delta_t^i=r_t^i + \gamma*V_{t+1}^i - V_t^i$). The generalized advantage estimation will be:
47+
Assume the <img src='https://imgur.com/qjRZqX8'> represent the i-step bootstrapping (e.g. <img src='https://imgur.com/jViBLdI'>). The generalized advantage estimation will be:
4848

49-
$$A_{t}^{GAE} = \sum_{l=0}^{\infty} (\gamma\lambda)^l \delta_{t+1}$$
49+
<img src='https://imgur.com/6AQ2kDs'>
5050

5151
Here we use ```util.discount``` to calculate the advantages by discount_rate and LAMBDA
5252

0 commit comments

Comments
 (0)