You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: report.md
+6-6Lines changed: 6 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
# Homework3-Policy-Gradient report
2
2
## problem1 construct a neural network to represent policy
3
-
In more complex tasks(atari games, and even in real-world tasks), it's hard to apply policy iteration/ value iteration directly due to large state/action space, requiring large storage and hard to calculate the Q values for all. So we "learn" the Q values or plicy by neural network. Here in problem 1, we want to use a simple neural network <imgsrc="https://latex.codecogs.com/gif.latex?f_{Q^*}(s, a;\Theta)"> to represent $ Q^*(s, a) \end where $\Theta$ is the parameters of the nerual network, just like the figure showed below:
3
+
In more complex tasks(atari games, and even in real-world tasks), it's hard to apply policy iteration/ value iteration directly due to large state/action space, requiring large storage and hard to calculate the Q values for all. So we "learn" the Q values or plicy by neural network. Here in problem 1, we want to use a simple neural network <imgsrc="https://latex.codecogs.com/gif.latex?f_{Q^*}(s,%20a;\Theta)%22"> to represent <imgsrc='https://latex.codecogs.com/gif.latex?Q^*%20(s,%20a)'> where <imgsrc='https://latex.codecogs.com/gif.latex?\Theta'> is the parameters of the nerual network, just like the figure showed below:
4
4
<imgsrc='pictures/DNNforQ.png'width='300'>
5
5
6
6
To implement this, I added two fully connected layers in policy.py file:
@@ -14,8 +14,8 @@
14
14
where the nn's output ```probs```(output of nn) is the logits of each action's probability conditioned on the ```self._observations```(input of nn)
15
15
16
16
## problem2 compute the surrogate loss
17
-
In reinforcement learning, our goal is to maximize the accumulated discounted reward $R_t^i = \sum_{{t^′}=t}^T \gamma^{{t^′}-t}r(s_{t^′}, a_{t^′})$ .
18
-
We want to maximize the reward, it's the same as minimize the negative reward. In problem 2 here, I added a line to compute the <b>negative</b> surrogate loss $-L(\theta) = -\frac{1}{(NT)}(\sum_{i=1}^N \sum_{t=0}^T log\pi_\theta(a_t^i | s_t^i) *R_t^i)$. and <b>minimize</b> it.
17
+
In reinforcement learning, our goal is to maximize the accumulated discounted reward <imgsrc='https://imgur.com/zNVb7qv'> .
18
+
We want to maximize the reward, it's the same as minimize the negative reward. In problem 2 here, I added a line to compute the <b>negative</b> surrogate loss <imgsrc='https://imgur.com/vZ393ks'>. and <b>minimize</b> it.
@@ -33,7 +33,7 @@ It's clear that the reward increases during training and terminated(converged) i
33
33
<imgsrc='pictures/p4.png'>
34
34
The results without baseline converges faster(solved in 76 iterations) than the one with baseline. It's possible that in this task, the unstable of gradient somehow pushed the agent to act like "exploration" and thus finding the solution faster than with baseline subtraction one.
35
35
## problem5 actor-critic implementation with bootstrapping
36
-
In Actor-Critic algorithm, actors takes actions based on policy-iteration and critics evaluates the actions based on value-iteration(Q-learning). Actor-Critic alorithm combines actor and critic where actor improves the current policy, and critic evaluates(criticizes) the current policy. Here in problem 5, we changed the advantage function in problem3 into $A_t^i = r_t^i + \gamma*V_{t+1}^i - V_t^i$ using one-step bootstrap in policy_gradient/util.py
36
+
In Actor-Critic algorithm, actors takes actions based on policy-iteration and critics evaluates the actions based on value-iteration(Q-learning). Actor-Critic alorithm combines actor and critic where actor improves the current policy, and critic evaluates(criticizes) the current policy. Here in problem 5, we changed the advantage function in problem3 into <imgsrc='https://imgur.com/ovyr0s7'> using one-step bootstrap in policy_gradient/util.py
37
37
```python
38
38
new_b = np.append(b[1:], 0)
39
39
return x + discount_rate * new_b
@@ -44,9 +44,9 @@ The boostrapping actor-critic agent doesn't converge since it's not stable enoug
Since the original actor-critic is not stable, in problem6, we introduce λ to compromise the advantage function based on problem3 and 5.
47
-
Assume the $\delta_t^i$ represent the i-step bootstrapping (e.g. $\delta_t^i=r_t^i + \gamma*V_{t+1}^i - V_t^i$). The generalized advantage estimation will be:
47
+
Assume the <imgsrc='https://imgur.com/qjRZqX8'> represent the i-step bootstrapping (e.g. <imgsrc='https://imgur.com/jViBLdI'>). The generalized advantage estimation will be:
0 commit comments