CEDL2017 · orchinia · Dec 12, 2015 · Dec 12, 2015 · Dec 12, 2015 · Dec 12, 2015
diff --git a/Lab3-policy-gradient.ipynb b/Lab3-policy-gradient.ipynb
diff --git a/pictures/DNNforQ.png b/pictures/DNNforQ.png
diff --git a/pictures/p123.png b/pictures/p123.png
diff --git a/pictures/p4.png b/pictures/p4.png
diff --git a/pictures/p5.png b/pictures/p5.png
diff --git a/pictures/p6.png b/pictures/p6.png
diff --git a/policy_gradient/policy.py b/policy_gradient/policy.py
@@ -30,6 +30,9 @@ def __init__(self, in_dim, out_dim, hidden_dim, optimizer, session):
         Sample solution is about 2~4 lines.
         """
         # YOUR CODE HERE >>>>>>
+        fc1 = tf.contrib.layers.fully_connected(self._observations, num_outputs=hidden_dim, activation_fn=tf.tanh)
+        fc2 = tf.contrib.layers.fully_connected(fc1, num_outputs=out_dim, activation_fn=None)
+        probs = tf.nn.softmax(fc2)
         # <<<<<<<<
 
         # --------------------------------------------------
@@ -72,6 +75,7 @@ def __init__(self, in_dim, out_dim, hidden_dim, optimizer, session):
         Sample solution is about 1~3 lines.
         """
         # YOUR CODE HERE >>>>>>
+        surr_loss = tf.reduce_mean((-log_prob)*self._advantages)
         # <<<<<<<<
 
         grads_and_vars = self._opt.compute_gradients(surr_loss)

diff --git a/policy_gradient/util.py b/policy_gradient/util.py
@@ -32,6 +32,8 @@ def discount_bootstrap(x, discount_rate, b):
     Sample code should be about 3 lines
     """
     # YOUR CODE >>>>>>>>>>>>>>>>>>>
+    new_b = np.append(b[1:], 0)
+    return x + discount_rate * new_b
     # <<<<<<<<<<<<<<<<<<<<<<<<<<<<
 
 def plot_curve(data, key, filename=None):

diff --git a/report.md b/report.md
@@ -1,3 +1,57 @@
 # Homework3-Policy-Gradient report
+## problem1 construct a neural network to represent policy
+   In more complex tasks(atari games, and even in real-world tasks), it's hard to apply policy iteration/ value iteration directly due to large state/action space, requiring large storage and hard to calculate the Q values for all. So we "learn" the Q values or plicy by neural network. Here in problem 1, we want to use a simple neural network <img src="https://latex.codecogs.com/gif.latex?f_{Q^*}(s,%20a;\Theta)%22"> to represent <img src='https://latex.codecogs.com/gif.latex?Q^*%20(s,%20a)'> where <img src='https://latex.codecogs.com/gif.latex?\Theta'> is the parameters of the nerual network, just like the figure showed below:
+   <img src='pictures/DNNforQ.png' width='300'>
+
+   To implement this, I added two fully connected layers in policy.py file:
+```python
+    fc1 = tf.contrib.layers.fully_connected(self._observations, num_outputs=hidden_dim, activation_fn = tf.tanh)
+    fc2 = tf.contrib.layers.fully_connected(fc1, num_outputs=out_dim, activation_fn=None)
+    probs = tf.nn.nsoftmax(fc2)
+```
+
+
+   where the nn's output ```probs```(output of nn) is the logits of each action's probability conditioned on the ```self._observations```(input of nn)
 
-TA: try to elaborate the algorithms that you implemented and any details worth mentioned.
+## problem2 compute the surrogate loss
+In reinforcement learning, our goal is to maximize the accumulated discounted reward <img src='https://imgur.com/zNVb7qv.png'> .
+We want to maximize the reward, it's the same as minimize the negative reward. In problem 2 here, I added a line to compute the <b>negative</b> surrogate loss <img src='https://imgur.com/vZ393ks.png'>. and <b>minimize</b> it.
+
+```python
+surr_loss = tf.reduce_mean((-log_prob)*self._advantages)
+```
+## problem3 
+In problem3, we're going to reduce the variance of the gradient estimate. To achieve this goal, we can change the ```Reward``` in surrogate loss into ```Reward-Basline```. This trick is somehow a similar idea with residual nets, which would help the learning more stable.
+```python
+a = r - b
+```
+## Verifying Solution
+After implemented problem 1, 2, and 3, I trained my agent and got the average return curve for episodes as below:
+<img src ='pictures/p123.png'>
+It's clear that the reward increases during training and terminated(converged) in 83 episodes.
+## problem4 compare results in problem 3 with no baseline
+<img src='pictures/p4.png'>
+The results without baseline converges faster(solved in 76 iterations) than the one with baseline. It's possible that in this task, the unstable of gradient somehow pushed the agent to act like "exploration" and thus finding the solution faster than with baseline subtraction one. 
+## problem5 actor-critic implementation with bootstrapping
+   In Actor-Critic algorithm, actors takes actions based on policy-iteration and critics evaluates the actions based on value-iteration(Q-learning). Actor-Critic alorithm combines actor and critic where actor improves the current policy, and critic evaluates(criticizes) the current policy. Here in problem 5, we changed the advantage function in problem3 into <img src='https://imgur.com/ovyr0s7.png'> using one-step bootstrap in policy_gradient/util.py
+```python
+   new_b = np.append(b[1:], 0)
+   return x + discount_rate * new_b
+```
+,which replaced the total reward by immediate reward and the estimated baseline.
+
+The boostrapping actor-critic agent doesn't converge since it's not stable enough.
+<img src ='pictures/p5.png'>
+## Problem6 generalized advantage estimation (GAE)
+Since the original actor-critic is not stable, in problem6, we introduce λ to compromise the advantage function based on problem3 and 5.
+Assume the <img src='https://imgur.com/qjRZqX8.png'> represent the i-step bootstrapping (e.g. <img src='https://imgur.com/jViBLdI.png'>). The generalized advantage estimation will be:
+
+<img src='https://imgur.com/6AQ2kDs.png'>
+
+Here we use ```util.discount``` to calculate the advantages by discount_rate and LAMBDA
+
+```python
+a = util.discount(a, self.discount_rate * LAMBDA)
+```
+The GAE agent converged in 73 episodes.
+<img src ='pictures/p6.png'>