CEDL2017 · panzhufeng · Jan 19, 2018 · Jan 19, 2018 · Jan 20, 2018
diff --git a/Lab3-policy-gradient.ipynb b/Lab3-policy-gradient.ipynb
diff --git a/imgs/tanh.png b/imgs/tanh.png
diff --git a/imgs/wrong-award.png b/imgs/wrong-award.png
diff --git a/imgs/wrong-loss.png b/imgs/wrong-loss.png
diff --git a/policy_gradient/__pycache__/__init__.cpython-36.pyc b/policy_gradient/__pycache__/__init__.cpython-36.pyc
diff --git a/policy_gradient/__pycache__/policy.cpython-36.pyc b/policy_gradient/__pycache__/policy.cpython-36.pyc
diff --git a/policy_gradient/__pycache__/util.cpython-36.pyc b/policy_gradient/__pycache__/util.cpython-36.pyc
diff --git a/policy_gradient/baselines/__pycache__/__init__.cpython-36.pyc b/policy_gradient/baselines/__pycache__/__init__.cpython-36.pyc
diff --git a/policy_gradient/baselines/__pycache__/base.cpython-36.pyc b/policy_gradient/baselines/__pycache__/base.cpython-36.pyc
diff --git a/policy_gradient/baselines/__pycache__/linear_feature_baseline.cpython-36.pyc b/policy_gradient/baselines/__pycache__/linear_feature_baseline.cpython-36.pyc
diff --git a/policy_gradient/policy.py b/policy_gradient/policy.py
@@ -31,6 +31,11 @@ def __init__(self, in_dim, out_dim, hidden_dim, optimizer, session):
         """
         # YOUR CODE HERE >>>>>>
         # <<<<<<<<
+        hidden1 = tf.layers.dense(inputs=self._observations, units=hidden_dim, activation=tf.nn.tanh, name='hidden1')
+        #hidden2 = tf.layers.dense(inputs=hidden1, units=out_dim, activation=tf.nn.tanh, name='hidden2')
+        #mistake tf.nn.tanh as the activation function for the second layer
+        hidden2 = tf.layers.dense(inputs=hidden1, units=out_dim, activation=None, name='hidden2')
+        probs = tf.nn.softmax(hidden2)
 
         # --------------------------------------------------
         # This operation (variable) is used when choosing action during data sampling phase
@@ -72,6 +77,7 @@ def __init__(self, in_dim, out_dim, hidden_dim, optimizer, session):
         Sample solution is about 1~3 lines.
         """
         # YOUR CODE HERE >>>>>>
+        surr_loss = -tf.reduce_mean(self._advantages * log_prob)
         # <<<<<<<<
 
         grads_and_vars = self._opt.compute_gradients(surr_loss)

diff --git a/policy_gradient/util.py b/policy_gradient/util.py
@@ -32,6 +32,9 @@ def discount_bootstrap(x, discount_rate, b):
     Sample code should be about 3 lines
     """
     # YOUR CODE >>>>>>>>>>>>>>>>>>>
+    b[0] = 0    
+    return x + discount_rate * np.roll(b, -1)
+
     # <<<<<<<<<<<<<<<<<<<<<<<<<<<<
 
 def plot_curve(data, key, filename=None):

diff --git a/report.md b/report.md
@@ -1,3 +1,21 @@
-# Homework3-Policy-Gradient report
+# Homework2 report
+In this lab, I complete the some parts of different kinds of policy gradient. Following problems I met is what I think worth mentioned.
 
-TA: try to elaborate the algorithms that you implemented and any details worth mentioned.
+1. For Problem 1, I didn't notice the activation function tanh is only for the 1st layer. I set the activation of the 2rd layer as tanh, too. Although the network could still converge, it took much longer time to solve the CartPole problem. I tried several times and the shortest one is solving the problem in 120 iterations. Generally, it takes about 160 iterations to solve the problem, which is 2~3 times slower compared to 60 iterations in the correct implementation. The loss curve and average return during my first slower implementation is shown below.
+![alt text](imgs/wrong-loss.png)
+![alt text](imgs/wrong-award.png)
+![alt text](imgs/tanh.png)
+
+This makes sense because tanh function maps the output of the second layer into (0,1), and the softmax function maps the output to (0,1) again. The relatively numerical value makes the gradient relatively smaller and thux, it takes longer time to solve the given problem.
+
+
+2. Following table shows experiment results I got at last.
+
+| Problem Num   | Algorithm           | Num of Iterations to solve the problem  |
+| ---------- |:-------------:| -----:|
+| 1,2,3    | with baseline in gradient estimate | 70 |
+| 4	   | without baseline      |   80 |
+| 5 | Actor-Critic algorithm (with bootstrapping)      |    200 |
+| 6 | Generalized Advantage Estimation      |   76 |
+
+3. Actually I found the number of iterations needed to solve the problem is unstable. Take problem 6 for example, number varies from 70 to 100 iterations. This might also depend on concrete algorithm.