Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
526 changes: 330 additions & 196 deletions Lab3-policy-gradient.ipynb

Large diffs are not rendered by default.

Binary file added imgs/tanh.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added imgs/wrong-award.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added imgs/wrong-loss.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added policy_gradient/__pycache__/__init__.cpython-36.pyc
Binary file not shown.
Binary file added policy_gradient/__pycache__/policy.cpython-36.pyc
Binary file not shown.
Binary file added policy_gradient/__pycache__/util.cpython-36.pyc
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
6 changes: 6 additions & 0 deletions policy_gradient/policy.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,11 @@ def __init__(self, in_dim, out_dim, hidden_dim, optimizer, session):
"""
# YOUR CODE HERE >>>>>>
# <<<<<<<<
hidden1 = tf.layers.dense(inputs=self._observations, units=hidden_dim, activation=tf.nn.tanh, name='hidden1')
#hidden2 = tf.layers.dense(inputs=hidden1, units=out_dim, activation=tf.nn.tanh, name='hidden2')
#mistake tf.nn.tanh as the activation function for the second layer
hidden2 = tf.layers.dense(inputs=hidden1, units=out_dim, activation=None, name='hidden2')
probs = tf.nn.softmax(hidden2)

# --------------------------------------------------
# This operation (variable) is used when choosing action during data sampling phase
Expand Down Expand Up @@ -72,6 +77,7 @@ def __init__(self, in_dim, out_dim, hidden_dim, optimizer, session):
Sample solution is about 1~3 lines.
"""
# YOUR CODE HERE >>>>>>
surr_loss = -tf.reduce_mean(self._advantages * log_prob)
# <<<<<<<<

grads_and_vars = self._opt.compute_gradients(surr_loss)
Expand Down
3 changes: 3 additions & 0 deletions policy_gradient/util.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,9 @@ def discount_bootstrap(x, discount_rate, b):
Sample code should be about 3 lines
"""
# YOUR CODE >>>>>>>>>>>>>>>>>>>
b[0] = 0
return x + discount_rate * np.roll(b, -1)

# <<<<<<<<<<<<<<<<<<<<<<<<<<<<

def plot_curve(data, key, filename=None):
Expand Down
22 changes: 20 additions & 2 deletions report.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,21 @@
# Homework3-Policy-Gradient report
# Homework2 report
In this lab, I complete the some parts of different kinds of policy gradient. Following problems I met is what I think worth mentioned.

TA: try to elaborate the algorithms that you implemented and any details worth mentioned.
1. For Problem 1, I didn't notice the activation function tanh is only for the 1st layer. I set the activation of the 2rd layer as tanh, too. Although the network could still converge, it took much longer time to solve the CartPole problem. I tried several times and the shortest one is solving the problem in 120 iterations. Generally, it takes about 160 iterations to solve the problem, which is 2~3 times slower compared to 60 iterations in the correct implementation. The loss curve and average return during my first slower implementation is shown below.
![alt text](imgs/wrong-loss.png)
![alt text](imgs/wrong-award.png)
![alt text](imgs/tanh.png)

This makes sense because tanh function maps the output of the second layer into (0,1), and the softmax function maps the output to (0,1) again. The relatively numerical value makes the gradient relatively smaller and thux, it takes longer time to solve the given problem.


2. Following table shows experiment results I got at last.

| Problem Num | Algorithm | Num of Iterations to solve the problem |
| ---------- |:-------------:| -----:|
| 1,2,3 | with baseline in gradient estimate | 70 |
| 4 | without baseline | 80 |
| 5 | Actor-Critic algorithm (with bootstrapping) | 200 |
| 6 | Generalized Advantage Estimation | 76 |

3. Actually I found the number of iterations needed to solve the problem is unstable. Take problem 6 for example, number varies from 70 to 100 iterations. This might also depend on concrete algorithm.