COSE490: Understand the fundamentals of Reinforcement Learning, and basic techniques
전산학특강(강화학습)에 대한 정리 repository입니다.
Basic RL 기법부터 DQN 까지의 과제 수행 내역들을 담고 있습니다.
Goal: The goal of the first homework is (i) to let the student familiar with the simulator environment, called Open AI Gym, and (ii) to understand the basic programming structure for the RL.
Tasks: Use the provided file, and run the program in there. Understand how one can implement policy evaluation, policy improvement, policy iteration, and value iteration.
-
Task 1: Construct program environment to run the provided program (e.g., python, jupyter notebook, and programs for RL and Open AI Gym). The provided file is in the format of “jupyter notebook”
-
Task 2: Make the (provided) program run without error.
Goal: The goal of the second homework is (i) to let the student familiar with implementing an RL algorithm, and (ii) to understand the online rollout algorithm.
Tasks: Use the provided notebook file, and complete the function call “do_rollout”.
- We have modified the Frozen Lake (FL). See the new MDP environment descried in matrix “P” (See line 72 in [5] of the notebook file). The transitions and rewards are deterministic1. There is no hole. Instead, we consider several states that, when entering, we will get negative reward (-5 or -10). The following is the reward value for each (state, action) – state is shown in blue and action in each direction.
Goal: The goal of the third homework is to practice D-eep Q Network (DQN).
-
(i) In previous homework, you implemented lookup table to save values of state-action pairs. This was possible because state-action space was small.
-
(ii) In this homework, you train the agent with DQN algorithm which use neural networks to approximate Q function. Tasks: Use the provided notebook file, and complete the functions.
-
In this homework, we use taxi environment (“Taxi-v3”) provided by Open AI Gym. Find the environment details at
https://github.com/openai/gym/blob/master/gym/envs/toy_text/taxi.py (Detailed descriptions and codes are also provided in the first cell of given notebook.) The goal of this environment is to control taxi to pick up a customer and get to destination.
.
-
We consider 5x5 space with total 25 “squares”, in which there are four designated “locations” in the grid world indicated by R(ed), G(reen), Y(ellow), and B(lue). When the episode starts, the taxi starts off at a random square and the passenger is at a random location. The taxi drives to the passenger's location, picks up the passenger, drives to the passenger's destination (another one of the four specified locations), and then drops off the passenger. Once the passenger is dropped off, the episode ends.
-
Observations There are 500 discrete states since there are 25 taxi positions,5 possible locations of the passenger (including the case when the passenger is in the taxi), and 4 destination locations.
-
Passenger locations
- 0: R(ed)
- 1: G(reen)
- 2: Y(ellow)
- 3: B(lue)
- 4: in taxi
-
Destinations:
- 0: R(ed)
- 1: G(reen)
- 2: Y(ellow)
- 3: B(lue)
-
Actions: There are 6 discrete deterministic actions:
- 0: move south
- 1: move north
- 2: move east
- 3: move west
- 4: pickup passenger
- 5: drop off passenger
-
Rewards:
- There is a default per-step reward of -1
- Delivering the passenger: +20
- Executing "pickup" and "drop-off" actions without passenger: -10
-
State space = (taxi_row, taxi_col, passenger_location, destination)
-
Task 1: Complete the main code and the training function (marked in the notebook). We recommend you to refer to the sources cited in the head of notebook. If you refer to other sources, you MUST specify those references (in the notebook as “Markdown”).
-
Task 2: Optimize several parameters (marked in the notebook). Even after you successfully complete Task 1, the results may not be satisfactory due to non-optimized parameters. You should delicately tune the parameters to achieve good performance.