In this group project, you will utilize preference-based learning and imitation learning methods to train control policies for a robot manipulation task.
The target task is a robot food-serving task built base on panda-gym, where a robot arm needs to learn a control policy to pick up a banana that is randomly placed in a food preparation area, and place it on the plate for food serving. For more high-level details regarding the code structure, refer to Section 2 and Section 3 in this paper.
The state of the task is defined as a dictionary with keys "observation", "achieved_goal", and "desired_goal".
Each value is a 1D np.ndarray.
The task observation is the concatenation of:
- Robot observation (UR5):
- end-effector position (3):
ee_pos_x, ee_pos_y, ee_pos_z - end-effector linear velocity (3):
ee_vel_x, ee_vel_y, ee_vel_z - gripper opening width (1):
finger_width
- end-effector position (3):
- Object observation (banana):
- object position (3):
obj_pos_x, obj_pos_y, obj_pos_z - object rotation quaternion (4):
obj_quat_x, obj_quat_y, obj_quat_z, obj_quat_w - object linear velocity (3):
obj_vel_x, obj_vel_y, obj_vel_z - object angular velocity (3):
obj_ang_vel_x, obj_ang_vel_y, obj_ang_vel_z
- object position (3):
state["observation"] = np.array([
# robot (7)
ee_pos_x, ee_pos_y, ee_pos_z,
ee_vel_x, ee_vel_y, ee_vel_z,
finger_width,
# object (13)
obj_pos_x, obj_pos_y, obj_pos_z,
obj_quat_x, obj_quat_y, obj_quat_z, obj_quat_w,
obj_vel_x, obj_vel_y, obj_vel_z,
obj_ang_vel_x, obj_ang_vel_y, obj_ang_vel_z
]) # total: 20 dimsThe goal representation follows the standard goal-conditioned interface:
state["achieved_goal"] = np.array([obj_pos_x, obj_pos_y, obj_pos_z]) # 3 dims
state["desired_goal"] = np.array([target_pos_x, target_pos_y, target_pos_z]) # 3 dimsFor this robot food-serving task, the desired goal position is fixed and given, defined as the center of the red plate. The values of the observation and the desired goal will change at each time step in accordance with the physical interaction between the robot and the banana object. More details can be found in pick_and_place.py.
Tips for training: For many RL algorithms it is convenient to flatten the dict observation into a single vector. A common option is to concatenate state["observation"] and state["desired_goal"], via reconstruct_state() in env_wrappers.
The action is a 1D np.ndarray in [-1, 1] with:
- 3 values controlling the UR5 arm
- 1 value controlling the gripper (if block_gripper=False)
action = np.array([dx, dy, dz, gripper_cmd])(dx, dy, dz) is an end-effector displacement command (delta in Cartesian space):
target_ee_position=current_ee_position + 0.05 * [dx, dy, dz]- the controller then uses inverse kinematics to compute the 6 UR5 joint targets.
gripper_cmd controls the gripper opening:
gripper_cmd> 0 closes the grippergripper_cmd<= 0 opens the gripper
More details can be found in ur_robot.py.
The task is designed with a sparse reward function. The robot will get a positive reward of +1000.0 if it manages to place the banana at the plate. Otherwise, the robot will only get a reward of -1.0 at each step. More details can be found in task_envs.py.
An episode will terminate immediately if the banana is placed on the plate, or will terminate if the episode reaches its maximum length of 150 steps. More details can be found in task_envs.py.
It is very easy to create the task environment. You can do it by:
env = PnPNewRobotEnv(render=render)
env = ResetWrapper(env)
env = ActionNormalizer(env)
env = TimeLimitWrapper(env, max_steps=150)
env.reset(seed=0)
terminated = truncated = False
while not (terminated or truncated):
action = env.action_space.sample()
obs, reward, terminated, truncated, info = env.step(action)Refer to test_env.py for how to roll out in the environment episodically. More details about ActionNormalizer, ResetWrapper, and TimeLimitWrapper can be found in env_wrappers.py.
-
Create a conda environment under python 3.8 and activate it in your terminal. Make sure you already downloaded and installed Anaconda on your laptop before running the following commands in the terminal:
$ conda create --name hial_project_env python=3.8 $ conda activate hial_project_env
-
Install the required packages:
$ pip install -r requirements.txt
-
Install the library of APReL following the instructions here.
Notice: If you are a Windows user, or a MacOS user but run into troubles with ffmpeg, check this post and try to install it directly via conda:
$ conda install -c conda-forge ffmpeg=4.2.2
-
After installing the APReL library, remember to go back to the parent directory (e.g., if you installed APReL at "/Desktop/APReL/", then go back to the directory of "/Desktop/"), and clone the hial_project repository under this parent directory:
$ git clone https://github.com/FabianoBusca/HIAL-PROJECT.git
Notice: If you haven't installed git on your laptop, follow the instructions here and install git before you use the commend of "git clone" in your terminal.
-
Try to run the test script
$ python -m scripts.test_env # from the root directory
-
We have prepared 20 human-expert demos for you under the folder of /demo_data/PickAndPlace/, including episodic trajectories of states, actions, rewards, next states, and dones (i.e., whether an episode is terminated or not).
If you want to check and load these demos run the script load_demos.py
$ python -m scripts.load_demos # from the root directoryYou should be able to see demos replaying in the simulator environment like below:
-
1.1 Complete the file named banana.py under the directory of alg.
- design your own features of the recovered reward and implement the feature function given an episodic trajectory
- complete the code to generate 30 video clips of episodic trajectories, including 20 clips of given expert demos and 10 clips of trajectories with random actions.
-
1.2 Complete the file named pref_learn.py, write your code to recover the reward function for the robot food-serving task using the APReL library:
- The human teacher will provide feedback whenever given a comparison between two trajectories chosen from these 30 trajectories. And the way they provide feedback is by watching the clips of the chosen trajectories and give their preference using keyboard.
- Experiment with different
acquisition_functionoptions. Include in your report which one worked best for you and why. - Utilize APReL to write your code to recover the reward function based on your designed feature function and generated clips. More specifically, the human teacher is allowed to give feedback on 10 comparison queries in total. More code-wise procedures can be found in simple.py of APReL.
- Save the learned weights for your designed reward feature as a csv file named "feature_weights" under the directory of saved.
- Create a file named policy_learn.py under the directory of alg.
- In this policy_learn.py, write your code to learn robot control policy using RL from Demonstrations with your learned reward function:
-
Choose your favorite RLfD method (e.g., AWAC) as the underlying policy learning algorithm
-
Load the given 20 expert demonstrations into the demo replay buffer of RLfD before policy training starts
-
Train your policy episodically with your learned reward function. More specifically, write your code to realize a training loop where the robot first roll-outs its current policy for one episode, recovers the rewards for this episode given this roll-out trajectory, saves the trajectory into the replay buffer, then updates the control policy
-
Train your policy for a maximum of 500k environment steps, and save your policy models every 1k steps during the training process
-
To report your results, plot the policy learning curve during the training process every 1k steps, with x-axis as the number of environment steps and y-axis as the average success rate of your trained policy at the current step.
For every data point, you should get the value of it by calculating the average success rate across 10 test runs. And for each test run, you should:
- Randomly initialize the task environment (i.e., env.reset())
- Roll out your saved policy model of the current training step
- Check whether the robot succeeded or not for the current test run
-
- The script of banana.py for recording the clips
- The script of pref_learn.py for reward learning
- The script of policy_learn.py for policy learning
- The
./saved/folder - (Important) Another script named policy_test.py under the directory of alg:
- To facilitate TAs to check the performance of your final trained policy, we kindly ask you to write two simple functions named load_final_policy() and get_policy_action() inside the script of policy_test.py:
- For the function of load_final_policy(), it should be in the format of:
def load_final_policy(path_to_saved_policy): """ load your final trained policy Args: path_to_saved_policy (str): the path to your saved policy model Returns: your saved policy model under the corresponding path """ # your code here - For the function of get_policy_action(), it should be in the format of:
def get_policy_action(state, saved_policy_model): """ get the action that the policy decides to take for the given environment state Args: state (dict): the state of the environment returned by the env.step() or env.reset(), which is a dictionary including keys of "observation", "achieved_goal", and "desired_goal" saved_policy_model: a saved model in the same format as the one returned by your load_final_policy() function Returns: action (np.array): the action that the saved policy model decides to take under the given state """ # your code here
- Realize online policy training with RLfD and preference-based learning, and report your results. In other words, there will be no two separate phases for training. Instead, during every episode of the learning loop, the robot will:
- Ask for human feedback on a certain amount of trajectory-pair comparisons
- Recover a new reward function for the current training episode
- Roll out the current policy and recover the rewards based on just learned reward function
- Update the policy
- Add the rolled out trajectory to the trajectory set of the preference-based learning, and generate the video clip of it for future comparison
