Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configuration Action Chunk Size and other relevant paremers #15

Open
MichaelRazum opened this issue Nov 23, 2024 · 6 comments
Open

Configuration Action Chunk Size and other relevant paremers #15

MichaelRazum opened this issue Nov 23, 2024 · 6 comments

Comments

@MichaelRazum
Copy link

MichaelRazum commented Nov 23, 2024

Hello,

Thank you for providing this code. It works quite well.

After training the libero90 and reproducing some of the evaluation metrics. As I understand, some of the relevant parameters to adjust would be the action_horizon (chunk size). Is there a connection to the autoencoder when increasing the chunk size?

It would be great if you could provide a hint about which parameters to adjust. Maybe there are more tweaks for better performance.

I'm currently trying to train the policy on a (cheap) robotic arm with 6 DOF. It kind of works, although ACT seems to perform better at the moment. I suspect this might be a training or configuration issue. I have trained the encoder and prior using one-hot embeddings.

Thanks a lot!

@atharvamete
Copy link
Collaborator

Hi, this sounds exciting.
action_horizon is not chunk size but the execution horizon, meaning how many of the predicted actions to execute. The chunk size would be skill_block_size which is set to 32.
You can try skill_block_size = 16 and downsample_factor = 2. action_horizon = skill_block_size/2 should generally work.

Could you answer below questions?

  1. What do you mean by training encoder and prior with one-hot embeddings? Is this for encoding task?
  2. What are the tasks? How many total tasks?
  3. Number of demos/trajectories per task?
  4. What's the action space and the frequency of data collection?

@MichaelRazum
Copy link
Author

MichaelRazum commented Nov 24, 2024

Thanks for the answer! Is there a reason you choose action_horizon = 8 smaller than skill_block_size = 16 ?

  1. What do you mean by training encoder and prior with one-hot embeddings? Is this for the encoding task?
    Yes exactly, it is for the encoding task. I am performing the following steps:
    a. train autoencode
    • Using type: onehot for the encoding task. As I understand it, the autoencoder is conditioned on the task_id number. Here's an example of the task config
    task_embedding_format: clip
    img_height: 128
    img_width: 128
    horizon: 600
    
    shape_meta:
      action_dim: 6
      observation:
        rgb:
          front:
            - 3
            - ${task.img_height}
            - ${task.img_width}
        lowdim:
          robot_states: 6
      task:
        type: onehot
        n_tasks: ${task.n_tasks}
    b. training the prior with 100 epochs. I didn't try the third step (finetuning), since as I understand it is used in the paper for few-shot learning

2/3. What are the tasks? How many total tasks? Number of demos/trajectories per task?

  • Rotate 90 degrees right
  • Rotate 90 degrees left
  • Flip Cube. Place the gripper on the corner and flip the cube.
  • Complex Task. This involves multiple manipulations.

In total, we have 4 tasks. Additionally, I’ve included the same 4 tasks recorded from a slightly different camera perspective, making it 8 tasks. I don’t think this additional data should hurt training—at least not for the autoencoder.

Task Demos Actions Per Demo
Flip Cube 50 100–300
Rotate Cube Left 50 200–400
Rotate Cube Right 50 200–400
Complex Task 60 500–700
Other Camera Perspective:
Flip Cube 160 100–200
Rotate Cube Left 100 400–500
Rotate Cube Right 100 400–500
Complex Task 60 500–700

  1. What's the action space and frequency of data collection?
  • Observation Space: joint angles [-2π ,2π ] and camera image 128x128.
  • Action Space: target joint angles.
  • Frequency: rather low about 10HZ or even a bit lower.

I did't try using end-effector positions instead of joint angles yet.

One more remark. ACT was trained on a single task with image resolution 480x640, with rather large action_horizon=50. For the QueST these images were first downsampled to 128x128 to match the demos, which might be an issue. Will try passing higher-resolution images to see if it improves the results.

@atharvamete
Copy link
Collaborator

Is there a reason you choose action_horizon = 8 smaller than skill_block_size = 16 ?

This is executing in receding horizon fashion mainly to make policy more reactive to any errors, diffusion policy also does this.
The autoencoder is not conditioned on task_id (check this line) only prior is.
Can you try some of the early checkpoints for prior? We found 20th checkpoint to do better than the 100th one by 4% on LIBERO 90.
We have never tested QueST on joint angles, though there is no obvious reason why it would do worse than in eef position. Still it might be worth trying eef delta.
The frequency determines the skill_block_size, which should ideally represent 1 to 1.5 seconds of motion. This suggestion is based of some initial experiments with the OpenX dataset and might not be 100% valid.
We expect QueST to work better in multitask as the autoencoder would have more coverage of action space.

For the QueST these images were first downsampled to 128x128 to match the demos

When you say to match the demos, is it so that your demos are of 128x128 but at inference it's 480x640?

@MichaelRazum
Copy link
Author

MichaelRazum commented Nov 27, 2024

This is executing in receding horizon fashion mainly to make policy more reactive to any errors, diffusion policy also does this.

So as I understand it is better to train action_horizon=8 and skill_block_size=16 instead of both 8. Kind of makes sense to me, since we provide more information during the training.

Can you try some of the early checkpoints for prior? We found 20th checkpoint to do better than the 100th one by 4% on LIBERO 90.

I tried the 20th checkpoint, but in my case, the 100th one performed better. The policy was noticeably smoother with the 100th checkpoint

When you say to match the demos, is it so that your demos are of 128x128 but at inference it's 480x640?

As a first step, all the training data re scaled to 128x128, during the evaluation also the cam data was re scaled to 128x128.
Right now training the policy using 480x640 . The 20th looked better compared to 128x128. Still kind of not yet there regarding the task. Some of the movements are weird. So maybe after training longer this will improve.

Also planning to do a second experiment with two camera (top & front). Will be interesting to see how well the methods perform with more information.

Just wanted to share the wandb logs, for the current run (480x648), actually it looks good to me, but maybe you see something odd. Kind of takes a long time to train, so will report later if there is an improvement.
image

@atharvamete
Copy link
Collaborator

The ideal thing would be to train the autoencoder (stage-1) as well beyond convergence, this is for LIBERO-90 with batch size 2048. Could you also show your logs for it?

The way we encode images might not be ideal for real world. We basically pool the shallow resnet output, then concatenate this for all cameras along with proprio, and get a single token that we then project to stage two's transformer dimension. This is a lot of feature compression. Instead we can remove resnet pooling, also use more resnet layers and stack all vision tokens along with proprio ones. Stage-2 transformer will then attend to more visual information. This also needs changing the attention mask for vision tokens. It's should not be too difficult to add else I can integrate this in the codebase by 5th if needed.

@MichaelRazum
Copy link
Author

The ideal thing would be to train the autoencoder (stage-1) as well beyond convergence, this is for LIBERO-90 with batch size 2048. Could you also show your logs for it?

The autoencoder loss looks very similar to yours. Although it was trained with a 256 batch size.
image

Have also attached the log and the config for the stage-1

config.txt
wandb_autoencoder.log

The way we encode images might not be ideal for real world. We basically pool the shallow resnet output, then concatenate this for all cameras along with proprio, and get a single token that we then project to stage two's transformer dimension. This is a lot of feature compression. Instead we can remove resnet pooling, also use more resnet layers and stack all vision tokens along with proprio ones. Stage-2 transformer will then attend to more visual information. This also needs changing the attention mask for vision tokens. It's should not be too difficult to add else I can integrate this in the codebase by 5th if needed.

This part is interesting and might help a lot! As I understand, right now the context window is set to two (one for the image and one for the proprio). So if we remove the pooling it would become HxW (for the images) + 1, please correct me if that is wrong. The changes one would need to do would be the parameter do_projection=False ResnetEncoder and obs_encode hwc=True. Hope that is correct. Not sure if this might make the model a bit too large.

Just discovered that the pretrained parameter was set to false in the default ResNet18 configuration. As a first step, I was thinking of trying to train with pretrained=True. Also noticed you have a DINOEncoder, and since the ACT paper used sinusoidal positional embeddings, I’m wondering if either of these approaches might be worth exploring.

Actually, I’m not quite sure which direction would be best here, so thanks in advance for any hint!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants