-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Configuration Action Chunk Size and other relevant paremers #15
Comments
Hi, this sounds exciting. Could you answer below questions?
|
Thanks for the answer! Is there a reason you choose
2/3. What are the tasks? How many total tasks? Number of demos/trajectories per task?
In total, we have 4 tasks. Additionally, I’ve included the same 4 tasks recorded from a slightly different camera perspective, making it 8 tasks. I don’t think this additional data should hurt training—at least not for the autoencoder.
I did't try using end-effector positions instead of joint angles yet. One more remark. ACT was trained on a single task with image resolution 480x640, with rather large action_horizon=50. For the QueST these images were first downsampled to 128x128 to match the demos, which might be an issue. Will try passing higher-resolution images to see if it improves the results. |
This is executing in receding horizon fashion mainly to make policy more reactive to any errors, diffusion policy also does this.
When you say to match the demos, is it so that your demos are of 128x128 but at inference it's 480x640? |
The ideal thing would be to train the autoencoder (stage-1) as well beyond convergence, this is for LIBERO-90 with batch size 2048. Could you also show your logs for it? The way we encode images might not be ideal for real world. We basically pool the shallow resnet output, then concatenate this for all cameras along with proprio, and get a single token that we then project to stage two's transformer dimension. This is a lot of feature compression. Instead we can remove resnet pooling, also use more resnet layers and stack all vision tokens along with proprio ones. Stage-2 transformer will then attend to more visual information. This also needs changing the attention mask for vision tokens. It's should not be too difficult to add else I can integrate this in the codebase by 5th if needed. |
The autoencoder loss looks very similar to yours. Although it was trained with a 256 batch size. Have also attached the log and the config for the stage-1 config.txt
This part is interesting and might help a lot! As I understand, right now the context window is set to two (one for the image and one for the proprio). So if we remove the pooling it would become HxW (for the images) + 1, please correct me if that is wrong. The changes one would need to do would be the parameter Just discovered that the Actually, I’m not quite sure which direction would be best here, so thanks in advance for any hint! |
Hello,
Thank you for providing this code. It works quite well.
After training the libero90 and reproducing some of the evaluation metrics. As I understand, some of the relevant parameters to adjust would be the action_horizon (chunk size). Is there a connection to the autoencoder when increasing the chunk size?
It would be great if you could provide a hint about which parameters to adjust. Maybe there are more tweaks for better performance.
I'm currently trying to train the policy on a (cheap) robotic arm with 6 DOF. It kind of works, although ACT seems to perform better at the moment. I suspect this might be a training or configuration issue. I have trained the encoder and prior using one-hot embeddings.
Thanks a lot!
The text was updated successfully, but these errors were encountered: