-
Notifications
You must be signed in to change notification settings - Fork 286
Open
Labels
Description
This issue tracks the status of Rollout Routing Replay (R3) support being merged into SkyRL
- Initial PR - support for TP on inference, TP + CP + EP on Trainer: R3 PR: Rollout Routing Replay #1273
- Migrate tests to run on CI (use OlMoe since it's the smallest moe model supported on megatron bridge): [CI] Migrate MoE Tests to OlMoE #1384
- Plumb through router replay feature on new inference stack
- Test router replay plumbing for fully async training and step wise training
- Support for CP + PP on megatron Trainer (should just require slicing on megatron workers to get the right indices): [train][2/N] Support for Megatron PP + CP for R3 #1335
- Considering support for R2 - replaying the megatron forward pass expert routing - lower priority
Reactions are currently unavailable