Skip to content

[skyrl-train] Enable routing replay in SkyRL #815

@erictang000

Description

@erictang000

This issue tracks the status of Rollout Routing Replay (R3) support being merged into SkyRL

  • Initial PR - support for TP on inference, TP + CP + EP on Trainer: R3 PR: Rollout Routing Replay #1273
  • Migrate tests to run on CI (use OlMoe since it's the smallest moe model supported on megatron bridge): [CI] Migrate MoE Tests to OlMoE #1384
  • Plumb through router replay feature on new inference stack
  • Test router replay plumbing for fully async training and step wise training
  • Support for CP + PP on megatron Trainer (should just require slicing on megatron workers to get the right indices): [train][2/N] Support for Megatron PP + CP for R3 #1335
  • Considering support for R2 - replaying the megatron forward pass expert routing - lower priority

Metadata

Metadata

Assignees

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions