-
Notifications
You must be signed in to change notification settings - Fork 285
Open
Description
This issue tracks the development of the megatron backend for SkyRL for large scale MoE training.
Megatron-Core MoE tech report: https://arxiv.org/pdf/2603.07685
Prior tracking issue: #203
New Features
- [P0] Support R3 - Rollout Routing Replay ([skyrl-train] Enable routing replay in SkyRL #815)
- [P2] Megatron dynamic context parallel support ([megatron][perf] Integrate megatron dynamic context parallelism #1019)
- [P2] Enabling
TransformerConfig.cp_comm_type="a2a"for ulysses style sequence parallelism in megatron @devpatelio - [P2] Support Virtual Pipeline Parallel for improving training throughput
- [P2] Support megatron native FSDP
- [P1] Enable dynamic batch sizing in megatron
Megatron + LoRA
- [P1] Support in memory LoRA only weight sync for megatron + lora ([train] Support LoRA-only weight syncing for Megatron backend #1336)
- [P1] LoRA rank normalization via Megatron-Bridge ([feature] LoRA rank normalization for MoE models NVIDIA-NeMo/Megatron-Bridge#2964)
- [P0] Improve LoRA checkpointing support and fix/note upstream bugs (Dense LoRA adapters lose TP shards during PEFT-filtered checkpoint save NVIDIA-NeMo/Megatron-Bridge#2240)
Bugs/Improvements
- [P0] Add a DAPO recipe with MoE + megatron to CI: [CI] Add a DAPO recipe with the Megatron backend to CI #1322
- [P1] Debug GLM + LoRA train/infer mismatch issue
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels