[R3]: Move to new vLLM routed experts format#2487
Open
S1ro1 wants to merge 33 commits into
Open
Conversation
bf79561 to
721a874
Compare
bc91c30 to
e55328f
Compare
e55328f to
1fea38e
Compare
samsja
previously approved these changes
May 15, 2026
* Guard checkpoint disk metrics mkdir * Remove test_trainer_utils.py per review feedback Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Simplify ckpt disk metrics guard Drop the rank-0 gate and the disk_usage path fallback per review feedback. Catching FileExistsError on mkdir is sufficient: every rank that races on mkdir either wins or harmlessly catches the BeegFS race, and shutil.disk_usage can then operate on the now-existing ckpt_dir. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
64d3f2c to
cb3c559
Compare
cb3c559 to
4402d7e
Compare
…erts # Conflicts: # pyproject.toml # src/prime_rl/inference/patches.py # src/prime_rl/inference/vllm/serving_chat_with_tokens.py # src/prime_rl/inference/vllm/serving_tokens.py # src/prime_rl/orchestrator/trajectories.py # src/prime_rl/trainer/batch.py # src/prime_rl/trainer/rl/data.py # tests/unit/orchestrator/test_batch.py # uv.lock
11fe0ad to
de71036
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit aa1fc36. Configure here.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

PR is ready - verified with verifiers/renderers (we need to pin to main), however waiting: vllm-project/vllm#39568 to be included in vLLM release - expected 0.21.1
choices[i].routed_expertsfrom the prime-rl vLLM token path as compact raw-uint8JSON payloads:{"data": base64(raw_bytes), "shape": [...]}RequestOutput.prompt_routed_expertswith per-completion decode experts before serializingRoutedExperts(data, shape, dtype), explicit dtype maps, and_pack_routed_experts/_unpack_routed_expertsfor multi-turn stitchingRoutedExpertstransport struct withtorch.frombuffervllm-routerto0.1.25for the matching raw-uint8schema and addpybase64deps/verifiersto the cleaned production-equivalent routed-experts response path from verifiers PR [Feat] Multi-lora layer, data packing, optimizer, broadcast and scheduler #1433Related PRs
Verification
uv sync --all-extrasuv run ruff check src/prime_rl/inference/patches.py src/prime_rl/inference/vllm/routed_experts.pybash -n /beegfs/outputs/qwen3-30b-a3b-router-replay-diag-r3-v3-clean/rl.sbatch19354, output/beegfs/outputs/qwen3-30b-a3b-router-replay-diag-r3-v3-clean, node 53 excluded. Orchestrator completed 5/5 steps using the direct renderer rollout client; trainer completed steps 0-3 with finite grad norms and began step 4 before the generated script terminated remaining processes after orchestrator completion.Failed to merge,Rollout error,Aborted rollout,ERROR,Traceback, andExceptionunder the validation output logs returned no matches.Note
Medium Risk
Medium risk because it changes the on-wire/transport representation of
routed_expertsend-to-end (inference responses, orchestrator parsing, batching/packing, and trainer tensor reconstruction), and adds a vLLMVllmConfig.__post_init__monkey patch that could affect inference startup/validation in disaggregated setups.Overview
Updates routed-experts handling end-to-end to a new compact format. The vLLM
/inference/v1/generatepath now captures per-choicerouted_expertsand serializes it as{"data": base64(raw uint8 bytes), "shape": [...]}(via newinference/vllm/routed_experts.py), replacing the prior numpy/list-style encoding.Propagates the new representation through the training pipeline. The orchestrator decodes the compact payload to numpy for step stitching, then packs it into a new
transport.types.RoutedExpertsstruct (rawbytes+shape+dtype) carried byTrainingSample/MicroBatch; trainer batching now slices/appends/pads this byte payload and reconstructs tensors withtorch.frombuffer.Adds safety and compatibility guardrails. RL config validation now rejects router replay when
inference.kv_cache_offloadis enabled, and inference installs a vLLM monkey patch to allow routed-experts capture when using the NIXL KV connector (while still rejecting unsupported PP/v2 runner cases). Dependencies are updated to addpybase64and bumpvllm-routerto0.1.25, with tests updated accordingly.Reviewed by Cursor Bugbot for commit c13b0b3. Bugbot is set up for automated code reviews on this repo. Configure here.