feat: upgrade veRL to v0.7.1 with trainer file migration#525
feat: upgrade veRL to v0.7.1 with trainer file migration#525chenyushuo wants to merge 8 commits intoagentscope-ai:mainfrom
Conversation
- Update dependencies: verl==0.7.1, vllm<=0.19.0, megatron-core==0.16.1, transformer-engine==2.13.0 - Migrate 7 core trainer files for upstream compatibility - Add support for: use_prefix_grouper, calculate_sum_pi_squared, sum_pi_squared_checkpointing - Implement upstream checkpoint manager patterns and metadata handling - Remove transformers v5 compatibility patch (handled by upstream) - Add Docker fixtures and init_migration helper script - Add veRL upgrade checklist and migration plan documentation
|
/unittest-all |
Summary
Tests
Github Test Reporter by CTRF 💚 |
There was a problem hiding this comment.
Pull request overview
This PR upgrades Trinity’s veRL integration from v0.7.0 to v0.7.1 by migrating core trainer/worker/actor/checkpoint code to align with upstream interfaces and adding support for new v0.7.1 features (prefix grouper + sum_pi_squared-related paths), alongside dependency/version, Docker, and migration-documentation updates.
Changes:
- Bump dependencies to
verl==0.7.1and expand supportedtransformers/vllmversion ranges; adjust Megatron/TE/mbridge deps. - Migrate trainer/workers/actors/checkpoint managers to veRL v0.7.1 patterns (LoRA ref-logprob path, MFU images seqlens, checkpoint retention registration, mbridge args passthrough).
- Add migration tooling + documentation for future veRL upgrades.
Reviewed changes
Copilot reviewed 19 out of 19 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| trinity/trainer/verl/verl_trainer.py | Add v0.7.1 metrics (compute_variance_proxy_metrics) and multimodal images_seqlens meta propagation. |
| trinity/trainer/verl/verl_config.py | Update config schema to v0.7.1 (mcore config, rollout correction config, prefix grouper + sum_pi_squared flags, reward nesting). |
| trinity/trainer/verl/monkey_patch.py | Wire use_prefix_grouper patch hook into model monkey patch pipeline. |
| trinity/trainer/verl/megatron_workers.py | Align with v0.7.1 worker behavior (MTP wiring, LoRA ref-logprob handling, images_seqlens MFU input, weight export tweaks). |
| trinity/trainer/verl/megatron_checkpoint_manager.py | Migrate to upstream retention/registration patterns and mbridge save_weights signature-based passthrough. |
| trinity/trainer/verl/megatron_actor.py | Align micro-batch rearrangement + MTP loss reporting with v0.7.1. |
| trinity/trainer/verl/fsdp_workers.py | Thread use_prefix_grouper and pad_token_id through actor/ref/logprob paths; pass sum_pi_squared when present. |
| trinity/trainer/verl/fsdp_checkpoint_manager.py | Align retention rotation with upstream ensure_checkpoint_capacity() + checkpoint registration. |
| trinity/trainer/verl/dp_actor.py | Reduce local overrides, pass pad_token_id, and select prefix grouper keys (prompts/uid) when enabled. |
| trinity/trainer/verl/init.py | Remove transformers-v5 compatibility patch side-effects (now empty). |
| trinity/common/models/vllm_patch/worker_patch.py | Extend supported vLLM versions up to 0.19.0. |
| trinity/common/models/utils.py | Remove transformers-v5 patch calls when loading veRL checkpoints/converters. |
| scripts/migrate_from_verl/init_migration.py | Add helper script to snapshot/migrate upstream veRL files into build/<version>/. |
| scripts/docker/Dockerfile.uv | Update Docker build deps/overrides for the new version set (vLLM/Transformers/TE/Megatron). |
| pyproject.toml | Dependency bumps and tighter version constraints (verl==0.7.1, transformers<=5.3.0, vllm<=0.19.0, etc.). |
| docs/agent_summarization/verl_v0.7.1_migration_plan.md | Add detailed migration plan and “what changed/what to keep” notes. |
| docs/agent_summarization/verl_upgrade_checklist.md | Add upgrade checklist for future veRL bumps. |
| benchmark/bench.py | Add CLI/config support for trainer_strategy. |
| .github/workflows/docker/docker-compose.yaml | Update CI docker image tag + VLM model env vars. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
/unittest-all |
Summary
Failed Tests
Skipped
Tests
Github Test Reporter by CTRF 💚 |
JiwaniZakir
left a comment
There was a problem hiding this comment.
In benchmark/bench.py, the new --trainer_strategy argument writes to config["trainer"]["trainer_strategy"] without first verifying that config["trainer"] exists — the same pattern used for config["synchronizer"] above it, which likely assumes the key is always present. If a benchmark config omits the trainer section entirely, this will raise a KeyError at runtime rather than a clear error message; consider adding a guard like config.setdefault("trainer", {}) before the assignment.
In docker-compose.yaml, both TRINITY_VLM_MODEL_PATH and TRINITY_ALTERNATIVE_VLM_MODEL_PATH are now set to the identical path Qwen3.5-0.8B. If the intent is to test alternative VLM code paths, using the same model for both means any logic that branches on TRINITY_ALTERNATIVE_VLM_MODEL_PATH won't be meaningfully exercised in CI. This should either be intentional (with a comment explaining why) or the alternative should point to a distinct model.
The two new documentation files under docs/agent_summarization/ reference snapshot directories at trinity/trainer/verl/build/v0.7.0/ and trinity/trainer/verl/build/v0.7.1/ as prerequisites for the three-way diff process, but those directories don't appear in the PR diff. If these snapshots are not committed to the repo, the checklist's step 3 ("确认需要对照的上游快照已经生成") is unverifiable for reviewers and future contributors.
|
/unittest-all |
Summary
Failed Tests
Skipped
Tests
Github Test Reporter by CTRF 💚 |
|
/unittest-all |
Summary
Skipped
Tests
Github Test Reporter by CTRF 💚 |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 24 out of 24 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
/unittest-module-trainer |
Summary
Failed Tests
Skipped
Tests
Github Test Reporter by CTRF 💚 |
|
/unittest-pattern-ColocateModeTest |
Summary
Tests
Github Test Reporter by CTRF 💚 |
|
/unittest-module-trainer |
Summary
Skipped
Tests
Github Test Reporter by CTRF 💚 |
|
/unittest-all |
Summary
Failed Tests
Skipped
Tests
Github Test Reporter by CTRF 💚 |
Description
Upgrade veRL from v0.7.0 to v0.7.1 with core trainer migration
Checklist
Please check the following items before code is ready to be reviewed.