[feat]: verl integrate msprobe data collection by Tjh-UKN · Pull Request #5186 · verl-project/verl

Tjh-UKN · 2026-02-03T09:32:45Z

What does this PR do?

Integrates msprobe PrecisionDebugger into VERL’s rollout/ref/train/update paths with minimal, explicit start/stop calls. Dumps are organized as {data_dir}/{global_step}/{stage} for consistent step/stage separation. This adds a new precision_debugger config block and wires global_steps into training batches. (No related issues/PRs.)

Checklist Before Starting

Search for similar PRs. Paste at least one query link here: ...
Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
- {modules} include fsdp, megatron, veomni, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg, reward
- If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
- {type} is in feat, fix, refactor, chore, test
- If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
- Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

Not covered by CI. This change depends on the external msprobe runtime and requires a real training run to validate dump outputs. Please run a short PPO/GRPO step on FSDP and Megatron backends with precision_debugger.enable=true and confirm dump directories are created per stage.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Example config snippet (ppo_trainer.yaml / ppo_megatron_trainer.yaml)
precision_debugger:
  enable: True
  config_path: /path/to/config.json
  data_dir: outputs/precision_debug
  steps: [1, 2, 5]  # optional
  stages: ["rollout", "train_fwd", "train_bwd", "update_actor", "ref_model"]

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Design: Minimal, explicit start/stop around key stages; PrecisionDebugger.step() is always called before stop() to advance internal step counts. Dumps are organized by {global_step}/{stage}.
Key changes:
- Add precision_debugger config block to trainer configs and generated configs.
- Inject global_steps into DataProto.meta_info in trainer loop.
- Add precision_start/precision_stop helper (msprobe only) with staged dump paths.
- Hook start/stop into:
  - Rollout (generate_sequences)
  - Ref model (compute_ref_log_prob)
  - Train forward/backward/update actor

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Read the Contribute Guide.
Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
Add / Update the documentation.
Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: msprobe runtime is external; this integration requires environment-specific profiling runs.
Once your PR is ready for CI, send a message in the ci-request channel in the verl Slack workspace. (If not accessible, please try the Feishu group (飞书群).)
If your PR is related to the recipe submodule, please also update the reference to the submodule commit via git submodule update --remote or cd recipe && git pull origin main.

CLAassistant · 2026-02-03T09:32:54Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.

TAJh seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

wuxibin89 · 2026-02-03T12:28:02Z

verl/workers/engine_workers.py

    @DistProfiler.annotate(color="olive", role="ref_compute_log_prob")
    def compute_ref_log_prob(self, data: TensorDict) -> TensorDict:
+        global_step = data.get("global_steps", None)
+        handle = precision_start(


Please move precision_start and precision_stop into DistProfiler

I’ll do that

wuxibin89 · 2026-02-03T12:31:21Z

verl/trainer/config/ppo_trainer.yaml

      kw_args: {}

+# precision debugger configs
+precision_debugger:


Move it under global_profiler

Add a typed config class for it.

Add doc

I‘ll move it
Here is the draft version[step1] and will be reviewed by profiler mates to decide the final version[step2]
Thanks for your tips

tardis-key · 2026-02-03T13:37:21Z

Due to the addition of the msprobe dependency introduced by this pr, it is necessary to update the corresponding requirements files, supplement the necessary CI , and update the Dockerfile.

tardis-key · 2026-02-03T13:56:08Z

Integrate into DistProfile control logic as much as possible to maximize reuse and minimize modification. Maybe we can confirm the control flow design for this pr first.

tardis-key · 2026-02-03T13:56:58Z

Necessary test for different hardware platforms/training engines/rollout engines

add msprobe

a65c445

Tjh-UKN requested review from ISEEKYAN, PeterSH6, eric-haibin-lin, tongyx361 and vermouth1992 as code owners February 3, 2026 09:32

Tjh-UKN changed the title ~~add msprobe~~ [feat]: verl integrate msprobe data collection Feb 3, 2026

wuxibin89 reviewed Feb 3, 2026

View reviewed changes

tardis-key self-requested a review February 4, 2026 01:04

Tjh-UKN marked this pull request as draft February 4, 2026 01:19

TAJh added 2 commits February 4, 2026 09:55

fix review

45b9a7f

fix review

fd3918f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat]: verl integrate msprobe data collection#5186

[feat]: verl integrate msprobe data collection#5186
Tjh-UKN wants to merge 3 commits intoverl-project:mainfrom
Tjh-UKN:main

Tjh-UKN commented Feb 3, 2026

Uh oh!

CLAassistant commented Feb 3, 2026

Uh oh!

wuxibin89 Feb 3, 2026

Uh oh!

Tjh-UKN Feb 3, 2026

Uh oh!

wuxibin89 Feb 3, 2026

Uh oh!

Tjh-UKN Feb 3, 2026 •

edited

Loading

Uh oh!

tardis-key commented Feb 3, 2026

Uh oh!

tardis-key commented Feb 3, 2026

Uh oh!

tardis-key commented Feb 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Tjh-UKN commented Feb 3, 2026

What does this PR do?

Checklist Before Starting

Test

API and Usage Example

Design & Code Changes

Checklist Before Submitting

Uh oh!

CLAassistant commented Feb 3, 2026

Uh oh!

wuxibin89 Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Tjh-UKN Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

wuxibin89 Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Tjh-UKN Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tardis-key commented Feb 3, 2026

Uh oh!

tardis-key commented Feb 3, 2026

Uh oh!

tardis-key commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Tjh-UKN Feb 3, 2026 •

edited

Loading

tardis-key commented Feb 3, 2026 •

edited

Loading