Skip to content

[feat]: verl integrate msprobe data collection#5186

Draft
Tjh-UKN wants to merge 3 commits intoverl-project:mainfrom
Tjh-UKN:main
Draft

[feat]: verl integrate msprobe data collection#5186
Tjh-UKN wants to merge 3 commits intoverl-project:mainfrom
Tjh-UKN:main

Conversation

@Tjh-UKN
Copy link

@Tjh-UKN Tjh-UKN commented Feb 3, 2026

What does this PR do?

Integrates msprobe PrecisionDebugger into VERL’s rollout/ref/train/update paths with minimal, explicit start/stop calls. Dumps are organized as {data_dir}/{global_step}/{stage} for consistent step/stage separation. This adds a new precision_debugger config block and wires global_steps into training batches. (No related issues/PRs.)

Checklist Before Starting

  • Search for similar PRs. Paste at least one query link here: ...
  • Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include fsdp, megatron, veomni, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg, reward
    • If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

Not covered by CI. This change depends on the external msprobe runtime and requires a real training run to validate dump outputs. Please run a short PPO/GRPO step on FSDP and Megatron backends with precision_debugger.enable=true and confirm dump directories are created per stage.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Example config snippet (ppo_trainer.yaml / ppo_megatron_trainer.yaml)
precision_debugger:
  enable: True
  config_path: /path/to/config.json
  data_dir: outputs/precision_debug
  steps: [1, 2, 5]  # optional
  stages: ["rollout", "train_fwd", "train_bwd", "update_actor", "ref_model"]

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

  • Design: Minimal, explicit start/stop around key stages; PrecisionDebugger.step() is always called before stop() to advance internal step counts. Dumps are organized by {global_step}/{stage}.
  • Key changes:
    • Add precision_debugger config block to trainer configs and generated configs.
    • Inject global_steps into DataProto.meta_info in trainer loop.
    • Add precision_start/precision_stop helper (msprobe only) with staged dump paths.
    • Hook start/stop into:
      • Rollout (generate_sequences)
      • Ref model (compute_ref_log_prob)
      • Train forward/backward/update actor

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

  • Read the Contribute Guide.
  • Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
  • Add / Update the documentation.
  • Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: msprobe runtime is external; this integration requires environment-specific profiling runs.
  • Once your PR is ready for CI, send a message in the ci-request channel in the verl Slack workspace. (If not accessible, please try the Feishu group (飞书群).)
  • If your PR is related to the recipe submodule, please also update the reference to the submodule commit via git submodule update --remote or cd recipe && git pull origin main.

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


TAJh seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

@Tjh-UKN Tjh-UKN changed the title add msprobe [feat]: verl integrate msprobe data collection Feb 3, 2026
@DistProfiler.annotate(color="olive", role="ref_compute_log_prob")
def compute_ref_log_prob(self, data: TensorDict) -> TensorDict:
global_step = data.get("global_steps", None)
handle = precision_start(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move precision_start and precision_stop into DistProfiler

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ll do that

kw_args: {}

# precision debugger configs
precision_debugger:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Move it under global_profiler
  2. Add a typed config class for it.
  3. Add doc

Copy link
Author

@Tjh-UKN Tjh-UKN Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I‘ll move it
Here is the draft version[step1] and will be reviewed by profiler mates to decide the final version[step2]
Thanks for your tips

@tardis-key
Copy link
Collaborator

Due to the addition of the msprobe dependency introduced by this pr, it is necessary to update the corresponding requirements files, supplement the necessary CI , and update the Dockerfile.

@tardis-key
Copy link
Collaborator

Integrate into DistProfile control logic as much as possible to maximize reuse and minimize modification. Maybe we can confirm the control flow design for this pr first.

@tardis-key
Copy link
Collaborator

tardis-key commented Feb 3, 2026

Necessary test for different hardware platforms/training engines/rollout engines

@tardis-key tardis-key self-requested a review February 4, 2026 01:04
@Tjh-UKN Tjh-UKN marked this pull request as draft February 4, 2026 01:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants