[feat]: verl integrate msprobe data collection#5186
[feat]: verl integrate msprobe data collection#5186Tjh-UKN wants to merge 3 commits intoverl-project:mainfrom
Conversation
|
TAJh seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
verl/workers/engine_workers.py
Outdated
| @DistProfiler.annotate(color="olive", role="ref_compute_log_prob") | ||
| def compute_ref_log_prob(self, data: TensorDict) -> TensorDict: | ||
| global_step = data.get("global_steps", None) | ||
| handle = precision_start( |
There was a problem hiding this comment.
Please move precision_start and precision_stop into DistProfiler
verl/trainer/config/ppo_trainer.yaml
Outdated
| kw_args: {} | ||
|
|
||
| # precision debugger configs | ||
| precision_debugger: |
There was a problem hiding this comment.
- Move it under
global_profiler - Add a typed config class for it.
- Add doc
There was a problem hiding this comment.
I‘ll move it
Here is the draft version[step1] and will be reviewed by profiler mates to decide the final version[step2]
Thanks for your tips
|
Due to the addition of the msprobe dependency introduced by this pr, it is necessary to update the corresponding requirements files, supplement the necessary CI , and update the Dockerfile. |
|
Integrate into DistProfile control logic as much as possible to maximize reuse and minimize modification. Maybe we can confirm the control flow design for this pr first. |
|
Necessary test for different hardware platforms/training engines/rollout engines |
What does this PR do?
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,veomni,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
API and Usage Example
Design & Code Changes
PrecisionDebugger.step()is always called beforestop()to advance internal step counts. Dumps are organized by{global_step}/{stage}.precision_debuggerconfig block to trainer configs and generated configs.global_stepsintoDataProto.meta_infoin trainer loop.precision_start/precision_stophelper (msprobe only) with staged dump paths.generate_sequences)compute_ref_log_prob)Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)recipesubmodule, please also update the reference to the submodule commit viagit submodule update --remoteorcd recipe && git pull origin main.