why are there multiple settings for actor_rollout_ref.model.enable_gradient_checkpointing? Is this a deliberate design choice?#4263
Open
khazic wants to merge 64 commits intoverl-project:mainfrom
Open
why are there multiple settings for actor_rollout_ref.model.enable_gradient_checkpointing? Is this a deliberate design choice?#4263khazic wants to merge 64 commits intoverl-project:mainfrom
actor_rollout_ref.model.enable_gradient_checkpointing? Is this a deliberate design choice?#4263khazic wants to merge 64 commits intoverl-project:mainfrom
Conversation
Contributor
There was a problem hiding this comment.
Code Review
This pull request removes a conflicting and duplicate configuration for actor_rollout_ref.model.enable_gradient_checkpointing in the run_qwen2.5-32b.sh script. This improves the clarity of the configuration. However, as a side effect, this change disables gradient checkpointing for the actor model. I've added a comment highlighting the potential for this to cause out-of-memory errors, as this is a significant change for a large 32B parameter model.
- add FSDP GRPO launcher with vLLM rollout settings - update Megatron launcher to keep workers running and log to W&B - increase Megatron NCCL timeout to 1200s - log validation generations by default in PPO trainer - remove legacy GRPO DLC script
- add single-node 8xGPU Megatron GRPO script with TP/PP=1 - tune batch sizes and validation defaults for single-node runs - update existing GRPO launch scripts to match latest paths/settings
- set WANDB_MODE=offline in single-node Megatron script - avoid proxy failures during W&B logging
- reduce batch sizes and sequence lengths for Megatron single-node - align FSDP single-node script with safer rollout settings - keep vLLM utilization low for constrained free memory
- raise vLLM gpu_memory_utilization to 0.30 for KV cache - lower rollout.n and cap max batched tokens for stability - apply settings to both Megatron and FSDP single-node scripts
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
API and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)