Skip to content

why are there multiple settings for actor_rollout_ref.model.enable_gradient_checkpointing? Is this a deliberate design choice?#4263

Open
khazic wants to merge 64 commits intoverl-project:mainfrom
khazic:main
Open

why are there multiple settings for actor_rollout_ref.model.enable_gradient_checkpointing? Is this a deliberate design choice?#4263
khazic wants to merge 64 commits intoverl-project:mainfrom
khazic:main

Conversation

@khazic
Copy link
Contributor

@khazic khazic commented Nov 24, 2025

What does this PR do?

Add concise overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review.

Checklist Before Starting

  • [✅] Search for similar PRs. Paste at least one query link here: ...
  • [✅] Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data
    • If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request removes a conflicting and duplicate configuration for actor_rollout_ref.model.enable_gradient_checkpointing in the run_qwen2.5-32b.sh script. This improves the clarity of the configuration. However, as a side effect, this change disables gradient checkpointing for the actor model. I've added a comment highlighting the potential for this to cause out-of-memory errors, as this is a significant change for a large 32B parameter model.

@CLAassistant
Copy link

CLAassistant commented Feb 2, 2026

CLA assistant check
All committers have signed the CLA.

khazic added 21 commits February 3, 2026 14:24
- add FSDP GRPO launcher with vLLM rollout settings
- update Megatron launcher to keep workers running and log to W&B
- increase Megatron NCCL timeout to 1200s
- log validation generations by default in PPO trainer
- remove legacy GRPO DLC script
- add single-node 8xGPU Megatron GRPO script with TP/PP=1
- tune batch sizes and validation defaults for single-node runs
- update existing GRPO launch scripts to match latest paths/settings
- set WANDB_MODE=offline in single-node Megatron script
- avoid proxy failures during W&B logging
- reduce batch sizes and sequence lengths for Megatron single-node
- align FSDP single-node script with safer rollout settings
- keep vLLM utilization low for constrained free memory
- raise vLLM gpu_memory_utilization to 0.30 for KV cache
- lower rollout.n and cap max batched tokens for stability
- apply settings to both Megatron and FSDP single-node scripts
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants