Skip to content

[rollout,vllm] Fix DP args and local_rank for Ray NOSET_VISIBLE_DEVICES#5233

Open
JohnConnor123 wants to merge 4 commits intoverl-project:mainfrom
JohnConnor123:bugfix/vllm_dp_rank_and_args
Open

[rollout,vllm] Fix DP args and local_rank for Ray NOSET_VISIBLE_DEVICES#5233
JohnConnor123 wants to merge 4 commits intoverl-project:mainfrom
JohnConnor123:bugfix/vllm_dp_rank_and_args

Conversation

@JohnConnor123
Copy link
Contributor

Summary

  • Fix A: ensure vLLM receives DP CLI args when data_parallel_size > 1 (not only when EP is enabled).
  • Fix B: when running under Ray with RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=1 and data_parallel_size > 1,
    compute the correct vLLM-local LOCAL_RANK (local rank within TP×PP) to avoid DP adjusted local rank ... out of bounds
    and incorrect GPU binding.
  • This is a refinement of the comments from PR [rollout,vllm] Fix DP args and local_rank for Ray NOSET_VISIBLE_DEVICES #5152

Context / Motivation

In DP configs (typically dp=2,tp=1,pp=1,nnodes=1):

  • vLLM could start without DP settings if the condition was tied only to EP.
  • Under Ray NOSET, using ray.get_runtime_context().get_accelerator_ids() to derive local rank can produce a rank that becomes
    out-of-bounds after vLLM DP adjustment.

Checklist

  • Bug fixed
  • Repro / test plan included
  • Regression test (real Ray+vLLM e2e)

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces two important fixes for running vLLM with data parallelism under Ray. The first fix ensures that data parallelism arguments are correctly passed to vLLM, which was previously only done for expert parallelism. The second fix aims to correct the local_rank for workers when using Ray's NOSET_VISIBLE_DEVICES mode to prevent out-of-bounds errors.

My review focuses on the implementation of the local_rank adjustment. While the change to include DP arguments is correct, the logic for local_rank correction in verl/workers/rollout/vllm_rollout/utils.py contains critical bugs related to argument parsing that would prevent it from functioning. I've provided a detailed comment with a corrected implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant