[rollout,vllm] Fix DP args and local_rank for Ray NOSET_VISIBLE_DEVICES#5233
[rollout,vllm] Fix DP args and local_rank for Ray NOSET_VISIBLE_DEVICES#5233JohnConnor123 wants to merge 4 commits intoverl-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces two important fixes for running vLLM with data parallelism under Ray. The first fix ensures that data parallelism arguments are correctly passed to vLLM, which was previously only done for expert parallelism. The second fix aims to correct the local_rank for workers when using Ray's NOSET_VISIBLE_DEVICES mode to prevent out-of-bounds errors.
My review focuses on the implementation of the local_rank adjustment. While the change to include DP arguments is correct, the logic for local_rank correction in verl/workers/rollout/vllm_rollout/utils.py contains critical bugs related to argument parsing that would prevent it from functioning. I've provided a detailed comment with a corrected implementation.
Summary
data_parallel_size > 1(not only when EP is enabled).RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=1anddata_parallel_size > 1,compute the correct vLLM-local
LOCAL_RANK(local rank within TP×PP) to avoidDP adjusted local rank ... out of boundsand incorrect GPU binding.
Context / Motivation
In DP configs (typically
dp=2,tp=1,pp=1,nnodes=1):ray.get_runtime_context().get_accelerator_ids()to derive local rank can produce a rank that becomesout-of-bounds after vLLM DP adjustment.
Checklist