-
Notifications
You must be signed in to change notification settings - Fork 613
[bugfix] the hccl_buffsize config for EP #4512
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request fixes an issue with hccl_buffsize configuration by ensuring the HCCL_BUFFSIZE environment variable is respected when creating HCCL process group options. The changes look good, but I have a couple of suggestions. First, the logic for parsing the buffer size from the environment variable should be improved to handle non-positive values to prevent potential issues. Second, since the special handling for "mc2" communication groups was removed, adding a specific test case for this scenario would increase confidence that this change does not introduce any regressions.
Signed-off-by: mojave2 <[email protected]>
What this PR does / why we need it?
The hccl_buffsize for EP process group is set by the
_DEFAULT_BUFFER_SIZE, which is not enough for fused_moe kernel, especially for large max-num-batched-tokens. So we need to calculate the hccl_buffer for EP from the vllm_config.Does this PR introduce any user-facing change?
How was this patch tested?