Skip to content

Conversation

@hiworldwzj
Copy link
Collaborator

No description provided.

@gemini-code-assist
Copy link

Summary of Changes

Hello @hiworldwzj, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the performance of LightLLM in distributed inference environments, particularly during the prefill stage when operating in data-parallel mode with tpsp_mix_mode. It introduces a sophisticated data rebalancing strategy that dynamically redistributes input sequences across DP ranks. By ensuring a more uniform workload distribution before critical computational steps like attention and feed-forward networks, this change effectively addresses bottlenecks arising from imbalanced input data, leading to improved overall prefill efficiency.

Highlights

  • Dynamic Data Rebalancing for Prefill: Introduced a mechanism to dynamically rebalance input data (tokens, position IDs, etc.) across data-parallel (DP) ranks during the prefill phase. This aims to mitigate performance degradation caused by uneven data distribution, ensuring a more balanced workload for each DP.
  • Core Rebalancing Logic: Implemented prefill_dp_balance and associated _all_to_all_balance_get and _all_to_all_unbalance_get methods within InferStateInfo. These methods use all_gather and all_to_all_single operations to redistribute and restore data across DPs.
  • Integration with Model Layers: The rebalancing logic is integrated into the tpsp_context_forward and tpsp_token_forward methods for Llama models, and KV cache allocation is adjusted. Qwen3-MoE also received updates to its FFN layer to support this balancing in both TP and EP modes.
  • New Configuration Option: Added a new command-line argument --enable_dp_prefill_balance to activate this feature. This option requires --enable_tpsp_mix_mode to be enabled and dp (data parallelism) to be greater than 1, ensuring it's used in appropriate distributed setups.
  • Dedicated Communication Group: A specific dp_prefill_balance_group has been added to the CustomProcessGroup to handle the communication required for this prefill balancing, ensuring efficient and isolated data transfers.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a mechanism to balance the workload across data-parallel (DP) ranks during the prefill stage, specifically for the TP+SP mixed mode. This is beneficial when input sequences have highly variable lengths across DP ranks. The implementation involves redistributing input tensors using all_to_all communication. The core logic is added to InferStateInfo and is triggered from the pre_layer_infer modules. My review focuses on the performance implications of the chosen implementation strategy, particularly the frequent use of all_to_all communication within each transformer layer, and some minor maintainability improvements. While the load balancing concept is sound, the per-layer communication overhead might negate the benefits.

Comment on lines 237 to 239
if infer_state.is_prefill and get_env_start_args().enable_dp_prefill_balance:
q = infer_state._all_to_all_unbalance_get(data=q)
cache_kv = infer_state._all_to_all_unbalance_get(data=cache_kv)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The code performs an _all_to_all_unbalance_get before the attention calculation and an _all_to_all_balance_get after the attention calculation (in _tpsp_get_o). This introduces two all-to-all communication steps per layer, which can be a significant performance bottleneck. This seems to defeat the purpose of balancing the load for prefill, as the attention, a compute-heavy part, would run on unbalanced data. Please clarify the reasoning behind this design. If the attention kernel does not support the balanced data layout, this should be documented with a code comment explaining the limitation and the workaround.

的prefill 推理性能下降
"""
assert self.is_prefill
import torch.distributed as dist

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The import import torch.distributed as dist is local to the prefill_dp_balance method. It's better to move it to the top of the file for consistency and to avoid repeated import overhead. The same applies to other local imports in _all_to_all_balance_get and _all_to_all_unbalance_get.

self.dp_handle_lens = dp_handle_lens
# 分配每个dp 的原始输入和分配后的原始输入
origin_datas = collections.deque()
for origin_dp_index, origin_dp_input_len in enumerate(dp_input_lens.numpy()):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Calling .numpy() inside a loop can be inefficient as it may cause a GPU-to-CPU synchronization on each iteration. It's better to call it once before the loop and iterate over the resulting numpy array.



def normal_or_p_d_start(args):
from lightllm.server.core.objs.start_args_type import StartArgs

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The import from lightllm.server.core.objs.start_args_type import StartArgs is local to the function normal_or_p_d_start. It's better to move it to the top of the file for consistency and readability.

assert args.router_token_ratio == 0.0

if args.enable_dp_prefill_balance:
assert args.enable_tpsp_mix_mode and args.dp > 1, "need set --enable_tpsp_mix_mode firstly"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The assertion message is a bit vague. A more descriptive message would be helpful for the user, explaining that --enable_dp_prefill_balance has dependencies.

Suggested change
assert args.enable_tpsp_mix_mode and args.dp > 1, "need set --enable_tpsp_mix_mode firstly"
assert args.enable_tpsp_mix_mode and args.dp > 1, "--enable_dp_prefill_balance requires --enable_tpsp_mix_mode and --dp > 1"

0,1,2,3 为一个dp, 4,5,6,7 为另一个 dp, 则在[0,4],
[1,5], [2,6], [3,7] 间建立通信组
"""
from lightllm.utils.envs_utils import get_env_start_args

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The local import from lightllm.utils.envs_utils import get_env_start_args should be moved to the top of the file for better code organization and to follow standard Python conventions.

@shihaobai shihaobai merged commit 16a41d2 into main Oct 27, 2025
1 check passed
@shihaobai shihaobai deleted the wzj branch October 27, 2025 12:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants