tpsp mode support db prefill balance. #1086

hiworldwzj · 2025-10-24T13:07:28Z

No description provided.

gemini-code-assist · 2025-10-24T13:08:10Z

Summary of Changes

Hello @hiworldwzj, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the performance of LightLLM in distributed inference environments, particularly during the prefill stage when operating in data-parallel mode with tpsp_mix_mode. It introduces a sophisticated data rebalancing strategy that dynamically redistributes input sequences across DP ranks. By ensuring a more uniform workload distribution before critical computational steps like attention and feed-forward networks, this change effectively addresses bottlenecks arising from imbalanced input data, leading to improved overall prefill efficiency.

Highlights

Dynamic Data Rebalancing for Prefill: Introduced a mechanism to dynamically rebalance input data (tokens, position IDs, etc.) across data-parallel (DP) ranks during the prefill phase. This aims to mitigate performance degradation caused by uneven data distribution, ensuring a more balanced workload for each DP.
Core Rebalancing Logic: Implemented prefill_dp_balance and associated _all_to_all_balance_get and _all_to_all_unbalance_get methods within InferStateInfo. These methods use all_gather and all_to_all_single operations to redistribute and restore data across DPs.
Integration with Model Layers: The rebalancing logic is integrated into the tpsp_context_forward and tpsp_token_forward methods for Llama models, and KV cache allocation is adjusted. Qwen3-MoE also received updates to its FFN layer to support this balancing in both TP and EP modes.
New Configuration Option: Added a new command-line argument --enable_dp_prefill_balance to activate this feature. This option requires --enable_tpsp_mix_mode to be enabled and dp (data parallelism) to be greater than 1, ensuring it's used in appropriate distributed setups.
Dedicated Communication Group: A specific dp_prefill_balance_group has been added to the CustomProcessGroup to handle the communication required for this prefill balancing, ensuring efficient and isolated data transfers.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a mechanism to balance the workload across data-parallel (DP) ranks during the prefill stage, specifically for the TP+SP mixed mode. This is beneficial when input sequences have highly variable lengths across DP ranks. The implementation involves redistributing input tensors using all_to_all communication. The core logic is added to InferStateInfo and is triggered from the pre_layer_infer modules. My review focuses on the performance implications of the chosen implementation strategy, particularly the frequent use of all_to_all communication within each transformer layer, and some minor maintainability improvements. While the load balancing concept is sound, the per-layer communication overhead might negate the benefits.

gemini-code-assist · 2025-10-24T13:11:22Z

lightllm/models/llama/layer_infer/transformer_layer_infer.py

+        if infer_state.is_prefill and get_env_start_args().enable_dp_prefill_balance:
+            q = infer_state._all_to_all_unbalance_get(data=q)
+            cache_kv = infer_state._all_to_all_unbalance_get(data=cache_kv)


The code performs an _all_to_all_unbalance_get before the attention calculation and an _all_to_all_balance_get after the attention calculation (in _tpsp_get_o). This introduces two all-to-all communication steps per layer, which can be a significant performance bottleneck. This seems to defeat the purpose of balancing the load for prefill, as the attention, a compute-heavy part, would run on unbalanced data. Please clarify the reasoning behind this design. If the attention kernel does not support the balanced data layout, this should be documented with a code comment explaining the limitation and the workaround.

gemini-code-assist · 2025-10-24T13:11:22Z

lightllm/common/basemodel/infer_struct.py

+        的prefill 推理性能下降
+        """
+        assert self.is_prefill
+        import torch.distributed as dist


The import import torch.distributed as dist is local to the prefill_dp_balance method. It's better to move it to the top of the file for consistency and to avoid repeated import overhead. The same applies to other local imports in _all_to_all_balance_get and _all_to_all_unbalance_get.

gemini-code-assist · 2025-10-24T13:11:22Z

lightllm/common/basemodel/infer_struct.py

+        self.dp_handle_lens = dp_handle_lens
+        # 分配每个dp 的原始输入和分配后的原始输入
+        origin_datas = collections.deque()
+        for origin_dp_index, origin_dp_input_len in enumerate(dp_input_lens.numpy()):


Calling .numpy() inside a loop can be inefficient as it may cause a GPU-to-CPU synchronization on each iteration. It's better to call it once before the loop and iterate over the resulting numpy array.

gemini-code-assist · 2025-10-24T13:11:22Z

lightllm/server/api_start.py



 def normal_or_p_d_start(args):
+    from lightllm.server.core.objs.start_args_type import StartArgs


The import from lightllm.server.core.objs.start_args_type import StartArgs is local to the function normal_or_p_d_start. It's better to move it to the top of the file for consistency and readability.

gemini-code-assist · 2025-10-24T13:11:22Z

lightllm/server/api_start.py

        assert args.router_token_ratio == 0.0

+    if args.enable_dp_prefill_balance:
+        assert args.enable_tpsp_mix_mode and args.dp > 1, "need set --enable_tpsp_mix_mode firstly"


The assertion message is a bit vague. A more descriptive message would be helpful for the user, explaining that --enable_dp_prefill_balance has dependencies.

Suggested change

assert args.enable_tpsp_mix_mode and args.dp > 1, "need set --enable_tpsp_mix_mode firstly"

assert args.enable_tpsp_mix_mode and args.dp > 1, "--enable_dp_prefill_balance requires --enable_tpsp_mix_mode and --dp > 1"

gemini-code-assist · 2025-10-24T13:11:22Z

lightllm/utils/dist_utils.py

+    0,1,2,3 为一个dp, 4,5,6,7 为另一个 dp, 则在[0,4],
+    [1,5], [2,6], [3,7] 间建立通信组
+    """
+    from lightllm.utils.envs_utils import get_env_start_args


The local import from lightllm.utils.envs_utils import get_env_start_args should be moved to the top of the file for better code organization and to follow standard Python conventions.

wangzaijun added 6 commits October 23, 2025 06:26

fix tpsp get o

6d0376d

dp balance cache data

1d0da2d

fix

36ecad8

fix

8f23444

fix

c47a76e

fix

645baa1

gemini-code-assist bot reviewed Oct 24, 2025

View reviewed changes

hiworldwzj and others added 10 commits October 24, 2025 21:30

remove _pre_cache_kv

b99deda

fix

5ac8bfa

fix

5d6fe53

fix deepseekv2 balance

08cc488

fix

b75b3c1

improve deepseekv2

6180581

fix

815f5ac

fix

5e2dfc9

fix

f68a306

fix typing

fbc1648

shihaobai merged commit 16a41d2 into main Oct 27, 2025
1 check passed

shihaobai deleted the wzj branch October 27, 2025 12:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

tpsp mode support db prefill balance. #1086

tpsp mode support db prefill balance. #1086

Uh oh!

hiworldwzj commented Oct 24, 2025

Uh oh!

gemini-code-assist bot commented Oct 24, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Oct 24, 2025

Uh oh!

gemini-code-assist bot Oct 24, 2025

Uh oh!

gemini-code-assist bot Oct 24, 2025

Uh oh!

gemini-code-assist bot Oct 24, 2025

Uh oh!

gemini-code-assist bot Oct 24, 2025

Uh oh!

gemini-code-assist bot Oct 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants



		def normal_or_p_d_start(args):
		from lightllm.server.core.objs.start_args_type import StartArgs

	assert args.enable_tpsp_mix_mode and args.dp > 1, "need set --enable_tpsp_mix_mode firstly"
	assert args.enable_tpsp_mix_mode and args.dp > 1, "--enable_dp_prefill_balance requires --enable_tpsp_mix_mode and --dp > 1"

tpsp mode support db prefill balance. #1086

tpsp mode support db prefill balance. #1086

Uh oh!

Conversation

hiworldwzj commented Oct 24, 2025

Uh oh!

gemini-code-assist bot commented Oct 24, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants