[None][feat] Integrate MnnvlThroughput into TRTLLM MoE. #8728

bobboli · 2025-10-28T11:54:34Z

Enable MnnvlThroughput for TrtllmGenMoE.
TrtllmGenMoE supports being provided the output tensor by user. Currently only modified w4a8_mxfp4_mxfp8 for gpt-oss.
Decouple max_num_tokens and runtime_max_num_tokens_per_rank for MnnvlThroughput. Due to the changes, integration inside CutlassMoE is also adapted.

Summary by CodeRabbit

Release Notes

New Features
- Added new "mnnvlthroughput" backend option for MoE all-to-all operations with improved metadata-driven layout management.
- Introduced metainfo-based initialization for workspace and offset tracking, enabling more flexible auxiliary data organization.
Refactor
- Reorganized MoE dispatch and combine operation parameters to use runtime token counts and centralized metadata structures.
- Updated internal metadata indexing and offset calculations for improved scalability and maintainability.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Signed-off-by: Bo Li <[email protected]>

coderabbitai · 2025-10-31T08:29:33Z

📝 Walkthrough

Walkthrough

This refactoring restructures the MoE all-to-all communication system around a unified workspace model with metadata-driven offsets. The namespace is renamed to MnnvlThroughput, parameter structs are reorganized to consolidate auxiliary data and distributed counters, and Python bindings are updated to support runtime-configurable token counts and metainfo-based initialization. A new MoeAlltoAll class manages state and workspace lifecycle, with support for backend selection via environment variables.

Changes

Cohort / File(s)	Summary
C++ Kernel Interfaces `cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu`, `cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.h`	Namespace renamed from `tensorrt_llm::kernels::moe_a2a` to `tensorrt_llm::kernels::MnnvlThroughput`. Structs `MoeA2ADispatchParams` and `MoeA2ACombineParams` reorganized with new fields for distributed counters, routing metadata (`topk_target_ranks`, `topk_send_indices`), CUDA streams, and unified buffer layout. Function signatures unchanged but now under new namespace.
Meta-info and Offset Management `cpp/tensorrt_llm/thop/moeAlltoAllMeta.h`	Added `<array>` include and nested `MnnvlThroughput` namespace. Enum `MoeA2AMetaInfoIndex` now has explicit `int64_t` underlying type; added `TOPK_TARGET_RANKS_OFFSET_INDEX` and `TOPK_SEND_INDICES_OFFSET_INDEX` entries; `NUM_METAINFO_FIELDS` increased from 7 to 9. Introduced `MoeA2ADataOffsets` type alias as `std::array<int64_t, NUM_METAINFO_FIELDS>`.
PyTorch Operation Implementation `cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp`	Added `moeA2AInitializeOp` for workspace initialization and metainfo generation. Updated `calculateOffsets` to return `MoeA2ADataOffsets`. Modified signatures of `moeA2ADispatchOp`, `moeA2ACombineOp`, `moeA2ASanitizeExpertIdsOp`, and `moeA2AGetCombinePayloadTensorOp` to accept metainfo and use runtime-derived token counts. Kernel parameter population now derives pointers from rank-specific workspace regions. Updated TORCH_LIBRARY bindings.
Python Bindings `cpp/tensorrt_llm/nanobind/thop/bindings.cpp`, `cpp/tensorrt_llm/pybind/thop/bindings.cpp`	Updated MoE A2A constants export to call `torch_ext::MnnvlThroughput::getMoeA2AMetaInfoIndexPairs()` instead of `torch_ext::getMoeA2AMetaInfoIndexPairs()`.
Python Runtime Core `tensorrt_llm/_torch/distributed/moe_alltoall.py`	Introduced `_A2AState` container for operation state tracking. Replaced per-field offset constants with `_METAINFO_INDEX` dictionary populated from C++ offsets. Renamed `max_num_tokens_per_rank` to `max_num_tokens` throughout. Added `MnnvlMemory.initialize()` invocation and unified workspace/metainfo storage in `_WORKSPACE`. Updated `dispatch`, `combine`, and `get_combine_payload_tensor_in_workspace` signatures to accept `runtime_max_tokens_per_rank` and use metainfo-derived offsets.
MoE Module Integration `tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py`	Updated `max_num_token` references to use per-rank `max_tokens_per_rank` in alltoall preparation calls and tensor sizing for result combination. Runtime path derives `runtime_max_tokens_per_rank` from input shape.
MoE Fused Module Configuration `tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py`	Added `moe_alltoall_backend` cached_property (reads `TRTLLM_MOE_ALLTOALL_BACKEND`; default "mnnvllatency"). Added `MoeAlltoAll` import. Constructor branches on backend: "mnnvllatency" retains existing `MnnvlMemory`/`MnnvlMoe` initialization; "mnnvlthroughput" initializes `MoeAlltoAll` instance with workspace size from `TRTLLM_MOE_A2A_WORKSPACE_MB` environment variable. Forward pass includes backend-specific dispatch/combine logic with appropriate parameter transformation for each.
MoE A2A Tests `tests/unittest/_torch/multi_gpu/test_moe_a2a.py`	Replaced per-rank max token assumption with global `max_num_tokens` derived as `max(all_num_tokens)`. Updated `dispatch` calls to accept and use `runtime_max_tokens_per_rank`. Changed return handling from `recv_buffers` to `recv_tensors` with metainfo-based offset lookups for counters and routing metadata. Verification logic updated to validate new tensor-based data structures and extended shape checks for metainfo-derived offsets.

Sequence Diagram(s)

sequenceDiagram
    participant Python as Python Runtime
    participant Init as Initialize Op
    participant Dispatch as Dispatch Op
    participant Combine as Combine Op
    participant Kernel as CUDA Kernels

    Python->>Init: moeA2AInitializeOp(workspace, epRank, epSize, maxNumTokens)
    Init->>Init: calculateOffsets(epSize, maxNumTokens)
    Init->>Kernel: Write offsets to metainfo tensor
    Init-->>Python: metainfo tensor

    Python->>Dispatch: dispatch(tokens, payloads, workspace, metainfo, runtime_max_tokens_per_rank, ...)
    Dispatch->>Dispatch: Parse metainfo offsets
    Dispatch->>Dispatch: Populate MoeA2ADispatchParams from workspace regions
    Dispatch->>Kernel: Launch kernel with derived pointers
    Kernel->>Kernel: Process routing & all-to-all
    Kernel-->>Dispatch: Update counters/flags in workspace
    Dispatch-->>Python: recv_tensors, combine_payload_offset

    Python->>Combine: combine(payload, runtime_max_tokens_per_rank, workspace, metainfo, ...)
    Combine->>Combine: Parse metainfo offsets
    Combine->>Combine: Populate MoeA2ACombineParams from workspace regions
    Combine->>Kernel: Launch combine kernel
    Kernel-->>Combine: Combined output
    Combine-->>Python: result tensor

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Areas requiring extra attention:

Workspace layout and offset calculations (cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp): Verify that offset calculations are correct for all new metainfo fields and that pointer derivation for each rank matches the unified workspace layout.
Metainfo indexing consistency (cpp/tensorrt_llm/thop/moeAlltoAllMeta.h and usage sites): Ensure enum indices match across C++ and Python layers; audit all _METAINFO_INDEX lookups in Python.
Backend branching logic (tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py): Verify that parameter transformation and tensor reshaping differ correctly between "mnnvllatency" and "mnnvlthroughput" paths; check environment variable handling and fallback behavior.
Kernel parameter population (cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp): Confirm that all new fields (flag_val, send_counters, topk_target_ranks, etc.) are correctly extracted from workspace and passed to kernel structs.
Python state management (tensorrt_llm/_torch/distributed/moe_alltoall.py): Validate _A2AState transitions and that metainfo is correctly propagated across dispatch/combine phases.
Test verification coverage (tests/unittest/_torch/multi_gpu/test_moe_a2a.py): Check that all new tensor-based counters and routing metadata are correctly validated with correct shape expectations.

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 63.64% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Description Check	⚠️ Warning	The pull request description is largely incomplete and does not adequately follow the repository's PR template structure. While the author provided three bullet points at the beginning summarizing the changes (enabling MnnvlThroughput for TrtllmGenMoE, supporting user-provided output tensors, and decoupling max_num_tokens and runtime_max_num_tokens_per_rank), the required template sections are either missing or contain only placeholder comments. The "Description" section lacks any explanation of the issue and solution, the "Test Coverage" section is entirely empty with no list of relevant tests, and the PR title is absent in the required format of `[JIRA/NVBugs/GitHub/None][type] Summary`. Additionally, the PR Checklist items are not addressed or checked.	The author should complete the PR description by filling out all required sections: add a proper PR title following the format `[ticket_id][type] summary`, provide a detailed description explaining the issue and the solution beyond the bullet points, list the relevant test cases that validate these changes (such as tests in test_moe_a2a.py based on the changes summary), and address or check off items from the PR Checklist. This will ensure reviewers have adequate context for understanding the scope and impact of these significant structural changes to the MoE all-to-all communication system.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The PR title "[feat] Integrate MnnvlThroughput into TRTLLM MoE" accurately reflects the primary objective and scope of the changes. The changeset consistently demonstrates this integration across multiple layers: namespace reorganization from moe_a2a to MnnvlThroughput, introduction of a new MoeAlltoAll Python class, addition of backend selection logic in fused_moe_trtllm_gen.py, and updates to C++ kernel parameters and function signatures. The title is clear, concise, and specific enough that a teammate scanning the commit history would immediately understand that a new throughput-focused all-to-all implementation is being integrated into the existing MoE system.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tests/unittest/_torch/multi_gpu/test_moe_a2a.py (1)

518-553: Align invalid-expert sentinel in tests with runtime contract.

The runtime now needs invalid_expert_id == num_experts (see Line 424 in the module), but the test still injects and asserts -1 (Lines 518, 553, 596, 667). Once the dispatch fix lands, these expectations will flip the tests red and mask regressions. Please derive the sentinel from the same value (ep_size * num_experts_per_rank) and update all related assertions/fixtures (e.g., Lines 466-467) so the test exercises the real contract.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ae57738 and e92c05e.

📒 Files selected for processing (10)

cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu (3 hunks)
cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.h (5 hunks)
cpp/tensorrt_llm/nanobind/thop/bindings.cpp (1 hunks)
cpp/tensorrt_llm/pybind/thop/bindings.cpp (1 hunks)
cpp/tensorrt_llm/thop/moeAlltoAllMeta.h (2 hunks)
cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp (9 hunks)
tensorrt_llm/_torch/distributed/moe_alltoall.py (5 hunks)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (5 hunks)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py (6 hunks)
tests/unittest/_torch/multi_gpu/test_moe_a2a.py (16 hunks)

🧰 Additional context used

📓 Path-based instructions (8)

**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh}