Skip to content

Conversation

@amirkl94
Copy link
Contributor

@amirkl94 amirkl94 commented Oct 23, 2025

📌 Description

Fixed a few compilation issues for L40, and removed 1 gemm tactic for sm == 89 that crashes due to:

Assertion failed: GPU lacks the shared memory resources to run GroupedGEMM kernel

🧪 Tests

Ran pytest tests/moe/test_trtllm_cutlass_fused_moe.py manually on an L40 GPU and verified all tests passed.

Summary by CodeRabbit

  • New Features

    • Added explicit support for SM89 GPU backend and build flags, plus a public generation path to target it.
    • FP8 execution enabled on SM89 where supported.
  • Bug Fixes / Compatibility

    • FP4 quantized weights are now disallowed on architectures below SM90 to prevent unsupported runtime paths.
  • Performance

    • Streamlined dispatch to prefer SM89-optimized kernels when applicable, improving inference dispatch and kernel selection.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @amirkl94, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces comprehensive support for L40 GPUs (SM89 architecture) within the CUTLASS FusedMoE kernels. It resolves critical compilation issues and a runtime crash related to shared memory allocation for a specific GEMM tactic on SM89, ensuring stable and efficient operation. The changes also include updates to the Python JIT compilation infrastructure to enable proper module generation and loading for L40.

Highlights

  • L40 FusedMoE Support: Dedicated support has been added for NVIDIA L40 GPUs (SM89 architecture) within the CUTLASS FusedMoE path, enabling optimized performance for these devices.
  • GEMM Tactic Refinement: A specific GroupedGEMM kernel tactic that caused shared memory assertion failures on SM89 (L40) has been identified and removed from the candidate tile selection, enhancing stability.
  • Compilation and FP Handling Fixes: Several compilation issues related to the handling of FP4 and FP8 data types across different Streaming Multiprocessor (SM) architectures have been resolved, improving code robustness.
  • Python API Extension: The Python API has been extended to allow explicit selection and loading of SM89-specific FusedMoE modules, integrating the new L40 support into the existing framework.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 23, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Add explicit SM89 handling across CUTLASS kernel heuristics, MOE GEMM dispatch, and Flashinfer JIT: new SM89 NVCC flags, SM89-specific JIT module generator, tightened FP4 guards, and adjusted FP8/FP4 dispatch behavior for SM80–89 paths.

Changes

Cohort / File(s) Summary
CUTLASS Heuristic
csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
Split combined `(sm == 89
MOE GEMM Dispatch Logic
csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
For SM80–89 dispatch path, FP4 is now explicitly disallowed on SM < 90 (runtime error) and FP8/FP4 conditions consistently route to the SM89 implementation; SM80/S800 fallback retained for non-FP8/FP4 cases.
MOE Specialization Guards
csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_tma_warp_specialized_traits.h
Added ENABLE_FP4 requirement to isValidSM120MOESpecialisation and isValidBlackwellMOESpecialisation guards, so they evaluate true only when CUTLASS arch support macros AND ENABLE_FP4 are defined.
Flashinfer: fused_moe Backend Dispatch
flashinfer/fused_moe/core.py
Added support for backend identifier "89": import of gen_cutlass_fused_moe_sm89_module and dispatch to its build_and_load() when backend == "89".
Flashinfer JIT NVCC Flags
flashinfer/jit/core.py
Added module-level constant sm89_nvcc_flags including compute_89 / sm_89 codegen and FP8 E8M0 feature enablement.
Flashinfer SM89 JIT Module Generator
flashinfer/jit/fused_moe.py
Added gen_cutlass_fused_moe_sm89_module(use_fast_build: bool = False) -> JitSpec, which composes NVCC flags from sm89_nvcc_flags plus BF16/FP8 (and optional FP8_BLOCK_SCALE) and delegates to generic generator with device_arch="89"; added sm89_nvcc_flags import.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant PythonCore as flashinfer/fused_moe/core.py
    participant JITFactory as fused_moe.gen_cutlass_fused_moe_sm89_module
    participant JITCore as flashinfer/jit/core.py
    participant CppDispatch as moe_gemm_template_dispatch.h
    participant CUTLASS as CUTLASS Kernels

    User->>PythonCore: get_cutlass_fused_moe_module(backend="89")
    PythonCore->>JITFactory: build_and_load()
    JITFactory->>JITCore: use sm89_nvcc_flags + BF16/FP8 (+FP8_BLOCK_SCALE?)
    JITFactory->>PythonCore: return JitSpec (SM89)
    PythonCore-->>User: compiled module

    Note over CppDispatch,CUTLASS: SM 80–89 dispatch path
    alt FP8 or FP4 detected
        CppDispatch->>CppDispatch: if sm == 89 (new branch)
        CppDispatch->>CUTLASS: dispatch to SM89 implementation
    else No FP8/FP4
        CppDispatch->>CUTLASS: use SM80/S800 fallback
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

  • joker-eph
  • aleozlx
  • yongwww
  • yzh119
  • cyx-6
  • wenscarl

Poem

🐰 I hopped through flags and dispatch trees so fine,
SM89 now sings in CUTLASS line by line.
FP8 found a home, FP4 keeps its gate,
JIT builds compile, the rabbit celebrates—
A tiny hop for kernels, a joyful nibble of design.

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The pull request description is incomplete relative to the provided template. While it includes a Description section with relevant information about fixing L40 compilation issues and removing a problematic GEMM tactic, and it mentions test verification, the description is missing several key template sections: the Related Issues section, the Pre-commit Checks checklist with required checkboxes, and proper checkbox formatting for the Tests section. The provided content is specific and on-topic rather than vague, but the overall structure deviates significantly from the template requirements, omitting multiple sections that are intended to ensure consistency and thoroughness across pull requests. Please follow the pull request template more completely by adding: a Related Issues section (link any associated issues or note if none exist), the full Pre-commit Checks section with the three checkbox items and completion status, and the Tests section with proper checkbox items showing which tests were run and their status. While your description content is good, adhering to the template structure ensures consistency and helps reviewers quickly locate important information.
Docstring Coverage ⚠️ Warning Docstring coverage is 12.50% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Title Check ✅ Passed The title "Feature: Add support for L40 FusedMoE in cutlass path" is specific, clear, and directly describes the main objective of the pull request. Based on the raw summary and PR objectives, the primary goal is to enable L40 GPU support in the cutlass path for FusedMoE, which the title accurately captures. While the title doesn't explicitly mention the supporting work (fixing compilation issues and removing a problematic GEMM tactic for sm == 89), these are implementation details that support the main feature goal. The title is concise, avoids vague language, and provides sufficient context for a developer scanning the repository history to understand the primary change.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for L40 GPUs (sm_89) in the CUTLASS FusedMoE path. The changes include fixing compilation issues, removing a problematic GEMM tactic for sm_89 that was causing crashes, and adding the necessary build configurations for this architecture. The changes are logical and well-implemented. I have one suggestion to improve code clarity when constructing compiler flags.

Comment on lines +81 to +86
nvcc_flags = sm89_nvcc_flags + [
"-DENABLE_BF16",
"-DENABLE_FP8",
"-DENABLE_FP8_BLOCK_SCALE" if is_cuda_version_at_least("12.8") else "",
"-DUSING_OSS_CUTLASS_MOE_GEMM",
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Conditionally adding an empty string to the list of compiler flags can be slightly confusing and less clean. It's better to conditionally append the flag to the list to avoid adding empty elements. This improves code readability and maintainability.

Suggested change
nvcc_flags = sm89_nvcc_flags + [
"-DENABLE_BF16",
"-DENABLE_FP8",
"-DENABLE_FP8_BLOCK_SCALE" if is_cuda_version_at_least("12.8") else "",
"-DUSING_OSS_CUTLASS_MOE_GEMM",
]
nvcc_flags = sm89_nvcc_flags + [
"-DENABLE_BF16",
"-DENABLE_FP8",
"-DUSING_OSS_CUTLASS_MOE_GEMM",
]
if is_cuda_version_at_least("12.8"):
nvcc_flags.append("-DENABLE_FP8_BLOCK_SCALE")

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_tma_warp_specialized_traits.h (1)

42-42: Fix misleading comment about SM120 vs SM100.

The comment mentions CUTLASS_ARCH_MMA_SM100_SUPPORTED but this function checks for CUTLASS_ARCH_MMA_SM120_SUPPORTED (line 35). Update the comment to accurately reflect the macro being checked.

Apply this diff:

-  return false;  // CUTLASS_ARCH_MMA_SM100_SUPPORTED is set when Blackwell kernels are enabled
+  return false;  // CUTLASS_ARCH_MMA_SM120_SUPPORTED is set when Blackwell kernels are enabled
🧹 Nitpick comments (2)
flashinfer/jit/fused_moe.py (1)

80-88: SM89 module generation is correctly implemented.

The function appropriately:

  • Uses sm89_nvcc_flags which excludes FP4 support for L40
  • Omits Hopper-specific TMA GEMM flags (correct for Ada architecture)
  • Includes conditional FP8 block scale support for CUDA ≥12.8

Optional: Consider iterable unpacking for cleaner syntax.

As suggested by Ruff, you could use iterable unpacking instead of concatenation:

-    nvcc_flags = sm89_nvcc_flags + [
+    nvcc_flags = [
+        *sm89_nvcc_flags,
         "-DENABLE_BF16",
         "-DENABLE_FP8",
         "-DENABLE_FP8_BLOCK_SCALE" if is_cuda_version_at_least("12.8") else "",
         "-DUSING_OSS_CUTLASS_MOE_GEMM",
     ]

This is a minor style improvement and can be deferred.

csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h (1)

696-709: LGTM: Correctly reorganizes SM89 dispatch to avoid shared memory issues.

The reorganized control flow properly addresses the L40 (SM89) issue by:

  1. Routing FP8 workloads to Sm89 kernels with runtime validation (line 703)
  2. Routing non-FP8 workloads to Sm80 kernels (lines 707-708)

This aligns with the kernel implementation in moe_cutlass_kernel.h which shows SM89 architecture reusing Sm80 kernels for non-FP8 types, and prevents the "GPU lacks the shared memory resources" assertion mentioned in the PR objectives.

Optional suggestion: Consider adding a brief comment explaining why non-FP8 on SM89 uses the Sm80 path, to help future maintainers understand the shared memory constraint that motivated this design.

Apply this diff to add a clarifying comment:

     } else {
+      // Non-FP8 workloads on SM89 (L40) reuse Sm80 kernels to avoid
+      // Sm89-specific tactics that exceed shared memory limits
       dispatchMoeGemmToCutlass<T, WeightType, ScaleBiasType, cutlass::arch::Sm80, EpilogueTag>(
           inputs, multi_processor_count_);
     }
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 739df61 and d490c3f.

📒 Files selected for processing (6)
  • csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp (1 hunks)
  • csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h (1 hunks)
  • csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_tma_warp_specialized_traits.h (2 hunks)
  • flashinfer/fused_moe/core.py (2 hunks)
  • flashinfer/jit/core.py (1 hunks)
  • flashinfer/jit/fused_moe.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
flashinfer/fused_moe/core.py (2)
flashinfer/jit/fused_moe.py (1)
  • gen_cutlass_fused_moe_sm89_module (80-87)
flashinfer/jit/core.py (1)
  • build_and_load (272-284)
csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h (2)
csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/cutlass_type_conversion.h (6)
  • __nv_fp8_e5m2 (91-93)
  • cutlass (114-116)
  • cutlass (120-122)
  • cutlass (127-129)
  • cutlass (132-134)
  • cutlass (140-142)
csrc/nv_internal/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/kernel/moe_cutlass_kernel.h (1)
  • cutlass (40-677)
flashinfer/jit/fused_moe.py (2)
flashinfer/jit/core.py (2)
  • JitSpec (185-284)
  • gen_jit_spec (287-353)
flashinfer/jit/cpp_ext.py (1)
  • is_cuda_version_at_least (86-87)
🪛 Ruff (0.14.1)
flashinfer/jit/fused_moe.py

81-86: Consider iterable unpacking instead of concatenation

Replace with iterable unpacking

(RUF005)

🔇 Additional comments (6)
csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp (1)

161-168: LGTM! Targeted fix for L40 shared memory constraints.

The separation of SM89 from SM >= 120 handling correctly removes CtaShape16x256x128_WarpShape16x64x128 for L40 GPUs. This config's larger K dimension (128 vs 64) would exceed L40's shared memory capacity in GROUPED_GEMM mode, causing the assertion failure mentioned in the PR. The fix is minimal, well-scoped, and maintains full functionality for newer architectures.

flashinfer/jit/core.py (1)

93-96: SM89 flags correctly omit FP4 support.

The implementation appropriately excludes -DFLASHINFER_ENABLE_FP4_E2M1 for SM89 (L40), unlike other architectures that use common_nvcc_flags. This aligns with the PR objective of fixing compilation issues for L40 by removing problematic GEMM tactics.

flashinfer/fused_moe/core.py (2)

40-40: LGTM!

Import added correctly to support SM89 module generation.


289-290: LGTM!

Backend dispatch for SM89 follows the established pattern and correctly invokes the new module generation function.

flashinfer/jit/fused_moe.py (1)

21-27: LGTM!

Import correctly updated to include sm89_nvcc_flags needed for the new SM89 module generation function.

csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h (1)

692-694: LGTM: Proper FP4 validation for SM89.

The explicit compile-time check correctly prevents FP4 usage on SM < 90, consistent with the earlier validation for SM75-80 at lines 685-688.

Signed-off-by: Amir Klein <[email protected]>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d490c3f and 50a1bb0.

📒 Files selected for processing (1)
  • csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_tma_warp_specialized_traits.h (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Deploy Docs
🔇 Additional comments (1)
csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_tma_warp_specialized_traits.h (1)

35-36: LGTM: Guard is appropriate for FP4-only SM120 specialization.

The ENABLE_FP4 guard is correct here. Since isValidSM120MOESpecialisation only supports __nv_fp4_e2m1 types (lines 37-40), requiring ENABLE_FP4 at the preprocessor level prevents compilation errors when FP4 support is not available.

Comment on lines +51 to +52
#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED) && \
defined(ENABLE_FP4) // TODO Is there a better choice
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Overly restrictive guard blocks non-FP4 Blackwell specializations.

Requiring ENABLE_FP4 at the top level prevents ALL Blackwell MOE specializations from being valid when FP4 support is not compiled, including non-FP4 cases like FP8/FP8 or BF16/BF16 that should work regardless. Line 53 explicitly supports cutlass::platform::is_same<T, WeightType>::value, which matches any same-type pair.

The Hopper implementation (lines 68-89) uses a better pattern: conditional compilation guards only the FP4-specific parts inside the return statement, allowing non-FP4 specializations to work without ENABLE_FP4.

Apply this diff to follow the Hopper pattern:

-#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED) && \
-    defined(ENABLE_FP4)  // TODO Is there a better choice
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
   return (cutlass::platform::is_same<T, WeightType>::value ||
+#ifdef ENABLE_FP4
           (cutlass::platform::is_same<T, __nv_fp8_e4m3>::value &&
-           cutlass::platform::is_same<WeightType, __nv_fp4_e2m1>::value)) &&
+           cutlass::platform::is_same<WeightType, __nv_fp4_e2m1>::value) ||
+#else
+          false ||
+#endif
+          false) &&
          cutlass::platform::is_same<EpilogueTag, cutlass_extensions::EpilogueOpDefault>::value &&
          Fusion == TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::NONE;

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In
csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_tma_warp_specialized_traits.h
around lines 51-52, the current top-level guard requires ENABLE_FP4 and blocks
all Blackwell (SM100) MOE specializations when FP4 is not enabled; change this
to only guard FP4-specific logic: keep the CUTLASS_ARCH_MMA_SM100_SUPPORTED
check on the specialization but remove ENABLE_FP4 from the #if, then wrap only
the FP4-specific condition(s) inside the specialization (the return expression)
with #if defined(ENABLE_FP4) / #endif as the Hopper implementation does (lines
~68-89), ensuring the same-type WeightType check remains unguarded so non-FP4
types (FP8/BF16/etc.) compile normally.

Copy link

@djns99 djns99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, mainly just want to understand why we need to disable the tile shape.
Coderabbit's comment about more granular Fp4 guard might make sense, but I assume that if we are compiling with blackwell support we should also have FP4

CutlassTileConfig::CtaShape64x64x128_WarpShape32x64x64,
CutlassTileConfig::CtaShape128x64x64_WarpShape64x32x64,
CutlassTileConfig::CtaShape128x256x64_WarpShape64x64x64,
CutlassTileConfig::CtaShape256x128x64_WarpShape64x64x64};
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any SM89 GPUs that can support the CtaShape16x256x128_WarpShape16x64x128, or is this an SM120 addition?
Having a small M value like this helps with low latency cases, so I'd want to understand why its not supported before disabling it

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the very least can you leave a comment saying what the difference between the two lists are, so people dont have to manually compare the items

@yzh119
Copy link
Collaborator

yzh119 commented Oct 24, 2025

There are some compilation errors on CI, @amirkl94 would you mind taking a look?
https://ci.tlcpack.ai/blue/organizations/jenkins/flashinfer-ci/detail/PR-1973/4/pipeline

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants