Skip to content

Conversation

@djmmoss
Copy link
Collaborator

@djmmoss djmmoss commented Oct 23, 2025

📌 Description

Enable JIT compile for the FP8 DeepGEMM kernels, NVRTC is currently disabled it uses NVCC by default.

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Summary by CodeRabbit

  • Refactor
    • JIT include directory discovery now uses the flashinfer-python package instead of the previous package.
    • Updated resolved include path to the flashinfer data location.
    • Runtime compilation now consistently uses NVCC; the prior environment-variable toggle was removed.
    • Updated warning text when the expected package installation cannot be found.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 23, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

The pull request modifies the JIT compilation configuration for the TensorRT LLM deep GEMM module. The include directory discovery mechanism now uses the flashinfer-python package instead of tensorrt_llm, with updated path resolution. Additionally, the NVCC usage logic is simplified to always use NVCC without reading an environment variable.

Changes

Cohort / File(s) Summary
JIT Include Path Discovery
csrc/nv_internal/tensorrt_llm/deep_gemm/compiler.cuh
Changes package lookup from tensorrt_llm to flashinfer-python for JIT include directory resolution. Updates include path from tensorrt_llm/include to flashinfer/data/csrc/nv_internal/tensorrt_llm.
JIT Compilation Behavior
csrc/nv_internal/tensorrt_llm/deep_gemm/runtime.cuh
Removes environment variable-based conditional logic from kJitUseNvcc lambda. Now always returns true, enforcing NVCC usage regardless of TRTLLM_DG_JIT_USE_NVCC setting. Comments note NVRTC switch is blocked by missing headers.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Poem

🐰 Carrots of code, now aligned with flashinfer's bright light,
Include paths redirected, compilation paths set right,
NVCC stays true, no more switches to debate,
Deep GEMM springs forward—let the rabbits celebrate! 🥕✨

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed The pull request description includes the critical sections from the template: a clear Description section explaining that JIT compilation is being enabled for FP8 DeepGEMM kernels with NVCC as default, and a completed Pull Request Checklist with all pre-commit and test items marked as complete. The "Related Issues" section from the template is absent from the provided description, but this is a relatively minor omission that does not substantially detract from the overall completeness of the description. The description sufficiently communicates the purpose and meets the core requirements for merging.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
Title Check ✅ Passed The pull request title "feat: enable deepgemm jit for fp8 block-scale on SM90" is fully related to the main change in the changeset. The underlying modifications to compiler.cuh and runtime.cuh are configuration and infrastructure-level changes that directly enable JIT compilation for FP8 DeepGEMM kernels on SM90 GPUs. The title is specific, mentioning the kernel type (deepgemm jit), the data type (fp8), the scaling method (block-scale), and the target GPU generation (SM90), which aligns with the PR objectives and the actual changes being made.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @djmmoss, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates JIT compilation capabilities for FP8 DeepGEMM kernels into the FlashInfer project. The primary goal is to enhance performance by allowing dynamic compilation of these kernels. This is achieved by ensuring that NVCC is always used for JIT compilation and by correctly configuring the include paths to resolve dependencies within the flashinfer-python package, setting the stage for more optimized FP8 operations.

Highlights

  • JIT Compilation for FP8 DeepGEMM: This pull request enables Just-In-Time (JIT) compilation for FP8 DeepGEMM kernels, which is crucial for optimizing performance of these operations.
  • Forced NVCC Usage for JIT: NVRTC (NVIDIA Runtime Compilation) is currently disabled, and NVCC is now explicitly set to be used for all JIT compilation. A TODO comment notes that NVRTC requires specific headers like cuda_bf16.h which are currently missing.
  • Updated Include Path Resolution: The mechanism for locating necessary include directories for JIT compilation has been updated. It now correctly queries flashinfer-python instead of tensorrt_llm to find the internal tensorrt_llm include paths within the FlashInfer project structure.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enables JIT compilation for FP8 DeepGEMM kernels, defaulting to NVCC. The changes correctly update package paths from tensorrt_llm to flashinfer-python to locate necessary headers. My review focuses on improving code clarity and maintainability. I've suggested updating a log message to reflect the package change and cleaning up commented-out code while preserving important context in a TODO comment.

Comment on lines 38 to 46
static bool kJitUseNvcc = []() {
char const* env_var = getenv("TRTLLM_DG_JIT_USE_NVCC");
return env_var && (std::string(env_var) == "1" || std::string(env_var) == "true");
// char const* env_var = getenv("TRTLLM_DG_JIT_USE_NVCC");
// return env_var && (std::string(env_var) == "1" || std::string(env_var) == "true");
// always use nvcc
// TODO: Enable nvrtc -- need these headers:
// [TensorRT-LLM][INFO] Compilation log:
// kernel.cu(16): catastrophic error: cannot open source file "cuda_bf16.h"
return true;
}();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The kJitUseNvcc variable is now hardcoded to true, and the previous implementation that used an environment variable is commented out. While the TODO comment explains why NVRTC is currently disabled, leaving commented-out code can reduce readability and maintainability. It would be cleaner to remove the commented-out lines and refine the comments to keep only the essential context.

static bool kJitUseNvcc = []() {
  // NVRTC is currently disabled due to include issues. Using NVCC by default.
  // TODO: Enable NVRTC. It fails with errors like:
  // [TensorRT-LLM][INFO] Compilation log:
  // kernel.cu(16): catastrophic error: cannot open source file "cuda_bf16.h"
  return true;
}();

@djmmoss djmmoss changed the title feat: enable deepgemm jit for fp8 block-scale feat: enable deepgemm jit for fp8 block-scale on SM90 Oct 23, 2025
Signed-off-by: Duncan Moss <[email protected]>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
csrc/nv_internal/tensorrt_llm/deep_gemm/compiler.cuh (2)

154-179: Update the outdated comment to reflect the new package name.

The comment at line 155 still references tensorrt_llm, but the code now uses flashinfer-python. Update the comment for consistency.

Apply this diff to update the comment:

       // Parse the location using regex
-      // `pip show tensorrt_llm` will output something like:
+      // `pip show flashinfer-python` will output something like:
       // Location: /usr/local/lib/python3.12/dist-packages
       // Editable project location: /code

180-182: Update the error message to reflect the new package dependency.

The error message references "TensorRT LLM installation" but the code now searches for flashinfer-python. Update the message for consistency.

Apply this diff to update the error message:

     } else {
-      TLLM_LOG_WARNING("Failed to find TensorRT LLM installation, DeepGEMM will be disabled.");
+      TLLM_LOG_WARNING("Failed to find flashinfer-python installation, DeepGEMM will be disabled.");
     }
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0260ab3 and c9a1a1a.

📒 Files selected for processing (2)
  • csrc/nv_internal/tensorrt_llm/deep_gemm/compiler.cuh (2 hunks)
  • csrc/nv_internal/tensorrt_llm/deep_gemm/runtime.cuh (1 hunks)
🔇 Additional comments (1)
csrc/nv_internal/tensorrt_llm/deep_gemm/runtime.cuh (1)

38-46: LGTM! Appropriate simplification to force NVCC usage.

The hardcoded return true with commented-out environment variable logic is appropriate given the TODO note about missing headers for NVRTC. This ensures consistent behavior until NVRTC support is fully enabled.

if (includeDirs.empty()) {
// Command to execute
char const* cmd = "pip show tensorrt_llm 2>/dev/null";
char const* cmd = "pip show flashinfer-python 2>/dev/null";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of this command?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the DeepGEMM JIT, it needs the header files in deep_gemm/, this command finds the installation path which is then used further down to add the deep_gemm/ to the -I

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to move the logic to python, pip show flashinfer-python doesn't necessarily show the correct package information (e.g. at AOT time when the package is not installed yet).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we can obtain the include path from python and pass the value to C++.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is where a refactor might be necessary, unfortunately these deep_gemm kernels aren't captured as part of AOT.

}
} else {
TLLM_LOG_WARNING("Failed to find TensorRT LLM installation, DeepGEMM will be disabled.");
TLLM_LOG_WARNING("Failed to find FlashInfer installation, DeepGEMM will be disabled.");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can safely assume flashinfer is installed if this function is called?

@yzh119 yzh119 merged commit bf03ad4 into flashinfer-ai:main Oct 26, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants