Skip to content

feat(kernels): gate MoE third-party substrates#474

Merged
xiaguan merged 1 commit into
mainfrom
feat/moe-third-party-gate
Jun 30, 2026
Merged

feat(kernels): gate MoE third-party substrates#474
xiaguan merged 1 commit into
mainfrom
feat/moe-third-party-gate

Conversation

@xiaguan

@xiaguan xiaguan commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • add DeepGEMM and FlashMLA as kernel submodules and introduce the shared moe feature for DeepEP/DeepGEMM/FlashMLA substrates
  • add the narrow glm52 kernel surface for DeepGEMM scale layout/grouped FP8 contracts and FlashMLA SM90 sparse decode wrappers
  • update kernel docs and keep non-GLM52/router/indexer/PP/TRTLLM surfaces out until the model crate has stable callers
  • align bench_serving with the current Qwen3 engine signature so the existing pre-commit clippy gate passes

Verification

  • cargo fmt
  • OPENINFER_CUDA_SM=90 cargo check -p openinfer-kernels --no-default-features
  • OPENINFER_CUDA_SM=90 OPENINFER_NCCL_ROOT=/data/code/workspace-rustllm/ep-moe-demo/.venv/lib/python3.12/site-packages/nvidia/nccl cargo check -p openinfer-kernels --no-default-features --features moe
  • OPENINFER_CUDA_SM=90 OPENINFER_NCCL_ROOT=/data/code/workspace-rustllm/ep-moe-demo/.venv/lib/python3.12/site-packages/nvidia/nccl cargo check -p openinfer-kernels --no-default-features --features glm52
  • OPENINFER_CUDA_SM=90 OPENINFER_NCCL_ROOT=/data/code/workspace-rustllm/ep-moe-demo/.venv/lib/python3.12/site-packages/nvidia/nccl cargo check -p openinfer-kernels --no-default-features --features kimi-k2
  • OPENINFER_CUDA_SM=90 cargo check --release -p openinfer-server --bin bench_serving
  • pre-commit hooks via OPENINFER_CUDA_SM=90 OPENINFER_NCCL_ROOT=/data/code/workspace-rustllm/ep-moe-demo/.venv/lib/python3.12/site-packages/nvidia/nccl git commit ...

Notes

  • GLM5.2 FlashMLA sparse decode is SM90-only, fixes V32 topk=2048, and intentionally does not expose dynamic topk_length.
  • The grouped DeepGEMM compute entry is fail-closed with CUDA_ERROR_NOT_SUPPORTED until a real DeepGEMM runner is wired.
  • Upstream FlashMLA CHECK_CUDA / FLASH_ASSERT paths can still terminate the process on internal CUDA/launch failures; this PR only converts C++ assertion exceptions at the FFI boundary into CUresult.
  • No performance win is claimed; this is feature-gated substrate/API bring-up, so no A/B benchmark is attached.

@xiaguan xiaguan merged commit 1c71fee into main Jun 30, 2026
1 check passed

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7ee9da3705

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

# Shared MoE/MLA third-party substrate: DeepEP, DeepGEMM, and FlashMLA.
moe = []
glm52 = ["moe"]
kimi-k2 = ["moe"]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid making Kimi require unused GLM submodules

When building with --features kimi-k2, this feature edge now enables moe, and build.rs::require_moe_submodules asserts that both DeepGEMM and FlashMLA are checked out before any Kimi CUDA is compiled. The Kimi/DeepEP paths do not include those repositories, so existing Kimi-only checkouts that initialized only DeepEP will fail even though the model does not use the GLM5.2 wrappers; please split the DeepEP substrate from the GLM5.2 DeepGEMM/FlashMLA checks or require those only under glm52.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant