feat(kernels): gate MoE third-party substrates#474
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7ee9da3705
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| # Shared MoE/MLA third-party substrate: DeepEP, DeepGEMM, and FlashMLA. | ||
| moe = [] | ||
| glm52 = ["moe"] | ||
| kimi-k2 = ["moe"] |
There was a problem hiding this comment.
Avoid making Kimi require unused GLM submodules
When building with --features kimi-k2, this feature edge now enables moe, and build.rs::require_moe_submodules asserts that both DeepGEMM and FlashMLA are checked out before any Kimi CUDA is compiled. The Kimi/DeepEP paths do not include those repositories, so existing Kimi-only checkouts that initialized only DeepEP will fail even though the model does not use the GLM5.2 wrappers; please split the DeepEP substrate from the GLM5.2 DeepGEMM/FlashMLA checks or require those only under glm52.
Useful? React with 👍 / 👎.
Summary
moefeature for DeepEP/DeepGEMM/FlashMLA substratesglm52kernel surface for DeepGEMM scale layout/grouped FP8 contracts and FlashMLA SM90 sparse decode wrappersbench_servingwith the current Qwen3 engine signature so the existing pre-commit clippy gate passesVerification
cargo fmtOPENINFER_CUDA_SM=90 cargo check -p openinfer-kernels --no-default-featuresOPENINFER_CUDA_SM=90 OPENINFER_NCCL_ROOT=/data/code/workspace-rustllm/ep-moe-demo/.venv/lib/python3.12/site-packages/nvidia/nccl cargo check -p openinfer-kernels --no-default-features --features moeOPENINFER_CUDA_SM=90 OPENINFER_NCCL_ROOT=/data/code/workspace-rustllm/ep-moe-demo/.venv/lib/python3.12/site-packages/nvidia/nccl cargo check -p openinfer-kernels --no-default-features --features glm52OPENINFER_CUDA_SM=90 OPENINFER_NCCL_ROOT=/data/code/workspace-rustllm/ep-moe-demo/.venv/lib/python3.12/site-packages/nvidia/nccl cargo check -p openinfer-kernels --no-default-features --features kimi-k2OPENINFER_CUDA_SM=90 cargo check --release -p openinfer-server --bin bench_servingOPENINFER_CUDA_SM=90 OPENINFER_NCCL_ROOT=/data/code/workspace-rustllm/ep-moe-demo/.venv/lib/python3.12/site-packages/nvidia/nccl git commit ...Notes
topk=2048, and intentionally does not expose dynamictopk_length.CUDA_ERROR_NOT_SUPPORTEDuntil a real DeepGEMM runner is wired.CHECK_CUDA/FLASH_ASSERTpaths can still terminate the process on internal CUDA/launch failures; this PR only converts C++ assertion exceptions at the FFI boundary intoCUresult.