Skip to content

[deepseek][kernels][blackwell] Cutlass blackwell grouped gemm using cute dsl (forward) #1276

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 20 commits into
base: main
Choose a base branch
from

Conversation

lessw2020
Copy link
Contributor

@lessw2020 lessw2020 commented Jun 8, 2025

This PR integrates the new cutlass dsl grouped gemm into PyTorch with the CUTLASSGroupedGemmStrategy.
This handles the various conversions and pointer and metadata arrays needed.

Testing:
verified via the benchmarking file as a standalone group gemm
verified group gemm strategyintegration with the testMoe.

Screenshot 2025-06-08 at 9 34 01 PM Screenshot 2025-06-07 at 10 37 18 PM

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 8, 2025
@drisspg
Copy link
Contributor

drisspg commented Jun 9, 2025

OOC are you not comparing against groupd_gemm because we aren't building on sm100?

@lessw2020
Copy link
Contributor Author

OOC are you not comparing against groupd_gemm because we aren't building on sm100?

yes, exactly:
"Error using torch strategy: torch._grouped_mm is only supported on CUDA devices with compute capability = 9.0"

@drisspg
Copy link
Contributor

drisspg commented Jun 9, 2025

Will open an issue for this

@lessw2020 lessw2020 changed the title [WIP][kernels][blackwell] Cutlass blackwell grouped gemm using cute dsl [deepseek][kernels][blackwell] Cutlass blackwell grouped gemm using cute dsl (forward) Jun 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants