Skip to content

[deepseek][blackwell] add manual looping group gemm to enable base working inference on Blackwell #1272

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

lessw2020
Copy link
Contributor

This PR enables deepseek inference to run on Blackwell (B200).
Currently, torch._grouped_mm is specific to Hopper...thus trying to run on B200 via TorchBF16GroupGEMM yields:

"Error using torch strategy: torch._grouped_mm is only supported on CUDA devices with compute capability = 9.0"

thus this PR adds a manual looping group gemm to get ds inference working on Blackwell.
*Note that you must use Symmetric Memory for the all2all, dist.all2all_single does not yet work on Blackwell.

Wtih this PR:
Screenshot 2025-06-07 at 4 20 31 PM

Token per second of 1.21 is not great, but we have moved now from 'not working' to a working inference on B200.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 7, 2025
@lessw2020 lessw2020 requested a review from kwen2501 June 7, 2025 23:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants