[deepseek][blackwell] add manual looping group gemm to enable base working inference on Blackwell #1272

lessw2020 · 2025-06-07T23:25:45Z

This PR enables deepseek inference to run on Blackwell (B200).
Currently, torch._grouped_mm is specific to Hopper...thus trying to run on B200 via TorchBF16GroupGEMM yields:

"Error using torch strategy: torch._grouped_mm is only supported on CUDA devices with compute capability = 9.0"

thus this PR adds a manual looping group gemm to get ds inference working on Blackwell.
*Note that you must use Symmetric Memory for the all2all, dist.all2all_single does not yet work on Blackwell.

Wtih this PR:

Token per second of 1.21 is not great, but we have moved now from 'not working' to a working inference on B200.

add manual looping group gemm to enable on Blackwell

9b97d58

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 7, 2025

lessw2020 requested a review from kwen2501 June 7, 2025 23:25

linting for import ordering

d9c14fa

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[deepseek][blackwell] add manual looping group gemm to enable base working inference on Blackwell #1272

[deepseek][blackwell] add manual looping group gemm to enable base working inference on Blackwell #1272

Uh oh!

lessw2020 commented Jun 7, 2025

Uh oh!

Uh oh!

[deepseek][blackwell] add manual looping group gemm to enable base working inference on Blackwell #1272

Are you sure you want to change the base?

[deepseek][blackwell] add manual looping group gemm to enable base working inference on Blackwell #1272

Uh oh!

Conversation

lessw2020 commented Jun 7, 2025

Uh oh!

Uh oh!