Correction to TP logic for Mamba Mixer 2 when Num Groups not divisible by TP Size #13660

fabianlim · 2025-02-21T09:06:30Z

The current logic is incorrect for the case when n_groups ~~cannot~~ can be divided by TP size, this requires more change to the kernels. This is because it currently uses this simple ratio logic to map the head to the correct group. However, one can imagine in the more general case it is more complicated.

For example, n_groups=3, n_heads=15, and TP_size=5, then in this case we end up with this kind of splitting of the heads (in the below, the numbers 0 - 2 represent which each of the 15 heads map to )

000 | 001 | 111 | 122 | 222

In this case, we will end up with a very heterogenous situation, were we need to

duplicate to at least 2 groups per shard,
cannot rely on the ratio logic if we want to stay with 2 groups per shard.

Of course, if we duplicated groups to the extent that they equal heads, then its possible, but not sure if there is an more efficient method.

Current Strategy in this PR

For now, maybe its easier to patch it to specicially only support the following two special cases

If TP size divides n_groups,
If TP_size does not divide n_groups, but n_groups == 1.

These two scenarios support existing models such as Codestral, Bamba, Zamba, etc.., where n_groups are either 1 or some power of 2

cc: @tlrmchlsmth @yury-tokpanov

…f num_groups > 1. Signed-off-by: Yu Chin Fabian Lim <[email protected]>

github-actions · 2025-02-21T09:06:44Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

DarkLight1337 · 2025-02-21T09:18:54Z

Thanks for fixing, can we add some tests to avoid future regressions?

fabianlim · 2025-02-21T10:03:55Z

yes currently we have a TP test for selected models, but AFAIK the tests for mamba2 are not yet automated, and there is some problem running them in a suite because they are not setup properly using torch multiprocessing, but let me discuss this with @tlrmchlsmth

tlrmchlsmth · 2025-02-21T15:51:32Z

The current logic is incorrect for the case when n_groups cannot be divided by TP size

I'm a little confused by this because we're seeing poor gsm8k results on TP size 2 on Mamba Codestral, which has n_groups == 8, so n_groups is divisible in this case.

For reference, these are GSM8k results on Codestral-7B from @yury-tokpanov
TP=1

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.4731|±  |0.0138|
|     |       |strict-match    |     5|exact_match|↑  |0.4610|±  |0.0137|

vs TP=2

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.0167|±  |0.0035|
|     |       |strict-match    |     5|exact_match|↑  |0.0076|±  |0.0024|

tlrmchlsmth · 2025-02-21T15:52:59Z

For now, maybe its easier to patch it to specicially only support the following two special cases

If TP size divides n_groups,

If TP_size does not divide n_groups, but n_groups == 1.

These two scenarios support existing models such as Codestral, Bamba, Zamba, etc.., where n_groups are either 1 or some power of 2

This does make sense to me, and I'm on board with this approach

tlrmchlsmth · 2025-02-21T16:03:40Z

vllm/model_executor/layers/mamba/mamba_mixer2.py

        self.tp_size = get_tensor_model_parallel_world_size()
        tp_rank = get_tensor_model_parallel_rank()

        assert num_heads % self.tp_size == 0, \
            "Tensor parallel world size must divide num heads."

+        assert (n_groups % self.tp_size) != 0 and n_groups == 0, \


Should this be:

assert (n_groups % self.tp_size) == 0 or n_groups == 1, ...)

I hit this assert when running

lm_eval --model vllm --model_args pretrained=mistralai/Mamba-Codestral-7B-v0.1,gpu_memory_utilization=0.8,max_model_len=4096,tensor_parallel_size=2 --batch_size auto --trust_remote_code --cache_requests true --tasks gsm8k

but with my suggestion, the TP==2 gsm8k results look good:

|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.4701|± |0.0137| | | |strict-match | 5|exact_match|↑ |0.4549|± |0.0137|

oh sorry you are right

fabianlim · 2025-02-21T17:07:39Z

I'm a little confused by this because we're seeing poor gsm8k results on TP size 2 on Mamba Codestral, which has

@tlrmchlsmth sorry that was a typo, i fixed it in the main description

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

yury-tokpanov · 2025-02-21T17:47:35Z

I think originally groups were introduced to support easy TP for mamba2, right? So, in the original logic n_groups should always be divisible by TP size.

I'm wondering, whether the general case for n_groups being not divisible by TP size would be quite exotic. Probably supporting the special case of n_groups == 1 would be enough for now.

yury-tokpanov · 2025-02-21T17:55:21Z

Thanks for the fix and all the work on TP for mamba2!

tlrmchlsmth

Thanks a lot for the fix!

…e by TP Size (vllm-project#13660)

…e by TP Size (vllm-project#13660) Signed-off-by: Louis Ulmer <[email protected]>

…e by TP Size (vllm-project#13660)

If TP size cannot divide num_groups, mamba mixer2 cannot support TP i…

2640cac

…f num_groups > 1. Signed-off-by: Yu Chin Fabian Lim <[email protected]>

tlrmchlsmth self-assigned this Feb 21, 2025

tlrmchlsmth mentioned this pull request Feb 21, 2025

[New Model]: Codestral Mamba #6479

Closed

tlrmchlsmth reviewed Feb 21, 2025

View reviewed changes

thanks @tlrmchlsmth!

d4debac

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

tlrmchlsmth added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 21, 2025

tlrmchlsmth approved these changes Feb 21, 2025

View reviewed changes

tlrmchlsmth enabled auto-merge (squash) February 21, 2025 22:20

simon-mo merged commit fca2084 into vllm-project:main Feb 22, 2025
55 of 58 checks passed

Akshat-Tripathi pushed a commit to krai/vllm that referenced this pull request Mar 3, 2025

Correction to TP logic for Mamba Mixer 2 when Num Groups not divisibl…

2bf686d

…e by TP Size (vllm-project#13660)

This was referenced Mar 11, 2025

Enforce that TP > 1 is not supported for Mamba2 if Quantization is Enabled. #14617

Merged

[Bug]: Quantization In MambaMixer2 Not Supported when Tensor Parallel is enabled #14618

Closed

lulmer pushed a commit to lulmer/vllm that referenced this pull request Apr 7, 2025

Correction to TP logic for Mamba Mixer 2 when Num Groups not divisibl…

bb79a23

…e by TP Size (vllm-project#13660) Signed-off-by: Louis Ulmer <[email protected]>

ckhordiasma mentioned this pull request Apr 17, 2025

[do not merge] pr test for nm changes into 2.20 red-hat-data-services/vllm#107

Closed

shreyankg pushed a commit to shreyankg/vllm that referenced this pull request May 3, 2025

Correction to TP logic for Mamba Mixer 2 when Num Groups not divisibl…

ce221b4

…e by TP Size (vllm-project#13660)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Correction to TP logic for Mamba Mixer 2 when Num Groups not divisible by TP Size #13660

Correction to TP logic for Mamba Mixer 2 when Num Groups not divisible by TP Size #13660

Uh oh!

fabianlim commented Feb 21, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Feb 21, 2025

Uh oh!

DarkLight1337 commented Feb 21, 2025

Uh oh!

fabianlim commented Feb 21, 2025

Uh oh!

tlrmchlsmth commented Feb 21, 2025

Uh oh!

tlrmchlsmth commented Feb 21, 2025

Uh oh!

tlrmchlsmth Feb 21, 2025

Uh oh!

tlrmchlsmth Feb 21, 2025

Uh oh!

fabianlim Feb 21, 2025

Uh oh!

fabianlim commented Feb 21, 2025

Uh oh!

yury-tokpanov commented Feb 21, 2025

Uh oh!

yury-tokpanov commented Feb 21, 2025

Uh oh!

tlrmchlsmth left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Correction to TP logic for Mamba Mixer 2 when Num Groups not divisible by TP Size #13660

Correction to TP logic for Mamba Mixer 2 when Num Groups not divisible by TP Size #13660

Uh oh!

Conversation

fabianlim commented Feb 21, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Current Strategy in this PR

Uh oh!

github-actions bot commented Feb 21, 2025

Uh oh!

DarkLight1337 commented Feb 21, 2025

Uh oh!

fabianlim commented Feb 21, 2025

Uh oh!

tlrmchlsmth commented Feb 21, 2025

Uh oh!

tlrmchlsmth commented Feb 21, 2025

Uh oh!

tlrmchlsmth Feb 21, 2025

Choose a reason for hiding this comment

Uh oh!

tlrmchlsmth Feb 21, 2025

Choose a reason for hiding this comment

Uh oh!

fabianlim Feb 21, 2025

Choose a reason for hiding this comment

Uh oh!

fabianlim commented Feb 21, 2025

Uh oh!

yury-tokpanov commented Feb 21, 2025

Uh oh!

yury-tokpanov commented Feb 21, 2025

Uh oh!

tlrmchlsmth left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

fabianlim commented Feb 21, 2025 •

edited by github-actions bot

Loading