[SW-236089] UTs: multimodality correctness #136

kfojcik-intel · 2025-09-04T06:53:00Z

Unit tests for vLLM multimodal input and processing.
Inspired by upstream test_inputs.py and test_processing.py

Signed-off-by: Katarzyna Fojcik <[email protected]>

Updating states in defragmentator on dummy data is redundant and we should avoid it. Right now, doing warmup on defragmentator will also cause a crash in case of contigious pa due to vllm-project#126 Signed-off-by: Marcin Swiniarski <[email protected]> Signed-off-by: Katarzyna Fojcik <[email protected]>

Set of commits for the vllm docker --------- Signed-off-by: PatrykWo <[email protected]> Signed-off-by: Katarzyna Fojcik <[email protected]>

This PR adds support to hpu_model_ruuner to execute pooling models. --------- Signed-off-by: slokesha <[email protected]> Signed-off-by: Katarzyna Fojcik <[email protected]>

Reverts vllm-project#120 Signed-off-by: Katarzyna Fojcik <[email protected]>

…t#137) Sometimes reqs that don't have returning toks get mixed up with the rest prefills in merged prefill case - we want to remove them from sampling --------- Signed-off-by: Agata Dobrzyniewicz <[email protected]> Signed-off-by: Katarzyna Fojcik <[email protected]>

Signed-off-by: Katarzyna Fojcik <[email protected]>

…t#107) Signed-off-by: taran2210 <[email protected]> Co-authored-by: Michał Kuligowski <[email protected]> Co-authored-by: Agata Dobrzyniewicz <[email protected]> Signed-off-by: Katarzyna Fojcik <[email protected]>

I added two tests for testing custom op registration. Additionally, in `vllm_gaudi/ops/__init__.py`, I wrapped imports into a function. I did it because currently, if someone imported custom operator (before ops registration) for example `from vllm_gaudi.ops.hpu_layernorm import HPURMSNorm`, then all other custom ops would be register as an unexpected side effect. With that change only `HPURMSNorm` will be registered in such case. --------- Signed-off-by: Kacper Pietkun <[email protected]> Co-authored-by: Michał Kuligowski <[email protected]> Signed-off-by: Katarzyna Fojcik <[email protected]>

This PR adds support to hpu_model_ruuner to execute pooling models. Note : Warm up is not yet enabled for pooling. --------- Signed-off-by: slokesha <[email protected]> Signed-off-by: Katarzyna Fojcik <[email protected]>

Signed-off-by: Vivek <[email protected]> Signed-off-by: Katarzyna Fojcik <[email protected]>

This PR introduces warmup for merged prefill but also changes warmup design a little bit: - separate get cfg and get range functions in strategies - strategies will not handle filtering buckets now - bucketing manager will create buckets from 3 ranges (bs, query, ctx) and filter out not wanted buckets based on filtering map --------- Signed-off-by: Agata Dobrzyniewicz <[email protected]> Signed-off-by: Katarzyna Fojcik <[email protected]>

Introduces a new attention backend - Unified Attention to handle both prefills and decodes (and potentially in the future mixed batches). * To enable run with VLLM_UNIFIED_ATTN=true * Unified Attention by default implies contiguous_pa and merged_prefill, but one can disable them by specifying their respective flags (VLLM_CONTIGOUS_PA=false or VLLM_MERGED_PREFILL=f) --------- Signed-off-by: Michal Adamczyk <[email protected]> Signed-off-by: Katarzyna Fojcik <[email protected]>

Warming up the sampler with different configurations removes graph recompilations of bigger sampler graphs seen within the actual execution. As tested with example workloads and batch sizes, the only recompilations left from the sampler are from minor graphs, which have minimal influence to the execution time. The warmup of the sampler takes around 1-3 seconds, depending on the buckets' batch sizes to be warmed up. Additionally, removed the situation, where the warmup method is called twice (seen as duplicated prints within the warmup phase but with empty warmed up buckets, as these have all been already warmed up). --------- Signed-off-by: Krzysztof Smusz <[email protected]> Signed-off-by: Katarzyna Fojcik <[email protected]>

Signed-off-by: Vivek <[email protected]> Signed-off-by: Katarzyna Fojcik <[email protected]>

This is to add data parallel support for V1 gaudi plugin. - [x] add dp aware padding - [x] use all_gather and reduce_scatter - [x] add data parallel example --------- Signed-off-by: Wuxun Zhang <[email protected]> Co-authored-by: Konrad Zawora <[email protected]> Co-authored-by: Agata Dobrzyniewicz <[email protected]> Signed-off-by: Katarzyna Fojcik <[email protected]>

…ect#130) Signed-off-by: Konrad Zawora <[email protected]> Signed-off-by: Katarzyna Fojcik <[email protected]>

Signed-off-by: Chendi.Xue <[email protected]> Signed-off-by: Katarzyna Fojcik <[email protected]>

@afierka-intel

Add @afierka-intel user Signed-off-by: Artur Fierka <[email protected]> Signed-off-by: Katarzyna Fojcik <[email protected]>

Signed-off-by: Kacper Pietkun <[email protected]> Signed-off-by: Katarzyna Fojcik <[email protected]>

Signed-off-by: Wuxun Zhang <[email protected]> Co-authored-by: Agata Dobrzyniewicz <[email protected]> Signed-off-by: Katarzyna Fojcik <[email protected]>

Signed-off-by: Konrad Zawora <[email protected]> Signed-off-by: Katarzyna Fojcik <[email protected]>

Currently there are dynamo recompilations for each layer, due to `layer_name` arg passed to the forward function: ``` (Worker pid=79578) [rank0]:W0910 15:26:29.372000 79578 torch/_dynamo/convert_frame.py:1016] [3/8] torch._dynamo hit config.recompile_limit (8) (Worker pid=79578) [rank0]:W0910 15:26:29.372000 79578 torch/_dynamo/convert_frame.py:1016] [3/8] function: 'forward' (vllm/vllm/model_executor/models/mixtral.py:230) (Worker pid=79578) [rank0]:W0910 15:26:29.372000 79578 torch/_dynamo/convert_frame.py:1016] [3/8] last reason: 3/7: self._modules['block_sparse_moe']._modules['experts'].layer_name == 'model.layers.7.block_sparse_moe.experts' (Worker pid=79578) [rank0]:W0910 15:26:29.372000 79578 torch/_dynamo/convert_frame.py:1016] [3/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". (Worker pid=79578) [rank0]:W0910 15:26:29.372000 79578 torch/_dynamo/convert_frame.py:1016] [3/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. ``` It causes huge perf drop once using torch.compile instead of lazy mode (~5x worse perf) -- on the traces we can observe a lot of `transpose_mme` and `broadcast_nd` blocks, that are between all MME nodes: <img width="353" height="167" alt="image" src="https://github.com/user-attachments/assets/343ae137-20d0-447c-b687-387eefe19e41" /> To avoid it, I proposed a similar solution we used to have in vllm-fork ([FusedMoe.__init__()](https://github.com/HabanaAI/vllm-fork/blob/habana_main/vllm/model_executor/layers/fused_moe/layer.py#L866) and [FusedMoE.forward()](https://github.com/HabanaAI/vllm-fork/blob/habana_main/vllm/model_executor/layers/fused_moe/layer.py#L1442)) -- using `FusedMoE.forward_impl()` function for the cases where `dp_size` is equal to 1. --------- Signed-off-by: Karol Damaszke <[email protected]> Signed-off-by: Katarzyna Fojcik <[email protected]>

This PR fixes current Schrödinger's CI pipelines - it makes failing pipelines fail (failures reported as false positives are now true negatives), and it also makes failing pipelines pass (former false positives are now true positives due to adjusted tolerances). Basically, if you break something == CI pipeline will fail as it should, and pipelines that used to be broken, are now not broken. --------- Signed-off-by: Konrad Zawora <[email protected]> Signed-off-by: Katarzyna Fojcik <[email protected]>

For DP, dummy decode input data will be created with `schedulerOutput=None`, this is to skip prepare spec_decode_inputs in this case. --------- Signed-off-by: Wuxun Zhang <[email protected]> Co-authored-by: Chendi.Xue <[email protected]> Co-authored-by: Konrad Zawora <[email protected]> Signed-off-by: Katarzyna Fojcik <[email protected]>

…llm-project#161) Signed-off-by: Chendi.Xue <[email protected]> Signed-off-by: Katarzyna Fojcik <[email protected]>

port nixl --------- Signed-off-by: Harish Subramony <[email protected]> Signed-off-by: Chendi.Xue <[email protected]> Co-authored-by: Chendi.Xue <[email protected]> Signed-off-by: Katarzyna Fojcik <[email protected]>

Signed-off-by: Harish Subramony <[email protected]> Signed-off-by: Katarzyna Fojcik <[email protected]>

```bash QUANT_CONFIG=vllm-gaudi/tests/models/language/generation/inc_dynamic_quant.json VLLM_HPU_FORCE_CHANNEL_FP8=false \ HABANA_VISIBLE_DEVICES=all VLLM_CONTIGUOUS_PA=False VLLM_SKIP_WARMUP=true PT_HPU_LAZY_MODE=1 VLLM_USE_V1=1 \ VLLM_SKIP_WARMUP=true VLLM_CONTIGUOUS_PA=False PT_HPU_LAZY_MODE=1 \ lm_eval --model vllm --tasks gsm8k --num_fewshot 5 --batch_size 128 \ --model_args "pretrained=/mnt/disk8/Qwen/Qwen3-8B-FP8,tensor_parallel_size=1,trust_remote_code=true,max_model_len=4096,dtype=bfloat16" ``` ```bash vllm (pretrained=/mnt/disk8/Qwen/Qwen3-8B-FP8,tensor_parallel_size=1,trust_remote_code=true,max_model_len=4096,dtype=bfloat16), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: 128 |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.8817|± |0.0089| | | |strict-match | 5|exact_match|↑ |0.8749|± |0.0091| ``` --------- Signed-off-by: yiliu30 <[email protected]> Co-authored-by: Chendi.Xue <[email protected]> Signed-off-by: Katarzyna Fojcik <[email protected]>

Signed-off-by: Katarzyna Fojcik <[email protected]>

adobrzyn · 2025-09-12T12:22:31Z

/run-gaudi-tests

Signed-off-by: Katarzyna Fojcik <[email protected]>

adobrzyn · 2025-09-12T12:43:24Z

/run-gaudi-tests

Signed-off-by: Katarzyna Fojcik <[email protected]>

adobrzyn · 2025-09-12T13:03:23Z

/run-gaudi-tests

Signed-off-by: Katarzyna Fojcik <[email protected]>

adobrzyn · 2025-09-12T14:11:48Z

/run-gaudi-tests

kamil-kaczor · 2025-09-12T16:53:38Z

tests/unit_tests/multimodal/test_hpu_multimodal_processing.py

+    """Test that HPU processor is initialized with correct kwargs."""
+    mock_tokenizer = cast(AnyTokenizer, object())
+
+    ctx = InputProcessingContext(


ModelConfig read takes 10s+ of the time when running all tests. I've profiled this and it seems to come from run_from_subprocess and is gone once I remove it. I suggest to monkey-patch remove _run_in_subprocess from file
vllm/vllm/model_executor/models/registry.py
so just:
return _run_in_subprocess( lambda: _ModelInfo.from_model_cls(self.load_model_cls()))
change to:
return _ModelInfo.from_model_cls(self.load_model_cls())

This reduced test time of test_hpu_multimodal_processing.py::test_hf_processor_init_kwargs from 12s to 2s

kfojcik-intel requested review from kzawora-intel, xuechendi, mswiniarsk and adobrzyn as code owners September 4, 2025 06:53

kfojcik-intel force-pushed the dev/kfojcik/uts_multimodal branch from a7965e3 to 1322090 Compare September 4, 2025 06:57

kfojcik-intel and others added 25 commits September 12, 2025 14:04

[SW-236089] UTs: multimodality correctness

6f871d6

Signed-off-by: Katarzyna Fojcik <[email protected]>

code format

8778657

Signed-off-by: Katarzyna Fojcik <[email protected]>

Merging vllm docker implementation to vllm-gaudi (v1) (vllm-project#125)

fcc8654

Set of commits for the vllm docker --------- Signed-off-by: PatrykWo <[email protected]> Signed-off-by: Katarzyna Fojcik <[email protected]>

Enable embedding feature (vllm-project#120)

6229dc4

This PR adds support to hpu_model_ruuner to execute pooling models. --------- Signed-off-by: slokesha <[email protected]> Signed-off-by: Katarzyna Fojcik <[email protected]>

Revert "Enable embedding feature" (vllm-project#140)

1e651c4

Reverts vllm-project#120 Signed-off-by: Katarzyna Fojcik <[email protected]>

Update CODEOWNERS (vllm-project#144)

62acb76

Signed-off-by: Katarzyna Fojcik <[email protected]>

Enable embedding feature (vllm-project#141)

56789b6

This PR adds support to hpu_model_ruuner to execute pooling models. Note : Warm up is not yet enabled for pooling. --------- Signed-off-by: slokesha <[email protected]> Signed-off-by: Katarzyna Fojcik <[email protected]>

Update CODEOWNERS file (vllm-project#143)

f41044a

Signed-off-by: Vivek <[email protected]> Signed-off-by: Katarzyna Fojcik <[email protected]>

Add support for LoRA (vllm-project#51)

5f85302

Signed-off-by: Vivek <[email protected]> Signed-off-by: Katarzyna Fojcik <[email protected]>

Increase allowed line length to 120 + reformat accordingly (vllm-proj…

2147edd

…ect#130) Signed-off-by: Konrad Zawora <[email protected]> Signed-off-by: Katarzyna Fojcik <[email protected]>

[FIX HOURLY]Remove DP test from Hourly (vllm-project#147)

ed4eff7

Signed-off-by: Chendi.Xue <[email protected]> Signed-off-by: Katarzyna Fojcik <[email protected]>

Update CODEOWNERS (vllm-project#135)

5bbc466

Add @afierka-intel user Signed-off-by: Artur Fierka <[email protected]> Signed-off-by: Katarzyna Fojcik <[email protected]>

Enable sampler compilation (vllm-project#95)

c9b8918

Signed-off-by: Kacper Pietkun <[email protected]> Signed-off-by: Katarzyna Fojcik <[email protected]>

Add DP into CI (vllm-project#146)

768d8aa

Signed-off-by: Wuxun Zhang <[email protected]> Co-authored-by: Agata Dobrzyniewicz <[email protected]> Signed-off-by: Katarzyna Fojcik <[email protected]>

Add TESTOWNERS (vllm-project#153)

367695d

Signed-off-by: Konrad Zawora <[email protected]> Signed-off-by: Katarzyna Fojcik <[email protected]>

wuxun-zhang and others added 5 commits September 12, 2025 14:04

[Quick fix for CI]fix CI break on Qwen2.5-vl and update docker image (v…

1749efc

…llm-project#161) Signed-off-by: Chendi.Xue <[email protected]> Signed-off-by: Katarzyna Fojcik <[email protected]>

initial port for nixl (vllm-project#100)

d5edcd9

port nixl --------- Signed-off-by: Harish Subramony <[email protected]> Signed-off-by: Chendi.Xue <[email protected]> Co-authored-by: Chendi.Xue <[email protected]> Signed-off-by: Katarzyna Fojcik <[email protected]>

update nixl version in requirements (vllm-project#163)

8977a39

Signed-off-by: Harish Subramony <[email protected]> Signed-off-by: Katarzyna Fojcik <[email protected]>

kfojcik-intel force-pushed the dev/kfojcik/uts_multimodal branch from 51c38ed to 6606502 Compare September 12, 2025 11:05

kfojcik-intel and others added 4 commits September 12, 2025 13:05

Merge branch 'main' into dev/kfojcik/uts_multimodal

7d7e290

fix mix precision test

4e55ab6

Signed-off-by: Katarzyna Fojcik <[email protected]>

Merge branch 'main' into dev/kfojcik/uts_multimodal

37f343b

shorten processing tests

2cb5c53

Signed-off-by: Katarzyna Fojcik <[email protected]>

pre-commit fix

f84cb58

Signed-off-by: Katarzyna Fojcik <[email protected]>

pre-commit fix2

e15edb2

Signed-off-by: Katarzyna Fojcik <[email protected]>

remove mix-precision test

e1f482e

Signed-off-by: Katarzyna Fojcik <[email protected]>

kamil-kaczor suggested changes Sep 12, 2025

View reviewed changes

Merge branch 'main' into dev/kfojcik/uts_multimodal

a5d020c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SW-236089] UTs: multimodality correctness #136

[SW-236089] UTs: multimodality correctness #136

Uh oh!

kfojcik-intel commented Sep 4, 2025

Uh oh!

adobrzyn commented Sep 12, 2025

Uh oh!

adobrzyn commented Sep 12, 2025

Uh oh!

adobrzyn commented Sep 12, 2025

Uh oh!

adobrzyn commented Sep 12, 2025

Uh oh!

kamil-kaczor Sep 12, 2025 •

edited

Loading

Uh oh!

Uh oh!

[SW-236089] UTs: multimodality correctness #136

Are you sure you want to change the base?

[SW-236089] UTs: multimodality correctness #136

Uh oh!

Conversation

kfojcik-intel commented Sep 4, 2025

Uh oh!

adobrzyn commented Sep 12, 2025

Uh oh!

adobrzyn commented Sep 12, 2025

Uh oh!

adobrzyn commented Sep 12, 2025

Uh oh!

adobrzyn commented Sep 12, 2025

Uh oh!

kamil-kaczor Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kamil-kaczor Sep 12, 2025 •

edited

Loading