Skip to content

Conversation

kfojcik-intel
Copy link

Unit tests for vLLM multimodal input and processing.
Inspired by upstream test_inputs.py and test_processing.py

kfojcik-intel and others added 25 commits September 12, 2025 14:04
Signed-off-by: Katarzyna Fojcik <[email protected]>
Updating states in defragmentator on dummy data is redundant and we
should avoid it.
Right now, doing warmup on defragmentator will also cause a crash in
case of contigious pa due to
vllm-project#126

Signed-off-by: Marcin Swiniarski <[email protected]>
Signed-off-by: Katarzyna Fojcik <[email protected]>
Set of commits for the vllm docker

---------

Signed-off-by: PatrykWo <[email protected]>
Signed-off-by: Katarzyna Fojcik <[email protected]>
This PR adds support to hpu_model_ruuner to execute pooling models.

---------

Signed-off-by: slokesha <[email protected]>
Signed-off-by: Katarzyna Fojcik <[email protected]>
…t#137)

Sometimes reqs that don't have returning toks get mixed up with the rest
prefills in merged prefill case - we want to remove them from sampling

---------

Signed-off-by: Agata Dobrzyniewicz <[email protected]>
Signed-off-by: Katarzyna Fojcik <[email protected]>
…t#107)

Signed-off-by: taran2210 <[email protected]>
Co-authored-by: Michał Kuligowski <[email protected]>
Co-authored-by: Agata Dobrzyniewicz <[email protected]>
Signed-off-by: Katarzyna Fojcik <[email protected]>
I added two tests for testing custom op registration.
Additionally, in `vllm_gaudi/ops/__init__.py`, I wrapped imports into a
function. I did it because currently, if someone imported custom
operator (before ops registration) for example `from
vllm_gaudi.ops.hpu_layernorm import HPURMSNorm`, then all other custom
ops would be register as an unexpected side effect. With that change
only `HPURMSNorm` will be registered in such case.

---------

Signed-off-by: Kacper Pietkun <[email protected]>
Co-authored-by: Michał Kuligowski <[email protected]>
Signed-off-by: Katarzyna Fojcik <[email protected]>
This PR adds support to hpu_model_ruuner to execute pooling models.
Note : Warm up is not yet enabled for pooling.

---------

Signed-off-by: slokesha <[email protected]>
Signed-off-by: Katarzyna Fojcik <[email protected]>
Signed-off-by: Vivek <[email protected]>
Signed-off-by: Katarzyna Fojcik <[email protected]>
This PR introduces warmup for merged prefill but also changes warmup
design a little bit:
- separate get cfg and get range functions in strategies
- strategies will not handle filtering buckets now
- bucketing manager will create buckets from 3 ranges (bs, query, ctx)
and filter out not wanted buckets based on filtering map

---------

Signed-off-by: Agata Dobrzyniewicz <[email protected]>
Signed-off-by: Katarzyna Fojcik <[email protected]>
Introduces a new attention backend - Unified Attention to handle both
prefills and decodes (and potentially in the future mixed batches).
* To enable run with VLLM_UNIFIED_ATTN=true
* Unified Attention by default implies contiguous_pa and merged_prefill,
but one can disable them by specifying their respective flags
(VLLM_CONTIGOUS_PA=false or VLLM_MERGED_PREFILL=f)

---------

Signed-off-by: Michal Adamczyk <[email protected]>
Signed-off-by: Katarzyna Fojcik <[email protected]>
Warming up the sampler with different configurations removes graph
recompilations of bigger sampler graphs seen within the actual
execution. As tested with example workloads and batch sizes, the only
recompilations left from the sampler are from minor graphs, which have
minimal influence to the execution time.

The warmup of the sampler takes around 1-3 seconds, depending on the
buckets' batch sizes to be warmed up.

Additionally, removed the situation, where the warmup method is called
twice (seen as duplicated prints within the warmup phase but with empty
warmed up buckets, as these have all been already warmed up).

---------

Signed-off-by: Krzysztof Smusz <[email protected]>
Signed-off-by: Katarzyna Fojcik <[email protected]>
Signed-off-by: Vivek <[email protected]>
Signed-off-by: Katarzyna Fojcik <[email protected]>
This is to add data parallel support for V1 gaudi plugin.

- [x] add dp aware padding
- [x] use all_gather and reduce_scatter
- [x] add data parallel example

---------

Signed-off-by: Wuxun Zhang <[email protected]>
Co-authored-by: Konrad Zawora <[email protected]>
Co-authored-by: Agata Dobrzyniewicz <[email protected]>
Signed-off-by: Katarzyna Fojcik <[email protected]>
Signed-off-by: Chendi.Xue <[email protected]>
Signed-off-by: Katarzyna Fojcik <[email protected]>
Add @afierka-intel user

Signed-off-by: Artur Fierka <[email protected]>
Signed-off-by: Katarzyna Fojcik <[email protected]>
Signed-off-by: Kacper Pietkun <[email protected]>
Signed-off-by: Katarzyna Fojcik <[email protected]>
Signed-off-by: Wuxun Zhang <[email protected]>
Co-authored-by: Agata Dobrzyniewicz <[email protected]>
Signed-off-by: Katarzyna Fojcik <[email protected]>
Signed-off-by: Konrad Zawora <[email protected]>
Signed-off-by: Katarzyna Fojcik <[email protected]>
Currently there are dynamo recompilations for each layer, due to
`layer_name` arg passed to the forward function:
```
(Worker pid=79578) [rank0]:W0910 15:26:29.372000 79578 torch/_dynamo/convert_frame.py:1016] [3/8] torch._dynamo hit config.recompile_limit (8)
(Worker pid=79578) [rank0]:W0910 15:26:29.372000 79578 torch/_dynamo/convert_frame.py:1016] [3/8]    function: 'forward' (vllm/vllm/model_executor/models/mixtral.py:230)
(Worker pid=79578) [rank0]:W0910 15:26:29.372000 79578 torch/_dynamo/convert_frame.py:1016] [3/8]    last reason: 3/7: self._modules['block_sparse_moe']._modules['experts'].layer_name == 'model.layers.7.block_sparse_moe.experts'
(Worker pid=79578) [rank0]:W0910 15:26:29.372000 79578 torch/_dynamo/convert_frame.py:1016] [3/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
(Worker pid=79578) [rank0]:W0910 15:26:29.372000 79578 torch/_dynamo/convert_frame.py:1016] [3/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
```

It causes huge perf drop once using torch.compile instead of lazy mode
(~5x worse perf) -- on the traces we can observe a lot of
`transpose_mme` and `broadcast_nd` blocks, that are between all MME
nodes:
<img width="353" height="167" alt="image"
src="https://github.com/user-attachments/assets/343ae137-20d0-447c-b687-387eefe19e41"
/>

To avoid it, I proposed a similar solution we used to have in vllm-fork
([FusedMoe.__init__()](https://github.com/HabanaAI/vllm-fork/blob/habana_main/vllm/model_executor/layers/fused_moe/layer.py#L866)
and
[FusedMoE.forward()](https://github.com/HabanaAI/vllm-fork/blob/habana_main/vllm/model_executor/layers/fused_moe/layer.py#L1442))
-- using `FusedMoE.forward_impl()` function for the cases where
`dp_size` is equal to 1.

---------

Signed-off-by: Karol Damaszke <[email protected]>
Signed-off-by: Katarzyna Fojcik <[email protected]>
This PR fixes current Schrödinger's CI pipelines - it makes failing
pipelines fail (failures reported as false positives are now true
negatives), and it also makes failing pipelines pass (former false
positives are now true positives due to adjusted tolerances). Basically,
if you break something == CI pipeline will fail as it should, and
pipelines that used to be broken, are now not broken.

---------

Signed-off-by: Konrad Zawora <[email protected]>
Signed-off-by: Katarzyna Fojcik <[email protected]>
wuxun-zhang and others added 5 commits September 12, 2025 14:04
For DP, dummy decode input data will be created with
`schedulerOutput=None`, this is to skip prepare spec_decode_inputs in
this case.

---------

Signed-off-by: Wuxun Zhang <[email protected]>
Co-authored-by: Chendi.Xue <[email protected]>
Co-authored-by: Konrad Zawora <[email protected]>
Signed-off-by: Katarzyna Fojcik <[email protected]>
port nixl

---------

Signed-off-by: Harish Subramony <[email protected]>
Signed-off-by: Chendi.Xue <[email protected]>
Co-authored-by: Chendi.Xue <[email protected]>
Signed-off-by: Katarzyna Fojcik <[email protected]>
Signed-off-by: Harish Subramony <[email protected]>
Signed-off-by: Katarzyna Fojcik <[email protected]>
```bash
QUANT_CONFIG=vllm-gaudi/tests/models/language/generation/inc_dynamic_quant.json VLLM_HPU_FORCE_CHANNEL_FP8=false  \
HABANA_VISIBLE_DEVICES=all VLLM_CONTIGUOUS_PA=False VLLM_SKIP_WARMUP=true PT_HPU_LAZY_MODE=1 VLLM_USE_V1=1 \
VLLM_SKIP_WARMUP=true VLLM_CONTIGUOUS_PA=False PT_HPU_LAZY_MODE=1 \
lm_eval   --model vllm --tasks gsm8k --num_fewshot 5 --batch_size 128 \
--model_args "pretrained=/mnt/disk8/Qwen/Qwen3-8B-FP8,tensor_parallel_size=1,trust_remote_code=true,max_model_len=4096,dtype=bfloat16"
```
```bash
vllm (pretrained=/mnt/disk8/Qwen/Qwen3-8B-FP8,tensor_parallel_size=1,trust_remote_code=true,max_model_len=4096,dtype=bfloat16), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: 128
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8817|±  |0.0089|
|     |       |strict-match    |     5|exact_match|↑  |0.8749|±  |0.0091|

```

---------

Signed-off-by: yiliu30 <[email protected]>
Co-authored-by: Chendi.Xue <[email protected]>
Signed-off-by: Katarzyna Fojcik <[email protected]>
@kfojcik-intel kfojcik-intel force-pushed the dev/kfojcik/uts_multimodal branch from 51c38ed to 6606502 Compare September 12, 2025 11:05
@adobrzyn
Copy link
Collaborator

/run-gaudi-tests

Signed-off-by: Katarzyna Fojcik <[email protected]>
@adobrzyn
Copy link
Collaborator

/run-gaudi-tests

Signed-off-by: Katarzyna Fojcik <[email protected]>
@adobrzyn
Copy link
Collaborator

/run-gaudi-tests

Signed-off-by: Katarzyna Fojcik <[email protected]>
@adobrzyn
Copy link
Collaborator

/run-gaudi-tests

"""Test that HPU processor is initialized with correct kwargs."""
mock_tokenizer = cast(AnyTokenizer, object())

ctx = InputProcessingContext(
Copy link
Contributor

@kamil-kaczor kamil-kaczor Sep 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ModelConfig read takes 10s+ of the time when running all tests. I've profiled this and it seems to come from run_from_subprocess and is gone once I remove it. I suggest to monkey-patch remove _run_in_subprocess from file
vllm/vllm/model_executor/models/registry.py
so just:
return _run_in_subprocess( lambda: _ModelInfo.from_model_cls(self.load_model_cls()))
change to:
return _ModelInfo.from_model_cls(self.load_model_cls())

This reduced test time of test_hpu_multimodal_processing.py::test_hf_processor_init_kwargs from 12s to 2s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.