-
Notifications
You must be signed in to change notification settings - Fork 38
[SW-236089] UTs: multimodality correctness #136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[SW-236089] UTs: multimodality correctness #136
Conversation
a7965e3
to
1322090
Compare
Signed-off-by: Katarzyna Fojcik <[email protected]>
Signed-off-by: Katarzyna Fojcik <[email protected]>
Updating states in defragmentator on dummy data is redundant and we should avoid it. Right now, doing warmup on defragmentator will also cause a crash in case of contigious pa due to vllm-project#126 Signed-off-by: Marcin Swiniarski <[email protected]> Signed-off-by: Katarzyna Fojcik <[email protected]>
Set of commits for the vllm docker --------- Signed-off-by: PatrykWo <[email protected]> Signed-off-by: Katarzyna Fojcik <[email protected]>
This PR adds support to hpu_model_ruuner to execute pooling models. --------- Signed-off-by: slokesha <[email protected]> Signed-off-by: Katarzyna Fojcik <[email protected]>
Reverts vllm-project#120 Signed-off-by: Katarzyna Fojcik <[email protected]>
…t#137) Sometimes reqs that don't have returning toks get mixed up with the rest prefills in merged prefill case - we want to remove them from sampling --------- Signed-off-by: Agata Dobrzyniewicz <[email protected]> Signed-off-by: Katarzyna Fojcik <[email protected]>
Signed-off-by: Katarzyna Fojcik <[email protected]>
…t#107) Signed-off-by: taran2210 <[email protected]> Co-authored-by: Michał Kuligowski <[email protected]> Co-authored-by: Agata Dobrzyniewicz <[email protected]> Signed-off-by: Katarzyna Fojcik <[email protected]>
I added two tests for testing custom op registration. Additionally, in `vllm_gaudi/ops/__init__.py`, I wrapped imports into a function. I did it because currently, if someone imported custom operator (before ops registration) for example `from vllm_gaudi.ops.hpu_layernorm import HPURMSNorm`, then all other custom ops would be register as an unexpected side effect. With that change only `HPURMSNorm` will be registered in such case. --------- Signed-off-by: Kacper Pietkun <[email protected]> Co-authored-by: Michał Kuligowski <[email protected]> Signed-off-by: Katarzyna Fojcik <[email protected]>
This PR adds support to hpu_model_ruuner to execute pooling models. Note : Warm up is not yet enabled for pooling. --------- Signed-off-by: slokesha <[email protected]> Signed-off-by: Katarzyna Fojcik <[email protected]>
Signed-off-by: Vivek <[email protected]> Signed-off-by: Katarzyna Fojcik <[email protected]>
This PR introduces warmup for merged prefill but also changes warmup design a little bit: - separate get cfg and get range functions in strategies - strategies will not handle filtering buckets now - bucketing manager will create buckets from 3 ranges (bs, query, ctx) and filter out not wanted buckets based on filtering map --------- Signed-off-by: Agata Dobrzyniewicz <[email protected]> Signed-off-by: Katarzyna Fojcik <[email protected]>
Introduces a new attention backend - Unified Attention to handle both prefills and decodes (and potentially in the future mixed batches). * To enable run with VLLM_UNIFIED_ATTN=true * Unified Attention by default implies contiguous_pa and merged_prefill, but one can disable them by specifying their respective flags (VLLM_CONTIGOUS_PA=false or VLLM_MERGED_PREFILL=f) --------- Signed-off-by: Michal Adamczyk <[email protected]> Signed-off-by: Katarzyna Fojcik <[email protected]>
Warming up the sampler with different configurations removes graph recompilations of bigger sampler graphs seen within the actual execution. As tested with example workloads and batch sizes, the only recompilations left from the sampler are from minor graphs, which have minimal influence to the execution time. The warmup of the sampler takes around 1-3 seconds, depending on the buckets' batch sizes to be warmed up. Additionally, removed the situation, where the warmup method is called twice (seen as duplicated prints within the warmup phase but with empty warmed up buckets, as these have all been already warmed up). --------- Signed-off-by: Krzysztof Smusz <[email protected]> Signed-off-by: Katarzyna Fojcik <[email protected]>
Signed-off-by: Vivek <[email protected]> Signed-off-by: Katarzyna Fojcik <[email protected]>
This is to add data parallel support for V1 gaudi plugin. - [x] add dp aware padding - [x] use all_gather and reduce_scatter - [x] add data parallel example --------- Signed-off-by: Wuxun Zhang <[email protected]> Co-authored-by: Konrad Zawora <[email protected]> Co-authored-by: Agata Dobrzyniewicz <[email protected]> Signed-off-by: Katarzyna Fojcik <[email protected]>
…ect#130) Signed-off-by: Konrad Zawora <[email protected]> Signed-off-by: Katarzyna Fojcik <[email protected]>
Signed-off-by: Chendi.Xue <[email protected]> Signed-off-by: Katarzyna Fojcik <[email protected]>
Add @afierka-intel user Signed-off-by: Artur Fierka <[email protected]> Signed-off-by: Katarzyna Fojcik <[email protected]>
Signed-off-by: Kacper Pietkun <[email protected]> Signed-off-by: Katarzyna Fojcik <[email protected]>
Signed-off-by: Wuxun Zhang <[email protected]> Co-authored-by: Agata Dobrzyniewicz <[email protected]> Signed-off-by: Katarzyna Fojcik <[email protected]>
Signed-off-by: Konrad Zawora <[email protected]> Signed-off-by: Katarzyna Fojcik <[email protected]>
Currently there are dynamo recompilations for each layer, due to `layer_name` arg passed to the forward function: ``` (Worker pid=79578) [rank0]:W0910 15:26:29.372000 79578 torch/_dynamo/convert_frame.py:1016] [3/8] torch._dynamo hit config.recompile_limit (8) (Worker pid=79578) [rank0]:W0910 15:26:29.372000 79578 torch/_dynamo/convert_frame.py:1016] [3/8] function: 'forward' (vllm/vllm/model_executor/models/mixtral.py:230) (Worker pid=79578) [rank0]:W0910 15:26:29.372000 79578 torch/_dynamo/convert_frame.py:1016] [3/8] last reason: 3/7: self._modules['block_sparse_moe']._modules['experts'].layer_name == 'model.layers.7.block_sparse_moe.experts' (Worker pid=79578) [rank0]:W0910 15:26:29.372000 79578 torch/_dynamo/convert_frame.py:1016] [3/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". (Worker pid=79578) [rank0]:W0910 15:26:29.372000 79578 torch/_dynamo/convert_frame.py:1016] [3/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. ``` It causes huge perf drop once using torch.compile instead of lazy mode (~5x worse perf) -- on the traces we can observe a lot of `transpose_mme` and `broadcast_nd` blocks, that are between all MME nodes: <img width="353" height="167" alt="image" src="https://github.com/user-attachments/assets/343ae137-20d0-447c-b687-387eefe19e41" /> To avoid it, I proposed a similar solution we used to have in vllm-fork ([FusedMoe.__init__()](https://github.com/HabanaAI/vllm-fork/blob/habana_main/vllm/model_executor/layers/fused_moe/layer.py#L866) and [FusedMoE.forward()](https://github.com/HabanaAI/vllm-fork/blob/habana_main/vllm/model_executor/layers/fused_moe/layer.py#L1442)) -- using `FusedMoE.forward_impl()` function for the cases where `dp_size` is equal to 1. --------- Signed-off-by: Karol Damaszke <[email protected]> Signed-off-by: Katarzyna Fojcik <[email protected]>
This PR fixes current Schrödinger's CI pipelines - it makes failing pipelines fail (failures reported as false positives are now true negatives), and it also makes failing pipelines pass (former false positives are now true positives due to adjusted tolerances). Basically, if you break something == CI pipeline will fail as it should, and pipelines that used to be broken, are now not broken. --------- Signed-off-by: Konrad Zawora <[email protected]> Signed-off-by: Katarzyna Fojcik <[email protected]>
For DP, dummy decode input data will be created with `schedulerOutput=None`, this is to skip prepare spec_decode_inputs in this case. --------- Signed-off-by: Wuxun Zhang <[email protected]> Co-authored-by: Chendi.Xue <[email protected]> Co-authored-by: Konrad Zawora <[email protected]> Signed-off-by: Katarzyna Fojcik <[email protected]>
…llm-project#161) Signed-off-by: Chendi.Xue <[email protected]> Signed-off-by: Katarzyna Fojcik <[email protected]>
port nixl --------- Signed-off-by: Harish Subramony <[email protected]> Signed-off-by: Chendi.Xue <[email protected]> Co-authored-by: Chendi.Xue <[email protected]> Signed-off-by: Katarzyna Fojcik <[email protected]>
Signed-off-by: Harish Subramony <[email protected]> Signed-off-by: Katarzyna Fojcik <[email protected]>
```bash QUANT_CONFIG=vllm-gaudi/tests/models/language/generation/inc_dynamic_quant.json VLLM_HPU_FORCE_CHANNEL_FP8=false \ HABANA_VISIBLE_DEVICES=all VLLM_CONTIGUOUS_PA=False VLLM_SKIP_WARMUP=true PT_HPU_LAZY_MODE=1 VLLM_USE_V1=1 \ VLLM_SKIP_WARMUP=true VLLM_CONTIGUOUS_PA=False PT_HPU_LAZY_MODE=1 \ lm_eval --model vllm --tasks gsm8k --num_fewshot 5 --batch_size 128 \ --model_args "pretrained=/mnt/disk8/Qwen/Qwen3-8B-FP8,tensor_parallel_size=1,trust_remote_code=true,max_model_len=4096,dtype=bfloat16" ``` ```bash vllm (pretrained=/mnt/disk8/Qwen/Qwen3-8B-FP8,tensor_parallel_size=1,trust_remote_code=true,max_model_len=4096,dtype=bfloat16), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: 128 |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.8817|± |0.0089| | | |strict-match | 5|exact_match|↑ |0.8749|± |0.0091| ``` --------- Signed-off-by: yiliu30 <[email protected]> Co-authored-by: Chendi.Xue <[email protected]> Signed-off-by: Katarzyna Fojcik <[email protected]>
51c38ed
to
6606502
Compare
Signed-off-by: Katarzyna Fojcik <[email protected]>
Signed-off-by: Katarzyna Fojcik <[email protected]>
/run-gaudi-tests |
Signed-off-by: Katarzyna Fojcik <[email protected]>
/run-gaudi-tests |
Signed-off-by: Katarzyna Fojcik <[email protected]>
/run-gaudi-tests |
Signed-off-by: Katarzyna Fojcik <[email protected]>
/run-gaudi-tests |
"""Test that HPU processor is initialized with correct kwargs.""" | ||
mock_tokenizer = cast(AnyTokenizer, object()) | ||
|
||
ctx = InputProcessingContext( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ModelConfig read takes 10s+ of the time when running all tests. I've profiled this and it seems to come from run_from_subprocess and is gone once I remove it. I suggest to monkey-patch remove _run_in_subprocess from file
vllm/vllm/model_executor/models/registry.py
so just:
return _run_in_subprocess( lambda: _ModelInfo.from_model_cls(self.load_model_cls()))
change to:
return _ModelInfo.from_model_cls(self.load_model_cls())
This reduced test time of test_hpu_multimodal_processing.py::test_hf_processor_init_kwargs from 12s to 2s
Unit tests for vLLM multimodal input and processing.
Inspired by upstream test_inputs.py and test_processing.py