Added support of subfunction for VLMs #653

abhishek-singh591 · 2025-12-05T07:43:57Z

Currently it's only for Qwen2.5VL.

Signed-off-by: abhishek-singh591 <[email protected]>

# Support for Diffusers Architecture in Efficient Transformers ## Overview This pull request introduces **Diffusers architecture support** to the **Efficient Transformers** framework, enabling seamless integration of diffusion models. ## Key Highlights 1. **Support of model [black-forest-labs/FLUX1-schnell](https://huggingface.co/black-forest-labs/FLUX.1-schnell)** 2. **Flexible Configuration** - Supports JSON-based configuration files for easy compilation and execution. 3. **Performance Benchmarking** - Implements a performance matrix for Diffusers models to enable benchmarking for each modules. 4. **Testing Framework** - Includes initial test scripts for Diffusers (In progress). 5. **Support of onnx subfunction graph using flag use_onnx_function** 6. **Support parallel compilation of modules using flag `parallel_compile`** --------- Signed-off-by: Amit Raj <[email protected]> Signed-off-by: Amit Raj <[email protected]> Signed-off-by: tv-karthikeya <[email protected]> Signed-off-by: vtirumal <[email protected]> Co-authored-by: tv-karthikeya <[email protected]> Co-authored-by: Amit Raj <[email protected]> Co-authored-by: Karthikeya <[email protected]>

Signed-off-by: abhishek-singh591 <[email protected]>

Signed-off-by: Abukhoyer Shaik <[email protected]>

# We should be using disaggragate serving for GPTOSS model for best performance - GPT-OSS model has 128/4 for 120b and 32/4 ratio of total_experts/experts_per_tok - We use read all experts only once always strategy in prefill-only model - And we treat weights activtions meaning read only chosen experts for decode-only model # Prefill-only model ## Blocking default behviour when `prefill_only=True` in compile API - NUM_Q_BLOCKS=<int> set number of Q blocks in attention - NUM_FFN_BLOCKS=<int> set number of blocks in FFN - ENABLE_OPT_SWA=0 or 1 to enable/disable optimized SWA. when enabled we will be using only valid KVs for given block in Attention reducing MACs - prefix_caching is not supported with this mode ## Chunking pass `enable_chunking=True` and `prefill_only=True` in compile API - Optimized SWA i.e. reading only valid KV as per diagonal attention mask is enabled for this version by default - This model can be used for prefix_caching by passing `kv_cache_batch_size=<int>` in compile API # Decode-only model ## Retain Sliding window length of KV for sliding window layers, default behavour when `prefill_seq_len=1` in compile API - This reduces the amount of DDR used by the model - CB is enabled for this version pass `continous_batching=True` in `from_pretrained` call and strictly pass `full_batch_size=<int>` and optinally `kv_cache_batch_size=<int>` if needed ## Full KV for sliding window layers pass `retain_full_kv=True` along with `prefill_seq_len=1` in compile API - This uses higher DDR as we are retaining ctx_len KV even for sliding window layers but will be reading only sliding window len kv in attention - CB is enabled for this version pass `continous_batching=True` in `from_pretrained` call and strictly pass `full_batch_size=<int>` and optinally `kv_cache_batch_size=<int>` if needed - This is enabled for the usecase of multi-turn chat, where we will be running prefill-> decode and then use cache of prefill as well as decode combined to again run prefill, so we want to retain full KV for sliding window layers NOTE: * decode-only model currently fails compilation with `use_onnx_subfunctions=True` so avoid using it * 120B model needs NPI, there are two versions of NPI one with and without subfunction both are uploaded here, pass it as `node_precision_info=<path to file>` * It is advised to use `use_onnx_subfunctions=True` with prefill-only model, otherwise the compilation times are too high, with this the model is supposed to export and fail during compile as it needs assert sdk, so user is supposed to run this compilation manually by pasting the command printed in the error --------- Signed-off-by: vbaddi <[email protected]> Signed-off-by: Onkar Chougule <[email protected]> Signed-off-by: Mamta Singh <[email protected]> Signed-off-by: Onkar Chougule <[email protected]> Co-authored-by: Vinayak Baddi <[email protected]> Co-authored-by: Vinayak Baddi <[email protected]> Co-authored-by: Mamta Singh <[email protected]> Co-authored-by: Mamta Singh <[email protected]>

Update tests of onnx_subfunction to compare the hash of the .onnx file when `use_onnx_subfunction` flag is toggled --------- Signed-off-by: Amit Raj <[email protected]> Co-authored-by: Amit Raj <[email protected]>

**Overview** On-device sampling can significantly reduce host overhead and improve inference throughput; however, so far it has only been implemented for `QEffForCausalLM` models. This PR extends on-device sampling support to the language decoder of dual QPC vision language models, `QEffCausalLMForTextImageToTextModel`. In addition, it fixes the bug in gumbel noise so that it correctly simulates a multinomial distribution for random sampling. **Implementation details** ``` class _QEffAutoModelForImageTextToTextDualQPC: def __init__( self, model: nn.Module, continuous_batching: bool = False, qaic_config: Optional[dict] = None, **kwargs, ): # Omitting unchanged parts self.lang_model = QEffCausalLMForTextImageToTextModel(model, qaic_config=qaic_config, **kwargs) # ---Sampling--- # Note: SamplerTransform should be applied after all other transforms # are done. The role of the sampler is to just add nodes at the output of the # previous transform function. self.lang_model.model, _ = SamplerTransform.apply(self.lang_model.model, qaic_config, **kwargs) ``` **Usage** The usage is the similar to enable on-device sampling for `QEffForCausalLM`. ``` from QEfficient import QEFFAutoModelForImageTextToText model_id = "Qwen/Qwen2.5-VL-3B-Instruct" qeff_model = QEFFAutoModelForImageTextToText.from_pretrained( model_id, attn_implementation="eager", kv_offload=True, continuous_batching=True, qaic_config={ "include_sampler": True, "return_pdfs": False, "max_top_k_ids": 512, }, ) ``` --------- Signed-off-by: quic-xiyushi <[email protected]> Signed-off-by: quic-sanising <[email protected]> Signed-off-by: sanising <[email protected]> Signed-off-by: Mamta Singh <[email protected]> Co-authored-by: sanising <[email protected]> Co-authored-by: Mamta Singh <[email protected]>

…of hash comparison (quic#670) ## Summary Refactored the subfunction unit test to directly verify ONNX subfunction usage by inspecting the exported model structure, replacing the previous hash-based validation approach. ## Changes - Removed hash-based checks (`export_hash` and file hash comparisons) - Added ONNX model inspection utilities: - `has_gpt2block_function()`: Checks for QEffGPT2Block function definitions - Added explicit assertions to verify: - QEffGPT2Block function is defined when `use_onnx_subfunctions=True` - QEffGPT2Block function is NOT defined when `use_onnx_subfunctions=False` - QEffGPT2Block calls exist in graph nodes when subfunctions are enabled - No QEffGPT2Block calls when subfunctions are disabled - Maintained functional equivalence testing (generation output comparison) Signed-off-by: Vinayak Baddi <[email protected]> Co-authored-by: vbaddi <[email protected]>

Signed-off-by: abhishek-singh591 <[email protected]>

quic-rishinr · 2025-12-18T10:19:41Z

QEfficient/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py

-    cos = torch.cat([m[i % 3] for i, m in enumerate(cos.split(mrope_section, dim=-1))], dim=-1).unsqueeze(unsqueeze_dim)
-    sin = torch.cat([m[i % 3] for i, m in enumerate(sin.split(mrope_section, dim=-1))], dim=-1).unsqueeze(unsqueeze_dim)
-
+    cos = torch.cat([cos[0, ..., 0:32], cos[0, ..., 32:80], cos[0, ..., 80:128]], dim=-1).unsqueeze(0)


is this part of subfunction change?

quic-rishinr · 2025-12-18T10:19:49Z

QEfficient/utils/torch_patches.py

+            # onnx_attrs = {}
+            try:
+                _C._jit_pass_onnx_track_scope_attributes(graph, onnx_attrs)
+            except Exception as e:


use qeff logger

Signed-off-by: abhishek-singh591 <[email protected]>

quic-amitraj · 2025-12-23T09:29:24Z

Please resolve all conflicts.

abhishek-singh591 added 2 commits December 5, 2025 07:41

Added support of subfunction to Qwen2.5VL

abd9648

Signed-off-by: abhishek-singh591 <[email protected]>

Added support of subfunction to Qwen2.5VL

8f78722

Signed-off-by: abhishek-singh591 <[email protected]>

abhishek-singh591 requested review from ochougul, quic-amitraj, quic-hemagnih and quic-rishinr as code owners December 5, 2025 07:43

abhishek-singh591 added 2 commits December 5, 2025 07:54

Resolved lint and format error

2871558

Signed-off-by: abhishek-singh591 <[email protected]>

Made minnor fixes

7e1327c

Signed-off-by: abhishek-singh591 <[email protected]>

abhishek-singh591 marked this pull request as draft December 5, 2025 08:00

abhishek-singh591 and others added 4 commits December 5, 2025 15:06

Merge branch 'quic:main' into subfunction_for_vlm

c606d86

Merge branch 'quic:main' into subfunction_for_vlm

4bcf41a

Subfunction fixes for KV cache transform (quic#655)

5da8325

Signed-off-by: abhishek-singh591 <[email protected]>

abhishek-singh591 changed the title ~~Added support of subfunction to Qwen 2.5VL~~ Added support of subfunction for VLMs Dec 11, 2025

abhishek-singh591 and others added 8 commits December 11, 2025 06:52

Rebased

a4c790b

Signed-off-by: abhishek-singh591 <[email protected]>

[Test]: subfunction test moved to qaic Test Stage (quic#665)

1b2fabe

Signed-off-by: Abukhoyer Shaik <[email protected]>

Updated tests of onnx_sunfunction (quic#668)

f9d73b1

Update tests of onnx_subfunction to compare the hash of the .onnx file when `use_onnx_subfunction` flag is toggled --------- Signed-off-by: Amit Raj <[email protected]> Co-authored-by: Amit Raj <[email protected]>

Merge branch 'quic:main' into subfunction_for_vlm

6a34942

Added a sample changes for scaling subfunction

353d03b

Signed-off-by: abhishek-singh591 <[email protected]>

abhishek-singh591 force-pushed the subfunction_for_vlm branch from 8961c32 to 353d03b Compare December 18, 2025 05:06

vbaddi marked this pull request as ready for review December 18, 2025 09:53

vbaddi added the ready for review label Dec 18, 2025

quic-rishinr requested changes Dec 18, 2025

View reviewed changes

Added function in all the modeling file to return repeatative blocks

c6b123d

Signed-off-by: abhishek-singh591 <[email protected]>

quic-amitraj force-pushed the main branch from 18de278 to d8182b8 Compare December 22, 2025 08:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Added support of subfunction for VLMs #653

Added support of subfunction for VLMs #653

abhishek-singh591 commented Dec 5, 2025 •

edited

Loading

Uh oh!

quic-rishinr Dec 18, 2025

Uh oh!

quic-rishinr Dec 18, 2025

Uh oh!

quic-amitraj commented Dec 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Added support of subfunction for VLMs #653

Are you sure you want to change the base?

Added support of subfunction for VLMs #653

Conversation

abhishek-singh591 commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

quic-rishinr Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

quic-rishinr Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

quic-amitraj commented Dec 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

abhishek-singh591 commented Dec 5, 2025 •

edited

Loading