-
Notifications
You must be signed in to change notification settings - Fork 63
Updated seq_len for prefill_only issue #680
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
asmigosw
wants to merge
16
commits into
quic:main
Choose a base branch
from
asmigosw:prefill_issue
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
+11,346
−666
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Support for Diffusers Architecture in Efficient Transformers ## Overview This pull request introduces **Diffusers architecture support** to the **Efficient Transformers** framework, enabling seamless integration of diffusion models. ## Key Highlights 1. **Support of model [black-forest-labs/FLUX1-schnell](https://huggingface.co/black-forest-labs/FLUX.1-schnell)** 2. **Flexible Configuration** - Supports JSON-based configuration files for easy compilation and execution. 3. **Performance Benchmarking** - Implements a performance matrix for Diffusers models to enable benchmarking for each modules. 4. **Testing Framework** - Includes initial test scripts for Diffusers (In progress). 5. **Support of onnx subfunction graph using flag use_onnx_function** 6. **Support parallel compilation of modules using flag `parallel_compile`** --------- Signed-off-by: Amit Raj <[email protected]> Signed-off-by: Amit Raj <[email protected]> Signed-off-by: tv-karthikeya <[email protected]> Signed-off-by: vtirumal <[email protected]> Co-authored-by: tv-karthikeya <[email protected]> Co-authored-by: Amit Raj <[email protected]> Co-authored-by: Karthikeya <[email protected]>
Signed-off-by: abhishek-singh591 <[email protected]>
Signed-off-by: Abukhoyer Shaik <[email protected]>
# We should be using disaggragate serving for GPTOSS model for best performance - GPT-OSS model has 128/4 for 120b and 32/4 ratio of total_experts/experts_per_tok - We use read all experts only once always strategy in prefill-only model - And we treat weights activtions meaning read only chosen experts for decode-only model # Prefill-only model ## Blocking default behviour when `prefill_only=True` in compile API - NUM_Q_BLOCKS=<int> set number of Q blocks in attention - NUM_FFN_BLOCKS=<int> set number of blocks in FFN - ENABLE_OPT_SWA=0 or 1 to enable/disable optimized SWA. when enabled we will be using only valid KVs for given block in Attention reducing MACs - prefix_caching is not supported with this mode ## Chunking pass `enable_chunking=True` and `prefill_only=True` in compile API - Optimized SWA i.e. reading only valid KV as per diagonal attention mask is enabled for this version by default - This model can be used for prefix_caching by passing `kv_cache_batch_size=<int>` in compile API # Decode-only model ## Retain Sliding window length of KV for sliding window layers, default behavour when `prefill_seq_len=1` in compile API - This reduces the amount of DDR used by the model - CB is enabled for this version pass `continous_batching=True` in `from_pretrained` call and strictly pass `full_batch_size=<int>` and optinally `kv_cache_batch_size=<int>` if needed ## Full KV for sliding window layers pass `retain_full_kv=True` along with `prefill_seq_len=1` in compile API - This uses higher DDR as we are retaining ctx_len KV even for sliding window layers but will be reading only sliding window len kv in attention - CB is enabled for this version pass `continous_batching=True` in `from_pretrained` call and strictly pass `full_batch_size=<int>` and optinally `kv_cache_batch_size=<int>` if needed - This is enabled for the usecase of multi-turn chat, where we will be running prefill-> decode and then use cache of prefill as well as decode combined to again run prefill, so we want to retain full KV for sliding window layers NOTE: * decode-only model currently fails compilation with `use_onnx_subfunctions=True` so avoid using it * 120B model needs NPI, there are two versions of NPI one with and without subfunction both are uploaded here, pass it as `node_precision_info=<path to file>` * It is advised to use `use_onnx_subfunctions=True` with prefill-only model, otherwise the compilation times are too high, with this the model is supposed to export and fail during compile as it needs assert sdk, so user is supposed to run this compilation manually by pasting the command printed in the error --------- Signed-off-by: vbaddi <[email protected]> Signed-off-by: Onkar Chougule <[email protected]> Signed-off-by: Mamta Singh <[email protected]> Signed-off-by: Onkar Chougule <[email protected]> Co-authored-by: Vinayak Baddi <[email protected]> Co-authored-by: Vinayak Baddi <[email protected]> Co-authored-by: Mamta Singh <[email protected]> Co-authored-by: Mamta Singh <[email protected]>
Update tests of onnx_subfunction to compare the hash of the .onnx file when `use_onnx_subfunction` flag is toggled --------- Signed-off-by: Amit Raj <[email protected]> Co-authored-by: Amit Raj <[email protected]>
**Overview**
On-device sampling can significantly reduce host overhead and improve
inference throughput; however, so far it has only been implemented for
`QEffForCausalLM` models. This PR extends on-device sampling support to
the language decoder of dual QPC vision language models,
`QEffCausalLMForTextImageToTextModel`. In addition, it fixes the bug in
gumbel noise so that it correctly simulates a multinomial distribution
for random sampling.
**Implementation details**
```
class _QEffAutoModelForImageTextToTextDualQPC:
def __init__(
self,
model: nn.Module,
continuous_batching: bool = False,
qaic_config: Optional[dict] = None,
**kwargs,
):
# Omitting unchanged parts
self.lang_model = QEffCausalLMForTextImageToTextModel(model, qaic_config=qaic_config, **kwargs)
# ---Sampling---
# Note: SamplerTransform should be applied after all other transforms
# are done. The role of the sampler is to just add nodes at the output of the
# previous transform function.
self.lang_model.model, _ = SamplerTransform.apply(self.lang_model.model, qaic_config, **kwargs)
```
**Usage**
The usage is the similar to enable on-device sampling for
`QEffForCausalLM`.
```
from QEfficient import QEFFAutoModelForImageTextToText
model_id = "Qwen/Qwen2.5-VL-3B-Instruct"
qeff_model = QEFFAutoModelForImageTextToText.from_pretrained(
model_id,
attn_implementation="eager",
kv_offload=True,
continuous_batching=True,
qaic_config={
"include_sampler": True,
"return_pdfs": False,
"max_top_k_ids": 512,
},
)
```
---------
Signed-off-by: quic-xiyushi <[email protected]>
Signed-off-by: quic-sanising <[email protected]>
Signed-off-by: sanising <[email protected]>
Signed-off-by: Mamta Singh <[email protected]>
Co-authored-by: sanising <[email protected]>
Co-authored-by: Mamta Singh <[email protected]>
…of hash comparison (quic#670) ## Summary Refactored the subfunction unit test to directly verify ONNX subfunction usage by inspecting the exported model structure, replacing the previous hash-based validation approach. ## Changes - Removed hash-based checks (`export_hash` and file hash comparisons) - Added ONNX model inspection utilities: - `has_gpt2block_function()`: Checks for QEffGPT2Block function definitions - Added explicit assertions to verify: - QEffGPT2Block function is defined when `use_onnx_subfunctions=True` - QEffGPT2Block function is NOT defined when `use_onnx_subfunctions=False` - QEffGPT2Block calls exist in graph nodes when subfunctions are enabled - No QEffGPT2Block calls when subfunctions are disabled - Maintained functional equivalence testing (generation output comparison) Signed-off-by: Vinayak Baddi <[email protected]> Co-authored-by: vbaddi <[email protected]>
quic#661) installing pytorch2.9 for FT CI test --------- Signed-off-by: Dhiraj Kumar Sah <[email protected]>
## ✨ Add Support for Guided Decoding to On Device Sampling ### 📌 Overview This PR introduces **guided decoding** capabilities in On Device Sampling for `QEffForCausalLM` and `QEffCausalLMForTextImageToTextModel` models. </br> </br> ### 🚀 Motivation As outlined in [this blog on structured decoding](https://blog.vllm.ai/2025/01/14/struct-decode-intro.html), structured decoding represents a fundamental shift in controlling LLM outputs. Instead of relying on post-processing, constraints are enforced during token generation via **logits manipulation**. This approach ensures: * **Format compliance** at generation time. * Reduced error rates for structured outputs. * Performance improvements through optimized backends like **XGrammar**, which can deliver up to **5× faster token generation under load**. The constraints are provided through `token_bitmasks` which is a Boolean matrix of shape `(batch_size, vocab_size)`. Here, each element indicates whether a token should be kept (1) or masked (0). During sampling, this mask is applied to the logits before token selection, ensuring that only allowed tokens are considered. By performing this operation directly on the device, we eliminate host-device transfers, reduce latency, and improve throughput for structured decoding workloads. </br> </br> ### 🛠️ Implementation Details The guided decoding logic is injected via `include_guided_decoding=True` during model loading. No changes to the model architecture are required. ```python from QEfficient import QEFFAutoModelForCausalLM as AutoModelForCausalLM # Load model with On Device Sampler enabled qeff_model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.1-8B", continuous_batching=True, qaic_config={ "include_sampler": True, "return_pdfs": False, "max_top_k_ids": 512, "include_guided_decoding": True, }, ) # Compile as usual qeff_model.compile( prefill_seq_length=128, ctx_len=256, full_batch_size=16, num_devices=4, num_speculative_tokens=0, mxint8_kv_cache=True, mxfp6_matmul=True, ) ``` To disable guided decoding, simply set `include_guided_decoding=False`. --------- Signed-off-by: quic-xiyushi <[email protected]> Signed-off-by: quic-sanising <[email protected]> Signed-off-by: sanising <[email protected]> Signed-off-by: Mamta Singh <[email protected]> Co-authored-by: quic-xiyushi <[email protected]> Co-authored-by: sanising <[email protected]> Co-authored-by: Mamta Singh <[email protected]> Co-authored-by: Hem Agnihotri <[email protected]>
Added memory profiling tool (scripts/memory_profiling) that tracks memory, CPU, and disk I/O usage across QEfficient workflow stages. The profiler supports manual operation marking, child process tracking for accurate compilation metrics, and generates 4-panel visualizations with detailed performance reports to help identify bottlenecks and optimize resource usage. Signed-off-by: Rishin Raj <[email protected]>
…ence after model export. (quic#678) Signed-off-by: Dhiraj Kumar Sah <[email protected]>
…es not provide lists (quic#663) For using CCL feature during prefilling and decoding, the user needs to pass two lists of context lengths to be used during these processes. Here, we add an option to generate these lists automatically when they are not provided by the user manually. These list generation is suitable for a general-purpose application which considers both prefilling and decoding processes and generates CCL lists for both. --------- Signed-off-by: Rishin Raj <[email protected]> Signed-off-by: Vahid Janfaza <[email protected]>
Model: lightx2v/Wan2.2-Lightning Support Wan Unified Transformer on QAIC --------- Signed-off-by: vtirumal <[email protected]> Signed-off-by: Karthikeya <[email protected]>
Added blocking support to flux --------- Signed-off-by: Amit Raj <[email protected]> Co-authored-by: Amit Raj <[email protected]>
Signed-off-by: Asmita Goswami <[email protected]>
quic-rishinr
approved these changes
Dec 22, 2025
| ) | ||
| kv_cache_shape[2] = seq_len + self.model.config.sliding_window if enable_chunking else seq_len | ||
| kv_cache_shape[2] = ( | ||
| seq_len + (self.model.config.sliding_window if hasattr(self.model.config, "sliding_window") else 0) |
Contributor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be chunked attention as well, right? can we handle it in a better way?
| if kwargs.get("retain_full_kv", False): | ||
| kv_cache_shape[2] = seq_len + self.model.config.sliding_window | ||
| self.hash_params["retain_full_kv"] = True | ||
| kv_cache_shape[2] = seq_len + ( |
Contributor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
Signed-off-by: Asmita Goswami <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
JIRA: https://jira-dc.qualcomm.com/jira/browse/QRANIUMSW-59121
The seq_len was only calculated in prefill_only for models which had sliding_window in config. For models which don't have sliding window, we need to only provide the original seq_len even if we have chunking enabled.