Skip to content

Conversation

@abukhoy
Copy link
Contributor

@abukhoy abukhoy commented Dec 9, 2025

The following models have been successfully tested without using model cache:

  • TinyLlama/TinyLlama-1.1B-Chat-v1.0
  • gpt2
  • Salesforce/codegen-350M-mono
  • microsoft/Phi-3-mini-4k-instruct
  • tiiuae/falcon-7b
  • Qwen/Qwen2-0.5B
  • Qwen/Qwen3-0.6B
  • bigcode/starcoder2-3b
  • Felladrin/Minueza-32M-Base
  • wtang06/mpt-125m-c4
  • hakurei/gpt-j-random-tinier
  • meta-llama/Llama-3.2-1B
  • unsloth/gemma-2b
  • unsloth/gemma-2-2b
  • TheBloke/TinyLlama-1.1B-Chat-v0.3-AWQ (AWQ model)
  • neuralmagic/Llama-3.2-3B-Instruct-FP8 (float quantized compressed-tensor per tensor for weights and activations)
  • neuralmagic/Qwen2-0.5B-Instruct-FP8 (FP8 quant method, static, LM head ignored)
  • ibm-granite/granite-3.1-2b-instruct
  • ibm-granite/granite-guardian-3.1-2b
  • allenai/OLMo-2-0425-1B

abukhoy and others added 5 commits December 9, 2025 12:32
Signed-off-by: Abukhoyer Shaik <[email protected]>
# Support for Diffusers Architecture in Efficient Transformers

## Overview
This pull request introduces **Diffusers architecture support** to the
**Efficient Transformers** framework, enabling seamless integration of
diffusion models.

## Key Highlights
1. **Support of model
[black-forest-labs/FLUX1-schnell](https://huggingface.co/black-forest-labs/FLUX.1-schnell)**
2. **Flexible Configuration**  
- Supports JSON-based configuration files for easy compilation and
execution.
3. **Performance Benchmarking**  
- Implements a performance matrix for Diffusers models to enable
benchmarking for each modules.
4. **Testing Framework**  
   - Includes initial test scripts for Diffusers (In progress).
5. **Support of onnx subfunction graph using flag use_onnx_function**
6. **Support parallel compilation of modules using flag
`parallel_compile`**

---------

Signed-off-by: Amit Raj <[email protected]>
Signed-off-by: Amit Raj <[email protected]>
Signed-off-by: tv-karthikeya <[email protected]>
Signed-off-by: vtirumal <[email protected]>
Co-authored-by: tv-karthikeya <[email protected]>
Co-authored-by: Amit Raj <[email protected]>
Co-authored-by: Karthikeya <[email protected]>
# We should be using disaggragate serving for GPTOSS model for best
performance
- GPT-OSS model has 128/4 for 120b and 32/4 ratio of
total_experts/experts_per_tok
- We use read all experts only once always strategy in prefill-only
model
- And we treat weights activtions meaning read only chosen experts for
decode-only model

# Prefill-only model
## Blocking default behviour when `prefill_only=True` in compile API
 - NUM_Q_BLOCKS=<int> set number of Q blocks in attention 
 - NUM_FFN_BLOCKS=<int> set number of blocks in FFN
- ENABLE_OPT_SWA=0 or 1 to enable/disable optimized SWA. when enabled we
will be using only valid KVs for given block in Attention reducing MACs
 - prefix_caching is not supported with this mode

## Chunking pass `enable_chunking=True` and `prefill_only=True` in
compile API
- Optimized SWA i.e. reading only valid KV as per diagonal attention
mask is enabled for this version by default
- This model can be used for prefix_caching by passing
`kv_cache_batch_size=<int>` in compile API

# Decode-only model
## Retain Sliding window length of KV for sliding window layers, default
behavour when `prefill_seq_len=1` in compile API
 - This reduces the amount of DDR used by the model
- CB is enabled for this version pass `continous_batching=True` in
`from_pretrained` call and strictly pass `full_batch_size=<int>` and
optinally `kv_cache_batch_size=<int>` if needed
## Full KV for sliding window layers pass `retain_full_kv=True` along
with `prefill_seq_len=1` in compile API
- This uses higher DDR as we are retaining ctx_len KV even for sliding
window layers but will be reading only sliding window len kv in
attention
- CB is enabled for this version pass `continous_batching=True` in
`from_pretrained` call and strictly pass `full_batch_size=<int>` and
optinally `kv_cache_batch_size=<int>` if needed
- This is enabled for the usecase of multi-turn chat, where we will be
running prefill-> decode and then use cache of prefill as well as decode
combined to again run prefill, so we want to retain full KV for sliding
window layers


NOTE:
* decode-only model currently fails compilation with
`use_onnx_subfunctions=True` so avoid using it
* 120B model needs NPI, there are two versions of NPI one with and
without subfunction both are uploaded here, pass it as
`node_precision_info=<path to file>`
* It is advised to use `use_onnx_subfunctions=True` with prefill-only
model, otherwise the compilation times are too high, with this the model
is supposed to export and fail during compile as it needs assert sdk, so
user is supposed to run this compilation manually by pasting the command
printed in the error

---------

Signed-off-by: vbaddi <[email protected]>
Signed-off-by: Onkar Chougule <[email protected]>
Signed-off-by: Mamta Singh <[email protected]>
Signed-off-by: Onkar Chougule <[email protected]>
Co-authored-by: Vinayak Baddi <[email protected]>
Co-authored-by: Vinayak Baddi <[email protected]>
Co-authored-by: Mamta Singh <[email protected]>
Co-authored-by: Mamta Singh <[email protected]>
@abukhoy abukhoy marked this pull request as draft December 15, 2025 06:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants