Skip to content
6 changes: 3 additions & 3 deletions QEfficient/transformers/models/modeling_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -3550,10 +3550,10 @@ class QEFFAutoModelForCTC(QEFFTransformersBase):
including Wav2Vec2 and other encoder-only speech models optimized for alignment-free transcription.
Although it is possible to initialize the class directly, we highly recommend using the ``from_pretrained`` method for initialization.

``Mandatory`` Args:
:model (nn.Module): PyTorch model

Example
-------
.. code-block:: python

import torchaudio
from QEfficient import QEFFAutoModelForCTC
from transformers import AutoProcessor
Expand Down
18 changes: 13 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,18 +6,26 @@
---

*Latest news* :fire: <br>

- [12/2025] Enabled [disaggregated serving](examples/disagg_serving) for GPT-OSS model
- [12/2025] Added support for wav2vec2 Audio Model [facebook/wav2vec2-base-960h](https://huggingface.co/facebook/wav2vec2-base-960h)
- [12/2025] Added support for diffuser video generation model [WAN 2.2 Model Card](https://huggingface.co/Wan-AI/Wan2.2-T2V-A14B-Diffusers)
- [12/2025] Added support for diffuser image generation model [FLUX.1 Model Card](https://huggingface.co/black-forest-labs/FLUX.1-schnell)
- [12/2025] Added support for [openai/gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b)
- [12/2025] Added support for [OpenGVLab/InternVL3_5-1B](https://huggingface.co/OpenGVLab/InternVL3_5-1B)
- [12/2025] Added support for Olmo Model [allenai/OLMo-2-0425-1B](https://huggingface.co/allenai/OLMo-2-0425-1B)
- [10/2025] Added support for Qwen3 MOE Model [Qwen/Qwen3-30B-A3B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507)
- [10/2025] Added support for Qwen2.5VL Multi-Model [Qwen/Qwen2.5-VL-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-32B-Instruct)
- [10/2025] Added support for Mistral3 Multi-Model [mistralai/Mistral-Small-3.1-24B-Instruct-2503](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503)
- [10/2025] Added support for Molmo Multi-Model [allenai/Molmo-7B-D-0924](https://huggingface.co/allenai/Molmo-7B-D-0924)
- [06/2025] Added support for Llama4 Multi-Model [meta-llama/Llama-4-Scout-17B-16E-Instruct](https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct)
- [06/2025] Added support for Gemma3 Multi-Modal-Model [google/gemma-3-4b-it](https://huggingface.co/google/gemma-3-4b-it)
- [06/2025] Added support of model `hpcai-tech/grok-1` [hpcai-tech/grok-1](https://huggingface.co/hpcai-tech/grok-1)
- [06/2025] Added support for sentence embedding which improves efficiency, Flexible/Custom Pooling configuration and compilation with multiple sequence lengths, [Embedding model](https://github.com/quic/efficient-transformers/pull/424).


<details>
<summary>More</summary>

- [06/2025] Added support for Llama4 Multi-Model [meta-llama/Llama-4-Scout-17B-16E-Instruct](https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct)
- [06/2025] Added support for Gemma3 Multi-Modal-Model [google/gemma-3-4b-it](https://huggingface.co/google/gemma-3-4b-it)
- [06/2025] Added support of model `hpcai-tech/grok-1` [hpcai-tech/grok-1](https://huggingface.co/hpcai-tech/grok-1)
- [06/2025] Added support for sentence embedding which improves efficiency, Flexible/Custom Pooling configuration and compilation with multiple sequence lengths, [Embedding model](https://github.com/quic/efficient-transformers/pull/424)
- [04/2025] Support for [SpD, multiprojection heads](https://quic.github.io/efficient-transformers/source/quick_start.html#draft-based-speculative-decoding). Implemented post-attention hidden size projections to speculate tokens ahead of the base model
- [04/2025] [QNN Compilation support](https://github.com/quic/efficient-transformers/pull/374) for AutoModel classes. QNN compilation capabilities for multi-models, embedding models and causal models.
- [04/2025] Added support for separate prefill and decode compilation for encoder (vision) and language models. This feature will be utilized for [disaggregated serving](https://github.com/quic/efficient-transformers/pull/365).
Expand Down
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ Welcome to Efficient-Transformers Documentation!
:maxdepth: 4

source/qeff_autoclasses
source/diffuser_classes
source/cli_api

.. toctree::
Expand Down
84 changes: 84 additions & 0 deletions docs/source/diffuser_classes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
# Diffuser Classes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we follow the similar approach like qeff_autoclasses.html? Add small examples and keep only the user exposed classes? @quic-amitraj can you suggest



## Pipeline API

(QEffTextEncoder)=
### `QEffTextEncoder`

```{eval-rst}
.. autoclass:: QEfficient.diffusers.pipelines.pipeline_module.QEffTextEncoder
:members:
:no-show-inheritance:
```

---

(QEffUNet)=
### `QEffUNet`

```{eval-rst}
.. autoclass:: QEfficient.diffusers.pipelines.pipeline_module.QEffUNet
:members:
:no-show-inheritance:
```

---

(QEffVAE)=
### `QEffVAE`

```{eval-rst}
.. autoclass:: QEfficient.diffusers.pipelines.pipeline_module.QEffVAE
:members:
:no-show-inheritance:
```

---

(QEffFluxTransformerModel)=
### `QEffFluxTransformerModel`

```{eval-rst}
.. autoclass:: QEfficient.diffusers.pipelines.pipeline_module.QEffFluxTransformerModel
:members:
:no-show-inheritance:
```

----

(QEffWanUnifiedTransformer)=
### `QEffWanUnifiedTransformer`

```{eval-rst}
.. autoclass:: QEfficient.diffusers.pipelines.pipeline_module.QEffWanUnifiedTransformer
:members:
:no-show-inheritance:
```

----


## Model Classes

(QEffWanPipeline)=
### `QEffWanPipeline`

```{eval-rst}
.. autoclass:: QEfficient.diffusers.pipelines.wan.pipeline_wan.QEffWanPipeline
:members:
:no-show-inheritance:
```

----

(QEffFluxPipeline)=
### `QEffFluxPipeline`

```{eval-rst}
.. autoclass:: QEfficient.diffusers.pipelines.flux.pipeline_flux.QEffFluxPipeline
:members:
:no-show-inheritance:
```

----
20 changes: 16 additions & 4 deletions docs/source/introduction.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,14 +23,26 @@ For other models, there is comprehensive documentation to inspire upon the chang
***Latest news*** : <br>

- [coming soon] Support for more popular [models](models_coming_soon)<br>
- [06/2025] Added support for Llama4 Multi-Model [meta-llama/Llama-4-Scout-17B-16E-Instruct](https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct)
- [06/2025] Added support for Gemma3 Multi-Modal-Model [google/gemma-3-4b-it](https://huggingface.co/google/gemma-3-4b-it)
- [06/2025] Added support of model `hpcai-tech/grok-1` [hpcai-tech/grok-1](https://huggingface.co/hpcai-tech/grok-1)
- [06/2025] Added support for sentence embedding which improves efficiency, Flexible/Custom Pooling configuration and compilation with multiple sequence lengths, [Embedding model](https://github.com/quic/efficient-transformers/pull/424).
- [12/2025] Enabled [disaggregated serving](https://github.com/quic/efficient-transformers/tree/main/examples/disagg_serving) for GPT-OSS model
- [12/2025] Added support for wav2vec2 Audio Model [facebook/wav2vec2-base-960h](https://huggingface.co/facebook/wav2vec2-base-960h)
- [12/2025] Added support for diffuser video generation model [WAN 2.2 Model Card](https://huggingface.co/Wan-AI/Wan2.2-T2V-A14B-Diffusers)
- [12/2025] Added support for diffuser image generation model [FLUX.1 Model Card](https://huggingface.co/black-forest-labs/FLUX.1-schnell)
- [12/2025] Added support for [openai/gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b)
- [12/2025] Added support for [OpenGVLab/InternVL3_5-1B](https://huggingface.co/OpenGVLab/InternVL3_5-1B)
- [12/2025] Added support for Olmo Model [allenai/OLMo-2-0425-1B](https://huggingface.co/allenai/OLMo-2-0425-1B)
- [10/2025] Added support for Qwen3 MOE Model [Qwen/Qwen3-30B-A3B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507)
- [10/2025] Added support for Qwen2.5VL Multi-Model [Qwen/Qwen2.5-VL-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-32B-Instruct)
- [10/2025] Added support for Mistral3 Multi-Model [mistralai/Mistral-Small-3.1-24B-Instruct-2503](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503)
- [10/2025] Added support for Molmo Multi-Model [allenai/Molmo-7B-D-0924](https://huggingface.co/allenai/Molmo-7B-D-0924)


<details>
<summary>More</summary>

- [06/2025] Added support for Llama4 Multi-Model [meta-llama/Llama-4-Scout-17B-16E-Instruct](https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct)
- [06/2025] Added support for Gemma3 Multi-Modal-Model [google/gemma-3-4b-it](https://huggingface.co/google/gemma-3-4b-it)
- [06/2025] Added support of model `hpcai-tech/grok-1` [hpcai-tech/grok-1](https://huggingface.co/hpcai-tech/grok-1)
- [06/2025] Added support for sentence embedding which improves efficiency, Flexible/Custom Pooling configuration and compilation with multiple sequence lengths, [Embedding model](https://github.com/quic/efficient-transformers/pull/424)
- [04/2025] Support for [SpD, multiprojection heads](https://quic.github.io/efficient-transformers/source/quick_start.html#draft-based-speculative-decoding). Implemented post-attention hidden size projections to speculate tokens ahead of the base model
- [04/2025] [QNN Compilation support](https://github.com/quic/efficient-transformers/pull/374) for AutoModel classes. QNN compilation capabilities for multi-models, embedding models and causal models.
- [04/2025] Added support for separate prefill and decode compilation for encoder (vision) and language models. This feature will be utilized for [disaggregated serving](https://github.com/quic/efficient-transformers/pull/365).
Expand Down
20 changes: 20 additions & 0 deletions docs/source/qeff_autoclasses.md
Original file line number Diff line number Diff line change
Expand Up @@ -115,3 +115,23 @@
.. automethod:: QEfficient.transformers.models.modeling_auto.QEFFAutoModelForSpeechSeq2Seq.compile
.. automethod:: QEfficient.transformers.models.modeling_auto.QEFFAutoModelForSpeechSeq2Seq.generate
```

(QEFFAutoModelForCTC)=
## `QEFFAutoModelForCTC`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add an example here



```{eval-rst}
.. autoclass:: QEfficient.transformers.models.modeling_auto.QEFFAutoModelForCTC
:noindex:
:no-members:
:no-show-inheritance:
```

### High-Level API

```{eval-rst}
.. automethod:: QEfficient.transformers.models.modeling_auto.QEFFAutoModelForCTC.from_pretrained
.. automethod:: QEfficient.transformers.models.modeling_auto.QEFFAutoModelForCTC.export
.. automethod:: QEfficient.transformers.models.modeling_auto.QEFFAutoModelForCTC.compile
.. automethod:: QEfficient.transformers.models.modeling_auto.QEFFAutoModelForCTC.generate
```
113 changes: 111 additions & 2 deletions docs/source/release_docs.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,120 @@
# Efficient Transformer Library - 1.21.0 Release Notes

Welcome to the official release of **Efficient Transformer Library v1.21.0**! This release introduces advanced attention mechanisms, expanded model support, optimized serving capabilities, and significant improvements to fine-tuning and deployment workflows.

> ✅ All features and models listed below are available on the [`release/v1.21.0`](https://github.com/quic/efficient-transformers/tree/release/v1.21.0) branch and [`mainline`](https://github.com/quic/efficient-transformers/tree/main).

---

## Newly Supported Models

- **Flux (Diffusers - Image Generation)**
- Diffusion-based image generation model
- [Flux.1 Schnell Example Script](https://github.com/quic/efficient-transformers/blob/main/examples/diffusers/flux/flux_1_schnell.py)

- **WAN (Diffusers - Video Generation)**
- Wide-Area Network Lightning support for distributed inference
- [Wan_lightning Example Script](https://github.com/quic/efficient-transformers/blob/main/examples/diffusers/wan/wan_lightning.py)

- **Qwen2.5-VL (Vision Language)**
- Executable via [`QEFFAutoModelForImageTextToText`](#QEFFAutoModelForImageTextToText)
- Multi-image prompt support
- Continuous batching enabled
- [Qwen2.5-VL Usage Guide](https://github.com/quic/efficient-transformers/tree/main/examples/image_text_to_text/models/qwen_vl)

- **Mistral 3.1 (24B)**
- Executable via [`QEFFAutoModelForImageTextToText`](#QEFFAutoModelForImageTextToText)
- [Mistral-3.1 Example Script](https://github.com/quic/efficient-transformers/blob/main/examples/image_text_to_text/models/mistral_vision/mistral3_example.py)


- **GPT-OSS (Decode-Only)**
- Executable via [`QEffAutoModelForCausalLM`](#QEffAutoModelForCausalLM)
- Separate prefill and decode compilation supported
- Disaggregated serving ready
- [GPT-OSS Example Scripts](https://github.com/quic/efficient-transformers/blob/main/examples/disagg_serving/gpt_oss_disagg_mode.py)

- **Olmo2**
- Executable via [`QEffAutoModelForCausalLM`](#QEffAutoModelForCausalLM)
- Full CausalLM support with optimizations
- Refer to [Text generation Example Scripts](https://github.com/quic/efficient-transformers/tree/main/examples/text_generation) for usage details.

- **Molmo**
- Executable via [`QEffAutoModelForCausalLM`](#QEffAutoModelForCausalLM)
- Multi-modal capabilities
- [Molmo Example Script](https://github.com/quic/efficient-transformers/blob/main/examples/image_text_to_text/models/molmo/molmo_example.py)

- **InternVL 3.5 Series**
- Executable via [`QEffAutoModelForCausalLM`](#QEffAutoModelForCausalLM)
- Full Vision-Language support
- Multi-image handling with continuous batching
- Refer to [InternVL 3.5 Example Scripts](https://github.com/quic/efficient-transformers/tree/main/examples/image_text_to_text/models/internvl) for usage details.

- **Qwen3-MOE (Mixture of Experts)**
- Executable via [`QEffAutoModelForCausalLM`](#QEffAutoModelForCausalLM)
- Efficient expert routing
- [Qwen3-MOE Example Scripts](https://github.com/quic/efficient-transformers/blob/main/examples/text_generation/moe_inference.py)

- **Wav2Vec2 (Audio)**
- Executable via [`QEFFAutoModelForCTC`](#QEFFAutoModelForCTC)
- Speech recognition and audio feature extraction
- [Wav2Vec2 Example Scripts](https://github.com/quic/efficient-transformers/blob/main/examples/audio/wav2vec2_inference.py)

- **Multilingual-e5-Large (Embedding Model)**
- Executable via [`QEffAutoModel`](#QEffAutoModel)
- Multilingual text embedding capabilities
- Refer [usage details](https://github.com/quic/efficient-transformers/tree/main/examples/embeddings) here.

---

## Key Features & Enhancements

- **Framework Upgrades**: Transformers `4.55`, PyTorch `2.7.0+cpu`, Torchvision `0.22.0+cpu`
- **Python Support**: Requires Python `3.10`
- **ONNX Opset**: Updated to version `17` for broader operator support
- **Advanced Attention**: Flux blocking support, BlockedKV attention for CausalLM models
- **Diffusers Integration**: Full support for diffuser-based image generation and video generation models
- **Compute-Context-Length (CCL) support**: To optimize the throughput when handling very large context lengths
- **Prefill/Decode Separation**: Support for GPT OSS using disaggregate serving models
- **Continuous Batching (VLMs)**: Extended to Vision Language Models with multi-image handling
- **ONNX Sub-Functions**: Feature enabling more efficient model compilation and execution on hardware
- **Memory Profiling**: Built-in utilities for optimization analysis
- **Extend on-device Sampling**: Extend on-device sampling to dual QPC VLMs and Guided decoding for on-device sampling
- **ONNX transform, memory & time optimizations**: Optimizations for faster ONNX Transform and reduced memory footprint
- **Removed platform SDK dependency**: Support QPC generation on systems without the Platform SDK
- **Example Scripts Revamp**: New example scripts for audio, embeddings, and image-text-to-text tasks
- **Onboarding Guide**: Simplified setup and deployment process for new users



---

## Embedding Model Upgrades

- **Multi-Sequence Length Support**: Auto-selects optimal graph at runtime
- **Enhanced Pooling**: Flexible pooling strategies for various embedding tasks

---

## Fine-Tuning Support

- **Checkpoint Management**: Resume from epochs with proper state restoration
- **Enhanced Loss Tracking**: Corrected data type handling for accurate loss computation
- **Custom Dataset Support**: Improved handling with better tokenization
- **Device-Aware Scaling**: Optimized GradScaler for multi-device training
- **Comprehensive Testing**: Unit tests for fine-tuning workflows

---


# Efficient Transformer Library - 1.20.0 Release Notes

Welcome to the official release of **Efficient Transformer Library v1.20.0**! This release brings a host of new model integrations, performance enhancements, and fine-tuning capabilities to accelerate your AI development.
Welcome to the official release of **Efficient Transformer Library v1.20.0**! This release introduces advanced attention mechanisms, expanded model support, optimized serving capabilities, and significant improvements to fine-tuning and deployment workflows.

> ✅ All features and models listed below are available on the [`release/1.20.0`](https://github.com/quic/efficient-transformers/tree/release/v1.20.0) branch and [`mainline`](https://github.com/quic/efficient-transformers/tree/main).
> ✅ All features and models listed below are available on the [`release/v1.20.0`](https://github.com/quic/efficient-transformers/tree/release/v1.20.0) branch and [`mainline`](https://github.com/quic/efficient-transformers/tree/main).

---


## Newly Supported Models

- **Llama-4-Scout-17B-16E-Instruct**
Expand Down
10 changes: 8 additions & 2 deletions docs/source/supported_features.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,14 @@ Supported Features

* - Feature
- Impact
* - `Diffusion Models <https://github.com/quic/efficient-transformers/tree/main/examples/diffusers>`_
- Full support for diffuser-based image generation models like Stable Diffusion, Imagen, Videogen enabling efficient image and video synthesis tasks.
* - `Disaggregated Serving for GPT-OSS <https://github.com/quic/efficient-transformers/tree/main/examples/disagg_serving>`_
- Enabled for GPT-OSS models, allowing for flexible deployment of large language models across different hardware configurations.
* - `ONNX Sub-Functions <https://github.com/quic/efficient-transformers/pull/621>`_
- Feature enabling more efficient model compilation and execution on hardware.
* - `BlockedKV attention in CausalLM <https://github.com/quic/efficient-transformers/pull/618>`_
- Implements a blocked K/V cache layout so attention reads/processes the cache blockbyblock, improving longcontext decode performance.
* - `Compute Context Length (CCL) <https://github.com/quic/efficient-transformers/blob/main/examples/performance/compute_context_length/README.md>`_
- Optimizes inference by using different context lengths during prefill and decode phases, reducing memory footprint and computation for shorter sequences while maintaining support for longer contexts. Supports both text-only and vision-language models. Refer `sample script <https://github.com/quic/efficient-transformers/blob/main/examples/performance/compute_context_length/basic_inference.py>`_ for more **details**.
* - Sentence embedding, Flexible Pooling configuration and compilation with multiple sequence lengths
Expand Down Expand Up @@ -58,5 +66,3 @@ Supported Features
- A script for computing the perplexity of a model, allowing for the evaluation of model performance and comparison across different models and datasets. Refer `sample script <https://github.com/quic/efficient-transformers/blob/main/scripts/perplexity_computation/calculate_perplexity.py>`_ for more **details**.
* - KV Heads Replication Script
- A sample script for replicating key-value (KV) heads for the Llama-3-8B-Instruct model, running inference with the original model, replicating KV heads, validating changes, and exporting the modified model to ONNX format. Refer `sample script <https://github.com/quic/efficient-transformers/blob/main/scripts/replicate_kv_head/replicate_kv_heads.py>`_ for more **details**.
* - Block Attention (in progress)
- Reduces inference latency and computational cost by dividing context into blocks and reusing key-value states, particularly useful in RAG.
Loading