diff --git a/QEfficient/transformers/models/modeling_auto.py b/QEfficient/transformers/models/modeling_auto.py index 236f6c9f5..8b2f3edd6 100644 --- a/QEfficient/transformers/models/modeling_auto.py +++ b/QEfficient/transformers/models/modeling_auto.py @@ -3550,10 +3550,10 @@ class QEFFAutoModelForCTC(QEFFTransformersBase): including Wav2Vec2 and other encoder-only speech models optimized for alignment-free transcription. Although it is possible to initialize the class directly, we highly recommend using the ``from_pretrained`` method for initialization. - ``Mandatory`` Args: - :model (nn.Module): PyTorch model - + Example + ------- .. code-block:: python + import torchaudio from QEfficient import QEFFAutoModelForCTC from transformers import AutoProcessor diff --git a/README.md b/README.md index cb6f32382..257fd6344 100644 --- a/README.md +++ b/README.md @@ -6,18 +6,26 @@ --- *Latest news* :fire:
- +- [12/2025] Enabled [disaggregated serving](examples/disagg_serving) for GPT-OSS model +- [12/2025] Added support for wav2vec2 Audio Model [facebook/wav2vec2-base-960h](https://huggingface.co/facebook/wav2vec2-base-960h) +- [12/2025] Added support for diffuser video generation model [WAN 2.2 Model Card](https://huggingface.co/Wan-AI/Wan2.2-T2V-A14B-Diffusers) +- [12/2025] Added support for diffuser image generation model [FLUX.1 Model Card](https://huggingface.co/black-forest-labs/FLUX.1-schnell) +- [12/2025] Added support for [openai/gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b) +- [12/2025] Added support for [OpenGVLab/InternVL3_5-1B](https://huggingface.co/OpenGVLab/InternVL3_5-1B) +- [12/2025] Added support for Olmo Model [allenai/OLMo-2-0425-1B](https://huggingface.co/allenai/OLMo-2-0425-1B) +- [10/2025] Added support for Qwen3 MOE Model [Qwen/Qwen3-30B-A3B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507) - [10/2025] Added support for Qwen2.5VL Multi-Model [Qwen/Qwen2.5-VL-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-32B-Instruct) - [10/2025] Added support for Mistral3 Multi-Model [mistralai/Mistral-Small-3.1-24B-Instruct-2503](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503) - [10/2025] Added support for Molmo Multi-Model [allenai/Molmo-7B-D-0924](https://huggingface.co/allenai/Molmo-7B-D-0924) -- [06/2025] Added support for Llama4 Multi-Model [meta-llama/Llama-4-Scout-17B-16E-Instruct](https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct) -- [06/2025] Added support for Gemma3 Multi-Modal-Model [google/gemma-3-4b-it](https://huggingface.co/google/gemma-3-4b-it) -- [06/2025] Added support of model `hpcai-tech/grok-1` [hpcai-tech/grok-1](https://huggingface.co/hpcai-tech/grok-1) -- [06/2025] Added support for sentence embedding which improves efficiency, Flexible/Custom Pooling configuration and compilation with multiple sequence lengths, [Embedding model](https://github.com/quic/efficient-transformers/pull/424). +
More +- [06/2025] Added support for Llama4 Multi-Model [meta-llama/Llama-4-Scout-17B-16E-Instruct](https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct) +- [06/2025] Added support for Gemma3 Multi-Modal-Model [google/gemma-3-4b-it](https://huggingface.co/google/gemma-3-4b-it) +- [06/2025] Added support of model `hpcai-tech/grok-1` [hpcai-tech/grok-1](https://huggingface.co/hpcai-tech/grok-1) +- [06/2025] Added support for sentence embedding which improves efficiency, Flexible/Custom Pooling configuration and compilation with multiple sequence lengths, [Embedding model](https://github.com/quic/efficient-transformers/pull/424) - [04/2025] Support for [SpD, multiprojection heads](https://quic.github.io/efficient-transformers/source/quick_start.html#draft-based-speculative-decoding). Implemented post-attention hidden size projections to speculate tokens ahead of the base model - [04/2025] [QNN Compilation support](https://github.com/quic/efficient-transformers/pull/374) for AutoModel classes. QNN compilation capabilities for multi-models, embedding models and causal models. - [04/2025] Added support for separate prefill and decode compilation for encoder (vision) and language models. This feature will be utilized for [disaggregated serving](https://github.com/quic/efficient-transformers/pull/365). diff --git a/docs/index.rst b/docs/index.rst index e83337db2..5e0c8f634 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -38,6 +38,7 @@ Welcome to Efficient-Transformers Documentation! :maxdepth: 4 source/qeff_autoclasses + source/diffuser_classes source/cli_api .. toctree:: diff --git a/docs/source/diffuser_classes.md b/docs/source/diffuser_classes.md new file mode 100644 index 000000000..7154f8c0d --- /dev/null +++ b/docs/source/diffuser_classes.md @@ -0,0 +1,84 @@ +# Diffuser Classes + + +## Pipeline API + +(QEffTextEncoder)= +### `QEffTextEncoder` + +```{eval-rst} +.. autoclass:: QEfficient.diffusers.pipelines.pipeline_module.QEffTextEncoder + :members: + :no-show-inheritance: +``` + +--- + +(QEffUNet)= +### `QEffUNet` + +```{eval-rst} +.. autoclass:: QEfficient.diffusers.pipelines.pipeline_module.QEffUNet + :members: + :no-show-inheritance: +``` + +--- + +(QEffVAE)= +### `QEffVAE` + +```{eval-rst} +.. autoclass:: QEfficient.diffusers.pipelines.pipeline_module.QEffVAE + :members: + :no-show-inheritance: +``` + +--- + +(QEffFluxTransformerModel)= +### `QEffFluxTransformerModel` + +```{eval-rst} +.. autoclass:: QEfficient.diffusers.pipelines.pipeline_module.QEffFluxTransformerModel + :members: + :no-show-inheritance: +``` + +---- + +(QEffWanUnifiedTransformer)= +### `QEffWanUnifiedTransformer` + +```{eval-rst} +.. autoclass:: QEfficient.diffusers.pipelines.pipeline_module.QEffWanUnifiedTransformer + :members: + :no-show-inheritance: +``` + +---- + + +## Model Classes + +(QEffWanPipeline)= +### `QEffWanPipeline` + +```{eval-rst} +.. autoclass:: QEfficient.diffusers.pipelines.wan.pipeline_wan.QEffWanPipeline + :members: + :no-show-inheritance: +``` + +---- + +(QEffFluxPipeline)= +### `QEffFluxPipeline` + +```{eval-rst} +.. autoclass:: QEfficient.diffusers.pipelines.flux.pipeline_flux.QEffFluxPipeline + :members: + :no-show-inheritance: +``` + +---- diff --git a/docs/source/introduction.md b/docs/source/introduction.md index 9fdc814d8..3fbbb1813 100644 --- a/docs/source/introduction.md +++ b/docs/source/introduction.md @@ -23,14 +23,26 @@ For other models, there is comprehensive documentation to inspire upon the chang ***Latest news*** :
- [coming soon] Support for more popular [models](models_coming_soon)
-- [06/2025] Added support for Llama4 Multi-Model [meta-llama/Llama-4-Scout-17B-16E-Instruct](https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct) -- [06/2025] Added support for Gemma3 Multi-Modal-Model [google/gemma-3-4b-it](https://huggingface.co/google/gemma-3-4b-it) -- [06/2025] Added support of model `hpcai-tech/grok-1` [hpcai-tech/grok-1](https://huggingface.co/hpcai-tech/grok-1) -- [06/2025] Added support for sentence embedding which improves efficiency, Flexible/Custom Pooling configuration and compilation with multiple sequence lengths, [Embedding model](https://github.com/quic/efficient-transformers/pull/424). +- [12/2025] Enabled [disaggregated serving](https://github.com/quic/efficient-transformers/tree/main/examples/disagg_serving) for GPT-OSS model +- [12/2025] Added support for wav2vec2 Audio Model [facebook/wav2vec2-base-960h](https://huggingface.co/facebook/wav2vec2-base-960h) +- [12/2025] Added support for diffuser video generation model [WAN 2.2 Model Card](https://huggingface.co/Wan-AI/Wan2.2-T2V-A14B-Diffusers) +- [12/2025] Added support for diffuser image generation model [FLUX.1 Model Card](https://huggingface.co/black-forest-labs/FLUX.1-schnell) +- [12/2025] Added support for [openai/gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b) +- [12/2025] Added support for [OpenGVLab/InternVL3_5-1B](https://huggingface.co/OpenGVLab/InternVL3_5-1B) +- [12/2025] Added support for Olmo Model [allenai/OLMo-2-0425-1B](https://huggingface.co/allenai/OLMo-2-0425-1B) +- [10/2025] Added support for Qwen3 MOE Model [Qwen/Qwen3-30B-A3B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507) +- [10/2025] Added support for Qwen2.5VL Multi-Model [Qwen/Qwen2.5-VL-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-32B-Instruct) +- [10/2025] Added support for Mistral3 Multi-Model [mistralai/Mistral-Small-3.1-24B-Instruct-2503](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503) +- [10/2025] Added support for Molmo Multi-Model [allenai/Molmo-7B-D-0924](https://huggingface.co/allenai/Molmo-7B-D-0924) +
More +- [06/2025] Added support for Llama4 Multi-Model [meta-llama/Llama-4-Scout-17B-16E-Instruct](https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct) +- [06/2025] Added support for Gemma3 Multi-Modal-Model [google/gemma-3-4b-it](https://huggingface.co/google/gemma-3-4b-it) +- [06/2025] Added support of model `hpcai-tech/grok-1` [hpcai-tech/grok-1](https://huggingface.co/hpcai-tech/grok-1) +- [06/2025] Added support for sentence embedding which improves efficiency, Flexible/Custom Pooling configuration and compilation with multiple sequence lengths, [Embedding model](https://github.com/quic/efficient-transformers/pull/424) - [04/2025] Support for [SpD, multiprojection heads](https://quic.github.io/efficient-transformers/source/quick_start.html#draft-based-speculative-decoding). Implemented post-attention hidden size projections to speculate tokens ahead of the base model - [04/2025] [QNN Compilation support](https://github.com/quic/efficient-transformers/pull/374) for AutoModel classes. QNN compilation capabilities for multi-models, embedding models and causal models. - [04/2025] Added support for separate prefill and decode compilation for encoder (vision) and language models. This feature will be utilized for [disaggregated serving](https://github.com/quic/efficient-transformers/pull/365). diff --git a/docs/source/qeff_autoclasses.md b/docs/source/qeff_autoclasses.md index 1b1d8657d..7ec21b97b 100644 --- a/docs/source/qeff_autoclasses.md +++ b/docs/source/qeff_autoclasses.md @@ -115,3 +115,23 @@ .. automethod:: QEfficient.transformers.models.modeling_auto.QEFFAutoModelForSpeechSeq2Seq.compile .. automethod:: QEfficient.transformers.models.modeling_auto.QEFFAutoModelForSpeechSeq2Seq.generate ``` + +(QEFFAutoModelForCTC)= +## `QEFFAutoModelForCTC` + + +```{eval-rst} +.. autoclass:: QEfficient.transformers.models.modeling_auto.QEFFAutoModelForCTC + :noindex: + :no-members: + :no-show-inheritance: +``` + +### High-Level API + +```{eval-rst} +.. automethod:: QEfficient.transformers.models.modeling_auto.QEFFAutoModelForCTC.from_pretrained +.. automethod:: QEfficient.transformers.models.modeling_auto.QEFFAutoModelForCTC.export +.. automethod:: QEfficient.transformers.models.modeling_auto.QEFFAutoModelForCTC.compile +.. automethod:: QEfficient.transformers.models.modeling_auto.QEFFAutoModelForCTC.generate +``` \ No newline at end of file diff --git a/docs/source/release_docs.md b/docs/source/release_docs.md index 97389e571..c71d13d30 100644 --- a/docs/source/release_docs.md +++ b/docs/source/release_docs.md @@ -1,11 +1,120 @@ +# Efficient Transformer Library - 1.21.0 Release Notes + +Welcome to the official release of **Efficient Transformer Library v1.21.0**! This release introduces advanced attention mechanisms, expanded model support, optimized serving capabilities, and significant improvements to fine-tuning and deployment workflows. + +> ✅ All features and models listed below are available on the [`release/v1.21.0`](https://github.com/quic/efficient-transformers/tree/release/v1.21.0) branch and [`mainline`](https://github.com/quic/efficient-transformers/tree/main). + +--- + +## Newly Supported Models + +- **Flux (Diffusers - Image Generation)** + - Diffusion-based image generation model + - [Flux.1 Schnell Example Script](https://github.com/quic/efficient-transformers/blob/main/examples/diffusers/flux/flux_1_schnell.py) + +- **WAN (Diffusers - Video Generation)** + - Wide-Area Network Lightning support for distributed inference + - [Wan_lightning Example Script](https://github.com/quic/efficient-transformers/blob/main/examples/diffusers/wan/wan_lightning.py) + +- **Qwen2.5-VL (Vision Language)** + - Executable via [`QEFFAutoModelForImageTextToText`](#QEFFAutoModelForImageTextToText) + - Multi-image prompt support + - Continuous batching enabled + - [Qwen2.5-VL Usage Guide](https://github.com/quic/efficient-transformers/tree/main/examples/image_text_to_text/models/qwen_vl) + +- **Mistral 3.1 (24B)** + - Executable via [`QEFFAutoModelForImageTextToText`](#QEFFAutoModelForImageTextToText) + - [Mistral-3.1 Example Script](https://github.com/quic/efficient-transformers/blob/main/examples/image_text_to_text/models/mistral_vision/mistral3_example.py) + + +- **GPT-OSS (Decode-Only)** + - Executable via [`QEffAutoModelForCausalLM`](#QEffAutoModelForCausalLM) + - Separate prefill and decode compilation supported + - Disaggregated serving ready + - [GPT-OSS Example Scripts](https://github.com/quic/efficient-transformers/blob/main/examples/disagg_serving/gpt_oss_disagg_mode.py) + +- **Olmo2** + - Executable via [`QEffAutoModelForCausalLM`](#QEffAutoModelForCausalLM) + - Full CausalLM support with optimizations + - Refer to [Text generation Example Scripts](https://github.com/quic/efficient-transformers/tree/main/examples/text_generation) for usage details. + +- **Molmo** + - Executable via [`QEffAutoModelForCausalLM`](#QEffAutoModelForCausalLM) + - Multi-modal capabilities + - [Molmo Example Script](https://github.com/quic/efficient-transformers/blob/main/examples/image_text_to_text/models/molmo/molmo_example.py) + +- **InternVL 3.5 Series** + - Executable via [`QEffAutoModelForCausalLM`](#QEffAutoModelForCausalLM) + - Full Vision-Language support + - Multi-image handling with continuous batching + - Refer to [InternVL 3.5 Example Scripts](https://github.com/quic/efficient-transformers/tree/main/examples/image_text_to_text/models/internvl) for usage details. + +- **Qwen3-MOE (Mixture of Experts)** + - Executable via [`QEffAutoModelForCausalLM`](#QEffAutoModelForCausalLM) + - Efficient expert routing + - [Qwen3-MOE Example Scripts](https://github.com/quic/efficient-transformers/blob/main/examples/text_generation/moe_inference.py) + +- **Wav2Vec2 (Audio)** + - Executable via [`QEFFAutoModelForCTC`](#QEFFAutoModelForCTC) + - Speech recognition and audio feature extraction + - [Wav2Vec2 Example Scripts](https://github.com/quic/efficient-transformers/blob/main/examples/audio/wav2vec2_inference.py) + +- **Multilingual-e5-Large (Embedding Model)** + - Executable via [`QEffAutoModel`](#QEffAutoModel) + - Multilingual text embedding capabilities + - Refer [usage details](https://github.com/quic/efficient-transformers/tree/main/examples/embeddings) here. + +--- + +## Key Features & Enhancements + +- **Framework Upgrades**: Transformers `4.55`, PyTorch `2.7.0+cpu`, Torchvision `0.22.0+cpu` +- **Python Support**: Requires Python `3.10` +- **ONNX Opset**: Updated to version `17` for broader operator support +- **Advanced Attention**: Flux blocking support, BlockedKV attention for CausalLM models +- **Diffusers Integration**: Full support for diffuser-based image generation and video generation models +- **Compute-Context-Length (CCL) support**: To optimize the throughput when handling very large context lengths +- **Prefill/Decode Separation**: Support for GPT OSS using disaggregate serving models +- **Continuous Batching (VLMs)**: Extended to Vision Language Models with multi-image handling +- **ONNX Sub-Functions**: Feature enabling more efficient model compilation and execution on hardware +- **Memory Profiling**: Built-in utilities for optimization analysis +- **Extend on-device Sampling**: Extend on-device sampling to dual QPC VLMs and Guided decoding for on-device sampling +- **ONNX transform, memory & time optimizations**: Optimizations for faster ONNX Transform and reduced memory footprint +- **Removed platform SDK dependency**: Support QPC generation on systems without the Platform SDK +- **Example Scripts Revamp**: New example scripts for audio, embeddings, and image-text-to-text tasks +- **Onboarding Guide**: Simplified setup and deployment process for new users + + + +--- + +## Embedding Model Upgrades + +- **Multi-Sequence Length Support**: Auto-selects optimal graph at runtime +- **Enhanced Pooling**: Flexible pooling strategies for various embedding tasks + +--- + +## Fine-Tuning Support + +- **Checkpoint Management**: Resume from epochs with proper state restoration +- **Enhanced Loss Tracking**: Corrected data type handling for accurate loss computation +- **Custom Dataset Support**: Improved handling with better tokenization +- **Device-Aware Scaling**: Optimized GradScaler for multi-device training +- **Comprehensive Testing**: Unit tests for fine-tuning workflows + +--- + + # Efficient Transformer Library - 1.20.0 Release Notes -Welcome to the official release of **Efficient Transformer Library v1.20.0**! This release brings a host of new model integrations, performance enhancements, and fine-tuning capabilities to accelerate your AI development. +Welcome to the official release of **Efficient Transformer Library v1.20.0**! This release introduces advanced attention mechanisms, expanded model support, optimized serving capabilities, and significant improvements to fine-tuning and deployment workflows. -> ✅ All features and models listed below are available on the [`release/1.20.0`](https://github.com/quic/efficient-transformers/tree/release/v1.20.0) branch and [`mainline`](https://github.com/quic/efficient-transformers/tree/main). +> ✅ All features and models listed below are available on the [`release/v1.20.0`](https://github.com/quic/efficient-transformers/tree/release/v1.20.0) branch and [`mainline`](https://github.com/quic/efficient-transformers/tree/main). --- + ## Newly Supported Models - **Llama-4-Scout-17B-16E-Instruct** diff --git a/docs/source/supported_features.rst b/docs/source/supported_features.rst index 8260342f2..24551e904 100644 --- a/docs/source/supported_features.rst +++ b/docs/source/supported_features.rst @@ -6,6 +6,14 @@ Supported Features * - Feature - Impact + * - `Diffusion Models `_ + - Full support for diffuser-based image generation models like Stable Diffusion, Imagen, Videogen enabling efficient image and video synthesis tasks. + * - `Disaggregated Serving for GPT-OSS `_ + - Enabled for GPT-OSS models, allowing for flexible deployment of large language models across different hardware configurations. + * - `ONNX Sub-Functions `_ + - Feature enabling more efficient model compilation and execution on hardware. + * - `BlockedKV attention in CausalLM `_ + - Implements a blocked K/V cache layout so attention reads/processes the cache blockbyblock, improving longcontext decode performance. * - `Compute Context Length (CCL) `_ - Optimizes inference by using different context lengths during prefill and decode phases, reducing memory footprint and computation for shorter sequences while maintaining support for longer contexts. Supports both text-only and vision-language models. Refer `sample script `_ for more **details**. * - Sentence embedding, Flexible Pooling configuration and compilation with multiple sequence lengths @@ -58,5 +66,3 @@ Supported Features - A script for computing the perplexity of a model, allowing for the evaluation of model performance and comparison across different models and datasets. Refer `sample script `_ for more **details**. * - KV Heads Replication Script - A sample script for replicating key-value (KV) heads for the Llama-3-8B-Instruct model, running inference with the original model, replicating KV heads, validating changes, and exporting the modified model to ONNX format. Refer `sample script `_ for more **details**. - * - Block Attention (in progress) - - Reduces inference latency and computational cost by dividing context into blocks and reusing key-value states, particularly useful in RAG. diff --git a/docs/source/validate.md b/docs/source/validate.md index b5ab87629..2c948e175 100644 --- a/docs/source/validate.md +++ b/docs/source/validate.md @@ -8,17 +8,20 @@ | Architecture | Model Family | Representative Models | [vLLM Support](https://quic.github.io/cloud-ai-sdk-pages/latest/Getting-Started/Installation/vLLM/vLLM/index.html) | |-------------------------|--------------------|--------------------------------------------------------------------------------------|--------------| -| **FalconForCausalLM** | Falcon** | [tiiuae/falcon-40b](https://huggingface.co/tiiuae/falcon-40b) | ✔️ | +| **MolmoForCausalLM** | Molmo① | [allenai/Molmo-7B-D-0924](https://huggingface.co/allenai/Molmo-7B-D-0924) | ✕ | +| **Olmo2ForCausalLM** | OLMo-2 | [allenai/OLMo-2-0425-1B](https://huggingface.co/allenai/OLMo-2-0425-1B) | ✕ | +| **FalconForCausalLM** | Falcon② | [tiiuae/falcon-40b](https://huggingface.co/tiiuae/falcon-40b) | ✔️ | | **Qwen3MoeForCausalLM** | Qwen3Moe | [Qwen/Qwen3-30B-A3B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507) | ✕ | | **GemmaForCausalLM** | CodeGemma | [google/codegemma-2b](https://huggingface.co/google/codegemma-2b)
[google/codegemma-7b](https://huggingface.co/google/codegemma-7b) | ✔️ | -| | Gemma*** | [google/gemma-2b](https://huggingface.co/google/gemma-2b)
[google/gemma-7b](https://huggingface.co/google/gemma-7b)
[google/gemma-2-2b](https://huggingface.co/google/gemma-2-2b)
[google/gemma-2-9b](https://huggingface.co/google/gemma-2-9b)
[google/gemma-2-27b](https://huggingface.co/google/gemma-2-27b) | ✔️ | +| | Gemma③ | [google/gemma-2b](https://huggingface.co/google/gemma-2b)
[google/gemma-7b](https://huggingface.co/google/gemma-7b)
[google/gemma-2-2b](https://huggingface.co/google/gemma-2-2b)
[google/gemma-2-9b](https://huggingface.co/google/gemma-2-9b)
[google/gemma-2-27b](https://huggingface.co/google/gemma-2-27b) | ✔️ | +| **GptOssForCausalLM** | GPT-OSS | [openai/gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b) | ✔️ | | **GPTBigCodeForCausalLM** | Starcoder1.5 | [bigcode/starcoder](https://huggingface.co/bigcode/starcoder) | ✔️ | | | Starcoder2 | [bigcode/starcoder2-15b](https://huggingface.co/bigcode/starcoder2-15b) | ✔️ | | **GPTJForCausalLM** | GPT-J | [EleutherAI/gpt-j-6b](https://huggingface.co/EleutherAI/gpt-j-6b) | ✔️ | | **GPT2LMHeadModel** | GPT-2 | [openai-community/gpt2](https://huggingface.co/openai-community/gpt2) | ✔️ | | **GraniteForCausalLM** | Granite 3.1 | [ibm-granite/granite-3.1-8b-instruct](https://huggingface.co/ibm-granite/granite-3.1-8b-instruct)
[ibm-granite/granite-guardian-3.1-8b](https://huggingface.co/ibm-granite/granite-guardian-3.1-8b) | ✔️ | | | Granite 20B | [ibm-granite/granite-20b-code-base-8k](https://huggingface.co/ibm-granite/granite-20b-code-base-8k)
[ibm-granite/granite-20b-code-instruct-8k](https://huggingface.co/ibm-granite/granite-20b-code-instruct-8k) | ✔️ | -| **InternVLChatModel** | Intern-VL | [OpenGVLab/InternVL2_5-1B](https://huggingface.co/OpenGVLab/InternVL2_5-1B) | ✔️ | | | +| **InternVLChatModel** | Intern-VL① | [OpenGVLab/InternVL2_5-1B](https://huggingface.co/OpenGVLab/InternVL2_5-1B)
[OpenGVLab/InternVL3_5-1B](https://huggingface.co/OpenGVLab/InternVL3_5-1B) | ✔️ | | | | **LlamaForCausalLM** | CodeLlama | [codellama/CodeLlama-7b-hf](https://huggingface.co/codellama/CodeLlama-7b-hf)
[codellama/CodeLlama-13b-hf](https://huggingface.co/codellama/CodeLlama-13b-hf)
[codellama/CodeLlama-34b-hf](https://huggingface.co/codellama/CodeLlama-34b-hf) | ✔️ | | | DeepSeek-R1-Distill-Llama | [deepseek-ai/DeepSeek-R1-Distill-Llama-70B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-70B) | ✔️ | | | InceptionAI-Adapted | [inceptionai/jais-adapted-7b](https://huggingface.co/inceptionai/jais-adapted-7b)
[inceptionai/jais-adapted-13b-chat](https://huggingface.co/inceptionai/jais-adapted-13b-chat)
[inceptionai/jais-adapted-70b](https://huggingface.co/inceptionai/jais-adapted-70b) | ✔️ | @@ -31,13 +34,15 @@ | **MistralForCausalLM** | Mistral | [mistralai/Mistral-7B-Instruct-v0.1](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1) | ✔️ | | **MixtralForCausalLM** | Codestral
Mixtral | [mistralai/Codestral-22B-v0.1](https://huggingface.co/mistralai/Codestral-22B-v0.1)
[mistralai/Mixtral-8x7B-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1) | ✔️ | | **MPTForCausalLM** | MPT | [mosaicml/mpt-7b](https://huggingface.co/mosaicml/mpt-7b) | ✔️ | -| **Phi3ForCausalLM** | Phi-3**, Phi-3.5** | [microsoft/Phi-3-mini-4k-instruct](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct) | ✔️ | +| **Phi3ForCausalLM** | Phi-3②, Phi-3.5② | [microsoft/Phi-3-mini-4k-instruct](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct) | ✔️ | | **QwenForCausalLM** | DeepSeek-R1-Distill-Qwen | [DeepSeek-R1-Distill-Qwen-32B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B) | ✔️ | | | Qwen2, Qwen2.5 | [Qwen/Qwen2-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2-1.5B-Instruct) | ✔️ | | **LlamaSwiftKVForCausalLM** | swiftkv | [Snowflake/Llama-3.1-SwiftKV-8B-Instruct](https://huggingface.co/Snowflake/Llama-3.1-SwiftKV-8B-Instruct) | ✔️ | -| **Grok1ModelForCausalLM** | grok-1 | [hpcai-tech/grok-1](https://huggingface.co/hpcai-tech/grok-1) | ✕ | -- ** set "trust-remote-code" flag to True for e2e inference with vLLM -- *** pass "disable-sliding-window" flag for e2e inference of Gemma-2 family of models with vLLM +| **Grok1ModelForCausalLM** | grok-1② | [hpcai-tech/grok-1](https://huggingface.co/hpcai-tech/grok-1) | ✕ | + + +--- + ## Embedding Models ### Text Embedding Task @@ -47,12 +52,14 @@ |--------------|--------------|---------------------------------|--------------| | **BertModel** | BERT-based | [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5)
[BAAI/bge-large-en-v1.5](https://huggingface.co/BAAI/bge-large-en-v1.5)
[BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5)
[e5-large-v2](https://huggingface.co/intfloat/e5-large-v2) | ✔️ | | **MPNetForMaskedLM** | MPNet | [sentence-transformers/multi-qa-mpnet-base-cos-v1](https://huggingface.co/sentence-transformers/multi-qa-mpnet-base-cos-v1) | ✕ | -| **MistralModel** | Mistral | [e5-mistral-7b-instruct](https://huggingface.co/intfloat/e5-mistral-7b-instruct) | ✕ | -| **NomicBertModel** | NomicBERT | [nomic-embed-text-v1.5](https://huggingface.co/nomic-ai/nomic-embed-text-v1.5) | ✕ | -| **Qwen2ForCausalLM** | Qwen2 | [stella_en_1.5B_v5](https://huggingface.co/NovaSearch/stella_en_1.5B_v5) | ✔️ | +| **MistralModel** | Mistral | [intfloat/e5-mistral-7b-instruct](https://huggingface.co/intfloat/e5-mistral-7b-instruct) | ✕ | +| **NomicBertModel** | NomicBERT② | [nomic-ai/nomic-embed-text-v1.5](https://huggingface.co/nomic-ai/nomic-embed-text-v1.5) | ✕ | +| **Qwen2ForCausalLM** | Qwen2 | [NovaSearch/stella_en_1.5B_v5](https://huggingface.co/NovaSearch/stella_en_1.5B_v5) | ✔️ | | **RobertaModel** | RoBERTa | [ibm-granite/granite-embedding-30m-english](https://huggingface.co/ibm-granite/granite-embedding-30m-english)
[ibm-granite/granite-embedding-125m-english](https://huggingface.co/ibm-granite/granite-embedding-125m-english) | ✔️ | | **XLMRobertaForSequenceClassification** | XLM-RoBERTa | [bge-reranker-v2-m3bge-reranker-v2-m3](https://huggingface.co/BAAI/bge-reranker-v2-m3) | ✕ | -| **XLMRobertaModel** | XLM-RoBERTa |[ibm-granite/granite-embedding-107m-multilingual](https://huggingface.co/ibm-granite/granite-embedding-107m-multilingual)
[ibm-granite/granite-embedding-278m-multilingual](https://huggingface.co/ibm-granite/granite-embedding-278m-multilingual) | ✔️ | +| **XLMRobertaModel** | XLM-RoBERTa |[ibm-granite/granite-embedding-107m-multilingual](https://huggingface.co/ibm-granite/granite-embedding-107m-multilingual)
[ibm-granite/granite-embedding-278m-multilingual](https://huggingface.co/ibm-granite/granite-embedding-278m-multilingual)
[intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) | ✔️ | + +--- ## Multimodal Language Models @@ -65,8 +72,10 @@ | **MllamaForConditionalGeneration** | Llama 3.2 | [meta-llama/Llama-3.2-11B-Vision Instruct](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct)
[meta-llama/Llama-3.2-90B-Vision-Instruct](https://huggingface.co/meta-llama/Llama-3.2-90B-Vision-Instruct) | ✔️ | ✔️ | ✔️ | ✔️ | | **LlavaNextForConditionalGeneration** | Granite Vision | [ibm-granite/granite-vision-3.2-2b](https://huggingface.co/ibm-granite/granite-vision-3.2-2b) | ✕ | ✔️ | ✕ | ✔️ | | **Llama4ForConditionalGeneration** | Llama-4-Scout | [Llama-4-Scout-17B-16E-Instruct](https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct) | ✔️ | ✔️ | ✔️ | ✔️ | -| **Gemma3ForConditionalGeneration** | Gemma3*** | [google/gemma-3-4b-it](https://huggingface.co/google/gemma-3-4b-it) | ✔️ | ✔️ | ✔️ | ✕ | -- *** pass "disable-sliding-window" flag for e2e inference with vLLM +| **Gemma3ForConditionalGeneration** | Gemma3③ | [google/gemma-3-4b-it](https://huggingface.co/google/gemma-3-4b-it) | ✔️ | ✔️ | | | +| **Qwen2_5_VLForConditionalGeneration** | Qwen2.5-VL | [Qwen/Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct) | ✔️ | ✔️ | | | +| **Mistral3ForConditionalGeneration** | Mistral3| [mistralai/Mistral-Small-3.1-24B-Instruct-2503](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503)| ✔️ | ✔️ | | | + **Dual QPC:** @@ -84,26 +93,56 @@ In the single QPC(Qualcomm Program Container) setup, the entire model—includin -**Note:** +```{NOTE} The choice between Single and Dual QPC is determined during model instantiation using the `kv_offload` setting. If the `kv_offload` is set to `True` it runs in dual QPC and if its set to `False` model runs in single QPC mode. +``` ---- ### Audio Models (Automatic Speech Recognition) - Transcription Task + **QEff Auto Class:** `QEFFAutoModelForSpeechSeq2Seq` | Architecture | Model Family | Representative Models | vLLM Support | |--------------|--------------|----------------------------------------------------------------------------------------|--------------| | **Whisper** | Whisper | [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny)
[openai/whisper-base](https://huggingface.co/openai/whisper-base)
[openai/whisper-small](https://huggingface.co/openai/whisper-small)
[openai/whisper-medium](https://huggingface.co/openai/whisper-medium)
[openai/whisper-large](https://huggingface.co/openai/whisper-large)
[openai/whisper-large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo) | ✔️ | +| **Wav2Vec2** | Wav2Vec2 | [facebook/wav2vec2-base](https://huggingface.co/facebook/wav2vec2-base)
[facebook/wav2vec2-large](https://huggingface.co/facebook/wav2vec2-large) | | + +--- + +## Diffusion Models + +### Image Generation Models +**QEff Auto Class:** `QEffFluxPipeline` + +| Architecture | Model Family | Representative Models | vLLM Support | +|--------------|--------------|----------------------------------------------------------------------------------------|--------------| +| **FluxPipeline** | FLUX.1 | [black-forest-labs/FLUX.1-schnell](https://huggingface.co/stabilityai/stable-diffusion-2-1) | | + +### Video Generation Models +**QEff Auto Class:** `QEffWanPipeline` + +| Architecture | Model Family | Representative Models | vLLM Support | +|--------------|--------------|----------------------------------------------------------------------------------------|--------------| +| **WanPipeline** | Wan2.2 | [Wan-AI/Wan2.2-T2V-A14B-Diffusers](https://huggingface.co/stabilityai/stable-diffusion-2-1) | | + +--- + +```{NOTE} +① Intern-VL and Molmo models are Vision-Language Models but use `QEFFAutoModelForCausalLM` for inference to stay compatible with HuggingFace Transformers. + +② Set `trust_remote_code=True` for end-to-end inference with vLLM. + +③ Pass `disable_sliding_window` for few family models when using vLLM. +``` +--- (models_coming_soon)= # Models Coming Soon | Architecture | Model Family | Representative Models | |-------------------------|--------------|--------------------------------------------| -| **Qwen3MoeForCausalLM** |Qwen3| [Qwen/Qwen3-MoE-15B-A2B]() | -| **Mistral3ForConditionalGeneration**|Mistral 3.1| [mistralai/Mistral-Small-3.1-24B-Base-2503](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Base-2503) | -| **BaichuanForCausalLM** | Baichuan2 | [baichuan-inc/Baichuan2-7B-Base](https://huggingface.co/baichuan-inc/Baichuan2-7B-Base) | -| **CohereForCausalLM** | Command-R | [CohereForAI/c4ai-command-r-v01](https://huggingface.co/CohereForAI/c4ai-command-r-v01) | -| **DbrxForCausalLM** | DBRX | [databricks/dbrx-base](https://huggingface.co/databricks/dbrx-base) | \ No newline at end of file +| **NemotronHForCausalLM** | NVIDIA Nemotron v3 | [NVIDIA Nemotron v3](https://huggingface.co/collections/nvidia/nvidia-nemotron-v3) | +| **Sam3Model** | facebook/sam3 | [facebook/sam3](https://huggingface.co/facebook/sam3) | +| **StableDiffusionModel** | HiDream-ai | [HiDream-ai/HiDream-I1-Full](https://huggingface.co/HiDream-ai/HiDream-I1-Full) | +| **MistralLarge3Model** | Mistral Large 3 | [mistralai/mistral-large-3](https://huggingface.co/collections/mistralai/mistral-large-3) | \ No newline at end of file diff --git a/examples/README.md b/examples/README.md index 3913b25ce..ed2779fdf 100644 --- a/examples/README.md +++ b/examples/README.md @@ -72,6 +72,14 @@ Optimization techniques. [See all performance examples →](performance/) +### Disaggregated Serving +Distributed inference across multiple devices. + +| Example | Description | Script | +|---------|-------------|--------| +| Basic Disaggregated Serving | Multi-device serving | [disagg_serving/gpt_oss_disagg_mode.py](disagg_serving/gpt_oss_disagg_mode.py) | +| Chunking Disaggregated Serving | Multi-device serving | [disagg_serving/gpt_oss_disagg_mode_with_chunking.py](disagg_serving/gpt_oss_disagg_mode_with_chunking.py) | + ## Installation For installation instructions, see the [Quick Installation guide](../README.md#quick-installation) in the main README. diff --git a/examples/text_generation/README.md b/examples/text_generation/README.md index 6b80442c2..2d8754768 100644 --- a/examples/text_generation/README.md +++ b/examples/text_generation/README.md @@ -24,6 +24,7 @@ Popular model families include: - GPT-2, GPT-J - Falcon, MPT, Phi-3 - Granite, StarCoder +- OLMo 2 ---