Skip to content

Conversation

@abukhoy
Copy link
Contributor

@abukhoy abukhoy commented Dec 22, 2025

This is created for updating documentation for 1.21.0 release.

Note: First Draft

Signed-off-by: Abukhoyer Shaik <[email protected]>
Signed-off-by: Abukhoyer Shaik <[email protected]>

- **Mistral 3.1 (24B)**
- Executable via [`QEffAutoModelForCausalLM`](#QEffAutoModelForCausalLM)
- Production-ready deployment
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

- **Olmo2**
- Executable via [`QEffAutoModelForCausalLM`](#QEffAutoModelForCausalLM)
- Full CausalLM support with optimizations
- Bug fixes included
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

Signed-off-by: Abukhoyer Shaik <[email protected]>
Signed-off-by: Abukhoyer Shaik <[email protected]>
@quic-rishinr
Copy link
Contributor

We are missing a few more models, Olmo, Molmo and Wave2Vec2 @abukhoy please add these models as well

@abukhoy abukhoy marked this pull request as draft December 23, 2025 05:31
Signed-off-by: Abukhoyer Shaik <[email protected]>
Signed-off-by: Abukhoyer Shaik <[email protected]>
Signed-off-by: Abukhoyer Shaik <[email protected]>
Signed-off-by: Abukhoyer Shaik <[email protected]>
Signed-off-by: Abukhoyer Shaik <[email protected]>
Signed-off-by: Abukhoyer Shaik <[email protected]>
## Key Features & Enhancements

- **Framework Upgrades**: Transformers `4.55`, PyTorch `2.7.0+cpu`, Torchvision `0.22.0+cpu`
- **Python Support**: Now requires Python `>=3.9`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to keep python support as 3.10

@@ -0,0 +1,84 @@
# Diffuser Classes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we follow the similar approach like qeff_autoclasses.html? Add small examples and keep only the user exposed classes? @quic-amitraj can you suggest

```

(QEFFAutoModelForCTC)=
## `QEFFAutoModelForCTC`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add an example here

- Enabled for GPT-OSS models, allowing for flexible deployment of large language models across different hardware configurations.
* - `ONNX Sub-Functions <https://github.com/quic/efficient-transformers/pull/621>`_
- Feature enabling more efficient model compilation and execution on hardware.
* - `Continuous Batching (VLMs) <https://github.com/quic/efficient-transformers/pull/610>`_
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can remove Continuous Batching (VLMs)

- Implements a blocked K/V cache layout so attention reads/processes the cache blockbyblock, improving longcontext decode performance.
* - `Memory Profiling Tool <https://github.com/quic/efficient-transformers/pull/674>`_
- Adds scripts to profile memory during export/compile/infer (peak usage, cache footprint) for quicker diagnosis. Refer `sample scripts <https://github.com/quic/efficient-transformers/tree/main/scripts/memory_profiling>`_ for more **details**.
* - `ONNX transform, memory & time optimizations <https://github.com/quic/efficient-transformers/pull/640>`_
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove it. it's an optimization and it's not a standalone feature

- Adds scripts to profile memory during export/compile/infer (peak usage, cache footprint) for quicker diagnosis. Refer `sample scripts <https://github.com/quic/efficient-transformers/tree/main/scripts/memory_profiling>`_ for more **details**.
* - `ONNX transform, memory & time optimizations <https://github.com/quic/efficient-transformers/pull/640>`_
- Adds periodic memory cleanup (e.g., to FP16ClipTransform / SplitTensorsTransform) during largetensor processing, and avoids redundant external data loading when already present
* - Onboarding Guide
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

- Extended to Vision Language Models with multi-image handling capabilities, optimizing throughput and latency by dynamically batching requests with varying image counts. Refer `sample script <https://github.com/quic/efficient-transformers/blob/main/examples/image_text_to_text/models/granite_vision/continuous_batching.py>`_ for more **details**.
* - `BlockedKV attention in CausalLM <https://github.com/quic/efficient-transformers/pull/618>`_
- Implements a blocked K/V cache layout so attention reads/processes the cache blockbyblock, improving longcontext decode performance.
* - `Memory Profiling Tool <https://github.com/quic/efficient-transformers/pull/674>`_
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

| Architecture | Model Family | Representative Models | [vLLM Support](https://quic.github.io/cloud-ai-sdk-pages/latest/Getting-Started/Installation/vLLM/vLLM/index.html) |
|-------------------------|--------------------|--------------------------------------------------------------------------------------|--------------|
| **FalconForCausalLM** | Falcon** | [tiiuae/falcon-40b](https://huggingface.co/tiiuae/falcon-40b) | ✔️ |
| **MolmoForCausalLM** | Molmo① | [allenai/Molmo-7B-D-0924](https://huggingface.co/allenai/Molmo-7B-D-0924) ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Image There are some characters coming in these. Check for all the newly added models

| Architecture | Model Family | Representative Models | [vLLM Support](https://quic.github.io/cloud-ai-sdk-pages/latest/Getting-Started/Installation/vLLM/vLLM/index.html) |
|-------------------------|--------------------|--------------------------------------------------------------------------------------|--------------|
| **FalconForCausalLM** | Falcon** | [tiiuae/falcon-40b](https://huggingface.co/tiiuae/falcon-40b) | ✔️ |
| **MolmoForCausalLM** | Molmo① | [allenai/Molmo-7B-D-0924](https://huggingface.co/allenai/Molmo-7B-D-0924) ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check with @quic-vargupt and confirm if these models are supported on vllm and update it accordingly

Signed-off-by: Abukhoyer Shaik <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants