Skip to content

Commit e876275

Browse files
committed
init
1 parent c2e5ece commit e876275

File tree

2 files changed

+193
-0
lines changed

2 files changed

+193
-0
lines changed

docs/source/en/_toctree.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -72,6 +72,8 @@
7272
title: Accelerate inference
7373
- local: optimization/cache
7474
title: Caching
75+
- local: optimization/attention_backends
76+
title: Attention backends
7577
- local: optimization/memory
7678
title: Reduce memory usage
7779
- local: optimization/speed-memory-optims
Lines changed: 191 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,191 @@
1+
<!-- Copyright 2025 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License. -->
11+
12+
# Attention backends
13+
14+
Diffusers provides several optimized attention algorithms that are more memory and computationally efficient through it's *attention dispatcher*. The dispatcher acts as a router for managing and switching between different attention implementations and provides a unified interface for interacting with them.
15+
16+
Available attention implementations include the following.
17+
18+
| attention family | main feature |
19+
|---|---|
20+
| FlashAttention | minimizes memory reads/writes through tiling and recomputation |
21+
| SageAttention | quantizes attention to int8 |
22+
| PyTorch native | built-in PyTorch implementation using [scaled_dot_product_attention](./fp16#scaled-dot-product-attention) |
23+
| xFormers | memory-efficient attention with support for various attention kernels |
24+
25+
This guide will show you how to use the dispatcher to set and use the different attention backends.
26+
27+
## FlashAttention
28+
29+
[FlashAttention](https://github.com/Dao-AILab/flash-attention) reduces memory traffic by making better use of on-chip shared memory (SRAM) instead of global GPU memory so the data doesn't have to travel far. The latest variant, FlashAttention-3, is further optimized for modern GPUs (Hopper/Blackwell) and also overlaps computations and handles FP8 attention better.
30+
31+
There are several available FlashAttention variants, including variable length and the original FlashAttention. For a full list of supported implementations, check the list [here](https://github.com/huggingface/diffusers/blob/5e181eddfe7e44c1444a2511b0d8e21d177850a0/src/diffusers/models/attention_dispatch.py#L163).
32+
33+
The example below demonstrates how to enable the `_flash_3_hub` implementation. The [kernel](https://github.com/huggingface/kernels) library allows you to instantly use optimized compute kernels from the Hub without requiring any setup.
34+
35+
Pass the attention backend to the [`~ModelMixin.set_attention_backend`] method.
36+
37+
```py
38+
import torch
39+
from diffusers import QwenImagePipeline
40+
41+
pipeline = QwenImagePipeline.from_pretrained(
42+
"Qwen/Qwen-Image", torch_dtype=torch.bfloat16, device_map="cuda"
43+
)
44+
pipeline.transformer.set_attention_backend("_flash_3_hub")
45+
```
46+
47+
You could also use the [attention_backend](https://github.com/huggingface/diffusers/blob/5e181eddfe7e44c1444a2511b0d8e21d177850a0/src/diffusers/models/attention_dispatch.py#L225) context manager to temporarily set an attention backend for a model within the context.
48+
49+
```py
50+
import torch
51+
from diffusers import QwenImagePipeline
52+
53+
pipeline = QwenImagePipeline.from_pretrained(
54+
"Qwen/Qwen-Image", torch_dtype=torch.bfloat16, device_map="cuda"
55+
)
56+
prompt = """
57+
cinematic film still of a cat sipping a margarita in a pool in Palm Springs, California
58+
highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain
59+
"""
60+
61+
with attention_backend("_flash_3_hub"):
62+
image = pipeline(prompt).images[0]
63+
```
64+
65+
To restore the default attention backend, call [`~ModelMixin.reset_attention_backend`].
66+
67+
```py
68+
pipeline.transformer.reset_attention_backend()
69+
```
70+
71+
## SageAttention
72+
73+
[SageAttention](https://github.com/thu-ml/SageAttention) quantizes attention by computing queries (Q) and keys (K) in INT8. The probability (P) and value (V) are calculated in either FP8 or FP16 to minimize error. This significantly increases inference throughput and with little to no degradation.
74+
75+
There are several SageAttention variants for FP8 and FP16 as well as whether it is CUDA or Triton based. For a full list of supported implementations, check the list [here](https://github.com/huggingface/diffusers/blob/5e181eddfe7e44c1444a2511b0d8e21d177850a0/src/diffusers/models/attention_dispatch.py#L182).
76+
77+
The example below uses the `_sage_qk_int8_pv_fp8_cuda` implementation.
78+
79+
```py
80+
import torch
81+
from diffusers import QwenImagePipeline
82+
83+
pipeline = QwenImagePipeline.from_pretrained(
84+
"Qwen/Qwen-Image", torch_dtype=torch.bfloat16, device_map="cuda"
85+
)
86+
pipeline.transformer.set_attention_backend("_sage_qk_int8_pv_fp8_cuda")
87+
```
88+
89+
You could also use the [attention_backend](https://github.com/huggingface/diffusers/blob/5e181eddfe7e44c1444a2511b0d8e21d177850a0/src/diffusers/models/attention_dispatch.py#L225) context manager to temporarily set an attention backend for a model within the context.
90+
91+
```py
92+
import torch
93+
from diffusers import QwenImagePipeline
94+
95+
pipeline = QwenImagePipeline.from_pretrained(
96+
"Qwen/Qwen-Image", torch_dtype=torch.bfloat16, device_map="cuda"
97+
)
98+
prompt = """
99+
cinematic film still of a cat sipping a margarita in a pool in Palm Springs, California
100+
highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain
101+
"""
102+
103+
with attention_backend("_sage_qk_int8_pv_fp8_cuda"):
104+
image = pipeline(prompt).images[0]
105+
```
106+
107+
To restore the default attention backend, call [`~ModelMixin.reset_attention_backend`].
108+
109+
```py
110+
pipeline.transformer.reset_attention_backend()
111+
```
112+
113+
## PyTorch native
114+
115+
PyTorch includes a [native implementation](https://docs.pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html) of several optimized attention implementations including [FlexAttention](https://pytorch.org/blog/flexattention/), FlashAttention, memory-efficient attention, and a C++ version.
116+
117+
For a full list of supported implementations, check the list [here](https://github.com/huggingface/diffusers/blob/5e181eddfe7e44c1444a2511b0d8e21d177850a0/src/diffusers/models/attention_dispatch.py#L171).
118+
119+
The example below uses the `_native_flash` implementation.
120+
121+
```py
122+
import torch
123+
from diffusers import QwenImagePipeline
124+
125+
pipeline = QwenImagePipeline.from_pretrained(
126+
"Qwen/Qwen-Image", torch_dtype=torch.bfloat16, device_map="cuda"
127+
)
128+
pipeline.transformer.set_attention_backend("_native_flash")
129+
```
130+
131+
You could also use the [attention_backend](https://github.com/huggingface/diffusers/blob/5e181eddfe7e44c1444a2511b0d8e21d177850a0/src/diffusers/models/attention_dispatch.py#L225) context manager to temporarily set an attention backend for a model within the context.
132+
133+
```py
134+
import torch
135+
from diffusers import QwenImagePipeline
136+
137+
pipeline = QwenImagePipeline.from_pretrained(
138+
"Qwen/Qwen-Image", torch_dtype=torch.bfloat16, device_map="cuda"
139+
)
140+
prompt = """
141+
cinematic film still of a cat sipping a margarita in a pool in Palm Springs, California
142+
highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain
143+
"""
144+
145+
with attention_backend("_native_flash"):
146+
image = pipeline(prompt).images[0]
147+
```
148+
149+
To restore the default attention backend, call [`~ModelMixin.reset_attention_backend`].
150+
151+
```py
152+
pipeline.transformer.reset_attention_backend()
153+
```
154+
155+
## xFormers
156+
157+
[xFormers](https://github.com/facebookresearch/xformers) provides memory-efficient attention algorithms such as sparse attention and block-sparse attention. Pass `xformers` to enable it.
158+
159+
```py
160+
import torch
161+
from diffusers import QwenImagePipeline
162+
163+
pipeline = QwenImagePipeline.from_pretrained(
164+
"Qwen/Qwen-Image", torch_dtype=torch.bfloat16, device_map="cuda"
165+
)
166+
pipeline.transformer.set_attention_backend("xformers")
167+
```
168+
169+
You could also use the [attention_backend](https://github.com/huggingface/diffusers/blob/5e181eddfe7e44c1444a2511b0d8e21d177850a0/src/diffusers/models/attention_dispatch.py#L225) context manager to temporarily set an attention backend for a model within the context.
170+
171+
```py
172+
import torch
173+
from diffusers import QwenImagePipeline
174+
175+
pipeline = QwenImagePipeline.from_pretrained(
176+
"Qwen/Qwen-Image", torch_dtype=torch.bfloat16, device_map="cuda"
177+
)
178+
prompt = """
179+
cinematic film still of a cat sipping a margarita in a pool in Palm Springs, California
180+
highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain
181+
"""
182+
183+
with attention_backend("xformers"):
184+
image = pipeline(prompt).images[0]
185+
```
186+
187+
To restore the default attention backend, call [`~ModelMixin.reset_attention_backend`].
188+
189+
```py
190+
pipeline.transformer.reset_attention_backend()
191+
```

0 commit comments

Comments
 (0)