You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/en/api/parallel.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -11,7 +11,7 @@ specific language governing permissions and limitations under the License. -->
11
11
12
12
# Parallelism
13
13
14
-
Parallelism strategies help speed up diffusion transformers by distributing computations across multiple devices, allowing for faster inference/training times.
14
+
Parallelism strategies help speed up diffusion transformers by distributing computations across multiple devices, allowing for faster inference/training times. Refer to the [Distributed inferece](../training/distributed_inference) guide to learn more.
Copy file name to clipboardExpand all lines: docs/source/en/training/distributed_inference.md
+12-14Lines changed: 12 additions & 14 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -232,19 +232,21 @@ By selectively loading and unloading the models you need at a given stage and sh
232
232
233
233
[Context parallelism](https://huggingface.co/spaces/nanotron/ultrascale-playbook?section=context_parallelism) splits input sequences across multiple GPUs to reduce memory usage. Each GPU processes its own slice of the sequence.
234
234
235
-
Key (K) and value (V) representations communicate between devices using [Ring Attention](https://huggingface.co/papers/2310.01889). This ensures each split sees every other token's K/V. Each GPU computes attention for its local K/V and passes it to the next GPU in the ring. No single GPU holds the full sequence, which reduces communication latency.
235
+
Use [`~ModelMixin.set_attention_backend`] to switch to a more optimized [attention backend](../optimization/attention_backends). The example below uses the FlashAttention backend.
236
236
237
-
Pass a [`ContextParallelConfig`]to the `parallel_config` argument of the transformer model. The config supports the `ring_degree` argument that determines how many devices to use for Ring Attention.
237
+
Refer to the table supported attention backends enabled by [`~ModelMixin.set_attention_backend`].
238
238
239
-
Use [`~ModelMixin.set_attention_backend`] to switch to a more optimized [attention backend](../optimization/attention_backends). The example below uses the FlashAttention backend.
239
+
| attention family | support type | argument |
240
+
|---|---|---|
241
+
| native cuDNN | inference and training |`_native_cudnn`|
242
+
| FlashAttention-2/3 | inference and training |`flash` or `_flash_3`|
243
+
| SageAttention | inference |`sage`|
240
244
241
-
Refer to the table below for the supported attention backends enabled by [`~ModelMixin.set_attention_backend`].
245
+
### Ring Attention
242
246
243
-
| attention family | support type |
244
-
|---|---|
245
-
| native cuDNN | inference and training |
246
-
| FlashAttention-2/3 | inference and training |
247
-
| SageAttention | inference |
247
+
Key (K) and value (V) representations communicate between devices using [Ring Attention](https://huggingface.co/papers/2310.01889). This ensures each split sees every other token's K/V. Each GPU computes attention for its local K/V and passes it to the next GPU in the ring. No single GPU holds the full sequence, which reduces communication latency.
248
+
249
+
Pass a [`ContextParallelConfig`] to the `parallel_config` argument of the transformer model. The config supports the `ring_degree` argument that determines how many devices to use for Ring Attention.
248
250
249
251
```py
250
252
import torch
@@ -292,8 +294,4 @@ Pass the [`ContextParallelConfig`] to [`~ModelMixin.enable_parallelism`].
- Take a look at this [script](https://gist.github.com/sayakpaul/cfaebd221820d7b43fae638b4dfa01ba) for a minimal example of distributed inference with Accelerate.
298
-
- For more details, check out Accelerate's [Distributed inference](https://huggingface.co/docs/accelerate/en/usage_guides/distributed_inference#distributed-inference-with-accelerate) guide.
299
-
- The `device_map` argument assign models or an entire pipeline to devices. Refer to the [device placement](../using-diffusers/loading#device-placement) docs for more information.
0 commit comments