You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/en/training/distributed_inference.md
+5-11Lines changed: 5 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -244,7 +244,7 @@ By selectively loading and unloading the models you need at a given stage and sh
244
244
245
245
Key (K) and value (V) representations communicate between devices using [Ring Attention](https://huggingface.co/papers/2310.01889). This ensures each split sees every other token's K/V. Each GPU computes attention for its local K/V and passes it to the next GPU in the ring. No single GPU holds the full sequence, which reduces communication latency.
246
246
247
-
Call [`parallelize`]on the model and pass a [`ContextParallelConfig`]. The config supports the `ring_degree` argument that determines how many devices to use for Ring Attention.
247
+
Pass a [`ContextParallelConfig`]to the `parallel_config` argument of the transformer model. The config supports the `ring_degree` argument that determines how many devices to use for Ring Attention.
248
248
249
249
Use [`~ModelMixin.set_attention_backend`] to switch to a more optimized [attention backend](../optimization/attention_backends). The example below uses the FlashAttention backend.
250
250
@@ -258,32 +258,26 @@ Refer to the table below for the supported attention backends enabled by [`~Mode
258
258
259
259
```py
260
260
import torch
261
-
from diffusers importQwenImagePipeline, ContextParallelConfig, enable_parallelism
261
+
from diffusers importAutoModel, QwenImagePipeline, ContextParallelConfig
0 commit comments