Skip to content

Commit b58e74b

Browse files
committed
feedback
1 parent 0eed9f5 commit b58e74b

File tree

2 files changed

+13
-15
lines changed

2 files changed

+13
-15
lines changed

docs/source/en/api/parallel.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ specific language governing permissions and limitations under the License. -->
1111

1212
# Parallelism
1313

14-
Parallelism strategies help speed up diffusion transformers by distributing computations across multiple devices, allowing for faster inference/training times.
14+
Parallelism strategies help speed up diffusion transformers by distributing computations across multiple devices, allowing for faster inference/training times. Refer to the [Distributed inferece](../training/distributed_inference) guide to learn more.
1515

1616
## ParallelConfig
1717

docs/source/en/training/distributed_inference.md

Lines changed: 12 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -232,19 +232,21 @@ By selectively loading and unloading the models you need at a given stage and sh
232232

233233
[Context parallelism](https://huggingface.co/spaces/nanotron/ultrascale-playbook?section=context_parallelism) splits input sequences across multiple GPUs to reduce memory usage. Each GPU processes its own slice of the sequence.
234234

235-
Key (K) and value (V) representations communicate between devices using [Ring Attention](https://huggingface.co/papers/2310.01889). This ensures each split sees every other token's K/V. Each GPU computes attention for its local K/V and passes it to the next GPU in the ring. No single GPU holds the full sequence, which reduces communication latency.
235+
Use [`~ModelMixin.set_attention_backend`] to switch to a more optimized [attention backend](../optimization/attention_backends). The example below uses the FlashAttention backend.
236236

237-
Pass a [`ContextParallelConfig`] to the `parallel_config` argument of the transformer model. The config supports the `ring_degree` argument that determines how many devices to use for Ring Attention.
237+
Refer to the table supported attention backends enabled by [`~ModelMixin.set_attention_backend`].
238238

239-
Use [`~ModelMixin.set_attention_backend`] to switch to a more optimized [attention backend](../optimization/attention_backends). The example below uses the FlashAttention backend.
239+
| attention family | support type | argument |
240+
|---|---|---|
241+
| native cuDNN | inference and training | `_native_cudnn` |
242+
| FlashAttention-2/3 | inference and training | `flash` or `_flash_3` |
243+
| SageAttention | inference | `sage` |
240244

241-
Refer to the table below for the supported attention backends enabled by [`~ModelMixin.set_attention_backend`].
245+
### Ring Attention
242246

243-
| attention family | support type |
244-
|---|---|
245-
| native cuDNN | inference and training |
246-
| FlashAttention-2/3 | inference and training |
247-
| SageAttention | inference |
247+
Key (K) and value (V) representations communicate between devices using [Ring Attention](https://huggingface.co/papers/2310.01889). This ensures each split sees every other token's K/V. Each GPU computes attention for its local K/V and passes it to the next GPU in the ring. No single GPU holds the full sequence, which reduces communication latency.
248+
249+
Pass a [`ContextParallelConfig`] to the `parallel_config` argument of the transformer model. The config supports the `ring_degree` argument that determines how many devices to use for Ring Attention.
248250

249251
```py
250252
import torch
@@ -292,8 +294,4 @@ Pass the [`ContextParallelConfig`] to [`~ModelMixin.enable_parallelism`].
292294

293295
```py
294296
pipeline.transformer.enable_parallelism(config=ContextParallelConfig(ulysses_degree=2))
295-
```
296-
297-
- Take a look at this [script](https://gist.github.com/sayakpaul/cfaebd221820d7b43fae638b4dfa01ba) for a minimal example of distributed inference with Accelerate.
298-
- For more details, check out Accelerate's [Distributed inference](https://huggingface.co/docs/accelerate/en/usage_guides/distributed_inference#distributed-inference-with-accelerate) guide.
299-
- The `device_map` argument assign models or an entire pipeline to devices. Refer to the [device placement](../using-diffusers/loading#device-placement) docs for more information.
297+
```

0 commit comments

Comments
 (0)