Exploring Quantization Backends in Diffusers #2852

DerekLiu35 · 2025-05-13T20:33:02Z

Preparing the Article

Add an entry to _blog.yml.
Add a thumbnail. There are no requirements here, but there is a template if it's helpful.
Check you use a short title and blog path.
Upload any additional assets (such as images) to the Documentation Images repo. This is to reduce bloat in the GitHub base repo when cloning and pulling. Try to have small images to avoid a slow or expensive user experience.
Add metadata (such as authors) to your md file. You can also specify guest or org for the authors.
Ensure the publication date is correct.
Preview the content. A quick way is to paste the markdown content in https://huggingface.co/new-blog. Do not click publish, this is just a way to do an early check.

DerekLiu35 · 2025-05-13T20:34:05Z

@SunMarc

sayakpaul

Thanks for working on this.

diffusers-quantization.md

sayakpaul · 2025-05-15T04:44:05Z

diffusers-quantization.md

+
+**BF16:**
+
+![Baroque, Futurist, and Noir style images generated with BF16 precision](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/quantization-backends-diffusers/combined_flux-dev_bf16_combined.png)


Let's also provide an actual caption for the figure.

diffusers-quantization.md

sayakpaul · 2025-05-15T04:46:00Z

diffusers-quantization.md

+**BnB 4-bit:**
+
+![Baroque, Futurist, and Noir style images generated with BnB 4-bit](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/quantization-backends-diffusers/combined_flux-dev_bnb_4bit_combined.png)
+
+**BnB 8-bit:**
+![Baroque, Futurist, and Noir style images generated with BnB 8-bit](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/quantization-backends-diffusers/combined_flux-dev_bnb_8bit_combined.png)


Can we combine three images here?

BF16

4bit

8bit

Along with the caption?

sayakpaul · 2025-05-15T04:46:28Z

diffusers-quantization.md

+**BnB 8-bit:**
+![Baroque, Futurist, and Noir style images generated with BnB 8-bit](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/quantization-backends-diffusers/combined_flux-dev_bnb_8bit_combined.png)
+
+| BnB Precision | Memory after loading | Peak memory | Inference time |


Let's include the Bf16 numbers too.

sayakpaul · 2025-05-15T04:55:53Z

diffusers-quantization.md

+| Q8_0           | 21.502 GB            | 25.973 GB   | 15 seconds     |
+| Q2_k           | 13.264 GB            | 17.752 GB   | 26 seconds     |
+
+**Example (Flux-dev with GGUF Q4_1)**


I don't think we have to be exhaustive about showing snippets for every configuration unless they vary significantly from one another.

sayakpaul · 2025-05-15T04:57:14Z

diffusers-quantization.md

+
+For more information check out the [GGUF docs](https://huggingface.co/docs/diffusers/quantization/gguf).
+
+### FP8 Layerwise Casting (`enable_layerwise_casting`)


Could also write that can be combined with group_offloading:
https://huggingface.co/docs/diffusers/en/optimization/memory#group-offloading

sayakpaul · 2025-05-15T04:57:47Z

diffusers-quantization.md

+
+We created a setup where you can provide a prompt, and we generate results using both the original, high-precision model (e.g., Flux-dev in BF16) and several quantized versions (BnB 4-bit, BnB 8-bit). The generated images are then presented to you and your challenge is to identify which ones came from the quantized models.
+
+Try it out [here](https://huggingface.co/spaces/derekl35/flux-quant)!


Let's embed the space inside the blog post.

sayakpaul · 2025-05-15T04:58:42Z

diffusers-quantization.md

+Here's a quick guide to choosing a quantization backend:
+
+*   **Easiest Memory Savings (NVIDIA):** Start with `bitsandbytes` 4/8-bit.
+*   **Prioritize Inference Speed:** `torchao` + `torch.compile` offers the best performance potential.


GGUF also supports torch.compile(). So does bitsandbytes. I think we should mention that.

sayakpaul · 2025-05-15T04:59:12Z

diffusers-quantization.md

+*   **Simplicity (Hopper/Ada):** Explore FP8 Layerwise Casting (`enable_layerwise_casting`).
+*   **For Using Existing GGUF Models:** Use GGUF loading (`from_single_file`).
+
+Quantization significantly lowers the barrier to entry for using large diffusion models. Experiment with these backends to find the best balance of memory, speed, and quality for your needs.


Should we hint the readers that they can expect a follow-up blog around training with quantization?

Yeah would be great !

diffusers-quantization.md

ChunTeLee · 2025-05-15T19:56:19Z

Exploring Quantization Backends in Diffusers thumbnail

Hey there, here is the thumbnail suggestion! cc @sayakpaul

sayakpaul · 2025-05-16T03:15:25Z

@ChunTeLee possible to reduce the size of the middle object a bit so that "exploring" and "quantization" words are clear?

ChunTeLee · 2025-05-17T02:37:42Z

Here you go!

SunMarc

Thanks a lot ! This is really nice. I feel like it could be nice to add a bit more details but overall I think we can ship this very soon !

diffusers-quantization.md

SunMarc · 2025-05-19T09:09:48Z

diffusers-quantization.md

+## Combining with Memory Optimizations
+
+Most of these quantization backends can be combined with the memory optimization techniques offered in Diffusers. For example, using `enable_model_cpu_offload()` with `bitsandbytes` cuts the memory further, giving a reasonable trade-off between memory and latency. You can learn more about these techniques in the [Diffusers documentation](https://huggingface.co/docs/diffusers/main/en/optimization/memory).


What would be nice would be to show some results here with enable_model_cpu_offload and torch.compile ! Maybe we can test these with bnb 4-bit as it is compatible with the main version.

Then we can also combine all benchmark results and put it into a dataset so that users can easily compare.

Having a central dataset sounds good, and it enables systematic exploration.

Sounds good to me. Should I include code too? Not sure how to do torch.compile with pipeline level quant config

This would be a cool feature to add @sayakpaul in pipeline level quant config actually. Maybe you can open a PR @DerekLiu35 in diffusers to support this scenario ?

@SunMarc which scenario you're talking about? To maybe better focus on the blog post, it'd be good to file that request on our repo or discuss on Slack.

@DerekLiu35

Sounds good to me. Should I include code too? Not sure how to do torch.compile with pipeline level quant config

We can individually compile the components of a pipeline. Like this:
https://github.com/sayakpaul/diffusers-torchao/blob/9b9f2383dccb8eb73a4c6a8ffe736dd9610c26d2/inference/benchmark_image.py#L40

Yeah let's do that later !

SunMarc · 2025-05-19T09:11:27Z

diffusers-quantization.md

+></script>
+
+Building on our previous post, "[Memory-efficient Diffusion Transformers with Quanto and Diffusers](https://huggingface.co/blog/quanto-diffusers)", this post explores the diverse quantization backends integrated directly into Hugging Face Diffusers. We'll examine how bitsandbytes, GGUF, torchao, and native FP8 support make large and powerful models more accessible, demonstrating their use with Flux (a flow-based text-to-image generation model).
+


let's do a quick introduction of the pipeline that we will be quantizing, share about what each components do, the memory it takes and what might be interesting to quantize.

I believe it could be done in bulleted points.

SunMarc · 2025-05-19T09:17:29Z

diffusers-quantization.md

+from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig
+from diffusers.quantizers import PipelineQuantizationConfig
+from transformers import T5EncoderModel
+from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig


Maybe explain somewhere that we need to be careful when they import the configs DiffusersBitsAndBytesConfig and TransformersBitsAndBytesConfig since the components comes from different libraries and that if they don't want to deal with that, they can use this instead

SunMarc · 2025-05-19T09:18:08Z

diffusers-quantization.md

+
+| torchao Precision             | Memory after loading | Peak memory | Inference time |
+|-------------------------------|----------------------|-------------|----------------|
+| int4_weight_only              | 10.635 GB            | 14.654 GB   | 109 seconds    |


any idea why it takes so much time ?

I remember I faced something while working on https://github.com/sayakpaul/diffusers-torchao. IIUC it was a combination of unavailability of a good kernel, shape constraints, etc.

SunMarc · 2025-05-19T09:18:16Z

diffusers-quantization.md

+
+| quanto Precision | Memory after loading | Peak memory | Inference time |
+|------------------|----------------------|-------------|----------------|
+| int4             | 12.254 GB            | 16.139 GB   | 109 seconds    |


SunMarc · 2025-05-19T09:18:48Z

diffusers-quantization.md

+)
+```
+
+> **Note:** At the time of writing, for float8 support with Quanto, you'll need `optimum-quanto<0.2.5` and use quanto directly.


this is why you didn't put the results for float8 ?

Yeah, would need to add code to show how to quantize with float8:

import torch from diffusers import AutoModel, FluxPipeline from transformers import T5EncoderModel from optimum.quanto import freeze, qfloat8, quantize model_id = "black-forest-labs/FLUX.1-dev" text_encoder_2 = T5EncoderModel.from_pretrained( model_id, subfolder="text_encoder_2", torch_dtype=torch.bfloat16, ) quantize(text_encoder_2, weights=qfloat8) freeze(text_encoder_2) transformer = AutoModel.from_pretrained( model_id, subfolder="transformer", torch_dtype=torch.bfloat16, ) quantize(transformer, weights=qfloat8) freeze(transformer) pipe = FluxPipeline.from_pretrained( model_id, transformer=transformer, text_encoder_2=text_encoder_2, torch_dtype=torch.bfloat16 ).to("cuda") pipe_kwargs = { "prompt": "Ghibli style, a fantasy landscape with a grey castle with multiple tall, blue-roofed towers, beside a clear, flowing river in a lush green valley.", "height": 1024, "width": 1024, "guidance_scale": 3.5, "num_inference_steps": 50, "max_sequence_length": 512, } print(f"Pipeline memory usage: {torch.cuda.max_memory_reserved() / 1024**3:.3f} GB") image = pipe(**pipe_kwargs, generator=torch.manual_seed(0),).images[0] print(f"Pipeline memory usage: {torch.cuda.max_memory_reserved() / 1024**3:.3f} GB") image.save("flux-dev_quanto_fp8.png")

yeah let's add the code but make sure to say that we will be working on fixing this.

SunMarc · 2025-05-19T09:21:28Z

diffusers-quantization.md

+## Spot The Quantized Model
+
+Quantization sounds great for saving memory, but how much does it *really* affect the final image? Can you even spot the difference? We invite you to test your perception!
+
+We created a setup where you can provide a prompt, and we generate results using both the original, high-precision model (e.g., Flux-dev in BF16) and several quantized versions (BnB 4-bit, BnB 8-bit). The generated images are then presented to you and your challenge is to identify which ones came from the quantized models.
+
+Try it out here!
+
+<gradio-app theme_mode="light" space="derekl35/flux-quant"></gradio-app>
+
+Often, especially with 8-bit quantization, the differences are subtle and may not be noticeable without close inspection. More aggressive quantization like 4-bit or lower might be more noticeable, but the results can still be good, especially considering the massive memory savings.


Maybe we can put that at the top since users will be definitely super interested in reading the rest to the blogpost after playing with the space !

NF4 often gives the best trade-off though. At least for image models and saves considerable amount of memory. Also see this tip here:

SunMarc · 2025-05-19T09:22:01Z

diffusers-quantization.md

+*   **Simplicity (Hopper/Ada):** Explore FP8 Layerwise Casting (`enable_layerwise_casting`).
+*   **For Using Existing GGUF Models:** Use GGUF loading (`from_single_file`).
+
+Quantization significantly lowers the barrier to entry for using large diffusion models. Experiment with these backends to find the best balance of memory, speed, and quality for your needs.


Yeah would be great !

sayakpaul · 2025-05-19T10:06:34Z

diffusers-quantization.md

+*   **For Using Existing GGUF Models:** Use GGUF loading (`from_single_file`).
+*   **Curious about training with quantization?** Stay tuned for a follow-up blog post on that topic!
+
+Quantization significantly lowers the barrier to entry for using large diffusion models. Experiment with these backends to find the best balance of memory, speed, and quality for your needs.


Let's make sure acknowledge @ChunTeLee for providing a nice thumbnail.

SunMarc · 2025-05-20T09:51:11Z

diffusers-quantization.md

+Before diving into the quantization backends, let's introduce the FluxPipeline and its components, which we'll be quantizing:
+
+*   **Text Encoders (CLIP and T5):**
+    *   **Function:** Process input text prompts. FLUX-dev uses CLIP for initial understanding and a larger T5 for nuanced comprehension and better text rendering.
+    *   **Memory:** T5 - 9.52 GB; CLIP - 246 MB (in BF16)
+*   **Transformer (Main Model - MMDiT):**
+    *   **Function:** Core generative part (Multimodal Diffusion Transformer). Generates images in latent space from text embeddings. 
+    *   **Memory:** 23.8 GB (in BF16)
+*   **Variational Auto-Encoder (VAE):**
+    *   **Function:** Translates images between pixel and latent space. Decodes generated latent representation to a pixel-based image.
+    *   **Memory:** 168 MB (in BF16)
+*   **Focus of Quantization:** Examples will primarily focus on the `transformer` and `text_encoder_2` (T5) for the most substantial memory savings.
+


Nice ! Can you add a link to https://huggingface.co/black-forest-labs/FLUX.1-dev. Also you can precise the memory needed to load the model

SunMarc · 2025-05-20T13:40:10Z

diffusers-quantization.md

+from diffusers import AutoModel, FluxPipeline
+from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig
+from diffusers.quantizers import PipelineQuantizationConfig
+from transformers import T5EncoderModel


clean some of the import since we don't use them, same for the other snippets

diffusers-quantization.md

SunMarc · 2025-05-21T13:45:59Z

diffusers-quantization.md

+**FP8 layerwise casting + group offloading**:
+
+| precision | Memory after loading | Peak memory | Inference time |
+|-----------|----------------------|-------------|----------------|
+| FP8 (e4m3)| 9.264 GB             | 14.232 GB   | 58 seconds     |
+


a lot slower, is it expected @a-r-r-o-w ?

Yes, expected because (https://huggingface.co/docs/diffusers/main/en/optimization/memory#group-offloading):

Let's make a note that for most image-based models, enable_model_cpu_offload() and quantization work.

cc @DerekLiu35

SunMarc · 2025-05-21T13:46:43Z

diffusers-quantization.md

+**torchao + `torch.compile`**: 
+| torchao Precision             | Memory after loading | Peak memory | Inference time | Compile Time |
+|-------------------------------|----------------------|-------------|----------------|--------------|
+| int4_weight_only              | 10.635 GB            | 15.238 GB   | 6 seconds    | ~285 seconds          |
+| int8_weight_only              | 17.020 GB            | 22.473 GB   | 8 seconds     | ~851 seconds          |
+| float8_weight_only            | 17.016 GB            | 22.115 GB   | 8 seconds     | ~545 seconds          |
+


nice results !

SunMarc · 2025-05-21T13:53:46Z

diffusers-quantization.md

 | Precision     | Memory after loading | Peak memory | Inference time |
 |---------------|----------------------|-------------|----------------|
-| 4-bit         | 12.584 GB            | 17.281 GB   | 12 seconds     |
-| 8-bit         | 19.273 GB            | 24.432 GB   | 27 seconds     |
+| 4-bit         | 12.383 GB            | 12.383 GB   | 17 seconds     |
+| 8-bit         | 19.182 GB            | 23.428 GB   | 27 seconds     |


feels like the memory saving are really low after loading the model no ? cc @sayakpaul

Isn't that expected? We're loading in lower-precision.

oh sorry i mean they are very close to when we are not using enable_model_cpu_offload, see the diff

Oh that is unexpected. It should look something like this:

For 8bit models though even after applying CPU offloading, they will stay in CPU, so, that is there.

This is my script:
https://gist.github.com/sayakpaul/15cb0636fc829371a8991db56a13c377

I think this is expected as T5 + transformers 4bit takes around 12GB, the other components are really small. What we could to is to move the models back to cpu after quantization to limit the memory usage.

Co-authored-by: Sayak Paul <[email protected]>

SunMarc

Thanks for this amazing blogpost !

DerekLiu35 marked this pull request as ready for review May 13, 2025 20:33

SunMarc requested a review from sayakpaul May 14, 2025 17:17

sayakpaul reviewed May 15, 2025

View reviewed changes

DerekLiu35 commented May 15, 2025

View reviewed changes

diffusers-quantization.md Show resolved Hide resolved

SunMarc reviewed May 19, 2025

View reviewed changes

sayakpaul reviewed May 19, 2025

View reviewed changes

SunMarc reviewed May 20, 2025

View reviewed changes

SunMarc reviewed May 21, 2025

View reviewed changes

diffusers-quantization.md Show resolved Hide resolved

diffusers-quantization.md Outdated Show resolved Hide resolved

SunMarc changed the title ~~init~~ Exploring Quantization Backends in Diffusers May 21, 2025

SunMarc reviewed May 21, 2025

View reviewed changes

diffusers-quantization.md Outdated Show resolved Hide resolved

SunMarc reviewed May 21, 2025

View reviewed changes

DerekLiu35 and others added 12 commits May 21, 2025 14:13

init

36f8968

Apply suggestions from code review

a30e939

Co-authored-by: Sayak Paul <[email protected]>

turn images into tables

7a3bbd4

apply more suggestion from code review

5c0b8cf

update thumbnail

198462e

update thumbnail

53fb11f

typo

305a063

apply suggestions from code review

a403eba

proofread

b530e23

apply suggestions from code review

9bffa09

add torchao INT4 example

0b53cb1

fix _blog.yml

91ca179

DerekLiu35 added 5 commits May 21, 2025 14:15

explain other ways to reduce memory usage

3b06725

add more benchmark tables

22afb0f

update blog.yml

66c957f

fix bnb + enable_model_cpu_offload results

23e66dd

update _blog.yml

d008513

DerekLiu35 force-pushed the derekl35/quantization-diffusers branch from e10a9f5 to d008513 Compare May 21, 2025 14:18

DerekLiu35 added 2 commits May 21, 2025 14:21

fix _blog.yml

f7ba202

fix _blog.yml

fe3af19

SunMarc approved these changes May 21, 2025

View reviewed changes

proofread

1ca534f

SunMarc merged commit 11f22ba into huggingface:main May 21, 2025
1 check passed


		BF16:

		![Baroque, Futurist, and Noir style images generated with BF16 precision](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/quantization-backends-diffusers/combined_flux-dev_bf16_combined.png)


		For more information check out the [GGUF docs](https://huggingface.co/docs/diffusers/quantization/gguf).

		### FP8 Layerwise Casting (`enable_layerwise_casting`)


		We created a setup where you can provide a prompt, and we generate results using both the original, high-precision model (e.g., Flux-dev in BF16) and several quantized versions (BnB 4-bit, BnB 8-bit). The generated images are then presented to you and your challenge is to identify which ones came from the quantized models.

		Try it out [here](https://huggingface.co/spaces/derekl35/flux-quant)!

		## Combining with Memory Optimizations

		Most of these quantization backends can be combined with the memory optimization techniques offered in Diffusers. For example, using `enable_model_cpu_offload()` with `bitsandbytes` cuts the memory further, giving a reasonable trade-off between memory and latency. You can learn more about these techniques in the [Diffusers documentation](https://huggingface.co/docs/diffusers/main/en/optimization/memory).

		></script>

		Building on our previous post, "[Memory-efficient Diffusion Transformers with Quanto and Diffusers](https://huggingface.co/blog/quanto-diffusers)", this post explores the diverse quantization backends integrated directly into Hugging Face Diffusers. We'll examine how bitsandbytes, GGUF, torchao, and native FP8 support make large and powerful models more accessible, demonstrating their use with Flux (a flow-based text-to-image generation model).

Exploring Quantization Backends in Diffusers #2852

Exploring Quantization Backends in Diffusers #2852

Uh oh!

Conversation

DerekLiu35 commented May 13, 2025

Preparing the Article

Uh oh!

DerekLiu35 commented May 13, 2025

Uh oh!

sayakpaul left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ChunTeLee commented May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sayakpaul commented May 16, 2025

Uh oh!

ChunTeLee commented May 17, 2025

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sayakpaul May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ChunTeLee commented May 15, 2025 •

edited

Loading

sayakpaul May 19, 2025 •

edited

Loading

SunMarc May 20, 2025 •

edited

Loading