Implement GPU memory optimization

When stable diffusion was released, the most requested features were always about reducing GPU requirements as this makes it available to more users with cheaper GPUs. I think this is one important feature to implement in diffusers-rs.

Here  it's a list of the diffusers library optimizations: https://github.com/huggingface/diffusers/blob/main/docs/source/optimization/fp16.mdx

The main memory reduction feature  I think would be allowing half precision models (fp16), that AFAIK is not yet implemented (but I might have missed the conversion somewhere).

I don't know if there is a better option to set the fp16 requirement in advance, but otherwise just loading the VarStore in a CPU device, calling half() on it, and moving it to the GPU using set_device should do the trick. This can be done in the example without touching the library I think.

Then, the next important thing would be Sliced Attention. This should be really straight forward, as we can see here it's just splitting the normal attention into slices and computing each slice at a time in a loop: https://github.com/huggingface/diffusers/blob/08a6dc8a5840e0cc09e65e71e9647321ab9bb254/src/diffusers/models/attention.py#L526

Then, is just a mater of exposing an slice_size configuration element on CrossAttention, and add an attention_slice_size to anyone that uses it.

Supporting Flash Attention from https://github.com/HazyResearch/flash-attention would be nice, but it's much more complicated as needs to be compiled for Cuda and it only works with Nvidia Cards. But this optimization is the main one used by the xformers library related to attention and provides a very good speedup in many cases.

Finally, the last important thing missing would be to allow move some models to the CPU when not in use (see https://github.com/huggingface/diffusers/pull/850), or even run them on the CPU as needed only leaving the unet on the GPU (see https://github.com/huggingface/diffusers/pull/537). 

They use the accelerate library, but I think this can be implemented directly on tch. I think just providing a set_device method on all the models would be enough, as everything else can be handled directly on the example or the user code.

What do you think? I'm planning to play a little bit with the diffusers-rs library and stable diffusion the next weeks and I can try to implement a few optimizations, but my knowledge on ML is still very basic.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement GPU memory optimization #1

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Implement GPU memory optimization #1

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions