Skip to content

Implement GPU memory optimization #1

@siriux

Description

@siriux

When stable diffusion was released, the most requested features were always about reducing GPU requirements as this makes it available to more users with cheaper GPUs. I think this is one important feature to implement in diffusers-rs.

Here it's a list of the diffusers library optimizations: https://github.com/huggingface/diffusers/blob/main/docs/source/optimization/fp16.mdx

The main memory reduction feature I think would be allowing half precision models (fp16), that AFAIK is not yet implemented (but I might have missed the conversion somewhere).

I don't know if there is a better option to set the fp16 requirement in advance, but otherwise just loading the VarStore in a CPU device, calling half() on it, and moving it to the GPU using set_device should do the trick. This can be done in the example without touching the library I think.

Then, the next important thing would be Sliced Attention. This should be really straight forward, as we can see here it's just splitting the normal attention into slices and computing each slice at a time in a loop: https://github.com/huggingface/diffusers/blob/08a6dc8a5840e0cc09e65e71e9647321ab9bb254/src/diffusers/models/attention.py#L526

Then, is just a mater of exposing an slice_size configuration element on CrossAttention, and add an attention_slice_size to anyone that uses it.

Supporting Flash Attention from https://github.com/HazyResearch/flash-attention would be nice, but it's much more complicated as needs to be compiled for Cuda and it only works with Nvidia Cards. But this optimization is the main one used by the xformers library related to attention and provides a very good speedup in many cases.

Finally, the last important thing missing would be to allow move some models to the CPU when not in use (see huggingface/diffusers#850), or even run them on the CPU as needed only leaving the unet on the GPU (see huggingface/diffusers#537).

They use the accelerate library, but I think this can be implemented directly on tch. I think just providing a set_device method on all the models would be enough, as everything else can be handled directly on the example or the user code.

What do you think? I'm planning to play a little bit with the diffusers-rs library and stable diffusion the next weeks and I can try to implement a few optimizations, but my knowledge on ML is still very basic.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions