Diffusers, Gradio, and the Elusive Memory Leak: A Cautionary Tale (and Solution!) 👻 #10936

elismasilva · 2025-03-02T03:07:25Z

elismasilva
Mar 2, 2025

Hey fellow Diffusers and Gradio enthusiasts! 👋

I recently spent way too long debugging a stubborn memory leak in a Gradio app using a Diffusers pipeline (specifically, StableDiffusionXLControlNetUnionImg2ImgPipeline). To save you from the same headache, I’m sharing my journey and the solution I discovered. Let’s dive in! 😅

The Setup

I was building a Gradio app that allowed users to switch between SDXL models (like "RealVisXL 5 Lightning" and "RealVisXL 5") and apply ControlNet. To save VRAM, I used enable_model_cpu_offload() and thought I was doing everything right: moving the pipeline to the CPU, deleting variables, calling gc.collect(), and torch.cuda.empty_cache().

But every time I switched models, memory usage (both GPU VRAM and CPU RAM) crept up and never fully returned to baseline. A classic memory leak—but where was it coming from? 🕵️‍♀️

The Debugging Saga 🛣️

I tried everything:

Deleting pipeline components one by one.
Using del in every possible way.
Monitoring GPU memory with nvidia-smi.
Building a minimal Gradio app to isolate the issue.
Questioning my sanity (more than once). 🤪

The breakthrough came when I realized the problem wasn’t in the complex parts of the app—it was in a seemingly harmless line of code at the very beginning.

**The Culprit: Pre-loading the Pipeline Outside Gradio's Context**

I had this (seemingly sensible) code at the global scope:

pipeline = Pipeline()  # Create the pipeline instance
pipeline.load_model("RealVisXL 5 Lightning")  # Load the initial model

This was done before the with gr.Blocks() as app: block that defined the Gradio UI. My goal was to pre-load a default model so the app would be ready immediately. This was the mistake! �

Why This is a Problem

Gradio apps have their own internal context and event loop. By creating the pipeline instance and loading the model before Gradio was fully initialized, the pipeline and its memory were created outside Gradio's managed environment.

When switching models within Gradio event handlers (like button clicks), cleanup operations (del, moving to CPU, etc.) didn’t fully work because the initial model was loaded in a different context. Gradio, PyTorch, or CUDA itself might have held onto hidden references, preventing proper garbage collection.

**The Solution: Initialize Everything Within Gradio's Context**

The fix? Ensure the Pipeline is created and the initial model is loaded inside a function called after the Gradio UI is defined. Use app.load() for this.

Here’s the corrected structure (minimal, working example):

import gc
import torch
import gradio as gr

from diffusers import ControlNetUnionModel, AutoencoderKL
from diffusers.pipelines.controlnet import StableDiffusionXLControlNetUnionImg2ImgPipeline

device = "cuda"
MODELS = {"RealVisXL 5 Lightning": "SG161222/RealVisXL_V5.0_Lightning",
          "RealVisXL 5": "SG161222/RealVisXL_V5.0"}

class Pipeline:
    def __init__(self):
        self.pipe = None
        self.controlnet = None
        self.vae = None
        self.last_loaded_model = None

    def load_model(self, model_id):
        if model_id != self.last_loaded_model:
            print(f"\n--- Loading model: {model_id} ---")

            if self.pipe is not None:
                self.pipe.to("cpu")  # Move to CPU                
                del self.pipe
                self.pipe = None 
                del self.controlnet
                self.controlnet = None
                del self.vae
                self.vae = None
                gc.collect()
                torch.cuda.empty_cache()

            print("Loading new model components...")
            self.controlnet = ControlNetUnionModel.from_pretrained(
                    "brad-twinkl/controlnet-union-sdxl-1.0-promax", torch_dtype=torch.float16
                ).to(device=device)
            self.vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16).to(device=device)

            self.pipe = StableDiffusionXLControlNetUnionImg2ImgPipeline.from_pretrained(
                MODELS[model_id], controlnet=self.controlnet, vae=self.vae, torch_dtype=torch.float16, variant="fp16"
            ).to(device=device)

            self.pipe.enable_model_cpu_offload()
            self.pipe.enable_vae_tiling()
            self.pipe.enable_vae_slicing()
            self.last_loaded_model = model_id
            print(f"Model {model_id} loaded.")

    def __call__(self, *args, **kwargs):
        return self.pipe(*args, **kwargs)

# Initialization function
def load_model(model_name, on_load=False):
    global pipeline  # Declare pipeline as global
    if on_load and 'pipeline' not in globals(): # Prevent reload page
        pipeline = Pipeline()  # Create pipeline inside the function
        pipeline.load_model(model_name) # Load the initial model
    elif pipeline is not None and not on_load:
        pipeline.load_model(model_name) # Switch model
        return f"Model switched to: {model_name}" # Useful for debugging

with gr.Blocks() as app:
    model_selector = gr.Dropdown(label="Select Model", choices=list(MODELS.keys()), value=list(MODELS.keys())[0])
    output_message = gr.Textbox(label="Status Message") # Optional, for messages
    switch_button = gr.Button("Switch Model")

    # Use app.load() to initialize the pipeline *after* the UI is defined
    app.load(fn=load_model, inputs=[model_selector, gr.State(value=True)], outputs=None, concurrency_limit=1) # Load initial model on app load
    switch_button.click(fn=load_model, inputs=model_selector, outputs=output_message, concurrency_limit=1)

app.launch(share=False)

Key Takeaways

Initialization Order Matters! ⏰
Create your Pipeline instance and load the initial model within a function called by app.load() in Gradio. This ensures everything happens within Gradio's managed context. In this example you need uncomment this call line or use
Explicit del is Your Friend
When switching models, explicitly del the pipeline object and its components (controlnet, vae, etc.) before creating the new pipeline. Don’t just rely on reassignment.
Move to CPU Before Deletion
Always call .to("cpu") on your pipeline before deleting it. This ensures tensors are moved to CPU memory, which Python's garbage collector can manage.
Monitor Both CPU RAM and GPU VRAM
When using enable_model_cpu_offload(), models move between CPU and GPU. Monitor both CPU RAM (e.g., with psutil) and GPU VRAM (e.g., with nvidia-smi).
gc.collect() and torch.cuda.empty_cache()
These are helpful, but they’re not a substitute for proper reference management with del. Use them after the del operations.

Important Considerations

This example demonstrates proper initialization and cleanup but isn’t optimized for performance. In real-world apps, you’ll likely have more complex logic. The key is ensuring all model loading happens within Gradio's event handling context and meticulously cleaning up old references.

I hope this saves you from the same frustration I experienced! Let me know if you have questions—happy coding! 🚀

FurkanGozukara · 2025-03-02T19:39:06Z

FurkanGozukara
Mar 2, 2025

wow this is such a good report thanks

0 replies

CyberVy · 2025-03-03T10:58:42Z

CyberVy
Mar 3, 2025

Thanks for your investigation! I remember using Gradio with enable_model_cpu_offload can cause the same memory leak issue in every inference, which bothered me a little. The way I tried to solve the issue was to move all components to GPU which causes some more VRAM usage. Now I'll try your suggestion to solve it.

2 replies

elismasilva Mar 3, 2025
Author

Hi, if you are switching models in the pipeline between each inference, when loading in cpu offload what happens is that it does not free up the RAM that the previous one already occupied, further reducing the available RAM. In my case, I have 64GB, I started with 48GB available and on the third model change the python process died because there was only 13GB available. You will rarely get a clear error in the terminal like the OOM for GPU.

This solution is only useful if you perform model switching in the pipeline, if your application will only use one model there is no problem in leaving the pipeline loaded in the global context as we conventionally do.

CyberVy Mar 3, 2025

This solution is only useful if you perform model switching in the pipeline, if your application will only use one model there is no problem in leaving the pipeline loaded in the global context as we conventionally do.

Yes, you're right. I've tested it with an fp16 SDXL model just now. The result shows that there is no memory leak in inference. It's been several months since I got this issue last time. And I remember that I was using an fp8 flux model quantized by optimum-quando that time. I’m not sure whether it’s related to optimum-quando, nor do I know if it has been fixed in the past few months.

CyberVy · 2025-03-03T13:54:47Z

CyberVy
Mar 3, 2025

I can provide an simple way which is helpful to offload models.
This function delete can clear all referrers of components easily.

import gc
def delete(obj):
    """
        Example:
        >>> a = [1,2,3]
        >>> b = a
        >>> delete(a)
    """
    if obj is None:
        return 0, [], 0,[]
    i = 0
    _i = 0
    referrers = []
    _referrers = []
    for item in gc.get_referrers(obj):
        if hasattr(item, "__dict__"):
            # get the correct __dict__ by object.__getattribute__
            # item.__dict__ may not work when item.__getattribute__ is overridden
            __dict__ = object.__getattribute__(item,"__dict__")
        elif isinstance(item, dict):
            __dict__ = item
        elif isinstance(item, list):
            for index, _ in enumerate(item):
                if _ is obj:
                    item[index] = None
                    referrers.append(f"list.{index}")
                    i += 1
            continue
        else:
            _referrers.append(id(item))
            _i += 1
            continue

        target_keys = []
        for key, value in __dict__.items():
            if value is obj:
                target_keys.append(key)
                referrers.append(f"dict.{key}")
                i += 1
        for target_key in target_keys:
            __dict__.update({target_key: None})

    return i, referrers, _i, _referrers

Once you want to offload a model, just

for component in pipeline.components.values():
    delete(component)

4 replies

CyberVy Mar 3, 2025

Maybe this function can help you use gradio more elegantly.

elismasilva Mar 3, 2025
Author

Then I can test it. I tried something similar where I go through all the pipeline components, move them to the CPU and then delete them. But the problem is that if the pipeline's global variable is loaded before the Gradio application starts, when triggering this cleaning function after the context is loaded, the garbage collector does not do the cleaning work. This only happens if you are trying to do it from a Gradio call. If you are just running a local script, there is no such problem.

CyberVy Mar 3, 2025

One more thing is that it's hard to monitor RAM usage in Python. Even if the Python process releases RAM, it may not return it to the system.

Because Python reuses released memory to avoid repeatedly requesting memory from the system, you need a function like torch.cuda.empty_cache to return the released memory to the system in order to observe the actual memory usage. But unfortunately, this function is different in different platform.

In Linux, you can use the following code.

import ctypes
def empty_cache():
    libc = ctypes.CDLL("libc.so.6")
    libc.malloc_trim(0)

Or, as you said, you can use psutil.virtual_memory to check the actual memory usage.

elismasilva Mar 3, 2025
Author

Yep, but it is important to distinguish that when I am using CPU offloading I am talking about CPU RAM, if not, then I am talking about GPU VRAM. torch.cuda.empty_cache() only applies to GPU memory.

nitinmukesh · 2025-03-09T12:53:20Z

nitinmukesh
Mar 9, 2025

I am using pipe.remove_all_hooks() and didn't faced this issue. I remember yiyixuxu mentioned in one of the topics that this using remove_all_hooks is better approach.
In my case it's always GPU + RAM + Virtual

I have tried generating from multiple model (one after another)
https://github.com/newgenai79/sd-diffuser-webui

I may need to do more testing if remove_all_hooks is not sufficient.

7 replies

nitinmukesh Mar 11, 2025

Another issue I faced. While the model is unloading and memory is clearing new model start to load and there was some weird error.
After banging my head on wall for 1 hr I figured for some reason both process (unloading and loading) are happening in parallel. Not a developer so don't understand why it may happen, but here is the solution.

Earlier code (with issue)

clear_previous_model_memory()
--start new model loading--

New code (without issue)

clear_previous_model_memory()
torch.cuda.synchronize()
--start new model loading--

nitinmukesh Mar 11, 2025

Thanks for sharing, I had a similar problem a while ago with SDXL after applying FP8 together with CPU offload, in fact the problem was between each inference in the same model, I was forced to remove all hooks, disable FP8 and xformers (if I was applying it) then apply CPU Offloading again with FP8.

If and only if i understand it correctly:
But in this case you are loading the model from the beginning for each inference? That means if model loading time is 1 min and inference time is 1 min = 2 min. So with above logic it will always be 2 min for each inference whereas it should be little above 1 min for subsequent inference (provided same model/ no change in offload).
Also I understand any type of CPU offload (cpu offload or sequential) can't be disabled (there is no method). So you have to create the pipe again.

elismasilva Mar 11, 2025
Author

Obrigado por compartilhar, tive um problema semelhante há algum tempo com o SDXL depois de aplicar o FP8 junto com o descarregamento da CPU, na verdade o problema estava entre cada inferência no mesmo modelo, fui forçado a remover todos os ganchos, desabilitar o FP8 e os xformers (se eu estivesse aplicando) e aplicar o descarregamento da CPU novamente com o FP8.

Se e somente se eu entendi corretamente: Mas, neste caso, você está carregando o modelo desde o início para cada inferência? Isso significa que se o tempo de carregamento do modelo for de 1 min e o tempo de inferência for de 1 min = 2 min. Portanto, com a lógica acima, sempre será de 2 minutos para cada inferência, enquanto deve ser um pouco acima de 1 minuto para inferência subsequente (desde o mesmo modelo / sem alteração no descarregamento). Também entendo que qualquer tipo de descarregamento de CPU (descarregamento de CPU ou sequencial) não pode ser desabilitado (não há método). Então você tem que criar o pipe novamente.

No, the problem I had was that after loading the model the first time, applying FP8 and CPU Offloading, the first inference went well. However, when I went to perform the second inference, or if the first inference was in a batch greater than 1, CPU offloading caused an error and the process died. I then needed to disable the hooks in the already loaded pipeline, deactivate CPU offloading and reapply everything again without necessarily loading the weights again.

elismasilva Mar 11, 2025
Author

So i used this function i think i took this from accelerate or some diffusers class.

def optionally_disable_offloading(_pipeline):
    """
    Optionally removes offloading in case the pipeline has been already sequentially offloaded to CPU.

    Args:
        _pipeline (`DiffusionPipeline`):
            The pipeline to disable offloading for.

    Returns:
        tuple:
            A tuple indicating if `is_model_cpu_offload` or `is_sequential_cpu_offload` is True.
    """
    is_model_cpu_offload = False
    is_sequential_cpu_offload = False   
    if _pipeline is not None:
        for _, component in _pipeline.components.items():
            if isinstance(component, nn.Module) and hasattr(component, "_hf_hook"):
                if not is_model_cpu_offload:
                    is_model_cpu_offload = isinstance(component._hf_hook, CpuOffload)
                if not is_sequential_cpu_offload:
                    is_sequential_cpu_offload = isinstance(component._hf_hook, AlignDevicesHook)

               
                remove_hook_from_module(component, recurse=True)

    return (is_model_cpu_offload, is_sequential_cpu_offload)

nitinmukesh Mar 11, 2025

This is really helpful, thanks for sharing.

Diffusers, Gradio, and the Elusive Memory Leak: A Cautionary Tale (and Solution!) 👻 #10936

Uh oh!

The Setup

The Debugging Saga 🛣️

The Culprit: Pre-loading the Pipeline Outside Gradio's Context

Why This is a Problem

The Solution: Initialize Everything Within Gradio's Context

Key Takeaways

Important Considerations

Replies: 4 comments · 13 replies

Uh oh!

Uh oh!

Uh oh!

elismasilva Mar 3, 2025 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

elismasilva Mar 3, 2025 Author

Uh oh!

Uh oh!

elismasilva Mar 3, 2025 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

elismasilva Mar 11, 2025 Author

Uh oh!

elismasilva Mar 11, 2025 Author

Uh oh!

**The Culprit: Pre-loading the Pipeline Outside Gradio's Context**

**The Solution: Initialize Everything Within Gradio's Context**

Replies: 4 comments 13 replies

elismasilva Mar 3, 2025
Author

elismasilva Mar 3, 2025
Author

elismasilva Mar 3, 2025
Author

elismasilva Mar 11, 2025
Author

elismasilva Mar 11, 2025
Author