Possibility to unload/reload model from VRAM/RAM after IDLE timeout #196

v3DJG6GL · 2024-02-15T13:21:12Z

First of all thanks for this great project!

Description

I would like to have an option to set an idle time after which the model is unloaded from RAM/VRAM.

Background:

I have several applications that use the VRAM of my GPU, one of these is LocalAI.
Since I don't have unlimited VRAM, these applications have to share the available memory among themselves.
Luckily, since some time LocalAI has implemented a watchdog functionality that can be used to unload the model after a specified idle timeout. I'd love to have some similar functionality for whisper-asr-webservice
For now, whisper-asr-webservice is occupying 1/3rd of my VRAM although it is used only from time to time.

LuisMalhadas · 2024-03-22T17:37:15Z

I'd like to point out that it implies energy savings as well.

thfrei · 2024-04-14T18:10:31Z

Wouldn't it be this feature?
mudler/LocalAI#1341

v3DJG6GL · 2024-04-15T07:29:57Z

Wouldn't it be this feature? mudler/LocalAI#1341

Yes, that's the PR I also linked up there.

TigerWolf · 2024-07-13T01:40:20Z

I have this same problem and would really like this implemented. Can I help at all?

Deathproof76 · 2024-07-13T10:40:41Z

I've found a slimmed down version of subgen (which is specifically for generating subtitles for Plex or through bazarr by connecting directly to them) called slim-bazarr-subgen, which pretty much does this (it only connects to bazarr uses latest faster-whisper, takes about 20ish seconds for a 22min audio file on a rtx 3090 with large distil v3 with int8_bfloat16).

Disclaimer: Not a coder so just guessing and interpreting from limited knowledge.

This slim version seems to use a task queue approach which more or less "deletes" the model (purges it from vram) when it's done with its tasks and then reloads the model back into vram when a new task is queued. The model reload process takes less than a few seconds on my system (most likely depending if you put it on an ssd or /dev/shm for example). When it's unloaded it only takes up about ~200mb vram for the main process. Maybe someone more knowledgeable could take a look at the main script. It doesn't seem overly complicated to implement for someone with more experience. In comparison it would take me more than a week of fumbling about and I sadly don't have the resources to take on the responsibility right now, I'm counting on you kind strangers out there!🙏 It would be fantastic to have this implemented in whisper-asr!

some excerpts from the main script:

def start_model():
    global model
    if model is None:
        logging.debug("Model was purged, need to re-create")
        model = stable_whisper.load_faster_whisper(whisper_model, download_root=model_location, device=transcribe_device, cpu_threads=whisper_threads, num_workers=concurrent_transcriptions, compute_type=compute_type)

....

def delete_model():
    if task_queue.qsize() == 0:
        global model
        logging.debug("Queue is empty, clearing/releasing VRAM")
        model = None
        gc.collect()

....

    finally:
        task_queue.task_done()
        delete_model()

Btw: If you're interested in running slim-bazarr-subgen yourself but are still running on Ubuntu 22.04 (I was on 23.10 but the same might apply) here's a modified dockerfile with an older cuda version as you otherwise might get problems due to the new libs/drivers not being available:

dockerfile

FROM nvidia/cuda:12.3.2-cudnn9-devel-ubuntu22.04
ENV DEBIAN_FRONTEND=noninteractive

RUN apt-get update && apt-get -y upgrade
RUN apt-get install -y python3-pip libcudnn8
RUN apt-get clean

RUN apt remove -y --allow-remove-essential cuda-compat-12-3 cuda-cudart-12-3 cuda-cudart-dev-12-3 cuda-keyring cuda-libraries-12-3 cuda-libraries-dev-12-3 cuda-nsight-compute-12-3 cuda-nvml-dev-12-3 cuda-nvprof-12-3 cuda-nvtx-12-3 ncurses-base ncurses-bin e2fsprogs
RUN apt autoremove -y

COPY requirements.txt /requirements.txt
COPY entrypoint.sh /entrypoint.sh
RUN chmod +x /entrypoint.sh

ENTRYPOINT [ "/entrypoint.sh" ]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possibility to unload/reload model from VRAM/RAM after IDLE timeout #196

Possibility to unload/reload model from VRAM/RAM after IDLE timeout #196

v3DJG6GL commented Feb 15, 2024

LuisMalhadas commented Mar 22, 2024

thfrei commented Apr 14, 2024

v3DJG6GL commented Apr 15, 2024

TigerWolf commented Jul 13, 2024

Deathproof76 commented Jul 13, 2024 •

edited

Loading

Possibility to unload/reload model from VRAM/RAM after IDLE timeout #196

Possibility to unload/reload model from VRAM/RAM after IDLE timeout #196

Comments

v3DJG6GL commented Feb 15, 2024

Description

Background:

LuisMalhadas commented Mar 22, 2024

thfrei commented Apr 14, 2024

v3DJG6GL commented Apr 15, 2024

TigerWolf commented Jul 13, 2024

Deathproof76 commented Jul 13, 2024 • edited Loading

Deathproof76 commented Jul 13, 2024 •

edited

Loading