-
Notifications
You must be signed in to change notification settings - Fork 379
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possibility to unload/reload model from VRAM/RAM after IDLE timeout #196
Comments
I'd like to point out that it implies energy savings as well. |
Wouldn't it be this feature? |
Yes, that's the PR I also linked up there. |
I have this same problem and would really like this implemented. Can I help at all? |
I've found a slimmed down version of subgen (which is specifically for generating subtitles for Plex or through bazarr by connecting directly to them) called slim-bazarr-subgen, which pretty much does this (it only connects to bazarr uses latest faster-whisper, takes about 20ish seconds for a 22min audio file on a rtx 3090 with large distil v3 with int8_bfloat16). Disclaimer: Not a coder so just guessing and interpreting from limited knowledge. This slim version seems to use a task queue approach which more or less "deletes" the model (purges it from vram) when it's done with its tasks and then reloads the model back into vram when a new task is queued. The model reload process takes less than a few seconds on my system (most likely depending if you put it on an ssd or /dev/shm for example). When it's unloaded it only takes up about ~200mb vram for the main process. Maybe someone more knowledgeable could take a look at the main script. It doesn't seem overly complicated to implement for someone with more experience. In comparison it would take me more than a week of fumbling about and I sadly don't have the resources to take on the responsibility right now, I'm counting on you kind strangers out there!🙏 It would be fantastic to have this implemented in whisper-asr! some excerpts from the main script:
Btw: If you're interested in running slim-bazarr-subgen yourself but are still running on Ubuntu 22.04 (I was on 23.10 but the same might apply) here's a modified dockerfile with an older cuda version as you otherwise might get problems due to the new libs/drivers not being available: dockerfile
|
First of all thanks for this great project!
Description
I would like to have an option to set an idle time after which the model is unloaded from RAM/VRAM.
Background:
I have several applications that use the VRAM of my GPU, one of these is LocalAI.
Since I don't have unlimited VRAM, these applications have to share the available memory among themselves.
Luckily, since some time LocalAI has implemented a watchdog functionality that can be used to unload the model after a specified idle timeout. I'd love to have some similar functionality for whisper-asr-webservice
For now, whisper-asr-webservice is occupying 1/3rd of my VRAM although it is used only from time to time.
The text was updated successfully, but these errors were encountered: