Duplicate model GPU overload #1787
Unanswered
kirthi-exe
asked this question in
Q&A
Replies: 1 comment
-
@kirthi-exe this is espected as you are enabling This is a feature that should be used with small GPUs where you actually can have only a model loaded |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Greetings everyone,
I've been working on integrating LocalAI into Nextcloud via the Nextcloud app. Everything proceeded smoothly until a peculiar issue emerged: whenever I create an image, the model is cached in the vGPU memory. Similarly, when generating text, it's also cached in the vGPU. However, upon attempting to create another image, the cached model isn't utilized; instead, the identical model is saved in the cache again. Strangely, the text model gets purged from the cache.
This repetitive caching process overloads the GPU with duplicate image models during image creation, leading to eventual crashes. Despite configuring parallel requests to true, with llamacpp_parallel=1 and python_grpc_max_workers=1, allowing only one model to be cached and reused, the issue persists.
My development environment is based on a Proxmox VM, featuring 16 GB RAM, 64 CPU cores, and a NVIDIA L4 Tensor GPU with 24 GB memory. I'm utilizing the specific Image v2.6.1-cublas-cuda12-ffmpeg and the Nextcloud App available at https://apps.nextcloud.com/apps/integration_openai.
My Environment File:
My Docker Compose File:
Upon seeking assistance on Discord and being directed to report the issue as a bug, I'm reaching out for further insights and solutions. Any guidance or assistance would be greatly appreciated.
Beta Was this translation helpful? Give feedback.
All reactions