feat: allow parallel requests for same model #1142

mudler · 2023-10-06T10:24:03Z

Is your feature request related to a problem? Please describe.
Currently each model gets an instance only

Describe the solution you'd like
A flag or a YAML config option to tell the maximum number of instance to spawn/connect to. defaults to 1 (mimicking the current behavior)

Describe alternatives you've considered

Additional context
Ideally we should keep track of what models are being used and redirect to free slots.
Also we need to keep into account the logic which is already in place to handle a single GPU device (1 model loaded only)

Related: go-skynet/go-llama.cpp#253

localai-bot · 2023-10-15T13:22:36Z

⚠️⚠️⚠️⚠️⚠️

Hi! I'm a bot running with LocalAI ( a crazy experiment of @mudler ) - please beware that I might hallucinate sometimes!

but.... I can also be funny or helpful 😸 and I can provide generally speaking good tips or places where to look after in the documentation or in the code based on what you wrote in the issue.

Don't engage in conversation with me,
I don't support (yet) replying!

⚠️⚠️⚠️⚠️⚠️

To preload models during startup, you can use the PRELOAD_MODELS parameter in the YAML config file or pass it as a flag when starting the API. The PRELOAD_MODELS_CONFIG parameter is used to specify the path to a YAML configuration file containing a list of models with their corresponding URLs and names.
To set the number of threads, you can use the THREADS parameter in the YAML config file or pass it as a flag when starting the API. The default value is 14, but you can adjust it based on the number of physical cores available on your machine. Overbooking the CPU degrades performance notably.
To enable prompt caching, you can set the PROMPT_CACHE parameter in the YAML config file or pass it as a flag when starting the API. This will automatically cache prompts for faster loading of the prompt.
To control the number of instances per model, you can use the MAX_INSTANCES parameter in the YAML config file or pass it as a flag when starting the API. The default value is 1, which means each model gets an instance only. You can adjust this based on your needs and available resources.
To keep track of what models are being used and redirect to free slots, you can use a load balancer that dynamically assigns instances to available devices based on the current workload. This would require additional infrastructure and development effort.
To handle a single GPU device, you can use the SINGLE_ACTIVE_BACKEND parameter in the YAML config file or pass it as a flag when starting the API. This will enable the single backend mode, which means only one GPU will be used for all models.

Sources:

ACiDGRiM · 2023-12-01T21:54:27Z

I'm a fan of this, especially with the potential of localagi

I want to run two gpu LLMs on different servers i run, currently i just load balance round robbin between distinct local ai front-backends.

Ideally, one api front end could queue incoming requests and send to the next available llm. Allowing all LLM in a pool to store and retrieve from long term memory would be awesome

mudler · 2023-12-01T22:39:37Z

this is actually already possible with llama.cpp and by specifying PARALLEL_REQUESTS or --parallel-requests and the number of parallel requests for llama.cpp

LocalAI/.env

Line 72 in d6073ac

# LLAMACPP_PARALLEL=1

. Other backends like vllm should also support this natively

mudler added enhancement New feature or request roadmap labels Oct 6, 2023

mudler self-assigned this Oct 6, 2023

mudler closed this as completed Dec 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: allow parallel requests for same model #1142

feat: allow parallel requests for same model #1142

mudler commented Oct 6, 2023 •

edited

Loading

localai-bot commented Oct 15, 2023

ACiDGRiM commented Dec 1, 2023

mudler commented Dec 1, 2023

feat: allow parallel requests for same model #1142

feat: allow parallel requests for same model #1142

Comments

mudler commented Oct 6, 2023 • edited Loading

localai-bot commented Oct 15, 2023

⚠️⚠️⚠️⚠️⚠️

⚠️⚠️⚠️⚠️⚠️

ACiDGRiM commented Dec 1, 2023

mudler commented Dec 1, 2023

mudler commented Oct 6, 2023 •

edited

Loading