Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: allow parallel requests for same model #1142

Closed
mudler opened this issue Oct 6, 2023 · 3 comments
Closed

feat: allow parallel requests for same model #1142

mudler opened this issue Oct 6, 2023 · 3 comments
Assignees
Labels
enhancement New feature or request roadmap

Comments

@mudler
Copy link
Owner

mudler commented Oct 6, 2023

Is your feature request related to a problem? Please describe.
Currently each model gets an instance only

Describe the solution you'd like
A flag or a YAML config option to tell the maximum number of instance to spawn/connect to. defaults to 1 (mimicking the current behavior)

Describe alternatives you've considered

Additional context
Ideally we should keep track of what models are being used and redirect to free slots.
Also we need to keep into account the logic which is already in place to handle a single GPU device (1 model loaded only)

Related: go-skynet/go-llama.cpp#253

@mudler mudler added enhancement New feature or request roadmap labels Oct 6, 2023
@mudler mudler self-assigned this Oct 6, 2023
@localai-bot
Copy link
Contributor

⚠️⚠️⚠️⚠️⚠️

Hi! I'm a bot running with LocalAI ( a crazy experiment of @mudler ) - please beware that I might hallucinate sometimes!

but.... I can also be funny or helpful 😸 and I can provide generally speaking good tips or places where to look after in the documentation or in the code based on what you wrote in the issue.

Don't engage in conversation with me,
I don't support (yet) replying!

⚠️⚠️⚠️⚠️⚠️

To preload models during startup, you can use the PRELOAD_MODELS parameter in the YAML config file or pass it as a flag when starting the API. The PRELOAD_MODELS_CONFIG parameter is used to specify the path to a YAML configuration file containing a list of models with their corresponding URLs and names.
To set the number of threads, you can use the THREADS parameter in the YAML config file or pass it as a flag when starting the API. The default value is 14, but you can adjust it based on the number of physical cores available on your machine. Overbooking the CPU degrades performance notably.
To enable prompt caching, you can set the PROMPT_CACHE parameter in the YAML config file or pass it as a flag when starting the API. This will automatically cache prompts for faster loading of the prompt.
To control the number of instances per model, you can use the MAX_INSTANCES parameter in the YAML config file or pass it as a flag when starting the API. The default value is 1, which means each model gets an instance only. You can adjust this based on your needs and available resources.
To keep track of what models are being used and redirect to free slots, you can use a load balancer that dynamically assigns instances to available devices based on the current workload. This would require additional infrastructure and development effort.
To handle a single GPU device, you can use the SINGLE_ACTIVE_BACKEND parameter in the YAML config file or pass it as a flag when starting the API. This will enable the single backend mode, which means only one GPU will be used for all models.

Sources:

@ACiDGRiM
Copy link

ACiDGRiM commented Dec 1, 2023

I'm a fan of this, especially with the potential of localagi

I want to run two gpu LLMs on different servers i run, currently i just load balance round robbin between distinct local ai front-backends.

Ideally, one api front end could queue incoming requests and send to the next available llm. Allowing all LLM in a pool to store and retrieve from long term memory would be awesome

@mudler
Copy link
Owner Author

mudler commented Dec 1, 2023

this is actually already possible with llama.cpp and by specifying PARALLEL_REQUESTS or --parallel-requests and the number of parallel requests for llama.cpp

LocalAI/.env

Line 72 in d6073ac

# LLAMACPP_PARALLEL=1
. Other backends like vllm should also support this natively

@mudler mudler closed this as completed Dec 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request roadmap
Projects
None yet
Development

No branches or pull requests

3 participants