Replies: 1 comment
-
With CPU is expected to have a slower inference time, but you can enable parallel requests by setting the environment variable appropriately: Line 72 in 39a6b56 However that would probably not work very well with CPUs. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I have deploy the latest localAI container (v2.7.0) on a 2 x Xeon 2680V4 (56 threads in total) with 198 GB of RAM, but from what I can tell, the request hitting the /v1/chat/completions endpoint, are being processed one after the other, not in parallel.
I believe there are enough resources on my system to process these requests in parallel.
I am using the openhermes-2.5-mistral-7b.Q8_0.gguf model.
To get a response for the curl example below, it takes about 20+ seconds. Is this good or bad for the system I have?
Thanks
PS: I can't install a GPU on this system as it is a 1U unit.
Curl:
Docker-compose:
LocalAI logs:
Beta Was this translation helpful? Give feedback.
All reactions