Completion endpoint does not count tokens when using vLLM backend #3436
Labels
area/backends
area/vllm
bug
Something isn't working
python
Pull requests that update Python code
roadmap
LocalAI version:
localai/localai:v2.20.1-cublas-cuda12
Environment, CPU architecture, OS, and Version:
Linux dev-box 6.8.0-41-generic #41-Ubuntu SMP PREEMPT_DYNAMIC Fri Aug 2 20:41:06 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Describe the bug
When making calls to both the
/chat/completions
and/completions
endpoints, models backed with vLLM do not count tokens correctly, and are reporting that no tokens were used - despite correctly completing the prompt. This is not an issue with vLLM itself, since running the exact same model using vLLM's provided OpenAI server docker image correctly returns the actual token counts of the response.To Reproduce
What Works (vLLM direct)
First, we can show the correct behavior coming from vLLM:
docker run --name localai --runtime nvidia --gpus all \ -v ~/models:/root/.cache/huggingface \ -p 8000:8000 \ --ipc=host \ vllm/vllm-openai:latest \ --model akjindal53244/Llama-3.1-Storm-8B \ --gpu-memory-utilization 0.95 \ --max-model-len 49000
http://localhost:8000/v1/chat/completions
with the following body:Which contains correct usage data about the response.
What Doesn't Work (vLLM via LocalAI)
Now we'll try the same model, with the same configurations but running through localAI instead of directly through vLLM.
http://localhost:8000/v1/chat/completions
):4: However now, notice the response contains all 0s for usage data:
Expected behavior
The response from the vLLM server, and the localAI server running a vLLM backend should be identical - and specifically localAI's usage data should be correct. However it is instead providing all 0s for usage despite not having an empty response.
Logs
Additional context
This issue only happens on vLLM backed models. It does not happen when - for example - we run the same model on localAI with a llama.cpp backend.
The text was updated successfully, but these errors were encountered: