-
Notifications
You must be signed in to change notification settings - Fork 68
Open
Description
Problem
All inference responses return zero token counts (prompt_tokens=0, completion_tokens=0, total_tokens=0) in client.queryFinalInfo, regardless of actual token consumption.
Root Cause
The vLLM worker returns "usage": null in all SSE chunks, including the final one with finish_reason: "stop". The proxy's ByteTokenCounter parses SSE JSON looking for usage.prompt_tokens etc., but since the value is null (not an object), get_json_value returns 0 and all counters stay at zero.
Observed SSE Response
data: {"id":"...","object":"chat.completion.chunk","created":1773050183,
"model":"Qwen/Qwen3-32B",
"choices":[{"index":0,"delta":{"content":"Hello","reasoning_content":null},
"finish_reason":"stop","matched_stop":151645}],
"usage":null}
Every chunk has "usage": null, even though the proxy correctly injects stream_options.include_usage = true in ValidateRequest.cpp:199-201:
if (stream && !has_stream_options) {
b["stream_options"]["include_usage"] = true;
}Likely Cause
The vLLM instance serving Qwen/Qwen3-32B does not honor stream_options.include_usage.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels