fix(vllm-frontend): report real KV capacity for engines that publish it by FeathBow · Pull Request #475 · openinfer-project/openinfer

FeathBow · 2026-06-30T10:54:07Z

Description

The vLLM bridge used hardcoded KV-capacity placeholders for every engine. Report real EngineHandle::kv_capacity() for the engines that publish it, and wire qwen35 to publish the same metadata qwen3 already does.

After

qwen3 and qwen35 advertise real usable KV capacity from EngineHandle::kv_capacity(): num_gpu_blocks / kv_cache_size_tokens are the usable count (pool minus the CUDA-graph padding page), kv_cache_max_concurrency is vLLM's num_gpu_blocks / ceil(max_model_len / block_size). qwen35 wires .with_kv_capacity(), mirroring qwen3.
Engines that do not publish kv_capacity() keep the previous ready-response placeholders (0 / 16 / None / None), unchanged by this PR.

Tests

Single GPU (sm_89, x86_64), local engine KV capacity startup log:

qwen3 num_gpu_blocks=15376 block_size=16 kv_cache_size_tokens=246016 kv_cache_max_concurrency=6.00625
qwen35 num_gpu_blocks=61980 block_size=16 kv_cache_size_tokens=991680 kv_cache_max_concurrency=3.782958984375

frontend_e2e (CI gate) covers the bridge path.

Follow-up

kimi-k2 appears to have a real pool but does not publish capacity metadata yet, so it keeps the placeholder. Tracked as a sub-issue of #221.

xiaguan · 2026-06-30T13:34:46Z

Great little wiring PR — the real-value path reads well and the None fallback keeping the old placeholder is the right call.

Two nits, both non-blocking:

nit: avoid re-opening the same Option four times
openinfer-vllm-frontend/src/bridge.rs:64-76 currently does four separate kv_capacity.map_or(...) / map(...) calls, each reopening the same Option and re-copying c. A single match constructs all four fields at once and reads a little cleaner:

```rust
let (num_gpu_blocks, block_size, kv_cache_size_tokens, kv_cache_max_concurrency) = match kv_capacity {
Some(c) => {
let blocks_per_req = u64::from(self.max_model_len).div_ceil(c.block_size as u64);
(
c.total_blocks as u64,
c.block_size as u64,
Some(c.total_tokens() as u64),
Some(c.total_blocks as f64 / blocks_per_req as f64),
)
}
None => (0, 16, None, None),
};
```

Not blocking — just a readability nit. 🙂

nit: kv_capacity Debug is logged twice
bridge.rs:78-85 prints {kv_capacity:?} (which already serializes Some(KvCapacity { total_blocks, block_size })) and then immediately repeats num_gpu_blocks={} and block_size={} the line after. One of the two is enough; keeping both just makes the line longer than it needs to be. nit, feel free to ignore.

fix(vllm-frontend): report real KV capacity for engines that publish it

849d865

chore(vllm-frontend): centralize ready-response KV capacity fields

6b3a8a4

xiaguan merged commit 03351a3 into openinfer-project:main Jun 30, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(vllm-frontend): report real KV capacity for engines that publish it#475

fix(vllm-frontend): report real KV capacity for engines that publish it#475
xiaguan merged 2 commits into
openinfer-project:mainfrom
FeathBow:fix/vllm-frontend-real-kv-ready

FeathBow commented Jun 30, 2026

Uh oh!

xiaguan commented Jun 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

FeathBow commented Jun 30, 2026

Description

After

Tests

Follow-up

Uh oh!

xiaguan commented Jun 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants