Skip to content

fix(vllm-frontend): report real KV capacity for engines that publish it#475

Merged
xiaguan merged 2 commits into
openinfer-project:mainfrom
FeathBow:fix/vllm-frontend-real-kv-ready
Jun 30, 2026
Merged

fix(vllm-frontend): report real KV capacity for engines that publish it#475
xiaguan merged 2 commits into
openinfer-project:mainfrom
FeathBow:fix/vllm-frontend-real-kv-ready

Conversation

@FeathBow

Copy link
Copy Markdown
Collaborator

Description

Refs #401

The vLLM bridge used hardcoded KV-capacity placeholders for every engine. Report real EngineHandle::kv_capacity() for the engines that publish it, and wire qwen35 to publish the same metadata qwen3 already does.

After

  • qwen3 and qwen35 advertise real usable KV capacity from EngineHandle::kv_capacity(): num_gpu_blocks / kv_cache_size_tokens are the usable count (pool minus the CUDA-graph padding page), kv_cache_max_concurrency is vLLM's num_gpu_blocks / ceil(max_model_len / block_size). qwen35 wires .with_kv_capacity(), mirroring qwen3.
  • Engines that do not publish kv_capacity() keep the previous ready-response placeholders (0 / 16 / None / None), unchanged by this PR.

Tests

Single GPU (sm_89, x86_64), local engine KV capacity startup log:

  • qwen3 num_gpu_blocks=15376 block_size=16 kv_cache_size_tokens=246016 kv_cache_max_concurrency=6.00625
  • qwen35 num_gpu_blocks=61980 block_size=16 kv_cache_size_tokens=991680 kv_cache_max_concurrency=3.782958984375

frontend_e2e (CI gate) covers the bridge path.

Follow-up

kimi-k2 appears to have a real pool but does not publish capacity metadata yet, so it keeps the placeholder. Tracked as a sub-issue of #221.

@xiaguan

xiaguan commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator

Great little wiring PR — the real-value path reads well and the None fallback keeping the old placeholder is the right call.

Two nits, both non-blocking:


nit: avoid re-opening the same Option four times
openinfer-vllm-frontend/src/bridge.rs:64-76 currently does four separate kv_capacity.map_or(...) / map(...) calls, each reopening the same Option and re-copying c. A single match constructs all four fields at once and reads a little cleaner:

```rust
let (num_gpu_blocks, block_size, kv_cache_size_tokens, kv_cache_max_concurrency) = match kv_capacity {
Some(c) => {
let blocks_per_req = u64::from(self.max_model_len).div_ceil(c.block_size as u64);
(
c.total_blocks as u64,
c.block_size as u64,
Some(c.total_tokens() as u64),
Some(c.total_blocks as f64 / blocks_per_req as f64),
)
}
None => (0, 16, None, None),
};
```

Not blocking — just a readability nit. 🙂


nit: kv_capacity Debug is logged twice
bridge.rs:78-85 prints {kv_capacity:?} (which already serializes Some(KvCapacity { total_blocks, block_size })) and then immediately repeats num_gpu_blocks={} and block_size={} the line after. One of the two is enough; keeping both just makes the line longer than it needs to be. nit, feel free to ignore.

@xiaguan xiaguan merged commit 03351a3 into openinfer-project:main Jun 30, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants