You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The vLLM bridge used hardcoded KV-capacity placeholders for every engine. Report real EngineHandle::kv_capacity() for the engines that publish it, and wire qwen35 to publish the same metadata qwen3 already does.
After
qwen3 and qwen35 advertise real usable KV capacity from EngineHandle::kv_capacity(): num_gpu_blocks / kv_cache_size_tokens are the usable count (pool minus the CUDA-graph padding page), kv_cache_max_concurrency is vLLM's num_gpu_blocks / ceil(max_model_len / block_size). qwen35 wires .with_kv_capacity(), mirroring qwen3.
Engines that do not publish kv_capacity() keep the previous ready-response placeholders (0 / 16 / None / None), unchanged by this PR.
Tests
Single GPU (sm_89, x86_64), local engine KV capacity startup log:
Great little wiring PR — the real-value path reads well and the None fallback keeping the old placeholder is the right call.
Two nits, both non-blocking:
nit: avoid re-opening the same Option four times openinfer-vllm-frontend/src/bridge.rs:64-76 currently does four separate kv_capacity.map_or(...) / map(...) calls, each reopening the same Option and re-copying c. A single match constructs all four fields at once and reads a little cleaner:
```rust
let (num_gpu_blocks, block_size, kv_cache_size_tokens, kv_cache_max_concurrency) = match kv_capacity {
Some(c) => {
let blocks_per_req = u64::from(self.max_model_len).div_ceil(c.block_size as u64);
(
c.total_blocks as u64,
c.block_size as u64,
Some(c.total_tokens() as u64),
Some(c.total_blocks as f64 / blocks_per_req as f64),
)
}
None => (0, 16, None, None),
};
```
Not blocking — just a readability nit. 🙂
nit: kv_capacity Debug is logged twice bridge.rs:78-85 prints {kv_capacity:?} (which already serializes Some(KvCapacity { total_blocks, block_size })) and then immediately repeats num_gpu_blocks={} and block_size={} the line after. One of the two is enough; keeping both just makes the line longer than it needs to be. nit, feel free to ignore.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Refs #401
The vLLM bridge used hardcoded KV-capacity placeholders for every engine. Report real
EngineHandle::kv_capacity()for the engines that publish it, and wire qwen35 to publish the same metadata qwen3 already does.After
EngineHandle::kv_capacity():num_gpu_blocks/kv_cache_size_tokensare the usable count (pool minus the CUDA-graph padding page),kv_cache_max_concurrencyis vLLM'snum_gpu_blocks / ceil(max_model_len / block_size). qwen35 wires.with_kv_capacity(), mirroring qwen3.kv_capacity()keep the previous ready-response placeholders (0 / 16 / None / None), unchanged by this PR.Tests
Single GPU (sm_89, x86_64),
local engine KV capacitystartup log:num_gpu_blocks=15376 block_size=16 kv_cache_size_tokens=246016 kv_cache_max_concurrency=6.00625num_gpu_blocks=61980 block_size=16 kv_cache_size_tokens=991680 kv_cache_max_concurrency=3.782958984375frontend_e2e(CI gate) covers the bridge path.Follow-up
kimi-k2 appears to have a real pool but does not publish capacity metadata yet, so it keeps the placeholder. Tracked as a sub-issue of #221.