[codex] server: tighten cache_key slot reuse#157
Draft
ezcoder wants to merge 1 commit into
Draft
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Tightens OpenAI-compatible server
cache_keyslot reuse so a keyed request only reuses the previously mapped slot when the new prompt actually overlaps the cached prompt enough to be safe.Changes:
cache_key -> slot idmappings server-side.--slot-cache-key-similarityand--slot-cache-key-min-prefixserver flags.Why
The original keyed scheduling could route a low-overlap request with the same
cache_keyonto an unrelated cached prompt. In local testing, an unrelated same-key prompt had onlysim = 0.125andcommon = 3, but still reused the keyed slot under the old behavior. That is unsafe for agent/OpenClaw style traffic where keys may represent a session or topic but prompts can diverge.Validation
Built on latest
origin/feature/turboquant-kv-cacheat2cbfdc62awith Release + Metal + Accelerate + OpenMP/libomp.Focused cache-key probe:
cache_key: 2sim = 0.125,common = 3Threshold sweep:
0.25,0.50,0.75,0.9016,32,64Busy-slot
-np 2test:ignoring busy cache_key slotOpenClaw E2E smoke:
openclaw agent --agent turboquant-testturns succeeded through the local llama-server path4.488 s67.2 tok/sThroughput sanity check (
llama-bench -r 3, Qwen3.6-35B-A3B Q8 GGUF, M4 Pro 64 GB):q8_0/q8_0 b2048 ub512 t10:pp512=770.23,pp8192=617.85,tg64=41.27q8_0/turbo3 b2048 ub1024 t8:pp512=768.62,pp8192=623.10,tg64=39.11q8_0/turbo3 b1024 ub512 t10:pp512=744.64,pp8192=569.27,tg64=39.30Long-context retrieval harness sanity:
29316prompt tokens,488.29 tok/sprefill54668prompt tokens,204.88 tok/sprefill