Skip to content

[codex] server: tighten cache_key slot reuse#157

Draft
ezcoder wants to merge 1 commit into
TheTom:feature/turboquant-kv-cachefrom
ezcoder:codex/cache-key-slot-reuse
Draft

[codex] server: tighten cache_key slot reuse#157
ezcoder wants to merge 1 commit into
TheTom:feature/turboquant-kv-cachefrom
ezcoder:codex/cache-key-slot-reuse

Conversation

@ezcoder
Copy link
Copy Markdown

@ezcoder ezcoder commented May 27, 2026

Summary

Tightens OpenAI-compatible server cache_key slot reuse so a keyed request only reuses the previously mapped slot when the new prompt actually overlaps the cached prompt enough to be safe.

Changes:

  • Track cache_key -> slot id mappings server-side.
  • Reject stale, empty, or currently busy mapped slots.
  • Require configurable prompt overlap before keyed reuse.
  • Skip generic LCP fallback for keyed requests after a cache-key miss/reject, so unrelated same-key prompts do not accidentally reuse a warm slot.
  • Add --slot-cache-key-similarity and --slot-cache-key-min-prefix server flags.
  • Clear child-task cache keys for parallel completions.

Why

The original keyed scheduling could route a low-overlap request with the same cache_key onto an unrelated cached prompt. In local testing, an unrelated same-key prompt had only sim = 0.125 and common = 3, but still reused the keyed slot under the old behavior. That is unsafe for agent/OpenClaw style traffic where keys may represent a session or topic but prompts can diverge.

Validation

Built on latest origin/feature/turboquant-kv-cache at 2cbfdc62a with Release + Metal + Accelerate + OpenMP/libomp.

Focused cache-key probe:

  • exact/same-prefix same-key prompts selected by cache_key: 2
  • low-overlap same-key prompt rejected: 1
  • LRU fallbacks: 3
  • low-overlap rejection log: sim = 0.125, common = 3

Threshold sweep:

  • similarity: 0.25, 0.50, 0.75, 0.90
  • min-prefix: 16, 32, 64
  • all 12 combinations selected the matching prompts and rejected the low-overlap prompt.

Busy-slot -np 2 test:

  • overlapping same-key request logged ignoring busy cache_key slot
  • second request selected the other free slot by LRU
  • both requests returned HTTP 200

OpenClaw E2E smoke:

  • 3/3 openclaw agent --agent turboquant-test turns succeeded through the local llama-server path
  • average wall: 4.488 s
  • average decode: 67.2 tok/s

Throughput sanity check (llama-bench -r 3, Qwen3.6-35B-A3B Q8 GGUF, M4 Pro 64 GB):

  • q8_0/q8_0 b2048 ub512 t10: pp512=770.23, pp8192=617.85, tg64=41.27
  • q8_0/turbo3 b2048 ub1024 t8: pp512=768.62, pp8192=623.10, tg64=39.11
  • q8_0/turbo3 b1024 ub512 t10: pp512=744.64, pp8192=569.27, tg64=39.30

Long-context retrieval harness sanity:

  • 32k run: exact match true, 29316 prompt tokens, 488.29 tok/s prefill
  • 64k run: exact match true, 54668 prompt tokens, 204.88 tok/s prefill

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant