[codex] server: tighten cache_key slot reuse by ezcoder · Pull Request #157 · TheTom/llama-cpp-turboquant

ezcoder · 2026-05-27T18:47:05Z

Summary

Tightens OpenAI-compatible server cache_key slot reuse so a keyed request only reuses the previously mapped slot when the new prompt actually overlaps the cached prompt enough to be safe.

Changes:

Track cache_key -> slot id mappings server-side.
Reject stale, empty, or currently busy mapped slots.
Require configurable prompt overlap before keyed reuse.
Skip generic LCP fallback for keyed requests after a cache-key miss/reject, so unrelated same-key prompts do not accidentally reuse a warm slot.
Add --slot-cache-key-similarity and --slot-cache-key-min-prefix server flags.
Clear child-task cache keys for parallel completions.

Why

The original keyed scheduling could route a low-overlap request with the same cache_key onto an unrelated cached prompt. In local testing, an unrelated same-key prompt had only sim = 0.125 and common = 3, but still reused the keyed slot under the old behavior. That is unsafe for agent/OpenClaw style traffic where keys may represent a session or topic but prompts can diverge.

Validation

Built on latest origin/feature/turboquant-kv-cache at 2cbfdc62a with Release + Metal + Accelerate + OpenMP/libomp.

Focused cache-key probe:

exact/same-prefix same-key prompts selected by cache_key: 2
low-overlap same-key prompt rejected: 1
LRU fallbacks: 3
low-overlap rejection log: sim = 0.125, common = 3

Threshold sweep:

similarity: 0.25, 0.50, 0.75, 0.90
min-prefix: 16, 32, 64
all 12 combinations selected the matching prompts and rejected the low-overlap prompt.

Busy-slot -np 2 test:

overlapping same-key request logged ignoring busy cache_key slot
second request selected the other free slot by LRU
both requests returned HTTP 200

OpenClaw E2E smoke:

3/3 openclaw agent --agent turboquant-test turns succeeded through the local llama-server path
average wall: 4.488 s
average decode: 67.2 tok/s

Throughput sanity check (llama-bench -r 3, Qwen3.6-35B-A3B Q8 GGUF, M4 Pro 64 GB):

q8_0/q8_0 b2048 ub512 t10: pp512=770.23, pp8192=617.85, tg64=41.27
q8_0/turbo3 b2048 ub1024 t8: pp512=768.62, pp8192=623.10, tg64=39.11
q8_0/turbo3 b1024 ub512 t10: pp512=744.64, pp8192=569.27, tg64=39.30

Long-context retrieval harness sanity:

32k run: exact match true, 29316 prompt tokens, 488.29 tok/s prefill
64k run: exact match true, 54668 prompt tokens, 204.88 tok/s prefill

server: tighten cache_key slot reuse

4dfcfe6

github-actions Bot added examples server labels May 27, 2026

ezcoder mentioned this pull request May 27, 2026

[codex] Key llama.cpp sessions for local reuse openclaw/openclaw#87414

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[codex] server: tighten cache_key slot reuse#157

[codex] server: tighten cache_key slot reuse#157
ezcoder wants to merge 1 commit into
TheTom:feature/turboquant-kv-cachefrom
ezcoder:codex/cache-key-slot-reuse

ezcoder commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

ezcoder commented May 27, 2026

Summary

Why

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant