docs(qwen35): add prefix cache design document#423
Conversation
Add comprehensive design document for Qwen3.5-4B prefix caching covering the hybrid architecture challenge (24 linear + 8 full-attention layers), snapshot strategy, two-tier pool design, and memory budget analysis. Key decisions documented: - 256-token snapshot interval (4×GDR chunks) - Two-tier pool: ~29 GPU slots + CPU backup on RTX 4090 - Joint KV+snapshot matching requirement - LRU eviction per tier Includes quantitative analysis, design rationale, pitfalls, and prerequisites (paged-prefill migration blocker). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6d928c1d1d
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| **Core idea:** Checkpoint recurrent state at fixed 256-token intervals during prefill. Store snapshots in a two-tier pool (GPU primary, CPU eviction backup). On cache hit, restore both KV and snapshot at the matched boundary, then prefill only the suffix. | ||
|
|
||
| **Key components:** | ||
| 1. **Snapshot checkpointing**: at every 256-token boundary, D2H-copy the 52 MB recurrent-state snapshot |
There was a problem hiding this comment.
Avoid D2H-copying every new snapshot
Because the design makes the GPU tier the primary cache and describes the CPU tier as an eviction backup, inserting a fresh checkpoint should keep/copy the recurrent state into a GPU slot and only D2H-copy it when evicting or intentionally creating a CPU backup. If this bullet is implemented literally, every cold 4096-token prefill pays 16 PCIe D2H transfers (~32 ms by the doc’s own numbers) even while GPU slots are free, defeating the GPU-tier hot-cache path and inflating TTFT.
Useful? React with 👍 / 👎.
|
Thanks for writing this up, but I don’t think this is ready to land as a committed design doc yet. The hard part here is not mainly adding snapshot offload. For Qwen3.5, prefix caching needs a real allocation / indexing / lifetime design for linear-attention state. A valid hit has to jointly restore full-attention KV, GDR recurrent state, and conv state at the same token boundary, with the same token hash / adapter salt. That means we need to define snapshot handles, ownership, pinning, eviction, and how radix lookup joins KV blocks with recurrent snapshots. There is also an implementation gap: the current prefill path does not naturally produce a whole-model snapshot at every 256-token boundary. The GDR chunk scratch is per linear layer, and conv state is only kept as final state, not as per-boundary snapshots. So “D2H copy at GDR boundaries” is not enough as an implementation plan. I’d prefer to move this to an RFC issue first and narrow the first step to:
CPU-tier offload and two-tier LRU can be a later optimization once the core indexing/lifetime model is clear. |
|
Thanks for the review. I agree with narrowing the scope first. I drafted an RFC comment under #257 that focuses on the first step: a GPU-only recurrent snapshot design for Qwen3.5 prefix caching. |
Description
Fixes #257
Add design document for Qwen3.5-4B prefix caching. Qwen3.5's hybrid architecture (24 linear-attention + 8 full-attention layers) breaks Qwen3's KV-only caching assumption—prefix hits must restore both paged KV blocks and recurrent-state snapshots at the same boundary.
Key design decisions:
See
docs/models/qwen35/prefix-cache.mdfor full design rationale, quantitative analysis, pitfalls, and implementation roadmap.Type of Change
Checklist
docs/conventions/coding-style.md).CLAUDE.md).