Skip to content

docs(qwen35): add prefix cache design document#423

Open
Ke-Wng wants to merge 1 commit into
openinfer-project:mainfrom
Ke-Wng:docs/qwen35-prefix-cache
Open

docs(qwen35): add prefix cache design document#423
Ke-Wng wants to merge 1 commit into
openinfer-project:mainfrom
Ke-Wng:docs/qwen35-prefix-cache

Conversation

@Ke-Wng

@Ke-Wng Ke-Wng commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Description

Fixes #257

Add design document for Qwen3.5-4B prefix caching. Qwen3.5's hybrid architecture (24 linear-attention + 8 full-attention layers) breaks Qwen3's KV-only caching assumption—prefix hits must restore both paged KV blocks and recurrent-state snapshots at the same boundary.

Key design decisions:

  • Snapshot interval: 256 tokens (4×GDR chunks), chosen to balance cold prefill overhead (+32ms for 4096-token prompt) vs hit-rate granularity
  • Two-tier pool: GPU primary (~29 slots on RTX 4090 = 7,424 cached tokens) + CPU eviction backup (unlimited, +2.18ms H2D restore cost)
  • Joint matching: radix lookup requires both KV and snapshot availability at the same boundary
  • LRU eviction: independent per-tier, with active-request protection

See docs/models/qwen35/prefix-cache.md for full design rationale, quantitative analysis, pitfalls, and implementation roadmap.

Type of Change

  • Documentation update

Checklist

  • My code follows the style guidelines of this project (see docs/conventions/coding-style.md).
  • I have performed a self-review of my own code.
  • I have formatted my commits according to Commitizen conventions.
  • I have run the local test suite and all tests pass (see CLAUDE.md).

Add comprehensive design document for Qwen3.5-4B prefix caching covering
the hybrid architecture challenge (24 linear + 8 full-attention layers),
snapshot strategy, two-tier pool design, and memory budget analysis.

Key decisions documented:
- 256-token snapshot interval (4×GDR chunks)
- Two-tier pool: ~29 GPU slots + CPU backup on RTX 4090
- Joint KV+snapshot matching requirement
- LRU eviction per tier

Includes quantitative analysis, design rationale, pitfalls, and prerequisites
(paged-prefill migration blocker).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6d928c1d1d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

**Core idea:** Checkpoint recurrent state at fixed 256-token intervals during prefill. Store snapshots in a two-tier pool (GPU primary, CPU eviction backup). On cache hit, restore both KV and snapshot at the matched boundary, then prefill only the suffix.

**Key components:**
1. **Snapshot checkpointing**: at every 256-token boundary, D2H-copy the 52 MB recurrent-state snapshot

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid D2H-copying every new snapshot

Because the design makes the GPU tier the primary cache and describes the CPU tier as an eviction backup, inserting a fresh checkpoint should keep/copy the recurrent state into a GPU slot and only D2H-copy it when evicting or intentionally creating a CPU backup. If this bullet is implemented literally, every cold 4096-token prefill pays 16 PCIe D2H transfers (~32 ms by the doc’s own numbers) even while GPU slots are free, defeating the GPU-tier hot-cache path and inflating TTFT.

Useful? React with 👍 / 👎.

@xiaguan xiaguan self-assigned this Jun 20, 2026
@xiaguan

xiaguan commented Jun 20, 2026

Copy link
Copy Markdown
Collaborator

Thanks for writing this up, but I don’t think this is ready to land as a committed design doc yet.

The hard part here is not mainly adding snapshot offload. For Qwen3.5, prefix caching needs a real allocation / indexing / lifetime design for linear-attention state. A valid hit has to jointly restore full-attention KV, GDR recurrent state, and conv state at the same token boundary, with the same token hash / adapter salt. That means we need to define snapshot handles, ownership, pinning, eviction, and how radix lookup joins KV blocks with recurrent snapshots.

There is also an implementation gap: the current prefill path does not naturally produce a whole-model snapshot at every 256-token boundary. The GDR chunk scratch is per linear layer, and conv state is only kept as final state, not as per-boundary snapshots. So “D2H copy at GDR boundaries” is not enough as an implementation plan.

I’d prefer to move this to an RFC issue first and narrow the first step to:

  • GPU-only recurrent snapshot allocator
  • exact snapshot contents: GDR state + conv state
  • snapshot key and boundary semantics
  • joint KV+snapshot lookup
  • request lifetime / eviction protection
  • how prefill will actually materialize boundary snapshots

CPU-tier offload and two-tier LRU can be a later optimization once the core indexing/lifetime model is clear.

@Ke-Wng

Ke-Wng commented Jun 27, 2026

Copy link
Copy Markdown
Contributor Author

Thanks for the review. I agree with narrowing the scope first. I drafted an RFC comment under #257 that focuses on the first step: a GPU-only recurrent snapshot design for Qwen3.5 prefix caching.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

qwen35: prefix caching needs a design for recurrent state (discussion)

2 participants