docs(qwen35): add prefix cache design document by Ke-Wng · Pull Request #423 · openinfer-project/openinfer

Ke-Wng · 2026-06-18T07:21:19Z

Description

Fixes #257

Add design document for Qwen3.5-4B prefix caching. Qwen3.5's hybrid architecture (24 linear-attention + 8 full-attention layers) breaks Qwen3's KV-only caching assumption—prefix hits must restore both paged KV blocks and recurrent-state snapshots at the same boundary.

Key design decisions:

Snapshot interval: 256 tokens (4×GDR chunks), chosen to balance cold prefill overhead (+32ms for 4096-token prompt) vs hit-rate granularity
Two-tier pool: GPU primary (~29 slots on RTX 4090 = 7,424 cached tokens) + CPU eviction backup (unlimited, +2.18ms H2D restore cost)
Joint matching: radix lookup requires both KV and snapshot availability at the same boundary
LRU eviction: independent per-tier, with active-request protection

See docs/models/qwen35/prefix-cache.md for full design rationale, quantitative analysis, pitfalls, and implementation roadmap.

Type of Change

Documentation update

Checklist

My code follows the style guidelines of this project (see docs/conventions/coding-style.md).
I have performed a self-review of my own code.
I have formatted my commits according to Commitizen conventions.
I have run the local test suite and all tests pass (see CLAUDE.md).

Add comprehensive design document for Qwen3.5-4B prefix caching covering the hybrid architecture challenge (24 linear + 8 full-attention layers), snapshot strategy, two-tier pool design, and memory budget analysis. Key decisions documented: - 256-token snapshot interval (4×GDR chunks) - Two-tier pool: ~29 GPU slots + CPU backup on RTX 4090 - Joint KV+snapshot matching requirement - LRU eviction per tier Includes quantitative analysis, design rationale, pitfalls, and prerequisites (paged-prefill migration blocker). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6d928c1d1d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-06-18T07:23:44Z

+**Core idea:** Checkpoint recurrent state at fixed 256-token intervals during prefill. Store snapshots in a two-tier pool (GPU primary, CPU eviction backup). On cache hit, restore both KV and snapshot at the matched boundary, then prefill only the suffix.
+
+**Key components:**
+1. **Snapshot checkpointing**: at every 256-token boundary, D2H-copy the 52 MB recurrent-state snapshot


Avoid D2H-copying every new snapshot

Because the design makes the GPU tier the primary cache and describes the CPU tier as an eviction backup, inserting a fresh checkpoint should keep/copy the recurrent state into a GPU slot and only D2H-copy it when evicting or intentionally creating a CPU backup. If this bullet is implemented literally, every cold 4096-token prefill pays 16 PCIe D2H transfers (~32 ms by the doc’s own numbers) even while GPU slots are free, defeating the GPU-tier hot-cache path and inflating TTFT.

Useful? React with 👍 / 👎.

xiaguan · 2026-06-20T17:44:38Z

Thanks for writing this up, but I don’t think this is ready to land as a committed design doc yet.

The hard part here is not mainly adding snapshot offload. For Qwen3.5, prefix caching needs a real allocation / indexing / lifetime design for linear-attention state. A valid hit has to jointly restore full-attention KV, GDR recurrent state, and conv state at the same token boundary, with the same token hash / adapter salt. That means we need to define snapshot handles, ownership, pinning, eviction, and how radix lookup joins KV blocks with recurrent snapshots.

There is also an implementation gap: the current prefill path does not naturally produce a whole-model snapshot at every 256-token boundary. The GDR chunk scratch is per linear layer, and conv state is only kept as final state, not as per-boundary snapshots. So “D2H copy at GDR boundaries” is not enough as an implementation plan.

I’d prefer to move this to an RFC issue first and narrow the first step to:

GPU-only recurrent snapshot allocator
exact snapshot contents: GDR state + conv state
snapshot key and boundary semantics
joint KV+snapshot lookup
request lifetime / eviction protection
how prefill will actually materialize boundary snapshots

CPU-tier offload and two-tier LRU can be a later optimization once the core indexing/lifetime model is clear.

Ke-Wng · 2026-06-27T08:50:03Z

Thanks for the review. I agree with narrowing the scope first. I drafted an RFC comment under #257 that focuses on the first step: a GPU-only recurrent snapshot design for Qwen3.5 prefix caching.

chatgpt-codex-connector Bot reviewed Jun 18, 2026

View reviewed changes

xiaguan self-assigned this Jun 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

docs(qwen35): add prefix cache design document#423

docs(qwen35): add prefix cache design document#423
Ke-Wng wants to merge 1 commit into
openinfer-project:mainfrom
Ke-Wng:docs/qwen35-prefix-cache

Ke-Wng commented Jun 18, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 18, 2026

Uh oh!

xiaguan commented Jun 20, 2026

Uh oh!

Ke-Wng commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Ke-Wng commented Jun 18, 2026

Description

Type of Change

Checklist

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

xiaguan commented Jun 20, 2026

Uh oh!

Ke-Wng commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants