Skip to content

Improve Qwen3.5 recurrent cache handling#323

Open
aleroot wants to merge 1 commit into
ml-explore:mainfrom
aleroot:qwen3_5perf
Open

Improve Qwen3.5 recurrent cache handling#323
aleroot wants to merge 1 commit into
ml-explore:mainfrom
aleroot:qwen3_5perf

Conversation

@aleroot

@aleroot aleroot commented May 30, 2026

Copy link
Copy Markdown
Contributor

Proposed changes

Qwen3.6 models using model_type: qwen3_5 rely heavily on hybrid linear-attention / GatedDelta layers. Keeping the convolution state contiguous avoids carrying strided slices through decode steps, which should reduce per-token cache overhead and better match upstream MLX Python behavior.

Advancing array-cache metadata and preserving left-padding masks also improves stability for padded or batched generation paths.

Checklist

Put an x in the boxes that apply.

  • I have read the CONTRIBUTING document
  • I have run pre-commit run --all-files to format my code / installed pre-commit prior to committing changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have updated the necessary documentation (if needed)

Comment thread Libraries/MLXLMCommon/KVCache.swift Outdated
Comment thread Libraries/MLXLLM/Models/Qwen35.swift
@aleroot aleroot force-pushed the qwen3_5perf branch 2 times, most recently from d2c2851 to c34d2d9 Compare June 1, 2026 19:53
Store GatedDelta convolution state contiguously and advance array-cache metadata after each recurrent step, matching upstream mlx-lm behavior for Qwen3.5/Qwen3-Next style models.

Keep left-padding masks active after recurrent cache state initialization, add coverage for ArraysCache metadata advancement, and align Qwen3 RoPE setup with the shared rope initializer.
@aleroot

aleroot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

Updated rebasing on latest main , also fixed a MTP tail-budget issue by making speculative rounds budget-aware: the iterator now only drafts when the remaining output budget can hold both the accepted draft prefix and the verifier correction/bonus.

Comment on lines +116 to +130
fileprivate func prepareArrayMetadata(lengths: [Int]?) {
if let arrays = self as? ArraysCache {
arrays.prepare(lengths: lengths)
} else if let list = self as? CacheList {
list.prepare(lengths: lengths)
}
}

fileprivate func finalizeArrayMetadata() {
if let arrays = self as? ArraysCache {
arrays.finalize()
} else if let list = self as? CacheList {
list.finalize()
}
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this should be done through either methods on the KVCache protocol with default (empty) implementations? Or add a KVCacheLifecycle (or something) protocol. I worry that new cache types could get added (we already have quite a few) and this particular check wouldn't be updated.

@davidkoski davidkoski left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. See what you think about my question on KVCache finalize.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants