Improve Qwen3.5 recurrent cache handling by aleroot · Pull Request #323 · ml-explore/mlx-swift-lm

aleroot · 2026-05-30T04:10:23Z

Proposed changes

Qwen3.6 models using model_type: qwen3_5 rely heavily on hybrid linear-attention / GatedDelta layers. Keeping the convolution state contiguous avoids carrying strided slices through decode steps, which should reduce per-token cache overhead and better match upstream MLX Python behavior.

Advancing array-cache metadata and preserving left-padding masks also improves stability for padded or batched generation paths.

Checklist

Put an x in the boxes that apply.

I have read the CONTRIBUTING document
I have run pre-commit run --all-files to format my code / installed pre-commit prior to committing changes
I have added tests that prove my fix is effective or that my feature works
I have updated the necessary documentation (if needed)

Store GatedDelta convolution state contiguously and advance array-cache metadata after each recurrent step, matching upstream mlx-lm behavior for Qwen3.5/Qwen3-Next style models. Keep left-padding masks active after recurrent cache state initialization, add coverage for ArraysCache metadata advancement, and align Qwen3 RoPE setup with the shared rope initializer.

aleroot · 2026-06-12T11:44:50Z

Updated rebasing on latest main , also fixed a MTP tail-budget issue by making speculative rounds budget-aware: the iterator now only drafts when the remaining output budget can hold both the accepted draft prefix and the verifier correction/bonus.

davidkoski · 2026-06-15T19:54:55Z

+    fileprivate func prepareArrayMetadata(lengths: [Int]?) {
+        if let arrays = self as? ArraysCache {
+            arrays.prepare(lengths: lengths)
+        } else if let list = self as? CacheList {
+            list.prepare(lengths: lengths)
+        }
+    }
+
+    fileprivate func finalizeArrayMetadata() {
+        if let arrays = self as? ArraysCache {
+            arrays.finalize()
+        } else if let list = self as? CacheList {
+            list.finalize()
+        }
+    }


I wonder if this should be done through either methods on the KVCache protocol with default (empty) implementations? Or add a KVCacheLifecycle (or something) protocol. I worry that new cache types could get added (we already have quite a few) and this particular check wouldn't be updated.

davidkoski

Looks good. See what you think about my question on KVCache finalize.

davidkoski reviewed Jun 1, 2026

View reviewed changes

Comment thread Libraries/MLXLMCommon/KVCache.swift Outdated

davidkoski reviewed Jun 1, 2026

View reviewed changes

Comment thread Libraries/MLXLLM/Models/Qwen35.swift

aleroot force-pushed the qwen3_5perf branch 2 times, most recently from d2c2851 to c34d2d9 Compare June 1, 2026 19:53

aleroot force-pushed the qwen3_5perf branch from c34d2d9 to fd40a0e Compare June 12, 2026 11:39

aleroot force-pushed the qwen3_5perf branch from fd40a0e to 3eee371 Compare June 12, 2026 11:43

davidkoski reviewed Jun 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Qwen3.5 recurrent cache handling#323

Improve Qwen3.5 recurrent cache handling#323
aleroot wants to merge 1 commit into
ml-explore:mainfrom
aleroot:qwen3_5perf

aleroot commented May 30, 2026

Uh oh!

Uh oh!

Uh oh!

aleroot commented Jun 12, 2026

Uh oh!

davidkoski Jun 15, 2026

Uh oh!

davidkoski left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

aleroot commented May 30, 2026

Proposed changes

Checklist

Uh oh!

Uh oh!

Uh oh!

aleroot commented Jun 12, 2026

Uh oh!

davidkoski Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

davidkoski left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants