Interest in TurboQuant / rotating quantized KV cache support?

Hi, we have been experimenting in a fork with a TurboQuant path for MLX Swift and an LM-side rotating quantized KV cache, and wanted to ask whether this is directionally interesting before trying to shape it into upstream PRs.

The implementation is currently split across:

- `mlx-swift`: https://github.com/RNT56/mlx-swift/tree/schtack/turboquant-kv
- `mlx-swift-lm`: https://github.com/RNT56/mlx-swift-lm/tree/schtack/turboquant-kv

What exists in the fork:

- A `TurboQuant` API in `mlx-swift` with packed tensor support, reference encode/decode, Metal codec kernels, compressed attention kernels, runtime capability probing, and fallback diagnostics.
- `mlx-swift-lm` integration with TurboQuant KV cache strategies, rotating quantized KV cache support, compressed attention routing, cache diagnostics, and tests around cache copying/serialization/fallback behavior.
- Runtime fallback behavior: the public path can fall back to MLX packed quantized lanes when the PolarQuant/QJL Metal backend is unavailable or fails the self-test. The Metal attention path is gated behind a runtime probe, so unsupported devices should keep using the safer fallback path.
- Device/profile handling currently distinguishes portable Apple GPU profiles from wider/sustained profiles using Metal availability, GPU family signals, architecture name, and recommended working-set size. The code currently treats A16/A17-class devices as portable, A18/A19 / Apple8+ as wider, and larger working sets / A19 Pro as sustained.

Measured/observable benefit so far:

- The main concrete benefit is KV-cache memory pressure reduction. The compressed representation stores K/V payloads in low-bit packed form instead of keeping full fp16/bf16 raw cache tensors. For 4-bit values, the raw value payload is about 4x smaller than fp16 before per-group scale/bias metadata; the 2.5-bit / 3.5-bit TurboQuant presets reduce payload lanes further. The LM-side compressed-attention path also has tests that verify raw-free compressed cache state when the Metal attention backend is available.
- We have not yet produced polished end-to-end tokens/sec benchmark numbers suitable for an upstream PR. If this direction is interesting, we can turn the fork into a smaller benchmarked PR series rather than sending the current fork stack wholesale.

Suggested upstreaming shape, if maintainers are interested:

1. First discuss the desired API surface and benchmark/device matrix.
2. Split the `mlx-swift` core support from the `mlx-swift-lm` cache integration.
3. Keep the fallback/probe behavior mandatory so unsupported devices remain on existing paths.
4. Add focused benchmarks for memory footprint and generation throughput before any large PR.

I also prepared a few small unrelated cleanup/fix branches separately, because those are more suitable as conventional PRs than the TurboQuant stack:

- `mlx-swift` linalg nuclear norm API/docs: https://github.com/RNT56/mlx-swift/tree/upstream-pr/linalg-norm-kind-nuc
- `mlx-swift-lm` VLM processor completions: https://github.com/RNT56/mlx-swift-lm/tree/upstream-pr/vlm-processor-completions
- `mlx-swift-lm` model config/runtime error hardening: https://github.com/RNT56/mlx-swift-lm/tree/upstream-pr/model-config-runtime-hardening
- `mlx-swift-lm` model compatibility docs: https://github.com/RNT56/mlx-swift-lm/tree/upstream-pr/model-compatibility-docs

Would you be interested in us preparing a benchmarked TurboQuant PR series, and if so, which part would you prefer to evaluate first?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interest in TurboQuant / rotating quantized KV cache support? #294

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Interest in TurboQuant / rotating quantized KV cache support? #294

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions