Hi, we have been experimenting in a fork with a TurboQuant path for MLX Swift and an LM-side rotating quantized KV cache, and wanted to ask whether this is directionally interesting before trying to shape it into upstream PRs.
The implementation is currently split across:
What exists in the fork:
- A
TurboQuant API in mlx-swift with packed tensor support, reference encode/decode, Metal codec kernels, compressed attention kernels, runtime capability probing, and fallback diagnostics.
mlx-swift-lm integration with TurboQuant KV cache strategies, rotating quantized KV cache support, compressed attention routing, cache diagnostics, and tests around cache copying/serialization/fallback behavior.
- Runtime fallback behavior: the public path can fall back to MLX packed quantized lanes when the PolarQuant/QJL Metal backend is unavailable or fails the self-test. The Metal attention path is gated behind a runtime probe, so unsupported devices should keep using the safer fallback path.
- Device/profile handling currently distinguishes portable Apple GPU profiles from wider/sustained profiles using Metal availability, GPU family signals, architecture name, and recommended working-set size. The code currently treats A16/A17-class devices as portable, A18/A19 / Apple8+ as wider, and larger working sets / A19 Pro as sustained.
Measured/observable benefit so far:
- The main concrete benefit is KV-cache memory pressure reduction. The compressed representation stores K/V payloads in low-bit packed form instead of keeping full fp16/bf16 raw cache tensors. For 4-bit values, the raw value payload is about 4x smaller than fp16 before per-group scale/bias metadata; the 2.5-bit / 3.5-bit TurboQuant presets reduce payload lanes further. The LM-side compressed-attention path also has tests that verify raw-free compressed cache state when the Metal attention backend is available.
- We have not yet produced polished end-to-end tokens/sec benchmark numbers suitable for an upstream PR. If this direction is interesting, we can turn the fork into a smaller benchmarked PR series rather than sending the current fork stack wholesale.
Suggested upstreaming shape, if maintainers are interested:
- First discuss the desired API surface and benchmark/device matrix.
- Split the
mlx-swift core support from the mlx-swift-lm cache integration.
- Keep the fallback/probe behavior mandatory so unsupported devices remain on existing paths.
- Add focused benchmarks for memory footprint and generation throughput before any large PR.
I also prepared a few small unrelated cleanup/fix branches separately, because those are more suitable as conventional PRs than the TurboQuant stack:
Would you be interested in us preparing a benchmarked TurboQuant PR series, and if so, which part would you prefer to evaluate first?
Hi, we have been experimenting in a fork with a TurboQuant path for MLX Swift and an LM-side rotating quantized KV cache, and wanted to ask whether this is directionally interesting before trying to shape it into upstream PRs.
The implementation is currently split across:
mlx-swift: https://github.com/RNT56/mlx-swift/tree/schtack/turboquant-kvmlx-swift-lm: https://github.com/RNT56/mlx-swift-lm/tree/schtack/turboquant-kvWhat exists in the fork:
TurboQuantAPI inmlx-swiftwith packed tensor support, reference encode/decode, Metal codec kernels, compressed attention kernels, runtime capability probing, and fallback diagnostics.mlx-swift-lmintegration with TurboQuant KV cache strategies, rotating quantized KV cache support, compressed attention routing, cache diagnostics, and tests around cache copying/serialization/fallback behavior.Measured/observable benefit so far:
Suggested upstreaming shape, if maintainers are interested:
mlx-swiftcore support from themlx-swift-lmcache integration.I also prepared a few small unrelated cleanup/fix branches separately, because those are more suitable as conventional PRs than the TurboQuant stack:
mlx-swiftlinalg nuclear norm API/docs: https://github.com/RNT56/mlx-swift/tree/upstream-pr/linalg-norm-kind-nucmlx-swift-lmVLM processor completions: https://github.com/RNT56/mlx-swift-lm/tree/upstream-pr/vlm-processor-completionsmlx-swift-lmmodel config/runtime error hardening: https://github.com/RNT56/mlx-swift-lm/tree/upstream-pr/model-config-runtime-hardeningmlx-swift-lmmodel compatibility docs: https://github.com/RNT56/mlx-swift-lm/tree/upstream-pr/model-compatibility-docsWould you be interested in us preparing a benchmarked TurboQuant PR series, and if so, which part would you prefer to evaluate first?