Skip to content

Interest in TurboQuant / rotating quantized KV cache support? #294

@RNT56

Description

@RNT56

Hi, we have been experimenting in a fork with a TurboQuant path for MLX Swift and an LM-side rotating quantized KV cache, and wanted to ask whether this is directionally interesting before trying to shape it into upstream PRs.

The implementation is currently split across:

What exists in the fork:

  • A TurboQuant API in mlx-swift with packed tensor support, reference encode/decode, Metal codec kernels, compressed attention kernels, runtime capability probing, and fallback diagnostics.
  • mlx-swift-lm integration with TurboQuant KV cache strategies, rotating quantized KV cache support, compressed attention routing, cache diagnostics, and tests around cache copying/serialization/fallback behavior.
  • Runtime fallback behavior: the public path can fall back to MLX packed quantized lanes when the PolarQuant/QJL Metal backend is unavailable or fails the self-test. The Metal attention path is gated behind a runtime probe, so unsupported devices should keep using the safer fallback path.
  • Device/profile handling currently distinguishes portable Apple GPU profiles from wider/sustained profiles using Metal availability, GPU family signals, architecture name, and recommended working-set size. The code currently treats A16/A17-class devices as portable, A18/A19 / Apple8+ as wider, and larger working sets / A19 Pro as sustained.

Measured/observable benefit so far:

  • The main concrete benefit is KV-cache memory pressure reduction. The compressed representation stores K/V payloads in low-bit packed form instead of keeping full fp16/bf16 raw cache tensors. For 4-bit values, the raw value payload is about 4x smaller than fp16 before per-group scale/bias metadata; the 2.5-bit / 3.5-bit TurboQuant presets reduce payload lanes further. The LM-side compressed-attention path also has tests that verify raw-free compressed cache state when the Metal attention backend is available.
  • We have not yet produced polished end-to-end tokens/sec benchmark numbers suitable for an upstream PR. If this direction is interesting, we can turn the fork into a smaller benchmarked PR series rather than sending the current fork stack wholesale.

Suggested upstreaming shape, if maintainers are interested:

  1. First discuss the desired API surface and benchmark/device matrix.
  2. Split the mlx-swift core support from the mlx-swift-lm cache integration.
  3. Keep the fallback/probe behavior mandatory so unsupported devices remain on existing paths.
  4. Add focused benchmarks for memory footprint and generation throughput before any large PR.

I also prepared a few small unrelated cleanup/fix branches separately, because those are more suitable as conventional PRs than the TurboQuant stack:

Would you be interested in us preparing a benchmarked TurboQuant PR series, and if so, which part would you prefer to evaluate first?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions