You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
> Values shown as `generation speed · GPU memory allocated`
68
68
69
69
**Key takeaways:**
70
-
-🖥️**8 GB Mac Mini**: SSD Stream runs a 26B model at**4.6 GB Active RAM**
71
-
- 📄 **40K context on 24 GB MacBook Pro**: SSD + TurboQuant fits in **22.7 GB**
72
-
- 📚 **100K context on 32 GB Mac Studio**: SSD + TurboQuant fits in **33.3 GB**— previously required 64 GB
70
+
-🚀**Speed Doubled**: The newer MLX backend modifications have more than doubled raw `SSD Stream` inference speed (from 4.5 ->**10.8 tok/s**) while maintaining streaming stability.
71
+
- 📄 **40K context on 24 GB MacBook Pro**: SSD + TurboQuant effortlessly fits a 26B model in **22.5 GB** of memory footprint.
72
+
- 📚 **100K context on 24 GB MacBook Pro**: Due to hyper-efficient 3-bit KV compression paired with SSD weight streaming, you can process 100,000 tokens of context on a 24 GB machine — only utilizing **22.3 GB**total. (Previously required a 64 GB Mac Studio).
73
73
74
74
> Run `./run_benchmark.sh` to generate these metrics on your own device. (See **Benchmarks & Testing** below).
75
75
@@ -245,24 +245,18 @@ The breakthrough arrived when we realized the **embedding scale** was missing. T
245
245
246
246
The model instantly woke up from "whispering" whitespace and successfully responded to `"What is 2+2?"` with a perfect `"2 + 2 equals 4."` — proving that the entire massive structural pipeline from Swift to Metal was working.
247
247
248
-
## 📄 Dependencies & License
248
+
## 🙏 Acknowledgments & Credits
249
249
250
-
Built entirely on the hard work of the Apple MLX community.
251
-
-[mlx-swift](https://github.com/ml-explore/mlx-swift) — Apple MLX framework for Swift
252
-
-[mlx-lm](https://github.com/ml-explore/mlx/tree/main/mlx_lm) — Python reference implementation for MLX Language Models (inspiration for prompt chunking architecture)
253
-
-[Hummingbird](https://github.com/hummingbird-project/hummingbird) — Event-driven Swift HTTP server
254
-
-[flash-moe](https://github.com/danveloper/flash-moe) — Reference for SSD Expert Streaming
250
+
`SwiftLM` leverages the powerful foundation of the Apple MLX community and relies heavily on the open-source ecosystem. While the custom C++ implementations, Metal optimizations, and high-performance pipeline architecture were engineered natively for this engine, we owe massive thanks to the following projects for their indispensable reference materials and underlying protocols:
255
251
256
-
### 🙏 TurboQuant Credits
252
+
-**[mlx-swift](https://github.com/ml-explore/mlx-swift)** — The core Apple MLX wrapper bringing Metal-accelerated operations into the Swift ecosystem.
253
+
-**[mlx-lm](https://github.com/ml-explore/mlx/tree/main/mlx_lm)** — The official Python language models implementation, serving as the core inspiration for our chunked-prefill architecture and attention manipulation logic.
254
+
-**[flash-moe](https://github.com/danveloper/flash-moe)** — Inspired the memory-mapped out-of-core SSD Expert Streaming mechanics that we implemented natively in SwiftLM.
255
+
-**[Hummingbird](https://github.com/hummingbird-project/hummingbird)** — The incredible event-driven Swift HTTP engine powering the OpenAI-compatible REST API.
256
+
-**[TurboQuant Paper](https://arxiv.org/abs/2504.19874)** — *"TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate"* (Zandieh et al., AISTATS 2026). Provided the initial algorithmic framework for the dual-stage PolarQuant + QJL engine.
257
+
-**[TheTom/llama-cpp-turboquant](https://github.com/TheTom/llama-cpp-turboquant/tree/feature/turboquant-kv-cache)** — Served as an invaluable reference architecture for the C and GPU quantization tables, guiding the development of our native `turbo-wht` Walsh-Hadamard kernels and custom Metal wrapper layers.
258
+
-**[TheTom/turboquant_plus](https://github.com/TheTom/turboquant_plus)** — Essential Python validation logic used to certify the correctness of our manually constructed Lloyd-Max codebook generation math.
259
+
-**[amirzandieh/QJL](https://github.com/amirzandieh/QJL)** — The original 1-bit residual correction engine backing the paper, which informed our QJL error recovery in dot-product regimes.
257
260
258
-
The TurboQuant KV cache compression implemented in `SwiftLM` is directly based on the following open-source work and research:
259
-
260
-
-**[TheTom/llama-cpp-turboquant](https://github.com/TheTom/llama-cpp-turboquant/tree/feature/turboquant-kv-cache)** — The primary reference for the C and Metal GPU implementation. The `turbo-wht.h` Fast Walsh-Hadamard kernel, WHT sign arrays (seed=42), Lloyd-Max centroid tables, and the `ggml-turbo-quant.c` quantize/dequantize logic were ported directly from this repository into our MLX C++ and Metal backend.
261
-
262
-
-**[TheTom/turboquant_plus](https://github.com/TheTom/turboquant_plus)** — Python reference implementation used to validate the algorithm math, codebook construction (Lloyd's algorithm for N(0, 1/d)), and KV cache integration design.
263
-
264
-
-**TurboQuant Paper** — *"TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate"*, Zandieh et al., AISTATS/ICLR 2026. The two-stage PolarQuant + QJL algorithm described in Section 3 and Appendix A is the mathematical foundation of this implementation.
265
-
266
-
-**[amirzandieh/QJL](https://github.com/amirzandieh/QJL)** — Original Quantized Johnson-Lindenstrauss (QJL) 1-bit residual correction implementation by the paper authors.
0 commit comments