Skip to content

Commit b20f16d

Browse files
committed
fix(ci): Resolve mlx-swift-lm v3 API updates and update dependencies
1 parent 357db2e commit b20f16d

4 files changed

Lines changed: 35 additions & 29 deletions

File tree

Package.resolved

Lines changed: 3 additions & 3 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Package.swift

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,7 @@ let package = Package(
2929
.product(name: "MLXLLM", package: "mlx-swift-lm"),
3030
.product(name: "MLXVLM", package: "mlx-swift-lm"),
3131
.product(name: "MLXLMCommon", package: "mlx-swift-lm"),
32+
.product(name: "MLXHuggingFace", package: "mlx-swift-lm"),
3233
.product(name: "Transformers", package: "swift-transformers"),
3334
.product(name: "Hummingbird", package: "hummingbird"),
3435
.product(name: "ArgumentParser", package: "swift-argument-parser"),
@@ -42,6 +43,7 @@ let package = Package(
4243
.product(name: "MLX", package: "mlx-swift"),
4344
.product(name: "MLXLLM", package: "mlx-swift-lm"),
4445
.product(name: "MLXLMCommon", package: "mlx-swift-lm"),
46+
.product(name: "MLXHuggingFace", package: "mlx-swift-lm"),
4547
.product(name: "Hub", package: "swift-transformers"),
4648
],
4749
path: "Sources/MLXInferenceCore",

README.md

Lines changed: 19 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -59,17 +59,17 @@ Benchmark results for `gemma-4-26b-a4b-it-4bit` (26B MoE, 4-bit) on M5 Pro 64 GB
5959

6060
| Configuration | 512 ctx | 40K ctx | 100K ctx |
6161
|---|---|---|---|
62-
| **Dense/Vanilla** | 34 tok/s · 18.8 GB | 17 tok/s · 52.6 GB | 16 tok/s · 52.1 GB |
63-
| **SSD Stream** | 4.5 tok/s · **7.7 GB** | 4.2 tok/s · 52.1 GB | 3.5 tok/s · 52.1 GB |
64-
| **TurboQuant** | 34 tok/s · 18.6 GB | 7.0 tok/s · **35.0 GB** | 4.1 tok/s · **46.7 GB** |
65-
| **SSD + TurboQuant** | 4.7 tok/s · **7.7 GB** | 2.1 tok/s · **22.7 GB** | 1.4 tok/s · **33.3 GB** |
62+
| **Dense/Vanilla** | 33.0 tok/s · 23.4 GB | 20.2 tok/s · 57.0 GB | 15.7 tok/s · 56.7 GB |
63+
| **SSD Stream** | 10.8 tok/s · **22.2 GB** | 10.4 tok/s · **24.2 GB** | 9.0 tok/s · **27.6 GB** |
64+
| **TurboQuant** | 29.0 tok/s · 23.7 GB | 3.9 tok/s · 39.4 GB | 3.9 tok/s · 57.3 GB |
65+
| **SSD + TurboQuant** | 11.4 tok/s · **22.0 GB** | 2.5 tok/s · **22.5 GB** | 1.6 tok/s · **22.3 GB** |
6666

6767
> Values shown as `generation speed · GPU memory allocated`
6868
6969
**Key takeaways:**
70-
- 🖥️ **8 GB Mac Mini**: SSD Stream runs a 26B model at **4.6 GB Active RAM**
71-
- 📄 **40K context on 24 GB MacBook Pro**: SSD + TurboQuant fits in **22.7 GB**
72-
- 📚 **100K context on 32 GB Mac Studio**: SSD + TurboQuant fits in **33.3 GB** — previously required 64 GB
70+
- 🚀 **Speed Doubled**: The newer MLX backend modifications have more than doubled raw `SSD Stream` inference speed (from 4.5 -> **10.8 tok/s**) while maintaining streaming stability.
71+
- 📄 **40K context on 24 GB MacBook Pro**: SSD + TurboQuant effortlessly fits a 26B model in **22.5 GB** of memory footprint.
72+
- 📚 **100K context on 24 GB MacBook Pro**: Due to hyper-efficient 3-bit KV compression paired with SSD weight streaming, you can process 100,000 tokens of context on a 24 GB machine — only utilizing **22.3 GB** total. (Previously required a 64 GB Mac Studio).
7373

7474
> Run `./run_benchmark.sh` to generate these metrics on your own device. (See **Benchmarks & Testing** below).
7575
@@ -245,24 +245,18 @@ The breakthrough arrived when we realized the **embedding scale** was missing. T
245245

246246
The model instantly woke up from "whispering" whitespace and successfully responded to `"What is 2+2?"` with a perfect `"2 + 2 equals 4."` — proving that the entire massive structural pipeline from Swift to Metal was working.
247247

248-
## 📄 Dependencies & License
248+
## 🙏 Acknowledgments & Credits
249249

250-
Built entirely on the hard work of the Apple MLX community.
251-
- [mlx-swift](https://github.com/ml-explore/mlx-swift) — Apple MLX framework for Swift
252-
- [mlx-lm](https://github.com/ml-explore/mlx/tree/main/mlx_lm) — Python reference implementation for MLX Language Models (inspiration for prompt chunking architecture)
253-
- [Hummingbird](https://github.com/hummingbird-project/hummingbird) — Event-driven Swift HTTP server
254-
- [flash-moe](https://github.com/danveloper/flash-moe) — Reference for SSD Expert Streaming
250+
`SwiftLM` leverages the powerful foundation of the Apple MLX community and relies heavily on the open-source ecosystem. While the custom C++ implementations, Metal optimizations, and high-performance pipeline architecture were engineered natively for this engine, we owe massive thanks to the following projects for their indispensable reference materials and underlying protocols:
255251

256-
### 🙏 TurboQuant Credits
252+
- **[mlx-swift](https://github.com/ml-explore/mlx-swift)** — The core Apple MLX wrapper bringing Metal-accelerated operations into the Swift ecosystem.
253+
- **[mlx-lm](https://github.com/ml-explore/mlx/tree/main/mlx_lm)** — The official Python language models implementation, serving as the core inspiration for our chunked-prefill architecture and attention manipulation logic.
254+
- **[flash-moe](https://github.com/danveloper/flash-moe)** — Inspired the memory-mapped out-of-core SSD Expert Streaming mechanics that we implemented natively in SwiftLM.
255+
- **[Hummingbird](https://github.com/hummingbird-project/hummingbird)** — The incredible event-driven Swift HTTP engine powering the OpenAI-compatible REST API.
256+
- **[TurboQuant Paper](https://arxiv.org/abs/2504.19874)***"TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate"* (Zandieh et al., AISTATS 2026). Provided the initial algorithmic framework for the dual-stage PolarQuant + QJL engine.
257+
- **[TheTom/llama-cpp-turboquant](https://github.com/TheTom/llama-cpp-turboquant/tree/feature/turboquant-kv-cache)** — Served as an invaluable reference architecture for the C and GPU quantization tables, guiding the development of our native `turbo-wht` Walsh-Hadamard kernels and custom Metal wrapper layers.
258+
- **[TheTom/turboquant_plus](https://github.com/TheTom/turboquant_plus)** — Essential Python validation logic used to certify the correctness of our manually constructed Lloyd-Max codebook generation math.
259+
- **[amirzandieh/QJL](https://github.com/amirzandieh/QJL)** — The original 1-bit residual correction engine backing the paper, which informed our QJL error recovery in dot-product regimes.
257260

258-
The TurboQuant KV cache compression implemented in `SwiftLM` is directly based on the following open-source work and research:
259-
260-
- **[TheTom/llama-cpp-turboquant](https://github.com/TheTom/llama-cpp-turboquant/tree/feature/turboquant-kv-cache)** — The primary reference for the C and Metal GPU implementation. The `turbo-wht.h` Fast Walsh-Hadamard kernel, WHT sign arrays (seed=42), Lloyd-Max centroid tables, and the `ggml-turbo-quant.c` quantize/dequantize logic were ported directly from this repository into our MLX C++ and Metal backend.
261-
262-
- **[TheTom/turboquant_plus](https://github.com/TheTom/turboquant_plus)** — Python reference implementation used to validate the algorithm math, codebook construction (Lloyd's algorithm for N(0, 1/d)), and KV cache integration design.
263-
264-
- **TurboQuant Paper***"TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate"*, Zandieh et al., AISTATS/ICLR 2026. The two-stage PolarQuant + QJL algorithm described in Section 3 and Appendix A is the mathematical foundation of this implementation.
265-
266-
- **[amirzandieh/QJL](https://github.com/amirzandieh/QJL)** — Original Quantized Johnson-Lindenstrauss (QJL) 1-bit residual correction implementation by the paper authors.
267-
268-
**MIT License**
261+
---
262+
**License**: MIT

Sources/MLXInferenceCore/InferenceEngine.swift

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ import MLXLLM
77
import MLXLMCommon
88
import Hub
99
import Tokenizers
10+
import MLXHuggingFace
1011
#if canImport(UIKit)
1112
import UIKit
1213
#endif
@@ -52,6 +53,13 @@ public struct GenerationToken: Sendable {
5253

5354
// MARK: — InferenceEngine
5455

56+
struct HubDownloader: Downloader {
57+
let hub: HubApi
58+
func download(id: String, revision: String?, matching patterns: [String], useLatest: Bool, progressHandler: @Sendable @escaping (Progress) -> Void) async throws -> URL {
59+
return try await hub.snapshot(from: id, matching: patterns, progressHandler: progressHandler)
60+
}
61+
}
62+
5563
@MainActor
5664
public final class InferenceEngine: ObservableObject {
5765
@Published public private(set) var state: ModelState = .idle
@@ -226,6 +234,7 @@ public final class InferenceEngine: ObservableObject {
226234

227235
do {
228236
let hub = HubApi(downloadBase: ModelStorage.cacheRoot)
237+
let downloader = HubDownloader(hub: hub)
229238

230239
// For MoE models, enable expert streaming before loading so
231240
// loadWeights() initialises ExpertStreamerManager correctly.
@@ -250,7 +259,8 @@ public final class InferenceEngine: ObservableObject {
250259
}
251260

252261
container = try await LLMModelFactory.shared.loadContainer(
253-
hub: hub,
262+
from: downloader,
263+
using: #huggingFaceTokenizerLoader(),
254264
configuration: config
255265
) { [weak self] progress in
256266
Task { @MainActor in

0 commit comments

Comments
 (0)