feat: SSD expert streaming for DeepSeek-V4 MoE (complete engine, WIP) by kokorogg · Pull Request #45 · Layr-Labs/mlx-swift-lm

kokorogg · 2026-06-17T20:56:38Z

Summary

The complete SSD expert-streaming engine — stream DeepSeek-V4-Flash's routed
experts from NVMe/SSD on demand instead of holding the ~151 GB expert pool
resident in RAM. This is the implementation that runs DeepSeek-V4-Flash-4bit
from SSD (smoke-tested: coherent generation, experts served entirely from
NVMe), authored in our SSDStream research program.

Status: DRAFT / WIP. Everything related to SSD weight-streaming is here,
but it does not build standalone on this fork yet — it expects engine
pieces that differ here (see What it needs to build). Opened complete, per
request, for integration.

What's in the PR (9 files)

Streaming engine

SwitchLayers.swift — streaming MoE dispatch: LRU expert slot cache, async
miss-fetch pool (SSDFetchPool), stacked/fused kernels, live SSD telemetry
(SSDStreamMetrics). Replaces the fork's 254-line MoE path.
DeepseekV4.swift — streamed DeepSeek-V4: DSA (window + compressed-KV)
attention, HC Sinkhorn, hash routing, MLA RoPE. Replaces the fork's
DeepSeek-V4.
ConcurrentError.swift (new) — SSDStreamingError + SSDStreamingErrorLatch
- ThreadSafeError (async-fetch error propagation).

Config + caches (new, additive — these compile)

ExpertStreamingConfig.swift — streaming mode (disabled / mmap page-cache /
direct NVMe) + machine-size-aware slot budgeting.
SharedKVCache.swift — K==V shared-cache wrapper.
DSAKVCache.swift — sliding-window + compressed-slot KV cache.

Generation-loop hooks

Evaluate.swift, Gemma4Text.swift — SSDFetchPool.waitAll() fetch
ordering + per-token SSD telemetry tick.

Tests

DeepseekV4Tests.swift — model-correctness suite (HC Sinkhorn, MLA RoPE,
hash routing).

What it needs to build on this fork

The engine was developed against a different (SharpAI-lineage) base, so on this
ml-explore-lineage fork it references pieces that differ. To compile here:

MTPLanguageModel protocol, MambaCache.checkpoint, and
TokenIteratorProtocol public-access requirements — pulled in by
Evaluate.swift's MTP iterator; reconcile with this fork's MTP/cache types.
mlx-swift core: the F_NOCACHE streaming-read tweak (fast.cpp,
MLX_SSD_NOCACHE) lives in the core repo — needs a companion PR to
Layr-Labs/mlx-swift.
DeepseekV4.swift / SwitchLayers.swift replace this fork's existing
(diverged) implementations — a reconciliation decision (the fork has its own
sparse-attention DeepSeek-V4 + continuous-batching MoE path).

Motivation (measured)

Cost model TPS ≈ SSD_BW / bytes_streamed_per_token. On an M5 Max (128 GB,
internal NVMe ~10.8 GB/s random) this engine streams DeepSeek-V4-Flash-4bit at a
measured warm 10.6 tok/s decode at ~0.23 GB/token device traffic (80.6 %
expert-cache hit) — coherent output, experts served entirely from SSD.

Provenance

Authored in the SSDStream research program (gaj). Developed on a SwiftLM-lineage
base whose git history is unrelated to this fork, so it's contributed as file
contents. Conflicting files take the streaming implementation's version.

Phase 1 of porting the SSDStream research engine (DeepSeek-V4-Flash streamed from NVMe on Apple Silicon) onto this fork: the model-agnostic configuration and KV-cache scaffolding for streaming MoE expert weights from SSD instead of holding the full routed-expert pool resident in RAM. New (purely additive — not referenced by any existing code path yet): - ExpertStreamingConfig: public API for expert-streaming mode (disabled / mmap page-cache / direct NVMe) plus machine-size-aware slot-cache budgeting. Replaces the EXPERIMENTAL_SSD_STREAM env gate so streaming can be enabled on iOS, which cannot set environment variables. - SharedKVCache: KVCache wrapper for models where K == V (shared cache). - DSAKVCache: sliding-window + compressed-slot KV cache for DSA-style attention. Verified with `swift build` (compiles clean against MLXLMCommon). No behavior change: these types are not yet wired into any model. The streaming MoE dispatch path (SwitchLayers), SSD telemetry, and DeepSeek-V4 model wiring land in follow-ups — their current implementations on this fork differ substantially from the research lineage and need dedicated integration. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The full SSD weight-streaming implementation from the SSDStream research program: stream DeepSeek-V4-Flash's routed experts from NVMe on demand instead of holding the ~151 GB expert pool resident in RAM. Our complete versions of every streaming-touched file: - SwitchLayers.swift: streaming MoE dispatch — LRU expert slot cache, async miss-fetch pool (SSDFetchPool), stacked/fused kernels, live SSD telemetry (SSDStreamMetrics). The streaming engine. - DeepseekV4.swift: streamed DeepSeek-V4 — DSA (window + compressed-KV) attention, HC Sinkhorn, hash routing, MLA RoPE. - ConcurrentError.swift: SSDStreamingError + SSDStreamingErrorLatch + ThreadSafeError (async-fetch error propagation). - Evaluate.swift / Gemma4Text.swift: generation-loop streaming hooks (SSDFetchPool.waitAll ordering, per-token SSD telemetry tick). - ExpertStreamingConfig / SharedKVCache / DSAKVCache: config + cache types. - DeepseekV4Tests.swift: model-correctness suite. DRAFT — does not build standalone on this fork yet. It expects engine pieces that differ here: MTPLanguageModel protocol, MambaCache.checkpoint, and TokenIteratorProtocol public access (Evaluate.swift's MTP path), plus the core mlx-swift F_NOCACHE streaming-read tweak (separate repo). See PR body. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

kokorogg and others added 2 commits June 17, 2026 13:55

kokorogg changed the title ~~feat: SSD expert-streaming configuration foundation (phase 1)~~ feat: SSD expert streaming for DeepSeek-V4 MoE (complete engine, WIP) Jun 17, 2026

kokorogg marked this pull request as draft June 17, 2026 21:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: SSD expert streaming for DeepSeek-V4 MoE (complete engine, WIP)#45

feat: SSD expert streaming for DeepSeek-V4 MoE (complete engine, WIP)#45
kokorogg wants to merge 2 commits into
Layr-Labs:mainfrom
kokorogg:feat/ssd-expert-streaming-foundation

kokorogg commented Jun 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kokorogg commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's in the PR (9 files)

What it needs to build on this fork

Motivation (measured)

Provenance

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kokorogg commented Jun 17, 2026 •

edited

Loading