Skip to content

feat: SSD expert streaming for DeepSeek-V4 MoE (complete engine, WIP)#45

Draft
kokorogg wants to merge 2 commits into
Layr-Labs:mainfrom
kokorogg:feat/ssd-expert-streaming-foundation
Draft

feat: SSD expert streaming for DeepSeek-V4 MoE (complete engine, WIP)#45
kokorogg wants to merge 2 commits into
Layr-Labs:mainfrom
kokorogg:feat/ssd-expert-streaming-foundation

Conversation

@kokorogg

@kokorogg kokorogg commented Jun 17, 2026

Copy link
Copy Markdown

Summary

The complete SSD expert-streaming engine — stream DeepSeek-V4-Flash's routed
experts from NVMe/SSD on demand instead of holding the ~151 GB expert pool
resident in RAM. This is the implementation that runs DeepSeek-V4-Flash-4bit
from SSD
(smoke-tested: coherent generation, experts served entirely from
NVMe), authored in our SSDStream research program.

Status: DRAFT / WIP. Everything related to SSD weight-streaming is here,
but it does not build standalone on this fork yet — it expects engine
pieces that differ here (see What it needs to build). Opened complete, per
request, for integration.

What's in the PR (9 files)

Streaming engine

  • SwitchLayers.swift — streaming MoE dispatch: LRU expert slot cache, async
    miss-fetch pool (SSDFetchPool), stacked/fused kernels, live SSD telemetry
    (SSDStreamMetrics). Replaces the fork's 254-line MoE path.
  • DeepseekV4.swift — streamed DeepSeek-V4: DSA (window + compressed-KV)
    attention, HC Sinkhorn, hash routing, MLA RoPE. Replaces the fork's
    DeepSeek-V4.
  • ConcurrentError.swift (new)SSDStreamingError + SSDStreamingErrorLatch
    • ThreadSafeError (async-fetch error propagation).

Config + caches (new, additive — these compile)

  • ExpertStreamingConfig.swift — streaming mode (disabled / mmap page-cache /
    direct NVMe) + machine-size-aware slot budgeting.
  • SharedKVCache.swift — K==V shared-cache wrapper.
  • DSAKVCache.swift — sliding-window + compressed-slot KV cache.

Generation-loop hooks

  • Evaluate.swift, Gemma4Text.swiftSSDFetchPool.waitAll() fetch
    ordering + per-token SSD telemetry tick.

Tests

  • DeepseekV4Tests.swift — model-correctness suite (HC Sinkhorn, MLA RoPE,
    hash routing).

What it needs to build on this fork

The engine was developed against a different (SharpAI-lineage) base, so on this
ml-explore-lineage fork it references pieces that differ. To compile here:

  • MTPLanguageModel protocol, MambaCache.checkpoint, and
    TokenIteratorProtocol public-access requirements — pulled in by
    Evaluate.swift's MTP iterator; reconcile with this fork's MTP/cache types.
  • mlx-swift core: the F_NOCACHE streaming-read tweak (fast.cpp,
    MLX_SSD_NOCACHE) lives in the core repo — needs a companion PR to
    Layr-Labs/mlx-swift.
  • DeepseekV4.swift / SwitchLayers.swift replace this fork's existing
    (diverged) implementations — a reconciliation decision (the fork has its own
    sparse-attention DeepSeek-V4 + continuous-batching MoE path).

Motivation (measured)

Cost model TPS ≈ SSD_BW / bytes_streamed_per_token. On an M5 Max (128 GB,
internal NVMe ~10.8 GB/s random) this engine streams DeepSeek-V4-Flash-4bit at a
measured warm 10.6 tok/s decode at ~0.23 GB/token device traffic (80.6 %
expert-cache hit) — coherent output, experts served entirely from SSD.

Provenance

Authored in the SSDStream research program (gaj). Developed on a SwiftLM-lineage
base whose git history is unrelated to this fork, so it's contributed as file
contents. Conflicting files take the streaming implementation's version.

kokorogg and others added 2 commits June 17, 2026 13:55
Phase 1 of porting the SSDStream research engine (DeepSeek-V4-Flash streamed
from NVMe on Apple Silicon) onto this fork: the model-agnostic configuration
and KV-cache scaffolding for streaming MoE expert weights from SSD instead of
holding the full routed-expert pool resident in RAM.

New (purely additive — not referenced by any existing code path yet):
- ExpertStreamingConfig: public API for expert-streaming mode
  (disabled / mmap page-cache / direct NVMe) plus machine-size-aware
  slot-cache budgeting. Replaces the EXPERIMENTAL_SSD_STREAM env gate so
  streaming can be enabled on iOS, which cannot set environment variables.
- SharedKVCache: KVCache wrapper for models where K == V (shared cache).
- DSAKVCache: sliding-window + compressed-slot KV cache for DSA-style
  attention.

Verified with `swift build` (compiles clean against MLXLMCommon). No behavior
change: these types are not yet wired into any model. The streaming MoE
dispatch path (SwitchLayers), SSD telemetry, and DeepSeek-V4 model wiring land
in follow-ups — their current implementations on this fork differ
substantially from the research lineage and need dedicated integration.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The full SSD weight-streaming implementation from the SSDStream research
program: stream DeepSeek-V4-Flash's routed experts from NVMe on demand instead
of holding the ~151 GB expert pool resident in RAM.

Our complete versions of every streaming-touched file:
- SwitchLayers.swift: streaming MoE dispatch — LRU expert slot cache, async
  miss-fetch pool (SSDFetchPool), stacked/fused kernels, live SSD telemetry
  (SSDStreamMetrics). The streaming engine.
- DeepseekV4.swift: streamed DeepSeek-V4 — DSA (window + compressed-KV)
  attention, HC Sinkhorn, hash routing, MLA RoPE.
- ConcurrentError.swift: SSDStreamingError + SSDStreamingErrorLatch +
  ThreadSafeError (async-fetch error propagation).
- Evaluate.swift / Gemma4Text.swift: generation-loop streaming hooks
  (SSDFetchPool.waitAll ordering, per-token SSD telemetry tick).
- ExpertStreamingConfig / SharedKVCache / DSAKVCache: config + cache types.
- DeepseekV4Tests.swift: model-correctness suite.

DRAFT — does not build standalone on this fork yet. It expects engine pieces
that differ here: MTPLanguageModel protocol, MambaCache.checkpoint, and
TokenIteratorProtocol public access (Evaluate.swift's MTP path), plus the core
mlx-swift F_NOCACHE streaming-read tweak (separate repo). See PR body.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@kokorogg kokorogg changed the title feat: SSD expert-streaming configuration foundation (phase 1) feat: SSD expert streaming for DeepSeek-V4 MoE (complete engine, WIP) Jun 17, 2026
@kokorogg kokorogg marked this pull request as draft June 17, 2026 21:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant