feat: SSD expert streaming for DeepSeek-V4 MoE (complete engine, WIP)#45
Draft
kokorogg wants to merge 2 commits into
Draft
feat: SSD expert streaming for DeepSeek-V4 MoE (complete engine, WIP)#45kokorogg wants to merge 2 commits into
kokorogg wants to merge 2 commits into
Conversation
Phase 1 of porting the SSDStream research engine (DeepSeek-V4-Flash streamed from NVMe on Apple Silicon) onto this fork: the model-agnostic configuration and KV-cache scaffolding for streaming MoE expert weights from SSD instead of holding the full routed-expert pool resident in RAM. New (purely additive — not referenced by any existing code path yet): - ExpertStreamingConfig: public API for expert-streaming mode (disabled / mmap page-cache / direct NVMe) plus machine-size-aware slot-cache budgeting. Replaces the EXPERIMENTAL_SSD_STREAM env gate so streaming can be enabled on iOS, which cannot set environment variables. - SharedKVCache: KVCache wrapper for models where K == V (shared cache). - DSAKVCache: sliding-window + compressed-slot KV cache for DSA-style attention. Verified with `swift build` (compiles clean against MLXLMCommon). No behavior change: these types are not yet wired into any model. The streaming MoE dispatch path (SwitchLayers), SSD telemetry, and DeepSeek-V4 model wiring land in follow-ups — their current implementations on this fork differ substantially from the research lineage and need dedicated integration. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The full SSD weight-streaming implementation from the SSDStream research program: stream DeepSeek-V4-Flash's routed experts from NVMe on demand instead of holding the ~151 GB expert pool resident in RAM. Our complete versions of every streaming-touched file: - SwitchLayers.swift: streaming MoE dispatch — LRU expert slot cache, async miss-fetch pool (SSDFetchPool), stacked/fused kernels, live SSD telemetry (SSDStreamMetrics). The streaming engine. - DeepseekV4.swift: streamed DeepSeek-V4 — DSA (window + compressed-KV) attention, HC Sinkhorn, hash routing, MLA RoPE. - ConcurrentError.swift: SSDStreamingError + SSDStreamingErrorLatch + ThreadSafeError (async-fetch error propagation). - Evaluate.swift / Gemma4Text.swift: generation-loop streaming hooks (SSDFetchPool.waitAll ordering, per-token SSD telemetry tick). - ExpertStreamingConfig / SharedKVCache / DSAKVCache: config + cache types. - DeepseekV4Tests.swift: model-correctness suite. DRAFT — does not build standalone on this fork yet. It expects engine pieces that differ here: MTPLanguageModel protocol, MambaCache.checkpoint, and TokenIteratorProtocol public access (Evaluate.swift's MTP path), plus the core mlx-swift F_NOCACHE streaming-read tweak (separate repo). See PR body. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The complete SSD expert-streaming engine — stream DeepSeek-V4-Flash's routed
experts from NVMe/SSD on demand instead of holding the ~151 GB expert pool
resident in RAM. This is the implementation that runs DeepSeek-V4-Flash-4bit
from SSD (smoke-tested: coherent generation, experts served entirely from
NVMe), authored in our SSDStream research program.
What's in the PR (9 files)
Streaming engine
SwitchLayers.swift— streaming MoE dispatch: LRU expert slot cache, asyncmiss-fetch pool (
SSDFetchPool), stacked/fused kernels, live SSD telemetry(
SSDStreamMetrics). Replaces the fork's 254-line MoE path.DeepseekV4.swift— streamed DeepSeek-V4: DSA (window + compressed-KV)attention, HC Sinkhorn, hash routing, MLA RoPE. Replaces the fork's
DeepSeek-V4.
ConcurrentError.swift(new) —SSDStreamingError+SSDStreamingErrorLatchThreadSafeError(async-fetch error propagation).Config + caches (new, additive — these compile)
ExpertStreamingConfig.swift— streaming mode (disabled / mmap page-cache /direct NVMe) + machine-size-aware slot budgeting.
SharedKVCache.swift— K==V shared-cache wrapper.DSAKVCache.swift— sliding-window + compressed-slot KV cache.Generation-loop hooks
Evaluate.swift,Gemma4Text.swift—SSDFetchPool.waitAll()fetchordering + per-token SSD telemetry tick.
Tests
DeepseekV4Tests.swift— model-correctness suite (HC Sinkhorn, MLA RoPE,hash routing).
What it needs to build on this fork
The engine was developed against a different (SharpAI-lineage) base, so on this
ml-explore-lineage fork it references pieces that differ. To compile here:
MTPLanguageModelprotocol,MambaCache.checkpoint, andTokenIteratorProtocolpublic-access requirements — pulled in byEvaluate.swift's MTP iterator; reconcile with this fork's MTP/cache types.mlx-swiftcore: theF_NOCACHEstreaming-read tweak (fast.cpp,MLX_SSD_NOCACHE) lives in the core repo — needs a companion PR toLayr-Labs/mlx-swift.DeepseekV4.swift/SwitchLayers.swiftreplace this fork's existing(diverged) implementations — a reconciliation decision (the fork has its own
sparse-attention DeepSeek-V4 + continuous-batching MoE path).
Motivation (measured)
Cost model
TPS ≈ SSD_BW / bytes_streamed_per_token. On an M5 Max (128 GB,internal NVMe ~10.8 GB/s random) this engine streams DeepSeek-V4-Flash-4bit at a
measured warm 10.6 tok/s decode at ~0.23 GB/token device traffic (80.6 %
expert-cache hit) — coherent output, experts served entirely from SSD.
Provenance
Authored in the SSDStream research program (
gaj). Developed on a SwiftLM-lineagebase whose git history is unrelated to this fork, so it's contributed as file
contents. Conflicting files take the streaming implementation's version.