Recover from SSD streaming errors without crashing by solderzzc · Pull Request #38 · SharpAI/mlx-swift-lm

solderzzc · 2026-04-28T06:27:09Z

Summary

replace fatal crash behavior in SSD streaming error handling with a latched recoverable error
surface streaming I/O failures back to the generation loop so callers can recover cleanly

Testing

exercised through the parent SwiftBuddy model-loading recovery work
validated by loading the branch in the integrated SwiftBuddy harness and Xcode build

Convert ThreadSafeError.check() from fatalError (which crashes the entire app) to a global SSDStreamingErrorLatch pattern. When a pread I/O error occurs on truncated/corrupted safetensors files, the error is now posted to the latch instead of killing the process. The generation loop in Evaluate.swift checks the latch: - After model.prepare() during prefill (catches errors during prompt processing) - After each token in the generation loop (catches errors during decoding) This allows the consuming code (InferenceEngine) to surface the error gracefully in the UI and prompt the user to re-download the model. Also adds SSDStreamingError and SSDStreamingErrorLatch as public types for downstream consumers.

Copilot

Pull request overview

This PR changes MLXLMCommon’s SSD expert streaming error handling from a process-terminating crash to a recoverable, latched error that can be detected during prompt prefill and token generation.

Changes:

Introduces SSDStreamingError and a global SSDStreamingErrorLatch to record/consume streaming I/O failures from non-throwing paths.
Adds latch checks during TokenIterator.prepare(...) (prefill/logits) to throw early if streaming errors occurred.
Adds a per-token latch check in the async generation loop to stop generation when a streaming error is detected.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
Libraries/MLXLMCommon/Evaluate.swift	Checks the SSD streaming error latch during prefill and during async generation iteration.
Libraries/MLXLMCommon/ConcurrentError.swift	Replaces `fatalError` with a recoverable error latch and introduces a typed `SSDStreamingError`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    /// Check if any error was recorded during concurrent I/O.
+    ///
+    /// Instead of calling `fatalError` (which crashes the entire app), this
+    /// posts the error to the global `SSDStreamingErrorLatch` so the generation
+    /// loop can detect it after the current token and surface it gracefully
+    /// in the UI (e.g., prompting a re-download).
    package func check() {
        if let error = error {
-            fatalError("MLX SSD Streaming Error: \(error.localizedDescription). (The model safetensors file may be corrupted, truncated, or incomplete).")
+            SSDStreamingErrorLatch.shared.set(
+                SSDStreamingError(underlyingError: error)
+            )
        }


+            // Check for SSD streaming errors that occurred during prefill.
+            // The MoE expert pread path uses a non-throwing callAsFunction,
+            // so errors are posted to the global latch instead.
+            try SSDStreamingErrorLatch.shared.throwIfSet()
+
            // evaluate the remainder of the prompt -- this primes the pump
            let token = step(previous: y)
+
+            // Check again after step() which also runs through MoE layers
+            try SSDStreamingErrorLatch.shared.throwIfSet()


+                if let ssdError = SSDStreamingErrorLatch.shared.consume() {
+                    print("[MLXLMCommon] SSD streaming error detected: \(ssdError.localizedDescription)")
+                    stopReason = .cancelled
+                    break


+public final class SSDStreamingErrorLatch: @unchecked Sendable {
+    public static let shared = SSDStreamingErrorLatch()
+    private let lock = NSLock()
+    private var _error: Error?
+


Copilot AI review requested due to automatic review settings April 28, 2026 06:27

solderzzc mentioned this pull request Apr 28, 2026

Harden SwiftBuddy model loading and align local server settings SharpAI/SwiftLM#95

Closed

Copilot started reviewing on behalf of solderzzc April 28, 2026 06:27 View session

Copilot AI reviewed Apr 28, 2026

View reviewed changes

solderzzc mentioned this pull request Apr 28, 2026

Harden SwiftBuddy model loading and align local server settings SharpAI/SwiftLM#96

Merged

Aegis-AI added 2 commits April 28, 2026 08:16

Fix SSD streaming review issues

88faf35

Restore downstream compatibility

38d7ff2

solderzzc merged commit 2c2cd9e into main Apr 28, 2026
6 checks passed

solderzzc deleted the fix/ssd-streaming-crash-recovery branch April 28, 2026 19:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recover from SSD streaming errors without crashing#38

Recover from SSD streaming errors without crashing#38
solderzzc merged 3 commits into
mainfrom
fix/ssd-streaming-crash-recovery

solderzzc commented Apr 28, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

solderzzc commented Apr 28, 2026

Summary

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants