Skip to content

Recover from SSD streaming errors without crashing#38

Merged
solderzzc merged 3 commits into
mainfrom
fix/ssd-streaming-crash-recovery
Apr 28, 2026
Merged

Recover from SSD streaming errors without crashing#38
solderzzc merged 3 commits into
mainfrom
fix/ssd-streaming-crash-recovery

Conversation

@solderzzc

Copy link
Copy Markdown
Member

Summary

  • replace fatal crash behavior in SSD streaming error handling with a latched recoverable error
  • surface streaming I/O failures back to the generation loop so callers can recover cleanly

Testing

  • exercised through the parent SwiftBuddy model-loading recovery work
  • validated by loading the branch in the integrated SwiftBuddy harness and Xcode build

Convert ThreadSafeError.check() from fatalError (which crashes the entire
app) to a global SSDStreamingErrorLatch pattern. When a pread I/O error
occurs on truncated/corrupted safetensors files, the error is now posted
to the latch instead of killing the process.

The generation loop in Evaluate.swift checks the latch:
- After model.prepare() during prefill (catches errors during prompt processing)
- After each token in the generation loop (catches errors during decoding)

This allows the consuming code (InferenceEngine) to surface the error
gracefully in the UI and prompt the user to re-download the model.

Also adds SSDStreamingError and SSDStreamingErrorLatch as public types
for downstream consumers.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR changes MLXLMCommon’s SSD expert streaming error handling from a process-terminating crash to a recoverable, latched error that can be detected during prompt prefill and token generation.

Changes:

  • Introduces SSDStreamingError and a global SSDStreamingErrorLatch to record/consume streaming I/O failures from non-throwing paths.
  • Adds latch checks during TokenIterator.prepare(...) (prefill/logits) to throw early if streaming errors occurred.
  • Adds a per-token latch check in the async generation loop to stop generation when a streaming error is detected.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
Libraries/MLXLMCommon/Evaluate.swift Checks the SSD streaming error latch during prefill and during async generation iteration.
Libraries/MLXLMCommon/ConcurrentError.swift Replaces fatalError with a recoverable error latch and introduces a typed SSDStreamingError.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +65 to 76
/// Check if any error was recorded during concurrent I/O.
///
/// Instead of calling `fatalError` (which crashes the entire app), this
/// posts the error to the global `SSDStreamingErrorLatch` so the generation
/// loop can detect it after the current token and surface it gracefully
/// in the UI (e.g., prompting a re-download).
package func check() {
if let error = error {
fatalError("MLX SSD Streaming Error: \(error.localizedDescription). (The model safetensors file may be corrupted, truncated, or incomplete).")
SSDStreamingErrorLatch.shared.set(
SSDStreamingError(underlyingError: error)
)
}
Comment thread Libraries/MLXLMCommon/Evaluate.swift Outdated
Comment on lines +653 to +662
// Check for SSD streaming errors that occurred during prefill.
// The MoE expert pread path uses a non-throwing callAsFunction,
// so errors are posted to the global latch instead.
try SSDStreamingErrorLatch.shared.throwIfSet()

// evaluate the remainder of the prompt -- this primes the pump
let token = step(previous: y)

// Check again after step() which also runs through MoE layers
try SSDStreamingErrorLatch.shared.throwIfSet()
Comment thread Libraries/MLXLMCommon/Evaluate.swift Outdated
Comment on lines +1723 to +1726
if let ssdError = SSDStreamingErrorLatch.shared.consume() {
print("[MLXLMCommon] SSD streaming error detected: \(ssdError.localizedDescription)")
stopReason = .cancelled
break
Comment on lines +17 to +21
public final class SSDStreamingErrorLatch: @unchecked Sendable {
public static let shared = SSDStreamingErrorLatch()
private let lock = NSLock()
private var _error: Error?

@solderzzc solderzzc merged commit 2c2cd9e into main Apr 28, 2026
6 checks passed
@solderzzc solderzzc deleted the fix/ssd-streaming-crash-recovery branch April 28, 2026 19:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants