Merge pull request #76 from SharpAI/fix/issue-72-draft-model-ssd-ram

solderzzc · web-flow · commit 336c8a853d89 · 2026-04-23T00:10:52.000-07:00
fix(ssd-stream): prevent RAM explosion when --draft-model + --stream-experts combined (#72)
diff --git a/.agents/workflows/review-github-pr.md b/.agents/workflows/review-github-pr.md
@@ -0,0 +1,216 @@
+---
+description: Review a GitHub Issue or PR for SharpAI/SwiftLM — fetch, analyze, implement fixes, address review comments, and push back to the correct branch
+---
+
+# Review GitHub Issue / PR
+
+This workflow guides end-to-end handling of a GitHub Issue or Pull Request for the
+`SharpAI/SwiftLM` repository: from fetching context, through implementing or
+reviewing code changes, to pushing a clean commit back to the correct fork branch.
+
+---
+
+## Prerequisites
+
+- `gh` CLI path on macOS: **`/opt/homebrew/bin/gh`**
+  ```bash
+  export PATH="/opt/homebrew/bin:$PATH"
+  which gh  # → /opt/homebrew/bin/gh
+  ```
+- `gh` must be authenticated (`gh auth status`)
+- Working directory: `/Users/simba/workspace/mlx-server`
+- Remote `fork` may need to be added if pushing to a contributor's fork:
+  ```bash
+  git remote add fork https://github.com/<contributor>/SwiftLM.git
+  ```
+
+---
+
+## Steps
+
+### 1. Fetch the Issue or PR
+
+Determine whether the user supplied an **Issue number** or a **PR number**, then
+pull the full context using `gh`:
+
+```bash
+# For a PR
+gh pr view <NUMBER> --repo SharpAI/SwiftLM \
+  --json number,title,body,state,baseRefName,headRefName,headRepository,commits,files
+
+# For an Issue
+gh issue view <NUMBER> --repo SharpAI/SwiftLM \
+  --json number,title,body,state,labels,comments
+```
+
+Note the **`headRepository`** field — if it is not `SharpAI/SwiftLM`, the PR comes
+from a fork. You must push back to the fork's branch (see Step 6).
+
+---
+
+### 2. Understand the Scope
+
+Read the PR/Issue body and associated comments carefully. Identify:
+
+- **Category** — bug fix, feature, test improvement, CI/CD, documentation.
+- **Files touched** — run `gh pr diff <NUMBER> --repo SharpAI/SwiftLM` or read
+  the `files` field.
+- **CI status** — check the latest run:
+  ```bash
+  gh run list --repo SharpAI/SwiftLM --branch <headRefName> --limit 3
+  ```
+- **Review comments** — if Copilot or a human left inline review comments, read
+  them all before writing a single line of code:
+  ```bash
+  gh pr view <NUMBER> --repo SharpAI/SwiftLM --comments
+  ```
+
+---
+
+### 3. Check Out the Branch Locally
+
+```bash
+# If the PR is from SharpAI directly
+git fetch origin
+git checkout <headRefName>
+
+# If the PR is from a fork
+git remote add fork https://github.com/<forkOwner>/SwiftLM.git   # once only
+git fetch fork <headRefName>
+git checkout -b <headRefName> fork/<headRefName>
+```
+
+Verify you are on the correct branch:
+```bash
+git status
+git log --oneline -5
+```
+
+---
+
+### 4. Triage Review Comments (for PRs)
+
+For each Copilot or human review comment:
+
+1. **Classify** the severity:
+   - 🔴 **Must fix** — correctness bugs, resource leaks, race conditions, broken CI.
+   - 🟡 **Should fix** — test coverage gaps, false-pass logic, missing imports.
+   - 🟢 **Optional** — style, wording, architecture refactors beyond the PR scope.
+
+2. **Implement** all 🔴 and 🟡 items. For 🟢 items, document them as follow-up
+   work in a code comment or GitHub comment but do not expand the PR scope.
+
+3. **Key patterns learned from SwiftLM history**:
+   - Shell scripts use `set -euo pipefail` — every `grep`, `jq`, or pipeline that
+     may produce no output **must** be guarded with `|| true` or placed inside an
+     `if` condition to prevent silent script abort.
+   - Heartbeat / background `Task` objects in Swift **must** be cancelled via
+     `defer { task?.cancel() }` so all exit paths (including client disconnect)
+     are covered — not just the happy path.
+   - CORS-related shell tests must target the dedicated `--cors` server instance,
+     not the main server started without the flag.
+   - Concurrent-request tests must use `--parallel N` (N ≥ 2) to actually exercise
+     parallel code paths.
+   - When adding new Swift test files that use `Data` / `JSONSerialization`,
+     always add `import Foundation` — XCTest does not re-export it in all SPM environments.
+
+---
+
+### 5. Verify Locally
+
+Build and run the relevant test suite before pushing:
+
+```bash
+# Swift unit tests
+swift test --filter SwiftLMTests
+
+# Integration tests (server)
+./tests/test-server.sh .build/release/SwiftLM 15413
+
+# OpenCode / SDK compatibility test
+./tests/test-opencode.sh .build/release/SwiftLM 15414
+```
+
+If CI previously failed with a specific test number, reproduce it locally first:
+```bash
+gh run view <RUN_ID> --repo SharpAI/SwiftLM --log-failed 2>&1 | grep -E "FAIL|error|Test [0-9]+"
+```
+
+---
+
+### 6. Commit and Push to the Correct Remote
+
+> [!IMPORTANT]
+> Always push to the **fork's branch** when updating a fork-originated PR.
+> Pushing to `origin` (SharpAI) creates a new branch and does NOT update the PR.
+
+```bash
+git add <files>
+git commit -m "<type>(<scope>): <summary>
+
+<body: what changed and why>"
+
+# PR from a fork → push to fork
+git push fork <headRefName>:<headRefName>
+
+# PR from SharpAI directly → push to origin
+git push origin <headRefName>
+```
+
+Verify the PR was updated:
+```bash
+gh pr view <NUMBER> --repo SharpAI/SwiftLM --json commits --jq '.commits[].messageHeadline'
+```
+
+---
+
+### 7. Monitor CI
+
+After pushing, monitor the triggered workflow:
+
+```bash
+# List recent runs on the branch
+gh run list --repo SharpAI/SwiftLM --branch <headRefName> --limit 5
+
+# Stream logs for the latest run
+gh run view <RUN_ID> --repo SharpAI/SwiftLM --log
+
+# Pull only failed steps
+gh run view <RUN_ID> --repo SharpAI/SwiftLM --log-failed 2>&1 | grep -E "FAIL|error|exit code"
+```
+
+If tests fail, go back to Step 4. Iterate until CI is green.
+
+---
+
+### 8. Respond to Reviewers (Optional)
+
+If a human or Copilot reviewer left inline comments that you have addressed,
+leave a reply comment summarising what was changed and why each item was handled
+(or deferred):
+
+```bash
+gh pr comment <NUMBER> --repo SharpAI/SwiftLM \
+  --body "Addressed all 🔴/🟡 review comments in commit <SHA>:
+- heartbeat leak: added defer cleanup in both streaming handlers
+- import Foundation: added to ServerSSETests.swift
+- CORS test: redirected to CORS_PORT server
+- parallel test: dedicated --parallel 2 server on PORT+3
+- set -e trap: guarded grep/jq pipelines with || true"
+```
+
+---
+
+## Quick Reference
+
+| Task | Command |
+|------|---------|
+| View PR | `gh pr view <N> --repo SharpAI/SwiftLM` |
+| View PR diff | `gh pr diff <N> --repo SharpAI/SwiftLM` |
+| View PR comments | `gh pr view <N> --repo SharpAI/SwiftLM --comments` |
+| View Issue | `gh issue view <N> --repo SharpAI/SwiftLM` |
+| List CI runs | `gh run list --repo SharpAI/SwiftLM --branch <branch>` |
+| Failed CI logs | `gh run view <ID> --repo SharpAI/SwiftLM --log-failed` |
+| Push to fork | `git push fork <branch>:<branch>` |
+| Push to SharpAI | `git push origin <branch>` |
+| Verify PR commits | `gh pr view <N> --repo SharpAI/SwiftLM --json commits --jq '.commits[].messageHeadline'` |
diff --git a/Sources/SwiftLM/Server.swift b/Sources/SwiftLM/Server.swift
@@ -301,6 +301,22 @@ struct MLXServer: AsyncParsableCommand {
         // Resolve model directory for profiling (checks HuggingFace cache)
         let modelDirectory = resolveModelDirectory(modelId: modelId)
         
+        // ── Fix #72: Compute draft model footprint ONCE (Copilot review) ──────
+        // Resolved before the streamExperts block so the exact byte count can be
+        // reused for the early cap, both strategy branches, and logging without
+        // repeating the filesystem walk.  Use weightFileSizeBytes (exact bytes)
+        // instead of weightMemoryGB * 1_073_741_824 to avoid the ~7% GiB/GB
+        // mismatch flagged in Copilot review (weightMemoryGB = bytes / 1e9, not /2^30).
+        let draftFootprintBytes: Int
+        if self.streamExperts,
+           let draftPath = self.draftModel,
+           let draftDir = resolveModelDirectory(modelId: draftPath),
+           let draftProfile = ModelProfiler.profile(modelDirectory: draftDir, modelId: draftPath) {
+            draftFootprintBytes = draftProfile.weightFileSizeBytes
+        } else {
+            draftFootprintBytes = 0
+        }
+
         if self.streamExperts, let modelDir = modelDirectory {
             setenv("EXPERIMENTAL_SSD_STREAM", modelDir.path, 1)
             // Activate the modern Swift ExpertStreamingConfig so Load.swift can:
@@ -314,6 +330,24 @@ struct MLXServer: AsyncParsableCommand {
             // Cap Metal command buffer size to avoid the 5s Apple GPU Watchdog.
             setenv("MLX_MAX_OPS_PER_BUFFER", "50", 1)
             print("[SwiftLM] Enabled Async SSD Streaming on directory: \(modelDir.lastPathComponent)")
+
+            // ── Fix #72: Apply SSD memory cap EARLY (before any model loads) ──
+            // Both the main model and draft model must load under the budget.
+            // The sentinel memoryLimit bypasses MLX eval_impl's spin-wait loop.
+            // Also address Copilot comment: apply the cap even when modelDirectory
+            // is nil (first-run download) so downloads also respect the budget.
+            let system = ModelProfiler.systemProfile()
+            if draftFootprintBytes > 0 {
+                print("[SwiftLM] 📦 Draft model footprint: \(String(format: "%.2f", Double(draftFootprintBytes) / 1e9))GB reserved from SSD budget")
+            }
+            Memory.cacheLimit = computeSSDMemoryBudget(totalRAMBytes: system.totalRAMBytes, draftWeightBytes: draftFootprintBytes)
+            Memory.memoryLimit = 200 * 1024 * 1024 * 1024 // 200 GB sentinel
+        } else if self.streamExperts {
+            // modelDirectory is nil — model not yet downloaded (first-run).
+            // Still apply the SSD memory cap so the download itself is bounded.
+            let system = ModelProfiler.systemProfile()
+            Memory.cacheLimit = computeSSDMemoryBudget(totalRAMBytes: system.totalRAMBytes, draftWeightBytes: draftFootprintBytes)
+            Memory.memoryLimit = 200 * 1024 * 1024 * 1024 // 200 GB sentinel
         }
         
         var partitionPlan: PartitionPlan?
@@ -338,7 +372,8 @@ struct MLXServer: AsyncParsableCommand {
                 if self.streamExperts {
                     // SSD Streaming: expert weights are mmap'd from SSD via the OS page cache.
                     // No swap involved — the page cache evicts stale expert pages cleanly.
-                    let physicalBudget = Int(Double(system.totalRAMBytes) * 0.85) - (4 * 1024 * 1024 * 1024)
+                    // draftFootprintBytes pre-computed once above (Copilot review).
+                    let physicalBudget = computeSSDMemoryBudget(totalRAMBytes: system.totalRAMBytes, draftWeightBytes: draftFootprintBytes)
                     Memory.cacheLimit = physicalBudget
                     Memory.memoryLimit = 200 * 1024 * 1024 * 1024 // 200GB sentinel to bypass MLX eval_impl spin loop
                     print("[SwiftLM] 💾 Memory strategy: SSD STREAMING (page-cache managed, \(physicalBudget / (1024*1024*1024))GB RAM budget, no swap)")
@@ -349,7 +384,8 @@ struct MLXServer: AsyncParsableCommand {
                 }
             case .layerPartitioned:
                 if self.streamExperts {
-                    let physicalBudget = Int(Double(system.totalRAMBytes) * 0.85) - (4 * 1024 * 1024 * 1024)
+                    // draftFootprintBytes pre-computed once above (Copilot review).
+                    let physicalBudget = computeSSDMemoryBudget(totalRAMBytes: system.totalRAMBytes, draftWeightBytes: draftFootprintBytes)
                     Memory.cacheLimit = physicalBudget
                     Memory.memoryLimit = 200 * 1024 * 1024 * 1024 // 200GB sentinel to bypass MLX eval_impl spin loop
                     print("[SwiftLM] 💾 Memory strategy: SSD STREAMING (page-cache managed, \(physicalBudget / (1024*1024*1024))GB RAM budget, no swap)")
@@ -476,6 +512,11 @@ struct MLXServer: AsyncParsableCommand {
             } else {
                 draftConfig = ModelConfiguration(id: draftModelPath)
             }
+            // Fix #72: mirror lazyLoad so the draft model's weights are mmap'd
+            // (not eagerly paged into unified RAM) when SSD streaming is active.
+            if self.streamExperts {
+                draftConfig.lazyLoad = true
+            }
             let draftDownloader = HubDownloader(hub: HubApi(downloadBase: cacheRoot))
             let draftContainer = try await LLMModelFactory.shared.loadContainer(
                 from: draftDownloader,
@@ -833,6 +874,24 @@ struct ServerConfig: Sendable {
     let turboKV: Bool
 }
 
+// ── SSD Memory Budget ────────────────────────────────────────────────────────
+
+/// Compute the page-cache budget (bytes) for SSD streaming mode.
+///
+/// Formula: `totalRAM × 0.85 − osHeadroom − draftWeightBytes`, floored at 2 GB.
+///
+/// - Parameters:
+///   - totalRAMBytes: Physical RAM reported by the OS (e.g. `system.totalRAMBytes`).
+///   - draftWeightBytes: Weight size (bytes) of the draft model, or 0 if none.
+///     Subtracted so the draft model's resident pages don't push the main model's
+///     page cache over the physical limit and trigger swap (Issue #72).
+/// - Returns: The recommended `Memory.cacheLimit` value in bytes.
+func computeSSDMemoryBudget(totalRAMBytes: UInt64, draftWeightBytes: Int = 0) -> Int {
+    let osHeadroom = 4 * 1024 * 1024 * 1024  // 4 GB for OS + system processes
+    let raw = Int(Double(totalRAMBytes) * 0.85) - osHeadroom - draftWeightBytes
+    return max(raw, 2 * 1024 * 1024 * 1024)  // floor at 2 GB
+}
+
 // ── Model Directory Resolution ───────────────────────────────────────────────
 
 /// Resolve a model ID to its local directory (if already downloaded).