feat(doctor): detect and repair stale runtime run lock#785
feat(doctor): detect and repair stale runtime run lock#785mmTheBest wants to merge 2 commits intopaperclipai:masterfrom
Conversation
Greptile SummaryThis PR introduces a new Key findings:
Confidence Score: 2/5
Important Files Changed
Prompt To Fix All With AIThis is a comment left during a code review.
Path: cli/src/checks/runtime-lock-check.ts
Line: 14-19
Comment:
**`EPERM` mis-classified as dead process**
`process.kill(pid, 0)` on POSIX systems throws `EPERM` when the target **process is alive but owned by a different user** (you lack permission to signal it). The current catch-all means a live lock held by another OS user will be reported as stale and offered for deletion, which can corrupt a legitimately running instance.
The fix is to treat `EPERM` as "alive":
```suggestion
try {
process.kill(pid, 0);
return true;
} catch (err) {
if ((err as NodeJS.ErrnoException).code === "EPERM") return true;
return false;
}
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: cli/src/__tests__/runtime-lock-check.test.ts
Line: 36
Comment:
**Fragile PID assumption in stale lock test**
The test writes PID `999999` into the lock file, betting that no process with that PID is running. On Linux systems where `/proc/sys/kernel/pid_max` is raised above the default (e.g. to `4194304`), PID 999999 is a valid, potentially live process, and the check would return `"warn"` with `canRepair: false` (live lock path) instead of `"warn"` with a repair function — silently breaking the assertion on line 40.
A more reliable approach is to spawn a child process, record its PID, let it exit, and then write that PID into the lock file — guaranteeing the PID is dead at assertion time. Alternatively, capture the ESRCH error from a probing call to verify the PID is dead before writing the lock file.
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: cli/src/__tests__/runtime-lock-check.test.ts
Line: 33-53
Comment:
**Missing test for corrupt/unreadable lock file**
The PR description lists "warn + repairable: unreadable/corrupt lock file" as a distinct case, but there is no corresponding test. The code path (lines 39-50 of `runtime-lock-check.ts`) is completely untested. Consider adding a test that writes a non-JSON payload (e.g. `"not-json"`) to the lock file, asserts `status === "warn"` and `canRepair === true`, calls `repair()`, and confirms the file is removed.
How can I resolve this? If you propose a fix, please make it concise.Last reviewed commit: a1840ef |
| try { | ||
| process.kill(pid, 0); | ||
| return true; | ||
| } catch { | ||
| return false; | ||
| } |
There was a problem hiding this comment.
EPERM mis-classified as dead process
process.kill(pid, 0) on POSIX systems throws EPERM when the target process is alive but owned by a different user (you lack permission to signal it). The current catch-all means a live lock held by another OS user will be reported as stale and offered for deletion, which can corrupt a legitimately running instance.
The fix is to treat EPERM as "alive":
| try { | |
| process.kill(pid, 0); | |
| return true; | |
| } catch { | |
| return false; | |
| } | |
| try { | |
| process.kill(pid, 0); | |
| return true; | |
| } catch (err) { | |
| if ((err as NodeJS.ErrnoException).code === "EPERM") return true; | |
| return false; | |
| } |
Prompt To Fix With AI
This is a comment left during a code review.
Path: cli/src/checks/runtime-lock-check.ts
Line: 14-19
Comment:
**`EPERM` mis-classified as dead process**
`process.kill(pid, 0)` on POSIX systems throws `EPERM` when the target **process is alive but owned by a different user** (you lack permission to signal it). The current catch-all means a live lock held by another OS user will be reported as stale and offered for deletion, which can corrupt a legitimately running instance.
The fix is to treat `EPERM` as "alive":
```suggestion
try {
process.kill(pid, 0);
return true;
} catch (err) {
if ((err as NodeJS.ErrnoException).code === "EPERM") return true;
return false;
}
```
How can I resolve this? If you propose a fix, please make it concise.| it("warns and repair removes stale lock", async () => { | ||
| const home = setupHome(); | ||
| const lock = path.join(home, "instances", "default", "run.lock.json"); | ||
| fs.writeFileSync(lock, JSON.stringify({ pid: 999999, startedAt: new Date().toISOString() }), "utf8"); |
There was a problem hiding this comment.
Fragile PID assumption in stale lock test
The test writes PID 999999 into the lock file, betting that no process with that PID is running. On Linux systems where /proc/sys/kernel/pid_max is raised above the default (e.g. to 4194304), PID 999999 is a valid, potentially live process, and the check would return "warn" with canRepair: false (live lock path) instead of "warn" with a repair function — silently breaking the assertion on line 40.
A more reliable approach is to spawn a child process, record its PID, let it exit, and then write that PID into the lock file — guaranteeing the PID is dead at assertion time. Alternatively, capture the ESRCH error from a probing call to verify the PID is dead before writing the lock file.
Prompt To Fix With AI
This is a comment left during a code review.
Path: cli/src/__tests__/runtime-lock-check.test.ts
Line: 36
Comment:
**Fragile PID assumption in stale lock test**
The test writes PID `999999` into the lock file, betting that no process with that PID is running. On Linux systems where `/proc/sys/kernel/pid_max` is raised above the default (e.g. to `4194304`), PID 999999 is a valid, potentially live process, and the check would return `"warn"` with `canRepair: false` (live lock path) instead of `"warn"` with a repair function — silently breaking the assertion on line 40.
A more reliable approach is to spawn a child process, record its PID, let it exit, and then write that PID into the lock file — guaranteeing the PID is dead at assertion time. Alternatively, capture the ESRCH error from a probing call to verify the PID is dead before writing the lock file.
How can I resolve this? If you propose a fix, please make it concise.| it("warns and repair removes stale lock", async () => { | ||
| const home = setupHome(); | ||
| const lock = path.join(home, "instances", "default", "run.lock.json"); | ||
| fs.writeFileSync(lock, JSON.stringify({ pid: 999999, startedAt: new Date().toISOString() }), "utf8"); | ||
|
|
||
| const r = runtimeLockCheck(); | ||
| expect(r.status).toBe("warn"); | ||
| expect(typeof r.repair).toBe("function"); | ||
| await r.repair?.(); | ||
| expect(fs.existsSync(lock)).toBe(false); | ||
| }); | ||
|
|
||
| it("warns on live lock", () => { | ||
| const home = setupHome(); | ||
| const lock = path.join(home, "instances", "default", "run.lock.json"); | ||
| fs.writeFileSync(lock, JSON.stringify({ pid: process.pid, startedAt: new Date().toISOString() }), "utf8"); | ||
|
|
||
| const r = runtimeLockCheck(); | ||
| expect(r.status).toBe("warn"); | ||
| expect(r.message).toContain("active"); | ||
| }); |
There was a problem hiding this comment.
Missing test for corrupt/unreadable lock file
The PR description lists "warn + repairable: unreadable/corrupt lock file" as a distinct case, but there is no corresponding test. The code path (lines 39-50 of runtime-lock-check.ts) is completely untested. Consider adding a test that writes a non-JSON payload (e.g. "not-json") to the lock file, asserts status === "warn" and canRepair === true, calls repair(), and confirms the file is removed.
Prompt To Fix With AI
This is a comment left during a code review.
Path: cli/src/__tests__/runtime-lock-check.test.ts
Line: 33-53
Comment:
**Missing test for corrupt/unreadable lock file**
The PR description lists "warn + repairable: unreadable/corrupt lock file" as a distinct case, but there is no corresponding test. The code path (lines 39-50 of `runtime-lock-check.ts`) is completely untested. Consider adding a test that writes a non-JSON payload (e.g. `"not-json"`) to the lock file, asserts `status === "warn"` and `canRepair === true`, calls `repair()`, and confirms the file is removed.
How can I resolve this? If you propose a fix, please make it concise.|
Addressed review feedback in e7edb72:\n- Treat EPERM as alive in runtime lock PID probing\n- Harden stale-lock test (no hardcoded 999999 PID assumption)\n- Add corrupt/unreadable lock file repair test\n\nValidated with: |
|
Follow-up (escaped): validated with |
Summary
Add a doctor check that inspects runtime run-lock state and helps recover from stale locks.
Why
A recurring reliability issue is stale or conflicting
run.lock.jsonstate, which causes confusing startup behavior on reruns.What changed
runtimeLockCheck<instanceRoot>/run.lock.jsonpaperclipai doctorflow before port check.Validation
cd cli && pnpm exec vitest run src/__tests__/runtime-lock-check.test.ts✅