feat(doctor): detect and repair stale runtime run lock by mmTheBest · Pull Request #785 · paperclipai/paperclip

mmTheBest · 2026-03-13T12:38:26Z

Summary

Add a doctor check that inspects runtime run-lock state and helps recover from stale locks.

Why

A recurring reliability issue is stale or conflicting run.lock.json state, which causes confusing startup behavior on reruns.

What changed

New check: runtimeLockCheck
- inspects <instanceRoot>/run.lock.json
- pass: no lock
- warn: active lock owner PID appears alive
- warn + repairable: stale lock PID appears dead
- warn + repairable: unreadable/corrupt lock file
Wired into paperclipai doctor flow before port check.
Added tests for no lock, stale lock repair, and live lock warning.

Validation

cd cli && pnpm exec vitest run src/__tests__/runtime-lock-check.test.ts ✅

greptile-apps · 2026-03-13T12:53:58Z

Greptile Summary

This PR introduces a new runtimeLockCheck doctor check that inspects run.lock.json to detect and optionally repair stale or conflicting runtime locks, and wires it into the existing paperclipai doctor flow as step 9.

Key findings:

Logic bug in isPidAlive: On POSIX systems process.kill(pid, 0) throws EPERM when the target process is alive but owned by a different OS user. The catch-all silently treats this as "dead", so a live lock held by another user would be classified as stale and offered for deletion — potentially corrupting a legitimately running instance.
Incomplete test coverage: The PR description explicitly calls out the "unreadable/corrupt lock file" repair path as a supported case, but no test exercises it.
Fragile stale-lock test: The stale-lock test hardcodes PID 999999, which is not guaranteed to be dead on all Linux configurations (systems with a high pid_max value).

Confidence Score: 2/5

Not safe to merge — a logic bug in isPidAlive can cause live locks to be deleted when they belong to another OS user.
The EPERM mis-classification in isPidAlive is a correctness bug that directly undermines the primary purpose of this check (protecting live runs from being clobbered). The fix is small, but without it the check can do more harm than good in multi-user environments.
cli/src/checks/runtime-lock-check.ts (isPidAlive logic) and cli/src/__tests__/runtime-lock-check.test.ts (missing corrupt-lock test, fragile PID assumption)

Important Files Changed

Filename	Overview
cli/src/checks/runtime-lock-check.ts	New check that inspects run.lock.json; contains a logic bug in isPidAlive where EPERM is treated as "dead" instead of "alive", which can cause live locks owned by other OS users to be incorrectly classified as stale and deleted.
cli/src/tests/runtime-lock-check.test.ts	Covers no-lock and live-lock paths; missing test for corrupt/unreadable lock file, and the stale-lock test relies on PID 999999 being dead which is not guaranteed on all Linux configurations.
cli/src/commands/doctor.ts	Wires runtimeLockCheck into the doctor flow as step 9 via the existing runRepairableCheck helper; integration looks correct.
cli/src/checks/index.ts	Trivial re-export of runtimeLockCheck; no issues.

Prompt To Fix All With AI

This is a comment left during a code review.
Path: cli/src/checks/runtime-lock-check.ts
Line: 14-19

Comment:
**`EPERM` mis-classified as dead process**

`process.kill(pid, 0)` on POSIX systems throws `EPERM` when the target **process is alive but owned by a different user** (you lack permission to signal it). The current catch-all means a live lock held by another OS user will be reported as stale and offered for deletion, which can corrupt a legitimately running instance.

The fix is to treat `EPERM` as "alive":

```suggestion
  try {
    process.kill(pid, 0);
    return true;
  } catch (err) {
    if ((err as NodeJS.ErrnoException).code === "EPERM") return true;
    return false;
  }
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: cli/src/__tests__/runtime-lock-check.test.ts
Line: 36

Comment:
**Fragile PID assumption in stale lock test**

The test writes PID `999999` into the lock file, betting that no process with that PID is running. On Linux systems where `/proc/sys/kernel/pid_max` is raised above the default (e.g. to `4194304`), PID 999999 is a valid, potentially live process, and the check would return `"warn"` with `canRepair: false` (live lock path) instead of `"warn"` with a repair function — silently breaking the assertion on line 40.

A more reliable approach is to spawn a child process, record its PID, let it exit, and then write that PID into the lock file — guaranteeing the PID is dead at assertion time. Alternatively, capture the ESRCH error from a probing call to verify the PID is dead before writing the lock file.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: cli/src/__tests__/runtime-lock-check.test.ts
Line: 33-53

Comment:
**Missing test for corrupt/unreadable lock file**

The PR description lists "warn + repairable: unreadable/corrupt lock file" as a distinct case, but there is no corresponding test. The code path (lines 39-50 of `runtime-lock-check.ts`) is completely untested. Consider adding a test that writes a non-JSON payload (e.g. `"not-json"`) to the lock file, asserts `status === "warn"` and `canRepair === true`, calls `repair()`, and confirms the file is removed.

How can I resolve this? If you propose a fix, please make it concise.

_{Last reviewed commit: a1840ef}

greptile-apps · 2026-03-13T12:54:01Z

+  try {
+    process.kill(pid, 0);
+    return true;
+  } catch {
+    return false;
+  }


EPERM mis-classified as dead process

process.kill(pid, 0) on POSIX systems throws EPERM when the target process is alive but owned by a different user (you lack permission to signal it). The current catch-all means a live lock held by another OS user will be reported as stale and offered for deletion, which can corrupt a legitimately running instance.

The fix is to treat EPERM as "alive":

Suggested change

try {

process.kill(pid, 0);

return true;

} catch {

return false;

}

try {

process.kill(pid, 0);

return true;

} catch (err) {

if ((err as NodeJS.ErrnoException).code === "EPERM") return true;

return false;

}

Prompt To Fix With AI

This is a comment left during a code review. Path: cli/src/checks/runtime-lock-check.ts Line: 14-19 Comment: **`EPERM` mis-classified as dead process** `process.kill(pid, 0)` on POSIX systems throws `EPERM` when the target **process is alive but owned by a different user** (you lack permission to signal it). The current catch-all means a live lock held by another OS user will be reported as stale and offered for deletion, which can corrupt a legitimately running instance. The fix is to treat `EPERM` as "alive": ```suggestion try { process.kill(pid, 0); return true; } catch (err) { if ((err as NodeJS.ErrnoException).code === "EPERM") return true; return false; } ``` How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-03-13T12:54:02Z

+  it("warns and repair removes stale lock", async () => {
+    const home = setupHome();
+    const lock = path.join(home, "instances", "default", "run.lock.json");
+    fs.writeFileSync(lock, JSON.stringify({ pid: 999999, startedAt: new Date().toISOString() }), "utf8");


Fragile PID assumption in stale lock test

The test writes PID 999999 into the lock file, betting that no process with that PID is running. On Linux systems where /proc/sys/kernel/pid_max is raised above the default (e.g. to 4194304), PID 999999 is a valid, potentially live process, and the check would return "warn" with canRepair: false (live lock path) instead of "warn" with a repair function — silently breaking the assertion on line 40.

A more reliable approach is to spawn a child process, record its PID, let it exit, and then write that PID into the lock file — guaranteeing the PID is dead at assertion time. Alternatively, capture the ESRCH error from a probing call to verify the PID is dead before writing the lock file.

Prompt To Fix With AI

This is a comment left during a code review. Path: cli/src/__tests__/runtime-lock-check.test.ts Line: 36 Comment: **Fragile PID assumption in stale lock test** The test writes PID `999999` into the lock file, betting that no process with that PID is running. On Linux systems where `/proc/sys/kernel/pid_max` is raised above the default (e.g. to `4194304`), PID 999999 is a valid, potentially live process, and the check would return `"warn"` with `canRepair: false` (live lock path) instead of `"warn"` with a repair function — silently breaking the assertion on line 40. A more reliable approach is to spawn a child process, record its PID, let it exit, and then write that PID into the lock file — guaranteeing the PID is dead at assertion time. Alternatively, capture the ESRCH error from a probing call to verify the PID is dead before writing the lock file. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-03-13T12:54:03Z

+  it("warns and repair removes stale lock", async () => {
+    const home = setupHome();
+    const lock = path.join(home, "instances", "default", "run.lock.json");
+    fs.writeFileSync(lock, JSON.stringify({ pid: 999999, startedAt: new Date().toISOString() }), "utf8");
+
+    const r = runtimeLockCheck();
+    expect(r.status).toBe("warn");
+    expect(typeof r.repair).toBe("function");
+    await r.repair?.();
+    expect(fs.existsSync(lock)).toBe(false);
+  });
+
+  it("warns on live lock", () => {
+    const home = setupHome();
+    const lock = path.join(home, "instances", "default", "run.lock.json");
+    fs.writeFileSync(lock, JSON.stringify({ pid: process.pid, startedAt: new Date().toISOString() }), "utf8");
+
+    const r = runtimeLockCheck();
+    expect(r.status).toBe("warn");
+    expect(r.message).toContain("active");
+  });


Missing test for corrupt/unreadable lock file

The PR description lists "warn + repairable: unreadable/corrupt lock file" as a distinct case, but there is no corresponding test. The code path (lines 39-50 of runtime-lock-check.ts) is completely untested. Consider adding a test that writes a non-JSON payload (e.g. "not-json") to the lock file, asserts status === "warn" and canRepair === true, calls repair(), and confirms the file is removed.

Prompt To Fix With AI

This is a comment left during a code review. Path: cli/src/__tests__/runtime-lock-check.test.ts Line: 33-53 Comment: **Missing test for corrupt/unreadable lock file** The PR description lists "warn + repairable: unreadable/corrupt lock file" as a distinct case, but there is no corresponding test. The code path (lines 39-50 of `runtime-lock-check.ts`) is completely untested. Consider adding a test that writes a non-JSON payload (e.g. `"not-json"`) to the lock file, asserts `status === "warn"` and `canRepair === true`, calls `repair()`, and confirms the file is removed. How can I resolve this? If you propose a fix, please make it concise.

mmTheBest · 2026-03-13T13:01:42Z

Addressed review feedback in e7edb72:\n- Treat EPERM as alive in runtime lock PID probing\n- Harden stale-lock test (no hardcoded 999999 PID assumption)\n- Add corrupt/unreadable lock file repair test\n\nValidated with:
RUN v3.2.4 /private/tmp/paperclip/cli

mmTheBest · 2026-03-13T13:02:15Z

Follow-up (escaped): validated with cd /tmp/paperclip/cli && pnpm exec vitest run src/__tests__/runtime-lock-check.test.ts (5 tests passing).

feat(doctor): detect and repair stale runtime run lock

a1840ef

mmTheBest mentioned this pull request Mar 13, 2026

feat(doctor): detect and repair stale runtime run lock #781

Closed

mmTheBest marked this pull request as ready for review March 13, 2026 12:52

greptile-apps bot reviewed Mar 13, 2026

View reviewed changes

fix(doctor): handle EPERM lock checks and harden runtime lock tests

e7edb72

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(doctor): detect and repair stale runtime run lock#785

feat(doctor): detect and repair stale runtime run lock#785
mmTheBest wants to merge 2 commits intopaperclipai:masterfrom
mmTheBest:feat/doctor-runtime-lock-check-clean

mmTheBest commented Mar 13, 2026

Uh oh!

greptile-apps bot commented Mar 13, 2026

Uh oh!

greptile-apps bot Mar 13, 2026

Uh oh!

greptile-apps bot Mar 13, 2026

Uh oh!

greptile-apps bot Mar 13, 2026

Uh oh!

mmTheBest commented Mar 13, 2026

Uh oh!

mmTheBest commented Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mmTheBest commented Mar 13, 2026

Summary

Why

What changed

Validation

Uh oh!

greptile-apps bot commented Mar 13, 2026

Greptile Summary

Confidence Score: 2/5

Important Files Changed

Uh oh!

greptile-apps bot Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

mmTheBest commented Mar 13, 2026

Uh oh!

mmTheBest commented Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant