Skip to content

feat(doctor): detect and repair stale runtime run lock#785

Open
mmTheBest wants to merge 2 commits intopaperclipai:masterfrom
mmTheBest:feat/doctor-runtime-lock-check-clean
Open

feat(doctor): detect and repair stale runtime run lock#785
mmTheBest wants to merge 2 commits intopaperclipai:masterfrom
mmTheBest:feat/doctor-runtime-lock-check-clean

Conversation

@mmTheBest
Copy link
Copy Markdown

Summary

Add a doctor check that inspects runtime run-lock state and helps recover from stale locks.

Why

A recurring reliability issue is stale or conflicting run.lock.json state, which causes confusing startup behavior on reruns.

What changed

  • New check: runtimeLockCheck
    • inspects <instanceRoot>/run.lock.json
    • pass: no lock
    • warn: active lock owner PID appears alive
    • warn + repairable: stale lock PID appears dead
    • warn + repairable: unreadable/corrupt lock file
  • Wired into paperclipai doctor flow before port check.
  • Added tests for no lock, stale lock repair, and live lock warning.

Validation

  • cd cli && pnpm exec vitest run src/__tests__/runtime-lock-check.test.ts

@mmTheBest mmTheBest marked this pull request as ready for review March 13, 2026 12:52
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 13, 2026

Greptile Summary

This PR introduces a new runtimeLockCheck doctor check that inspects run.lock.json to detect and optionally repair stale or conflicting runtime locks, and wires it into the existing paperclipai doctor flow as step 9.

Key findings:

  • Logic bug in isPidAlive: On POSIX systems process.kill(pid, 0) throws EPERM when the target process is alive but owned by a different OS user. The catch-all silently treats this as "dead", so a live lock held by another user would be classified as stale and offered for deletion — potentially corrupting a legitimately running instance.
  • Incomplete test coverage: The PR description explicitly calls out the "unreadable/corrupt lock file" repair path as a supported case, but no test exercises it.
  • Fragile stale-lock test: The stale-lock test hardcodes PID 999999, which is not guaranteed to be dead on all Linux configurations (systems with a high pid_max value).

Confidence Score: 2/5

  • Not safe to merge — a logic bug in isPidAlive can cause live locks to be deleted when they belong to another OS user.
  • The EPERM mis-classification in isPidAlive is a correctness bug that directly undermines the primary purpose of this check (protecting live runs from being clobbered). The fix is small, but without it the check can do more harm than good in multi-user environments.
  • cli/src/checks/runtime-lock-check.ts (isPidAlive logic) and cli/src/__tests__/runtime-lock-check.test.ts (missing corrupt-lock test, fragile PID assumption)

Important Files Changed

Filename Overview
cli/src/checks/runtime-lock-check.ts New check that inspects run.lock.json; contains a logic bug in isPidAlive where EPERM is treated as "dead" instead of "alive", which can cause live locks owned by other OS users to be incorrectly classified as stale and deleted.
cli/src/tests/runtime-lock-check.test.ts Covers no-lock and live-lock paths; missing test for corrupt/unreadable lock file, and the stale-lock test relies on PID 999999 being dead which is not guaranteed on all Linux configurations.
cli/src/commands/doctor.ts Wires runtimeLockCheck into the doctor flow as step 9 via the existing runRepairableCheck helper; integration looks correct.
cli/src/checks/index.ts Trivial re-export of runtimeLockCheck; no issues.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: cli/src/checks/runtime-lock-check.ts
Line: 14-19

Comment:
**`EPERM` mis-classified as dead process**

`process.kill(pid, 0)` on POSIX systems throws `EPERM` when the target **process is alive but owned by a different user** (you lack permission to signal it). The current catch-all means a live lock held by another OS user will be reported as stale and offered for deletion, which can corrupt a legitimately running instance.

The fix is to treat `EPERM` as "alive":

```suggestion
  try {
    process.kill(pid, 0);
    return true;
  } catch (err) {
    if ((err as NodeJS.ErrnoException).code === "EPERM") return true;
    return false;
  }
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: cli/src/__tests__/runtime-lock-check.test.ts
Line: 36

Comment:
**Fragile PID assumption in stale lock test**

The test writes PID `999999` into the lock file, betting that no process with that PID is running. On Linux systems where `/proc/sys/kernel/pid_max` is raised above the default (e.g. to `4194304`), PID 999999 is a valid, potentially live process, and the check would return `"warn"` with `canRepair: false` (live lock path) instead of `"warn"` with a repair function — silently breaking the assertion on line 40.

A more reliable approach is to spawn a child process, record its PID, let it exit, and then write that PID into the lock file — guaranteeing the PID is dead at assertion time. Alternatively, capture the ESRCH error from a probing call to verify the PID is dead before writing the lock file.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: cli/src/__tests__/runtime-lock-check.test.ts
Line: 33-53

Comment:
**Missing test for corrupt/unreadable lock file**

The PR description lists "warn + repairable: unreadable/corrupt lock file" as a distinct case, but there is no corresponding test. The code path (lines 39-50 of `runtime-lock-check.ts`) is completely untested. Consider adding a test that writes a non-JSON payload (e.g. `"not-json"`) to the lock file, asserts `status === "warn"` and `canRepair === true`, calls `repair()`, and confirms the file is removed.

How can I resolve this? If you propose a fix, please make it concise.

Last reviewed commit: a1840ef

Comment on lines +14 to +19
try {
process.kill(pid, 0);
return true;
} catch {
return false;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EPERM mis-classified as dead process

process.kill(pid, 0) on POSIX systems throws EPERM when the target process is alive but owned by a different user (you lack permission to signal it). The current catch-all means a live lock held by another OS user will be reported as stale and offered for deletion, which can corrupt a legitimately running instance.

The fix is to treat EPERM as "alive":

Suggested change
try {
process.kill(pid, 0);
return true;
} catch {
return false;
}
try {
process.kill(pid, 0);
return true;
} catch (err) {
if ((err as NodeJS.ErrnoException).code === "EPERM") return true;
return false;
}
Prompt To Fix With AI
This is a comment left during a code review.
Path: cli/src/checks/runtime-lock-check.ts
Line: 14-19

Comment:
**`EPERM` mis-classified as dead process**

`process.kill(pid, 0)` on POSIX systems throws `EPERM` when the target **process is alive but owned by a different user** (you lack permission to signal it). The current catch-all means a live lock held by another OS user will be reported as stale and offered for deletion, which can corrupt a legitimately running instance.

The fix is to treat `EPERM` as "alive":

```suggestion
  try {
    process.kill(pid, 0);
    return true;
  } catch (err) {
    if ((err as NodeJS.ErrnoException).code === "EPERM") return true;
    return false;
  }
```

How can I resolve this? If you propose a fix, please make it concise.

it("warns and repair removes stale lock", async () => {
const home = setupHome();
const lock = path.join(home, "instances", "default", "run.lock.json");
fs.writeFileSync(lock, JSON.stringify({ pid: 999999, startedAt: new Date().toISOString() }), "utf8");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fragile PID assumption in stale lock test

The test writes PID 999999 into the lock file, betting that no process with that PID is running. On Linux systems where /proc/sys/kernel/pid_max is raised above the default (e.g. to 4194304), PID 999999 is a valid, potentially live process, and the check would return "warn" with canRepair: false (live lock path) instead of "warn" with a repair function — silently breaking the assertion on line 40.

A more reliable approach is to spawn a child process, record its PID, let it exit, and then write that PID into the lock file — guaranteeing the PID is dead at assertion time. Alternatively, capture the ESRCH error from a probing call to verify the PID is dead before writing the lock file.

Prompt To Fix With AI
This is a comment left during a code review.
Path: cli/src/__tests__/runtime-lock-check.test.ts
Line: 36

Comment:
**Fragile PID assumption in stale lock test**

The test writes PID `999999` into the lock file, betting that no process with that PID is running. On Linux systems where `/proc/sys/kernel/pid_max` is raised above the default (e.g. to `4194304`), PID 999999 is a valid, potentially live process, and the check would return `"warn"` with `canRepair: false` (live lock path) instead of `"warn"` with a repair function — silently breaking the assertion on line 40.

A more reliable approach is to spawn a child process, record its PID, let it exit, and then write that PID into the lock file — guaranteeing the PID is dead at assertion time. Alternatively, capture the ESRCH error from a probing call to verify the PID is dead before writing the lock file.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +33 to +53
it("warns and repair removes stale lock", async () => {
const home = setupHome();
const lock = path.join(home, "instances", "default", "run.lock.json");
fs.writeFileSync(lock, JSON.stringify({ pid: 999999, startedAt: new Date().toISOString() }), "utf8");

const r = runtimeLockCheck();
expect(r.status).toBe("warn");
expect(typeof r.repair).toBe("function");
await r.repair?.();
expect(fs.existsSync(lock)).toBe(false);
});

it("warns on live lock", () => {
const home = setupHome();
const lock = path.join(home, "instances", "default", "run.lock.json");
fs.writeFileSync(lock, JSON.stringify({ pid: process.pid, startedAt: new Date().toISOString() }), "utf8");

const r = runtimeLockCheck();
expect(r.status).toBe("warn");
expect(r.message).toContain("active");
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing test for corrupt/unreadable lock file

The PR description lists "warn + repairable: unreadable/corrupt lock file" as a distinct case, but there is no corresponding test. The code path (lines 39-50 of runtime-lock-check.ts) is completely untested. Consider adding a test that writes a non-JSON payload (e.g. "not-json") to the lock file, asserts status === "warn" and canRepair === true, calls repair(), and confirms the file is removed.

Prompt To Fix With AI
This is a comment left during a code review.
Path: cli/src/__tests__/runtime-lock-check.test.ts
Line: 33-53

Comment:
**Missing test for corrupt/unreadable lock file**

The PR description lists "warn + repairable: unreadable/corrupt lock file" as a distinct case, but there is no corresponding test. The code path (lines 39-50 of `runtime-lock-check.ts`) is completely untested. Consider adding a test that writes a non-JSON payload (e.g. `"not-json"`) to the lock file, asserts `status === "warn"` and `canRepair === true`, calls `repair()`, and confirms the file is removed.

How can I resolve this? If you propose a fix, please make it concise.

@mmTheBest
Copy link
Copy Markdown
Author

Addressed review feedback in e7edb72:\n- Treat EPERM as alive in runtime lock PID probing\n- Harden stale-lock test (no hardcoded 999999 PID assumption)\n- Add corrupt/unreadable lock file repair test\n\nValidated with:
RUN v3.2.4 /private/tmp/paperclip/cli

@mmTheBest
Copy link
Copy Markdown
Author

Follow-up (escaped): validated with cd /tmp/paperclip/cli && pnpm exec vitest run src/__tests__/runtime-lock-check.test.ts (5 tests passing).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant