fix(runtime): enforce single-instance run + deterministic readiness by mmTheBest · Pull Request #778 · paperclipai/paperclip

mmTheBest · 2026-03-13T12:13:04Z

Summary

This PR hardens paperclipai run startup reliability by enforcing single-instance ownership and adding deterministic readiness gating.

Why

During OpenClaw + Paperclip setup/recovery runs, we hit repeated split-runtime failures:

multiple paperclipai run processes active at the same time
API endpoint healthy while UI route behavior was inconsistent
false-positive "running" state from an operator perspective

What changed

Added runtime lock support:
- new file: cli/src/commands/run-runtime-guard.ts
- lock file: <instanceRoot>/run.lock.json
- blocks concurrent paperclipai run for same instance
- stale lock replacement when previous PID is dead
- lock cleanup on SIGINT / SIGTERM / process exit
Added readiness gating in run command:
- full mode (default): requires both API and UI probes to pass
- api-only mode: requires API probe only
- startup exits non-zero with diagnostics when readiness times out
Improved startup logs:
- lock path + owner pid
- readiness mode and final readiness result

Tests

Added: cli/src/__tests__/run-runtime-guard.test.ts

lock acquire + release
conflict on second live owner
stale lock replacement
readiness pass/fail behavior
PID liveness helper

Run command used:

cd cli && pnpm exec vitest run src/__tests__/run-runtime-guard.test.ts

Notes

pnpm typecheck currently reports existing unrelated errors in the repo (e.g. initdbFlags typing in other files). This PR does not touch those code paths.

Follow-ups

PR wsl 2 + zsh #2: explicit claim-key lifecycle validation
PR Opencode support #3: doctor --openclaw integration checks

greptile-apps · 2026-03-13T13:00:02Z

Greptile Summary

This PR adds single-instance enforcement (via a JSON lock file) and deterministic readiness gating (polling /api/health and /) to paperclipai run, addressing split-runtime failures during concurrent or rapid restarts. The approach is sound, but two logic bugs need to be resolved before merging.

Key findings:

Critical – SIGINT/SIGTERM do not terminate the process (run.ts:53-56): Registering process.once("SIGINT", cleanup) overrides Node.js's default termination behavior. After the handler fires the lock is released, but the server keeps running. A second paperclipai run can then acquire the lock while the original process is still live — the exact split-runtime scenario this PR is designed to prevent.
Logic – TOCTOU race in stale-lock replacement (run-runtime-guard.ts:49-57): The fallback writeFileSync (without "wx") that replaces a stale lock is not atomic. Two concurrent processes that both observe the same stale lock can each conclude it is dead and overwrite each other's valid lock.
Style – Stale-lock test uses non-deterministic PID (run-runtime-guard.test.ts:42): PID 999999 is assumed dead but is not guaranteed to be, making the test potentially flaky.
Style – api-only readiness mode has no test coverage (run-runtime-guard.test.ts): Easy-to-regress path with zero test coverage.

Confidence Score: 2/5

Not safe to merge — the SIGINT/SIGTERM handler silently keeps the server running after the lock is released, which directly reproduces the split-runtime failure the PR is trying to fix.
Two logic bugs undermine the core guarantees of the PR: (1) signal handlers don't exit the process, leaving a lock-less server running; (2) the stale-lock replacement is not atomic, allowing concurrent processes to both believe they hold the lock. The readiness polling logic itself is correct, and the overall architecture is good — but the signal-handling regression is severe enough to block merge.
cli/src/commands/run.ts (signal handler regression) and cli/src/commands/run-runtime-guard.ts (TOCTOU in stale-lock path) need attention before this is safe to merge.

Important Files Changed

Filename	Overview
cli/src/commands/run.ts	Integrates lock acquisition and readiness gating into startup; critical bug: SIGINT/SIGTERM handlers release the lock but do not exit the process, allowing a second instance to acquire the lock while the first server remains live.
cli/src/commands/run-runtime-guard.ts	New module providing lock acquisition and readiness polling; stale-lock replacement path has a TOCTOU race where two concurrent processes can each overwrite the other's valid lock, defeating single-instance enforcement.
cli/src/tests/run-runtime-guard.test.ts	New test file with good coverage of lock acquire/release and readiness probing; stale-lock test uses PID 999999 which is not guaranteed to be dead, and api-only readiness mode is not exercised.

Prompt To Fix All With AI

This is a comment left during a code review.
Path: cli/src/commands/run.ts
Line: 53-56

Comment:
**SIGINT/SIGTERM handlers do not exit the process**

`process.once("SIGINT", cleanup)` and `process.once("SIGTERM", cleanup)` override Node.js's default signal termination behavior. After the listener fires (`released` is set to `true`, lock is cleaned up), the process **continues running** — the server is still active and accepting connections. This directly breaks the single-instance guarantee this PR is meant to provide: another `paperclipai run` invocation could now acquire the lock and start, while the original server instance is still live.

The fix is to terminate the process after cleanup when handling a signal:

```suggestion
  const cleanup = () => lock.release();
  const signalCleanup = (signal: NodeJS.Signals) => {
    cleanup();
    process.kill(process.pid, signal);
  };
  process.once("SIGINT", signalCleanup);
  process.once("SIGTERM", signalCleanup);
  process.once("exit", cleanup);
```

Re-sending the signal after removing the handler restores default OS behavior (correct exit code/termination) rather than using `process.exit(0)`, which produces the wrong exit code for signal termination.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: cli/src/commands/run-runtime-guard.ts
Line: 49-57

Comment:
**TOCTOU race condition in stale lock replacement**

Between checking `isPidAlive(existing.pid)` (line 51) and overwriting the lock (line 57) without the exclusive `"wx"` flag, two concurrent processes that both observe the same stale lock can each:
1. Read the dead PID
2. Conclude it is stale
3. Both write their own lock

The second write silently overwrites the first process's valid lock, leaving two processes believing they hold the lock. This is exactly the split-runtime scenario the PR is trying to prevent.

To close the race, the replacement write should also use `"wx"` (or a platform-specific rename-based atomic swap). A simple defensive approach is to retry with `"wx"` and, if that fails, re-read and re-check the PID:

```ts
// stale/corrupt lock: attempt atomic replacement
try {
  fs.writeFileSync(lockPath, JSON.stringify(state, null, 2), { encoding: "utf8", flag: "wx" });
} catch {
  // Lost the replacement race — re-read and check
  const contested = readLock(lockPath);
  if (contested && isPidAlive(contested.pid)) {
    throw new Error(
      `Another paperclipai run appears active for instance '${instanceId}' (pid=${contested.pid}, startedAt=${contested.startedAt}).`,
    );
  }
  // Still stale (or file was removed); last-writer-wins is acceptable here
  fs.writeFileSync(lockPath, JSON.stringify(state, null, 2), { encoding: "utf8" });
}
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: cli/src/__tests__/run-runtime-guard.test.ts
Line: 40-43

Comment:
**Stale-lock test uses non-guaranteed dead PID**

PID `999999` is assumed to be dead, but on Linux the max PID is configurable (typically 4,194,304) and some environments recycle high PIDs. If PID 999999 happens to be live, `isPidAlive` returns `true`, the stale-lock code path throws, and the test fails non-deterministically.

A more robust approach is to look up a definitely-dead PID:

```suggestion
    fs.writeFileSync(
      lockPath,
      JSON.stringify({ pid: 1, startedAt: new Date().toISOString(), command: "x", instanceId: "default" }),
      "utf8",
    );
```

PID 1 (`init`/`systemd`) will never be owned by the test runner, so `process.kill(1, 0)` will always throw `EPERM` (process exists but not owned), meaning `isPidAlive(1)` returns `false`. Alternatively, spawn a child process, record its PID, wait for it to exit, then use that PID — that is provably dead.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: cli/src/__tests__/run-runtime-guard.test.ts
Line: 76-82

Comment:
**Missing test for `api-only` readiness mode**

The test suite covers `"full"` mode (pass and fail) and `isPidAlive`, but there is no test exercising `"api-only"` mode. In `"api-only"` mode, `uiOk` is pre-set to `true` and the UI probe is never executed — a straightforward path that is easy to introduce a regression in (e.g., accidentally adding `uiOk = false` before the mode check). Adding a test case would close this gap:

```ts
it("waitForReadiness passes in api-only mode when api is ok", async () => {
  const fetcher = async (): Promise<Response> => new Response("ok", { status: 200 });
  const result = await waitForReadiness({
    baseUrl: "http://127.0.0.1:3100",
    mode: "api-only",
    timeoutMs: 100,
    intervalMs: 10,
    fetcher: fetcher as typeof fetch,
  });
  expect(result.ok).toBe(true);
  expect(result.apiOk).toBe(true);
  expect(result.uiOk).toBe(true); // pre-set for api-only
});
```

How can I resolve this? If you propose a fix, please make it concise.

_{Last reviewed commit: 054cf62}

greptile-apps · 2026-03-13T13:00:05Z

+  const cleanup = () => lock.release();
+  process.once("SIGINT", cleanup);
+  process.once("SIGTERM", cleanup);
+  process.once("exit", cleanup);


SIGINT/SIGTERM handlers do not exit the process

process.once("SIGINT", cleanup) and process.once("SIGTERM", cleanup) override Node.js's default signal termination behavior. After the listener fires (released is set to true, lock is cleaned up), the process continues running — the server is still active and accepting connections. This directly breaks the single-instance guarantee this PR is meant to provide: another paperclipai run invocation could now acquire the lock and start, while the original server instance is still live.

The fix is to terminate the process after cleanup when handling a signal:

Suggested change

const cleanup = () => lock.release();

process.once("SIGINT", cleanup);

process.once("SIGTERM", cleanup);

process.once("exit", cleanup);

const cleanup = () => lock.release();

const signalCleanup = (signal: NodeJS.Signals) => {

cleanup();

process.kill(process.pid, signal);

};

process.once("SIGINT", signalCleanup);

process.once("SIGTERM", signalCleanup);

process.once("exit", cleanup);

Re-sending the signal after removing the handler restores default OS behavior (correct exit code/termination) rather than using process.exit(0), which produces the wrong exit code for signal termination.

Prompt To Fix With AI

This is a comment left during a code review. Path: cli/src/commands/run.ts Line: 53-56 Comment: **SIGINT/SIGTERM handlers do not exit the process** `process.once("SIGINT", cleanup)` and `process.once("SIGTERM", cleanup)` override Node.js's default signal termination behavior. After the listener fires (`released` is set to `true`, lock is cleaned up), the process **continues running** — the server is still active and accepting connections. This directly breaks the single-instance guarantee this PR is meant to provide: another `paperclipai run` invocation could now acquire the lock and start, while the original server instance is still live. The fix is to terminate the process after cleanup when handling a signal: ```suggestion const cleanup = () => lock.release(); const signalCleanup = (signal: NodeJS.Signals) => { cleanup(); process.kill(process.pid, signal); }; process.once("SIGINT", signalCleanup); process.once("SIGTERM", signalCleanup); process.once("exit", cleanup); ``` Re-sending the signal after removing the handler restores default OS behavior (correct exit code/termination) rather than using `process.exit(0)`, which produces the wrong exit code for signal termination. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-03-13T13:00:06Z

+  } catch {
+    const existing = readLock(lockPath);
+    if (existing && isPidAlive(existing.pid)) {
+      throw new Error(
+        `Another paperclipai run appears active for instance '${instanceId}' (pid=${existing.pid}, startedAt=${existing.startedAt}).`,
+      );
+    }
+    // stale/corrupt lock: replace
+    fs.writeFileSync(lockPath, JSON.stringify(state, null, 2), { encoding: "utf8" });


TOCTOU race condition in stale lock replacement

Between checking isPidAlive(existing.pid) (line 51) and overwriting the lock (line 57) without the exclusive "wx" flag, two concurrent processes that both observe the same stale lock can each:

Read the dead PID

Conclude it is stale

Both write their own lock

The second write silently overwrites the first process's valid lock, leaving two processes believing they hold the lock. This is exactly the split-runtime scenario the PR is trying to prevent.

To close the race, the replacement write should also use "wx" (or a platform-specific rename-based atomic swap). A simple defensive approach is to retry with "wx" and, if that fails, re-read and re-check the PID:

// stale/corrupt lock: attempt atomic replacement try { fs.writeFileSync(lockPath, JSON.stringify(state, null, 2), { encoding: "utf8", flag: "wx" }); } catch { // Lost the replacement race — re-read and check const contested = readLock(lockPath); if (contested && isPidAlive(contested.pid)) { throw new Error( `Another paperclipai run appears active for instance '${instanceId}' (pid=${contested.pid}, startedAt=${contested.startedAt}).`, ); } // Still stale (or file was removed); last-writer-wins is acceptable here fs.writeFileSync(lockPath, JSON.stringify(state, null, 2), { encoding: "utf8" }); }

Prompt To Fix With AI

This is a comment left during a code review. Path: cli/src/commands/run-runtime-guard.ts Line: 49-57 Comment: **TOCTOU race condition in stale lock replacement** Between checking `isPidAlive(existing.pid)` (line 51) and overwriting the lock (line 57) without the exclusive `"wx"` flag, two concurrent processes that both observe the same stale lock can each: 1. Read the dead PID 2. Conclude it is stale 3. Both write their own lock The second write silently overwrites the first process's valid lock, leaving two processes believing they hold the lock. This is exactly the split-runtime scenario the PR is trying to prevent. To close the race, the replacement write should also use `"wx"` (or a platform-specific rename-based atomic swap). A simple defensive approach is to retry with `"wx"` and, if that fails, re-read and re-check the PID: ```ts // stale/corrupt lock: attempt atomic replacement try { fs.writeFileSync(lockPath, JSON.stringify(state, null, 2), { encoding: "utf8", flag: "wx" }); } catch { // Lost the replacement race — re-read and check const contested = readLock(lockPath); if (contested && isPidAlive(contested.pid)) { throw new Error( `Another paperclipai run appears active for instance '${instanceId}' (pid=${contested.pid}, startedAt=${contested.startedAt}).`, ); } // Still stale (or file was removed); last-writer-wins is acceptable here fs.writeFileSync(lockPath, JSON.stringify(state, null, 2), { encoding: "utf8" }); } ``` How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-03-13T13:00:07Z

+    fs.writeFileSync(
+      lockPath,
+      JSON.stringify({ pid: 999999, startedAt: new Date().toISOString(), command: "x", instanceId: "default" }),
+      "utf8",


Stale-lock test uses non-guaranteed dead PID

PID 999999 is assumed to be dead, but on Linux the max PID is configurable (typically 4,194,304) and some environments recycle high PIDs. If PID 999999 happens to be live, isPidAlive returns true, the stale-lock code path throws, and the test fails non-deterministically.

A more robust approach is to look up a definitely-dead PID:

Suggested change

fs.writeFileSync(

lockPath,

JSON.stringify({ pid: 999999, startedAt: new Date().toISOString(), command: "x", instanceId: "default" }),

"utf8",

fs.writeFileSync(

lockPath,

JSON.stringify({ pid: 1, startedAt: new Date().toISOString(), command: "x", instanceId: "default" }),

"utf8",

);

PID 1 (init/systemd) will never be owned by the test runner, so process.kill(1, 0) will always throw EPERM (process exists but not owned), meaning isPidAlive(1) returns false. Alternatively, spawn a child process, record its PID, wait for it to exit, then use that PID — that is provably dead.

Prompt To Fix With AI

This is a comment left during a code review. Path: cli/src/__tests__/run-runtime-guard.test.ts Line: 40-43 Comment: **Stale-lock test uses non-guaranteed dead PID** PID `999999` is assumed to be dead, but on Linux the max PID is configurable (typically 4,194,304) and some environments recycle high PIDs. If PID 999999 happens to be live, `isPidAlive` returns `true`, the stale-lock code path throws, and the test fails non-deterministically. A more robust approach is to look up a definitely-dead PID: ```suggestion fs.writeFileSync( lockPath, JSON.stringify({ pid: 1, startedAt: new Date().toISOString(), command: "x", instanceId: "default" }), "utf8", ); ``` PID 1 (`init`/`systemd`) will never be owned by the test runner, so `process.kill(1, 0)` will always throw `EPERM` (process exists but not owned), meaning `isPidAlive(1)` returns `false`. Alternatively, spawn a child process, record its PID, wait for it to exit, then use that PID — that is provably dead. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-03-13T13:00:08Z

+    const result = await waitForReadiness({
+      baseUrl: "http://127.0.0.1:3100",
+      mode: "full",
+      timeoutMs: 60,
+      intervalMs: 10,
+      fetcher: fetcher as typeof fetch,
+    });


Missing test for api-only readiness mode

The test suite covers "full" mode (pass and fail) and isPidAlive, but there is no test exercising "api-only" mode. In "api-only" mode, uiOk is pre-set to true and the UI probe is never executed — a straightforward path that is easy to introduce a regression in (e.g., accidentally adding uiOk = false before the mode check). Adding a test case would close this gap:

it("waitForReadiness passes in api-only mode when api is ok", async () => { const fetcher = async (): Promise<Response> => new Response("ok", { status: 200 }); const result = await waitForReadiness({ baseUrl: "http://127.0.0.1:3100", mode: "api-only", timeoutMs: 100, intervalMs: 10, fetcher: fetcher as typeof fetch, }); expect(result.ok).toBe(true); expect(result.apiOk).toBe(true); expect(result.uiOk).toBe(true); // pre-set for api-only });

Prompt To Fix With AI

This is a comment left during a code review. Path: cli/src/__tests__/run-runtime-guard.test.ts Line: 76-82 Comment: **Missing test for `api-only` readiness mode** The test suite covers `"full"` mode (pass and fail) and `isPidAlive`, but there is no test exercising `"api-only"` mode. In `"api-only"` mode, `uiOk` is pre-set to `true` and the UI probe is never executed — a straightforward path that is easy to introduce a regression in (e.g., accidentally adding `uiOk = false` before the mode check). Adding a test case would close this gap: ```ts it("waitForReadiness passes in api-only mode when api is ok", async () => { const fetcher = async (): Promise<Response> => new Response("ok", { status: 200 }); const result = await waitForReadiness({ baseUrl: "http://127.0.0.1:3100", mode: "api-only", timeoutMs: 100, intervalMs: 10, fetcher: fetcher as typeof fetch, }); expect(result.ok).toBe(true); expect(result.apiOk).toBe(true); expect(result.uiOk).toBe(true); // pre-set for api-only }); ``` How can I resolve this? If you propose a fix, please make it concise.

mmTheBest · 2026-03-13T13:21:03Z

Addressed latest review feedback in 1705d4b:\n- Fixed SIGINT/SIGTERM behavior to release lock and then re-signal for correct process termination semantics\n- Hardened stale-lock acquisition to avoid non-atomic overwrite race and fail safely on contested acquisition\n- Replaced stale-lock test with deterministic dead PID pattern (spawn child -> wait exit)\n- Added api-only readiness mode test coverage\n\nValidated with:
RUN v3.2.4 /private/tmp/paperclip/cli

✓ src/tests/run-runtime-guard.test.ts (7 tests) 114ms

Test Files 1 passed (1)
Tests 7 passed (7)
Start at 21:20:57
Duration 311ms (transform 15ms, setup 0ms, collect 15ms, tests 114ms, environment 0ms, prepare 30ms) (7 tests passing).

fix(run): add runtime lock and readiness gating

054cf62

mmTheBest marked this pull request as ready for review March 13, 2026 12:52

greptile-apps bot reviewed Mar 13, 2026

View reviewed changes

fix(run): harden signal shutdown and stale lock acquisition

1705d4b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(runtime): enforce single-instance run + deterministic readiness#778

fix(runtime): enforce single-instance run + deterministic readiness#778
mmTheBest wants to merge 2 commits intopaperclipai:masterfrom
mmTheBest:fix/runtime-lock-readiness-openclaw

mmTheBest commented Mar 13, 2026

Uh oh!

greptile-apps bot commented Mar 13, 2026

Uh oh!

greptile-apps bot Mar 13, 2026

Uh oh!

greptile-apps bot Mar 13, 2026

Uh oh!

greptile-apps bot Mar 13, 2026

Uh oh!

greptile-apps bot Mar 13, 2026

Uh oh!

mmTheBest commented Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mmTheBest commented Mar 13, 2026

Summary

Why

What changed

Tests

Notes

Follow-ups

Uh oh!

greptile-apps bot commented Mar 13, 2026

Greptile Summary

Confidence Score: 2/5

Important Files Changed

Uh oh!

greptile-apps bot Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

mmTheBest commented Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant