Skip to content

fix(runtime): enforce single-instance run + deterministic readiness#778

Open
mmTheBest wants to merge 2 commits intopaperclipai:masterfrom
mmTheBest:fix/runtime-lock-readiness-openclaw
Open

fix(runtime): enforce single-instance run + deterministic readiness#778
mmTheBest wants to merge 2 commits intopaperclipai:masterfrom
mmTheBest:fix/runtime-lock-readiness-openclaw

Conversation

@mmTheBest
Copy link
Copy Markdown

Summary

This PR hardens paperclipai run startup reliability by enforcing single-instance ownership and adding deterministic readiness gating.

Why

During OpenClaw + Paperclip setup/recovery runs, we hit repeated split-runtime failures:

  • multiple paperclipai run processes active at the same time
  • API endpoint healthy while UI route behavior was inconsistent
  • false-positive "running" state from an operator perspective

What changed

  • Added runtime lock support:

    • new file: cli/src/commands/run-runtime-guard.ts
    • lock file: <instanceRoot>/run.lock.json
    • blocks concurrent paperclipai run for same instance
    • stale lock replacement when previous PID is dead
    • lock cleanup on SIGINT / SIGTERM / process exit
  • Added readiness gating in run command:

    • full mode (default): requires both API and UI probes to pass
    • api-only mode: requires API probe only
    • startup exits non-zero with diagnostics when readiness times out
  • Improved startup logs:

    • lock path + owner pid
    • readiness mode and final readiness result

Tests

Added: cli/src/__tests__/run-runtime-guard.test.ts

  • lock acquire + release
  • conflict on second live owner
  • stale lock replacement
  • readiness pass/fail behavior
  • PID liveness helper

Run command used:

cd cli && pnpm exec vitest run src/__tests__/run-runtime-guard.test.ts

Notes

pnpm typecheck currently reports existing unrelated errors in the repo (e.g. initdbFlags typing in other files). This PR does not touch those code paths.

Follow-ups

@mmTheBest mmTheBest marked this pull request as ready for review March 13, 2026 12:52
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 13, 2026

Greptile Summary

This PR adds single-instance enforcement (via a JSON lock file) and deterministic readiness gating (polling /api/health and /) to paperclipai run, addressing split-runtime failures during concurrent or rapid restarts. The approach is sound, but two logic bugs need to be resolved before merging.

Key findings:

  • Critical – SIGINT/SIGTERM do not terminate the process (run.ts:53-56): Registering process.once("SIGINT", cleanup) overrides Node.js's default termination behavior. After the handler fires the lock is released, but the server keeps running. A second paperclipai run can then acquire the lock while the original process is still live — the exact split-runtime scenario this PR is designed to prevent.
  • Logic – TOCTOU race in stale-lock replacement (run-runtime-guard.ts:49-57): The fallback writeFileSync (without "wx") that replaces a stale lock is not atomic. Two concurrent processes that both observe the same stale lock can each conclude it is dead and overwrite each other's valid lock.
  • Style – Stale-lock test uses non-deterministic PID (run-runtime-guard.test.ts:42): PID 999999 is assumed dead but is not guaranteed to be, making the test potentially flaky.
  • Style – api-only readiness mode has no test coverage (run-runtime-guard.test.ts): Easy-to-regress path with zero test coverage.

Confidence Score: 2/5

  • Not safe to merge — the SIGINT/SIGTERM handler silently keeps the server running after the lock is released, which directly reproduces the split-runtime failure the PR is trying to fix.
  • Two logic bugs undermine the core guarantees of the PR: (1) signal handlers don't exit the process, leaving a lock-less server running; (2) the stale-lock replacement is not atomic, allowing concurrent processes to both believe they hold the lock. The readiness polling logic itself is correct, and the overall architecture is good — but the signal-handling regression is severe enough to block merge.
  • cli/src/commands/run.ts (signal handler regression) and cli/src/commands/run-runtime-guard.ts (TOCTOU in stale-lock path) need attention before this is safe to merge.

Important Files Changed

Filename Overview
cli/src/commands/run.ts Integrates lock acquisition and readiness gating into startup; critical bug: SIGINT/SIGTERM handlers release the lock but do not exit the process, allowing a second instance to acquire the lock while the first server remains live.
cli/src/commands/run-runtime-guard.ts New module providing lock acquisition and readiness polling; stale-lock replacement path has a TOCTOU race where two concurrent processes can each overwrite the other's valid lock, defeating single-instance enforcement.
cli/src/tests/run-runtime-guard.test.ts New test file with good coverage of lock acquire/release and readiness probing; stale-lock test uses PID 999999 which is not guaranteed to be dead, and api-only readiness mode is not exercised.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: cli/src/commands/run.ts
Line: 53-56

Comment:
**SIGINT/SIGTERM handlers do not exit the process**

`process.once("SIGINT", cleanup)` and `process.once("SIGTERM", cleanup)` override Node.js's default signal termination behavior. After the listener fires (`released` is set to `true`, lock is cleaned up), the process **continues running** — the server is still active and accepting connections. This directly breaks the single-instance guarantee this PR is meant to provide: another `paperclipai run` invocation could now acquire the lock and start, while the original server instance is still live.

The fix is to terminate the process after cleanup when handling a signal:

```suggestion
  const cleanup = () => lock.release();
  const signalCleanup = (signal: NodeJS.Signals) => {
    cleanup();
    process.kill(process.pid, signal);
  };
  process.once("SIGINT", signalCleanup);
  process.once("SIGTERM", signalCleanup);
  process.once("exit", cleanup);
```

Re-sending the signal after removing the handler restores default OS behavior (correct exit code/termination) rather than using `process.exit(0)`, which produces the wrong exit code for signal termination.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: cli/src/commands/run-runtime-guard.ts
Line: 49-57

Comment:
**TOCTOU race condition in stale lock replacement**

Between checking `isPidAlive(existing.pid)` (line 51) and overwriting the lock (line 57) without the exclusive `"wx"` flag, two concurrent processes that both observe the same stale lock can each:
1. Read the dead PID
2. Conclude it is stale
3. Both write their own lock

The second write silently overwrites the first process's valid lock, leaving two processes believing they hold the lock. This is exactly the split-runtime scenario the PR is trying to prevent.

To close the race, the replacement write should also use `"wx"` (or a platform-specific rename-based atomic swap). A simple defensive approach is to retry with `"wx"` and, if that fails, re-read and re-check the PID:

```ts
// stale/corrupt lock: attempt atomic replacement
try {
  fs.writeFileSync(lockPath, JSON.stringify(state, null, 2), { encoding: "utf8", flag: "wx" });
} catch {
  // Lost the replacement race — re-read and check
  const contested = readLock(lockPath);
  if (contested && isPidAlive(contested.pid)) {
    throw new Error(
      `Another paperclipai run appears active for instance '${instanceId}' (pid=${contested.pid}, startedAt=${contested.startedAt}).`,
    );
  }
  // Still stale (or file was removed); last-writer-wins is acceptable here
  fs.writeFileSync(lockPath, JSON.stringify(state, null, 2), { encoding: "utf8" });
}
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: cli/src/__tests__/run-runtime-guard.test.ts
Line: 40-43

Comment:
**Stale-lock test uses non-guaranteed dead PID**

PID `999999` is assumed to be dead, but on Linux the max PID is configurable (typically 4,194,304) and some environments recycle high PIDs. If PID 999999 happens to be live, `isPidAlive` returns `true`, the stale-lock code path throws, and the test fails non-deterministically.

A more robust approach is to look up a definitely-dead PID:

```suggestion
    fs.writeFileSync(
      lockPath,
      JSON.stringify({ pid: 1, startedAt: new Date().toISOString(), command: "x", instanceId: "default" }),
      "utf8",
    );
```

PID 1 (`init`/`systemd`) will never be owned by the test runner, so `process.kill(1, 0)` will always throw `EPERM` (process exists but not owned), meaning `isPidAlive(1)` returns `false`. Alternatively, spawn a child process, record its PID, wait for it to exit, then use that PID — that is provably dead.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: cli/src/__tests__/run-runtime-guard.test.ts
Line: 76-82

Comment:
**Missing test for `api-only` readiness mode**

The test suite covers `"full"` mode (pass and fail) and `isPidAlive`, but there is no test exercising `"api-only"` mode. In `"api-only"` mode, `uiOk` is pre-set to `true` and the UI probe is never executed — a straightforward path that is easy to introduce a regression in (e.g., accidentally adding `uiOk = false` before the mode check). Adding a test case would close this gap:

```ts
it("waitForReadiness passes in api-only mode when api is ok", async () => {
  const fetcher = async (): Promise<Response> => new Response("ok", { status: 200 });
  const result = await waitForReadiness({
    baseUrl: "http://127.0.0.1:3100",
    mode: "api-only",
    timeoutMs: 100,
    intervalMs: 10,
    fetcher: fetcher as typeof fetch,
  });
  expect(result.ok).toBe(true);
  expect(result.apiOk).toBe(true);
  expect(result.uiOk).toBe(true); // pre-set for api-only
});
```

How can I resolve this? If you propose a fix, please make it concise.

Last reviewed commit: 054cf62

Comment thread cli/src/commands/run.ts
Comment on lines +53 to +56
const cleanup = () => lock.release();
process.once("SIGINT", cleanup);
process.once("SIGTERM", cleanup);
process.once("exit", cleanup);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SIGINT/SIGTERM handlers do not exit the process

process.once("SIGINT", cleanup) and process.once("SIGTERM", cleanup) override Node.js's default signal termination behavior. After the listener fires (released is set to true, lock is cleaned up), the process continues running — the server is still active and accepting connections. This directly breaks the single-instance guarantee this PR is meant to provide: another paperclipai run invocation could now acquire the lock and start, while the original server instance is still live.

The fix is to terminate the process after cleanup when handling a signal:

Suggested change
const cleanup = () => lock.release();
process.once("SIGINT", cleanup);
process.once("SIGTERM", cleanup);
process.once("exit", cleanup);
const cleanup = () => lock.release();
const signalCleanup = (signal: NodeJS.Signals) => {
cleanup();
process.kill(process.pid, signal);
};
process.once("SIGINT", signalCleanup);
process.once("SIGTERM", signalCleanup);
process.once("exit", cleanup);

Re-sending the signal after removing the handler restores default OS behavior (correct exit code/termination) rather than using process.exit(0), which produces the wrong exit code for signal termination.

Prompt To Fix With AI
This is a comment left during a code review.
Path: cli/src/commands/run.ts
Line: 53-56

Comment:
**SIGINT/SIGTERM handlers do not exit the process**

`process.once("SIGINT", cleanup)` and `process.once("SIGTERM", cleanup)` override Node.js's default signal termination behavior. After the listener fires (`released` is set to `true`, lock is cleaned up), the process **continues running** — the server is still active and accepting connections. This directly breaks the single-instance guarantee this PR is meant to provide: another `paperclipai run` invocation could now acquire the lock and start, while the original server instance is still live.

The fix is to terminate the process after cleanup when handling a signal:

```suggestion
  const cleanup = () => lock.release();
  const signalCleanup = (signal: NodeJS.Signals) => {
    cleanup();
    process.kill(process.pid, signal);
  };
  process.once("SIGINT", signalCleanup);
  process.once("SIGTERM", signalCleanup);
  process.once("exit", cleanup);
```

Re-sending the signal after removing the handler restores default OS behavior (correct exit code/termination) rather than using `process.exit(0)`, which produces the wrong exit code for signal termination.

How can I resolve this? If you propose a fix, please make it concise.

Comment thread cli/src/commands/run-runtime-guard.ts Outdated
Comment on lines +49 to +57
} catch {
const existing = readLock(lockPath);
if (existing && isPidAlive(existing.pid)) {
throw new Error(
`Another paperclipai run appears active for instance '${instanceId}' (pid=${existing.pid}, startedAt=${existing.startedAt}).`,
);
}
// stale/corrupt lock: replace
fs.writeFileSync(lockPath, JSON.stringify(state, null, 2), { encoding: "utf8" });
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TOCTOU race condition in stale lock replacement

Between checking isPidAlive(existing.pid) (line 51) and overwriting the lock (line 57) without the exclusive "wx" flag, two concurrent processes that both observe the same stale lock can each:

  1. Read the dead PID
  2. Conclude it is stale
  3. Both write their own lock

The second write silently overwrites the first process's valid lock, leaving two processes believing they hold the lock. This is exactly the split-runtime scenario the PR is trying to prevent.

To close the race, the replacement write should also use "wx" (or a platform-specific rename-based atomic swap). A simple defensive approach is to retry with "wx" and, if that fails, re-read and re-check the PID:

// stale/corrupt lock: attempt atomic replacement
try {
  fs.writeFileSync(lockPath, JSON.stringify(state, null, 2), { encoding: "utf8", flag: "wx" });
} catch {
  // Lost the replacement race — re-read and check
  const contested = readLock(lockPath);
  if (contested && isPidAlive(contested.pid)) {
    throw new Error(
      `Another paperclipai run appears active for instance '${instanceId}' (pid=${contested.pid}, startedAt=${contested.startedAt}).`,
    );
  }
  // Still stale (or file was removed); last-writer-wins is acceptable here
  fs.writeFileSync(lockPath, JSON.stringify(state, null, 2), { encoding: "utf8" });
}
Prompt To Fix With AI
This is a comment left during a code review.
Path: cli/src/commands/run-runtime-guard.ts
Line: 49-57

Comment:
**TOCTOU race condition in stale lock replacement**

Between checking `isPidAlive(existing.pid)` (line 51) and overwriting the lock (line 57) without the exclusive `"wx"` flag, two concurrent processes that both observe the same stale lock can each:
1. Read the dead PID
2. Conclude it is stale
3. Both write their own lock

The second write silently overwrites the first process's valid lock, leaving two processes believing they hold the lock. This is exactly the split-runtime scenario the PR is trying to prevent.

To close the race, the replacement write should also use `"wx"` (or a platform-specific rename-based atomic swap). A simple defensive approach is to retry with `"wx"` and, if that fails, re-read and re-check the PID:

```ts
// stale/corrupt lock: attempt atomic replacement
try {
  fs.writeFileSync(lockPath, JSON.stringify(state, null, 2), { encoding: "utf8", flag: "wx" });
} catch {
  // Lost the replacement race — re-read and check
  const contested = readLock(lockPath);
  if (contested && isPidAlive(contested.pid)) {
    throw new Error(
      `Another paperclipai run appears active for instance '${instanceId}' (pid=${contested.pid}, startedAt=${contested.startedAt}).`,
    );
  }
  // Still stale (or file was removed); last-writer-wins is acceptable here
  fs.writeFileSync(lockPath, JSON.stringify(state, null, 2), { encoding: "utf8" });
}
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +40 to +43
fs.writeFileSync(
lockPath,
JSON.stringify({ pid: 999999, startedAt: new Date().toISOString(), command: "x", instanceId: "default" }),
"utf8",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale-lock test uses non-guaranteed dead PID

PID 999999 is assumed to be dead, but on Linux the max PID is configurable (typically 4,194,304) and some environments recycle high PIDs. If PID 999999 happens to be live, isPidAlive returns true, the stale-lock code path throws, and the test fails non-deterministically.

A more robust approach is to look up a definitely-dead PID:

Suggested change
fs.writeFileSync(
lockPath,
JSON.stringify({ pid: 999999, startedAt: new Date().toISOString(), command: "x", instanceId: "default" }),
"utf8",
fs.writeFileSync(
lockPath,
JSON.stringify({ pid: 1, startedAt: new Date().toISOString(), command: "x", instanceId: "default" }),
"utf8",
);

PID 1 (init/systemd) will never be owned by the test runner, so process.kill(1, 0) will always throw EPERM (process exists but not owned), meaning isPidAlive(1) returns false. Alternatively, spawn a child process, record its PID, wait for it to exit, then use that PID — that is provably dead.

Prompt To Fix With AI
This is a comment left during a code review.
Path: cli/src/__tests__/run-runtime-guard.test.ts
Line: 40-43

Comment:
**Stale-lock test uses non-guaranteed dead PID**

PID `999999` is assumed to be dead, but on Linux the max PID is configurable (typically 4,194,304) and some environments recycle high PIDs. If PID 999999 happens to be live, `isPidAlive` returns `true`, the stale-lock code path throws, and the test fails non-deterministically.

A more robust approach is to look up a definitely-dead PID:

```suggestion
    fs.writeFileSync(
      lockPath,
      JSON.stringify({ pid: 1, startedAt: new Date().toISOString(), command: "x", instanceId: "default" }),
      "utf8",
    );
```

PID 1 (`init`/`systemd`) will never be owned by the test runner, so `process.kill(1, 0)` will always throw `EPERM` (process exists but not owned), meaning `isPidAlive(1)` returns `false`. Alternatively, spawn a child process, record its PID, wait for it to exit, then use that PID — that is provably dead.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +76 to +82
const result = await waitForReadiness({
baseUrl: "http://127.0.0.1:3100",
mode: "full",
timeoutMs: 60,
intervalMs: 10,
fetcher: fetcher as typeof fetch,
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing test for api-only readiness mode

The test suite covers "full" mode (pass and fail) and isPidAlive, but there is no test exercising "api-only" mode. In "api-only" mode, uiOk is pre-set to true and the UI probe is never executed — a straightforward path that is easy to introduce a regression in (e.g., accidentally adding uiOk = false before the mode check). Adding a test case would close this gap:

it("waitForReadiness passes in api-only mode when api is ok", async () => {
  const fetcher = async (): Promise<Response> => new Response("ok", { status: 200 });
  const result = await waitForReadiness({
    baseUrl: "http://127.0.0.1:3100",
    mode: "api-only",
    timeoutMs: 100,
    intervalMs: 10,
    fetcher: fetcher as typeof fetch,
  });
  expect(result.ok).toBe(true);
  expect(result.apiOk).toBe(true);
  expect(result.uiOk).toBe(true); // pre-set for api-only
});
Prompt To Fix With AI
This is a comment left during a code review.
Path: cli/src/__tests__/run-runtime-guard.test.ts
Line: 76-82

Comment:
**Missing test for `api-only` readiness mode**

The test suite covers `"full"` mode (pass and fail) and `isPidAlive`, but there is no test exercising `"api-only"` mode. In `"api-only"` mode, `uiOk` is pre-set to `true` and the UI probe is never executed — a straightforward path that is easy to introduce a regression in (e.g., accidentally adding `uiOk = false` before the mode check). Adding a test case would close this gap:

```ts
it("waitForReadiness passes in api-only mode when api is ok", async () => {
  const fetcher = async (): Promise<Response> => new Response("ok", { status: 200 });
  const result = await waitForReadiness({
    baseUrl: "http://127.0.0.1:3100",
    mode: "api-only",
    timeoutMs: 100,
    intervalMs: 10,
    fetcher: fetcher as typeof fetch,
  });
  expect(result.ok).toBe(true);
  expect(result.apiOk).toBe(true);
  expect(result.uiOk).toBe(true); // pre-set for api-only
});
```

How can I resolve this? If you propose a fix, please make it concise.

@mmTheBest
Copy link
Copy Markdown
Author

Addressed latest review feedback in 1705d4b:\n- Fixed SIGINT/SIGTERM behavior to release lock and then re-signal for correct process termination semantics\n- Hardened stale-lock acquisition to avoid non-atomic overwrite race and fail safely on contested acquisition\n- Replaced stale-lock test with deterministic dead PID pattern (spawn child -> wait exit)\n- Added api-only readiness mode test coverage\n\nValidated with:
RUN v3.2.4 /private/tmp/paperclip/cli

✓ src/tests/run-runtime-guard.test.ts (7 tests) 114ms

Test Files 1 passed (1)
Tests 7 passed (7)
Start at 21:20:57
Duration 311ms (transform 15ms, setup 0ms, collect 15ms, tests 114ms, environment 0ms, prepare 30ms) (7 tests passing).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant