fix(runtime): enforce single-instance run + deterministic readiness#778
fix(runtime): enforce single-instance run + deterministic readiness#778mmTheBest wants to merge 2 commits intopaperclipai:masterfrom
Conversation
Greptile SummaryThis PR adds single-instance enforcement (via a JSON lock file) and deterministic readiness gating (polling Key findings:
Confidence Score: 2/5
Important Files Changed
Prompt To Fix All With AIThis is a comment left during a code review.
Path: cli/src/commands/run.ts
Line: 53-56
Comment:
**SIGINT/SIGTERM handlers do not exit the process**
`process.once("SIGINT", cleanup)` and `process.once("SIGTERM", cleanup)` override Node.js's default signal termination behavior. After the listener fires (`released` is set to `true`, lock is cleaned up), the process **continues running** — the server is still active and accepting connections. This directly breaks the single-instance guarantee this PR is meant to provide: another `paperclipai run` invocation could now acquire the lock and start, while the original server instance is still live.
The fix is to terminate the process after cleanup when handling a signal:
```suggestion
const cleanup = () => lock.release();
const signalCleanup = (signal: NodeJS.Signals) => {
cleanup();
process.kill(process.pid, signal);
};
process.once("SIGINT", signalCleanup);
process.once("SIGTERM", signalCleanup);
process.once("exit", cleanup);
```
Re-sending the signal after removing the handler restores default OS behavior (correct exit code/termination) rather than using `process.exit(0)`, which produces the wrong exit code for signal termination.
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: cli/src/commands/run-runtime-guard.ts
Line: 49-57
Comment:
**TOCTOU race condition in stale lock replacement**
Between checking `isPidAlive(existing.pid)` (line 51) and overwriting the lock (line 57) without the exclusive `"wx"` flag, two concurrent processes that both observe the same stale lock can each:
1. Read the dead PID
2. Conclude it is stale
3. Both write their own lock
The second write silently overwrites the first process's valid lock, leaving two processes believing they hold the lock. This is exactly the split-runtime scenario the PR is trying to prevent.
To close the race, the replacement write should also use `"wx"` (or a platform-specific rename-based atomic swap). A simple defensive approach is to retry with `"wx"` and, if that fails, re-read and re-check the PID:
```ts
// stale/corrupt lock: attempt atomic replacement
try {
fs.writeFileSync(lockPath, JSON.stringify(state, null, 2), { encoding: "utf8", flag: "wx" });
} catch {
// Lost the replacement race — re-read and check
const contested = readLock(lockPath);
if (contested && isPidAlive(contested.pid)) {
throw new Error(
`Another paperclipai run appears active for instance '${instanceId}' (pid=${contested.pid}, startedAt=${contested.startedAt}).`,
);
}
// Still stale (or file was removed); last-writer-wins is acceptable here
fs.writeFileSync(lockPath, JSON.stringify(state, null, 2), { encoding: "utf8" });
}
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: cli/src/__tests__/run-runtime-guard.test.ts
Line: 40-43
Comment:
**Stale-lock test uses non-guaranteed dead PID**
PID `999999` is assumed to be dead, but on Linux the max PID is configurable (typically 4,194,304) and some environments recycle high PIDs. If PID 999999 happens to be live, `isPidAlive` returns `true`, the stale-lock code path throws, and the test fails non-deterministically.
A more robust approach is to look up a definitely-dead PID:
```suggestion
fs.writeFileSync(
lockPath,
JSON.stringify({ pid: 1, startedAt: new Date().toISOString(), command: "x", instanceId: "default" }),
"utf8",
);
```
PID 1 (`init`/`systemd`) will never be owned by the test runner, so `process.kill(1, 0)` will always throw `EPERM` (process exists but not owned), meaning `isPidAlive(1)` returns `false`. Alternatively, spawn a child process, record its PID, wait for it to exit, then use that PID — that is provably dead.
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: cli/src/__tests__/run-runtime-guard.test.ts
Line: 76-82
Comment:
**Missing test for `api-only` readiness mode**
The test suite covers `"full"` mode (pass and fail) and `isPidAlive`, but there is no test exercising `"api-only"` mode. In `"api-only"` mode, `uiOk` is pre-set to `true` and the UI probe is never executed — a straightforward path that is easy to introduce a regression in (e.g., accidentally adding `uiOk = false` before the mode check). Adding a test case would close this gap:
```ts
it("waitForReadiness passes in api-only mode when api is ok", async () => {
const fetcher = async (): Promise<Response> => new Response("ok", { status: 200 });
const result = await waitForReadiness({
baseUrl: "http://127.0.0.1:3100",
mode: "api-only",
timeoutMs: 100,
intervalMs: 10,
fetcher: fetcher as typeof fetch,
});
expect(result.ok).toBe(true);
expect(result.apiOk).toBe(true);
expect(result.uiOk).toBe(true); // pre-set for api-only
});
```
How can I resolve this? If you propose a fix, please make it concise.Last reviewed commit: 054cf62 |
| const cleanup = () => lock.release(); | ||
| process.once("SIGINT", cleanup); | ||
| process.once("SIGTERM", cleanup); | ||
| process.once("exit", cleanup); |
There was a problem hiding this comment.
SIGINT/SIGTERM handlers do not exit the process
process.once("SIGINT", cleanup) and process.once("SIGTERM", cleanup) override Node.js's default signal termination behavior. After the listener fires (released is set to true, lock is cleaned up), the process continues running — the server is still active and accepting connections. This directly breaks the single-instance guarantee this PR is meant to provide: another paperclipai run invocation could now acquire the lock and start, while the original server instance is still live.
The fix is to terminate the process after cleanup when handling a signal:
| const cleanup = () => lock.release(); | |
| process.once("SIGINT", cleanup); | |
| process.once("SIGTERM", cleanup); | |
| process.once("exit", cleanup); | |
| const cleanup = () => lock.release(); | |
| const signalCleanup = (signal: NodeJS.Signals) => { | |
| cleanup(); | |
| process.kill(process.pid, signal); | |
| }; | |
| process.once("SIGINT", signalCleanup); | |
| process.once("SIGTERM", signalCleanup); | |
| process.once("exit", cleanup); |
Re-sending the signal after removing the handler restores default OS behavior (correct exit code/termination) rather than using process.exit(0), which produces the wrong exit code for signal termination.
Prompt To Fix With AI
This is a comment left during a code review.
Path: cli/src/commands/run.ts
Line: 53-56
Comment:
**SIGINT/SIGTERM handlers do not exit the process**
`process.once("SIGINT", cleanup)` and `process.once("SIGTERM", cleanup)` override Node.js's default signal termination behavior. After the listener fires (`released` is set to `true`, lock is cleaned up), the process **continues running** — the server is still active and accepting connections. This directly breaks the single-instance guarantee this PR is meant to provide: another `paperclipai run` invocation could now acquire the lock and start, while the original server instance is still live.
The fix is to terminate the process after cleanup when handling a signal:
```suggestion
const cleanup = () => lock.release();
const signalCleanup = (signal: NodeJS.Signals) => {
cleanup();
process.kill(process.pid, signal);
};
process.once("SIGINT", signalCleanup);
process.once("SIGTERM", signalCleanup);
process.once("exit", cleanup);
```
Re-sending the signal after removing the handler restores default OS behavior (correct exit code/termination) rather than using `process.exit(0)`, which produces the wrong exit code for signal termination.
How can I resolve this? If you propose a fix, please make it concise.| } catch { | ||
| const existing = readLock(lockPath); | ||
| if (existing && isPidAlive(existing.pid)) { | ||
| throw new Error( | ||
| `Another paperclipai run appears active for instance '${instanceId}' (pid=${existing.pid}, startedAt=${existing.startedAt}).`, | ||
| ); | ||
| } | ||
| // stale/corrupt lock: replace | ||
| fs.writeFileSync(lockPath, JSON.stringify(state, null, 2), { encoding: "utf8" }); |
There was a problem hiding this comment.
TOCTOU race condition in stale lock replacement
Between checking isPidAlive(existing.pid) (line 51) and overwriting the lock (line 57) without the exclusive "wx" flag, two concurrent processes that both observe the same stale lock can each:
- Read the dead PID
- Conclude it is stale
- Both write their own lock
The second write silently overwrites the first process's valid lock, leaving two processes believing they hold the lock. This is exactly the split-runtime scenario the PR is trying to prevent.
To close the race, the replacement write should also use "wx" (or a platform-specific rename-based atomic swap). A simple defensive approach is to retry with "wx" and, if that fails, re-read and re-check the PID:
// stale/corrupt lock: attempt atomic replacement
try {
fs.writeFileSync(lockPath, JSON.stringify(state, null, 2), { encoding: "utf8", flag: "wx" });
} catch {
// Lost the replacement race — re-read and check
const contested = readLock(lockPath);
if (contested && isPidAlive(contested.pid)) {
throw new Error(
`Another paperclipai run appears active for instance '${instanceId}' (pid=${contested.pid}, startedAt=${contested.startedAt}).`,
);
}
// Still stale (or file was removed); last-writer-wins is acceptable here
fs.writeFileSync(lockPath, JSON.stringify(state, null, 2), { encoding: "utf8" });
}Prompt To Fix With AI
This is a comment left during a code review.
Path: cli/src/commands/run-runtime-guard.ts
Line: 49-57
Comment:
**TOCTOU race condition in stale lock replacement**
Between checking `isPidAlive(existing.pid)` (line 51) and overwriting the lock (line 57) without the exclusive `"wx"` flag, two concurrent processes that both observe the same stale lock can each:
1. Read the dead PID
2. Conclude it is stale
3. Both write their own lock
The second write silently overwrites the first process's valid lock, leaving two processes believing they hold the lock. This is exactly the split-runtime scenario the PR is trying to prevent.
To close the race, the replacement write should also use `"wx"` (or a platform-specific rename-based atomic swap). A simple defensive approach is to retry with `"wx"` and, if that fails, re-read and re-check the PID:
```ts
// stale/corrupt lock: attempt atomic replacement
try {
fs.writeFileSync(lockPath, JSON.stringify(state, null, 2), { encoding: "utf8", flag: "wx" });
} catch {
// Lost the replacement race — re-read and check
const contested = readLock(lockPath);
if (contested && isPidAlive(contested.pid)) {
throw new Error(
`Another paperclipai run appears active for instance '${instanceId}' (pid=${contested.pid}, startedAt=${contested.startedAt}).`,
);
}
// Still stale (or file was removed); last-writer-wins is acceptable here
fs.writeFileSync(lockPath, JSON.stringify(state, null, 2), { encoding: "utf8" });
}
```
How can I resolve this? If you propose a fix, please make it concise.| fs.writeFileSync( | ||
| lockPath, | ||
| JSON.stringify({ pid: 999999, startedAt: new Date().toISOString(), command: "x", instanceId: "default" }), | ||
| "utf8", |
There was a problem hiding this comment.
Stale-lock test uses non-guaranteed dead PID
PID 999999 is assumed to be dead, but on Linux the max PID is configurable (typically 4,194,304) and some environments recycle high PIDs. If PID 999999 happens to be live, isPidAlive returns true, the stale-lock code path throws, and the test fails non-deterministically.
A more robust approach is to look up a definitely-dead PID:
| fs.writeFileSync( | |
| lockPath, | |
| JSON.stringify({ pid: 999999, startedAt: new Date().toISOString(), command: "x", instanceId: "default" }), | |
| "utf8", | |
| fs.writeFileSync( | |
| lockPath, | |
| JSON.stringify({ pid: 1, startedAt: new Date().toISOString(), command: "x", instanceId: "default" }), | |
| "utf8", | |
| ); |
PID 1 (init/systemd) will never be owned by the test runner, so process.kill(1, 0) will always throw EPERM (process exists but not owned), meaning isPidAlive(1) returns false. Alternatively, spawn a child process, record its PID, wait for it to exit, then use that PID — that is provably dead.
Prompt To Fix With AI
This is a comment left during a code review.
Path: cli/src/__tests__/run-runtime-guard.test.ts
Line: 40-43
Comment:
**Stale-lock test uses non-guaranteed dead PID**
PID `999999` is assumed to be dead, but on Linux the max PID is configurable (typically 4,194,304) and some environments recycle high PIDs. If PID 999999 happens to be live, `isPidAlive` returns `true`, the stale-lock code path throws, and the test fails non-deterministically.
A more robust approach is to look up a definitely-dead PID:
```suggestion
fs.writeFileSync(
lockPath,
JSON.stringify({ pid: 1, startedAt: new Date().toISOString(), command: "x", instanceId: "default" }),
"utf8",
);
```
PID 1 (`init`/`systemd`) will never be owned by the test runner, so `process.kill(1, 0)` will always throw `EPERM` (process exists but not owned), meaning `isPidAlive(1)` returns `false`. Alternatively, spawn a child process, record its PID, wait for it to exit, then use that PID — that is provably dead.
How can I resolve this? If you propose a fix, please make it concise.| const result = await waitForReadiness({ | ||
| baseUrl: "http://127.0.0.1:3100", | ||
| mode: "full", | ||
| timeoutMs: 60, | ||
| intervalMs: 10, | ||
| fetcher: fetcher as typeof fetch, | ||
| }); |
There was a problem hiding this comment.
Missing test for api-only readiness mode
The test suite covers "full" mode (pass and fail) and isPidAlive, but there is no test exercising "api-only" mode. In "api-only" mode, uiOk is pre-set to true and the UI probe is never executed — a straightforward path that is easy to introduce a regression in (e.g., accidentally adding uiOk = false before the mode check). Adding a test case would close this gap:
it("waitForReadiness passes in api-only mode when api is ok", async () => {
const fetcher = async (): Promise<Response> => new Response("ok", { status: 200 });
const result = await waitForReadiness({
baseUrl: "http://127.0.0.1:3100",
mode: "api-only",
timeoutMs: 100,
intervalMs: 10,
fetcher: fetcher as typeof fetch,
});
expect(result.ok).toBe(true);
expect(result.apiOk).toBe(true);
expect(result.uiOk).toBe(true); // pre-set for api-only
});Prompt To Fix With AI
This is a comment left during a code review.
Path: cli/src/__tests__/run-runtime-guard.test.ts
Line: 76-82
Comment:
**Missing test for `api-only` readiness mode**
The test suite covers `"full"` mode (pass and fail) and `isPidAlive`, but there is no test exercising `"api-only"` mode. In `"api-only"` mode, `uiOk` is pre-set to `true` and the UI probe is never executed — a straightforward path that is easy to introduce a regression in (e.g., accidentally adding `uiOk = false` before the mode check). Adding a test case would close this gap:
```ts
it("waitForReadiness passes in api-only mode when api is ok", async () => {
const fetcher = async (): Promise<Response> => new Response("ok", { status: 200 });
const result = await waitForReadiness({
baseUrl: "http://127.0.0.1:3100",
mode: "api-only",
timeoutMs: 100,
intervalMs: 10,
fetcher: fetcher as typeof fetch,
});
expect(result.ok).toBe(true);
expect(result.apiOk).toBe(true);
expect(result.uiOk).toBe(true); // pre-set for api-only
});
```
How can I resolve this? If you propose a fix, please make it concise.|
Addressed latest review feedback in 1705d4b:\n- Fixed SIGINT/SIGTERM behavior to release lock and then re-signal for correct process termination semantics\n- Hardened stale-lock acquisition to avoid non-atomic overwrite race and fail safely on contested acquisition\n- Replaced stale-lock test with deterministic dead PID pattern (spawn child -> wait exit)\n- Added api-only readiness mode test coverage\n\nValidated with: ✓ src/tests/run-runtime-guard.test.ts (7 tests) 114ms Test Files 1 passed (1) |
Summary
This PR hardens
paperclipai runstartup reliability by enforcing single-instance ownership and adding deterministic readiness gating.Why
During OpenClaw + Paperclip setup/recovery runs, we hit repeated split-runtime failures:
paperclipai runprocesses active at the same timeWhat changed
Added runtime lock support:
cli/src/commands/run-runtime-guard.ts<instanceRoot>/run.lock.jsonpaperclipai runfor same instanceSIGINT/SIGTERM/ process exitAdded readiness gating in
runcommand:fullmode (default): requires both API and UI probes to passapi-onlymode: requires API probe onlyImproved startup logs:
Tests
Added:
cli/src/__tests__/run-runtime-guard.test.tsRun command used:
Notes
pnpm typecheckcurrently reports existing unrelated errors in the repo (e.g.initdbFlagstyping in other files). This PR does not touch those code paths.Follow-ups
doctor --openclawintegration checks