refactor(cli): port start-services.sh to TypeScript by prekshivyas · Pull Request #1307 · NVIDIA/NemoClaw

prekshivyas · 2026-04-02T06:37:13Z

Summary

Closes #1297. Part of #924 shell consolidation (PR 2 scope).

Ports the PID-managed service lifecycle (start/stop/status for telegram-bridge and cloudflared) from scripts/start-services.sh to nemoclaw/src/lib/services.ts:

Uses execa with detached: true + unref() for background service spawning (replaces nohup)
Native process.kill(pid, 0) for liveness checks (replaces kill -0)
File-based PID management preserved (/tmp/nemoclaw-services-{sandbox}/)
bin/nemoclaw.js start/stop/status now call the TS module directly instead of bash scripts/start-services.sh
scripts/start-services.sh kept — still used by remote deploy (ssh ... bash scripts/start-services.sh) and uninstall.sh

Pattern:

TS source in nemoclaw/src/lib/services.ts with flat ServiceOptions interface
Thin CJS shim in bin/lib/services.js
Tests in src/lib/services.test.ts importing from ../../dist/lib/services for coverage

Test plan

npx vitest run --project plugin — 201 tests pass (includes new services tests)
npx vitest run --project cli — 336 tests pass
npx vitest run test/service-env.test.js — 11 tests pass (bash script still works)
npx eslint nemoclaw/src/lib/services.ts — clean
Pre-commit hooks pass (prettier, eslint, gitleaks, tsc, vitest)

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Service management to start, stop, monitor background services and report statuses, including public connectivity URLs when available
- Installer now detects when the user’s shell needs reloading and shows a conditional reload hint
- CLI start/stop/status commands now use the integrated service manager for in-process control
Bug Fixes
- More robust handling of stale/invalid PID files and safer service shutdown behavior
Tests
- Added comprehensive tests for service behaviors and installer reload hint detection

Replace the bash shell-outs for start/stop/status in bin/nemoclaw.js with direct calls to a new nemoclaw/src/lib/services.ts module that uses execa for subprocess management and native Node APIs for PID file lifecycle. - TS implementation in nemoclaw/src/lib/services.ts with ServiceOptions interface - Thin CJS shim in bin/lib/services.js - Co-located tests importing from ../../dist/lib/services for coverage - bin/nemoclaw.js start/stop/status now call the TS module directly - scripts/start-services.sh preserved for remote deploy and uninstall Closes #1297 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-04-02T06:37:26Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Added a new TypeScript service-management module (start/stop/status) with PID/log handling and tunnel detection, tests for compiled output, a CommonJS shim re-exporting the compiled module, CLI wiring to call the module in-process, and installer logic to conditionally print a shell-reload hint.

Changes

Cohort / File(s)	Summary
Service Module (source & build shim) `src/lib/services.ts`, `bin/lib/services.js`	New TypeScript implementation of service lifecycle (startAll/stopAll/showStatus/getServiceStatuses) with PID files, detached processes, log parsing for tunnel URL; added CommonJS shim re-exporting `../../dist/lib/services`.
Service Tests `src/lib/services.test.ts`	New Vitest suite exercising compiled `dist/lib/services` for status detection, PID-dir creation, stale/invalid PID handling, showStatus output, stopAll idempotence, and log-based URL behavior using isolated temp dirs.
CLI entrypoint `bin/nemoclaw.js`	Replaced shell-out start/stop/status behavior with in-process calls to `startAll`, `stopAll`, and `showStatus` from `./lib/services`.
Installer script & tests `install.sh`, `test/install-preflight.test.js`	Added `needs_shell_reload()` and `_needs_reload` logic to print a `source ...` reload hint only when installer-modified PATH entries are not present; tests added/updated to assert reload hint when PATH was extended by installer.
Other tests `test/cli.test.js`	Updated CLI test expectations to reflect in-process services invocation (asserts presence of "NemoClaw Services" output instead of filesystem marker).

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant CLI
    participant ServicesModule
    participant ProcessManager
    participant FileSystem
    participant LogWatcher

    User->>CLI: run `nemoclaw start`
    CLI->>ServicesModule: call startAll(opts)
    ServicesModule->>FileSystem: ensure pidDir exists
    ServicesModule->>ProcessManager: start telegram-bridge (if token present)
    ProcessManager->>FileSystem: write telegram-bridge.pid & redirect logs
    ServicesModule->>ProcessManager: start cloudflared (if available)
    ProcessManager->>FileSystem: write cloudflared.pid & redirect logs
    ServicesModule->>LogWatcher: poll cloudflared.log for trycloudflare.com URL
    LogWatcher-->>ServicesModule: return public URL (when found)
    ServicesModule->>CLI: print banner/status (includes URL)
    User-->>CLI: sees status/banner

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐇 I hopped through PID files beneath the moon,

Spawned daemons hum and logs unfurl soon,
A shim to bridge the built and running streams,
Tests snug in tempdir, installer hints in beams,
I nibble code, then thump — services resume.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 18.42% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'refactor(cli): port start-services.sh to TypeScript' clearly and concisely describes the main change: porting a bash script to TypeScript. It is directly related to the core objective of the changeset.
Linked Issues check	✅ Passed	The PR successfully implements all coding requirements from issue `#1297`: ports scripts/start-services.sh to src/lib/services.ts using Node subprocess management, updates bin/nemoclaw.js to call the TS module directly, preserves PID file management and service lifecycle, follows the established migration pattern with TS source, compiled dist/, and co-located tests, adds comprehensive test coverage, and maintains bash script behavior (warnings for missing keys, skipping telegram bridge conditionally).
Out of Scope Changes check	✅ Passed	All changes are within scope: new TS module and tests for service management, CommonJS shim for compatibility, updates to bin/nemoclaw.js CLI integration, shell script enhancements for reload hints (related to service startup workflow), and corresponding test updates. No unrelated refactoring or feature changes detected.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch refactor/port-start-services-to-ts-1297

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (3)

nemoclaw/src/lib/services.test.ts (1)

123-146: Consider adding verification that process.kill is actually called during stopAll.

The stopAll tests verify PID file removal and log output, but don't confirm that process.kill(pid, "SIGTERM") is invoked for running processes. Since the mock allows tracking calls, adding an assertion would strengthen coverage of the stop logic.

♻️ Suggested enhancement

+    it("sends SIGTERM to running processes", () => {
+      store.set("/tmp/test-svc/telegram-bridge.pid", "111");
+      // Make process alive
+      const killCalls: Array<{ pid: number; signal: string | number }> = [];
+      killMock = (pid: number, signal?: string | number) => {
+        killCalls.push({ pid, signal: signal ?? 0 });
+        if (signal === 0) return true; // alive check
+        return true; // kill succeeded
+      };
+      process.kill = killMock as typeof process.kill;
+
+      const logSpy = vi.spyOn(console, "log").mockImplementation(() => {});
+      stopAll({ pidDir: "/tmp/test-svc" });
+      
+      expect(killCalls).toContainEqual({ pid: 111, signal: "SIGTERM" });
+      logSpy.mockRestore();
+    });

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemoclaw/src/lib/services.test.ts` around lines 123 - 146, Add assertions
that process.kill is invoked with the PIDs from the PID files when stopAll({
pidDir: "/tmp/test-svc" }) runs: mock/spy on process.kill before calling
stopAll, assert it was called twice with 111 and 222 and signal "SIGTERM" (or at
least called with the expected pid values), and restore the mock after the test;
additionally in the "handles already-stopped services gracefully" test assert
that process.kill is not called (or that calls are absent) when no PID files
exist. Use the stopAll function name and the existing PID filenames
(telegram-bridge.pid, cloudflared.pid) to locate where to add the process.kill
spy and assertions.

nemoclaw/src/lib/services.ts (2)

128-150: stopService sends SIGTERM but immediately falls back to SIGKILL without waiting.

The current logic sends SIGTERM and immediately catches any error to send SIGKILL. However, if SIGTERM succeeds (no error), SIGKILL is never sent—which is correct. But if SIGTERM itself throws (e.g., EPERM), the fallback to SIGKILL may also fail for the same reason. More importantly, there's no grace period for the process to terminate before escalating.

For CLI tooling this is likely acceptable, but consider adding a brief delay before SIGKILL if graceful shutdown is desired.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemoclaw/src/lib/services.ts` around lines 128 - 150, stopService currently
sends SIGTERM and immediately falls back to SIGKILL on any error; change it to
send SIGTERM first, then wait briefly (e.g., poll process.kill(pid, 0) with a
short timeout/interval, up to a few seconds) to see if the process exits, and
only if it remains alive after the grace period send SIGKILL; keep error
handling for EPERM/other errors around both signals, ensure removePid(pidDir,
name) is called only after the process is confirmed dead or known not running,
and update the try/catch blocks around process.kill, readPid and removePid
accordingly (refer to stopService, readPid, removePid to locate the logic).
265-280: Consider using async file read in the polling loop.

The loop uses synchronous readFileSync within an async function. While functional, using fs/promises.readFile would be more idiomatic for async code and avoid blocking the event loop during the 15-second polling window.
♻️ Suggested refactor
+import { readFile } from "node:fs/promises";

 // In the polling loop:
-      if (existsSync(logFile)) {
-        const log = readFileSync(logFile, "utf-8");
+      try {
+        const log = await readFile(logFile, "utf-8");
         if (/https:\/\/[a-z0-9-]*\.trycloudflare\.com/.test(log)) {
           break;
         }
-      }
+      } catch {
+        // File doesn't exist yet
+      }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemoclaw/src/lib/services.ts` around lines 265 - 280, The polling loop
currently uses synchronous readFileSync which blocks the event loop; change it
to use the fs/promises API by importing readFile (and optionally stat or access)
from "fs/promises" and replace readFileSync(logFile, "utf-8") with await
readFile(logFile, "utf-8") inside the for loop (or use await stat/access in a
try/catch to detect file existence and then await readFile), keeping the same
regex check (/https:\/\/[a-z0-9-]*\.trycloudflare\.com/) and the surrounding
control flow (isRunning(pidDir, "cloudflared"), logFile variable, and the delay
via setTimeout) so the loop remains non-blocking.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/lib/services.js`:
- Around line 4-5: The runtime fails because bin/lib/services.js uses require()
to load an ESM package; replace the CJS require shim with an async dynamic
import so the ESM-built output can be loaded at runtime: change the
module.exports = require("../../nemoclaw/dist/lib/services.js") pattern to
perform an import("../../nemoclaw/dist/lib/services.js") and then re-export the
imported namespace (or its default) via module.exports (or export default) so
callers continue to work; target the symbol in the diff (the module.exports =
require(...) line) when making the change.

---

Nitpick comments:
In `@nemoclaw/src/lib/services.test.ts`:
- Around line 123-146: Add assertions that process.kill is invoked with the PIDs
from the PID files when stopAll({ pidDir: "/tmp/test-svc" }) runs: mock/spy on
process.kill before calling stopAll, assert it was called twice with 111 and 222
and signal "SIGTERM" (or at least called with the expected pid values), and
restore the mock after the test; additionally in the "handles already-stopped
services gracefully" test assert that process.kill is not called (or that calls
are absent) when no PID files exist. Use the stopAll function name and the
existing PID filenames (telegram-bridge.pid, cloudflared.pid) to locate where to
add the process.kill spy and assertions.

In `@nemoclaw/src/lib/services.ts`:
- Around line 128-150: stopService currently sends SIGTERM and immediately falls
back to SIGKILL on any error; change it to send SIGTERM first, then wait briefly
(e.g., poll process.kill(pid, 0) with a short timeout/interval, up to a few
seconds) to see if the process exits, and only if it remains alive after the
grace period send SIGKILL; keep error handling for EPERM/other errors around
both signals, ensure removePid(pidDir, name) is called only after the process is
confirmed dead or known not running, and update the try/catch blocks around
process.kill, readPid and removePid accordingly (refer to stopService, readPid,
removePid to locate the logic).
- Around line 265-280: The polling loop currently uses synchronous readFileSync
which blocks the event loop; change it to use the fs/promises API by importing
readFile (and optionally stat or access) from "fs/promises" and replace
readFileSync(logFile, "utf-8") with await readFile(logFile, "utf-8") inside the
for loop (or use await stat/access in a try/catch to detect file existence and
then await readFile), keeping the same regex check
(/https:\/\/[a-z0-9-]*\.trycloudflare\.com/) and the surrounding control flow
(isRunning(pidDir, "cloudflared"), logFile variable, and the delay via
setTimeout) so the loop remains non-blocking.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 3fcb2341-b705-4f22-a6c8-e4279bfe35b6

📥 Commits

Reviewing files that changed from the base of the PR and between cb668d7 and 013709f.

📒 Files selected for processing (4)

bin/lib/services.js
bin/nemoclaw.js
nemoclaw/src/lib/services.test.ts
nemoclaw/src/lib/services.ts

coderabbitai · 2026-04-02T06:42:29Z

bin/lib/services.js

+// Thin CJS shim — implementation lives in nemoclaw/src/lib/services.ts
+module.exports = require("../../nemoclaw/dist/lib/services.js");


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check nemoclaw package.json for "type" field echo "=== nemoclaw/package.json type field ===" cat nemoclaw/package.json | jq '.type // "commonjs (default)"' # Check tsconfig for module output format echo -e "\n=== nemoclaw/tsconfig.json module setting ===" cat nemoclaw/tsconfig.json | jq '.compilerOptions.module // "not set"' # Check if dist/lib/services.js exists and its format echo -e "\n=== Check compiled output format (if exists) ===" if [ -f nemoclaw/dist/lib/services.js ]; then head -20 nemoclaw/dist/lib/services.js else echo "dist/lib/services.js not found (not built yet)" fi

Repository: NVIDIA/NemoClaw

Length of output: 261

CJS/ESM interoperability conflict: this shim will fail at runtime.

The shim uses require() to load compiled output, but nemoclaw/package.json declares "type": "module" (ESM package) while nemoclaw/tsconfig.json specifies "module": "Node16" (CommonJS output). When Node.js loads nemoclaw/dist/lib/services.js, it treats it as ESM due to the package type, but the file contains CommonJS syntax, causing an immediate runtime failure.

Fix one of these:

Change nemoclaw/tsconfig.json to output ESM (e.g., "module": "ES2020")

Use .cjs file extension for CommonJS output in ESM packages

Switch the shim to use dynamic import() instead of require()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@bin/lib/services.js` around lines 4 - 5, The runtime fails because bin/lib/services.js uses require() to load an ESM package; replace the CJS require shim with an async dynamic import so the ESM-built output can be loaded at runtime: change the module.exports = require("../../nemoclaw/dist/lib/services.js") pattern to perform an import("../../nemoclaw/dist/lib/services.js") and then re-export the imported namespace (or its default) via module.exports (or export default) so callers continue to work; target the symbol in the diff (the module.exports = require(...) line) when making the change.

Replace the bash shell-outs for start/stop/status in bin/nemoclaw.js with direct calls to a new nemoclaw/src/lib/services.ts module that uses child_process.spawn for detached service management and native Node APIs for PID file lifecycle. - Uses spawn() with a single fd for stdout+stderr (mirrors bash >log 2>&1) - Native process.kill(pid, 0) for liveness checks (replaces kill -0) - Removed dead node-availability check (we're already in Node) - bin/nemoclaw.js start/stop/status now call the TS module directly - scripts/start-services.sh preserved for remote deploy and uninstall Closes #1297 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemoclaw/src/lib/services.ts`:
- Around line 139-157: stopService currently sends SIGTERM then immediately logs
stopped and removes the PID file, which can orphan a process; change stopService
(using readPid, process.kill, removePid, info) to after sending SIGTERM poll the
process liveness with process.kill(pid, 0) in a short retry loop (e.g., sleep
100ms up to a configurable timeout) and only send SIGKILL if the process remains
alive after the timeout; log appropriate messages for escalation and only call
removePid once process.kill(pid, 0) confirms the process is gone (or after
SIGKILL succeeds/throws indicating death).
- Around line 167-170: resolvePidDir is falling back to "default" when
opts.sandboxName is missing, causing startAll (which receives the registry
default sandbox) to write PIDs to a different directory than stop/showStatus
(which call this with no opts). Fix by making resolvePidDir use the same
registry default sandbox when opts.sandboxName is undefined: retrieve the
registry default sandbox (the same source used by startAll) inside resolvePidDir
or require callers (stop, showStatus) to pass that registry default through
ServiceOptions; update calls in stop and showStatus to pass the registry default
sandbox as well (see startAll, stop, showStatus, resolvePidDir, and
ServiceOptions references) and adjust the fallback logic (also update the other
occurrences noted around the resolvePidDir usages at the other locations).
- Around line 188-195: Don't surface an old cloudflared URL from
cloudflared.log; only show the URL if the tunnel process is actually running.
Update the code that reads logFile (variables: pidDir, logFile; functions/uses:
existsSync, readFileSync, the /https:\/\/[a-z0-9-]*\.trycloudflare\.com/ regex
match) to first check for a valid pid (e.g., read a cloudflared.pid in pidDir),
parse it, and verify the process is alive (process.kill(pid, 0) in a try/catch
or equivalent); only if that liveness check passes should you read the log and
emit info(match[0]). Apply this same guard to both places that read
cloudflared.log (the startup banner and the status output) so stale logs from
stopAll() are ignored.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 852ab34c-87f9-4ab8-a340-a55a00d941ec

📥 Commits

Reviewing files that changed from the base of the PR and between 013709f and 0bed503.

📒 Files selected for processing (1)

nemoclaw/src/lib/services.ts

nemoclaw/src/lib/services.ts

src/lib/services.ts

cv

Good progress on the services port. A few things to address:

1. Wrong directory — same issue as #1306. nemoclaw/src/lib/services.ts should be src/lib/services.ts. See the pattern in #1262, #1265, #1271.

2. PID directory mismatch (critical) — startAll() in bin/nemoclaw.js passes the registry default sandbox name to build the PID directory, but stop() and showStatus() call the module with no options. They resolve a different PID path, so services started under sandbox "alpha" cannot be stopped because stop looks in the wrong directory. The PID directory must be resolved consistently across start/stop/status — either always from the registry default, or passed explicitly from the CLI dispatcher.

3. Premature PID file deletion — After sending SIGTERM, the PID file is deleted immediately without confirming the process exited. If the service needs a moment to shut down, it is now orphaned with no PID tracking. Add a brief poll (e.g., check process.kill(pid, 0) in a loop with a timeout) before removing the PID file, falling back to SIGKILL if it does not exit.

4. Needs rebase — merge conflicts with main.

The execa-based spawning with detached: true + unref() is the right approach for background services.

Replace the bash shell-outs for start/stop/status in bin/nemoclaw.js with direct calls to a new nemoclaw/src/lib/services.ts module that uses execa for subprocess management and native Node APIs for PID file lifecycle. - TS implementation in nemoclaw/src/lib/services.ts with ServiceOptions interface - Thin CJS shim in bin/lib/services.js - Co-located tests importing from ../../dist/lib/services for coverage - bin/nemoclaw.js start/stop/status now call the TS module directly - scripts/start-services.sh preserved for remote deploy and uninstall Closes #1297 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace the bash shell-outs for start/stop/status in bin/nemoclaw.js with direct calls to a new nemoclaw/src/lib/services.ts module that uses child_process.spawn for detached service management and native Node APIs for PID file lifecycle. - Uses spawn() with a single fd for stdout+stderr (mirrors bash >log 2>&1) - Native process.kill(pid, 0) for liveness checks (replaces kill -0) - Removed dead node-availability check (we're already in Node) - bin/nemoclaw.js start/stop/status now call the TS module directly - scripts/start-services.sh preserved for remote deploy and uninstall Closes #1297 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… tests Address review feedback: move TS source from nemoclaw/src/lib/ to src/lib/ (matching the established pattern for ported shell scripts), replace execa dependency with child_process.execFileSync/execSync, update bin/lib/services.js shim to point to dist/lib/services, and add comprehensive test suite using real temp directories. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…NVIDIA/NemoClaw into refactor/port-start-services-to-ts-1297 # Conflicts: # bin/lib/services.js # bin/nemoclaw.js

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (1)

src/lib/services.test.ts (1)
16-142: Please add a mocked startAll() suite.

This file covers status/stop paths, but most of the migration risk now sits in startAll()—env gating, spawn args, cloudflared detection, and tunnel URL polling. A mocked node:child_process/timer test would cover that without touching real binaries.

Based on learnings "Mock external dependencies; don't call real NVIDIA APIs in unit tests".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/services.test.ts` around lines 16 - 142, Add a new describe block to
test startAll() that mocks node:child_process spawn and timer behavior: use
vi.spyOn(require("node:child_process"), "spawn") to return a fake
ChildProcess-like object (with stdout.on, stderr.on, pid, kill) and use
vi.useFakeTimers() to control polling timers; assert startAll() spawns the
expected commands and args for both "telegram-bridge" and "cloudflared",
respects env gating (pass through process.env checks), handles cloudflared
detection by simulating a successful tunnel URL written to the cloudflared.log
stream and that the URL polling resolves, and verify startAll() returns/creates
PID/log files and calls console.log messages—also include cleanup to restore
spies and timers to keep tests isolated.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/services.ts`:
- Around line 344-350: getServiceStatuses currently returns a numeric pid even
when running is false, causing inconsistent status objects; update
getServiceStatuses (using resolvePidDir, ensurePidDir, SERVICE_NAMES, isRunning,
readPid) to set pid to null whenever isRunning(pidDir, name) is false (or
alternatively remove stale PID files before assembling the array) so the
returned ServiceStatus has pid:null for dead services rather than a stale
number. Ensure the change happens inside the map that builds the status list so
callers get consistent objects.
- Around line 193-196: resolvePidDir currently interpolates opts.sandboxName and
environment variables directly into a filesystem path which allows path
traversal; update resolvePidDir (and its use of ServiceOptions.sandboxName) to
validate the chosen sandbox name against the same sandbox-name validator used in
scripts/telegram-bridge.js (or import that validator) before building the path,
rejecting or throwing an Error for any name containing separators, “..”, or
otherwise invalid characters, and only then return the pidDir using the
validated sandbox token; also ensure the env var fallbacks (NEMOCLAW_SANDBOX,
SANDBOX_NAME) are validated the same way.
- Around line 124-139: The current spawn flow writes the PID and logs success
immediately after calling spawn(), which can record stale PIDs or crash if the
child fails to launch; update the code around spawn()/subprocess to (1) replace
the hardcoded "node" invocation with process.execPath for the Node bridge, (2)
attach subprocess.once('spawn', ...) and move writePid(pidDir, name, pid) and
info(`${name} started (PID ${String(pid)})`) into that handler so they only run
after the child actually spawns, and (3) add subprocess.once('error', err => {
warn(`${name} failed to start`); /* clean up: closeFd/remove any partial state
*/ }) before calling subprocess.unref() to avoid unhandled errors and stale
state.

---

Nitpick comments:
In `@src/lib/services.test.ts`:
- Around line 16-142: Add a new describe block to test startAll() that mocks
node:child_process spawn and timer behavior: use
vi.spyOn(require("node:child_process"), "spawn") to return a fake
ChildProcess-like object (with stdout.on, stderr.on, pid, kill) and use
vi.useFakeTimers() to control polling timers; assert startAll() spawns the
expected commands and args for both "telegram-bridge" and "cloudflared",
respects env gating (pass through process.env checks), handles cloudflared
detection by simulating a successful tunnel URL written to the cloudflared.log
stream and that the URL polling resolves, and verify startAll() returns/creates
PID/log files and calls console.log messages—also include cleanup to restore
spies and timers to keep tests isolated.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8def4363-61fb-4db3-b7cc-db923fbdd581

📥 Commits

Reviewing files that changed from the base of the PR and between 0bed503 and e5cadb2.

📒 Files selected for processing (5)

bin/lib/services.js
install.sh
src/lib/services.test.ts
src/lib/services.ts
test/install-preflight.test.js

✅ Files skipped from review due to trivial changes (1)

bin/lib/services.js

coderabbitai · 2026-04-02T17:46:39Z

src/lib/services.ts

+  const subprocess = spawn(command, args, {
+    detached: true,
+    stdio: ["ignore", logFd, logFd],
+    env: { ...process.env, ...env },
+  });
+  closeSync(logFd);
+
+  const pid = subprocess.pid;
+  if (pid === undefined) {
+    warn(`${name} failed to start`);
+    return;
+  }
+
+  subprocess.unref();
+  writePid(pidDir, name, pid);
+  info(`${name} started (PID ${String(pid)})`);


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

In Node.js child_process.spawn(), when is the 'spawn' event emitted, what happens if the executable cannot be launched, and what is the effect of an unhandled ChildProcess 'error' event?

💡 Result:

The 'spawn' event is emitted once the child process has spawned successfully (Added in: v15.1.0, v14.17.0). It comes before all other events and before any data is received via stdout or stderr. The 'spawn' event will fire regardless of whether an error occurs within the spawned process itself. If the executable cannot be launched (e.g., ENOENT error when command not found), the 'spawn' event is not emitted. Instead, the 'error' event is emitted. An unhandled 'error' event on a ChildProcess causes Node.js to throw an uncaught exception, which typically crashes the parent process (e.g., "throw er; // Unhandled 'error' event"). Always attach an 'error' event listener to handle spawn failures and prevent crashes.

Citations:

1: https://nodejs.org/api/child_process.html

2: https://nodejs.org/docs/latest-v18.x/api/child_process.html

3: http://nodejs.org/docs/latest/api/child_process.html

4: https://stackoverflow.com/questions/27688804/how-do-i-debug-error-spawn-enoent-on-node-js

5: https://stackoverflow.com/questions/17749055/nodejs-throw-er-unhandled-error-event-events-js72-when-using-child-proc

🏁 Script executed:

# Check the file exists and get the line count wc -l src/lib/services.ts

Repository: NVIDIA/NemoClaw

Length of output: 82

🏁 Script executed:

# Read the startService function around lines 106-140 sed -n '106,140p' src/lib/services.ts | cat -n

Repository: NVIDIA/NemoClaw

Length of output: 1242

🏁 Script executed:

# Read the startAll function and bridge spawn call around lines 264-274 sed -n '264,274p' src/lib/services.ts | cat -n

Repository: NVIDIA/NemoClaw

Length of output: 488

🏁 Script executed:

# Check if there are error event handlers on subprocess before unref() rg -A 10 -B 5 'spawn\(command' src/lib/services.ts

Repository: NVIDIA/NemoClaw

Length of output: 601

Move the PID write and success log behind spawn event confirmation; use process.execPath for the Node bridge.

spawn() returning a ChildProcess does not guarantee the child has actually spawned. If the executable cannot be launched, an error event is emitted and an unhandled error crashes the parent process. The code at lines 124–139 writes the PID and logs success immediately after spawn() returns, before confirming the spawn succeeded. Combined with line 271's "node" lookup via PATH, a missing or incorrect interpreter or TOCTOU around command discovery will leave stale PIDs on disk and potentially crash the CLI.

Attach a spawn event listener to gate the writePid() and success log, or add an error handler before unref() to prevent crashes. Replace "node" with process.execPath to use the current interpreter and avoid PATH dependencies.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/lib/services.ts` around lines 124 - 139, The current spawn flow writes the PID and logs success immediately after calling spawn(), which can record stale PIDs or crash if the child fails to launch; update the code around spawn()/subprocess to (1) replace the hardcoded "node" invocation with process.execPath for the Node bridge, (2) attach subprocess.once('spawn', ...) and move writePid(pidDir, name, pid) and info(`${name} started (PID ${String(pid)})`) into that handler so they only run after the child actually spawns, and (3) add subprocess.once('error', err => { warn(`${name} failed to start`); /* clean up: closeFd/remove any partial state */ }) before calling subprocess.unref() to avoid unhandled errors and stale state.

src/lib/services.ts

Address remaining review feedback: - Validate sandboxName against path traversal (reject ../, slashes, etc.) - Normalize dead PIDs to null in getServiceStatuses (consistent API) - Attach error handler on detached subprocess to prevent unhandled crash Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (3)

src/lib/services.ts (3)

254-257: Avoid process.exit() in library code.

Calling process.exit(1) in a library function makes it untestable and prevents callers from handling the error gracefully. Consider throwing an error instead.

Suggested fix

   if (!process.env.NVIDIA_API_KEY) {
-    console.error(`${RED}[services]${NC} NVIDIA_API_KEY required`);
-    process.exit(1);
+    throw new Error("NVIDIA_API_KEY required");
   }

The CLI entry point can catch this and exit with the appropriate code.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/lib/services.ts` around lines 254 - 257, The current check in the NVIDIA
API key guard uses console.error(...) and calls process.exit(1), which is unsafe
for a library; change the behavior in the function that performs this check (the
block referencing process.env.NVIDIA_API_KEY and process.exit) to throw an Error
(or a specific custom Error type) instead of exiting the process so callers
(e.g., CLI entry point) can catch it and decide whether to log and call
process.exit; update any surrounding function (e.g., the service initialization
function that contains this guard) to document/propagate the thrown error so
callers can handle termination.

264-275: Sandbox readiness check is less rigorous than existing utility.

The output.includes("Ready") check (line 270) can produce false negatives/positives:

Matches "NotReady" since it contains "Ready"
Doesn't verify the specific sandbox is ready
Doesn't strip ANSI codes from output

src/lib/gateway-state.ts already exports isSandboxReady(output, sandboxName) which handles these edge cases correctly.

Suggested fix

+import { isSandboxReady } from "./gateway-state";
 // ...
   try {
     const output = execFileSync("openshell", ["sandbox", "list"], {
       encoding: "utf-8",
       stdio: ["ignore", "pipe", "pipe"],
     });
-    if (!output.includes("Ready")) {
+    const sandboxName =
+      opts.sandboxName ?? process.env.NEMOCLAW_SANDBOX ?? process.env.SANDBOX_NAME ?? "default";
+    if (!isSandboxReady(output, sandboxName)) {
       warn("No sandbox in Ready state. Telegram bridge may not work until sandbox is running.");
     }
   } catch {

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/lib/services.ts` around lines 264 - 275, The sandbox readiness check
using execFileSync(...).includes("Ready") is too fragile; replace that logic to
call the existing isSandboxReady(output, sandboxName) exported from
src/lib/gateway-state.ts (import it) and use its result to decide whether to
warn (keep using warn). Ensure you still run execFileSync("openshell",
["sandbox", "list"], ...) and pass the correct sandboxName (or
undefined/default) into isSandboxReady so ANSI codes and exact-name matching are
handled and false positives like "NotReady" are avoided; preserve the try/catch
behavior for missing openshell.

170-178: Busy-wait spin loop is CPU-intensive.

The inner spin loop (lines 175-177) consumes CPU cycles unnecessarily. For a CLI tool this is acceptable, but consider using a more efficient polling mechanism if this code path is called frequently.

Alternative using sync sleep from a native binding or simpler busy loop

If you want to keep it synchronous without a native sleep, the current approach works. Alternatively, for Node 16+, you could restructure using Atomics.wait on a shared buffer for a true sync sleep without CPU spin.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/lib/services.ts` around lines 170 - 178, The inner busy-wait inside the
polling loop (while (Date.now() - start < 100) {/* spin */}) is CPU-intensive;
replace that spin with a synchronous sleep using Atomics.wait on a
SharedArrayBuffer to block the thread for ~100ms instead. In the polling block
that uses deadline and isAlive(pid), create an Int32Array backed by a
SharedArrayBuffer and call Atomics.wait(sabInt32, 0, 0, 100) to sleep, with a
small fallback to the current spin if Atomics.wait is unavailable (or guard with
a Node 16+ feature check), so the outer loop and isAlive(pid) checks remain
unchanged. Ensure the new code references the same loop (deadline, start,
isAlive) and does not change the polling semantics or total timeout.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/services.ts`:
- Around line 248-353: The CLI still shells out to start/stop/status via bash
scripts instead of calling the new TypeScript module; update bin/nemoclaw.js to
import (or require) the compiled module and call startAll, stopAll, and
showStatus directly instead of exec'ing bash "${SCRIPTS}/start-services.sh" in
the start(), stop(), and status() handlers—map any flags/environment
(dashboardPort, repoDir, SANDBOX_NAME, TELEGRAM_BOT_TOKEN, PID dir handling)
into the ServiceOptions you pass to startAll(opts) and stopAll(opts) and ensure
showStatus() is invoked for status; remove or replace the exec/execFileSync
calls so the CLI uses the module exports (startAll, stopAll, showStatus).

---

Nitpick comments:
In `@src/lib/services.ts`:
- Around line 254-257: The current check in the NVIDIA API key guard uses
console.error(...) and calls process.exit(1), which is unsafe for a library;
change the behavior in the function that performs this check (the block
referencing process.env.NVIDIA_API_KEY and process.exit) to throw an Error (or a
specific custom Error type) instead of exiting the process so callers (e.g., CLI
entry point) can catch it and decide whether to log and call process.exit;
update any surrounding function (e.g., the service initialization function that
contains this guard) to document/propagate the thrown error so callers can
handle termination.
- Around line 264-275: The sandbox readiness check using
execFileSync(...).includes("Ready") is too fragile; replace that logic to call
the existing isSandboxReady(output, sandboxName) exported from
src/lib/gateway-state.ts (import it) and use its result to decide whether to
warn (keep using warn). Ensure you still run execFileSync("openshell",
["sandbox", "list"], ...) and pass the correct sandboxName (or
undefined/default) into isSandboxReady so ANSI codes and exact-name matching are
handled and false positives like "NotReady" are avoided; preserve the try/catch
behavior for missing openshell.
- Around line 170-178: The inner busy-wait inside the polling loop (while
(Date.now() - start < 100) {/* spin */}) is CPU-intensive; replace that spin
with a synchronous sleep using Atomics.wait on a SharedArrayBuffer to block the
thread for ~100ms instead. In the polling block that uses deadline and
isAlive(pid), create an Int32Array backed by a SharedArrayBuffer and call
Atomics.wait(sabInt32, 0, 0, 100) to sleep, with a small fallback to the current
spin if Atomics.wait is unavailable (or guard with a Node 16+ feature check), so
the outer loop and isAlive(pid) checks remain unchanged. Ensure the new code
references the same loop (deadline, start, isAlive) and does not change the
polling semantics or total timeout.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a9acc254-30df-48ff-82c9-aa695dc12bdd

📥 Commits

Reviewing files that changed from the base of the PR and between e5cadb2 and 5a84739.

📒 Files selected for processing (2)

src/lib/services.test.ts
src/lib/services.ts

✅ Files skipped from review due to trivial changes (1)

src/lib/services.test.ts

src/lib/services.ts

- Update start/stop/status handlers in bin/nemoclaw.js to call the new TS module (startAll, stopAll, showStatus) instead of shelling out to bash start-services.sh - Match bash behavior: treat missing NVIDIA_API_KEY as a warning and skip telegram bridge, not a hard exit - Port WSL2 IPv4-first DNS workaround from bash script - Guard telegram bridge on both TELEGRAM_BOT_TOKEN and NVIDIA_API_KEY - Update CLI test to reflect in-process service execution Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

start() passed the registry default sandboxName to resolve the PID directory, but stop() and showStatus() called with no args, falling back to env vars or "default". This mismatch meant services started under one sandbox name could not be stopped. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cv

PID directory mismatch fixed in 04baa47 — stop() and showStatus() now pass sandboxName from the registry, consistent with start(). Path traversal validation, SIGTERM wait-for-exit, and directory structure all look good.

## Summary Closes NVIDIA#1297. Part of NVIDIA#924 shell consolidation (PR 2 scope). Ports the PID-managed service lifecycle (start/stop/status for telegram-bridge and cloudflared) from `scripts/start-services.sh` to `nemoclaw/src/lib/services.ts`: - Uses `execa` with `detached: true` + `unref()` for background service spawning (replaces `nohup`) - Native `process.kill(pid, 0)` for liveness checks (replaces `kill -0`) - File-based PID management preserved (`/tmp/nemoclaw-services-{sandbox}/`) - `bin/nemoclaw.js` start/stop/status now call the TS module directly instead of `bash scripts/start-services.sh` - `scripts/start-services.sh` kept — still used by remote deploy (`ssh ... bash scripts/start-services.sh`) and `uninstall.sh` Pattern: 1. TS source in `nemoclaw/src/lib/services.ts` with flat `ServiceOptions` interface 2. Thin CJS shim in `bin/lib/services.js` 3. Tests in `src/lib/services.test.ts` importing from `../../dist/lib/services` for coverage ## Test plan - [x] `npx vitest run --project plugin` — 201 tests pass (includes new services tests) - [x] `npx vitest run --project cli` — 336 tests pass - [x] `npx vitest run test/service-env.test.js` — 11 tests pass (bash script still works) - [x] `npx eslint nemoclaw/src/lib/services.ts` — clean - [x] Pre-commit hooks pass (prettier, eslint, gitleaks, tsc, vitest) 🤖 Generated with [Claude Code](https://claude.com/claude-code)  ## Summary by CodeRabbit * **New Features** * Service management to start, stop, monitor background services and report statuses, including public connectivity URLs when available * Installer now detects when the user’s shell needs reloading and shows a conditional reload hint * CLI start/stop/status commands now use the integrated service manager for in-process control * **Bug Fixes** * More robust handling of stale/invalid PID files and safer service shutdown behavior * **Tests** * Added comprehensive tests for service behaviors and installer reload hint detection  --------- Co-authored-by: Prekshi Vyas <prekshivyas@nvidia.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Carlos Villela <cvillela@nvidia.com>

coderabbitai bot reviewed Apr 2, 2026

View reviewed changes

nemoclaw/src/lib/services.ts Outdated Show resolved Hide resolved

src/lib/services.ts Show resolved Hide resolved

cv requested changes Apr 2, 2026

View reviewed changes

cv mentioned this pull request Apr 2, 2026

refactor(cli): port debug.sh to TypeScript #1308

Merged

7 tasks

Prekshi Vyas and others added 4 commits April 2, 2026 10:22

Merge branch 'refactor/port-start-services-to-ts-1297' of github.com:…

e5cadb2

…NVIDIA/NemoClaw into refactor/port-start-services-to-ts-1297 # Conflicts: # bin/lib/services.js # bin/nemoclaw.js

coderabbitai bot reviewed Apr 2, 2026

View reviewed changes

src/lib/services.ts Show resolved Hide resolved

prekshivyas requested a review from cv April 2, 2026 18:33

cv and others added 3 commits April 2, 2026 11:56

Merge branch 'main' into refactor/port-start-services-to-ts-1297

4b625a2

Merge branch 'main' into refactor/port-start-services-to-ts-1297

23b3df8

cv approved these changes Apr 2, 2026

View reviewed changes

Merge branch 'main' into refactor/port-start-services-to-ts-1297

20e804d

cv enabled auto-merge (squash) April 2, 2026 20:38

cv merged commit 7040236 into main Apr 2, 2026
8 checks passed

latenighthackathon mentioned this pull request Apr 2, 2026

fix(cli): resolve sandbox name for stop and status commands #1135

Closed

3 tasks

		// Thin CJS shim — implementation lives in nemoclaw/src/lib/services.ts
		module.exports = require("../../nemoclaw/dist/lib/services.js");

Conversation

prekshivyas commented Apr 2, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cv left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

prekshivyas commented Apr 2, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 2, 2026 •

edited

Loading