refactor(installer): unify host preflight and thin deploy compatibility by kjw3 · Pull Request #1470 · NVIDIA/NemoClaw

kjw3 · 2026-04-04T05:03:37Z

Summary

unify installer and onboarding host detection around shared TypeScript preflight logic
move deploy behavior into TypeScript, thin the Brev compatibility wrapper, and harden Brev readiness handling
demote or remove legacy platform-specific setup paths (setup-spark, brev-setup.sh) in favor of the canonical installer + onboard flow
update docs, CLI help, and Brev E2E coverage to match the new behavior

What Changed

added shared host assessment and remediation planning in src/lib/preflight.ts
wired installer and onboard flows to the same host preflight decisions
changed Podman handling from hard block to unsupported-runtime warning
migrated deploy logic into src/lib/deploy.ts
updated nemoclaw deploy to use the authenticated Brev CLI, current Brev create flags, explicit GCP provider default, stricter readiness checks, and standard installer/onboard flow
removed scripts/setup-spark.sh and reduced scripts/brev-setup.sh to a deprecated compatibility wrapper
updated README/docs/help text and hardened the Brev E2E cleanup path

Validation

npm run build:cli
targeted Vitest coverage for src/lib/preflight.test.ts, src/lib/deploy.test.ts, test/install-preflight.test.js, test/cli.test.js, test/runner.test.js
live Brev validation with TEST_SUITE=deploy-cli on cpu-e2.4vcpu-16gb
confirmed successful end-to-end remote deploy after waiting for Brev status=RUNNING, build_status=COMPLETED, shell_status=READY

Related Issues

Credit / Prior Work

This branch builds on ideas and prior work from:

fix(install): simplify DGX Spark setup, remove cgroup v2 workaround #1368 by @zyang-dev for simplifying Spark setup and removing the old cgroup workaround
refactor(installer): thin bootstrap and versioned payload #1395 and fix(install): keep nemoclaw available after install #1468 by @kjw3 for the thin installer/bootstrap direction and installer path reliability
fix(ci): switch Brev E2E from Nebius to GCP for reliability #1450 by @cjagwani for switching Brev flows toward GCP for reliability
fix: use current Brev create flags #1383 by @13ernkastel for the current Brev create flag compatibility work
fix(deploy): use the remote user's home for Brev sync #1364 by @WuKongAI-CMU for deploy sync-path fixes
ci: add CI-Ready CPU Brev launchable for E2E tests (#1327) #1362 and fix(e2e): harden Brev E2E infrastructure for CPU-only instances #1266 by @jyaunches for the Brev E2E/launchable infrastructure direction
issue ideas from [NemoClaw][Brev] nemoclaw deploy — brev create --gpu flag incompatible with current brev CLI #1377 and [NemoClaw][all platform] OpenShell version not upgraded when upgrading NemoClaw to a new version #1404 by @zNeill, nemoclaw deploy should transfer workspace files to remote instance #1330 by @Marcelo5444, and chore(ci): follow-up cleanup for Brev launchable E2E harness #1390 by @ericksoa

Summary by CodeRabbit

Release Notes

New Features
- Enhanced host diagnostics with actionable remediation guidance during onboarding preflight checks.
Improvements
- Updated runtime compatibility: macOS (Intel) now recommends Docker Desktop over Podman.
- Simplified DGX Spark setup to use standard installer followed by nemoclaw onboard.
- Improved preflight error messages with clear remediation steps for Docker issues.
Deprecations
- nemoclaw deploy now delegates to standard installer; prefer separate host provisioning + nemoclaw onboard.
- nemoclaw setup-spark marked as deprecated compatibility alias for nemoclaw onboard.

coderabbitai · 2026-04-04T05:03:50Z

📝 Walkthrough

Walkthrough

The changes shift NemoClaw's deployment strategy from a Brev-specific wrapper to using the standard installer on remote hosts with onboarding. A new preflight host assessment module validates Docker/runtime readiness. Deploy functionality moves to a dedicated TypeScript module with comprehensive Brev VM orchestration. Commands deploy and setup-spark are deprecated in favor of the standard installer flow.

Changes

Cohort / File(s)	Summary
Documentation `README.md`, `docs/get-started/quickstart.md`, `docs/reference/commands.md`, `docs/reference/troubleshooting.md`, `docs/deployment/deploy-to-remote-gpu.md`, `spark-install.md`	Updated container runtime support matrix (Docker Desktop on macOS instead of Podman), repositioned deploy as deprecated in favor of standard installer + onboard workflow, reframed troubleshooting guidance for Docker reachability and Podman warnings.
Host Assessment & Remediation `src/lib/preflight.ts`, `src/lib/preflight.test.ts`	New module providing `assessHost()` and `planHostRemediation()` for detecting container runtime, Docker status, cgroup configuration, WSL/headless environments, and generating remediation actions with platform-specific commands.
Brev Deploy Implementation `src/lib/deploy.ts`, `src/lib/deploy.test.ts`	New module implementing `executeDeploy()` orchestrating Brev VM provisioning, instance creation/polling, rsync file sync, remote installer execution with credentials, and optional service startup and SSH connection.
CLI Entry Point & Onboarding `bin/nemoclaw.js`, `bin/lib/onboard.js`	Refactored CLI to delegate deploy to new `executeDeploy()` module, deprecated `setup-spark` and `deploy` commands with warnings, replaced local preflight checks in onboard with `assessHost()`-driven validation and remediation reporting.
Installation & Bootstrap Scripts `scripts/install.sh`, `scripts/brev-setup.sh`, `scripts/start-services.sh`	`install.sh` adds preflight checks via Node, gating onboarding on Docker readiness; `brev-setup.sh` converted to thin wrapper delegating to `install.sh`; removed direct cloudflared setup guidance from start-services.
Removed Script `scripts/setup-spark.sh`	Entire script deleted; Docker cgroup v2 configuration logic no longer needed with current OpenShell releases.
Test Coverage `test/cli.test.js`, `test/runner.test.js`, `test/install-preflight.test.js`, `test/e2e/brev-e2e.test.js`, `test/e2e/test-spark-install.sh`	Added new preflight/deploy test cases, updated CLI help assertions, added Brev e2e deploy flow testing with instance lifecycle helpers, removed Spark-specific e2e setup steps, extended regression guards for credential exposure and deploy behavior.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant CLI as nemoclaw.js
    participant Preflight as assessHost
    participant Deploy as executeDeploy
    participant Brev as brev CLI
    participant VM as Remote VM
    participant Installer as install.sh

    User->>CLI: nemoclaw deploy <instance>
    CLI->>Preflight: assessHost() on local machine
    Preflight-->>CLI: host diagnostics & remediation actions
    CLI->>Deploy: executeDeploy(instanceName, ...)
    Deploy->>Brev: brev ls --json
    alt Instance Missing
        Deploy->>Brev: brev create <instance>
    end
    Deploy->>Brev: brev refresh
    Deploy->>Brev: poll brev ls --json for readiness
    Brev-->>Deploy: instance RUNNING, build COMPLETED, shell READY
    Deploy->>VM: wait for SSH availability
    Deploy->>VM: rsync local rootDir to remote
    Deploy->>VM: scp .env file with credentials
    Deploy->>VM: run scripts/install.sh --non-interactive
    Installer->>VM: execute remote preflight checks
    Installer->>VM: run onboard in non-interactive mode
    VM-->>Installer: sandbox ready
    alt NEMOCLAW_DEPLOY_NO_CONNECT not set
        Deploy->>VM: openshell sandbox connect
        VM-->>User: interactive shell session
    else
        Deploy-->>User: print SSH connection instructions
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 A rabbit's ode to your deploy refactor!

Brev's old dance was tangled and twisted,
Now preflight checks host dreams insisted,
From wrapper to module, the logic flows clean,
Docker and Podman get properly seen,
The installer takes center stage with such grace! 🚀

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 13.64% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main refactoring work: unifying host preflight logic and moving deploy behavior to TypeScript while maintaining backward compatibility.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch refactor/installer-unification

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…ification # Conflicts: # scripts/setup-spark.sh # spark-install.md

github-actions · 2026-04-04T05:10:32Z

🚀 Docs preview ready!

https://NVIDIA.github.io/NemoClaw/pr-preview/pr-1470/

coderabbitai

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

test/install-preflight.test.js (1)
738-752: ⚠️ Potential issue | 🟡 Minor

Assert the installer still exits successfully in this skip-onboarding path.

Right now the test only checks the message and the missing log file. A non-zero exit that prints the same text would still pass, which weakens the regression coverage for the intended success case.
Suggested fix
     const result = spawnSync("bash", [INSTALLER, "--non-interactive"], {
       cwd: path.join(import.meta.dirname, ".."),
       encoding: "utf-8",
       env: {
         ...process.env,
@@
       },
     });
 
+    expect(result.status).toBe(0);
     expect(`${result.stdout}${result.stderr}`).toContain("Skipping onboarding");
     expect(fs.existsSync(onboardLog)).toBe(false);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/install-preflight.test.js` around lines 738 - 752, The test currently
only asserts the installer output and absence of the onboard log but doesn't
verify the process exit code; update the assertion after the spawnSync call (the
variable result from spawnSync using INSTALLER) to assert the installer exited
successfully (e.g., expect(result.status).toBe(0) or equivalent success
condition) so non-zero exits that print the same message will fail the test.
docs/deployment/deploy-to-remote-gpu.md (1)
30-35: ⚠️ Potential issue | 🟠 Major

The quick start skips the setup flow the page just recommended.

Lines 26-27 say the preferred path is to install NemoClaw on the remote host and run nemoclaw onboard, but this example jumps straight to nemoclaw my-assistant connect and openclaw tui. Those commands only work after onboarding has already created a sandbox, so a fresh Brev VM cannot follow this section as written.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/deployment/deploy-to-remote-gpu.md` around lines 30 - 35, The example
jumps straight to running "nemoclaw my-assistant connect" and "openclaw tui" but
those commands require an existing sandbox created by onboarding; update the
paragraph to either (a) explicitly state this is for already-onboarded Brev
instances, or (b) prepend the onboarding step by instructing users to run
"nemoclaw onboard" (or the recommended remote-host install + "nemoclaw onboard")
before "nemoclaw my-assistant connect" and "openclaw tui", and add a short note
about the prerequisite sandbox so fresh VMs can follow the flow.

🧹 Nitpick comments (4)

docs/reference/troubleshooting.md (1)
135-135: Keep one sentence per source line in this paragraph.

Line 135 currently packs two sentences onto one source line. Split them so the docs stay diff-friendly and consistent with the rest of docs/.
Suggested fix
-If you are using Podman, NemoClaw warns and continues, but OpenShell officially documents Docker-based runtimes only. If onboarding or sandbox lifecycle fails, switch to Docker Desktop, Colima, or Docker Engine and rerun onboarding.
+If you are using Podman, NemoClaw warns and continues, but OpenShell officially documents Docker-based runtimes only.
+If onboarding or sandbox lifecycle fails, switch to Docker Desktop, Colima, or Docker Engine and rerun onboarding.
As per coding guidelines, "One sentence per line in source (makes diffs readable). Flag paragraphs where multiple sentences appear on the same line."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/reference/troubleshooting.md` at line 135, The paragraph in
docs/reference/troubleshooting.md currently has two sentences on a single source
line; edit that source line so each sentence is on its own line (i.e., split "If
you are using Podman, NemoClaw warns and continues, but OpenShell officially
documents Docker-based runtimes only. If onboarding or sandbox lifecycle fails,
switch to Docker Desktop, Colima, or Docker Engine and rerun onboarding." into
two separate source lines) to follow the one-sentence-per-line guideline and
keep diffs clean.
test/install-preflight.test.js (1)
570-598: Drop the overwritten docker stub in this test setup.

The first writeExecutable(..., "docker") is immediately replaced by the second one, so half of this setup never runs. Keeping only the failing stub makes the scenario unambiguous.
Suggested fix
-    writeExecutable(
-      path.join(fakeBin, "docker"),
-      `#!/usr/bin/env bash
-if [ "$1" = "info" ]; then
-  echo '{"ServerVersion":"29.3.1","OperatingSystem":"Ubuntu 24.04","CgroupVersion":"2"}'
-  exit 0
-fi
-exit 0
-`,
-    );
     writeExecutable(
       path.join(fakeBin, "openshell"),
       `#!/usr/bin/env bash
 if [ "$1" = "--version" ]; then
   echo "openshell 0.0.22"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/install-preflight.test.js` around lines 570 - 598, The test sets two
different docker stubs via writeExecutable(...) with target "docker", and the
first one is immediately overwritten by the second; remove the initial
successful docker stub block so only the failing docker stub remains (keep the
openshell stub intact), i.e. ensure there is a single writeExecutable call for
"docker" (the one that exits 1 for "info") to make the scenario unambiguous.
docs/reference/commands.md (1)
212-213: Use active voice in this deprecation warning.

Line 213 says the installer/onboard flow “should be used instead,” which is passive. Address the reader directly here.

As per coding guidelines, "Active voice required. Flag passive constructions."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/reference/commands.md` around lines 212 - 213, Replace the passive
phrasing "the standard installer and `nemoclaw onboard` flow should be used
instead" with an active imperative addressing the reader; reword the sentence
about `nemoclaw setup-spark` deprecation so it reads something like: "Use the
standard installer and run `nemoclaw onboard` instead, because current OpenShell
releases handle the older DGX Spark cgroup behavior." Update the sentence that
mentions `nemoclaw setup-spark` and `nemoclaw onboard` to this active
construction.
test/e2e/brev-e2e.test.js (1)
68-112: Good retry/cleanup logic with a minor suggestion.

The deleteBrevInstance helper implements robust retry logic with brev refresh fallback. One observation: the function returns false when deletion fails after all attempts, but the calling code at line 685-687 throws if deletion fails. This is consistent.

However, consider that if hasBrevInstance returns true but brev("delete", ...) succeeds without throwing, you immediately call brev("refresh") and then check hasBrevInstance again. If the Brev API has eventual consistency, the instance might still appear in the list briefly after deletion succeeds. The current retry loop handles this gracefully, but adding a small delay after successful delete (before the existence check) might reduce unnecessary retries.
💡 Optional: Add small delay after successful delete
     try {
       brev("delete", instanceName);
+      sleep(2); // Allow Brev API to propagate deletion
     } catch {
       // Best-effort delete. We'll verify via ls below and retry if needed.
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/brev-e2e.test.js` around lines 68 - 112, The deleteBrevInstance
retry loop can spur unnecessary retries when brev("delete", instanceName)
succeeds but the API is eventually consistent; after calling brev("delete",
instanceName) inside deleteBrevInstance, insert a short sleep (use the existing
sleep(seconds) helper, e.g., 1–3 seconds) before calling brev("refresh") and
re-checking hasBrevInstance(instanceName) so the list has time to converge; keep
the existing try/catch around brev("delete") and brev("refresh") and preserve
the attempts/intervalSeconds behavior.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/brev-setup.sh`:
- Around line 28-35: The script currently unconditionally fails if
NVIDIA_API_KEY is missing; change the check so the key is only required when the
effective provider (NEMOCLAW_PROVIDER) is the GPU-focused "build" provider (or
absent/default to "build"). Concretely, replace the unconditional check of
NVIDIA_API_KEY with a conditional that reads NEMOCLAW_PROVIDER (or uses the
default "${NEMOCLAW_PROVIDER:-build}") and only invokes fail "NVIDIA_API_KEY not
set" when that provider equals "build"; leave the installer existence check ([
-f "$INSTALLER_SCRIPT" ] || fail ...) and the exports (NEEDRESTART_MODE,
DEBIAN_FRONTEND, NEMOCLAW_* ) unchanged so onboarding for
openai/anthropic/gemini or endpoint providers can proceed without an
NVIDIA_API_KEY.

In `@scripts/install.sh`:
- Around line 924-956: The preflight is mixing informational notes
(host.runtime, host.notes) with actionable warnings so the script prints "Host
preflight found warnings." even when planHostRemediation(host) returned no
actions; split the output into two buffers (e.g., info_lines and action_lines)
so Detected container runtime and WSL messages go into info_lines and only
remediation items from actions/blockingActions go into action_lines, only set
status/exit code based on blockingActions and only print the warning header
("Host preflight found warnings." / remediation list) when action_lines is
non-empty; keep printing info_lines unconditionally so runtime/WSL remain
visible without triggering the warning path.

In `@spark-install.md`:
- Around line 5-14: The document is now internally inconsistent: the
intro/prereqs claim current OpenShell no longer needs the Spark-specific cgroup
workaround and that the NemoClaw installer/onboarding will install OpenShell
automatically, but later sections still instruct manual OpenShell installation
and the old setup-spark/default-cgroupns-mode=host fix; update the Quick Start
and "What's Different on Spark" sections to match the new supported path by
removing or revising any instructions that tell users to run setup-spark or set
default-cgroupns-mode=host and by deleting manual OpenShell install steps, and
if necessary replace them with a short note pointing to the onboarding installer
behavior (referencing the "Quick Start" and "What's Different on Spark" headings
and the phrases "setup-spark" and "default-cgroupns-mode=host" and "OpenShell
CLI") so the page consistently documents only the automatic onboarding/install
flow.

In `@src/lib/deploy.ts`:
- Around line 240-248: The current existence check uses substring matching
(out.includes(name)) which can falsely match names like "foo" in "foo-prod" or
in error text; update the check after calling execFileSync("brev", ["ls"]) to
split the output into lines (e.g., out.split(/\r?\n/)), trim each line and test
for an exact equality to name (use lines.some(line => line.trim() === name)) to
set exists; apply the same exact-line check to err.stdout and err.stderr instead
of includes so only an exact instance name marks existence (referencing the
exists variable and the out/err handling in this try/catch block).
- Around line 294-297: Replace the insecure SSH/SCP option
StrictHostKeyChecking=no used in the execFileSync calls (the SSH checks invoking
"ssh" and the scp upload of the .env) with a safer host-key policy: either Point
SSH to a pinned known_hosts file via -o UserKnownHostsFile=/path/to/known_hosts
and remove StrictHostKeyChecking=no, or use TOFU by setting -o
StrictHostKeyChecking=accept-new; apply this change to every execFileSync
invocation that runs "ssh" or "scp" (the calls that pass name and the one that
uploads the .env) and ensure the commands also reference the chosen known_hosts
file when appropriate.

---

Outside diff comments:
In `@docs/deployment/deploy-to-remote-gpu.md`:
- Around line 30-35: The example jumps straight to running "nemoclaw
my-assistant connect" and "openclaw tui" but those commands require an existing
sandbox created by onboarding; update the paragraph to either (a) explicitly
state this is for already-onboarded Brev instances, or (b) prepend the
onboarding step by instructing users to run "nemoclaw onboard" (or the
recommended remote-host install + "nemoclaw onboard") before "nemoclaw
my-assistant connect" and "openclaw tui", and add a short note about the
prerequisite sandbox so fresh VMs can follow the flow.

In `@test/install-preflight.test.js`:
- Around line 738-752: The test currently only asserts the installer output and
absence of the onboard log but doesn't verify the process exit code; update the
assertion after the spawnSync call (the variable result from spawnSync using
INSTALLER) to assert the installer exited successfully (e.g.,
expect(result.status).toBe(0) or equivalent success condition) so non-zero exits
that print the same message will fail the test.

---

Nitpick comments:
In `@docs/reference/commands.md`:
- Around line 212-213: Replace the passive phrasing "the standard installer and
`nemoclaw onboard` flow should be used instead" with an active imperative
addressing the reader; reword the sentence about `nemoclaw setup-spark`
deprecation so it reads something like: "Use the standard installer and run
`nemoclaw onboard` instead, because current OpenShell releases handle the older
DGX Spark cgroup behavior." Update the sentence that mentions `nemoclaw
setup-spark` and `nemoclaw onboard` to this active construction.

In `@docs/reference/troubleshooting.md`:
- Line 135: The paragraph in docs/reference/troubleshooting.md currently has two
sentences on a single source line; edit that source line so each sentence is on
its own line (i.e., split "If you are using Podman, NemoClaw warns and
continues, but OpenShell officially documents Docker-based runtimes only. If
onboarding or sandbox lifecycle fails, switch to Docker Desktop, Colima, or
Docker Engine and rerun onboarding." into two separate source lines) to follow
the one-sentence-per-line guideline and keep diffs clean.

In `@test/e2e/brev-e2e.test.js`:
- Around line 68-112: The deleteBrevInstance retry loop can spur unnecessary
retries when brev("delete", instanceName) succeeds but the API is eventually
consistent; after calling brev("delete", instanceName) inside
deleteBrevInstance, insert a short sleep (use the existing sleep(seconds)
helper, e.g., 1–3 seconds) before calling brev("refresh") and re-checking
hasBrevInstance(instanceName) so the list has time to converge; keep the
existing try/catch around brev("delete") and brev("refresh") and preserve the
attempts/intervalSeconds behavior.

In `@test/install-preflight.test.js`:
- Around line 570-598: The test sets two different docker stubs via
writeExecutable(...) with target "docker", and the first one is immediately
overwritten by the second; remove the initial successful docker stub block so
only the failing docker stub remains (keep the openshell stub intact), i.e.
ensure there is a single writeExecutable call for "docker" (the one that exits 1
for "info") to make the scenario unambiguous.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 5a93e840-8490-4a99-8aa7-214864ddfd40

📥 Commits

Reviewing files that changed from the base of the PR and between 3328359 and dccdfb8.

📒 Files selected for processing (22)

README.md
bin/lib/onboard.js
bin/nemoclaw.js
docs/deployment/deploy-to-remote-gpu.md
docs/get-started/quickstart.md
docs/reference/commands.md
docs/reference/troubleshooting.md
scripts/brev-setup.sh
scripts/install.sh
scripts/setup-spark.sh
scripts/start-services.sh
spark-install.md
src/lib/deploy.test.ts
src/lib/deploy.ts
src/lib/preflight.test.ts
src/lib/preflight.ts
test/cli.test.js
test/e2e/brev-e2e.test.js
test/e2e/test-spark-install.sh
test/install-preflight.test.js
test/runner.test.js
test/usage-notice.test.js

💤 Files with no reviewable changes (1)

scripts/setup-spark.sh

coderabbitai · 2026-04-04T05:17:17Z

scripts/brev-setup.sh

 [ -n "${NVIDIA_API_KEY:-}" ] || fail "NVIDIA_API_KEY not set"
+[ -f "$INSTALLER_SCRIPT" ] || fail "Installer not found at $INSTALLER_SCRIPT"

-# Suppress needrestart noise from apt (Scanning processes, No services need...)
 export NEEDRESTART_MODE=a
 export DEBIAN_FRONTEND=noninteractive
+export NEMOCLAW_NON_INTERACTIVE=1
+export NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE=1
+export NEMOCLAW_PROVIDER="${NEMOCLAW_PROVIDER:-build}"


⚠️ Potential issue | 🟠 Major

Don't hard-require NVIDIA_API_KEY for every provider.

Line 28 aborts before Line 35 can steer onboarding to openai, anthropic, gemini, or the compatible endpoint providers. That makes the compatibility deploy path fail for any non-build provider even when its own credential is present. Let the installer/onboard flow validate the selected provider, or gate this check on the effective provider instead.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@scripts/brev-setup.sh` around lines 28 - 35, The script currently unconditionally fails if NVIDIA_API_KEY is missing; change the check so the key is only required when the effective provider (NEMOCLAW_PROVIDER) is the GPU-focused "build" provider (or absent/default to "build"). Concretely, replace the unconditional check of NVIDIA_API_KEY with a conditional that reads NEMOCLAW_PROVIDER (or uses the default "${NEMOCLAW_PROVIDER:-build}") and only invokes fail "NVIDIA_API_KEY not set" when that provider equals "build"; leave the installer existence check ([ -f "$INSTALLER_SCRIPT" ] || fail ...) and the exports (NEEDRESTART_MODE, DEBIAN_FRONTEND, NEMOCLAW_* ) unchanged so onboarding for openai/anthropic/gemini or endpoint providers can proceed without an NVIDIA_API_KEY.

coderabbitai · 2026-04-04T05:17:17Z

scripts/install.sh

+        if (host.runtime && host.runtime !== "unknown") {
+          lines.push(`Detected container runtime: ${host.runtime}`);
+        }
+        if (host.notes && host.notes.includes("Running under WSL")) {
+          lines.push("Running under WSL");
+        }
+        for (const action of actions) {
+          lines.push(`- ${action.title}: ${action.reason}`);
+          for (const command of action.commands || []) {
+            lines.push(`  ${command}`);
+          }
+        }
+        if (lines.length > 0) {
+          process.stdout.write(lines.join("\n"));
+        }
+        process.exit(blockingActions.length > 0 ? 10 : 0);
+      } catch {
+        process.exit(0);
+      }
+    ' "$preflight_module"
+  )"; then
+    status=0
+  else
+    status=$?
+  fi
+
+  if [[ -n "$output" ]]; then
+    echo ""
+    if [[ "$status" -eq 10 ]]; then
+      warn "Host preflight found issues that will prevent onboarding right now."
+    else
+      warn "Host preflight found warnings."
+    fi


⚠️ Potential issue | 🟡 Minor

Separate informational host notes from actual warnings.

Lines 924-929 always add Detected container runtime: docker on a healthy Docker host, so Lines 950-956 print Host preflight found warnings. even when planHostRemediation(host) returned no actions. Keep runtime/WSL notes out of the warning path, or only emit the warning header when there is remediation to show.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@scripts/install.sh` around lines 924 - 956, The preflight is mixing informational notes (host.runtime, host.notes) with actionable warnings so the script prints "Host preflight found warnings." even when planHostRemediation(host) returned no actions; split the output into two buffers (e.g., info_lines and action_lines) so Detected container runtime and WSL messages go into info_lines and only remediation items from actions/blockingActions go into action_lines, only set status/exit code based on blockingActions and only print the warning header ("Host preflight found warnings." / remediation list) when action_lines is non-empty; keep printing info_lines unconditionally so runtime/WSL remain visible without triggering the warning path.

spark-install.md

coderabbitai · 2026-04-04T05:17:18Z

src/lib/deploy.ts

+  let exists = false;
+  try {
+    const out = execFileSync("brev", ["ls"], { encoding: "utf-8" });
+    exists = out.includes(name);
+  } catch (caught) {
+    const err = caught as { stdout?: string; stderr?: string };
+    if (err.stdout && err.stdout.includes(name)) exists = true;
+    if (err.stderr && err.stderr.includes(name)) exists = true;
+  }


⚠️ Potential issue | 🟠 Major

Use exact instance matching instead of substring checks.

out.includes(name) will treat foo as existing when only something like foo-prod is present, or when the name appears in an error message. In that case we skip brev create and then wait for a VM that does not actually exist.

Suggested fix

let exists = false; try { - const out = execFileSync("brev", ["ls"], { encoding: "utf-8" }); - exists = out.includes(name); + const out = execFileSync("brev", ["ls", "--json"], { encoding: "utf-8" }); + exists = findBrevInstanceStatus(out, name) !== null; } catch (caught) { - const err = caught as { stdout?: string; stderr?: string }; - if (err.stdout && err.stdout.includes(name)) exists = true; - if (err.stderr && err.stderr.includes(name)) exists = true; + const err = caught as { stdout?: string }; + if (err.stdout) exists = findBrevInstanceStatus(err.stdout, name) !== null; }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/lib/deploy.ts` around lines 240 - 248, The current existence check uses substring matching (out.includes(name)) which can falsely match names like "foo" in "foo-prod" or in error text; update the check after calling execFileSync("brev", ["ls"]) to split the output into lines (e.g., out.split(/\r?\n/)), trim each line and test for an exact equality to name (use lines.some(line => line.trim() === name)) to set exists; apply the same exact-line check to err.stdout and err.stderr instead of includes so only an exact instance name marks existence (referencing the exists variable and the out/err handling in this try/catch block).

coderabbitai · 2026-04-04T05:17:18Z

src/lib/deploy.ts

+      execFileSync(
+        "ssh",
+        ["-o", "ConnectTimeout=5", "-o", "StrictHostKeyChecking=no", name, "echo", "ok"],
+        { encoding: "utf-8", stdio: "ignore" },


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash rg -n --fixed-strings 'StrictHostKeyChecking=no' src/lib/deploy.ts

Repository: NVIDIA/NemoClaw

Length of output: 1380

Disable StrictHostKeyChecking=no for all SSH/scp operations to prevent MITM attacks on credential transmission.

All nine SSH and scp commands in this file skip host key verification with StrictHostKeyChecking=no, including the scp at line 338 that uploads the .env file containing API credentials. This silently accepts host-key mismatches and can disclose secrets to the wrong host. Use a pinned known_hosts file, or at least a trust-on-first-use mode that prevents key changes.

Affected instances: lines 296, 313, 320, 323, 338, 341, 358, 367, 387

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/lib/deploy.ts` around lines 294 - 297, Replace the insecure SSH/SCP option StrictHostKeyChecking=no used in the execFileSync calls (the SSH checks invoking "ssh" and the scp upload of the .env) with a safer host-key policy: either Point SSH to a pinned known_hosts file via -o UserKnownHostsFile=/path/to/known_hosts and remove StrictHostKeyChecking=no, or use TOFU by setting -o StrictHostKeyChecking=accept-new; apply this change to every execFileSync invocation that runs "ssh" or "scp" (the calls that pass name and the one that uploads the .env) and ensure the commands also reference the chosen known_hosts file when appropriate.

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

test/install-preflight.test.js (1)

709-755: ⚠️ Potential issue | 🟠 Major

This no longer verifies the non-interactive terms gate.

Because this case doesn't force preflight to pass, the installer can skip onboarding before show_usage_notice() runs and still satisfy /Skipping onboarding/. That means the test now passes for unrelated Docker/OpenShell state instead of proving the acceptance requirement.

Suggested fix

     writeNodeStub(fakeBin);
+    writeExecutable(
+      path.join(fakeBin, "docker"),
+      `#!/usr/bin/env bash
+if [ "$1" = "info" ]; then
+  echo '{"ServerVersion":"29.3.1","OperatingSystem":"Ubuntu 24.04","CgroupVersion":"2"}'
+  exit 0
+fi
+exit 0
+`,
+    );
+    writeExecutable(
+      path.join(fakeBin, "openshell"),
+      `#!/usr/bin/env bash
+if [ "$1" = "--version" ]; then
+  echo "openshell 0.0.22"
+  exit 0
+fi
+exit 0
+`,
+    );
     writeNpmStub(
       fakeBin,
       `if [ "$1" = "pack" ]; then
@@
     const result = spawnSync("bash", [INSTALLER, "--non-interactive"], {
@@
     });
 
-    expect(`${result.stdout}${result.stderr}`).toMatch(
-      /Skipping onboarding|--yes-i-accept-third-party-software|NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE=1/,
-    );
+    expect(result.status).not.toBe(0);
+    expect(`${result.stdout}${result.stderr}`).toMatch(
+      /--yes-i-accept-third-party-software|NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE=1/,
+    );
     expect(fs.existsSync(onboardLog)).toBe(false);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/install-preflight.test.js` around lines 709 - 755, The test no longer
guarantees the installer reached the terms gate because preflight can fail
earlier; update the test so preflight is forced to succeed and the installer
reaches show_usage_notice() before asserting onboarding was skipped. Concretely,
in the test around INSTALLER invocation ensure you arrange/stub the preflight
checks (or set the env variable(s) the preflight uses) so they return success
(so show_usage_notice() is executed), keep NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE
empty to simulate non-acceptance, then assert the output contains the
terms-acceptance hints and that onboardLog does not exist; reference the
installer invocation via INSTALLER, the show_usage_notice() path, the
NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE env var, and the onboardLog file to locate
where to change the test.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/preflight.ts`:
- Around line 203-207: detectNvidiaGpu currently returns true just because the
nvidia-smi binary exists; change it to first verify commandExists("nvidia-smi",
runCaptureImpl) and then actually probe the device by running nvidia-smi (e.g.
runCaptureImpl("nvidia-smi -L", { ignoreError: true }) or
runCaptureImpl("nvidia-smi --query-gpu=name --format=csv,noheader", {
ignoreError: true })) and return true only if the probe returns
non-empty/expected output (trimmed) indicating at least one GPU; otherwise
return false. Use the existing runCaptureImpl and commandExists helpers and keep
error handling via the ignoreError option so an absent/failed probe yields false
rather than throwing.

In `@test/e2e/brev-e2e.test.js`:
- Around line 279-285: The pre-cleanup incorrectly assumes
deleteBrevInstance(INSTANCE_NAME) succeeded; change the pre-cleanup to check the
boolean result and throw or retry/log an error when it returns false so we don't
treat a failed delete as success. Also ensure instanceCreated is set as soon as
the VM is actually created (either by returning a created flag from
runLocalDeploy or by setting instanceCreated = true immediately after detecting
creation) so that afterAll() will always attempt deletion even if later
bootstrap steps fail; update runLocalDeploy usage and afterAll cleanup logic to
rely on that explicit created flag.

---

Outside diff comments:
In `@test/install-preflight.test.js`:
- Around line 709-755: The test no longer guarantees the installer reached the
terms gate because preflight can fail earlier; update the test so preflight is
forced to succeed and the installer reaches show_usage_notice() before asserting
onboarding was skipped. Concretely, in the test around INSTALLER invocation
ensure you arrange/stub the preflight checks (or set the env variable(s) the
preflight uses) so they return success (so show_usage_notice() is executed),
keep NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE empty to simulate non-acceptance, then
assert the output contains the terms-acceptance hints and that onboardLog does
not exist; reference the installer invocation via INSTALLER, the
show_usage_notice() path, the NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE env var, and
the onboardLog file to locate where to change the test.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 80955231-b32a-4dc6-8604-041a7326547d

📥 Commits

Reviewing files that changed from the base of the PR and between dccdfb8 and 30a5081.

📒 Files selected for processing (9)

README.md
bin/lib/onboard.js
bin/nemoclaw.js
docs/reference/commands.md
scripts/install.sh
spark-install.md
src/lib/preflight.ts
test/e2e/brev-e2e.test.js
test/install-preflight.test.js

✅ Files skipped from review due to trivial changes (1)

docs/reference/commands.md

🚧 Files skipped from review as they are similar to previous changes (3)

README.md
spark-install.md
bin/lib/onboard.js

coderabbitai · 2026-04-04T05:28:22Z

src/lib/preflight.ts

+function detectNvidiaGpu(
+  runCaptureImpl: (command: string, options?: { ignoreError?: boolean }) => string,
+): boolean {
+  return commandExists("nvidia-smi", runCaptureImpl);
+}


⚠️ Potential issue | 🟡 Minor

Probe nvidia-smi, don't just check that it's on PATH.

hasNvidiaGpu flips to true whenever the binary exists, even if the host has no usable NVIDIA device/driver. On CPU images that still ship nvidia-smi, that suppresses the headless-host hint and mislabels the assessment.

Suggested fix

function detectNvidiaGpu( runCaptureImpl: (command: string, options?: { ignoreError?: boolean }) => string, ): boolean { - return commandExists("nvidia-smi", runCaptureImpl); + if (!commandExists("nvidia-smi", runCaptureImpl)) { + return false; + } + try { + const output = runCaptureImpl("nvidia-smi -L 2>/dev/null", { ignoreError: true }); + return Boolean(String(output || "").trim()); + } catch { + return false; + } }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

function detectNvidiaGpu(

runCaptureImpl: (command: string, options?: { ignoreError?: boolean }) => string,

): boolean {

return commandExists("nvidia-smi", runCaptureImpl);

}

function detectNvidiaGpu(

runCaptureImpl: (command: string, options?: { ignoreError?: boolean }) => string,

): boolean {

if (!commandExists("nvidia-smi", runCaptureImpl)) {

return false;

}

try {

const output = runCaptureImpl("nvidia-smi -L 2>/dev/null", { ignoreError: true });

return Boolean(String(output || "").trim());

} catch {

return false;

}

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/lib/preflight.ts` around lines 203 - 207, detectNvidiaGpu currently returns true just because the nvidia-smi binary exists; change it to first verify commandExists("nvidia-smi", runCaptureImpl) and then actually probe the device by running nvidia-smi (e.g. runCaptureImpl("nvidia-smi -L", { ignoreError: true }) or runCaptureImpl("nvidia-smi --query-gpu=name --format=csv,noheader", { ignoreError: true })) and return true only if the probe returns non-empty/expected output (trimmed) indicating at least one GPU; otherwise return false. Use the existing runCaptureImpl and commandExists helpers and keep error handling via the ignoreError option so an absent/failed probe yields false rather than throwing.

coderabbitai · 2026-04-04T05:28:22Z

test/e2e/brev-e2e.test.js

    // Pre-cleanup: delete any leftover instance with the same name.
    // This can happen when a previous run's create succeeded on the backend
    // but the CLI got a network error (unexpected EOF) before confirming,
    // then the retry/fallback fails with "duplicate workspace".
    try {
-      brev("delete", INSTANCE_NAME);
+      deleteBrevInstance(INSTANCE_NAME);
      console.log(`[${elapsed()}] Deleted leftover instance "${INSTANCE_NAME}"`);


⚠️ Potential issue | 🟠 Major

Don't lose cleanup state on partial instance lifecycle failures.

deleteBrevInstance() returns false when the instance still exists, but the pre-cleanup branch treats any return as success. Also, instanceCreated is only set after runLocalDeploy() returns; if deploy creates the VM and then fails during remote bootstrap, afterAll() skips deletion and leaks the instance.

Suggested fix

try { - deleteBrevInstance(INSTANCE_NAME); - console.log(`[${elapsed()}] Deleted leftover instance "${INSTANCE_NAME}"`); + const hadLeftover = hasBrevInstance(INSTANCE_NAME); + if (hadLeftover && !deleteBrevInstance(INSTANCE_NAME)) { + throw new Error(`Failed to delete leftover instance "${INSTANCE_NAME}" before test start`); + } + if (hadLeftover) { + console.log(`[${elapsed()}] Deleted leftover instance "${INSTANCE_NAME}"`); + } } catch { // Expected — no leftover instance exists } if (TEST_SUITE === "deploy-cli") { console.log(`[${elapsed()}] Running nemoclaw deploy end to end...`); - runLocalDeploy(INSTANCE_NAME); - instanceCreated = true; + try { + runLocalDeploy(INSTANCE_NAME); + } finally { + instanceCreated = hasBrevInstance(INSTANCE_NAME); + } try { brev("refresh"); } catch {

Also applies to: 290-299

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@test/e2e/brev-e2e.test.js` around lines 279 - 285, The pre-cleanup incorrectly assumes deleteBrevInstance(INSTANCE_NAME) succeeded; change the pre-cleanup to check the boolean result and throw or retry/log an error when it returns false so we don't treat a failed delete as success. Also ensure instanceCreated is set as soon as the VM is actually created (either by returning a created flag from runLocalDeploy or by setting instanceCreated = true immediately after detecting creation) so that afterAll() will always attempt deletion even if later bootstrap steps fail; update runLocalDeploy usage and afterAll cleanup logic to rely on that explicit created flag.

kjw3 added 2 commits April 4, 2026 00:58

refactor(installer): unify host preflight and thin deploy compatibility

68c6c3b

fix(cli): wrap deploy execFileSync injection as utf-8 string

dccdfb8

Merge remote-tracking branch 'origin/main' into refactor/installer-un…

409b3f2

…ification # Conflicts: # scripts/setup-spark.sh # spark-install.md

test(install): relax non-interactive preflight assertion

30a5081

coderabbitai bot reviewed Apr 4, 2026

View reviewed changes

Conversation

kjw3 commented Apr 4, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What Changed

Validation

Related Issues

Credit / Prior Work

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

github-actions bot commented Apr 4, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kjw3 commented Apr 4, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 4, 2026 •

edited

Loading