Skip to content

refactor(installer): unify host preflight and thin deploy compatibility#1470

Open
kjw3 wants to merge 4 commits intomainfrom
refactor/installer-unification
Open

refactor(installer): unify host preflight and thin deploy compatibility#1470
kjw3 wants to merge 4 commits intomainfrom
refactor/installer-unification

Conversation

@kjw3
Copy link
Copy Markdown
Contributor

@kjw3 kjw3 commented Apr 4, 2026

Summary

  • unify installer and onboarding host detection around shared TypeScript preflight logic
  • move deploy behavior into TypeScript, thin the Brev compatibility wrapper, and harden Brev readiness handling
  • demote or remove legacy platform-specific setup paths (setup-spark, brev-setup.sh) in favor of the canonical installer + onboard flow
  • update docs, CLI help, and Brev E2E coverage to match the new behavior

What Changed

  • added shared host assessment and remediation planning in src/lib/preflight.ts
  • wired installer and onboard flows to the same host preflight decisions
  • changed Podman handling from hard block to unsupported-runtime warning
  • migrated deploy logic into src/lib/deploy.ts
  • updated nemoclaw deploy to use the authenticated Brev CLI, current Brev create flags, explicit GCP provider default, stricter readiness checks, and standard installer/onboard flow
  • removed scripts/setup-spark.sh and reduced scripts/brev-setup.sh to a deprecated compatibility wrapper
  • updated README/docs/help text and hardened the Brev E2E cleanup path

Validation

  • npm run build:cli
  • targeted Vitest coverage for src/lib/preflight.test.ts, src/lib/deploy.test.ts, test/install-preflight.test.js, test/cli.test.js, test/runner.test.js
  • live Brev validation with TEST_SUITE=deploy-cli on cpu-e2.4vcpu-16gb
  • confirmed successful end-to-end remote deploy after waiting for Brev status=RUNNING, build_status=COMPLETED, shell_status=READY

Related Issues

Credit / Prior Work

This branch builds on ideas and prior work from:

Summary by CodeRabbit

Release Notes

  • New Features

    • Enhanced host diagnostics with actionable remediation guidance during onboarding preflight checks.
  • Improvements

    • Updated runtime compatibility: macOS (Intel) now recommends Docker Desktop over Podman.
    • Simplified DGX Spark setup to use standard installer followed by nemoclaw onboard.
    • Improved preflight error messages with clear remediation steps for Docker issues.
  • Deprecations

    • nemoclaw deploy now delegates to standard installer; prefer separate host provisioning + nemoclaw onboard.
    • nemoclaw setup-spark marked as deprecated compatibility alias for nemoclaw onboard.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 4, 2026

📝 Walkthrough

Walkthrough

The changes shift NemoClaw's deployment strategy from a Brev-specific wrapper to using the standard installer on remote hosts with onboarding. A new preflight host assessment module validates Docker/runtime readiness. Deploy functionality moves to a dedicated TypeScript module with comprehensive Brev VM orchestration. Commands deploy and setup-spark are deprecated in favor of the standard installer flow.

Changes

Cohort / File(s) Summary
Documentation
README.md, docs/get-started/quickstart.md, docs/reference/commands.md, docs/reference/troubleshooting.md, docs/deployment/deploy-to-remote-gpu.md, spark-install.md
Updated container runtime support matrix (Docker Desktop on macOS instead of Podman), repositioned deploy as deprecated in favor of standard installer + onboard workflow, reframed troubleshooting guidance for Docker reachability and Podman warnings.
Host Assessment & Remediation
src/lib/preflight.ts, src/lib/preflight.test.ts
New module providing assessHost() and planHostRemediation() for detecting container runtime, Docker status, cgroup configuration, WSL/headless environments, and generating remediation actions with platform-specific commands.
Brev Deploy Implementation
src/lib/deploy.ts, src/lib/deploy.test.ts
New module implementing executeDeploy() orchestrating Brev VM provisioning, instance creation/polling, rsync file sync, remote installer execution with credentials, and optional service startup and SSH connection.
CLI Entry Point & Onboarding
bin/nemoclaw.js, bin/lib/onboard.js
Refactored CLI to delegate deploy to new executeDeploy() module, deprecated setup-spark and deploy commands with warnings, replaced local preflight checks in onboard with assessHost()-driven validation and remediation reporting.
Installation & Bootstrap Scripts
scripts/install.sh, scripts/brev-setup.sh, scripts/start-services.sh
install.sh adds preflight checks via Node, gating onboarding on Docker readiness; brev-setup.sh converted to thin wrapper delegating to install.sh; removed direct cloudflared setup guidance from start-services.
Removed Script
scripts/setup-spark.sh
Entire script deleted; Docker cgroup v2 configuration logic no longer needed with current OpenShell releases.
Test Coverage
test/cli.test.js, test/runner.test.js, test/install-preflight.test.js, test/e2e/brev-e2e.test.js, test/e2e/test-spark-install.sh
Added new preflight/deploy test cases, updated CLI help assertions, added Brev e2e deploy flow testing with instance lifecycle helpers, removed Spark-specific e2e setup steps, extended regression guards for credential exposure and deploy behavior.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant CLI as nemoclaw.js
    participant Preflight as assessHost
    participant Deploy as executeDeploy
    participant Brev as brev CLI
    participant VM as Remote VM
    participant Installer as install.sh

    User->>CLI: nemoclaw deploy <instance>
    CLI->>Preflight: assessHost() on local machine
    Preflight-->>CLI: host diagnostics & remediation actions
    CLI->>Deploy: executeDeploy(instanceName, ...)
    Deploy->>Brev: brev ls --json
    alt Instance Missing
        Deploy->>Brev: brev create <instance>
    end
    Deploy->>Brev: brev refresh
    Deploy->>Brev: poll brev ls --json for readiness
    Brev-->>Deploy: instance RUNNING, build COMPLETED, shell READY
    Deploy->>VM: wait for SSH availability
    Deploy->>VM: rsync local rootDir to remote
    Deploy->>VM: scp .env file with credentials
    Deploy->>VM: run scripts/install.sh --non-interactive
    Installer->>VM: execute remote preflight checks
    Installer->>VM: run onboard in non-interactive mode
    VM-->>Installer: sandbox ready
    alt NEMOCLAW_DEPLOY_NO_CONNECT not set
        Deploy->>VM: openshell sandbox connect
        VM-->>User: interactive shell session
    else
        Deploy-->>User: print SSH connection instructions
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 A rabbit's ode to your deploy refactor!

Brev's old dance was tangled and twisted,
Now preflight checks host dreams insisted,
From wrapper to module, the logic flows clean,
Docker and Podman get properly seen,
The installer takes center stage with such grace! 🚀

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 13.64% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main refactoring work: unifying host preflight logic and moving deploy behavior to TypeScript while maintaining backward compatibility.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch refactor/installer-unification

Comment @coderabbitai help to get the list of available commands and usage tips.

…ification

# Conflicts:
#	scripts/setup-spark.sh
#	spark-install.md
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 4, 2026

🚀 Docs preview ready!

https://NVIDIA.github.io/NemoClaw/pr-preview/pr-1470/

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
test/install-preflight.test.js (1)

738-752: ⚠️ Potential issue | 🟡 Minor

Assert the installer still exits successfully in this skip-onboarding path.

Right now the test only checks the message and the missing log file. A non-zero exit that prints the same text would still pass, which weakens the regression coverage for the intended success case.

Suggested fix
     const result = spawnSync("bash", [INSTALLER, "--non-interactive"], {
       cwd: path.join(import.meta.dirname, ".."),
       encoding: "utf-8",
       env: {
         ...process.env,
@@
       },
     });
 
+    expect(result.status).toBe(0);
     expect(`${result.stdout}${result.stderr}`).toContain("Skipping onboarding");
     expect(fs.existsSync(onboardLog)).toBe(false);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/install-preflight.test.js` around lines 738 - 752, The test currently
only asserts the installer output and absence of the onboard log but doesn't
verify the process exit code; update the assertion after the spawnSync call (the
variable result from spawnSync using INSTALLER) to assert the installer exited
successfully (e.g., expect(result.status).toBe(0) or equivalent success
condition) so non-zero exits that print the same message will fail the test.
docs/deployment/deploy-to-remote-gpu.md (1)

30-35: ⚠️ Potential issue | 🟠 Major

The quick start skips the setup flow the page just recommended.

Lines 26-27 say the preferred path is to install NemoClaw on the remote host and run nemoclaw onboard, but this example jumps straight to nemoclaw my-assistant connect and openclaw tui. Those commands only work after onboarding has already created a sandbox, so a fresh Brev VM cannot follow this section as written.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/deployment/deploy-to-remote-gpu.md` around lines 30 - 35, The example
jumps straight to running "nemoclaw my-assistant connect" and "openclaw tui" but
those commands require an existing sandbox created by onboarding; update the
paragraph to either (a) explicitly state this is for already-onboarded Brev
instances, or (b) prepend the onboarding step by instructing users to run
"nemoclaw onboard" (or the recommended remote-host install + "nemoclaw onboard")
before "nemoclaw my-assistant connect" and "openclaw tui", and add a short note
about the prerequisite sandbox so fresh VMs can follow the flow.
🧹 Nitpick comments (4)
docs/reference/troubleshooting.md (1)

135-135: Keep one sentence per source line in this paragraph.

Line 135 currently packs two sentences onto one source line. Split them so the docs stay diff-friendly and consistent with the rest of docs/.

Suggested fix
-If you are using Podman, NemoClaw warns and continues, but OpenShell officially documents Docker-based runtimes only. If onboarding or sandbox lifecycle fails, switch to Docker Desktop, Colima, or Docker Engine and rerun onboarding.
+If you are using Podman, NemoClaw warns and continues, but OpenShell officially documents Docker-based runtimes only.
+If onboarding or sandbox lifecycle fails, switch to Docker Desktop, Colima, or Docker Engine and rerun onboarding.

As per coding guidelines, "One sentence per line in source (makes diffs readable). Flag paragraphs where multiple sentences appear on the same line."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/reference/troubleshooting.md` at line 135, The paragraph in
docs/reference/troubleshooting.md currently has two sentences on a single source
line; edit that source line so each sentence is on its own line (i.e., split "If
you are using Podman, NemoClaw warns and continues, but OpenShell officially
documents Docker-based runtimes only. If onboarding or sandbox lifecycle fails,
switch to Docker Desktop, Colima, or Docker Engine and rerun onboarding." into
two separate source lines) to follow the one-sentence-per-line guideline and
keep diffs clean.
test/install-preflight.test.js (1)

570-598: Drop the overwritten docker stub in this test setup.

The first writeExecutable(..., "docker") is immediately replaced by the second one, so half of this setup never runs. Keeping only the failing stub makes the scenario unambiguous.

Suggested fix
-    writeExecutable(
-      path.join(fakeBin, "docker"),
-      `#!/usr/bin/env bash
-if [ "$1" = "info" ]; then
-  echo '{"ServerVersion":"29.3.1","OperatingSystem":"Ubuntu 24.04","CgroupVersion":"2"}'
-  exit 0
-fi
-exit 0
-`,
-    );
     writeExecutable(
       path.join(fakeBin, "openshell"),
       `#!/usr/bin/env bash
 if [ "$1" = "--version" ]; then
   echo "openshell 0.0.22"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/install-preflight.test.js` around lines 570 - 598, The test sets two
different docker stubs via writeExecutable(...) with target "docker", and the
first one is immediately overwritten by the second; remove the initial
successful docker stub block so only the failing docker stub remains (keep the
openshell stub intact), i.e. ensure there is a single writeExecutable call for
"docker" (the one that exits 1 for "info") to make the scenario unambiguous.
docs/reference/commands.md (1)

212-213: Use active voice in this deprecation warning.

Line 213 says the installer/onboard flow “should be used instead,” which is passive. Address the reader directly here.

As per coding guidelines, "Active voice required. Flag passive constructions."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/reference/commands.md` around lines 212 - 213, Replace the passive
phrasing "the standard installer and `nemoclaw onboard` flow should be used
instead" with an active imperative addressing the reader; reword the sentence
about `nemoclaw setup-spark` deprecation so it reads something like: "Use the
standard installer and run `nemoclaw onboard` instead, because current OpenShell
releases handle the older DGX Spark cgroup behavior." Update the sentence that
mentions `nemoclaw setup-spark` and `nemoclaw onboard` to this active
construction.
test/e2e/brev-e2e.test.js (1)

68-112: Good retry/cleanup logic with a minor suggestion.

The deleteBrevInstance helper implements robust retry logic with brev refresh fallback. One observation: the function returns false when deletion fails after all attempts, but the calling code at line 685-687 throws if deletion fails. This is consistent.

However, consider that if hasBrevInstance returns true but brev("delete", ...) succeeds without throwing, you immediately call brev("refresh") and then check hasBrevInstance again. If the Brev API has eventual consistency, the instance might still appear in the list briefly after deletion succeeds. The current retry loop handles this gracefully, but adding a small delay after successful delete (before the existence check) might reduce unnecessary retries.

💡 Optional: Add small delay after successful delete
     try {
       brev("delete", instanceName);
+      sleep(2); // Allow Brev API to propagate deletion
     } catch {
       // Best-effort delete. We'll verify via ls below and retry if needed.
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/brev-e2e.test.js` around lines 68 - 112, The deleteBrevInstance
retry loop can spur unnecessary retries when brev("delete", instanceName)
succeeds but the API is eventually consistent; after calling brev("delete",
instanceName) inside deleteBrevInstance, insert a short sleep (use the existing
sleep(seconds) helper, e.g., 1–3 seconds) before calling brev("refresh") and
re-checking hasBrevInstance(instanceName) so the list has time to converge; keep
the existing try/catch around brev("delete") and brev("refresh") and preserve
the attempts/intervalSeconds behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/brev-setup.sh`:
- Around line 28-35: The script currently unconditionally fails if
NVIDIA_API_KEY is missing; change the check so the key is only required when the
effective provider (NEMOCLAW_PROVIDER) is the GPU-focused "build" provider (or
absent/default to "build"). Concretely, replace the unconditional check of
NVIDIA_API_KEY with a conditional that reads NEMOCLAW_PROVIDER (or uses the
default "${NEMOCLAW_PROVIDER:-build}") and only invokes fail "NVIDIA_API_KEY not
set" when that provider equals "build"; leave the installer existence check ([
-f "$INSTALLER_SCRIPT" ] || fail ...) and the exports (NEEDRESTART_MODE,
DEBIAN_FRONTEND, NEMOCLAW_* ) unchanged so onboarding for
openai/anthropic/gemini or endpoint providers can proceed without an
NVIDIA_API_KEY.

In `@scripts/install.sh`:
- Around line 924-956: The preflight is mixing informational notes
(host.runtime, host.notes) with actionable warnings so the script prints "Host
preflight found warnings." even when planHostRemediation(host) returned no
actions; split the output into two buffers (e.g., info_lines and action_lines)
so Detected container runtime and WSL messages go into info_lines and only
remediation items from actions/blockingActions go into action_lines, only set
status/exit code based on blockingActions and only print the warning header
("Host preflight found warnings." / remediation list) when action_lines is
non-empty; keep printing info_lines unconditionally so runtime/WSL remain
visible without triggering the warning path.

In `@spark-install.md`:
- Around line 5-14: The document is now internally inconsistent: the
intro/prereqs claim current OpenShell no longer needs the Spark-specific cgroup
workaround and that the NemoClaw installer/onboarding will install OpenShell
automatically, but later sections still instruct manual OpenShell installation
and the old setup-spark/default-cgroupns-mode=host fix; update the Quick Start
and "What's Different on Spark" sections to match the new supported path by
removing or revising any instructions that tell users to run setup-spark or set
default-cgroupns-mode=host and by deleting manual OpenShell install steps, and
if necessary replace them with a short note pointing to the onboarding installer
behavior (referencing the "Quick Start" and "What's Different on Spark" headings
and the phrases "setup-spark" and "default-cgroupns-mode=host" and "OpenShell
CLI") so the page consistently documents only the automatic onboarding/install
flow.

In `@src/lib/deploy.ts`:
- Around line 240-248: The current existence check uses substring matching
(out.includes(name)) which can falsely match names like "foo" in "foo-prod" or
in error text; update the check after calling execFileSync("brev", ["ls"]) to
split the output into lines (e.g., out.split(/\r?\n/)), trim each line and test
for an exact equality to name (use lines.some(line => line.trim() === name)) to
set exists; apply the same exact-line check to err.stdout and err.stderr instead
of includes so only an exact instance name marks existence (referencing the
exists variable and the out/err handling in this try/catch block).
- Around line 294-297: Replace the insecure SSH/SCP option
StrictHostKeyChecking=no used in the execFileSync calls (the SSH checks invoking
"ssh" and the scp upload of the .env) with a safer host-key policy: either Point
SSH to a pinned known_hosts file via -o UserKnownHostsFile=/path/to/known_hosts
and remove StrictHostKeyChecking=no, or use TOFU by setting -o
StrictHostKeyChecking=accept-new; apply this change to every execFileSync
invocation that runs "ssh" or "scp" (the calls that pass name and the one that
uploads the .env) and ensure the commands also reference the chosen known_hosts
file when appropriate.

---

Outside diff comments:
In `@docs/deployment/deploy-to-remote-gpu.md`:
- Around line 30-35: The example jumps straight to running "nemoclaw
my-assistant connect" and "openclaw tui" but those commands require an existing
sandbox created by onboarding; update the paragraph to either (a) explicitly
state this is for already-onboarded Brev instances, or (b) prepend the
onboarding step by instructing users to run "nemoclaw onboard" (or the
recommended remote-host install + "nemoclaw onboard") before "nemoclaw
my-assistant connect" and "openclaw tui", and add a short note about the
prerequisite sandbox so fresh VMs can follow the flow.

In `@test/install-preflight.test.js`:
- Around line 738-752: The test currently only asserts the installer output and
absence of the onboard log but doesn't verify the process exit code; update the
assertion after the spawnSync call (the variable result from spawnSync using
INSTALLER) to assert the installer exited successfully (e.g.,
expect(result.status).toBe(0) or equivalent success condition) so non-zero exits
that print the same message will fail the test.

---

Nitpick comments:
In `@docs/reference/commands.md`:
- Around line 212-213: Replace the passive phrasing "the standard installer and
`nemoclaw onboard` flow should be used instead" with an active imperative
addressing the reader; reword the sentence about `nemoclaw setup-spark`
deprecation so it reads something like: "Use the standard installer and run
`nemoclaw onboard` instead, because current OpenShell releases handle the older
DGX Spark cgroup behavior." Update the sentence that mentions `nemoclaw
setup-spark` and `nemoclaw onboard` to this active construction.

In `@docs/reference/troubleshooting.md`:
- Line 135: The paragraph in docs/reference/troubleshooting.md currently has two
sentences on a single source line; edit that source line so each sentence is on
its own line (i.e., split "If you are using Podman, NemoClaw warns and
continues, but OpenShell officially documents Docker-based runtimes only. If
onboarding or sandbox lifecycle fails, switch to Docker Desktop, Colima, or
Docker Engine and rerun onboarding." into two separate source lines) to follow
the one-sentence-per-line guideline and keep diffs clean.

In `@test/e2e/brev-e2e.test.js`:
- Around line 68-112: The deleteBrevInstance retry loop can spur unnecessary
retries when brev("delete", instanceName) succeeds but the API is eventually
consistent; after calling brev("delete", instanceName) inside
deleteBrevInstance, insert a short sleep (use the existing sleep(seconds)
helper, e.g., 1–3 seconds) before calling brev("refresh") and re-checking
hasBrevInstance(instanceName) so the list has time to converge; keep the
existing try/catch around brev("delete") and brev("refresh") and preserve the
attempts/intervalSeconds behavior.

In `@test/install-preflight.test.js`:
- Around line 570-598: The test sets two different docker stubs via
writeExecutable(...) with target "docker", and the first one is immediately
overwritten by the second; remove the initial successful docker stub block so
only the failing docker stub remains (keep the openshell stub intact), i.e.
ensure there is a single writeExecutable call for "docker" (the one that exits 1
for "info") to make the scenario unambiguous.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 5a93e840-8490-4a99-8aa7-214864ddfd40

📥 Commits

Reviewing files that changed from the base of the PR and between 3328359 and dccdfb8.

📒 Files selected for processing (22)
  • README.md
  • bin/lib/onboard.js
  • bin/nemoclaw.js
  • docs/deployment/deploy-to-remote-gpu.md
  • docs/get-started/quickstart.md
  • docs/reference/commands.md
  • docs/reference/troubleshooting.md
  • scripts/brev-setup.sh
  • scripts/install.sh
  • scripts/setup-spark.sh
  • scripts/start-services.sh
  • spark-install.md
  • src/lib/deploy.test.ts
  • src/lib/deploy.ts
  • src/lib/preflight.test.ts
  • src/lib/preflight.ts
  • test/cli.test.js
  • test/e2e/brev-e2e.test.js
  • test/e2e/test-spark-install.sh
  • test/install-preflight.test.js
  • test/runner.test.js
  • test/usage-notice.test.js
💤 Files with no reviewable changes (1)
  • scripts/setup-spark.sh

Comment on lines 28 to +35
[ -n "${NVIDIA_API_KEY:-}" ] || fail "NVIDIA_API_KEY not set"
[ -f "$INSTALLER_SCRIPT" ] || fail "Installer not found at $INSTALLER_SCRIPT"

# Suppress needrestart noise from apt (Scanning processes, No services need...)
export NEEDRESTART_MODE=a
export DEBIAN_FRONTEND=noninteractive
export NEMOCLAW_NON_INTERACTIVE=1
export NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE=1
export NEMOCLAW_PROVIDER="${NEMOCLAW_PROVIDER:-build}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't hard-require NVIDIA_API_KEY for every provider.

Line 28 aborts before Line 35 can steer onboarding to openai, anthropic, gemini, or the compatible endpoint providers. That makes the compatibility deploy path fail for any non-build provider even when its own credential is present. Let the installer/onboard flow validate the selected provider, or gate this check on the effective provider instead.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/brev-setup.sh` around lines 28 - 35, The script currently
unconditionally fails if NVIDIA_API_KEY is missing; change the check so the key
is only required when the effective provider (NEMOCLAW_PROVIDER) is the
GPU-focused "build" provider (or absent/default to "build"). Concretely, replace
the unconditional check of NVIDIA_API_KEY with a conditional that reads
NEMOCLAW_PROVIDER (or uses the default "${NEMOCLAW_PROVIDER:-build}") and only
invokes fail "NVIDIA_API_KEY not set" when that provider equals "build"; leave
the installer existence check ([ -f "$INSTALLER_SCRIPT" ] || fail ...) and the
exports (NEEDRESTART_MODE, DEBIAN_FRONTEND, NEMOCLAW_* ) unchanged so onboarding
for openai/anthropic/gemini or endpoint providers can proceed without an
NVIDIA_API_KEY.

Comment on lines +924 to +956
if (host.runtime && host.runtime !== "unknown") {
lines.push(`Detected container runtime: ${host.runtime}`);
}
if (host.notes && host.notes.includes("Running under WSL")) {
lines.push("Running under WSL");
}
for (const action of actions) {
lines.push(`- ${action.title}: ${action.reason}`);
for (const command of action.commands || []) {
lines.push(` ${command}`);
}
}
if (lines.length > 0) {
process.stdout.write(lines.join("\n"));
}
process.exit(blockingActions.length > 0 ? 10 : 0);
} catch {
process.exit(0);
}
' "$preflight_module"
)"; then
status=0
else
status=$?
fi

if [[ -n "$output" ]]; then
echo ""
if [[ "$status" -eq 10 ]]; then
warn "Host preflight found issues that will prevent onboarding right now."
else
warn "Host preflight found warnings."
fi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Separate informational host notes from actual warnings.

Lines 924-929 always add Detected container runtime: docker on a healthy Docker host, so Lines 950-956 print Host preflight found warnings. even when planHostRemediation(host) returned no actions. Keep runtime/WSL notes out of the warning path, or only emit the warning header when there is remediation to show.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/install.sh` around lines 924 - 956, The preflight is mixing
informational notes (host.runtime, host.notes) with actionable warnings so the
script prints "Host preflight found warnings." even when
planHostRemediation(host) returned no actions; split the output into two buffers
(e.g., info_lines and action_lines) so Detected container runtime and WSL
messages go into info_lines and only remediation items from
actions/blockingActions go into action_lines, only set status/exit code based on
blockingActions and only print the warning header ("Host preflight found
warnings." / remediation list) when action_lines is non-empty; keep printing
info_lines unconditionally so runtime/WSL remain visible without triggering the
warning path.

Comment on lines +240 to +248
let exists = false;
try {
const out = execFileSync("brev", ["ls"], { encoding: "utf-8" });
exists = out.includes(name);
} catch (caught) {
const err = caught as { stdout?: string; stderr?: string };
if (err.stdout && err.stdout.includes(name)) exists = true;
if (err.stderr && err.stderr.includes(name)) exists = true;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Use exact instance matching instead of substring checks.

out.includes(name) will treat foo as existing when only something like foo-prod is present, or when the name appears in an error message. In that case we skip brev create and then wait for a VM that does not actually exist.

Suggested fix
   let exists = false;
   try {
-    const out = execFileSync("brev", ["ls"], { encoding: "utf-8" });
-    exists = out.includes(name);
+    const out = execFileSync("brev", ["ls", "--json"], { encoding: "utf-8" });
+    exists = findBrevInstanceStatus(out, name) !== null;
   } catch (caught) {
-    const err = caught as { stdout?: string; stderr?: string };
-    if (err.stdout && err.stdout.includes(name)) exists = true;
-    if (err.stderr && err.stderr.includes(name)) exists = true;
+    const err = caught as { stdout?: string };
+    if (err.stdout) exists = findBrevInstanceStatus(err.stdout, name) !== null;
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/deploy.ts` around lines 240 - 248, The current existence check uses
substring matching (out.includes(name)) which can falsely match names like "foo"
in "foo-prod" or in error text; update the check after calling
execFileSync("brev", ["ls"]) to split the output into lines (e.g.,
out.split(/\r?\n/)), trim each line and test for an exact equality to name (use
lines.some(line => line.trim() === name)) to set exists; apply the same
exact-line check to err.stdout and err.stderr instead of includes so only an
exact instance name marks existence (referencing the exists variable and the
out/err handling in this try/catch block).

Comment on lines +294 to +297
execFileSync(
"ssh",
["-o", "ConnectTimeout=5", "-o", "StrictHostKeyChecking=no", name, "echo", "ok"],
{ encoding: "utf-8", stdio: "ignore" },
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
rg -n --fixed-strings 'StrictHostKeyChecking=no' src/lib/deploy.ts

Repository: NVIDIA/NemoClaw

Length of output: 1380


Disable StrictHostKeyChecking=no for all SSH/scp operations to prevent MITM attacks on credential transmission.

All nine SSH and scp commands in this file skip host key verification with StrictHostKeyChecking=no, including the scp at line 338 that uploads the .env file containing API credentials. This silently accepts host-key mismatches and can disclose secrets to the wrong host. Use a pinned known_hosts file, or at least a trust-on-first-use mode that prevents key changes.

Affected instances: lines 296, 313, 320, 323, 338, 341, 358, 367, 387

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/deploy.ts` around lines 294 - 297, Replace the insecure SSH/SCP
option StrictHostKeyChecking=no used in the execFileSync calls (the SSH checks
invoking "ssh" and the scp upload of the .env) with a safer host-key policy:
either Point SSH to a pinned known_hosts file via -o
UserKnownHostsFile=/path/to/known_hosts and remove StrictHostKeyChecking=no, or
use TOFU by setting -o StrictHostKeyChecking=accept-new; apply this change to
every execFileSync invocation that runs "ssh" or "scp" (the calls that pass name
and the one that uploads the .env) and ensure the commands also reference the
chosen known_hosts file when appropriate.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
test/install-preflight.test.js (1)

709-755: ⚠️ Potential issue | 🟠 Major

This no longer verifies the non-interactive terms gate.

Because this case doesn't force preflight to pass, the installer can skip onboarding before show_usage_notice() runs and still satisfy /Skipping onboarding/. That means the test now passes for unrelated Docker/OpenShell state instead of proving the acceptance requirement.

Suggested fix
     writeNodeStub(fakeBin);
+    writeExecutable(
+      path.join(fakeBin, "docker"),
+      `#!/usr/bin/env bash
+if [ "$1" = "info" ]; then
+  echo '{"ServerVersion":"29.3.1","OperatingSystem":"Ubuntu 24.04","CgroupVersion":"2"}'
+  exit 0
+fi
+exit 0
+`,
+    );
+    writeExecutable(
+      path.join(fakeBin, "openshell"),
+      `#!/usr/bin/env bash
+if [ "$1" = "--version" ]; then
+  echo "openshell 0.0.22"
+  exit 0
+fi
+exit 0
+`,
+    );
     writeNpmStub(
       fakeBin,
       `if [ "$1" = "pack" ]; then
@@
     const result = spawnSync("bash", [INSTALLER, "--non-interactive"], {
@@
     });
 
-    expect(`${result.stdout}${result.stderr}`).toMatch(
-      /Skipping onboarding|--yes-i-accept-third-party-software|NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE=1/,
-    );
+    expect(result.status).not.toBe(0);
+    expect(`${result.stdout}${result.stderr}`).toMatch(
+      /--yes-i-accept-third-party-software|NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE=1/,
+    );
     expect(fs.existsSync(onboardLog)).toBe(false);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/install-preflight.test.js` around lines 709 - 755, The test no longer
guarantees the installer reached the terms gate because preflight can fail
earlier; update the test so preflight is forced to succeed and the installer
reaches show_usage_notice() before asserting onboarding was skipped. Concretely,
in the test around INSTALLER invocation ensure you arrange/stub the preflight
checks (or set the env variable(s) the preflight uses) so they return success
(so show_usage_notice() is executed), keep NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE
empty to simulate non-acceptance, then assert the output contains the
terms-acceptance hints and that onboardLog does not exist; reference the
installer invocation via INSTALLER, the show_usage_notice() path, the
NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE env var, and the onboardLog file to locate
where to change the test.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/preflight.ts`:
- Around line 203-207: detectNvidiaGpu currently returns true just because the
nvidia-smi binary exists; change it to first verify commandExists("nvidia-smi",
runCaptureImpl) and then actually probe the device by running nvidia-smi (e.g.
runCaptureImpl("nvidia-smi -L", { ignoreError: true }) or
runCaptureImpl("nvidia-smi --query-gpu=name --format=csv,noheader", {
ignoreError: true })) and return true only if the probe returns
non-empty/expected output (trimmed) indicating at least one GPU; otherwise
return false. Use the existing runCaptureImpl and commandExists helpers and keep
error handling via the ignoreError option so an absent/failed probe yields false
rather than throwing.

In `@test/e2e/brev-e2e.test.js`:
- Around line 279-285: The pre-cleanup incorrectly assumes
deleteBrevInstance(INSTANCE_NAME) succeeded; change the pre-cleanup to check the
boolean result and throw or retry/log an error when it returns false so we don't
treat a failed delete as success. Also ensure instanceCreated is set as soon as
the VM is actually created (either by returning a created flag from
runLocalDeploy or by setting instanceCreated = true immediately after detecting
creation) so that afterAll() will always attempt deletion even if later
bootstrap steps fail; update runLocalDeploy usage and afterAll cleanup logic to
rely on that explicit created flag.

---

Outside diff comments:
In `@test/install-preflight.test.js`:
- Around line 709-755: The test no longer guarantees the installer reached the
terms gate because preflight can fail earlier; update the test so preflight is
forced to succeed and the installer reaches show_usage_notice() before asserting
onboarding was skipped. Concretely, in the test around INSTALLER invocation
ensure you arrange/stub the preflight checks (or set the env variable(s) the
preflight uses) so they return success (so show_usage_notice() is executed),
keep NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE empty to simulate non-acceptance, then
assert the output contains the terms-acceptance hints and that onboardLog does
not exist; reference the installer invocation via INSTALLER, the
show_usage_notice() path, the NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE env var, and
the onboardLog file to locate where to change the test.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 80955231-b32a-4dc6-8604-041a7326547d

📥 Commits

Reviewing files that changed from the base of the PR and between dccdfb8 and 30a5081.

📒 Files selected for processing (9)
  • README.md
  • bin/lib/onboard.js
  • bin/nemoclaw.js
  • docs/reference/commands.md
  • scripts/install.sh
  • spark-install.md
  • src/lib/preflight.ts
  • test/e2e/brev-e2e.test.js
  • test/install-preflight.test.js
✅ Files skipped from review due to trivial changes (1)
  • docs/reference/commands.md
🚧 Files skipped from review as they are similar to previous changes (3)
  • README.md
  • spark-install.md
  • bin/lib/onboard.js

Comment on lines +203 to +207
function detectNvidiaGpu(
runCaptureImpl: (command: string, options?: { ignoreError?: boolean }) => string,
): boolean {
return commandExists("nvidia-smi", runCaptureImpl);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Probe nvidia-smi, don't just check that it's on PATH.

hasNvidiaGpu flips to true whenever the binary exists, even if the host has no usable NVIDIA device/driver. On CPU images that still ship nvidia-smi, that suppresses the headless-host hint and mislabels the assessment.

Suggested fix
 function detectNvidiaGpu(
   runCaptureImpl: (command: string, options?: { ignoreError?: boolean }) => string,
 ): boolean {
-  return commandExists("nvidia-smi", runCaptureImpl);
+  if (!commandExists("nvidia-smi", runCaptureImpl)) {
+    return false;
+  }
+  try {
+    const output = runCaptureImpl("nvidia-smi -L 2>/dev/null", { ignoreError: true });
+    return Boolean(String(output || "").trim());
+  } catch {
+    return false;
+  }
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
function detectNvidiaGpu(
runCaptureImpl: (command: string, options?: { ignoreError?: boolean }) => string,
): boolean {
return commandExists("nvidia-smi", runCaptureImpl);
}
function detectNvidiaGpu(
runCaptureImpl: (command: string, options?: { ignoreError?: boolean }) => string,
): boolean {
if (!commandExists("nvidia-smi", runCaptureImpl)) {
return false;
}
try {
const output = runCaptureImpl("nvidia-smi -L 2>/dev/null", { ignoreError: true });
return Boolean(String(output || "").trim());
} catch {
return false;
}
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/preflight.ts` around lines 203 - 207, detectNvidiaGpu currently
returns true just because the nvidia-smi binary exists; change it to first
verify commandExists("nvidia-smi", runCaptureImpl) and then actually probe the
device by running nvidia-smi (e.g. runCaptureImpl("nvidia-smi -L", {
ignoreError: true }) or runCaptureImpl("nvidia-smi --query-gpu=name
--format=csv,noheader", { ignoreError: true })) and return true only if the
probe returns non-empty/expected output (trimmed) indicating at least one GPU;
otherwise return false. Use the existing runCaptureImpl and commandExists
helpers and keep error handling via the ignoreError option so an absent/failed
probe yields false rather than throwing.

Comment on lines 279 to 285
// Pre-cleanup: delete any leftover instance with the same name.
// This can happen when a previous run's create succeeded on the backend
// but the CLI got a network error (unexpected EOF) before confirming,
// then the retry/fallback fails with "duplicate workspace".
try {
brev("delete", INSTANCE_NAME);
deleteBrevInstance(INSTANCE_NAME);
console.log(`[${elapsed()}] Deleted leftover instance "${INSTANCE_NAME}"`);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't lose cleanup state on partial instance lifecycle failures.

deleteBrevInstance() returns false when the instance still exists, but the pre-cleanup branch treats any return as success. Also, instanceCreated is only set after runLocalDeploy() returns; if deploy creates the VM and then fails during remote bootstrap, afterAll() skips deletion and leaks the instance.

Suggested fix
     try {
-      deleteBrevInstance(INSTANCE_NAME);
-      console.log(`[${elapsed()}] Deleted leftover instance "${INSTANCE_NAME}"`);
+      const hadLeftover = hasBrevInstance(INSTANCE_NAME);
+      if (hadLeftover && !deleteBrevInstance(INSTANCE_NAME)) {
+        throw new Error(`Failed to delete leftover instance "${INSTANCE_NAME}" before test start`);
+      }
+      if (hadLeftover) {
+        console.log(`[${elapsed()}] Deleted leftover instance "${INSTANCE_NAME}"`);
+      }
     } catch {
       // Expected — no leftover instance exists
     }
 
     if (TEST_SUITE === "deploy-cli") {
       console.log(`[${elapsed()}] Running nemoclaw deploy end to end...`);
-      runLocalDeploy(INSTANCE_NAME);
-      instanceCreated = true;
+      try {
+        runLocalDeploy(INSTANCE_NAME);
+      } finally {
+        instanceCreated = hasBrevInstance(INSTANCE_NAME);
+      }
       try {
         brev("refresh");
       } catch {

Also applies to: 290-299

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/brev-e2e.test.js` around lines 279 - 285, The pre-cleanup
incorrectly assumes deleteBrevInstance(INSTANCE_NAME) succeeded; change the
pre-cleanup to check the boolean result and throw or retry/log an error when it
returns false so we don't treat a failed delete as success. Also ensure
instanceCreated is set as soon as the VM is actually created (either by
returning a created flag from runLocalDeploy or by setting instanceCreated =
true immediately after detecting creation) so that afterAll() will always
attempt deletion even if later bootstrap steps fail; update runLocalDeploy usage
and afterAll cleanup logic to rely on that explicit created flag.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[NemoClaw][Brev] nemoclaw deploy — brev create --gpu flag incompatible with current brev CLI

1 participant