refactor(installer): unify host preflight and thin deploy compatibility#1470
refactor(installer): unify host preflight and thin deploy compatibility#1470
Conversation
📝 WalkthroughWalkthroughThe changes shift NemoClaw's deployment strategy from a Brev-specific wrapper to using the standard installer on remote hosts with onboarding. A new preflight host assessment module validates Docker/runtime readiness. Deploy functionality moves to a dedicated TypeScript module with comprehensive Brev VM orchestration. Commands Changes
Sequence Diagram(s)sequenceDiagram
actor User
participant CLI as nemoclaw.js
participant Preflight as assessHost
participant Deploy as executeDeploy
participant Brev as brev CLI
participant VM as Remote VM
participant Installer as install.sh
User->>CLI: nemoclaw deploy <instance>
CLI->>Preflight: assessHost() on local machine
Preflight-->>CLI: host diagnostics & remediation actions
CLI->>Deploy: executeDeploy(instanceName, ...)
Deploy->>Brev: brev ls --json
alt Instance Missing
Deploy->>Brev: brev create <instance>
end
Deploy->>Brev: brev refresh
Deploy->>Brev: poll brev ls --json for readiness
Brev-->>Deploy: instance RUNNING, build COMPLETED, shell READY
Deploy->>VM: wait for SSH availability
Deploy->>VM: rsync local rootDir to remote
Deploy->>VM: scp .env file with credentials
Deploy->>VM: run scripts/install.sh --non-interactive
Installer->>VM: execute remote preflight checks
Installer->>VM: run onboard in non-interactive mode
VM-->>Installer: sandbox ready
alt NEMOCLAW_DEPLOY_NO_CONNECT not set
Deploy->>VM: openshell sandbox connect
VM-->>User: interactive shell session
else
Deploy-->>User: print SSH connection instructions
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
…ification # Conflicts: # scripts/setup-spark.sh # spark-install.md
|
🚀 Docs preview ready! |
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
test/install-preflight.test.js (1)
738-752:⚠️ Potential issue | 🟡 MinorAssert the installer still exits successfully in this skip-onboarding path.
Right now the test only checks the message and the missing log file. A non-zero exit that prints the same text would still pass, which weakens the regression coverage for the intended success case.
Suggested fix
const result = spawnSync("bash", [INSTALLER, "--non-interactive"], { cwd: path.join(import.meta.dirname, ".."), encoding: "utf-8", env: { ...process.env, @@ }, }); + expect(result.status).toBe(0); expect(`${result.stdout}${result.stderr}`).toContain("Skipping onboarding"); expect(fs.existsSync(onboardLog)).toBe(false);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/install-preflight.test.js` around lines 738 - 752, The test currently only asserts the installer output and absence of the onboard log but doesn't verify the process exit code; update the assertion after the spawnSync call (the variable result from spawnSync using INSTALLER) to assert the installer exited successfully (e.g., expect(result.status).toBe(0) or equivalent success condition) so non-zero exits that print the same message will fail the test.docs/deployment/deploy-to-remote-gpu.md (1)
30-35:⚠️ Potential issue | 🟠 MajorThe quick start skips the setup flow the page just recommended.
Lines 26-27 say the preferred path is to install NemoClaw on the remote host and run
nemoclaw onboard, but this example jumps straight tonemoclaw my-assistant connectandopenclaw tui. Those commands only work after onboarding has already created a sandbox, so a fresh Brev VM cannot follow this section as written.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/deployment/deploy-to-remote-gpu.md` around lines 30 - 35, The example jumps straight to running "nemoclaw my-assistant connect" and "openclaw tui" but those commands require an existing sandbox created by onboarding; update the paragraph to either (a) explicitly state this is for already-onboarded Brev instances, or (b) prepend the onboarding step by instructing users to run "nemoclaw onboard" (or the recommended remote-host install + "nemoclaw onboard") before "nemoclaw my-assistant connect" and "openclaw tui", and add a short note about the prerequisite sandbox so fresh VMs can follow the flow.
🧹 Nitpick comments (4)
docs/reference/troubleshooting.md (1)
135-135: Keep one sentence per source line in this paragraph.Line 135 currently packs two sentences onto one source line. Split them so the docs stay diff-friendly and consistent with the rest of
docs/.Suggested fix
-If you are using Podman, NemoClaw warns and continues, but OpenShell officially documents Docker-based runtimes only. If onboarding or sandbox lifecycle fails, switch to Docker Desktop, Colima, or Docker Engine and rerun onboarding. +If you are using Podman, NemoClaw warns and continues, but OpenShell officially documents Docker-based runtimes only. +If onboarding or sandbox lifecycle fails, switch to Docker Desktop, Colima, or Docker Engine and rerun onboarding.As per coding guidelines, "One sentence per line in source (makes diffs readable). Flag paragraphs where multiple sentences appear on the same line."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/reference/troubleshooting.md` at line 135, The paragraph in docs/reference/troubleshooting.md currently has two sentences on a single source line; edit that source line so each sentence is on its own line (i.e., split "If you are using Podman, NemoClaw warns and continues, but OpenShell officially documents Docker-based runtimes only. If onboarding or sandbox lifecycle fails, switch to Docker Desktop, Colima, or Docker Engine and rerun onboarding." into two separate source lines) to follow the one-sentence-per-line guideline and keep diffs clean.test/install-preflight.test.js (1)
570-598: Drop the overwrittendockerstub in this test setup.The first
writeExecutable(..., "docker")is immediately replaced by the second one, so half of this setup never runs. Keeping only the failing stub makes the scenario unambiguous.Suggested fix
- writeExecutable( - path.join(fakeBin, "docker"), - `#!/usr/bin/env bash -if [ "$1" = "info" ]; then - echo '{"ServerVersion":"29.3.1","OperatingSystem":"Ubuntu 24.04","CgroupVersion":"2"}' - exit 0 -fi -exit 0 -`, - ); writeExecutable( path.join(fakeBin, "openshell"), `#!/usr/bin/env bash if [ "$1" = "--version" ]; then echo "openshell 0.0.22"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/install-preflight.test.js` around lines 570 - 598, The test sets two different docker stubs via writeExecutable(...) with target "docker", and the first one is immediately overwritten by the second; remove the initial successful docker stub block so only the failing docker stub remains (keep the openshell stub intact), i.e. ensure there is a single writeExecutable call for "docker" (the one that exits 1 for "info") to make the scenario unambiguous.docs/reference/commands.md (1)
212-213: Use active voice in this deprecation warning.Line 213 says the installer/onboard flow “should be used instead,” which is passive. Address the reader directly here.
As per coding guidelines, "Active voice required. Flag passive constructions."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/reference/commands.md` around lines 212 - 213, Replace the passive phrasing "the standard installer and `nemoclaw onboard` flow should be used instead" with an active imperative addressing the reader; reword the sentence about `nemoclaw setup-spark` deprecation so it reads something like: "Use the standard installer and run `nemoclaw onboard` instead, because current OpenShell releases handle the older DGX Spark cgroup behavior." Update the sentence that mentions `nemoclaw setup-spark` and `nemoclaw onboard` to this active construction.test/e2e/brev-e2e.test.js (1)
68-112: Good retry/cleanup logic with a minor suggestion.The
deleteBrevInstancehelper implements robust retry logic withbrev refreshfallback. One observation: the function returnsfalsewhen deletion fails after all attempts, but the calling code at line 685-687 throws if deletion fails. This is consistent.However, consider that if
hasBrevInstancereturnstruebutbrev("delete", ...)succeeds without throwing, you immediately callbrev("refresh")and then checkhasBrevInstanceagain. If the Brev API has eventual consistency, the instance might still appear in the list briefly after deletion succeeds. The current retry loop handles this gracefully, but adding a small delay after successful delete (before the existence check) might reduce unnecessary retries.💡 Optional: Add small delay after successful delete
try { brev("delete", instanceName); + sleep(2); // Allow Brev API to propagate deletion } catch { // Best-effort delete. We'll verify via ls below and retry if needed. }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/brev-e2e.test.js` around lines 68 - 112, The deleteBrevInstance retry loop can spur unnecessary retries when brev("delete", instanceName) succeeds but the API is eventually consistent; after calling brev("delete", instanceName) inside deleteBrevInstance, insert a short sleep (use the existing sleep(seconds) helper, e.g., 1–3 seconds) before calling brev("refresh") and re-checking hasBrevInstance(instanceName) so the list has time to converge; keep the existing try/catch around brev("delete") and brev("refresh") and preserve the attempts/intervalSeconds behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/brev-setup.sh`:
- Around line 28-35: The script currently unconditionally fails if
NVIDIA_API_KEY is missing; change the check so the key is only required when the
effective provider (NEMOCLAW_PROVIDER) is the GPU-focused "build" provider (or
absent/default to "build"). Concretely, replace the unconditional check of
NVIDIA_API_KEY with a conditional that reads NEMOCLAW_PROVIDER (or uses the
default "${NEMOCLAW_PROVIDER:-build}") and only invokes fail "NVIDIA_API_KEY not
set" when that provider equals "build"; leave the installer existence check ([
-f "$INSTALLER_SCRIPT" ] || fail ...) and the exports (NEEDRESTART_MODE,
DEBIAN_FRONTEND, NEMOCLAW_* ) unchanged so onboarding for
openai/anthropic/gemini or endpoint providers can proceed without an
NVIDIA_API_KEY.
In `@scripts/install.sh`:
- Around line 924-956: The preflight is mixing informational notes
(host.runtime, host.notes) with actionable warnings so the script prints "Host
preflight found warnings." even when planHostRemediation(host) returned no
actions; split the output into two buffers (e.g., info_lines and action_lines)
so Detected container runtime and WSL messages go into info_lines and only
remediation items from actions/blockingActions go into action_lines, only set
status/exit code based on blockingActions and only print the warning header
("Host preflight found warnings." / remediation list) when action_lines is
non-empty; keep printing info_lines unconditionally so runtime/WSL remain
visible without triggering the warning path.
In `@spark-install.md`:
- Around line 5-14: The document is now internally inconsistent: the
intro/prereqs claim current OpenShell no longer needs the Spark-specific cgroup
workaround and that the NemoClaw installer/onboarding will install OpenShell
automatically, but later sections still instruct manual OpenShell installation
and the old setup-spark/default-cgroupns-mode=host fix; update the Quick Start
and "What's Different on Spark" sections to match the new supported path by
removing or revising any instructions that tell users to run setup-spark or set
default-cgroupns-mode=host and by deleting manual OpenShell install steps, and
if necessary replace them with a short note pointing to the onboarding installer
behavior (referencing the "Quick Start" and "What's Different on Spark" headings
and the phrases "setup-spark" and "default-cgroupns-mode=host" and "OpenShell
CLI") so the page consistently documents only the automatic onboarding/install
flow.
In `@src/lib/deploy.ts`:
- Around line 240-248: The current existence check uses substring matching
(out.includes(name)) which can falsely match names like "foo" in "foo-prod" or
in error text; update the check after calling execFileSync("brev", ["ls"]) to
split the output into lines (e.g., out.split(/\r?\n/)), trim each line and test
for an exact equality to name (use lines.some(line => line.trim() === name)) to
set exists; apply the same exact-line check to err.stdout and err.stderr instead
of includes so only an exact instance name marks existence (referencing the
exists variable and the out/err handling in this try/catch block).
- Around line 294-297: Replace the insecure SSH/SCP option
StrictHostKeyChecking=no used in the execFileSync calls (the SSH checks invoking
"ssh" and the scp upload of the .env) with a safer host-key policy: either Point
SSH to a pinned known_hosts file via -o UserKnownHostsFile=/path/to/known_hosts
and remove StrictHostKeyChecking=no, or use TOFU by setting -o
StrictHostKeyChecking=accept-new; apply this change to every execFileSync
invocation that runs "ssh" or "scp" (the calls that pass name and the one that
uploads the .env) and ensure the commands also reference the chosen known_hosts
file when appropriate.
---
Outside diff comments:
In `@docs/deployment/deploy-to-remote-gpu.md`:
- Around line 30-35: The example jumps straight to running "nemoclaw
my-assistant connect" and "openclaw tui" but those commands require an existing
sandbox created by onboarding; update the paragraph to either (a) explicitly
state this is for already-onboarded Brev instances, or (b) prepend the
onboarding step by instructing users to run "nemoclaw onboard" (or the
recommended remote-host install + "nemoclaw onboard") before "nemoclaw
my-assistant connect" and "openclaw tui", and add a short note about the
prerequisite sandbox so fresh VMs can follow the flow.
In `@test/install-preflight.test.js`:
- Around line 738-752: The test currently only asserts the installer output and
absence of the onboard log but doesn't verify the process exit code; update the
assertion after the spawnSync call (the variable result from spawnSync using
INSTALLER) to assert the installer exited successfully (e.g.,
expect(result.status).toBe(0) or equivalent success condition) so non-zero exits
that print the same message will fail the test.
---
Nitpick comments:
In `@docs/reference/commands.md`:
- Around line 212-213: Replace the passive phrasing "the standard installer and
`nemoclaw onboard` flow should be used instead" with an active imperative
addressing the reader; reword the sentence about `nemoclaw setup-spark`
deprecation so it reads something like: "Use the standard installer and run
`nemoclaw onboard` instead, because current OpenShell releases handle the older
DGX Spark cgroup behavior." Update the sentence that mentions `nemoclaw
setup-spark` and `nemoclaw onboard` to this active construction.
In `@docs/reference/troubleshooting.md`:
- Line 135: The paragraph in docs/reference/troubleshooting.md currently has two
sentences on a single source line; edit that source line so each sentence is on
its own line (i.e., split "If you are using Podman, NemoClaw warns and
continues, but OpenShell officially documents Docker-based runtimes only. If
onboarding or sandbox lifecycle fails, switch to Docker Desktop, Colima, or
Docker Engine and rerun onboarding." into two separate source lines) to follow
the one-sentence-per-line guideline and keep diffs clean.
In `@test/e2e/brev-e2e.test.js`:
- Around line 68-112: The deleteBrevInstance retry loop can spur unnecessary
retries when brev("delete", instanceName) succeeds but the API is eventually
consistent; after calling brev("delete", instanceName) inside
deleteBrevInstance, insert a short sleep (use the existing sleep(seconds)
helper, e.g., 1–3 seconds) before calling brev("refresh") and re-checking
hasBrevInstance(instanceName) so the list has time to converge; keep the
existing try/catch around brev("delete") and brev("refresh") and preserve the
attempts/intervalSeconds behavior.
In `@test/install-preflight.test.js`:
- Around line 570-598: The test sets two different docker stubs via
writeExecutable(...) with target "docker", and the first one is immediately
overwritten by the second; remove the initial successful docker stub block so
only the failing docker stub remains (keep the openshell stub intact), i.e.
ensure there is a single writeExecutable call for "docker" (the one that exits 1
for "info") to make the scenario unambiguous.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 5a93e840-8490-4a99-8aa7-214864ddfd40
📒 Files selected for processing (22)
README.mdbin/lib/onboard.jsbin/nemoclaw.jsdocs/deployment/deploy-to-remote-gpu.mddocs/get-started/quickstart.mddocs/reference/commands.mddocs/reference/troubleshooting.mdscripts/brev-setup.shscripts/install.shscripts/setup-spark.shscripts/start-services.shspark-install.mdsrc/lib/deploy.test.tssrc/lib/deploy.tssrc/lib/preflight.test.tssrc/lib/preflight.tstest/cli.test.jstest/e2e/brev-e2e.test.jstest/e2e/test-spark-install.shtest/install-preflight.test.jstest/runner.test.jstest/usage-notice.test.js
💤 Files with no reviewable changes (1)
- scripts/setup-spark.sh
| [ -n "${NVIDIA_API_KEY:-}" ] || fail "NVIDIA_API_KEY not set" | ||
| [ -f "$INSTALLER_SCRIPT" ] || fail "Installer not found at $INSTALLER_SCRIPT" | ||
|
|
||
| # Suppress needrestart noise from apt (Scanning processes, No services need...) | ||
| export NEEDRESTART_MODE=a | ||
| export DEBIAN_FRONTEND=noninteractive | ||
| export NEMOCLAW_NON_INTERACTIVE=1 | ||
| export NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE=1 | ||
| export NEMOCLAW_PROVIDER="${NEMOCLAW_PROVIDER:-build}" |
There was a problem hiding this comment.
Don't hard-require NVIDIA_API_KEY for every provider.
Line 28 aborts before Line 35 can steer onboarding to openai, anthropic, gemini, or the compatible endpoint providers. That makes the compatibility deploy path fail for any non-build provider even when its own credential is present. Let the installer/onboard flow validate the selected provider, or gate this check on the effective provider instead.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/brev-setup.sh` around lines 28 - 35, The script currently
unconditionally fails if NVIDIA_API_KEY is missing; change the check so the key
is only required when the effective provider (NEMOCLAW_PROVIDER) is the
GPU-focused "build" provider (or absent/default to "build"). Concretely, replace
the unconditional check of NVIDIA_API_KEY with a conditional that reads
NEMOCLAW_PROVIDER (or uses the default "${NEMOCLAW_PROVIDER:-build}") and only
invokes fail "NVIDIA_API_KEY not set" when that provider equals "build"; leave
the installer existence check ([ -f "$INSTALLER_SCRIPT" ] || fail ...) and the
exports (NEEDRESTART_MODE, DEBIAN_FRONTEND, NEMOCLAW_* ) unchanged so onboarding
for openai/anthropic/gemini or endpoint providers can proceed without an
NVIDIA_API_KEY.
| if (host.runtime && host.runtime !== "unknown") { | ||
| lines.push(`Detected container runtime: ${host.runtime}`); | ||
| } | ||
| if (host.notes && host.notes.includes("Running under WSL")) { | ||
| lines.push("Running under WSL"); | ||
| } | ||
| for (const action of actions) { | ||
| lines.push(`- ${action.title}: ${action.reason}`); | ||
| for (const command of action.commands || []) { | ||
| lines.push(` ${command}`); | ||
| } | ||
| } | ||
| if (lines.length > 0) { | ||
| process.stdout.write(lines.join("\n")); | ||
| } | ||
| process.exit(blockingActions.length > 0 ? 10 : 0); | ||
| } catch { | ||
| process.exit(0); | ||
| } | ||
| ' "$preflight_module" | ||
| )"; then | ||
| status=0 | ||
| else | ||
| status=$? | ||
| fi | ||
|
|
||
| if [[ -n "$output" ]]; then | ||
| echo "" | ||
| if [[ "$status" -eq 10 ]]; then | ||
| warn "Host preflight found issues that will prevent onboarding right now." | ||
| else | ||
| warn "Host preflight found warnings." | ||
| fi |
There was a problem hiding this comment.
Separate informational host notes from actual warnings.
Lines 924-929 always add Detected container runtime: docker on a healthy Docker host, so Lines 950-956 print Host preflight found warnings. even when planHostRemediation(host) returned no actions. Keep runtime/WSL notes out of the warning path, or only emit the warning header when there is remediation to show.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/install.sh` around lines 924 - 956, The preflight is mixing
informational notes (host.runtime, host.notes) with actionable warnings so the
script prints "Host preflight found warnings." even when
planHostRemediation(host) returned no actions; split the output into two buffers
(e.g., info_lines and action_lines) so Detected container runtime and WSL
messages go into info_lines and only remediation items from
actions/blockingActions go into action_lines, only set status/exit code based on
blockingActions and only print the warning header ("Host preflight found
warnings." / remediation list) when action_lines is non-empty; keep printing
info_lines unconditionally so runtime/WSL remain visible without triggering the
warning path.
| let exists = false; | ||
| try { | ||
| const out = execFileSync("brev", ["ls"], { encoding: "utf-8" }); | ||
| exists = out.includes(name); | ||
| } catch (caught) { | ||
| const err = caught as { stdout?: string; stderr?: string }; | ||
| if (err.stdout && err.stdout.includes(name)) exists = true; | ||
| if (err.stderr && err.stderr.includes(name)) exists = true; | ||
| } |
There was a problem hiding this comment.
Use exact instance matching instead of substring checks.
out.includes(name) will treat foo as existing when only something like foo-prod is present, or when the name appears in an error message. In that case we skip brev create and then wait for a VM that does not actually exist.
Suggested fix
let exists = false;
try {
- const out = execFileSync("brev", ["ls"], { encoding: "utf-8" });
- exists = out.includes(name);
+ const out = execFileSync("brev", ["ls", "--json"], { encoding: "utf-8" });
+ exists = findBrevInstanceStatus(out, name) !== null;
} catch (caught) {
- const err = caught as { stdout?: string; stderr?: string };
- if (err.stdout && err.stdout.includes(name)) exists = true;
- if (err.stderr && err.stderr.includes(name)) exists = true;
+ const err = caught as { stdout?: string };
+ if (err.stdout) exists = findBrevInstanceStatus(err.stdout, name) !== null;
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/lib/deploy.ts` around lines 240 - 248, The current existence check uses
substring matching (out.includes(name)) which can falsely match names like "foo"
in "foo-prod" or in error text; update the check after calling
execFileSync("brev", ["ls"]) to split the output into lines (e.g.,
out.split(/\r?\n/)), trim each line and test for an exact equality to name (use
lines.some(line => line.trim() === name)) to set exists; apply the same
exact-line check to err.stdout and err.stderr instead of includes so only an
exact instance name marks existence (referencing the exists variable and the
out/err handling in this try/catch block).
| execFileSync( | ||
| "ssh", | ||
| ["-o", "ConnectTimeout=5", "-o", "StrictHostKeyChecking=no", name, "echo", "ok"], | ||
| { encoding: "utf-8", stdio: "ignore" }, |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
rg -n --fixed-strings 'StrictHostKeyChecking=no' src/lib/deploy.tsRepository: NVIDIA/NemoClaw
Length of output: 1380
Disable StrictHostKeyChecking=no for all SSH/scp operations to prevent MITM attacks on credential transmission.
All nine SSH and scp commands in this file skip host key verification with StrictHostKeyChecking=no, including the scp at line 338 that uploads the .env file containing API credentials. This silently accepts host-key mismatches and can disclose secrets to the wrong host. Use a pinned known_hosts file, or at least a trust-on-first-use mode that prevents key changes.
Affected instances: lines 296, 313, 320, 323, 338, 341, 358, 367, 387
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/lib/deploy.ts` around lines 294 - 297, Replace the insecure SSH/SCP
option StrictHostKeyChecking=no used in the execFileSync calls (the SSH checks
invoking "ssh" and the scp upload of the .env) with a safer host-key policy:
either Point SSH to a pinned known_hosts file via -o
UserKnownHostsFile=/path/to/known_hosts and remove StrictHostKeyChecking=no, or
use TOFU by setting -o StrictHostKeyChecking=accept-new; apply this change to
every execFileSync invocation that runs "ssh" or "scp" (the calls that pass name
and the one that uploads the .env) and ensure the commands also reference the
chosen known_hosts file when appropriate.
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
test/install-preflight.test.js (1)
709-755:⚠️ Potential issue | 🟠 MajorThis no longer verifies the non-interactive terms gate.
Because this case doesn't force preflight to pass, the installer can skip onboarding before
show_usage_notice()runs and still satisfy/Skipping onboarding/. That means the test now passes for unrelated Docker/OpenShell state instead of proving the acceptance requirement.Suggested fix
writeNodeStub(fakeBin); + writeExecutable( + path.join(fakeBin, "docker"), + `#!/usr/bin/env bash +if [ "$1" = "info" ]; then + echo '{"ServerVersion":"29.3.1","OperatingSystem":"Ubuntu 24.04","CgroupVersion":"2"}' + exit 0 +fi +exit 0 +`, + ); + writeExecutable( + path.join(fakeBin, "openshell"), + `#!/usr/bin/env bash +if [ "$1" = "--version" ]; then + echo "openshell 0.0.22" + exit 0 +fi +exit 0 +`, + ); writeNpmStub( fakeBin, `if [ "$1" = "pack" ]; then @@ const result = spawnSync("bash", [INSTALLER, "--non-interactive"], { @@ }); - expect(`${result.stdout}${result.stderr}`).toMatch( - /Skipping onboarding|--yes-i-accept-third-party-software|NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE=1/, - ); + expect(result.status).not.toBe(0); + expect(`${result.stdout}${result.stderr}`).toMatch( + /--yes-i-accept-third-party-software|NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE=1/, + ); expect(fs.existsSync(onboardLog)).toBe(false);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/install-preflight.test.js` around lines 709 - 755, The test no longer guarantees the installer reached the terms gate because preflight can fail earlier; update the test so preflight is forced to succeed and the installer reaches show_usage_notice() before asserting onboarding was skipped. Concretely, in the test around INSTALLER invocation ensure you arrange/stub the preflight checks (or set the env variable(s) the preflight uses) so they return success (so show_usage_notice() is executed), keep NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE empty to simulate non-acceptance, then assert the output contains the terms-acceptance hints and that onboardLog does not exist; reference the installer invocation via INSTALLER, the show_usage_notice() path, the NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE env var, and the onboardLog file to locate where to change the test.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/lib/preflight.ts`:
- Around line 203-207: detectNvidiaGpu currently returns true just because the
nvidia-smi binary exists; change it to first verify commandExists("nvidia-smi",
runCaptureImpl) and then actually probe the device by running nvidia-smi (e.g.
runCaptureImpl("nvidia-smi -L", { ignoreError: true }) or
runCaptureImpl("nvidia-smi --query-gpu=name --format=csv,noheader", {
ignoreError: true })) and return true only if the probe returns
non-empty/expected output (trimmed) indicating at least one GPU; otherwise
return false. Use the existing runCaptureImpl and commandExists helpers and keep
error handling via the ignoreError option so an absent/failed probe yields false
rather than throwing.
In `@test/e2e/brev-e2e.test.js`:
- Around line 279-285: The pre-cleanup incorrectly assumes
deleteBrevInstance(INSTANCE_NAME) succeeded; change the pre-cleanup to check the
boolean result and throw or retry/log an error when it returns false so we don't
treat a failed delete as success. Also ensure instanceCreated is set as soon as
the VM is actually created (either by returning a created flag from
runLocalDeploy or by setting instanceCreated = true immediately after detecting
creation) so that afterAll() will always attempt deletion even if later
bootstrap steps fail; update runLocalDeploy usage and afterAll cleanup logic to
rely on that explicit created flag.
---
Outside diff comments:
In `@test/install-preflight.test.js`:
- Around line 709-755: The test no longer guarantees the installer reached the
terms gate because preflight can fail earlier; update the test so preflight is
forced to succeed and the installer reaches show_usage_notice() before asserting
onboarding was skipped. Concretely, in the test around INSTALLER invocation
ensure you arrange/stub the preflight checks (or set the env variable(s) the
preflight uses) so they return success (so show_usage_notice() is executed),
keep NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE empty to simulate non-acceptance, then
assert the output contains the terms-acceptance hints and that onboardLog does
not exist; reference the installer invocation via INSTALLER, the
show_usage_notice() path, the NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE env var, and
the onboardLog file to locate where to change the test.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 80955231-b32a-4dc6-8604-041a7326547d
📒 Files selected for processing (9)
README.mdbin/lib/onboard.jsbin/nemoclaw.jsdocs/reference/commands.mdscripts/install.shspark-install.mdsrc/lib/preflight.tstest/e2e/brev-e2e.test.jstest/install-preflight.test.js
✅ Files skipped from review due to trivial changes (1)
- docs/reference/commands.md
🚧 Files skipped from review as they are similar to previous changes (3)
- README.md
- spark-install.md
- bin/lib/onboard.js
| function detectNvidiaGpu( | ||
| runCaptureImpl: (command: string, options?: { ignoreError?: boolean }) => string, | ||
| ): boolean { | ||
| return commandExists("nvidia-smi", runCaptureImpl); | ||
| } |
There was a problem hiding this comment.
Probe nvidia-smi, don't just check that it's on PATH.
hasNvidiaGpu flips to true whenever the binary exists, even if the host has no usable NVIDIA device/driver. On CPU images that still ship nvidia-smi, that suppresses the headless-host hint and mislabels the assessment.
Suggested fix
function detectNvidiaGpu(
runCaptureImpl: (command: string, options?: { ignoreError?: boolean }) => string,
): boolean {
- return commandExists("nvidia-smi", runCaptureImpl);
+ if (!commandExists("nvidia-smi", runCaptureImpl)) {
+ return false;
+ }
+ try {
+ const output = runCaptureImpl("nvidia-smi -L 2>/dev/null", { ignoreError: true });
+ return Boolean(String(output || "").trim());
+ } catch {
+ return false;
+ }
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| function detectNvidiaGpu( | |
| runCaptureImpl: (command: string, options?: { ignoreError?: boolean }) => string, | |
| ): boolean { | |
| return commandExists("nvidia-smi", runCaptureImpl); | |
| } | |
| function detectNvidiaGpu( | |
| runCaptureImpl: (command: string, options?: { ignoreError?: boolean }) => string, | |
| ): boolean { | |
| if (!commandExists("nvidia-smi", runCaptureImpl)) { | |
| return false; | |
| } | |
| try { | |
| const output = runCaptureImpl("nvidia-smi -L 2>/dev/null", { ignoreError: true }); | |
| return Boolean(String(output || "").trim()); | |
| } catch { | |
| return false; | |
| } | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/lib/preflight.ts` around lines 203 - 207, detectNvidiaGpu currently
returns true just because the nvidia-smi binary exists; change it to first
verify commandExists("nvidia-smi", runCaptureImpl) and then actually probe the
device by running nvidia-smi (e.g. runCaptureImpl("nvidia-smi -L", {
ignoreError: true }) or runCaptureImpl("nvidia-smi --query-gpu=name
--format=csv,noheader", { ignoreError: true })) and return true only if the
probe returns non-empty/expected output (trimmed) indicating at least one GPU;
otherwise return false. Use the existing runCaptureImpl and commandExists
helpers and keep error handling via the ignoreError option so an absent/failed
probe yields false rather than throwing.
| // Pre-cleanup: delete any leftover instance with the same name. | ||
| // This can happen when a previous run's create succeeded on the backend | ||
| // but the CLI got a network error (unexpected EOF) before confirming, | ||
| // then the retry/fallback fails with "duplicate workspace". | ||
| try { | ||
| brev("delete", INSTANCE_NAME); | ||
| deleteBrevInstance(INSTANCE_NAME); | ||
| console.log(`[${elapsed()}] Deleted leftover instance "${INSTANCE_NAME}"`); |
There was a problem hiding this comment.
Don't lose cleanup state on partial instance lifecycle failures.
deleteBrevInstance() returns false when the instance still exists, but the pre-cleanup branch treats any return as success. Also, instanceCreated is only set after runLocalDeploy() returns; if deploy creates the VM and then fails during remote bootstrap, afterAll() skips deletion and leaks the instance.
Suggested fix
try {
- deleteBrevInstance(INSTANCE_NAME);
- console.log(`[${elapsed()}] Deleted leftover instance "${INSTANCE_NAME}"`);
+ const hadLeftover = hasBrevInstance(INSTANCE_NAME);
+ if (hadLeftover && !deleteBrevInstance(INSTANCE_NAME)) {
+ throw new Error(`Failed to delete leftover instance "${INSTANCE_NAME}" before test start`);
+ }
+ if (hadLeftover) {
+ console.log(`[${elapsed()}] Deleted leftover instance "${INSTANCE_NAME}"`);
+ }
} catch {
// Expected — no leftover instance exists
}
if (TEST_SUITE === "deploy-cli") {
console.log(`[${elapsed()}] Running nemoclaw deploy end to end...`);
- runLocalDeploy(INSTANCE_NAME);
- instanceCreated = true;
+ try {
+ runLocalDeploy(INSTANCE_NAME);
+ } finally {
+ instanceCreated = hasBrevInstance(INSTANCE_NAME);
+ }
try {
brev("refresh");
} catch {Also applies to: 290-299
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/e2e/brev-e2e.test.js` around lines 279 - 285, The pre-cleanup
incorrectly assumes deleteBrevInstance(INSTANCE_NAME) succeeded; change the
pre-cleanup to check the boolean result and throw or retry/log an error when it
returns false so we don't treat a failed delete as success. Also ensure
instanceCreated is set as soon as the VM is actually created (either by
returning a created flag from runLocalDeploy or by setting instanceCreated =
true immediately after detecting creation) so that afterAll() will always
attempt deletion even if later bootstrap steps fail; update runLocalDeploy usage
and afterAll cleanup logic to rely on that explicit created flag.
Summary
deploybehavior into TypeScript, thin the Brev compatibility wrapper, and harden Brev readiness handlingsetup-spark,brev-setup.sh) in favor of the canonical installer + onboard flowWhat Changed
src/lib/preflight.tssrc/lib/deploy.tsnemoclaw deployto use the authenticated Brev CLI, current Brev create flags, explicit GCP provider default, stricter readiness checks, and standard installer/onboard flowscripts/setup-spark.shand reducedscripts/brev-setup.shto a deprecated compatibility wrapperValidation
npm run build:clisrc/lib/preflight.test.ts,src/lib/deploy.test.ts,test/install-preflight.test.js,test/cli.test.js,test/runner.test.jsTEST_SUITE=deploy-clioncpu-e2.4vcpu-16gbstatus=RUNNING,build_status=COMPLETED,shell_status=READYRelated Issues
Credit / Prior Work
This branch builds on ideas and prior work from:
Summary by CodeRabbit
Release Notes
New Features
Improvements
nemoclaw onboard.Deprecations
nemoclaw deploynow delegates to standard installer; prefer separate host provisioning +nemoclaw onboard.nemoclaw setup-sparkmarked as deprecated compatibility alias fornemoclaw onboard.