Skip to content

fix(onboard): run inference curl probes without shell expansion#890

Open
OffbeatOps wants to merge 2 commits intoNVIDIA:mainfrom
OffbeatOps:fix/onboard-inference-curl-probes
Open

fix(onboard): run inference curl probes without shell expansion#890
OffbeatOps wants to merge 2 commits intoNVIDIA:mainfrom
OffbeatOps:fix/onboard-inference-curl-probes

Conversation

@OffbeatOps
Copy link
Copy Markdown

@OffbeatOps OffbeatOps commented Mar 25, 2026

Summary

Onboarding inference validation builds curl commands under bash -c and passes the API token via $NEMOCLAW_PROBE_API_KEY. That is fragile: corrupted or pasted keys (concatenation, stray CR/LF, shell-sensitive characters) can make curl fail before any HTTP response. Users then saw messages like HTTP 43 even though 43 was the curl exit code, not an HTTP status.

Changes

  • Run the same probes and /models fetches with spawnSync("curl", argv, …) so headers and bodies are passed as argv (no shell re-parsing of secrets).
  • Add --http1.1 on those requests to reduce occasional HTTP/2-related failures on some networks.
  • When curl exits non-zero, report curl failed (exit N) and include a short stderr snippet instead of labeling the code as HTTP.
  • Normalize credentials: trim and strip \r for values read from the environment and ~/.nemoclaw/credentials.json.

Testing

  • node --check on the modified files.
  • Manually exercised NVIDIA Endpoints onboarding after these changes (chat completions probe succeeds; Responses probe still 404 as expected for integrate.api.nvidia.com).

Happy to adjust if you prefer keeping HTTP/2 for probes or want the credential normalization scoped to specific keys only.

Summary by CodeRabbit

  • New Features

    • More robust onboarding wizard with clearer prompts and repeat-until-valid API key entry; interactive prompts now mask input.
  • Bug Fixes

    • Consistent credential normalization (trim/cr/whitespace) and more reliable handling of missing/empty stored secrets.
    • Improved terminal input/backspace behavior during interactive prompts.
  • Improvements

    • More informative network probe and HTTP-status reporting; safer, more reliable external call handling.
  • Documentation

    • Large docs refresh: new/updated onboarding, inference options, security best-practices, deployment, and many skill guides.

- Use spawnSync("curl", argv) for OpenAI-like probes, Anthropic probes, and /models fetches so credentials are not reinterpreted by bash.
- Add --http1.1 on those requests to reduce HTTP/2-related failures on some networks.
- Report curl exit codes as curl failures (not HTTP status) and include trimmed stderr.
- Trim secrets loaded from the environment and credentials.json and strip CR characters.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 25, 2026

📝 Walkthrough

Walkthrough

Lazy credential path resolution, consistent credential normalization/redaction, and interactive prompt robustness were added. Network probes were refactored to use direct curl spawns with improved error reporting. Many bin shims were replaced with compiled dist exports, extensive docs/skills added/rewritten, new TypeScript libraries (preflight, onboard-session, inference, nim, debug, etc.) were introduced, and CI/workflow updates applied.

Changes

Cohort / File(s) Summary
Credentials & Interactive IO
bin/lib/credentials.js
Lazy HOME resolution getters for CREDS_DIR/CREDS_FILE, added normalizeCredentialValue(value), consistent trimming/\r stripping, null-handling for absent secrets, improved interactive prompt (backspace handling, * for secret input), explicit SIGINT finish handling.
Network Probes / Onboarding
bin/lib/onboard.js, src/lib/local-inference.ts, src/lib/inference-config.ts
Replaced shell bash -c curl invocations with spawnSync("curl", args), added getCurlSpawnArgs()/probeDisplayCode(), expanded summarizeProbeError() to include stderr, and moved endpoint/model probing into typed src helpers used by onboarding flows.
Bin shims → dist & CLI changes
bin/lib/* (many: *.js re-exports), bin/nemoclaw.js
Numerous bin/lib files converted to thin shims that require ../../dist/lib/.... bin/nemoclaw.js significantly extended for OpenShell integration, registry recovery, gateway/sandbox reconciliation, and robust connect/delete flows.
New TypeScript libraries & tests
src/lib/*.ts, src/lib/*.(test).ts (preflight, onboard-session, runtime-recovery, nim, local-inference, preflight, dashboard, gateway-state, inference-config, build-context, debug)
Added many typed modules implementing preflight checks, onboard session persistence+locking, runtime recovery heuristics, NIM container management, local inference validation, dashboard URL building, and test coverage suites. Notable API surface additions documented in files.
Registry, Policies, Runner, Redaction
bin/lib/registry.js, bin/lib/policies.js, bin/lib/runner.js
Registry gained advisory lock (acquire/release/withLock) and atomic saves; policies parsing now uses YAML merge with structured fallback and interactive preset selector; runner now captures and redacts secrets, exposing redact export and writing redacted command output.
Docs, Skills & Security Docs
.agents/skills/**, docs/**, AGENTS.md, CLAUDE.md, CONTRIBUTING.md
Added/rewrote many agent skills and reference docs (inference-options, sandbox-hardening, security-best-practices, security-code-review, get-started, etc.), changed frontmatter schema (description.main/agent), and updated multiple docs cross-links.
CI / Workflow / Tooling
.github/**, .pre-commit-config.yaml, package.json, Makefile, Dockerfile*
Added new composite actions, new workflows (main, base-image, sandbox-images-and-e2e, dco-check), updated PR workflow to use basic-checks, pinned some actions, changed buildbase image handling, added Docker base image recipe and build/push workflow, and updated formatting/typecheck hooks.
Installers & Scripts
install.sh, scripts/install.sh, scripts/* (brev launchable, dns proxy, setup-dns-proxy, setup-spark, brev-launchable-ci-cpu.sh, etc.)
Reworked top-level installer bootstrap and full installer, added many operational scripts (launchable CI, DNS proxy, swap/setup helpers), removed legacy scripts/setup.sh, and hardened installers with integrity checks and onboarding resilience.

Sequence Diagram(s)

sequenceDiagram
  participant User as CLI (nemoclaw)
  participant Registry as Local Registry (sandboxes.json)
  participant Gateway as OpenShell Gateway
  participant Onboard as Onboard Session / Host

  User->>Registry: list / request connect
  Registry->>Gateway: recoverRegistryFromLiveGateway()
  Gateway-->>Registry: live sandbox list / gateway info
  Registry->>User: reconciled registry entries
  User->>Gateway: ensureLiveSandboxOrExit()
  alt Gateway missing or inconsistent
    Gateway-->>Onboard: getRecoveryCommand()
    Onboard-->>User: prompt/resume onboarding
    User->>Onboard: run onboard (--resume if available)
    Onboard-->>Gateway: register sandbox / configure providers
    Gateway-->>Registry: updated live state
  else Gateway healthy
    User->>Gateway: openshell sandbox connect
  end
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

Suggested labels

NemoClaw CLI

Poem

🐇 I nibble lines of secrets, trim each stray CR,
I hop from shell to spawn so curl behaves like a star,
Shims lead to compiled lands where typed helpers play,
Docs bloom like carrots, CI polishes the way —
Huzzah! a rabbit cheers, builds green paths for devs today. 🥕

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/lib/credentials.js`:
- Around line 28-38: getCredential currently returns a normalized secret but
does not propagate that cleaned value back into process.env, which lets
onboarding's setupInference -> upsertProvider still read the original env var
(resolvedCredentialEnv) with CR/LF; update getCredential to normalize and then
assign the cleaned value back into process.env[key] before returning (use
normalizeSecret and loadCredentials as needed), or alternatively change the
provider-setup path (setupInference/upsertProvider) to call
getCredential(resolvedCredentialEnv) instead of reading process.env directly so
the cleaned value is consumed when exporting credentials.

In `@bin/lib/onboard.js`:
- Around line 365-369: probeDisplayCode currently only checks result.status and
returns null for spawnSync launch failures; change it to first check
result.error and result.signal and raise/return a clear failure instead of null:
if result.error exists, throw (or return an error object) with a message like
"curl failed to start: <result.error.code|message>", and if result.signal
exists, throw/return "curl terminated by signal <result.signal>". Apply the same
change to the other identical probe block referenced (the logic around the
second probe at the other occurrence) so callers receive actionable error
information rather than HTTP null.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: fb779e01-5d57-413c-a5a6-c1868fe72e44

📥 Commits

Reviewing files that changed from the base of the PR and between b2164e7 and c5658ad.

📒 Files selected for processing (2)
  • bin/lib/credentials.js
  • bin/lib/onboard.js

Comment on lines +28 to +38
function normalizeSecret(value) {
if (value == null) return null;
return String(value).replace(/\r/g, "").trim();
}

function getCredential(key) {
if (process.env[key]) return process.env[key];
if (process.env[key]) return normalizeSecret(process.env[key]);
const creds = loadCredentials();
return creds[key] || null;
const raw = creds[key];
if (raw == null) return null;
return normalizeSecret(raw);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Propagate the cleaned secret into the environment.

getCredential() only returns the normalized value. In non-interactive onboarding, setupInference() still forwards process.env[resolvedCredentialEnv] unchanged to upsertProvider() at bin/lib/onboard.js Line 1957 through Line 1960, so a key with trailing CR/LF can pass the probe here and then still be written to OpenShell with the bad bytes. Please normalize before exporting, or have the provider-setup path consume getCredential() instead of the raw env var.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/credentials.js` around lines 28 - 38, getCredential currently returns
a normalized secret but does not propagate that cleaned value back into
process.env, which lets onboarding's setupInference -> upsertProvider still read
the original env var (resolvedCredentialEnv) with CR/LF; update getCredential to
normalize and then assign the cleaned value back into process.env[key] before
returning (use normalizeSecret and loadCredentials as needed), or alternatively
change the provider-setup path (setupInference/upsertProvider) to call
getCredential(resolvedCredentialEnv) instead of reading process.env directly so
the cleaned value is consumed when exporting credentials.

Comment on lines +365 to +369
function probeDisplayCode(result) {
const httpCode = Number(String(result.stdout || "").trim());
if (result.status !== 0) return result.status;
return Number.isFinite(httpCode) ? httpCode : 0;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

In Node.js child_process.spawnSync, what values do result.status, result.signal, and result.error have when the executable cannot be found or when the child process is terminated by a signal?

💡 Result:

When the executable cannot be found (ENOENT error): - result.status: null - result.signal: null - result.error: Error object with code 'ENOENT' (e.g., Error: spawnSync not_found ENOENT) When the child process is terminated by a signal: - result.status: null - result.signal: the signal name (e.g., 'SIGTERM', 'SIGKILL') - result.error: undefined

Citations:


🏁 Script executed:

cat -n bin/lib/onboard.js | sed -n '360,375p'

Repository: NVIDIA/NemoClaw

Length of output: 1870


🏁 Script executed:

cat -n bin/lib/onboard.js | sed -n '525,545p'

Repository: NVIDIA/NemoClaw

Length of output: 899


🏁 Script executed:

# Search for callers of probeDisplayCode to understand the impact
rg "probeDisplayCode" bin/lib/onboard.js -B 3 -A 3

Repository: NVIDIA/NemoClaw

Length of output: 2136


Handle spawnSync() launch failures before formatting them as HTTP errors.

If curl cannot be started or the child is terminated by a signal, spawnSync() sets result.status to null and reports the failure via result.error (with code 'ENOENT' for missing executable) or result.signal (for signal termination). The probeDisplayCode() helper currently returns null in these cases, so callers end up showing HTTP null with no response body instead of an actionable curl failed to start or curl terminated message.

Check result.error and result.signal in addition to result.status to provide meaningful error messages to users.

Also applies to: 532-536

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 365 - 369, probeDisplayCode currently only
checks result.status and returns null for spawnSync launch failures; change it
to first check result.error and result.signal and raise/return a clear failure
instead of null: if result.error exists, throw (or return an error object) with
a message like "curl failed to start: <result.error.code|message>", and if
result.signal exists, throw/return "curl terminated by signal <result.signal>".
Apply the same change to the other identical probe block referenced (the logic
around the second probe at the other occurrence) so callers receive actionable
error information rather than HTTP null.

kjw3 added a commit that referenced this pull request Mar 31, 2026
## Summary

Smooth out inference configuration during `install.sh` / `nemoclaw
onboard`, especially when provider authorization, credential formatting,
endpoint probing, or final inference application fail.

This PR makes the hosted-provider onboarding path recoverable instead of
brittle:
- normalize and safely handle credential input
- classify validation failures more accurately
- let users re-enter credentials in place
- make final `openshell inference set` failures recoverable
- normalize over-specified custom base URLs
- add lower-level `back` / `exit` navigation so users can move up a
level without restarting the whole install
- clarify recovery prompts with explicit commands (`retry`, `back`,
`exit`)

## What Changed

- refactored provider probe execution to use direct `curl` argv
invocation instead of `bash -c`
- normalized credential values before use/persistence
- added structured auth / transport / model / endpoint failure
classification
- added in-place credential re-entry for hosted providers:
  - NVIDIA Endpoints
  - OpenAI
  - Anthropic
  - Google Gemini
  - custom OpenAI-compatible endpoints
  - custom Anthropic-compatible endpoints
- wrapped final provider/apply failures in interactive recovery instead
of hard abort
- added command-style recovery prompts:
  - `retry`
  - `back`
  - `exit`
- allowed `back` from lower-level inference prompts (model entry, base
URL entry, recovery prompts)
- normalized custom endpoint inputs to the minimum usable base URL
- removed stale `NVIDIA Endpoints (recommended)` wording
- secret prompts now show masked `*` feedback while typing/pasting

## Validation

```bash
npx vitest run test/credentials.test.js test/onboard-selection.test.js test/onboard.test.js
npx vitest run test/cli.test.js
npx eslint bin/lib/credentials.js bin/lib/onboard.js test/credentials.test.js test/onboard-selection.test.js test/onboard.test.js
npx tsc -p jsconfig.json --noEmit
```

## Issue Mapping

Fully addressed in this PR:
- Fixes #1099
- Fixes #1101
- Fixes #1130

Substantially addressed / partially addressed:
- #987
- improves NVIDIA validation behavior and failure classification so
false/misleading connectivity failures are much less likely, but this PR
is framed as onboarding recovery hardening rather than a WSL-specific
networking fix
- #301
- improves graceful handling when validation/apply fails, especially for
transport/upstream problems, but does not add provider auto-fallback or
a broader cloud-outage fallback strategy
- #446
- improves recovery specifically for the inference-configuration step,
but does not fully solve general resumability across all onboarding
steps

Related implementation direction:
- #890
- this PR aligns with the intent of safer/non-shell probe execution and
clearer validation reporting
- #380
- not implemented here; no automatic provider fallback was added in this
branch

## Notes

- This PR intentionally does not weaken validation or reopen old
onboarding shortcuts.
- Unrelated local `tmp/` noise was left out of the branch.

Signed-off-by: Kevin Jones <kejones@nvidia.com>


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Interactive onboarding navigation (`back`/`exit`/`quit`) with
credential re-prompting and retry flows.
* Improved probe/validation flow with clearer recovery options and more
robust sandbox build progress messages.
  * Secret input masks with reliable backspace behavior.

* **Bug Fixes**
* Credential sanitization (trim/line-ending normalization) and API key
validation now normalize and retry instead of exiting.
* Better classification and messaging for authorization/validation
failures; retries where appropriate.

* **Tests**
* Expanded tests for credential prompts, masking, retry flows,
validation classification, and onboarding navigation.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
@wscurran wscurran added the fix label Mar 31, 2026
@wscurran
Copy link
Copy Markdown
Contributor

✨ Thanks for submitting this PR with a detailed summary, it proposes a fix to improve the onboarding experience and prevent issues with curl probes.

@wscurran wscurran added the Getting Started Use this label to identify setup, installation, or onboarding issues. label Mar 31, 2026
cv added a commit that referenced this pull request Apr 1, 2026
…ript modules (#1240)

## Summary

- Extract ~210 lines of pure, side-effect-free functions from the
3,800-line `onboard.js` into **5 typed TypeScript modules** under
`src/lib/`:
- `gateway-state.ts` — gateway/sandbox state classification from
openshell output
- `validation.ts` — failure classification, API key validation, model ID
checks
  - `url-utils.ts` — URL normalization, text compaction, env formatting
  - `build-context.ts` — Docker build context filtering, recovery hints
  - `dashboard.ts` — dashboard URL resolution and construction
- Add **56 co-located unit tests** (`src/lib/*.test.ts`) for the
extracted modules
- Set up CLI TypeScript compilation: `tsconfig.src.json` compiles `src/`
→ `dist/` as CJS
- `onboard.js` imports from compiled `dist/lib/` output — transparent to
callers
- Pre-commit hook updated to build TS and include `dist/lib/` in
coverage

These functions are **not touched by any #924 blocker PR** (#781, #782,
#819, #672, #634, #890), so this extraction is safe to land immediately.

## Test plan

- [x] 598 CLI tests pass (542 existing + 56 new)
- [x] `tsc -p tsconfig.src.json` compiles cleanly
- [x] `tsc -p tsconfig.cli.json` type-checks cleanly
- [x] `tsc -p jsconfig.json` type-checks cleanly
- [x] Coverage ratchet passes with `dist/lib/` included

Closes #1237. Relates to #924 (shell consolidation).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Improved sandbox-creation recovery hints and targeted remediation
commands.
  * Smarter dashboard URL resolution and control-UI URL construction.

* **Bug Fixes**
  * More accurate gateway and sandbox state detection.
* Enhanced classification of validation/apply failures and safer
model/key validation.
  * Better provider URL normalization and loopback handling.

* **Tests**
  * Added comprehensive tests covering new utilities.

* **Chores**
* CLI now builds before CLI tests; CI/commit hooks updated to run the
CLI build.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
laitingsheng pushed a commit that referenced this pull request Apr 2, 2026
## Summary

Smooth out inference configuration during `install.sh` / `nemoclaw
onboard`, especially when provider authorization, credential formatting,
endpoint probing, or final inference application fail.

This PR makes the hosted-provider onboarding path recoverable instead of
brittle:
- normalize and safely handle credential input
- classify validation failures more accurately
- let users re-enter credentials in place
- make final `openshell inference set` failures recoverable
- normalize over-specified custom base URLs
- add lower-level `back` / `exit` navigation so users can move up a
level without restarting the whole install
- clarify recovery prompts with explicit commands (`retry`, `back`,
`exit`)

## What Changed

- refactored provider probe execution to use direct `curl` argv
invocation instead of `bash -c`
- normalized credential values before use/persistence
- added structured auth / transport / model / endpoint failure
classification
- added in-place credential re-entry for hosted providers:
  - NVIDIA Endpoints
  - OpenAI
  - Anthropic
  - Google Gemini
  - custom OpenAI-compatible endpoints
  - custom Anthropic-compatible endpoints
- wrapped final provider/apply failures in interactive recovery instead
of hard abort
- added command-style recovery prompts:
  - `retry`
  - `back`
  - `exit`
- allowed `back` from lower-level inference prompts (model entry, base
URL entry, recovery prompts)
- normalized custom endpoint inputs to the minimum usable base URL
- removed stale `NVIDIA Endpoints (recommended)` wording
- secret prompts now show masked `*` feedback while typing/pasting

## Validation

```bash
npx vitest run test/credentials.test.js test/onboard-selection.test.js test/onboard.test.js
npx vitest run test/cli.test.js
npx eslint bin/lib/credentials.js bin/lib/onboard.js test/credentials.test.js test/onboard-selection.test.js test/onboard.test.js
npx tsc -p jsconfig.json --noEmit
```

## Issue Mapping

Fully addressed in this PR:
- Fixes #1099
- Fixes #1101
- Fixes #1130

Substantially addressed / partially addressed:
- #987
- improves NVIDIA validation behavior and failure classification so
false/misleading connectivity failures are much less likely, but this PR
is framed as onboarding recovery hardening rather than a WSL-specific
networking fix
- #301
- improves graceful handling when validation/apply fails, especially for
transport/upstream problems, but does not add provider auto-fallback or
a broader cloud-outage fallback strategy
- #446
- improves recovery specifically for the inference-configuration step,
but does not fully solve general resumability across all onboarding
steps

Related implementation direction:
- #890
- this PR aligns with the intent of safer/non-shell probe execution and
clearer validation reporting
- #380
- not implemented here; no automatic provider fallback was added in this
branch

## Notes

- This PR intentionally does not weaken validation or reopen old
onboarding shortcuts.
- Unrelated local `tmp/` noise was left out of the branch.

Signed-off-by: Kevin Jones <kejones@nvidia.com>


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Interactive onboarding navigation (`back`/`exit`/`quit`) with
credential re-prompting and retry flows.
* Improved probe/validation flow with clearer recovery options and more
robust sandbox build progress messages.
  * Secret input masks with reliable backspace behavior.

* **Bug Fixes**
* Credential sanitization (trim/line-ending normalization) and API key
validation now normalize and retry instead of exiting.
* Better classification and messaging for authorization/validation
failures; retries where appropriate.

* **Tests**
* Expanded tests for credential prompts, masking, retry flows,
validation classification, and onboarding navigation.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
laitingsheng pushed a commit that referenced this pull request Apr 2, 2026
…ript modules (#1240)

## Summary

- Extract ~210 lines of pure, side-effect-free functions from the
3,800-line `onboard.js` into **5 typed TypeScript modules** under
`src/lib/`:
- `gateway-state.ts` — gateway/sandbox state classification from
openshell output
- `validation.ts` — failure classification, API key validation, model ID
checks
  - `url-utils.ts` — URL normalization, text compaction, env formatting
  - `build-context.ts` — Docker build context filtering, recovery hints
  - `dashboard.ts` — dashboard URL resolution and construction
- Add **56 co-located unit tests** (`src/lib/*.test.ts`) for the
extracted modules
- Set up CLI TypeScript compilation: `tsconfig.src.json` compiles `src/`
→ `dist/` as CJS
- `onboard.js` imports from compiled `dist/lib/` output — transparent to
callers
- Pre-commit hook updated to build TS and include `dist/lib/` in
coverage

These functions are **not touched by any #924 blocker PR** (#781, #782,
#819, #672, #634, #890), so this extraction is safe to land immediately.

## Test plan

- [x] 598 CLI tests pass (542 existing + 56 new)
- [x] `tsc -p tsconfig.src.json` compiles cleanly
- [x] `tsc -p tsconfig.cli.json` type-checks cleanly
- [x] `tsc -p jsconfig.json` type-checks cleanly
- [x] Coverage ratchet passes with `dist/lib/` included

Closes #1237. Relates to #924 (shell consolidation).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Improved sandbox-creation recovery hints and targeted remediation
commands.
  * Smarter dashboard URL resolution and control-UI URL construction.

* **Bug Fixes**
  * More accurate gateway and sandbox state detection.
* Enhanced classification of validation/apply failures and safer
model/key validation.
  * Better provider URL normalization and loopback handling.

* **Tests**
  * Added comprehensive tests covering new utilities.

* **Chores**
* CLI now builds before CLI tests; CI/commit hooks updated to run the
CLI build.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Merges main (1c50618) into the PR branch. Conflict resolution strategy:
- credentials.js: keep PR's normalizeSecret() + explicit null-check in
  getCredential(); keep main's normalizeCredentialValue() for saveCredential();
  add falsy guard so blank stored values still return null
- onboard.js: adopt main's runCurlProbe() helper (cleaner refactor, also
  avoids shell expansion); add --http1.1 to getCurlTimingArgs() to honour
  the PR's intent of dodging flaky HTTP/2 paths; remove now-redundant
  getCurlSpawnArgs() and probeDisplayCode() helpers introduced by the PR
- test/onboard.test.js: update getCurlTimingArgs regex to match --http1.1
- scripts/install.sh: use -e instead of -d for .git check to support git
  worktrees (where .git is a file, not a directory)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 17

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
.agents/skills/nemoclaw-deploy-remote/SKILL.md (1)

1-181: ⚠️ Potential issue | 🟠 Major

This file should not be edited directly.

Per coding guidelines, .agents/skills/nemoclaw-*/*.md files are autogenerated from docs/ via scripts/docs-to-skills.py. Edit the source documentation under docs/ and regenerate skills instead of modifying this file directly.

#!/bin/bash
# Check if the skills regeneration script exists and how it should be run
fd -t f "docs-to-skills.py" scripts/
# Check for any documentation about regenerating skills
rg -l "docs-to-skills" README.md CONTRIBUTING.md docs/ 2>/dev/null | head -5
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/skills/nemoclaw-deploy-remote/SKILL.md around lines 1 - 181, The
SKILL.md file (.agents/skills/nemoclaw-deploy-remote/SKILL.md) is autogenerated
and must not be edited directly; instead update the source documentation under
docs/ and then run the generator (scripts/docs-to-skills.py) to regenerate this
skill file so changes persist; do not modify the .agents file in-place — change
the appropriate docs/* source content and run the docs-to-skills.py script to
produce the updated .agents/skills output.
🟡 Minor comments (13)
eslint.config.mjs-42-43 (1)

42-43: ⚠️ Potential issue | 🟡 Minor

Fix the complexity comment/value mismatch.

The comment says the threshold is being ratcheted to 15, but the rule enforces 20. Please align one with the other to avoid misleading future maintenance.

Suggested fix
-      // Cyclomatic complexity — ratchet down to 15 as we refactor suppressed functions
+      // Cyclomatic complexity — currently capped at 20 (target: 15 after refactors)
       "complexity": ["error", { max: 20 }],
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@eslint.config.mjs` around lines 42 - 43, The inline comment and the
"complexity" rule are inconsistent: the comment says ratcheting to 15 but the
rule sets max: 20; update one to match the other in eslint.config.mjs by either
changing the comment text to reflect max: 20 or changing the rule value to
"complexity": ["error", { max: 15 }]; locate the rule by the unique key
"complexity" and ensure the comment immediately above it (mentioning ratcheting)
and the rule's max value are identical.
.agents/skills/security-code-review/SKILL.md-65-69 (1)

65-69: ⚠️ Potential issue | 🟡 Minor

“Read full content” is required but not operationalized.

Step 4 asks for full-file + diff review, but only shows a diff command. Add an explicit full-content read command to match the stated process.

Suggested doc fix
 Read the full content of each changed file and the diff for that file:
 
 ```bash
+cat <file>
 git diff main...HEAD -- <file>
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @.agents/skills/security-code-review/SKILL.md around lines 65 - 69, Update
Step 4 in SKILL.md to operationalize the “Read full content” requirement by
inserting an explicit full-file read command before the existing git diff
command; specifically, modify the Step 4 instructions so they first instruct the
reviewer to output the entire file contents (for example using a file-cat
command) and then run the provided git diff command, ensuring the documentation
includes both commands in sequence and the example suggested in the review
content.


</details>

</blockquote></details>
<details>
<summary>scripts/start-services.sh-129-131 (1)</summary><blockquote>

`129-131`: _⚠️ Potential issue_ | _🟡 Minor_

**Reflect the API-key gate in the startup banner.**

The bridge now stays down when `NVIDIA_API_KEY` is missing, but the banner still reports `not started (no token)`. That points operators at the wrong fix.



Also applies to: 155-158

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @scripts/start-services.sh around lines 129 - 131, Update the startup banner
text to reflect the actual NVIDIA_API_KEY gating: when NVIDIA_API_KEY is unset
the script currently warns ("NVIDIA_API_KEY not set — Telegram bridge will not
start.") but the banner still shows "not started (no token)"; change the banner
message(s) (the startup banner print at the block around lines referenced and
the similar block at 155-158) to indicate the bridge is disabled due to missing
NVIDIA_API_KEY (reference the NVIDIA_API_KEY variable and the Telegram bridge)
so operators see the correct remediation rather than suggesting a missing
Telegram token; keep the existing warn messages intact.


</details>

</blockquote></details>
<details>
<summary>bin/lib/platform.js-38-44 (1)</summary><blockquote>

`38-44`: _⚠️ Potential issue_ | _🟡 Minor_

**Whitelist the runtimes that need the CoreDNS patch.**

`runtime !== "unknown"` enables this path for `podman` and any future runtime string, even though the workaround is described as Docker-specific.

<details>
<summary>Suggested change</summary>

```diff
 function shouldPatchCoredns(runtime, opts = {}) {
   // k3s CoreDNS defaults to a loopback DNS that pods can't reach.
   // Patch it to use a real upstream on most Docker-based runtimes.
   // On WSL2, the host DNS is not routable from k3s pods - skip the
   // patch and let setup-dns-proxy.sh handle resolution instead.
   if (isWsl(opts)) return false;
-  return runtime !== "unknown";
+  return runtime === "colima" || runtime === "docker" || runtime === "docker-desktop";
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/platform.js` around lines 38 - 44, The current
shouldPatchCoredns(runtime, opts) returns true for any runtime string other than
"unknown", which wrongly enables the Docker-specific CoreDNS patch for
podman/future runtimes; change the runtime check to an explicit whitelist:
define an array of allowed runtimes (e.g., ['docker', 'moby', 'docker-desktop']
or the actual Docker engine identifiers you expect) and return true only if
isWsl(opts) is false and that array includes the provided runtime (use
Array.prototype.includes). Update the logic inside shouldPatchCoredns to use
this whitelist instead of runtime !== "unknown" so only the intended Docker-like
runtimes trigger the patch.
scripts/setup-dns-proxy.sh-81-82 (1)

81-82: ⚠️ Potential issue | 🟡 Minor

Don't mutate the first substring match.

grep -F -- "$SANDBOX_NAME" | head -1 can select the wrong pod when names share a prefix or suffix. This script then edits files, iptables, and network namespaces inside that pod, so it should require a unique exact match or a label selector before continuing.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/setup-dns-proxy.sh` around lines 81 - 82, The POD selection is using
a substring grep which can pick the wrong pod; update the POD assignment (the
command that sets POD using kctl/kubectl and $SANDBOX_NAME) to require an exact
match or use a label selector: e.g. use kubectl get pod -n openshell -o name
--field-selector metadata.name="$SANDBOX_NAME" or use kubectl get pods -n
openshell -l "yourLabelKey=$SANDBOX_NAME" to ensure uniqueness, and add a check
that POD is non-empty (and fail/exit if multiple matches are found) before
proceeding to modify files, iptables, or namespaces.
scripts/fix-coredns.sh-50-53 (1)

50-53: ⚠️ Potential issue | 🟡 Minor

Preserve any existing runtime detection.

Line 50 resets RUNTIME to "unknown" after lib/runtime.sh has already been sourced, so a valid RUNTIME=colima is lost unless DOCKER_HOST also happens to contain colima. That bypasses the Colima-specific upstream fallback and unnecessarily pushes those hosts down the generic discovery path.

Possible fix
-RUNTIME="unknown"
-if command -v colima >/dev/null 2>&1 && [[ "${DOCKER_HOST:-}" == *colima* ]]; then
+RUNTIME="${RUNTIME:-unknown}"
+if [ "$RUNTIME" = "unknown" ] && command -v colima >/dev/null 2>&1 && [[ "${DOCKER_HOST:-}" == *colima* ]]; then
   RUNTIME="colima"
 fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/fix-coredns.sh` around lines 50 - 53, The script resets
RUNTIME="unknown" after sourcing lib/runtime.sh which can overwrite a
previously-detected runtime (e.g. RUNTIME=colima); change the logic so you do
not clobber an existing RUNTIME value—either remove the unconditional
RUNTIME="unknown" assignment or only set RUNTIME if it is unset (e.g. test -z
"${RUNTIME:-}" before assigning), and keep the colima detection block (the
command -v colima and DOCKER_HOST check) so it can set RUNTIME="colima" when
appropriate; reference the RUNTIME variable and the colima detection code to
locate and update the lines.
.agents/skills/nemoclaw-security-best/SKILL.md-33-81 (1)

33-81: ⚠️ Potential issue | 🟡 Minor

Missing closing fence for Mermaid code block.

The Mermaid diagram opening at line 33 lacks a closing fence. The file ends at line 81 with "*Full details in `references/best-practices.md`.*", which will render inside the code block instead of as regular Markdown text. Add after line 81 to properly close the code block.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/skills/nemoclaw-security-best/SKILL.md around lines 33 - 81, The
Mermaid code block opened with the "```mermaid" fence is missing a closing
triple-backtick; close the block by adding a ``` fence immediately after the
diagram (after the line containing "*Full details in
`references/best-practices.md`.*") so the Mermaid diagram ends and the following
text renders as normal Markdown.
docs/get-started/quickstart.md-25-30 (1)

25-30: ⚠️ Potential issue | 🟡 Minor

Use a supported MyST callout type instead of generic admonition.

Use note, tip, or warning here instead of admonition to match docs callout conventions.

As per coding guidelines, "Use MyST admonitions (:::{tip}, :::{note}, :::{warning}) for callouts, not bold text or blockquotes."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/get-started/quickstart.md` around lines 25 - 30, Replace the generic
MyST admonition tag ':::{admonition} Alpha software' with a supported callout
type (for example ':::{warning}', ':::{note}', or ':::{tip}') to follow docs
conventions; specifically, update the block starting with ':::{admonition}' and
its closing marker ':::' so the opening tag uses a supported type (e.g.,
':::{warning} Alpha software') and keep the same inner text unchanged.
docs/get-started/quickstart.md-56-56 (1)

56-56: ⚠️ Potential issue | 🟡 Minor

Replace repository links with official documentation links.

These links point to GitHub repository/blob pages; docs pages should reference official documentation locations instead.

As per coding guidelines, "Do not add links to third-party code repositories or community collections; only official tool documentation links are acceptable."

Also applies to: 66-66, 124-124

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/get-started/quickstart.md` at line 56, Replace GitHub repository/blob
links in the quickstart table with the official product documentation URLs:
locate the markdown link entries such as the
"[OpenShell](https://github.com/NVIDIA/OpenShell)" table row and the other links
referenced at the same section (noted also around the other occurrences at lines
referenced) and update their hrefs to point to the official vendor or project
documentation pages rather than GitHub repo/blob pages; ensure the visible link
text (e.g., "OpenShell") remains the same but the URL is the official docs site,
and apply the same change to the other two occurrences mentioned.
scripts/brev-setup.sh-186-187 (1)

186-187: ⚠️ Potential issue | 🟡 Minor

Build errors are silently swallowed.

Redirecting both stdout and stderr to /dev/null hides npm build failures, making debugging difficult if the build fails silently and nemoclaw is unavailable.

Suggested fix: log stderr to a file
-(cd "$REPO_DIR/nemoclaw" && npm install && npm run build) >/dev/null 2>&1
-(cd "$REPO_DIR" && npm install --ignore-scripts && npm link) >/dev/null 2>&1
+(cd "$REPO_DIR/nemoclaw" && npm install && npm run build) >/dev/null 2>/tmp/nemoclaw-build.log || { cat /tmp/nemoclaw-build.log >&2; fail "nemoclaw build failed"; }
+(cd "$REPO_DIR" && npm install --ignore-scripts && npm link) >/dev/null 2>/tmp/nemoclaw-link.log || { cat /tmp/nemoclaw-link.log >&2; fail "nemoclaw link failed"; }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/brev-setup.sh` around lines 186 - 187, The current subshell commands
for building and linking (the lines invoking (cd "$REPO_DIR/nemoclaw" && npm
install && npm run build) and (cd "$REPO_DIR" && npm install --ignore-scripts &&
npm link)) swallow all output by redirecting both stdout and stderr to
/dev/null; change them to capture stderr to a log file (e.g., redirect only
stderr to "$REPO_DIR/setup-build.log" or separate files like
"nemoclaw-build.err"), preserve or optionally tee stdout, and check the command
exit status after each subshell (using || { echo "Build failed, see log"; cat
<logfile>; exit 1; }) so failures are logged and cause the script to stop;
reference the subshell commands and the variables REPO_DIR and the npm steps
(npm install, npm run build, npm link) when making these changes.
docs/security/best-practices.md-501-508 (1)

501-508: ⚠️ Potential issue | 🟡 Minor

Rename the closing section to Next Steps.

The links are useful, but new pages are expected to end with a Next Steps section rather than Related Topics.

As per coding guidelines, A "Next Steps" section at the bottom links to related pages.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/security/best-practices.md` around lines 501 - 508, Rename the closing
section header "Related Topics" to "Next Steps" in the document (change the
heading line that currently reads "## Related Topics" to "## Next Steps") and
keep the existing list of links unchanged so the bottom of
docs/security/best-practices.md follows the project's guideline that pages end
with a "Next Steps" section linking related pages.
docs/security/best-practices.md-2-4 (1)

2-4: ⚠️ Potential issue | 🟡 Minor

Make the H1 match title.page.

The frontmatter title uses the long form, but the page heading is shortened. Pick one and keep them identical so the page metadata and rendered content stay in sync.

As per coding guidelines, "H1 heading matches the title.page frontmatter value."

Also applies to: 23-23

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/security/best-practices.md` around lines 2 - 4, The H1 in the document
must exactly match the frontmatter value title.page; update either the
frontmatter or the page H1 so they are identical (e.g., set the H1 to "NemoClaw
Security Best Practices — Controls, Risks, and Posture Profiles" to match
title.page or change title.page to the shorter H1), ensuring the H1 text and the
title.page frontmatter key are an exact string match.
scripts/install.sh-362-366 (1)

362-366: ⚠️ Potential issue | 🟡 Minor

Array element removal leaves empty strings instead of removing elements.

The pattern ("${arr[@]/$val/}") replaces matching elements with empty strings but doesn't actually remove them from the array. This could cause issues if the arrays are iterated later.

🐛 Proposed fix
-  _cleanup_pids=("${_cleanup_pids[@]/$pid/}")
-  _cleanup_files=("${_cleanup_files[@]/$log/}")
+  local new_pids=()
+  for p in "${_cleanup_pids[@]:-}"; do
+    [[ "$p" != "$pid" && -n "$p" ]] && new_pids+=("$p")
+  done
+  _cleanup_pids=("${new_pids[@]}")
+
+  local new_files=()
+  for f in "${_cleanup_files[@]:-}"; do
+    [[ "$f" != "$log" && -n "$f" ]] && new_files+=("$f")
+  done
+  _cleanup_files=("${new_files[@]}")

Alternatively, since the global cleanup already handles empty entries gracefully (the kill and rm commands will just skip them), this is a minor concern.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/install.sh` around lines 362 - 366, The current removal uses pattern
substitution which leaves empty-string elements in _cleanup_pids and
_cleanup_files; replace each substitution with logic that rebuilds the array
excluding the matching element (e.g., iterate over _cleanup_pids and only append
elements not equal to $pid, then assign back to _cleanup_pids; do the same for
_cleanup_files comparing to $log) so the matching entries are truly removed
rather than turned into empty strings; update the lines touching _cleanup_pids
and _cleanup_files accordingly and keep the subsequent return $status unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.agents/skills/security-code-review/SKILL.md:
- Around line 54-55: The docs hardcode the git diff command as "git diff
main...HEAD" which yields incorrect diffs when the PR base isn't main; replace
every occurrence of that hardcoded string (including the instances near lines
54-55, 57-61, and 68-69) with a templated reference that uses the existing
baseRefName variable (e.g., use baseRefName...HEAD) so all diff commands
consistently use baseRefName.
- Around line 25-47: The document conflates PR and issue flows in "Step 2: Check
Out the Code" by only using PR commands (`gh pr checkout`, `gh pr view`); update
the skill to handle issues separately or explicitly restrict it to PRs: add an
explicit branch that detects issue URLs and uses `gh issue view` (or clones and
checks out the repo default branch via `gh repo view`/git checkout the default
branch) and then analyzes referenced commits/files, or change the docs to state
"PRs only" and remove issue examples; ensure both flows are clearly described
and include the distinct commands (`gh pr checkout`/`gh pr view` for PR mode,
`gh issue view` + checkout default branch for issue mode) in the "Step 2: Check
Out the Code" section.

In @.github/workflows/e2e-brev.yaml:
- Line 101: Uncomment and restore the repository guard so the workflow only runs
in the upstream repo: replace the commented line "# if: github.repository ==
'NVIDIA/NemoClaw'" with the active condition "if: github.repository ==
'NVIDIA/NemoClaw'" (so the workflow will not run on forks and will protect
secrets like BREV_API_TOKEN and NVIDIA_API_KEY); ensure the exact conditional
string is used and the YAML indentation matches surrounding steps so the guard
applies correctly to the job or workflow scope.

In `@bin/lib/credentials.js`:
- Around line 241-255: The loop that asks for the NVIDIA API Key using
prompt(...) can spin forever on non-interactive stdin because prompt returns
empty on EOF; update the logic around the while loop (the block that uses key
and normalizeCredentialValue) to fail fast in non-interactive contexts: before
entering the loop check process.stdin.isTTY (or equivalent) and throw/exit with
a clear error if not interactive, and/or after receiving an empty key from
prompt, if process.stdin.isTTY is false (or prompt indicates EOF), log a clear
error and exit instead of continuing; reference the key variable, the prompt(" 
NVIDIA API Key: ", { secret: true }) call, and normalizeCredentialValue so you
change the control flow there.
- Around line 10-42: The resolveHomeDir function currently only rejects exact
matches in UNSAFE_HOME_PATHS; update it to also reject HOME values that are
nested under those unsafe directories by checking whether the resolved and real
paths are equal to or are descendants of any entry in UNSAFE_HOME_PATHS (e.g.,
for each unsafe path check real === unsafe OR real startsWith(unsafe +
path.sep), and do the same for the non-real `home` value), using
path.resolve/path.normalize to canonicalize inputs before comparison; keep using
fs.realpathSync for the real path and throw the same error message when a match
or descendant is detected to prevent HOME paths like /tmp/alice or /dev/shm/user
from being allowed.

In `@docs/get-started/quickstart.md`:
- Around line 77-79: Replace the fenced CLI blocks that currently use ```bash
(for example the block containing the command "curl -fsSL
https://www.nvidia.com/nemoclaw.sh | bash") with ```console and prefix the
command lines with a shell prompt ($) so they read like "$ curl -fsSL
https://www.nvidia.com/nemoclaw.sh | bash"; do the same update for the other CLI
blocks noted (around the blocks at 103-105, 109-111, 115-117, 123-125) so all
command examples use the console language tag and include the $ prompt.

In `@scripts/brev-launchable-ci-cpu.sh`:
- Around line 134-140: Remove the sudo chmod 666 /var/run/docker.sock line and
instead run any immediate Docker commands as the docker group temporarily so SSH
sessions that haven't reloaded group membership can still use Docker; keep the
sudo usermod -aG docker "$TARGET_USER" line, then change the info line to call
docker via sg docker -c "docker --version" (or newgrp docker -c "docker
--version") and capture its output for the log (e.g., info "Docker enabled ($(sg
docker -c 'docker --version' 2>/dev/null | head -c 40))"). This preserves the
immediate feedback without making /var/run/docker.sock world-writable.

In `@scripts/brev-setup.sh`:
- Line 44: Update the hardcoded NodeSource installer checksum by replacing the
outdated value in the NODESOURCE_SHA256 variable with the current SHA-256 for
the setup_22.x installer
(cf868ff8927f7a76f5426c8957929b429841bec5052b94f1dd29f27e55fc3e34). Locate the
NODESOURCE_SHA256 declaration in the script and substitute the old checksum
string with the new one so the checksum verification for the NodeSource
setup_22.x installer succeeds.

In `@scripts/docs-to-skills.py`:
- Around line 789-799: The generated description may contain characters that
break YAML frontmatter, so add an import json at the top and use JSON
string-quoting when writing the SKILL.md frontmatter (i.e., serialize the
combined description with json.dumps(combined) instead of inserting combined
raw); locate where `descriptions` is merged into `combined` and ensure the value
written to the frontmatter uses the JSON-escaped string returned from
json.dumps.

In `@scripts/fix-coredns.sh`:
- Around line 63-65: The script currently falls back to 8.8.8.8 when
UPSTREAM_DNS is unset which leaks DNS off-host; change the behavior in
scripts/fix-coredns.sh so that if UPSTREAM_DNS is empty the script exits
non-zero with a clear error asking for an explicit UPSTREAM_DNS override (or
provide a CLI flag/env var like UPSTREAM_DNS_OVERRIDE) instead of assigning
"8.8.8.8"; update the error message to explain how to supply a safe upstream and
reference the UPSTREAM_DNS variable and the main execution block that uses it
(e.g., the conditional that checks [ -z "$UPSTREAM_DNS" ]) so callers must
opt-in rather than silently using a public resolver.

In `@scripts/nemoclaw-start.sh`:
- Around line 304-329: The current block sets _SANDBOX_HOME then calls
_write_proxy_snippet which may create dotfiles as root; change the flow so
dotfiles are created/owned by the sandbox user rather than root: when running as
root (id -u == 0) ensure any target files "${_SANDBOX_HOME}/.bashrc" and
"${_SANDBOX_HOME}/.profile" are either created and chown'd to the sandbox user
(use getent result to get the UID/GID) before calling _write_proxy_snippet, or
invoke _write_proxy_snippet as the sandbox user (e.g., with gosu) so files are
owned by sandbox; additionally when not root skip targets that are not writable
to avoid aborts under set -e — reference _SANDBOX_HOME, _write_proxy_snippet,
and the targets "${_SANDBOX_HOME}/.bashrc" and "${_SANDBOX_HOME}/.profile" when
implementing this change.

In `@scripts/setup-dns-proxy.sh`:
- Around line 25-35: The script's current positional parsing treats the sole
argument as GATEWAY_NAME (GATEWAY_NAME="${1:-}" and SANDBOX_NAME="${2:-}"),
causing a one-argument invocation to fail; change the argument handling in
scripts/setup-dns-proxy.sh so that when only one positional arg is provided it
is assigned to SANDBOX_NAME and GATEWAY_NAME is left empty, while when two args
are provided assign GATEWAY_NAME=$1 and SANDBOX_NAME=$2; update the validation
that checks SANDBOX_NAME (and the Usage echo) to reflect the optional gateway
and required sandbox, locating the changes around the GATEWAY_NAME and
SANDBOX_NAME assignments and the subsequent SANDBOX_NAME empty check.
- Around line 122-135: The current forward() logic spawns an unbounded daemon
thread per packet (in the recv loop using threading.Thread(..., daemon=True)),
which can exhaust resources; replace that per-packet thread creation with a
bounded worker pool (e.g., use concurrent.futures.ThreadPoolExecutor or a fixed
set of worker threads consuming a queue) so at most N concurrent forwards run;
keep the forward(data, addr) function but submit tasks to
executor.submit(forward, d, a) (or enqueue to a Queue and have N worker threads
call forward()), and ensure socket timeouts and socket.close() handling remain
intact while removing the unbounded threading.Thread(...) creation in the while
True recv loop and referencing UPSTREAM and sock as before.

In `@src/lib/debug.ts`:
- Around line 424-433: The createTarball routine currently calls
spawnSync("tar", ...) and always logs success then runDebug deletes collectDir
in finally, which risks losing diagnostics when tar fails; update createTarball
to inspect the spawnSync result (check status/exitCode, error, and
stderr/stdout) and only delete or treat the tarball as canonical if the command
succeeded (status === 0 and no error), otherwise keep collectDir intact, log a
clear error including spawnSync.error and stderr, and suggest alternative
actions; apply the same change to the other tar invocation mentioned (the
similar spawnSync("tar", ...) block referenced around lines 477-485) so both
places preserve the collected directory on failure and surface detailed error
info.
- Around line 305-317: The SSH config is written to a predictable path
(sshConfigPath) in tmpdir and written with writeFileSync, which can be hijacked
via symlinks; replace that with a securely-created exclusive temp
directory/file: use mkdtempSync to create a unique temp dir, construct the SSH
file path inside it, and write the file with exclusive creation and restrictive
mode (e.g., O_CREAT|O_EXCL and 0o600) instead of plain writeFileSync; update the
code around spawnSync("openshell", ["sandbox", "ssh-config", sandboxName], {
timeout: TIMEOUT_MS, ... }) to write sshResult.stdout into the exclusive file
and ensure proper cleanup if needed.

In `@src/lib/nim.ts`:
- Around line 171-180: Replace hard process termination with thrown errors so
callers can handle failures: in pullNimImage, when getImageForModel(model)
returns falsy, throw a descriptive Error (e.g., including model) instead of
calling process.exit(1); keep the console.error/console.log messages as needed
but do not exit the process. Apply the same change in startNimContainerByName —
detect the failure condition currently followed by process.exit(1) and throw an
Error with a clear message referencing startNimContainerByName and the
container/name so callers can catch and handle it.

In `@src/lib/resolve-openshell.test.ts`:
- Line 5: The test imports the compiled artifact instead of the source; update
the import for resolveOpenshell so the test imports the source module (change
the import from "../../dist/lib/resolve-openshell" to the local source import
"./resolve-openshell") so Vitest runs against the live src implementation and
not stale built output; ensure the import statement that references
resolveOpenshell is adjusted accordingly.

---

Outside diff comments:
In @.agents/skills/nemoclaw-deploy-remote/SKILL.md:
- Around line 1-181: The SKILL.md file
(.agents/skills/nemoclaw-deploy-remote/SKILL.md) is autogenerated and must not
be edited directly; instead update the source documentation under docs/ and then
run the generator (scripts/docs-to-skills.py) to regenerate this skill file so
changes persist; do not modify the .agents file in-place — change the
appropriate docs/* source content and run the docs-to-skills.py script to
produce the updated .agents/skills output.

---

Minor comments:
In @.agents/skills/nemoclaw-security-best/SKILL.md:
- Around line 33-81: The Mermaid code block opened with the "```mermaid" fence
is missing a closing triple-backtick; close the block by adding a ``` fence
immediately after the diagram (after the line containing "*Full details in
`references/best-practices.md`.*") so the Mermaid diagram ends and the following
text renders as normal Markdown.

In @.agents/skills/security-code-review/SKILL.md:
- Around line 65-69: Update Step 4 in SKILL.md to operationalize the “Read full
content” requirement by inserting an explicit full-file read command before the
existing git diff command; specifically, modify the Step 4 instructions so they
first instruct the reviewer to output the entire file contents (for example
using a file-cat command) and then run the provided git diff command, ensuring
the documentation includes both commands in sequence and the example suggested
in the review content.

In `@bin/lib/platform.js`:
- Around line 38-44: The current shouldPatchCoredns(runtime, opts) returns true
for any runtime string other than "unknown", which wrongly enables the
Docker-specific CoreDNS patch for podman/future runtimes; change the runtime
check to an explicit whitelist: define an array of allowed runtimes (e.g.,
['docker', 'moby', 'docker-desktop'] or the actual Docker engine identifiers you
expect) and return true only if isWsl(opts) is false and that array includes the
provided runtime (use Array.prototype.includes). Update the logic inside
shouldPatchCoredns to use this whitelist instead of runtime !== "unknown" so
only the intended Docker-like runtimes trigger the patch.

In `@docs/get-started/quickstart.md`:
- Around line 25-30: Replace the generic MyST admonition tag ':::{admonition}
Alpha software' with a supported callout type (for example ':::{warning}',
':::{note}', or ':::{tip}') to follow docs conventions; specifically, update the
block starting with ':::{admonition}' and its closing marker ':::' so the
opening tag uses a supported type (e.g., ':::{warning} Alpha software') and keep
the same inner text unchanged.
- Line 56: Replace GitHub repository/blob links in the quickstart table with the
official product documentation URLs: locate the markdown link entries such as
the "[OpenShell](https://github.com/NVIDIA/OpenShell)" table row and the other
links referenced at the same section (noted also around the other occurrences at
lines referenced) and update their hrefs to point to the official vendor or
project documentation pages rather than GitHub repo/blob pages; ensure the
visible link text (e.g., "OpenShell") remains the same but the URL is the
official docs site, and apply the same change to the other two occurrences
mentioned.

In `@docs/security/best-practices.md`:
- Around line 501-508: Rename the closing section header "Related Topics" to
"Next Steps" in the document (change the heading line that currently reads "##
Related Topics" to "## Next Steps") and keep the existing list of links
unchanged so the bottom of docs/security/best-practices.md follows the project's
guideline that pages end with a "Next Steps" section linking related pages.
- Around line 2-4: The H1 in the document must exactly match the frontmatter
value title.page; update either the frontmatter or the page H1 so they are
identical (e.g., set the H1 to "NemoClaw Security Best Practices — Controls,
Risks, and Posture Profiles" to match title.page or change title.page to the
shorter H1), ensuring the H1 text and the title.page frontmatter key are an
exact string match.

In `@eslint.config.mjs`:
- Around line 42-43: The inline comment and the "complexity" rule are
inconsistent: the comment says ratcheting to 15 but the rule sets max: 20;
update one to match the other in eslint.config.mjs by either changing the
comment text to reflect max: 20 or changing the rule value to "complexity":
["error", { max: 15 }]; locate the rule by the unique key "complexity" and
ensure the comment immediately above it (mentioning ratcheting) and the rule's
max value are identical.

In `@scripts/brev-setup.sh`:
- Around line 186-187: The current subshell commands for building and linking
(the lines invoking (cd "$REPO_DIR/nemoclaw" && npm install && npm run build)
and (cd "$REPO_DIR" && npm install --ignore-scripts && npm link)) swallow all
output by redirecting both stdout and stderr to /dev/null; change them to
capture stderr to a log file (e.g., redirect only stderr to
"$REPO_DIR/setup-build.log" or separate files like "nemoclaw-build.err"),
preserve or optionally tee stdout, and check the command exit status after each
subshell (using || { echo "Build failed, see log"; cat <logfile>; exit 1; }) so
failures are logged and cause the script to stop; reference the subshell
commands and the variables REPO_DIR and the npm steps (npm install, npm run
build, npm link) when making these changes.

In `@scripts/fix-coredns.sh`:
- Around line 50-53: The script resets RUNTIME="unknown" after sourcing
lib/runtime.sh which can overwrite a previously-detected runtime (e.g.
RUNTIME=colima); change the logic so you do not clobber an existing RUNTIME
value—either remove the unconditional RUNTIME="unknown" assignment or only set
RUNTIME if it is unset (e.g. test -z "${RUNTIME:-}" before assigning), and keep
the colima detection block (the command -v colima and DOCKER_HOST check) so it
can set RUNTIME="colima" when appropriate; reference the RUNTIME variable and
the colima detection code to locate and update the lines.

In `@scripts/install.sh`:
- Around line 362-366: The current removal uses pattern substitution which
leaves empty-string elements in _cleanup_pids and _cleanup_files; replace each
substitution with logic that rebuilds the array excluding the matching element
(e.g., iterate over _cleanup_pids and only append elements not equal to $pid,
then assign back to _cleanup_pids; do the same for _cleanup_files comparing to
$log) so the matching entries are truly removed rather than turned into empty
strings; update the lines touching _cleanup_pids and _cleanup_files accordingly
and keep the subsequent return $status unchanged.

In `@scripts/setup-dns-proxy.sh`:
- Around line 81-82: The POD selection is using a substring grep which can pick
the wrong pod; update the POD assignment (the command that sets POD using
kctl/kubectl and $SANDBOX_NAME) to require an exact match or use a label
selector: e.g. use kubectl get pod -n openshell -o name --field-selector
metadata.name="$SANDBOX_NAME" or use kubectl get pods -n openshell -l
"yourLabelKey=$SANDBOX_NAME" to ensure uniqueness, and add a check that POD is
non-empty (and fail/exit if multiple matches are found) before proceeding to
modify files, iptables, or namespaces.

In `@scripts/start-services.sh`:
- Around line 129-131: Update the startup banner text to reflect the actual
NVIDIA_API_KEY gating: when NVIDIA_API_KEY is unset the script currently warns
("NVIDIA_API_KEY not set — Telegram bridge will not start.") but the banner
still shows "not started (no token)"; change the banner message(s) (the startup
banner print at the block around lines referenced and the similar block at
155-158) to indicate the bridge is disabled due to missing NVIDIA_API_KEY
(reference the NVIDIA_API_KEY variable and the Telegram bridge) so operators see
the correct remediation rather than suggesting a missing Telegram token; keep
the existing warn messages intact.

---

Nitpick comments:
In @.github/actions/basic-checks/action.yaml:
- Around line 21-25: The script assumes /usr/local/bin is writable but may not
be on some runners; update the steps that download and install hadolint (the
curl to /usr/local/bin/hadolint, the sha256sum check for
/usr/local/bin/hadolint, and chmod +x) to run with elevated privileges when
necessary by prefixing the filesystem-modifying commands with sudo (or using a
conditional sudo wrapper) so that the variables HADOLINT_URL, EXPECTED, ACTUAL
and the target path /usr/local/bin/hadolint work reliably on both GitHub-hosted
and self-hosted runners.

In @.github/workflows/dco-check.yaml:
- Line 47: Update the DCO check regex used in the if condition that tests
"$normalized_body" so it more strictly validates the email portion and avoids
false positives; replace the permissive pattern
'^Signed-off-by:\s+.+\s+<[^<>]+>$' with a tighter PCRE that still accepts names
with punctuation but requires a basic email structure inside angle brackets
(user@domain.tld) and keep the grep -P usage in the same if statement to perform
the match.

In @.github/workflows/docker-pin-check.yaml:
- Around line 28-30: Run both pin checks and only fail the step after both have
executed: invoke scripts/update-docker-pin.sh with and without
DOCKERFILE=Dockerfile.base so the first non-zero exit doesn’t abort the block;
capture each script’s exit status (referencing the two invocations of
update-docker-pin.sh and the DOCKERFILE=Dockerfile.base invocation) and at the
end exit non-zero if either check failed.

In @.github/workflows/nightly-e2e.yaml:
- Around line 92-154: Extract the repeated NVM/PATH initialization into a new
script (e.g., create test/e2e/setup-env.sh) and replace the duplicated block in
the three workflow steps that call bash
test/e2e/e2e-cloud-experimental/check-docs.sh, bash
test/e2e/e2e-cloud-experimental/skip/05-network-policy.sh, and bash
test/e2e/e2e-cloud-experimental/cleanup.sh --verify with a single sourced
invocation (source test/e2e/setup-env.sh) followed by the existing bash command;
ensure the new script includes the same set -euo pipefail, .bashrc sourcing,
NVM_DIR/nvm.sh load, and PATH update logic so behavior is unchanged.

In `@bin/lib/policies.js`:
- Around line 292-337: In selectFromList, add a timeout for the readline prompt
so rl.question can't hang when stdin is non-interactive; create a timer (e.g.,
setTimeout) after rl is created that will rl.close(), perform the same non-TTY
pause/unref cleanup, clear the timer on normal answer handling, and
resolve(null) when the timeout fires; ensure the timer is cleared in the
rl.question callback and any early-return branches to avoid leaks and duplicate
resolves.

In `@bin/nemoclaw.js`:
- Around line 805-814: In the "--help"/"-h" switch case the process.exit(0) call
terminates execution so the following unreachable break should be removed; edit
the switch handling for the "--help"/"-h" labels (the block containing
console.log(...) and process.exit(0)) and delete the trailing break statement
after process.exit(0) to eliminate dead code.
- Around line 102-113: The stderr inclusion is inverted in captureOpenshell:
currently stderr is omitted when opts.ignoreError is true; change the
concatenation logic so stderr is included when opts.ignoreError is true (e.g.,
include result.stderr when opts.ignoreError is set, otherwise keep the current
behavior of excluding it), update the output construction in captureOpenshell
and add a short comment by getOpenshellBinary/captureOpenshell explaining the
rationale for including stderr when ignoring errors; reference symbols:
captureOpenshell, opts.ignoreError, result.stderr.
- Around line 459-516: getReconciledSandboxGatewayState is over the cyclomatic
complexity limit; extract the inline state-matching logic into small
predicate/helper functions to remove the eslint-disable and reduce branching.
Create helpers (e.g., isPresentOrMissing(lookup),
isHandshakeVerificationFailure(output), isNoGatewayConfigured(status),
isConnectionRefusedForNemoclaw(status), and
isNamedUnreachableRecovery(recovery)) and move the respective checks that
reference getSandboxGatewayState, recoverNamedGatewayRuntime,
getNamedGatewayLifecycleState, and stripAnsi into those helpers; then replace
the long if/return blocks in getReconciledSandboxGatewayState with calls to
these helpers to keep the function linear and under the complexity threshold.

In `@docs/about/how-it-works.md`:
- Line 42: The line containing "OpenShell handles *how* to sandbox an agent
securely. NemoClaw handles *what* goes in the sandbox and makes the setup
accessible. For the full system diagram, see
[Architecture](../reference/architecture.md)." should be split so each sentence
is on its own line (e.g., one line for "OpenShell handles *how* to sandbox an
agent securely.", one line for "NemoClaw handles *what* goes in the sandbox and
makes the setup accessible.", and one line for the Architecture reference);
apply the same one-sentence-per-line split to the other affected line referenced
(around the content at 145) so every sentence occupies its own source line.

In `@docs/conf.py`:
- Around line 103-106: The announcement string assigned to the "announcement"
key in docs/conf.py includes a leading bell emoji (&#x1F514;); remove that emoji
so the banner text conforms to the docs prose style guide, leaving the rest of
the message ("NVIDIA NemoClaw is <strong>alpha software</strong>. APIs and
behavior may change without notice. Do not use in production.") unchanged.

In `@docs/deployment/deploy-to-remote-gpu.md`:
- Around line 95-99: The paragraph explaining NemoClaw dashboard origin
allowlist and CHAT_UI_URL contains multiple sentences on one line; update the
docs in the deploy-to-remote-gpu.md section so each sentence is on its own line
for clearer diffs — e.g., split the single multi-sentence paragraph about the
allowlist, default `http://127.0.0.1:18789`, remote access examples (Brev/SSH
port-forward), and the instruction to set `CHAT_UI_URL` before running setup
into separate lines, preserving the same wording and meaning.

In `@docs/get-started/quickstart.md`:
- Line 46: The paragraph containing "The sandbox image is approximately 2.4 GB
compressed..." has multiple sentences on one source line; split it so each
sentence is on its own line (i.e., create four lines: one for "The sandbox image
is approximately 2.4 GB compressed.", one for the sentence starting "During
image push, the Docker daemon, k3s, and the OpenShell gateway...", one for "On
machines with less than 8 GB of RAM, this combined usage can trigger the OOM
killer.", and one for "If you cannot add memory, configuring at least 8 GB of
swap can work around the issue at the cost of slower performance.").

In `@docs/index.md`:
- Around line 202-206: The "Notice and Disclaimer" admonition block is currently
one long line; break its text so each sentence is on its own line inside the
admonition (split after periods) to improve diff readability; locate the block
starting with "```{admonition} Notice and Disclaimer" and replace the
single-line paragraph with multiple lines where each sentence is a separate line
while preserving the original wording and blank line structure.

In `@docs/reference/troubleshooting.md`:
- Around line 52-67: Reword the Node.js version troubleshooting paragraph to
avoid repeating sentences that start with "If": update the block that currently
begins "If the installer exits..." and "If the version is below 22.16..." to
something like "For versions below 22.16, install a supported release." while
keeping the same guidance about checking the version with `node --version` and
the nvm commands (`nvm install 22` / `nvm use 22`); ensure the flow reads
smoothly and retains all original instructions in the Node.js version check
section.

In `@docs/security/best-practices.md`:
- Line 208: Replace the bolded routine callout token "**Recommendation:**" with
a MyST admonition (e.g., :::{tip} ... :::) or plain sentence text; update the
occurrence on the shown line and the other instance noted (line with the same
token at 441) so routine guidance uses a tip/note admonition (or plain text)
rather than bold emphasis, and ensure the admonition content retains the
original sentence about applying presets and reviewing YAML.
- Around line 46-93: The Mermaid diagram uses emoji in node labels (e.g.,
YOU/"👤 Operator", AGENT/"🤖 Agent", PROC/"⚙️ Process Layer", FS/"📁 Filesystem
Layer", NET/"🌐 Network Layer", INF/"🧠 Inference Layer", OUTSIDE/"🌍 Outside
World") which violates the no-emoji documentation rule; edit those node label
strings to remove all emoji and leave plain text equivalents (e.g., "Operator",
"Agent", "Process Layer", "Filesystem Layer", "Network Layer", "Inference
Layer", "Outside World") while preserving the existing HTML tags and line breaks
in labels and keeping node/ class identifiers (YOU, AGENT, PROC, FS, NET, INF,
OUTSIDE) and styling unchanged.

In `@Makefile`:
- Around line 13-16: Add the format-cli target to the Makefile's .PHONY list so
the phony target declaration includes format-cli; update the .PHONY line (or the
.PHONY block) to include the symbol "format-cli" alongside other phony targets
to ensure "make format-cli" always runs the Prettier command in the format-cli
rule.

In `@nemoclaw/src/commands/migration-state.ts`:
- Around line 568-570: The filter passed when options?.stripCredentials uses
path.basename(source).toLowerCase() against CREDENTIAL_SENSITIVE_BASENAMES but
relies on the set already being lowercase; make the comparison explicitly
case-insensitive to avoid edge cases on case-sensitive filesystems: normalize
the basename and the set membership check in the filter (or build a
case-insensitive lookup) so names like "Auth-Profiles.json" are reliably
matched; update the filter expression around options?.stripCredentials,
path.basename, and CREDENTIAL_SENSITIVE_BASENAMES to perform an explicit
case-insensitive lookup.

In `@package.json`:
- Line 18: The inline "prepare" npm script in package.json is large and hard to
maintain; extract it to a dedicated executable shell script (e.g.,
scripts/prepare.sh) and replace the package.json "prepare" value with a single
invocation (e.g., "sh ./scripts/prepare.sh" or "./scripts/prepare.sh"). In the
new scripts/prepare.sh implement the same logic: detect tsc or
node_modules/.bin/tsc and run npm run build:cli, run npm install --omit=dev
--ignore-scripts 2>/dev/null || true, and handle git-hook setup by checking .git
then command -v prek or node_modules/@j178/prek with the same error/skip
messages; make the script executable and add set -euo pipefail at top so it can
be shellchecked and maintained easily.

In `@README.md`:
- Line 1: The README title contains an emoji ("🦞") which violates the "No emoji
in technical prose" guideline; remove the emoji from the top-level heading so
the title becomes plain text (e.g., change "# 🦞 NVIDIA NemoClaw: Reference
Stack for Running OpenClaw in OpenShell" to "# NVIDIA NemoClaw: Reference Stack
for Running OpenClaw in OpenShell"), and scan the rest of README.md for any
other emoji in headings or sentences to remove or replace with plain text;
update the commit message to note removal of emoji from README title.
- Around line 70-72: The README note block contains an emoji ("ℹ️") that
violates the "No emoji in technical prose" guideline; edit the note text (the
block referencing NemoClaw and OpenClaw created inside the sandbox during
onboarding) to remove the emoji and leave plain text (e.g., replace "ℹ️ Note"
with "Note" or similar) so the content referencing NemoClaw and OpenClaw remains
unchanged except for removing the emoji.

In `@scripts/check-coverage-ratchet.ts`:
- Around line 41-45: The failures computation assumes summary.total[metric].pct
exists and will throw if the coverage summary is malformed; update the METRICS
mapping that builds failures to defensively check that summary.total and
summary.total[metric] and summary.total[metric].pct are defined (or coerce
missing pct to 0) and surface a clear error/warning when a metric is missing so
the script fails with an informative message; specifically modify the block that
computes failures (referencing METRICS, failures, summary.total[metric].pct,
thresholds, and TOLERANCE) to guard each access, use a safe value or log the
missing metric, and then compare the safe actual value to thresholds[metric] -
TOLERANCE.

In `@scripts/clean-staged-tree.sh`:
- Around line 9-15: The script currently only checks that target_dir is
non-empty; add a validation after that to ensure the path exists and is a
directory (e.g., test -d "$target_dir") and if not print a clear usage/error
message and exit non-zero; update the logic around target_dir and the removal
commands (rm -rf ... and find ...) so they only run when the directory check
passes, referencing the target_dir variable in your check.

In `@scripts/install.sh`:
- Around line 600-604: Document that when neither sha256sum nor shasum is
present the script skips integrity verification and sets
actual_hash="$NVM_SHA256", and update the warn call (warn) to explain the
security trade-off and point users to install a SHA-256 tool; additionally,
implement an option to fail in non-interactive/CI mode by detecting the
non-interactive flag or CI env and exiting with an error instead of assigning
actual_hash="$NVM_SHA256"—reference the actual_hash variable, the NVM_SHA256
sentinel, and the warn call so you modify that branch accordingly.

In `@scripts/setup-spark.sh`:
- Around line 62-69: The script currently warns when REAL_USER is empty but then
still prints "DGX Spark Docker configuration complete." because
DOCKER_GROUP_ADDED was never set; either (A) add an early exit after the warn
call when REAL_USER is empty (e.g., call exit 1) so the script stops and never
reaches the final info/echo, or (B) explicitly set a state variable when
skipping configuration (e.g., set DOCKER_GROUP_ADDED=false or set
DOCKER_GROUP_SKIPPED=true in the REAL_USER-empty branch) and update the final
conditional that prints the completion message to check that flag (use warn()
when skipped and info() only when DOCKER_GROUP_ADDED=true) so messaging is
consistent; touch the REAL_USER detection block and the final DOCKER_GROUP_ADDED
check to implement one of these approaches.

In `@src/lib/inference-config.ts`:
- Around line 9-10: Replace the CommonJS require with an ES module import for
DEFAULT_OLLAMA_MODEL: change the require("../../bin/lib/local-inference") usage
to an import statement that imports DEFAULT_OLLAMA_MODEL from the
"bin/lib/local-inference" module, ensure the target module exports
DEFAULT_OLLAMA_MODEL as an ES export, and if necessary update TypeScript config
or add an `export`/`export default` in the local-inference module so the import
(e.g., import { DEFAULT_OLLAMA_MODEL } from "bin/lib/local-inference")
type-checks correctly.

In `@src/lib/nim.test.ts`:
- Around line 41-48: Replace the indirect type assertions in the test "each
model has name, image, and minGpuMemoryMB": use Vitest's toBeTypeOf to assert
m.minGpuMemoryMB is "number" and use a numeric matcher like toBeGreaterThan(0)
for positivity; similarly consider asserting m.name and m.image types (e.g.,
toBeTypeOf "string") or keep toBeTruthy if presence-only is intended. Update the
assertions around nim.listModels() / m.minGpuMemoryMB / m.name / m.image
accordingly.

In `@src/lib/nim.ts`:
- Around line 222-224: Replace the platform-dependent subprocess sleep call that
uses require("child_process").spawnSync("sleep", [String(intervalSec)]) with a
Node-native delay: if the surrounding function can be async, change it to return
a Promise and use an await-based delay (setTimeout) to sleep for
intervalSec*1000; if it must remain synchronous, replace the subprocess approach
with a native synchronous wait via Atomics.wait on a shared Int32Array for the
desired milliseconds. Update the function signature where this occurs (the scope
using intervalSec) to match the async/sync choice and remove the child_process
require.

In `@src/lib/onboard-session.ts`:
- Around line 13-16: Add a brief inline comment above SESSION_DIR (and by
extension SESSION_FILE and LOCK_FILE) documenting that when process.env.HOME is
unset the code falls back to /tmp/.nemoclaw, noting the potential for unexpected
behavior in headless/container environments and that directory/file permissions
are set restrictively (0o700/0o600) to mitigate multi-user risk; update the
comment near the SESSION_DIR declaration so maintainers see the fallback and
security implications when reading SESSION_DIR, SESSION_FILE, or LOCK_FILE.
- Around line 204-252: The normalizeSession function is getting complex; extract
field-specific normalization into helpers to improve readability and
maintainability: create a normalizeMetadata helper that takes d.metadata and
returns SessionMetadata or undefined and a normalizeSteps (or
normalizeStepEntry) helper that validates/normalizes each step entry and returns
the fully-typed steps map; replace the inline metadata and steps blocks in
normalizeSession with calls to normalizeMetadata(d.metadata) and
normalizeSteps(d.steps) (referencing normalized.steps and validateStep where
needed), and then remove or narrow the eslint-disable-next-line complexity by
keeping normalizeSession high-level and delegating detailed logic to these new
helpers.

In `@src/lib/preflight.test.ts`:
- Around line 135-148: The "detects EADDRINUSE on an occupied port (real net
probe)" test can hang if the server lifecycle fails; add an explicit timeout for
the test to prevent CI from hanging by limiting how long the
probePortAvailability call can run (e.g., set a per-test timeout via your test
runner or wrap probePortAvailability in a Promise.race with a short timeout)
while keeping the existing srv creation and the finally block that closes srv;
target the test name and the probePortAvailability call and ensure srv.close
still runs in the finally clause.

In `@src/lib/preflight.ts`:
- Around line 302-313: The writeManagedSwapMarker function uses
runCapture("mkdir -p ...") to create the .nemoclaw directory but otherwise uses
fs APIs; replace the shell call with fs.mkdirSync to avoid shell expansion and
keep consistency: check for the directory with fs.existsSync (or just call
fs.mkdirSync(nemoclawDir, { recursive: true })) before writing the marker, use
fs.mkdirSync(path.join(os.homedir(), ".nemoclaw"), { recursive: true }) (or wrap
in try/catch) and then call fs.writeFileSync(path.join(nemoclawDir,
"managed_swap"), "/swapfile"); preserve the current best-effort error handling
around the write.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

Comment on lines +25 to +47
If the user provided a PR or issue URL, extract the owner, repo, and number. If not, ask for one.

Supported URL formats:

- `https://github.com/OWNER/REPO/pull/NUMBER`
- `https://github.com/OWNER/REPO/issues/NUMBER`

## Step 2: Check Out the Code

Determine whether you are already in the target repository (compare `gh repo view --json nameWithOwner -q .nameWithOwner` against the URL). If you are:

```bash
gh pr checkout <number>
```

If reviewing a different repo, clone it to a temporary directory first:

```bash
TMPDIR=$(mktemp -d)
gh repo clone OWNER/REPO "$TMPDIR"
cd "$TMPDIR"
gh pr checkout <number>
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

PR/Issue flow is conflated; issue reviews will fail.

The skill says it supports issue URLs, but the execution path is PR-only (gh pr checkout, gh pr view). Add a branch for issue-mode (analyze current default/base branch + referenced files/commits) or explicitly scope this skill to PRs only.

Suggested doc fix
-If the user provided a PR or issue URL, extract the owner, repo, and number. If not, ask for one.
+If the user provided a PR or issue URL, extract the owner, repo, and number. If not, ask for one.
+
+- If URL type is `pull`: follow Steps 2-6 using `gh pr checkout` and PR metadata.
+- If URL type is `issues`: do not run `gh pr checkout`; instead, inspect the default branch (or linked PR/commit if present in the issue) and produce an issue-focused security triage report.

Also applies to: 60-61

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/skills/security-code-review/SKILL.md around lines 25 - 47, The
document conflates PR and issue flows in "Step 2: Check Out the Code" by only
using PR commands (`gh pr checkout`, `gh pr view`); update the skill to handle
issues separately or explicitly restrict it to PRs: add an explicit branch that
detects issue URLs and uses `gh issue view` (or clones and checks out the repo
default branch via `gh repo view`/git checkout the default branch) and then
analyzes referenced commits/files, or change the docs to state "PRs only" and
remove issue examples; ensure both flows are clearly described and include the
distinct commands (`gh pr checkout`/`gh pr view` for PR mode, `gh issue view` +
checkout default branch for issue mode) in the "Step 2: Check Out the Code"
section.

Comment on lines +54 to +55
git diff main...HEAD --name-status
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Base-branch handling is inconsistent and can yield wrong diffs.

You correctly mention non-main base branches, but commands still hardcode main...HEAD. Use baseRefName consistently in all diff commands.

Suggested doc fix
-git diff main...HEAD --name-status
+BASE_REF=$(gh pr view <number> --json baseRefName -q .baseRefName)
+git diff "${BASE_REF}"...HEAD --name-status
...
-git diff main...HEAD -- <file>
+git diff "${BASE_REF}"...HEAD -- <file>

Also applies to: 57-61, 68-69

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/skills/security-code-review/SKILL.md around lines 54 - 55, The docs
hardcode the git diff command as "git diff main...HEAD" which yields incorrect
diffs when the PR base isn't main; replace every occurrence of that hardcoded
string (including the instances near lines 54-55, 57-61, and 68-69) with a
templated reference that uses the existing baseRefName variable (e.g., use
baseRefName...HEAD) so all diff commands consistently use baseRefName.

jobs:
e2e-brev:
if: github.repository == 'NVIDIA/NemoClaw'
# if: github.repository == 'NVIDIA/NemoClaw' # Disabled for fork testing — re-enable before merge
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Re-enable the repository guard before merging.

The comment indicates this was disabled for fork testing. Leaving this commented out in production could allow the workflow to run on forks, potentially exposing secrets (BREV_API_TOKEN, NVIDIA_API_KEY) to untrusted code.

🔒 Proposed fix
-    # if: github.repository == 'NVIDIA/NemoClaw'  # Disabled for fork testing — re-enable before merge
+    if: github.repository == 'NVIDIA/NemoClaw'
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# if: github.repository == 'NVIDIA/NemoClaw' # Disabled for fork testing — re-enable before merge
if: github.repository == 'NVIDIA/NemoClaw'
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/e2e-brev.yaml at line 101, Uncomment and restore the
repository guard so the workflow only runs in the upstream repo: replace the
commented line "# if: github.repository == 'NVIDIA/NemoClaw'" with the active
condition "if: github.repository == 'NVIDIA/NemoClaw'" (so the workflow will not
run on forks and will protect secrets like BREV_API_TOKEN and NVIDIA_API_KEY);
ensure the exact conditional string is used and the YAML indentation matches
surrounding steps so the guard applies correctly to the job or workflow scope.

Comment on lines +10 to +42
const UNSAFE_HOME_PATHS = new Set(["/tmp", "/var/tmp", "/dev/shm", "/"]);

function resolveHomeDir() {
const raw = process.env.HOME || os.homedir();
if (!raw) {
throw new Error(
"Cannot determine safe home directory for credential storage. " +
"Set the HOME environment variable to a user-owned directory.",
);
}
const home = path.resolve(raw);
try {
const real = fs.realpathSync(home);
if (UNSAFE_HOME_PATHS.has(real)) {
throw new Error(
"Cannot store credentials: HOME resolves to '" +
real +
"' which is world-readable. " +
"Set the HOME environment variable to a user-owned directory.",
);
}
} catch (e) {
if (e.code !== "ENOENT") throw e;
}
if (UNSAFE_HOME_PATHS.has(home)) {
throw new Error(
"Cannot store credentials: HOME resolves to '" +
home +
"' which is world-readable. " +
"Set the HOME environment variable to a user-owned directory.",
);
}
return home;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Reject HOME values under temp directories too.

This guard only blocks the literal paths in UNSAFE_HOME_PATHS. HOME=/tmp/alice or /dev/shm/user still passes, so credentials end up under a world-writable parent even though this code is trying to forbid that class of location.

Possible hardening
-const UNSAFE_HOME_PATHS = new Set(["/tmp", "/var/tmp", "/dev/shm", "/"]);
+const UNSAFE_HOME_PATHS = ["/tmp", "/var/tmp", "/dev/shm", "/"];
+
+function isUnsafeHomePath(p) {
+  return UNSAFE_HOME_PATHS.some((base) => p === base || p.startsWith(base + path.sep));
+}
@@
-    if (UNSAFE_HOME_PATHS.has(real)) {
+    if (isUnsafeHomePath(real)) {
@@
-  if (UNSAFE_HOME_PATHS.has(home)) {
+  if (isUnsafeHomePath(home)) {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/credentials.js` around lines 10 - 42, The resolveHomeDir function
currently only rejects exact matches in UNSAFE_HOME_PATHS; update it to also
reject HOME values that are nested under those unsafe directories by checking
whether the resolved and real paths are equal to or are descendants of any entry
in UNSAFE_HOME_PATHS (e.g., for each unsafe path check real === unsafe OR real
startsWith(unsafe + path.sep), and do the same for the non-real `home` value),
using path.resolve/path.normalize to canonicalize inputs before comparison; keep
using fs.realpathSync for the real path and throw the same error message when a
match or descendant is detected to prevent HOME paths like /tmp/alice or
/dev/shm/user from being allowed.

Comment on lines +241 to 255
while (true) {
key = normalizeCredentialValue(await prompt(" NVIDIA API Key: ", { secret: true }));

if (!key || !key.startsWith("nvapi-")) {
console.error(" Invalid key. Must start with nvapi-");
process.exit(1);
if (!key) {
console.error(" NVIDIA API Key is required.");
continue;
}

if (!key.startsWith("nvapi-")) {
console.error(" Invalid key. Must start with nvapi-");
continue;
}

break;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fail fast on non-interactive stdin here.

Lines 241-255 now retry forever, but prompt() on EOF/non-TTY stdin just yields another empty answer. In CI or non-interactive onboarding, a missing key becomes a hang/spam loop instead of a clear failure.

One way to stop the loop
+  if (!process.stdin.isTTY) {
+    throw new Error(
+      "NVIDIA_API_KEY is required in non-interactive mode. Set NVIDIA_API_KEY or save it in ~/.nemoclaw/credentials.json.",
+    );
+  }
+
   while (true) {
     key = normalizeCredentialValue(await prompt("  NVIDIA API Key: ", { secret: true }));
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
while (true) {
key = normalizeCredentialValue(await prompt(" NVIDIA API Key: ", { secret: true }));
if (!key || !key.startsWith("nvapi-")) {
console.error(" Invalid key. Must start with nvapi-");
process.exit(1);
if (!key) {
console.error(" NVIDIA API Key is required.");
continue;
}
if (!key.startsWith("nvapi-")) {
console.error(" Invalid key. Must start with nvapi-");
continue;
}
break;
}
if (!process.stdin.isTTY) {
throw new Error(
"NVIDIA_API_KEY is required in non-interactive mode. Set NVIDIA_API_KEY or save it in ~/.nemoclaw/credentials.json.",
);
}
while (true) {
key = normalizeCredentialValue(await prompt(" NVIDIA API Key: ", { secret: true }));
if (!key) {
console.error(" NVIDIA API Key is required.");
continue;
}
if (!key.startsWith("nvapi-")) {
console.error(" Invalid key. Must start with nvapi-");
continue;
}
break;
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/credentials.js` around lines 241 - 255, The loop that asks for the
NVIDIA API Key using prompt(...) can spin forever on non-interactive stdin
because prompt returns empty on EOF; update the logic around the while loop (the
block that uses key and normalizeCredentialValue) to fail fast in
non-interactive contexts: before entering the loop check process.stdin.isTTY (or
equivalent) and throw/exit with a clear error if not interactive, and/or after
receiving an empty key from prompt, if process.stdin.isTTY is false (or prompt
indicates EOF), log a clear error and exit instead of continuing; reference the
key variable, the prompt("  NVIDIA API Key: ", { secret: true }) call, and
normalizeCredentialValue so you change the control flow there.

Comment on lines +122 to +135
def forward(data, addr):
try:
f = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
f.settimeout(5)
f.sendto(data, UPSTREAM)
r, _ = f.recvfrom(4096)
sock.sendto(r, addr)
f.close()
except Exception:
pass

while True:
d, a = sock.recvfrom(4096)
threading.Thread(target=forward, args=(d, a), daemon=True).start()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Bound the DNS forwarder's concurrency.

The proxy starts a brand-new daemon thread for every packet and never caps them. A burst of lookups can exhaust threads or PIDs inside the pod and take DNS resolution down. A fixed worker pool or a synchronous forward loop is safer here.

One simple way to cap it
-import socket, threading, os, sys
+import socket, os, sys
+from concurrent.futures import ThreadPoolExecutor
@@
-while True:
-    d, a = sock.recvfrom(4096)
-    threading.Thread(target=forward, args=(d, a), daemon=True).start()
+with ThreadPoolExecutor(max_workers=32) as pool:
+    while True:
+        d, a = sock.recvfrom(4096)
+        pool.submit(forward, d, a)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/setup-dns-proxy.sh` around lines 122 - 135, The current forward()
logic spawns an unbounded daemon thread per packet (in the recv loop using
threading.Thread(..., daemon=True)), which can exhaust resources; replace that
per-packet thread creation with a bounded worker pool (e.g., use
concurrent.futures.ThreadPoolExecutor or a fixed set of worker threads consuming
a queue) so at most N concurrent forwards run; keep the forward(data, addr)
function but submit tasks to executor.submit(forward, d, a) (or enqueue to a
Queue and have N worker threads call forward()), and ensure socket timeouts and
socket.close() handling remain intact while removing the unbounded
threading.Thread(...) creation in the while True recv loop and referencing
UPSTREAM and sock as before.

Comment on lines +305 to +317
// Generate temporary SSH config
const sshConfigPath = join(tmpdir(), `nemoclaw-ssh-${String(Date.now())}`);
try {
const sshResult = spawnSync("openshell", ["sandbox", "ssh-config", sandboxName], {
timeout: TIMEOUT_MS,
stdio: ["ignore", "pipe", "ignore"],
encoding: "utf-8",
});
if (sshResult.status !== 0) {
warn(`Could not generate SSH config for sandbox '${sandboxName}', skipping internals`);
return;
}
writeFileSync(sshConfigPath, sshResult.stdout ?? "");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Create the SSH config with an exclusive temp file.

sshConfigPath is predictable under the shared temp directory, and writeFileSync() will follow a pre-existing symlink. That turns this into an arbitrary-file overwrite/read risk for any other local user on the machine. Use mkdtempSync() plus exclusive creation and restrictive permissions instead.

Suggested fix
-  const sshConfigPath = join(tmpdir(), `nemoclaw-ssh-${String(Date.now())}`);
+  const sshConfigDir = mkdtempSync(join(tmpdir(), "nemoclaw-ssh-"));
+  const sshConfigPath = join(sshConfigDir, "config");
   try {
     const sshResult = spawnSync("openshell", ["sandbox", "ssh-config", sandboxName], {
       timeout: TIMEOUT_MS,
       stdio: ["ignore", "pipe", "ignore"],
       encoding: "utf-8",
@@
-    writeFileSync(sshConfigPath, sshResult.stdout ?? "");
+    writeFileSync(sshConfigPath, sshResult.stdout ?? "", {
+      flag: "wx",
+      mode: 0o600,
+    });
@@
   } finally {
-    if (existsSync(sshConfigPath)) {
-      unlinkSync(sshConfigPath);
-    }
+    rmSync(sshConfigDir, { recursive: true, force: true });
   }
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/debug.ts` around lines 305 - 317, The SSH config is written to a
predictable path (sshConfigPath) in tmpdir and written with writeFileSync, which
can be hijacked via symlinks; replace that with a securely-created exclusive
temp directory/file: use mkdtempSync to create a unique temp dir, construct the
SSH file path inside it, and write the file with exclusive creation and
restrictive mode (e.g., O_CREAT|O_EXCL and 0o600) instead of plain
writeFileSync; update the code around spawnSync("openshell", ["sandbox",
"ssh-config", sandboxName], { timeout: TIMEOUT_MS, ... }) to write
sshResult.stdout into the exclusive file and ensure proper cleanup if needed.

Comment on lines +424 to +433
function createTarball(collectDir: string, output: string): void {
spawnSync("tar", ["czf", output, "-C", dirname(collectDir), basename(collectDir)], {
stdio: "inherit",
timeout: 60_000,
});
info(`Tarball written to ${output}`);
warn(
"Known secrets are auto-redacted, but please review for any remaining sensitive data before sharing.",
);
info("Attach this file to your GitHub issue.");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Preserve the collected files when archiving fails.

spawnSync("tar", ...) is ignored here, success is always logged, and runDebug() deletes collectDir in finally. If tar is missing, times out, or cannot write output, the user loses the only copy of the diagnostics.

Suggested fix
-function createTarball(collectDir: string, output: string): void {
-  spawnSync("tar", ["czf", output, "-C", dirname(collectDir), basename(collectDir)], {
+function createTarball(collectDir: string, output: string): boolean {
+  const result = spawnSync("tar", ["czf", output, "-C", dirname(collectDir), basename(collectDir)], {
     stdio: "inherit",
     timeout: 60_000,
   });
+  if (result.error || result.status !== 0) {
+    warn(`Failed to write tarball to ${output}; leaving diagnostics in ${collectDir}`);
+    return false;
+  }
   info(`Tarball written to ${output}`);
@@
   info("Attach this file to your GitHub issue.");
+  return true;
 }
@@
 export function runDebug(opts: DebugOptions = {}): void {
+  let cleanupCollectDir = true;
   const quick = opts.quick ?? false;
@@
-    if (output) {
-      createTarball(collectDir, output);
-    }
+    if (output) cleanupCollectDir = createTarball(collectDir, output);
@@
   } finally {
-    rmSync(collectDir, { recursive: true, force: true });
+    if (cleanupCollectDir) {
+      rmSync(collectDir, { recursive: true, force: true });
+    }
   }
 }

Also applies to: 477-485

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/debug.ts` around lines 424 - 433, The createTarball routine currently
calls spawnSync("tar", ...) and always logs success then runDebug deletes
collectDir in finally, which risks losing diagnostics when tar fails; update
createTarball to inspect the spawnSync result (check status/exitCode, error, and
stderr/stdout) and only delete or treat the tarball as canonical if the command
succeeded (status === 0 and no error), otherwise keep collectDir intact, log a
clear error including spawnSync.error and stderr, and suggest alternative
actions; apply the same change to the other tar invocation mentioned (the
similar spawnSync("tar", ...) block referenced around lines 477-485) so both
places preserve the collected directory on failure and surface detailed error
info.

Comment on lines +171 to +180
export function pullNimImage(model: string): string {
const image = getImageForModel(model);
if (!image) {
console.error(` Unknown model: ${model}`);
process.exit(1);
}
console.log(` Pulling NIM image: ${image}`);
run(`docker pull ${shellQuote(image)}`);
return image;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid process.exit() in library code — throw an error instead.

Calling process.exit(1) in a library function prevents callers from handling the error gracefully (e.g., cleaning up resources, showing custom error messages, or attempting fallback logic). The same issue exists in startNimContainerByName (Line 189-191).

🛠️ Proposed fix
 export function pullNimImage(model: string): string {
   const image = getImageForModel(model);
   if (!image) {
-    console.error(`  Unknown model: ${model}`);
-    process.exit(1);
+    throw new Error(`Unknown NIM model: ${model}`);
   }
   console.log(`  Pulling NIM image: ${image}`);
   run(`docker pull ${shellQuote(image)}`);
   return image;
 }

And similarly for startNimContainerByName:

 export function startNimContainerByName(name: string, model: string, port = 8000): string {
   const image = getImageForModel(model);
   if (!image) {
-    console.error(`  Unknown model: ${model}`);
-    process.exit(1);
+    throw new Error(`Unknown NIM model: ${model}`);
   }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
export function pullNimImage(model: string): string {
const image = getImageForModel(model);
if (!image) {
console.error(` Unknown model: ${model}`);
process.exit(1);
}
console.log(` Pulling NIM image: ${image}`);
run(`docker pull ${shellQuote(image)}`);
return image;
}
export function pullNimImage(model: string): string {
const image = getImageForModel(model);
if (!image) {
throw new Error(`Unknown NIM model: ${model}`);
}
console.log(` Pulling NIM image: ${image}`);
run(`docker pull ${shellQuote(image)}`);
return image;
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/nim.ts` around lines 171 - 180, Replace hard process termination with
thrown errors so callers can handle failures: in pullNimImage, when
getImageForModel(model) returns falsy, throw a descriptive Error (e.g.,
including model) instead of calling process.exit(1); keep the
console.error/console.log messages as needed but do not exit the process. Apply
the same change in startNimContainerByName — detect the failure condition
currently followed by process.exit(1) and throw an Error with a clear message
referencing startNimContainerByName and the container/name so callers can catch
and handle it.

// SPDX-License-Identifier: Apache-2.0

import { describe, it, expect } from "vitest";
import { resolveOpenshell } from "../../dist/lib/resolve-openshell";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
fd -HI 'vitest' .
rg -n --type=ts --type=js "from ['\"][^'\"]*dist/" src

Repository: NVIDIA/NemoClaw

Length of output: 1113


🏁 Script executed:

cat vitest.config.ts

Repository: NVIDIA/NemoClaw

Length of output: 1060


🏁 Script executed:

rg -A 5 '"test"' package.json

Repository: NVIDIA/NemoClaw

Length of output: 315


Import from the source module, not dist.

Vitest is configured to run tests directly against source files in src/, not built output. Importing from ../../dist/lib/resolve-openshell means this suite will fail or execute against stale/missing compiled artifacts instead of the actual source code being tested. Change to ./resolve-openshell.

Suggested fix
-import { resolveOpenshell } from "../../dist/lib/resolve-openshell";
+import { resolveOpenshell } from "./resolve-openshell";
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
import { resolveOpenshell } from "../../dist/lib/resolve-openshell";
import { resolveOpenshell } from "./resolve-openshell";
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/resolve-openshell.test.ts` at line 5, The test imports the compiled
artifact instead of the source; update the import for resolveOpenshell so the
test imports the source module (change the import from
"../../dist/lib/resolve-openshell" to the local source import
"./resolve-openshell") so Vitest runs against the live src implementation and
not stale built output; ensure the import statement that references
resolveOpenshell is adjusted accordingly.

lakamsani pushed a commit to lakamsani/NemoClaw that referenced this pull request Apr 4, 2026
…#1136)

## Summary

Smooth out inference configuration during `install.sh` / `nemoclaw
onboard`, especially when provider authorization, credential formatting,
endpoint probing, or final inference application fail.

This PR makes the hosted-provider onboarding path recoverable instead of
brittle:
- normalize and safely handle credential input
- classify validation failures more accurately
- let users re-enter credentials in place
- make final `openshell inference set` failures recoverable
- normalize over-specified custom base URLs
- add lower-level `back` / `exit` navigation so users can move up a
level without restarting the whole install
- clarify recovery prompts with explicit commands (`retry`, `back`,
`exit`)

## What Changed

- refactored provider probe execution to use direct `curl` argv
invocation instead of `bash -c`
- normalized credential values before use/persistence
- added structured auth / transport / model / endpoint failure
classification
- added in-place credential re-entry for hosted providers:
  - NVIDIA Endpoints
  - OpenAI
  - Anthropic
  - Google Gemini
  - custom OpenAI-compatible endpoints
  - custom Anthropic-compatible endpoints
- wrapped final provider/apply failures in interactive recovery instead
of hard abort
- added command-style recovery prompts:
  - `retry`
  - `back`
  - `exit`
- allowed `back` from lower-level inference prompts (model entry, base
URL entry, recovery prompts)
- normalized custom endpoint inputs to the minimum usable base URL
- removed stale `NVIDIA Endpoints (recommended)` wording
- secret prompts now show masked `*` feedback while typing/pasting

## Validation

```bash
npx vitest run test/credentials.test.js test/onboard-selection.test.js test/onboard.test.js
npx vitest run test/cli.test.js
npx eslint bin/lib/credentials.js bin/lib/onboard.js test/credentials.test.js test/onboard-selection.test.js test/onboard.test.js
npx tsc -p jsconfig.json --noEmit
```

## Issue Mapping

Fully addressed in this PR:
- Fixes NVIDIA#1099
- Fixes NVIDIA#1101
- Fixes NVIDIA#1130

Substantially addressed / partially addressed:
- NVIDIA#987
- improves NVIDIA validation behavior and failure classification so
false/misleading connectivity failures are much less likely, but this PR
is framed as onboarding recovery hardening rather than a WSL-specific
networking fix
- NVIDIA#301
- improves graceful handling when validation/apply fails, especially for
transport/upstream problems, but does not add provider auto-fallback or
a broader cloud-outage fallback strategy
- NVIDIA#446
- improves recovery specifically for the inference-configuration step,
but does not fully solve general resumability across all onboarding
steps

Related implementation direction:
- NVIDIA#890
- this PR aligns with the intent of safer/non-shell probe execution and
clearer validation reporting
- NVIDIA#380
- not implemented here; no automatic provider fallback was added in this
branch

## Notes

- This PR intentionally does not weaken validation or reopen old
onboarding shortcuts.
- Unrelated local `tmp/` noise was left out of the branch.

Signed-off-by: Kevin Jones <kejones@nvidia.com>


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Interactive onboarding navigation (`back`/`exit`/`quit`) with
credential re-prompting and retry flows.
* Improved probe/validation flow with clearer recovery options and more
robust sandbox build progress messages.
  * Secret input masks with reliable backspace behavior.

* **Bug Fixes**
* Credential sanitization (trim/line-ending normalization) and API key
validation now normalize and retry instead of exiting.
* Better classification and messaging for authorization/validation
failures; retries where appropriate.

* **Tests**
* Expanded tests for credential prompts, masking, retry flows,
validation classification, and onboarding navigation.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
lakamsani pushed a commit to lakamsani/NemoClaw that referenced this pull request Apr 4, 2026
…ript modules (NVIDIA#1240)

## Summary

- Extract ~210 lines of pure, side-effect-free functions from the
3,800-line `onboard.js` into **5 typed TypeScript modules** under
`src/lib/`:
- `gateway-state.ts` — gateway/sandbox state classification from
openshell output
- `validation.ts` — failure classification, API key validation, model ID
checks
  - `url-utils.ts` — URL normalization, text compaction, env formatting
  - `build-context.ts` — Docker build context filtering, recovery hints
  - `dashboard.ts` — dashboard URL resolution and construction
- Add **56 co-located unit tests** (`src/lib/*.test.ts`) for the
extracted modules
- Set up CLI TypeScript compilation: `tsconfig.src.json` compiles `src/`
→ `dist/` as CJS
- `onboard.js` imports from compiled `dist/lib/` output — transparent to
callers
- Pre-commit hook updated to build TS and include `dist/lib/` in
coverage

These functions are **not touched by any NVIDIA#924 blocker PR** (NVIDIA#781, NVIDIA#782,
NVIDIA#819, NVIDIA#672, NVIDIA#634, NVIDIA#890), so this extraction is safe to land immediately.

## Test plan

- [x] 598 CLI tests pass (542 existing + 56 new)
- [x] `tsc -p tsconfig.src.json` compiles cleanly
- [x] `tsc -p tsconfig.cli.json` type-checks cleanly
- [x] `tsc -p jsconfig.json` type-checks cleanly
- [x] Coverage ratchet passes with `dist/lib/` included

Closes NVIDIA#1237. Relates to NVIDIA#924 (shell consolidation).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Improved sandbox-creation recovery hints and targeted remediation
commands.
  * Smarter dashboard URL resolution and control-UI URL construction.

* **Bug Fixes**
  * More accurate gateway and sandbox state detection.
* Enhanced classification of validation/apply failures and safer
model/key validation.
  * Better provider URL normalization and loopback handling.

* **Tests**
  * Added comprehensive tests covering new utilities.

* **Chores**
* CLI now builds before CLI tests; CI/commit hooks updated to run the
CLI build.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

fix Getting Started Use this label to identify setup, installation, or onboarding issues.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants