Skip to content

fix: harden inference onboarding auth and validation recovery#1136

Merged
kjw3 merged 16 commits intomainfrom
fix/onboard-inference-auth-handling
Mar 31, 2026
Merged

fix: harden inference onboarding auth and validation recovery#1136
kjw3 merged 16 commits intomainfrom
fix/onboard-inference-auth-handling

Conversation

@kjw3
Copy link
Copy Markdown
Contributor

@kjw3 kjw3 commented Mar 31, 2026

Summary

Smooth out inference configuration during install.sh / nemoclaw onboard, especially when provider authorization, credential formatting, endpoint probing, or final inference application fail.

This PR makes the hosted-provider onboarding path recoverable instead of brittle:

  • normalize and safely handle credential input
  • classify validation failures more accurately
  • let users re-enter credentials in place
  • make final openshell inference set failures recoverable
  • normalize over-specified custom base URLs
  • add lower-level back / exit navigation so users can move up a level without restarting the whole install
  • clarify recovery prompts with explicit commands (retry, back, exit)

What Changed

  • refactored provider probe execution to use direct curl argv invocation instead of bash -c
  • normalized credential values before use/persistence
  • added structured auth / transport / model / endpoint failure classification
  • added in-place credential re-entry for hosted providers:
    • NVIDIA Endpoints
    • OpenAI
    • Anthropic
    • Google Gemini
    • custom OpenAI-compatible endpoints
    • custom Anthropic-compatible endpoints
  • wrapped final provider/apply failures in interactive recovery instead of hard abort
  • added command-style recovery prompts:
    • retry
    • back
    • exit
  • allowed back from lower-level inference prompts (model entry, base URL entry, recovery prompts)
  • normalized custom endpoint inputs to the minimum usable base URL
  • removed stale NVIDIA Endpoints (recommended) wording
  • secret prompts now show masked * feedback while typing/pasting

Validation

npx vitest run test/credentials.test.js test/onboard-selection.test.js test/onboard.test.js
npx vitest run test/cli.test.js
npx eslint bin/lib/credentials.js bin/lib/onboard.js test/credentials.test.js test/onboard-selection.test.js test/onboard.test.js
npx tsc -p jsconfig.json --noEmit

Issue Mapping

Fully addressed in this PR:

Substantially addressed / partially addressed:

Related implementation direction:

Notes

  • This PR intentionally does not weaken validation or reopen old onboarding shortcuts.
  • Unrelated local tmp/ noise was left out of the branch.

Signed-off-by: Kevin Jones kejones@nvidia.com

Summary by CodeRabbit

  • New Features

    • Interactive onboarding navigation (back/exit/quit) with credential re-prompting and retry flows.
    • Improved probe/validation flow with clearer recovery options and more robust sandbox build progress messages.
    • Secret input masks with reliable backspace behavior.
  • Bug Fixes

    • Credential sanitization (trim/line-ending normalization) and API key validation now normalize and retry instead of exiting.
    • Better classification and messaging for authorization/validation failures; retries where appropriate.
  • Tests

    • Expanded tests for credential prompts, masking, retry flows, validation classification, and onboarding navigation.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 31, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds credential normalization and stricter secret-prompt masking/backspace handling; introduces interactive back/exit navigation across onboarding prompts; centralizes endpoint probing with a curl probe helper and failure classification; refactors validators to structured {ok, retry} results and implements interactive retry/recovery loops.

Changes

Cohort / File(s) Summary
Credential Normalization & Prompting
bin/lib/credentials.js
Added and exported normalizeCredentialValue(value); normalize values in saveCredential/getCredential and return null for empty results; updated promptSecret to append chars, echo *, and only emit backspace erase sequence when deleting; improved non-secret prompt finish/SIGINT handling and made ensureApiKey loop until normalized and prefixed with nvapi-.
Onboarding Flow, Navigation & Probing
bin/lib/onboard.js
Integrated credential normalization into Authorization/x-api-key headers; added navigation tokens and prompt-aware helpers (BACK_TO_SELECTION, getNavigationChoice, exitOnboardFromPrompt); consolidated probing into runCurlProbe(); added classifyValidationFailure and normalizeProviderBaseUrl; refactored validators to return structured { ok, ... } and updated selection/setup loops for interactive retry/recovery and non-interactive handling.
Provider Upsert / Inference Setup
bin/lib/onboard.js (related sections)
upsertProvider() now returns structured { ok, status?, message? }; setupInference() and related flows now perform interactive retries and honor BACK_TO_SELECTION instead of exiting immediately.
Sandbox/Create UX
bin/lib/onboard.js (sandbox sections)
Sandbox creation now tracks build/upload/create/ready phases, prints phase transitions and heartbeats after silent periods, and refreshes phase state from observed output patterns.
Tests — Credentials
test/credentials.test.js
Added tests for SIGINT re-raising, infinite-retry validation loop for NVIDIA key prefix, secret masking (output.write("*")) and backspace erase (output.write("\\b \\b")), and runtime coverage for normalizeCredentialValue trimming CR/whitespace.
Tests — Onboard Selection
test/onboard-selection.test.js
Added helper scripts to emulate curl auth-fail/reauth flows; extended tests for provider list UX, lower-level back/exit navigation, and authorization retry behavior across NVIDIA, OpenAI, Anthropic, Gemini, and custom endpoints; adjusted test inputs to fully qualified proxy URLs.
Tests — End-to-End Onboard
test/onboard.test.js
Added unit tests for classifyValidationFailure and normalizeProviderBaseUrl; introduced integration-style test that stubs credentials/runner to force an initial auth failure then success, asserting retry and env key transition and verifying provider invocation sequence.

Sequence Diagram

sequenceDiagram
    participant User
    participant Onboard as Onboarding Flow
    participant Cred as Credentials Manager
    participant Probe as Endpoint Prober
    participant Validator as Validator
    participant Runner as Runner/Provider Apply

    User->>Onboard: start / select provider
    Onboard->>Cred: promptSecret() / prompt()
    Cred->>Cred: normalizeCredentialValue()
    Cred->>User: echo "*" per char, handle backspace
    User->>Cred: provide API key or "back"/"exit"
    Cred-->>Onboard: normalized key or navigation command

    Onboard->>Probe: runCurlProbe(endpoint, credential)
    Probe->>Validator: classifyFailure(result)
    Validator-->>Onboard: { ok: true } or { ok: false, retry: "credential|model|endpoint", helpUrl? }

    alt Validation fails
        Onboard->>User: promptValidationRecovery()
        User->>Onboard: retry / back / exit
        Onboard->>Cred: re-prompt or navigate
    else Validation succeeds
        Onboard->>Runner: upsertProvider / apply model
        Runner-->>Onboard: { ok / error }
        alt Runner error and recoverable
            Onboard->>User: promptValidationRecovery()
        end
    end
Loading

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 I nibble CRs and trim each stray space,
I dot each secret with a silent * in place,
Type "back" and I hop to try the path anew,
Curl knocks, we listen, retry until true —
Hooray, the provider lands, the onboarding view!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 8.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix: harden inference onboarding auth and validation recovery' directly matches the main objective of the PR, which is to refactor and improve the inference onboarding flow for better error handling and recovery.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/onboard-inference-auth-handling

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (1)
test/credentials.test.js (1)

48-64: Prefer behavior tests over source-string assertions.

These checks lock the suite to the current bin/lib/credentials.js implementation (while (true), exact messages, and specific output.write() calls). A harmless refactor will fail the tests even if prompting, retrying, and masking still work.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/credentials.test.js` around lines 48 - 64, The tests currently assert
exact source strings in bin/lib/credentials.js (e.g., matching "while (true)",
exact prompt text, and specific output.write calls), which makes them brittle;
instead, update the tests to exercise observable behavior: spawn or require the
credentials module (use child_process.spawn or import exported functions like
the prompt/validation and masking helpers if available) and assert that it (1)
prompts and retries on invalid keys by sending invalid then valid input and
observing repeated prompts, and (2) masks secret input by sending keystrokes and
asserting the terminal output shows masking characters (or that the masking
helper returns masked output) while the returned/collected secret remains the
unmasked value. Reference the credentials module and its
prompt/validation/masking entry points rather than matching source-code strings.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/lib/onboard.js`:
- Around line 669-683: The retry branch for recovery.kind === "credential"
currently calls replaceNamedCredential(credentialEnv, ...) which only rejects
empty input and therefore lets malformed NVIDIA API keys (NVIDIA_API_KEY) be
accepted; update this path to preserve the NVIDIA-specific validation by calling
the existing ensureApiKey(...) (or performing the same 'nvapi-' prefix and
format checks) when credentialEnv corresponds to NVIDIA_API_KEY (or when label
indicates NVIDIA), otherwise keep using replaceNamedCredential; ensure the
function names replaceNamedCredential and ensureApiKey and the credentialEnv
check are used to locate and modify the logic so the NVIDIA retry path enforces
the same validation as the initial prompt.
- Around line 2672-2676: The recovery handling after calling
promptValidationRecovery(...) treats any result other than "credential" or
"retry" as terminal and calls process.exit, which makes a "back" selection abort
onboarding; update the conditional around the variable retry (from
promptValidationRecovery) so that "back" is also treated as a non-terminal
choice (e.g., include retry === "back" in the branch that continues) and ensure
the same fix is applied to the duplicate block that handles provider/apply
recovery (the similar code around the other providerResult handling at the
second occurrence).
- Around line 485-491: The Anthropic normalization leaves a bare "/v1" behind
causing duplicate segments; update the suffix list used when computing pathname
in onboard.js so the Anthropic branch includes "/v1" (i.e., add "/v1" to the
array assigned to suffixes when flavor === "anthropic") so that
stripEndpointSuffix(...) will remove a trailing "/v1" and pathname normalization
won't produce "/v1/v1" on later probes.
- Around line 605-639: In runCurlProbe, handle spawnSync errors by checking
result.error before treating result.status as a successful exit; if result.error
is present (e.g., ETIMEDOUT or spawn failure) set ok=false, curlStatus to a
non-zero value (use result.error.code or 1), httpStatus=0, stderr to
result.error.message, and set message via
summarizeCurlFailure(result.error.code||1,
result.error.message||String(result.error)); only use result.status for normal
completions (avoid coercing null with result.status||0) and keep
summarizeProbeFailure for successful spawns where result.status is non-null.

---

Nitpick comments:
In `@test/credentials.test.js`:
- Around line 48-64: The tests currently assert exact source strings in
bin/lib/credentials.js (e.g., matching "while (true)", exact prompt text, and
specific output.write calls), which makes them brittle; instead, update the
tests to exercise observable behavior: spawn or require the credentials module
(use child_process.spawn or import exported functions like the prompt/validation
and masking helpers if available) and assert that it (1) prompts and retries on
invalid keys by sending invalid then valid input and observing repeated prompts,
and (2) masks secret input by sending keystrokes and asserting the terminal
output shows masking characters (or that the masking helper returns masked
output) while the returned/collected secret remains the unmasked value.
Reference the credentials module and its prompt/validation/masking entry points
rather than matching source-code strings.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e3e611ec-a1d4-4825-916d-32b66fc6b0ef

📥 Commits

Reviewing files that changed from the base of the PR and between 8530cde and 2d37749.

📒 Files selected for processing (5)
  • bin/lib/credentials.js
  • bin/lib/onboard.js
  • test/credentials.test.js
  • test/onboard-selection.test.js
  • test/onboard.test.js

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (4)
bin/lib/onboard.js (4)

485-491: ⚠️ Potential issue | 🟠 Major

Strip a bare /v1 from Anthropic-compatible base URLs.

A user-entered base ending in /v1 survives this normalization, so later probes hit .../v1/v1/messages and .../v1/v1/models. Add "/v1" to the Anthropic suffix list here.

🐛 Minimal fix
-      ? ["/v1/messages", "/v1/models", "/messages", "/models"]
+      ? ["/v1/messages", "/v1/models", "/messages", "/models", "/v1"]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 485 - 491, The Anthropic suffix list in the
normalization logic (the suffixes array used before calling stripEndpointSuffix)
is missing the bare "/v1", causing user-provided bases ending with "/v1" to
remain and later produce duplicated segments; update the ternary branch that
builds suffixes for flavor === "anthropic" (the variable named suffixes) to
include "/v1" among ["/v1/messages", "/v1/models", "/v1", "/messages",
"/models"] so stripEndpointSuffix can remove a trailing "/v1" and subsequent
code (the pathname normalization and url.pathname assignment) will produce the
correct base URL.

672-685: ⚠️ Potential issue | 🟡 Minor

Preserve NVIDIA-specific key validation on credential retry.

This branch now always calls replaceNamedCredential(), which only rejects empty input. For NVIDIA_API_KEY that bypasses the ensureApiKey() format checks, so malformed values can be saved during recovery.

🛡️ Minimal fix
-      await replaceNamedCredential(credentialEnv, `${label} API key`, helpUrl);
+      if (credentialEnv === "NVIDIA_API_KEY") {
+        await ensureApiKey();
+      } else {
+        await replaceNamedCredential(credentialEnv, `${label} API key`, helpUrl);
+      }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 672 - 685, When retrying credential entry we
currently always call replaceNamedCredential(), which only rejects empty input
and therefore lets malformed NVIDIA_API_KEY values bypass the original
ensureApiKey validation; update the retry path in the recovery.kind ===
"credential" branch to preserve NVIDIA-specific validation by detecting the
NVIDIA credential (e.g., credentialEnv === "NVIDIA_API_KEY" or label containing
"NVIDIA") and invoking the same validation flow used elsewhere (call
ensureApiKey or run the NVIDIA-specific validator before persisting) instead of
the generic replaceNamedCredential, or pass a validator callback into
replaceNamedCredential so it enforces the NVIDIA format and rejects malformed
values.

613-629: ⚠️ Potential issue | 🟠 Major

Check result.error before coercing status.

When curl times out or cannot spawn, spawnSync() reports that on result.error and leaves result.status null. The result.status || 0 fallback turns those transport failures into curlStatus: 0 / HTTP 0, so recovery falls through to the wrong branch.

In Node.js v22, when child_process.spawnSync() times out or cannot spawn the child, what values are set on result.error, result.status, and result.signal?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 613 - 629, The spawnSync call's result may
have result.error set when curl fails to spawn or times out, so update the
post-spawn handling around the spawnSync result (the result variable used to
compute ok, httpStatus, curlStatus, body, stderr, and message) to first check
result.error and treat that as a transport failure: set curlStatus to a distinct
non-0 value (e.g., -1 or use result.error.code), set httpStatus to 0, and ensure
ok is false; pass the error details into summarizeProbeFailure and stderr
instead of coercing result.status || 0 or parsing stdout as an HTTP code when
result.error is present. Ensure references to result.error, result.status,
result.stdout, curlStatus, httpStatus, ok, and summarizeProbeFailure are updated
accordingly.

2680-2684: ⚠️ Potential issue | 🟠 Major

back still aborts from provider/apply recovery.

promptValidationRecovery() maps back to "selection", but both failure paths here still exit for anything except "credential" / "retry". That makes the new navigation option terminal instead of letting onboarding return to provider selection.

Also applies to: 2702-2706

🧹 Nitpick comments (2)
bin/lib/onboard.js (1)

2230-2450: Break the provider/recovery state machines into smaller helpers.

setupNim() and setupInference() are already well past the repo’s complexity cap, and the added retry/back/auth branches make control-flow bugs much easier to miss. Extract the per-provider validation/recovery cases so these top-level functions only coordinate state transitions.

As per coding guidelines, **/*.{js,ts,tsx}: Maintain cyclomatic complexity limit of 20 (target: ratchet down to 15) for all functions.

Also applies to: 2667-2707

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 2230 - 2450, The provider selection loop in
setupNim()/setupInference() is too large and must be split: extract per-provider
validation and recovery logic into small helper functions (e.g.,
validateOpenAiLikeSelectionWithRetry, validateAnthropicSelectionWithRetry,
validateCustomOpenAiLikeSelectionHelper, validateCustomAnthropicSelectionHelper)
that each accept (remoteConfig, selected.key, endpointUrl, model, credentialEnv)
and return a normalized { ok, api, retry } result; move navigation/back/exit
handling into a tiny helper (e.g., handleNavigationChoice) so the top-level loop
only orchestrates state transitions (select provider → call appropriate helper →
handle retry/back/exit/continue), ensuring setupNim and setupInference drop
below a cyclomatic complexity of 20 (target ~15) by delegating branches to these
helpers and reusing shared pieces like
getCredential/normalizeProviderBaseUrl/validateOpenAiLikeModel.
test/onboard.test.js (1)

907-927: Prefer behavioral checks over source-regex tests here.

These assertions only grep onboard.js text, so they break on harmless refactors and can still miss runtime regressions. The surrounding subprocess-based tests are the stronger pattern for validating curl timeout handling and provider-create noise suppression.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/onboard.test.js` around lines 907 - 927, Replace the fragile
source-regex assertions in the two tests named "uses split curl timeout args and
does not mislabel curl usage errors as timeouts" and "suppresses expected
provider-create AlreadyExists noise when update succeeds" with behavioral checks
that run the onboard.js binary and assert runtime behavior (e.g., spawn the
subprocess/child_process for onboard.js, inspect stdout/stderr, exit codes, and
subprocess error/exit properties) instead of grepping file text; remove the
assert.match(...) checks that look for the literal strings and instead assert
the observable outcomes (curl timeout handling and suppressed AlreadyExists
noise) using the existing subprocess-based test patterns in the suite so
refactors of onboard.js won't break the tests.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@bin/lib/onboard.js`:
- Around line 485-491: The Anthropic suffix list in the normalization logic (the
suffixes array used before calling stripEndpointSuffix) is missing the bare
"/v1", causing user-provided bases ending with "/v1" to remain and later produce
duplicated segments; update the ternary branch that builds suffixes for flavor
=== "anthropic" (the variable named suffixes) to include "/v1" among
["/v1/messages", "/v1/models", "/v1", "/messages", "/models"] so
stripEndpointSuffix can remove a trailing "/v1" and subsequent code (the
pathname normalization and url.pathname assignment) will produce the correct
base URL.
- Around line 672-685: When retrying credential entry we currently always call
replaceNamedCredential(), which only rejects empty input and therefore lets
malformed NVIDIA_API_KEY values bypass the original ensureApiKey validation;
update the retry path in the recovery.kind === "credential" branch to preserve
NVIDIA-specific validation by detecting the NVIDIA credential (e.g.,
credentialEnv === "NVIDIA_API_KEY" or label containing "NVIDIA") and invoking
the same validation flow used elsewhere (call ensureApiKey or run the
NVIDIA-specific validator before persisting) instead of the generic
replaceNamedCredential, or pass a validator callback into replaceNamedCredential
so it enforces the NVIDIA format and rejects malformed values.
- Around line 613-629: The spawnSync call's result may have result.error set
when curl fails to spawn or times out, so update the post-spawn handling around
the spawnSync result (the result variable used to compute ok, httpStatus,
curlStatus, body, stderr, and message) to first check result.error and treat
that as a transport failure: set curlStatus to a distinct non-0 value (e.g., -1
or use result.error.code), set httpStatus to 0, and ensure ok is false; pass the
error details into summarizeProbeFailure and stderr instead of coercing
result.status || 0 or parsing stdout as an HTTP code when result.error is
present. Ensure references to result.error, result.status, result.stdout,
curlStatus, httpStatus, ok, and summarizeProbeFailure are updated accordingly.

---

Nitpick comments:
In `@bin/lib/onboard.js`:
- Around line 2230-2450: The provider selection loop in
setupNim()/setupInference() is too large and must be split: extract per-provider
validation and recovery logic into small helper functions (e.g.,
validateOpenAiLikeSelectionWithRetry, validateAnthropicSelectionWithRetry,
validateCustomOpenAiLikeSelectionHelper, validateCustomAnthropicSelectionHelper)
that each accept (remoteConfig, selected.key, endpointUrl, model, credentialEnv)
and return a normalized { ok, api, retry } result; move navigation/back/exit
handling into a tiny helper (e.g., handleNavigationChoice) so the top-level loop
only orchestrates state transitions (select provider → call appropriate helper →
handle retry/back/exit/continue), ensuring setupNim and setupInference drop
below a cyclomatic complexity of 20 (target ~15) by delegating branches to these
helpers and reusing shared pieces like
getCredential/normalizeProviderBaseUrl/validateOpenAiLikeModel.

In `@test/onboard.test.js`:
- Around line 907-927: Replace the fragile source-regex assertions in the two
tests named "uses split curl timeout args and does not mislabel curl usage
errors as timeouts" and "suppresses expected provider-create AlreadyExists noise
when update succeeds" with behavioral checks that run the onboard.js binary and
assert runtime behavior (e.g., spawn the subprocess/child_process for
onboard.js, inspect stdout/stderr, exit codes, and subprocess error/exit
properties) instead of grepping file text; remove the assert.match(...) checks
that look for the literal strings and instead assert the observable outcomes
(curl timeout handling and suppressed AlreadyExists noise) using the existing
subprocess-based test patterns in the suite so refactors of onboard.js won't
break the tests.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e663db14-a5b2-4f4f-983d-f6a8efb59926

📥 Commits

Reviewing files that changed from the base of the PR and between 2d37749 and 8451b77.

📒 Files selected for processing (4)
  • bin/lib/credentials.js
  • bin/lib/onboard.js
  • test/credentials.test.js
  • test/onboard.test.js
✅ Files skipped from review due to trivial changes (1)
  • test/credentials.test.js
🚧 Files skipped from review as they are similar to previous changes (1)
  • bin/lib/credentials.js

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (4)
bin/lib/onboard.js (4)

613-630: ⚠️ Potential issue | 🟠 Major

Handle spawnSync().error before deriving probe status.

Timeouts and spawn failures surface on result.error with status === null; result.status || 0 collapses both cases to curlStatus: 0, so ETIMEDOUT / ENOENT probes get misreported as HTTP 0 and skip the transport-retry path.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 613 - 630, The probe result handling after
the spawnSync("curl", ...) call must detect and surface process errors via
result.error before deriving OK/http status: in the block that returns the
object (referencing spawnSync call, result, curlStatus, httpStatus, ok, and
summarizeProbeFailure), check if result.error exists and set curlStatus to
result.error.code || result.error.errno (or a non-zero sentinel) and ensure ok
is false when result.error is present; keep httpStatus as 0 in that case and
include the result.error details (error.code/errno and result.stderr) in the
message passed to summarizeProbeFailure so ETIMEDOUT/ENOENT are not collapsed
into curlStatus 0 and will correctly trigger transport-retry logic.

2682-2686: ⚠️ Potential issue | 🟠 Major

Treat selection / model recovery results as non-terminal here.

promptValidationRecovery() can return selection (including back) and model, but both loops still exit for anything except credential / retry. That makes provider create/update and final inference set failures terminal again instead of routing back up the onboarding flow.

Also applies to: 2704-2708

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 2682 - 2686, The failure-handling branch for
promptValidationRecovery currently only treats "credential" and "retry" as
non-terminal, but promptValidationRecovery may return "selection", "back", or
"model" which should also continue the onboarding flow; update the logic in the
blocks that inspect retry (the branch using
promptValidationRecovery(config.label,
classifyApplyFailure(providerResult.message), resolvedCredentialEnv,
config.helpUrl) and the equivalent block later) so that if retry is one of
"credential", "retry", "selection", "back", or "model" you continue the loop,
and only call process.exit(providerResult.status || 1) for other values.

485-491: ⚠️ Potential issue | 🟠 Major

Strip a bare /v1 from Anthropic-compatible bases.

https://proxy.example.com/v1 still survives normalization as /v1, and every Anthropic probe later appends /v1/messages / /v1/models, so validation hits /v1/v1/....

Possible fix
-      ? ["/v1/messages", "/v1/models", "/messages", "/models"]
+      ? ["/v1/messages", "/v1/models", "/messages", "/models", "/v1"]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 485 - 491, The Anthropic normalization
misses a bare "/v1" path, causing later probes to double-append; update the
suffixes used by stripEndpointSuffix so that when flavor === "anthropic" you
include "/v1" (e.g., change the anthropic suffix array in the const suffixes
assignment to include "/v1"), then ensure you still trim trailing slashes on
url.pathname and set url.pathname = pathname || "/"; this will make
stripEndpointSuffix(stripEndpointSuffix(url.pathname..., suffixes)) correctly
normalize bases like "https://proxy.example.com/v1" before probes run.

672-685: ⚠️ Potential issue | 🟡 Minor

Preserve NVIDIA key validation on credential retry.

When credentialEnv === "NVIDIA_API_KEY", this path now calls replaceNamedCredential(), which only rejects empty input. That bypasses the nvapi-... checks enforced by ensureApiKey() during the initial prompt.

Possible fix
-      await replaceNamedCredential(credentialEnv, `${label} API key`, helpUrl);
+      if (credentialEnv === "NVIDIA_API_KEY") {
+        await ensureApiKey();
+      } else {
+        await replaceNamedCredential(credentialEnv, `${label} API key`, helpUrl);
+      }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 672 - 685, The retry path for credential
recovery bypasses NVIDIA key format validation because when recovery.kind ===
"credential" and credentialEnv is set the code calls
replaceNamedCredential(credentialEnv, ...) which only rejects empty input;
modify this path so that when credentialEnv === "NVIDIA_API_KEY" you validate
the entered key with the same nvapi- checks used by ensureApiKey (or call
ensureApiKey) before accepting and returning "credential" (or pass a validator
into replaceNamedCredential that enforces the nvapi- prefix and length checks);
ensure you still allow empty-retry/back/exit flows and preserve label and
helpUrl usage.
🧹 Nitpick comments (1)
bin/lib/onboard.js (1)

2184-2656: Split setupNim() by provider path.

The new recovery loops push this function even further past the repo’s complexity limit. Extracting the hosted-provider, custom-endpoint, and local-provider branches into helpers would make the retry/navigation state machine much easier to reason about and test.

As per coding guidelines, **/*.{js,ts,tsx}: Maintain cyclomatic complexity limit of 20 (target: ratchet down to 15) for all functions.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 2184 - 2656, The setupNim function is too
large/complex; split provider-specific paths into small helpers to reduce
cyclomatic complexity: extract the hosted/provider branch (the logic under if
(REMOTE_PROVIDER_CONFIG[selected.key]) including credential checks, model
prompting and validateOpenAiLikeSelection) into a helper like
handleHostedProviderSelection(selected, requestedModel, requestedProvider) that
returns {model, provider, endpointUrl, credentialEnv, preferredInferenceApi} or
navigation signals; extract the custom/anthropic-compatible endpoint handling
into handleCustomEndpointSelection that normalizes endpointUrl and handles
navigation back/exit; extract the local-provider branches (nim-local, ollama,
install-ollama, vllm) into separate helpers handleNimLocal(gpu, requestedModel),
handleOllamaFlow(requestedModel,gpu,runCapture), handleInstallOllamaFlow(...),
and handleVllmFlow(...) that encapsulate their pulls/starts/validation and
return the same selection result shape; keep selectionLoop and navigation intact
by calling these helpers and acting on their return values (including signals
like "back"/"retry"/"exit") so the main setupNim only orchestrates flow and
logging while all validation and retry loops live in the new helper functions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/lib/onboard.js`:
- Around line 563-575: The error classification logic in onboard.js currently
checks for httpStatus 404/405 before testing the response string against
model-related patterns, causing model-miss responses to be misclassified as
endpoint issues; move the model regex check (/model.+not found|unknown
model|unsupported model|bad model/i) so it runs before the 404/405 branch (the
block that returns { kind: "endpoint", retry: "selection" }) and keep the
unauthorized/forbidden credential check as-is, ensuring that model-related
responses are classified as { kind: "model", retry: "model" } and only true
404/405 non-model endpoint errors fall through to endpoint selection.
- Around line 2503-2512: The local-provider branches only check validation.ok
and ignore the new recovery contract; update each place that calls
validateOpenAiLikeSelection (e.g., the "Local NVIDIA NIM" call that assigns
preferredInferenceApi) to first test validation.retry (and validation.back where
applicable) and, when set, jump back to the outer provider selector by executing
continue selectionLoop; only if not retry/back proceed to check validation.ok
and set preferredInferenceApi—apply the same change to the other call sites you
noted so validateOpenAiLikeSelection's retry/back paths are honored.

---

Duplicate comments:
In `@bin/lib/onboard.js`:
- Around line 613-630: The probe result handling after the spawnSync("curl",
...) call must detect and surface process errors via result.error before
deriving OK/http status: in the block that returns the object (referencing
spawnSync call, result, curlStatus, httpStatus, ok, and summarizeProbeFailure),
check if result.error exists and set curlStatus to result.error.code ||
result.error.errno (or a non-zero sentinel) and ensure ok is false when
result.error is present; keep httpStatus as 0 in that case and include the
result.error details (error.code/errno and result.stderr) in the message passed
to summarizeProbeFailure so ETIMEDOUT/ENOENT are not collapsed into curlStatus 0
and will correctly trigger transport-retry logic.
- Around line 2682-2686: The failure-handling branch for
promptValidationRecovery currently only treats "credential" and "retry" as
non-terminal, but promptValidationRecovery may return "selection", "back", or
"model" which should also continue the onboarding flow; update the logic in the
blocks that inspect retry (the branch using
promptValidationRecovery(config.label,
classifyApplyFailure(providerResult.message), resolvedCredentialEnv,
config.helpUrl) and the equivalent block later) so that if retry is one of
"credential", "retry", "selection", "back", or "model" you continue the loop,
and only call process.exit(providerResult.status || 1) for other values.
- Around line 485-491: The Anthropic normalization misses a bare "/v1" path,
causing later probes to double-append; update the suffixes used by
stripEndpointSuffix so that when flavor === "anthropic" you include "/v1" (e.g.,
change the anthropic suffix array in the const suffixes assignment to include
"/v1"), then ensure you still trim trailing slashes on url.pathname and set
url.pathname = pathname || "/"; this will make
stripEndpointSuffix(stripEndpointSuffix(url.pathname..., suffixes)) correctly
normalize bases like "https://proxy.example.com/v1" before probes run.
- Around line 672-685: The retry path for credential recovery bypasses NVIDIA
key format validation because when recovery.kind === "credential" and
credentialEnv is set the code calls replaceNamedCredential(credentialEnv, ...)
which only rejects empty input; modify this path so that when credentialEnv ===
"NVIDIA_API_KEY" you validate the entered key with the same nvapi- checks used
by ensureApiKey (or call ensureApiKey) before accepting and returning
"credential" (or pass a validator into replaceNamedCredential that enforces the
nvapi- prefix and length checks); ensure you still allow empty-retry/back/exit
flows and preserve label and helpUrl usage.

---

Nitpick comments:
In `@bin/lib/onboard.js`:
- Around line 2184-2656: The setupNim function is too large/complex; split
provider-specific paths into small helpers to reduce cyclomatic complexity:
extract the hosted/provider branch (the logic under if
(REMOTE_PROVIDER_CONFIG[selected.key]) including credential checks, model
prompting and validateOpenAiLikeSelection) into a helper like
handleHostedProviderSelection(selected, requestedModel, requestedProvider) that
returns {model, provider, endpointUrl, credentialEnv, preferredInferenceApi} or
navigation signals; extract the custom/anthropic-compatible endpoint handling
into handleCustomEndpointSelection that normalizes endpointUrl and handles
navigation back/exit; extract the local-provider branches (nim-local, ollama,
install-ollama, vllm) into separate helpers handleNimLocal(gpu, requestedModel),
handleOllamaFlow(requestedModel,gpu,runCapture), handleInstallOllamaFlow(...),
and handleVllmFlow(...) that encapsulate their pulls/starts/validation and
return the same selection result shape; keep selectionLoop and navigation intact
by calling these helpers and acting on their return values (including signals
like "back"/"retry"/"exit") so the main setupNim only orchestrates flow and
logging while all validation and retry loops live in the new helper functions.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6cb4ce50-412e-47a0-8a3a-6fcf7b00f47c

📥 Commits

Reviewing files that changed from the base of the PR and between 8451b77 and d4ccd4c.

📒 Files selected for processing (2)
  • bin/lib/onboard.js
  • test/onboard.test.js
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/onboard.test.js

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (6)
bin/lib/onboard.js (6)

742-755: ⚠️ Potential issue | 🟡 Minor

Keep NVIDIA key validation on the retry path.

When credentialEnv === "NVIDIA_API_KEY", this branch bypasses ensureApiKey() and falls back to replaceNamedCredential(), which only rejects empty input. retry can now persist obviously malformed NVIDIA keys that the initial prompt would reject.

🐛 Minimal fix
     if (choice === "" || choice === "retry") {
-      await replaceNamedCredential(credentialEnv, `${label} API key`, helpUrl);
+      if (credentialEnv === "NVIDIA_API_KEY") {
+        await ensureApiKey();
+      } else {
+        await replaceNamedCredential(credentialEnv, `${label} API key`, helpUrl);
+      }
       return "credential";
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 742 - 755, The retry branch for credential
recovery currently calls replaceNamedCredential(credentialEnv, `${label} API
key`, helpUrl) which only rejects empty input and skips ensureApiKey()
validation for NVIDIA_API_KEY; update the retry path inside the recovery.kind
=== "credential" branch so that when credentialEnv === "NVIDIA_API_KEY" you call
ensureApiKey() (or run the same NVIDIA key validation used on initial prompt)
before persisting the credential, falling back to replaceNamedCredential() only
after validation succeeds; target the logic around the choice === "" || choice
=== "retry" handling to add this conditional validation.

633-643: ⚠️ Potential issue | 🟠 Major

Classify model misses before the 404/405 endpoint fallback.

getProbeRecovery(..., { allowModelRetry: true }) depends on model misses being tagged as model, but this ordering turns 404/405 "model not found" responses into selection first. The compatible-endpoint flows then bounce users back to provider selection instead of re-prompting for the model.

🐛 Minimal fix
   if (httpStatus === 400) {
     return { kind: "model", retry: "model" };
   }
-  if (httpStatus === 404 || httpStatus === 405) {
-    return { kind: "endpoint", retry: "selection" };
-  }
   if (/unauthorized|forbidden|invalid api key|invalid_auth|permission/i.test(normalized)) {
     return { kind: "credential", retry: "credential" };
   }
   if (/model.+not found|unknown model|unsupported model|bad model/i.test(normalized)) {
     return { kind: "model", retry: "model" };
   }
+  if (httpStatus === 404 || httpStatus === 405) {
+    return { kind: "endpoint", retry: "selection" };
+  }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 633 - 643, The classification for "model not
found" should run before the 404/405 endpoint fallback so model misses are
returned as { kind: "model", retry: "model" } (which getProbeRecovery(..., {
allowModelRetry: true }) expects); move the /model.+not found|unknown
model|unsupported model|bad model/i test to appear before the if (httpStatus ===
404 || httpStatus === 405) branch in the error-classification logic inside
getProbeRecovery (or the surrounding function that contains httpStatus and
normalized), ensuring the model-regex branch is evaluated first so model-related
errors are classified as "model" rather than "endpoint".

555-561: ⚠️ Potential issue | 🟠 Major

Strip a bare /v1 from Anthropic-compatible base URLs.

If the user enters https://proxy.example.com/v1, this normalization leaves /v1 intact, and the later Anthropic probes append another /v1/..., so validation goes to /v1/v1/messages and /v1/v1/models.

🐛 Minimal fix
-    const suffixes = flavor === "anthropic"
-      ? ["/v1/messages", "/v1/models", "/messages", "/models"]
+    const suffixes = flavor === "anthropic"
+      ? ["/v1/messages", "/v1/models", "/messages", "/models", "/v1"]
       : ["/responses", "/chat/completions", "/completions", "/models"];
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 555 - 561, The normalization fails to remove
a bare "/v1" for Anthropic-style URLs, causing probes to hit "/v1/v1/…"; update
the suffix list used by stripEndpointSuffix when flavor === "anthropic" to
include "/v1" (in addition to "/v1/messages", "/v1/models", "/messages",
"/models") so that a base URL ending in "/v1" is stripped to "/" and
url.pathname becomes "/" (ensuring url.origin is returned and subsequent probes
do not double-up "/v1").

683-699: ⚠️ Potential issue | 🟠 Major

Handle spawnSync() failures before interpreting status.

On timeout or spawn failure, Node reports the problem on result.error and leaves result.status null. Coercing that with result.status || 0 turns those failures into a fake HTTP-0 path, so transport recovery never sees the real timeout/ENOENT condition.

In Node.js 22.x, when `child_process.spawnSync()` times out or cannot spawn the child, is the failure reported on `result.error` while `result.status` is `null`?
🐛 Minimal fix
     const body = fs.existsSync(bodyFile) ? fs.readFileSync(bodyFile, "utf8") : "";
+    if (result.error) {
+      const stderr = result.error.message || String(result.error);
+      const curlStatus = result.error.code === "ETIMEDOUT" ? 28 : 1;
+      return {
+        ok: false,
+        httpStatus: 0,
+        curlStatus,
+        body,
+        stderr,
+        message: summarizeCurlFailure(curlStatus, stderr, body),
+      };
+    }
     const status = Number(String(result.stdout || "").trim());
     return {
       ok: result.status === 0 && status >= 200 && status < 300,
       httpStatus: Number.isFinite(status) ? status : 0,
-      curlStatus: result.status || 0,
+      curlStatus: result.status ?? 0,
       body,
       stderr: String(result.stderr || ""),
-      message: summarizeProbeFailure(body, status || 0, result.status || 0, String(result.stderr || "")),
+      message: summarizeProbeFailure(body, status || 0, result.status ?? 0, String(result.stderr || "")),
     };
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 683 - 699, The spawnSync() result must be
checked for result.error before coercing result.status; modify the probe logic
around the spawnSync("curl", ...) call to detect if result.error is present and,
in that case, set ok=false, httpStatus=0, and set curlStatus to a predictable
sentinel (e.g., result.error.errno, result.error.code, or -1) instead of using
result.status || 0; also include result.error (its code/message) in stderr and
pass it into summarizeProbeFailure so transport/retry logic can see
timeouts/ENOENT rather than a fake HTTP-0. Reference the existing variables
result, bodyFile, summarizeProbeFailure and the spawnSync call when making the
change.

2752-2756: ⚠️ Potential issue | 🟠 Major

Don’t terminate non-terminal recovery results in Step 4.

promptValidationRecovery() can return selection and model here, but both branches still process.exit() unless the answer is credential or retry. That makes back from provider/apply recovery abort onboarding, and model-related apply failures still can’t unwind to Step 3.

Also applies to: 2774-2778

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 2752 - 2756, The code currently calls
process.exit for any promptValidationRecovery result other than "credential" or
"retry", which aborts onboarding when promptValidationRecovery returns
"selection" or "model"; change the logic around promptValidationRecovery (the
call with args config.label, classifyApplyFailure(providerResult.message),
resolvedCredentialEnv, config.helpUrl) so that when it returns "selection" or
"model" you do NOT call process.exit but instead unwind the flow (e.g., loop
back to the prior step/Step 3 or re-enter the selection/model handling logic) to
allow recovery to proceed; preserve the existing behavior of continuing on
"credential" or "retry" and only use providerResult.status (or exit) for truly
terminal failures.

2573-2582: ⚠️ Potential issue | 🟠 Major

Honor validation.retry in the local-provider branches.

These call sites still collapse every failed validation into a fixed control flow, so the new recovery contract is lost: retry on local NIM/vLLM jumps back to provider selection, and back from the Ollama recovery prompt just reopens the model picker instead of returning to selection.

Also applies to: 2623-2633, 2665-2675, 2703-2712

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 2573 - 2582, The current local-provider call
sites unconditionally do "continue selectionLoop" when
validateOpenAiLikeSelection returns !ok, which ignores the recovery contract;
update the validateOpenAiLikeSelection error handling (the blocks that call
validateOpenAiLikeSelection for "Local NVIDIA NIM", vLLM and Ollama) to examine
validation.retry (and any equivalent recovery flag) instead of always
continuing: if validation.retry === 'retry' then continue selectionLoop to
re-open provider selection, if validation.retry === 'back' then perform the
"back" control flow (e.g. break/return to the previous menu or invoke the
existing back-handling path) and otherwise follow the default retry behavior;
apply the same change in the other similar call sites that set
preferredInferenceApi after validation.
🧹 Nitpick comments (1)
bin/lib/onboard.js (1)

2271-2712: Split setupNim() into per-provider handlers.

The added back/retry paths push this dispatcher even further into “state machine in one function” territory, which is exactly how misses like the local retry handling slipped through. Keeping only provider selection/orchestration here and moving each provider branch into a helper would make the new recovery behavior much easier to reason about and test.

As per coding guidelines, **/*.{js,ts,tsx}: Maintain cyclomatic complexity limit of 20 (target: ratchet down to 15) for all functions.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 2271 - 2712, This function has grown too
large and mixes provider selection with per-provider state/retry logic; extract
each provider branch into its own handler (e.g., handleNimLocal, handleOllama,
handleInstallOllama, handleVllm, handleRemoteProvider) and move the
NIM/start/validation loops out of the main selectionLoop; each handler should
accept the necessary inputs (options, requestedModel, gpu, runCapture, etc.),
perform provider-specific prompts/validation using existing helpers like
validateOpenAiLikeSelection, validateAnthropicSelectionWithRetryMessage,
prepareOllamaModel, nim.listModels/ pullNimImage/ startNimContainer, and return
a standardized result (status: "ok"|"retry"|"back"|"exit", plus provider,
credentialEnv, endpointUrl, model, preferredInferenceApi, nimContainer) so the
outer selectionLoop only dispatches to handlers and acts on the returned status
(continue selectionLoop, exit, or accept result), reducing cyclomatic complexity
of the main function (selectionLoop/ setupNim) below the guideline.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@bin/lib/onboard.js`:
- Around line 742-755: The retry branch for credential recovery currently calls
replaceNamedCredential(credentialEnv, `${label} API key`, helpUrl) which only
rejects empty input and skips ensureApiKey() validation for NVIDIA_API_KEY;
update the retry path inside the recovery.kind === "credential" branch so that
when credentialEnv === "NVIDIA_API_KEY" you call ensureApiKey() (or run the same
NVIDIA key validation used on initial prompt) before persisting the credential,
falling back to replaceNamedCredential() only after validation succeeds; target
the logic around the choice === "" || choice === "retry" handling to add this
conditional validation.
- Around line 633-643: The classification for "model not found" should run
before the 404/405 endpoint fallback so model misses are returned as { kind:
"model", retry: "model" } (which getProbeRecovery(..., { allowModelRetry: true
}) expects); move the /model.+not found|unknown model|unsupported model|bad
model/i test to appear before the if (httpStatus === 404 || httpStatus === 405)
branch in the error-classification logic inside getProbeRecovery (or the
surrounding function that contains httpStatus and normalized), ensuring the
model-regex branch is evaluated first so model-related errors are classified as
"model" rather than "endpoint".
- Around line 555-561: The normalization fails to remove a bare "/v1" for
Anthropic-style URLs, causing probes to hit "/v1/v1/…"; update the suffix list
used by stripEndpointSuffix when flavor === "anthropic" to include "/v1" (in
addition to "/v1/messages", "/v1/models", "/messages", "/models") so that a base
URL ending in "/v1" is stripped to "/" and url.pathname becomes "/" (ensuring
url.origin is returned and subsequent probes do not double-up "/v1").
- Around line 683-699: The spawnSync() result must be checked for result.error
before coercing result.status; modify the probe logic around the
spawnSync("curl", ...) call to detect if result.error is present and, in that
case, set ok=false, httpStatus=0, and set curlStatus to a predictable sentinel
(e.g., result.error.errno, result.error.code, or -1) instead of using
result.status || 0; also include result.error (its code/message) in stderr and
pass it into summarizeProbeFailure so transport/retry logic can see
timeouts/ENOENT rather than a fake HTTP-0. Reference the existing variables
result, bodyFile, summarizeProbeFailure and the spawnSync call when making the
change.
- Around line 2752-2756: The code currently calls process.exit for any
promptValidationRecovery result other than "credential" or "retry", which aborts
onboarding when promptValidationRecovery returns "selection" or "model"; change
the logic around promptValidationRecovery (the call with args config.label,
classifyApplyFailure(providerResult.message), resolvedCredentialEnv,
config.helpUrl) so that when it returns "selection" or "model" you do NOT call
process.exit but instead unwind the flow (e.g., loop back to the prior step/Step
3 or re-enter the selection/model handling logic) to allow recovery to proceed;
preserve the existing behavior of continuing on "credential" or "retry" and only
use providerResult.status (or exit) for truly terminal failures.
- Around line 2573-2582: The current local-provider call sites unconditionally
do "continue selectionLoop" when validateOpenAiLikeSelection returns !ok, which
ignores the recovery contract; update the validateOpenAiLikeSelection error
handling (the blocks that call validateOpenAiLikeSelection for "Local NVIDIA
NIM", vLLM and Ollama) to examine validation.retry (and any equivalent recovery
flag) instead of always continuing: if validation.retry === 'retry' then
continue selectionLoop to re-open provider selection, if validation.retry ===
'back' then perform the "back" control flow (e.g. break/return to the previous
menu or invoke the existing back-handling path) and otherwise follow the default
retry behavior; apply the same change in the other similar call sites that set
preferredInferenceApi after validation.

---

Nitpick comments:
In `@bin/lib/onboard.js`:
- Around line 2271-2712: This function has grown too large and mixes provider
selection with per-provider state/retry logic; extract each provider branch into
its own handler (e.g., handleNimLocal, handleOllama, handleInstallOllama,
handleVllm, handleRemoteProvider) and move the NIM/start/validation loops out of
the main selectionLoop; each handler should accept the necessary inputs
(options, requestedModel, gpu, runCapture, etc.), perform provider-specific
prompts/validation using existing helpers like validateOpenAiLikeSelection,
validateAnthropicSelectionWithRetryMessage, prepareOllamaModel, nim.listModels/
pullNimImage/ startNimContainer, and return a standardized result (status:
"ok"|"retry"|"back"|"exit", plus provider, credentialEnv, endpointUrl, model,
preferredInferenceApi, nimContainer) so the outer selectionLoop only dispatches
to handlers and acts on the returned status (continue selectionLoop, exit, or
accept result), reducing cyclomatic complexity of the main function
(selectionLoop/ setupNim) below the guideline.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: f7a7da0c-04b2-43b2-9840-3e0c56600881

📥 Commits

Reviewing files that changed from the base of the PR and between d4ccd4c and a5b57c8.

📒 Files selected for processing (2)
  • bin/lib/onboard.js
  • test/onboard.test.js
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/onboard.test.js

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
bin/lib/onboard.js (1)

2794-2798: ⚠️ Potential issue | 🟠 Major

Bubble Step 4 recovery out of this loop.

Line 2795 and Line 2817 treat "selection", "back", and "model" like a plain retry by continue-ing the same while (true). That never returns to setupNim(), so users still can’t actually change provider/model after openshell provider or openshell inference set failures. Only "credential" and "retry" should stay local here.

Also applies to: 2816-2820

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 2794 - 2798, The loop handling
promptValidationRecovery treats "selection", "back", and "model" the same as a
local retry, preventing control from returning to setupNim; update the logic
around promptValidationRecovery (the variables retry and providerResult.message
used with classifyApplyFailure and promptValidationRecovery) so only
"credential" and "retry" trigger a local continue, while "selection", "back",
and "model" bubble out of this loop (e.g., break or return so setupNim can run
again) and preserve the existing process.exit(providerResult.status||1) for
other failures.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/lib/onboard.js`:
- Around line 678-726: The runCurlProbe function is over complexity limits;
refactor by extracting the spawn+file handling and the two response-shaping
branches into small helpers: create a helper spawnCurlAndReadBody(argv,
bodyFile) that runs spawnSync("curl", ...), reads bodyFile, and returns an
object { result, body } (normalize errors there), and create two responders
formatCurlError(resultError, body) and formatCurlSuccess(result, body) that
build the return payloads currently produced in the error branch and success
branch (use the existing symbols compactText, summarizeProbeFailure,
summarizeCurlFailure when composing messages). Replace the inline branches in
runCurlProbe with calls to spawnCurlAndReadBody and then one-line returns from
formatCurlError/formatCurlSuccess so runCurlProbe's cyclomatic complexity drops
below the limit without changing behavior.
- Around line 2373-2389: The onboarding flow currently calls process.exit(1)
when normalizeProviderBaseUrl returns falsy for a custom OpenAI-compatible
endpoint, which aborts interactive onboarding on accidental blank input; update
the logic around endpointUrl (and the similar block later) to only hard-exit
when isNonInteractive() is true, and for interactive mode either re-prompt the
user or route back to provider selection (use getNavigationChoice/selectionLoop
or call exitOnboardFromPrompt as appropriate) so the user can retry instead of
the process terminating. Ensure the checks reference endpointUrl,
isNonInteractive(), getNavigationChoice, normalizeProviderBaseUrl, and
selectionLoop so the behavior differs between interactive and non-interactive
runs.

---

Duplicate comments:
In `@bin/lib/onboard.js`:
- Around line 2794-2798: The loop handling promptValidationRecovery treats
"selection", "back", and "model" the same as a local retry, preventing control
from returning to setupNim; update the logic around promptValidationRecovery
(the variables retry and providerResult.message used with classifyApplyFailure
and promptValidationRecovery) so only "credential" and "retry" trigger a local
continue, while "selection", "back", and "model" bubble out of this loop (e.g.,
break or return so setupNim can run again) and preserve the existing
process.exit(providerResult.status||1) for other failures.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 03fd22ad-3bf0-4e4e-bbef-99fc16e3ec45

📥 Commits

Reviewing files that changed from the base of the PR and between a5b57c8 and 2f24593.

📒 Files selected for processing (2)
  • bin/lib/onboard.js
  • test/onboard.test.js
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/onboard.test.js

Comment on lines +678 to +726
function runCurlProbe(argv) {
const bodyFile = path.join(os.tmpdir(), `nemoclaw-curl-probe-${Date.now()}-${Math.random().toString(36).slice(2)}.json`);
try {
const args = [...argv];
const url = args.pop();
const result = spawnSync("curl", [...args, "-o", bodyFile, "-w", "%{http_code}", url], {
cwd: ROOT,
encoding: "utf8",
timeout: 30_000,
env: {
...process.env,
},
});
const body = fs.existsSync(bodyFile) ? fs.readFileSync(bodyFile, "utf8") : "";
if (result.error) {
const errorCode = result.error.code || result.error.errno || 1;
const errorMessage = compactText(
`${result.error.message || String(result.error)} ${String(result.stderr || "")}`
);
return {
ok: false,
httpStatus: 0,
curlStatus: errorCode,
body,
stderr: errorMessage,
message: summarizeProbeFailure(body, 0, errorCode, errorMessage),
};
}
const status = Number(String(result.stdout || "").trim());
return {
ok: result.status === 0 && status >= 200 && status < 300,
httpStatus: Number.isFinite(status) ? status : 0,
curlStatus: result.status || 0,
body,
stderr: String(result.stderr || ""),
message: summarizeProbeFailure(body, status || 0, result.status || 0, String(result.stderr || "")),
};
} catch (error) {
return {
ok: false,
httpStatus: 0,
curlStatus: error?.status || 1,
body: "",
stderr: error?.message || String(error),
message: summarizeCurlFailure(error?.status || 1, error?.message || String(error)),
};
} finally {
fs.rmSync(bodyFile, { force: true });
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Split runCurlProbe() before merge.

Line 678 is still over the repo limit and is already failing CI (complexity 25). Extract the spawn/error normalization and response-shaping branches into a couple of small helpers so this clears lint without changing behavior.

As per coding guidelines, **/*.{js,ts,tsx}: Maintain cyclomatic complexity limit of 20 (target: ratchet down to 15) for all functions.

🧰 Tools
🪛 GitHub Actions: pr

[error] 678-678: ESLint (JavaScript) failed. 'runCurlProbe' has a complexity of 25. Maximum allowed is 20 (complexity).

🪛 GitHub Check: lint

[failure] 678-678:
Function 'runCurlProbe' has a complexity of 25. Maximum allowed is 20

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 678 - 726, The runCurlProbe function is over
complexity limits; refactor by extracting the spawn+file handling and the two
response-shaping branches into small helpers: create a helper
spawnCurlAndReadBody(argv, bodyFile) that runs spawnSync("curl", ...), reads
bodyFile, and returns an object { result, body } (normalize errors there), and
create two responders formatCurlError(resultError, body) and
formatCurlSuccess(result, body) that build the return payloads currently
produced in the error branch and success branch (use the existing symbols
compactText, summarizeProbeFailure, summarizeCurlFailure when composing
messages). Replace the inline branches in runCurlProbe with calls to
spawnCurlAndReadBody and then one-line returns from
formatCurlError/formatCurlSuccess so runCurlProbe's cyclomatic complexity drops
below the limit without changing behavior.

@kjw3 kjw3 merged commit e893e9d into main Mar 31, 2026
9 checks passed
@kjw3 kjw3 deleted the fix/onboard-inference-auth-handling branch March 31, 2026 04:12
laitingsheng pushed a commit that referenced this pull request Apr 2, 2026
## Summary

Smooth out inference configuration during `install.sh` / `nemoclaw
onboard`, especially when provider authorization, credential formatting,
endpoint probing, or final inference application fail.

This PR makes the hosted-provider onboarding path recoverable instead of
brittle:
- normalize and safely handle credential input
- classify validation failures more accurately
- let users re-enter credentials in place
- make final `openshell inference set` failures recoverable
- normalize over-specified custom base URLs
- add lower-level `back` / `exit` navigation so users can move up a
level without restarting the whole install
- clarify recovery prompts with explicit commands (`retry`, `back`,
`exit`)

## What Changed

- refactored provider probe execution to use direct `curl` argv
invocation instead of `bash -c`
- normalized credential values before use/persistence
- added structured auth / transport / model / endpoint failure
classification
- added in-place credential re-entry for hosted providers:
  - NVIDIA Endpoints
  - OpenAI
  - Anthropic
  - Google Gemini
  - custom OpenAI-compatible endpoints
  - custom Anthropic-compatible endpoints
- wrapped final provider/apply failures in interactive recovery instead
of hard abort
- added command-style recovery prompts:
  - `retry`
  - `back`
  - `exit`
- allowed `back` from lower-level inference prompts (model entry, base
URL entry, recovery prompts)
- normalized custom endpoint inputs to the minimum usable base URL
- removed stale `NVIDIA Endpoints (recommended)` wording
- secret prompts now show masked `*` feedback while typing/pasting

## Validation

```bash
npx vitest run test/credentials.test.js test/onboard-selection.test.js test/onboard.test.js
npx vitest run test/cli.test.js
npx eslint bin/lib/credentials.js bin/lib/onboard.js test/credentials.test.js test/onboard-selection.test.js test/onboard.test.js
npx tsc -p jsconfig.json --noEmit
```

## Issue Mapping

Fully addressed in this PR:
- Fixes #1099
- Fixes #1101
- Fixes #1130

Substantially addressed / partially addressed:
- #987
- improves NVIDIA validation behavior and failure classification so
false/misleading connectivity failures are much less likely, but this PR
is framed as onboarding recovery hardening rather than a WSL-specific
networking fix
- #301
- improves graceful handling when validation/apply fails, especially for
transport/upstream problems, but does not add provider auto-fallback or
a broader cloud-outage fallback strategy
- #446
- improves recovery specifically for the inference-configuration step,
but does not fully solve general resumability across all onboarding
steps

Related implementation direction:
- #890
- this PR aligns with the intent of safer/non-shell probe execution and
clearer validation reporting
- #380
- not implemented here; no automatic provider fallback was added in this
branch

## Notes

- This PR intentionally does not weaken validation or reopen old
onboarding shortcuts.
- Unrelated local `tmp/` noise was left out of the branch.

Signed-off-by: Kevin Jones <kejones@nvidia.com>


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Interactive onboarding navigation (`back`/`exit`/`quit`) with
credential re-prompting and retry flows.
* Improved probe/validation flow with clearer recovery options and more
robust sandbox build progress messages.
  * Secret input masks with reliable backspace behavior.

* **Bug Fixes**
* Credential sanitization (trim/line-ending normalization) and API key
validation now normalize and retry instead of exiting.
* Better classification and messaging for authorization/validation
failures; retries where appropriate.

* **Tests**
* Expanded tests for credential prompts, masking, retry flows,
validation classification, and onboarding navigation.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
lakamsani pushed a commit to lakamsani/NemoClaw that referenced this pull request Apr 4, 2026
…#1136)

## Summary

Smooth out inference configuration during `install.sh` / `nemoclaw
onboard`, especially when provider authorization, credential formatting,
endpoint probing, or final inference application fail.

This PR makes the hosted-provider onboarding path recoverable instead of
brittle:
- normalize and safely handle credential input
- classify validation failures more accurately
- let users re-enter credentials in place
- make final `openshell inference set` failures recoverable
- normalize over-specified custom base URLs
- add lower-level `back` / `exit` navigation so users can move up a
level without restarting the whole install
- clarify recovery prompts with explicit commands (`retry`, `back`,
`exit`)

## What Changed

- refactored provider probe execution to use direct `curl` argv
invocation instead of `bash -c`
- normalized credential values before use/persistence
- added structured auth / transport / model / endpoint failure
classification
- added in-place credential re-entry for hosted providers:
  - NVIDIA Endpoints
  - OpenAI
  - Anthropic
  - Google Gemini
  - custom OpenAI-compatible endpoints
  - custom Anthropic-compatible endpoints
- wrapped final provider/apply failures in interactive recovery instead
of hard abort
- added command-style recovery prompts:
  - `retry`
  - `back`
  - `exit`
- allowed `back` from lower-level inference prompts (model entry, base
URL entry, recovery prompts)
- normalized custom endpoint inputs to the minimum usable base URL
- removed stale `NVIDIA Endpoints (recommended)` wording
- secret prompts now show masked `*` feedback while typing/pasting

## Validation

```bash
npx vitest run test/credentials.test.js test/onboard-selection.test.js test/onboard.test.js
npx vitest run test/cli.test.js
npx eslint bin/lib/credentials.js bin/lib/onboard.js test/credentials.test.js test/onboard-selection.test.js test/onboard.test.js
npx tsc -p jsconfig.json --noEmit
```

## Issue Mapping

Fully addressed in this PR:
- Fixes NVIDIA#1099
- Fixes NVIDIA#1101
- Fixes NVIDIA#1130

Substantially addressed / partially addressed:
- NVIDIA#987
- improves NVIDIA validation behavior and failure classification so
false/misleading connectivity failures are much less likely, but this PR
is framed as onboarding recovery hardening rather than a WSL-specific
networking fix
- NVIDIA#301
- improves graceful handling when validation/apply fails, especially for
transport/upstream problems, but does not add provider auto-fallback or
a broader cloud-outage fallback strategy
- NVIDIA#446
- improves recovery specifically for the inference-configuration step,
but does not fully solve general resumability across all onboarding
steps

Related implementation direction:
- NVIDIA#890
- this PR aligns with the intent of safer/non-shell probe execution and
clearer validation reporting
- NVIDIA#380
- not implemented here; no automatic provider fallback was added in this
branch

## Notes

- This PR intentionally does not weaken validation or reopen old
onboarding shortcuts.
- Unrelated local `tmp/` noise was left out of the branch.

Signed-off-by: Kevin Jones <kejones@nvidia.com>


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Interactive onboarding navigation (`back`/`exit`/`quit`) with
credential re-prompting and retry flows.
* Improved probe/validation flow with clearer recovery options and more
robust sandbox build progress messages.
  * Secret input masks with reliable backspace behavior.

* **Bug Fixes**
* Credential sanitization (trim/line-ending normalization) and API key
validation now normalize and retry instead of exiting.
* Better classification and messaging for authorization/validation
failures; retries where appropriate.

* **Tests**
* Expanded tests for credential prompts, masking, retry flows,
validation classification, and onboarding navigation.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants