ci: add CI-Ready GPU Brev launchable for E2E tests by ksapru · Pull Request #1456 · NVIDIA/NemoClaw

ksapru · 2026-04-03T19:11:05Z

Summary

This PR introduces a CI-Ready GPU Brev launchable (scripts/brev-launchable-ci-gpu.sh), mirroring the architecture of the CPU launchable to prevent repeated, expensive setup overhead in our GPU E2E tests. By pre-baking Docker, Node.js, the OpenShell CLI, the NVIDIA Container Toolkit, and pre-pulling Ollama models (qwen3:0.6b / qwen2.5:0.5b), this eliminates 5-10 minutes of bootstrapping during CI. Additionally, this PR mitigates CI flakiness by bumping default Vitest timeouts from 5s to 30s on heavy integration tests to ensure tests run smoothly on VM instances.

Related Issue

Fixes #1328

Changes

New GPU Startup Script: Added scripts/brev-launchable-ci-gpu.sh with robust GPU passthrough validation, graceful apt-lock handling, and reliable asynchronous Ollama polling defaults.
GPU Test Suite Enablement: Updated test/e2e/brev-e2e.test.js to run the GPU E2E suite conditionally when TEST_SUITE === "gpu".
Workflow Pipeline: Modified .github/workflows/nightly-e2e.yaml to integrate the GPU-specific launch pipeline path.
Test Timeout Fixes: Increased Vitest timeouts for test/install-preflight.test.js, test/nemoclaw-cli-recovery.test.js, and test/onboard.test.js from 5,000ms to 30,000ms to eliminate arbitrary timeout failures during heavy local and remote executions.

Type of Change

Code change for a new feature, bug fix, or refactor.
Code change with doc updates.
Doc only. Prose changes without code sample modifications.
Doc only. Includes code sample changes.

Testing

npx prek run --all-files passes (or equivalently make check).
npm test passes.
make docs builds without warnings. (for doc-only changes)

Checklist

General

I have read and followed the contributing guide.
I have read and followed the style guide. (for doc-only changes)

Code Changes

Formatters applied — npx prek run --all-files auto-fixes formatting (or make format for targeted runs).
Tests added or updated for new or changed behavior.
No secrets, API keys, or credentials committed.
Doc pages updated for any user-facing behavior changes.

Doc Changes

Follows the style guide.
New pages include SPDX license header and frontmatter, if creating a new page.
Cross-references and links verified.

Signed-off-by: Krish Sapru ksapru@bu.edu

Summary by CodeRabbit

Tests
- Updated GPU E2E testing infrastructure to use temporary instances instead of persistent runners
- Added GPU test suite selection option
Chores
- Added automation for GPU environment provisioning and setup
- Enhanced CI/CD pipeline for GPU testing

coderabbitai · 2026-04-03T19:11:20Z

📝 Walkthrough

Walkthrough

GPU E2E testing infrastructure transitions from persistent self-hosted runners to ephemeral cloud-provisioned instances via Brev CLI, including a comprehensive startup script for VM provisioning (Docker, NVIDIA toolkit, Ollama) and corresponding workflow orchestration with readiness polling and remote test execution.

Changes

Cohort / File(s)	Summary
Workflow GPU Integration `.github/workflows/nightly-e2e.yaml`	Replaces persistent self-hosted GPU runner with ephemeral ubuntu-latest runner that provisions a temporary GPU instance via Brev CLI, includes readiness polling loop, remote test execution via `brev exec`, and instance teardown with artifact recovery.
GPU VM Provisioning `scripts/brev-launchable-ci-gpu.sh`	New startup script that provisions CI-ready GPU VM: installs system dependencies (curl, git, jq, tar), Docker, Node.js 22, OpenShell CLI, clones NemoClaw repository, pre-pulls Docker images, installs NVIDIA Container Toolkit, installs Ollama with model pre-pulling, validates services, and writes readiness sentinel file. Includes retry logic, lock handling, and environment variable gates.
E2E Test Suite `test/e2e/brev-e2e.test.js`	Extends `TEST_SUITE` environment variable documentation and adds conditional Vitest test case for GPU E2E execution when `TEST_SUITE === "gpu"`, delegating to `runRemoteTest("test/e2e/test-gpu-e2e.sh")` with 900-second timeout.

Sequence Diagram(s)

sequenceDiagram
    participant GHA as GitHub Actions<br/>(nightly-e2e)
    participant Brev as Brev CLI
    participant GPU as GPU Instance<br/>(Ephemeral)
    participant Script as Startup Script<br/>(brev-launchable-ci-gpu.sh)
    participant Tests as GPU E2E Test

    GHA->>GHA: Install Brev CLI
    GHA->>Brev: Create GPU instance<br/>(ubuntu-latest base)
    Brev->>GPU: Provision with startup script
    GPU->>Script: Execute on boot
    Script->>Script: Install dependencies<br/>(Docker, Node, NVIDIA toolkit)
    Script->>Script: Pull Docker images
    Script->>Script: Install & pre-pull Ollama
    Script->>Script: Write readiness sentinel<br/>(/var/run/nemoclaw-launchable-ready)
    GHA->>GPU: Poll for readiness sentinel<br/>(bounded retry loop)
    GPU-->>GHA: Ready signal
    GHA->>Brev: Execute via brev exec<br/>(with env vars)
    Brev->>GPU: Run test/e2e/test-gpu-e2e.sh
    GPU->>Tests: Execute GPU E2E suite
    Tests-->>GPU: Output result (PASS/FAIL)
    GPU-->>GHA: Test output
    GHA->>GHA: Recover logs via brev scp
    GHA->>Brev: Delete GPU instance<br/>(always teardown)
    GHA->>GHA: Upload artifacts

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🐰 Ephemeral clouds now dance on high,
GPU instances bloom and quickly die—
Brev brings them forth with startup grace,
While Ollama hums in cyberspace.
Tests execute remote and fleet,
Making our CI pipeline sweet! 🚀✨

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Out of Scope Changes check	⚠️ Warning	Changes to test timeout values (install-preflight.test.js, nemoclaw-cli-recovery.test.js, onboard.test.js) are not mentioned in linked issue `#1328` objectives, falling outside the GPU launchable scope.	Remove timeout increases from unrelated test files or create a separate PR for timeout adjustments, keeping this PR focused on GPU launchable implementation.
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely summarizes the main change: adding a CI-Ready GPU Brev launchable for E2E tests, which matches the primary objective and scope of the pull request.
Linked Issues check	✅ Passed	The PR successfully delivers all three coding-related objectives from `#1328`: the GPU startup script (scripts/brev-launchable-ci-gpu.sh), GPU E2E test integration (test/e2e/brev-e2e.test.js), and nightly-e2e.yaml workflow updates.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

.github/workflows/nightly-e2e.yaml (1)
224-248: ⚠️ Potential issue | 🟠 Major

Delete the VM after failure logs are copied.

On failures, this always() step runs before both brev scp steps. That leaves nothing to copy, so the artifact uploads are empty right when diagnostics are needed most. Move teardown below the log-copy steps.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/nightly-e2e.yaml around lines 224 - 248, The teardown step
"Tear down GPU instance" currently runs before the failure log copy steps,
causing logs to be deleted; move the "Tear down GPU instance" step (the block
that runs `brev delete e2e-gpu-nightly-${{ github.run_id }}` with if: always())
to after the "Copy install log on failure", "Upload install log on failure", and
"Copy test log on failure" steps so that the `brev scp` and upload actions can
run first and collect artifacts, keeping the teardown step's if: always()
behavior but ensuring it executes last.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/nightly-e2e.yaml:
- Around line 181-185: The "Install Brev CLI" step currently extracts the
archive directly to /usr/local/bin (and sets chmod) which can fail due to
permissions; change the step to either use sudo when writing to /usr/local/bin
or unpack into a writable user directory (e.g., $HOME/.local/bin) and ensure
that directory is created and added to PATH before subsequent steps; update the
step that references brev so it uses the updated install location (references:
the "Install Brev CLI" step and the target path /usr/local/bin or
$HOME/.local/bin and the brev binary name).
- Around line 193-195: The --startup-script argument passed to the brev create
command currently uses a remote URL which Brev CLI no longer accepts; update the
command that builds the instance (the brev create invocation with --name
"$INSTANCE_NAME" and --flavor "t4") to use the `@filepath` form by replacing the
URL value with `@scripts/brev-launchable-ci-gpu.sh` so the CLI reads the local
script file content instead of a URL.

In `@scripts/brev-launchable-ci-gpu.sh`:
- Around line 295-300: The until loop polling Ollama can hang indefinitely; add
a bounded timeout (e.g., MAX_WAIT_SECS) and check elapsed time or a counter
inside the loop that waits for http://localhost:11434, and if the timeout is
reached, kill the background process (use OLLAMA_PID), log a clear error, and
exit non‑zero so the CI fails fast; update the block that starts "ollama serve
>/dev/null 2>&1 &" and the subsequent loop that uses curl to implement this
timeout and cleanup.
- Around line 243-271: The GPU validation currently only logs warnings when
nvidia-smi is missing or the docker GPU test fails, allowing the startup to
continue; change the validation in the block that runs nvidia-smi and the docker
run to treat failures as fatal: when nvidia-smi is not found or returns
non-zero, or when docker run --gpus all nvidia/cuda:12.2.0-base nvidia-smi
fails, call an error handler (e.g., use error or fatal logging) and exit
non-zero (avoid creating the ready sentinel) instead of warn; update the code
around the nvidia-smi checks and the docker run command to exit with non-zero
status on failure so a broken GPU runtime is not marked ready.

In `@test/e2e/brev-e2e.test.js`:
- Around line 648-656: The test's GPU branch is never exercising GPU because
beforeAll always uses the CPU bootstrap (scripts/brev-launchable-ci-cpu.sh /
brev search cpu) and later asserts gpuEnabled: false; update the setup in the
beforeAll (or the bootstrap helper invoked there) to conditionally request a GPU
when TEST_SUITE === "gpu" — e.g., switch to the GPU launch script or a brev
search that returns a GPU launchable and ensure the resulting runtime's
gpuEnabled flag is true (references: TEST_SUITE, beforeAll,
scripts/brev-launchable-ci-cpu.sh, brev search cpu, gpuEnabled) so the "GPU E2E
suite passes on remote VM" test actually runs on a GPU environment.

---

Outside diff comments:
In @.github/workflows/nightly-e2e.yaml:
- Around line 224-248: The teardown step "Tear down GPU instance" currently runs
before the failure log copy steps, causing logs to be deleted; move the "Tear
down GPU instance" step (the block that runs `brev delete e2e-gpu-nightly-${{
github.run_id }}` with if: always()) to after the "Copy install log on failure",
"Upload install log on failure", and "Copy test log on failure" steps so that
the `brev scp` and upload actions can run first and collect artifacts, keeping
the teardown step's if: always() behavior but ensuring it executes last.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7119745b-c573-4f21-9c3e-b69f28d691f5

📥 Commits

Reviewing files that changed from the base of the PR and between f4a01cf and e6994ae.

📒 Files selected for processing (3)

.github/workflows/nightly-e2e.yaml
scripts/brev-launchable-ci-gpu.sh
test/e2e/brev-e2e.test.js

coderabbitai · 2026-04-03T19:21:54Z

.github/workflows/nightly-e2e.yaml

+      - name: Install Brev CLI
+        run: |
+          curl -fsSL -o /tmp/brev.tar.gz "https://github.com/brevdev/brev-cli/releases/download/v0.6.322/brev-cli_0.6.322_linux_amd64.tar.gz"
+          tar -xzf /tmp/brev.tar.gz -C /usr/local/bin brev
+          chmod +x /usr/local/bin/brev


⚠️ Potential issue | 🟠 Major

Install brev into a writable path.

GitHub-hosted steps run as the runner user, so extracting straight into /usr/local/bin is usually permission-denied. Use sudo for the install or unpack into $HOME/.local/bin and export that path before the next step.

Possible fix

- name: Install Brev CLI run: | curl -fsSL -o /tmp/brev.tar.gz "https://github.com/brevdev/brev-cli/releases/download/v0.6.322/brev-cli_0.6.322_linux_amd64.tar.gz" - tar -xzf /tmp/brev.tar.gz -C /usr/local/bin brev - chmod +x /usr/local/bin/brev + mkdir -p "$HOME/.local/bin" + tar -xzf /tmp/brev.tar.gz -C "$HOME/.local/bin" brev + chmod +x "$HOME/.local/bin/brev" + echo "$HOME/.local/bin" >> "$GITHUB_PATH"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

- name: Install Brev CLI

run: |

curl -fsSL -o /tmp/brev.tar.gz "https://github.com/brevdev/brev-cli/releases/download/v0.6.322/brev-cli_0.6.322_linux_amd64.tar.gz"

tar -xzf /tmp/brev.tar.gz -C /usr/local/bin brev

chmod +x /usr/local/bin/brev

- name: Install Brev CLI

run: |

curl -fsSL -o /tmp/brev.tar.gz "https://github.com/brevdev/brev-cli/releases/download/v0.6.322/brev-cli_0.6.322_linux_amd64.tar.gz"

mkdir -p "$HOME/.local/bin"

tar -xzf /tmp/brev.tar.gz -C "$HOME/.local/bin" brev

chmod +x "$HOME/.local/bin/brev"

echo "$HOME/.local/bin" >> "$GITHUB_PATH"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In @.github/workflows/nightly-e2e.yaml around lines 181 - 185, The "Install Brev CLI" step currently extracts the archive directly to /usr/local/bin (and sets chmod) which can fail due to permissions; change the step to either use sudo when writing to /usr/local/bin or unpack into a writable user directory (e.g., $HOME/.local/bin) and ensure that directory is created and added to PATH before subsequent steps; update the step that references brev so it uses the updated install location (references: the "Install Brev CLI" step and the target path /usr/local/bin or $HOME/.local/bin and the brev binary name).

coderabbitai · 2026-04-03T19:21:54Z

.github/workflows/nightly-e2e.yaml

+          brev create --name "$INSTANCE_NAME" \
+            --flavor "t4" \
+            --startup-script "https://raw.githubusercontent.com/NVIDIA/NemoClaw/${{ github.sha }}/scripts/brev-launchable-ci-gpu.sh"


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, check if the script file exists find . -name "brev-launchable-ci-gpu.sh" -type f

Repository: NVIDIA/NemoClaw

Length of output: 94

🏁 Script executed:

# Check the workflow file content around lines 193-195 if [ -f ".github/workflows/nightly-e2e.yaml" ]; then sed -n '190,200p' ".github/workflows/nightly-e2e.yaml" else echo "File not found" fi

Repository: NVIDIA/NemoClaw

Length of output: 675

🏁 Script executed:

# Search for Brev usage patterns in the repository rg "brev\s+create" -A 3 -B 3

Repository: NVIDIA/NemoClaw

Length of output: 7496

🏁 Script executed:

# Search for any Brev configuration or documentation find . -type f -name "*.md" -o -name "*.txt" -o -name "*.json" | xargs grep -l "brev" 2>/dev/null | head -10

Repository: NVIDIA/NemoClaw

Length of output: 261

Use @filepath instead of URL for --startup-script.

The Brev CLI (v0.6.322+) does not accept URLs for --startup-script—only inline content or @filepath format. The test file (test/e2e/brev-e2e.test.js) explicitly documents this: "brev create (v0.6.322+) accepts --startup-script as a string or @filepath — not a URL." The current command will fail to run the GPU bootstrap script. Use @scripts/brev-launchable-ci-gpu.sh instead.

Suggested fix

brev create --name "$INSTANCE_NAME" \ --flavor "t4" \ - --startup-script "https://raw.githubusercontent.com/NVIDIA/NemoClaw/${{ github.sha }}/scripts/brev-launchable-ci-gpu.sh" + --startup-script `@scripts/brev-launchable-ci-gpu.sh`

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

brev create --name "$INSTANCE_NAME" \

--flavor "t4" \

--startup-script "https://raw.githubusercontent.com/NVIDIA/NemoClaw/${{ github.sha }}/scripts/brev-launchable-ci-gpu.sh"

brev create --name "$INSTANCE_NAME" \

--flavor "t4" \

--startup-script `@scripts/brev-launchable-ci-gpu.sh`

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In @.github/workflows/nightly-e2e.yaml around lines 193 - 195, The --startup-script argument passed to the brev create command currently uses a remote URL which Brev CLI no longer accepts; update the command that builds the instance (the brev create invocation with --name "$INSTANCE_NAME" and --flavor "t4") to use the `@filepath` form by replacing the URL value with `@scripts/brev-launchable-ci-gpu.sh` so the CLI reads the local script file content instead of a URL.

coderabbitai · 2026-04-03T19:21:54Z

scripts/brev-launchable-ci-gpu.sh

+if command -v nvidia-smi >/dev/null 2>&1; then
+  if ! dpkg -s nvidia-container-toolkit >/dev/null 2>&1; then
+    info "Installing NVIDIA Container Toolkit..."
+    curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
+      | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
+    curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
+      | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
+      | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list >/dev/null
+    wait_for_apt_lock
+    retry 3 10 "apt-get update" sudo apt-get update -qq >/dev/null 2>&1
+    retry 3 10 "install nvidia-container-toolkit" sudo apt-get install -y -qq nvidia-container-toolkit >/dev/null 2>&1
+    sudo nvidia-ctk runtime configure --runtime=docker >/dev/null 2>&1
+    sudo systemctl restart docker
+    info "NVIDIA Container Toolkit installed"
+  else
+    info "NVIDIA Container Toolkit already installed"
+  fi
+
+  info "Validating GPU passthrough..."
+  if command -v nvidia-smi >/dev/null 2>&1; then
+    nvidia-smi || warn "GPU detected but not functioning"
+    docker run --rm --gpus all nvidia/cuda:12.2.0-base nvidia-smi \
+      || warn "Docker GPU passthrough failed"
+  else
+    warn "No GPU detected"
+  fi
+else
+  warn "nvidia-smi not found, assuming a CPU-only instance or failed passthrough"
+fi


⚠️ Potential issue | 🟠 Major

Do not mark a broken GPU runtime as “ready”.

This is the GPU launchable, but missing nvidia-smi or a failed docker run --gpus all … only logs warnings and still reaches the sentinel. That turns a misconfigured VM into a ready one and pushes the real failure into the E2E phase.

Possible fix

info "Validating GPU passthrough..." if command -v nvidia-smi >/dev/null 2>&1; then - nvidia-smi || warn "GPU detected but not functioning" + nvidia-smi || fail "GPU detected but not functioning" docker run --rm --gpus all nvidia/cuda:12.2.0-base nvidia-smi \ - || warn "Docker GPU passthrough failed" + || fail "Docker GPU passthrough failed" else - warn "No GPU detected" + fail "No GPU detected on the GPU launchable" fi else - warn "nvidia-smi not found, assuming a CPU-only instance or failed passthrough" + fail "nvidia-smi not found on the GPU launchable" fi

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if command -v nvidia-smi >/dev/null 2>&1; then

if ! dpkg -s nvidia-container-toolkit >/dev/null 2>&1; then

info "Installing NVIDIA Container Toolkit..."

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \

| sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \

| sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \

| sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list >/dev/null

wait_for_apt_lock

retry 3 10 "apt-get update" sudo apt-get update -qq >/dev/null 2>&1

retry 3 10 "install nvidia-container-toolkit" sudo apt-get install -y -qq nvidia-container-toolkit >/dev/null 2>&1

sudo nvidia-ctk runtime configure --runtime=docker >/dev/null 2>&1

sudo systemctl restart docker

info "NVIDIA Container Toolkit installed"

else

info "NVIDIA Container Toolkit already installed"

fi

info "Validating GPU passthrough..."

if command -v nvidia-smi >/dev/null 2>&1; then

nvidia-smi || warn "GPU detected but not functioning"

docker run --rm --gpus all nvidia/cuda:12.2.0-base nvidia-smi \

|| warn "Docker GPU passthrough failed"

else

warn "No GPU detected"

fi

else

warn "nvidia-smi not found, assuming a CPU-only instance or failed passthrough"

fi

if command -v nvidia-smi >/dev/null 2>&1; then

if ! dpkg -s nvidia-container-toolkit >/dev/null 2>&1; then

info "Installing NVIDIA Container Toolkit..."

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \

| sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \

| sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \

| sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list >/dev/null

wait_for_apt_lock

retry 3 10 "apt-get update" sudo apt-get update -qq >/dev/null 2>&1

retry 3 10 "install nvidia-container-toolkit" sudo apt-get install -y -qq nvidia-container-toolkit >/dev/null 2>&1

sudo nvidia-ctk runtime configure --runtime=docker >/dev/null 2>&1

sudo systemctl restart docker

info "NVIDIA Container Toolkit installed"

else

info "NVIDIA Container Toolkit already installed"

fi

info "Validating GPU passthrough..."

if command -v nvidia-smi >/dev/null 2>&1; then

nvidia-smi || fail "GPU detected but not functioning"

docker run --rm --gpus all nvidia/cuda:12.2.0-base nvidia-smi \

|| fail "Docker GPU passthrough failed"

else

fail "No GPU detected on the GPU launchable"

fi

else

fail "nvidia-smi not found on the GPU launchable"

fi

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@scripts/brev-launchable-ci-gpu.sh` around lines 243 - 271, The GPU validation currently only logs warnings when nvidia-smi is missing or the docker GPU test fails, allowing the startup to continue; change the validation in the block that runs nvidia-smi and the docker run to treat failures as fatal: when nvidia-smi is not found or returns non-zero, or when docker run --gpus all nvidia/cuda:12.2.0-base nvidia-smi fails, call an error handler (e.g., use error or fatal logging) and exit non-zero (avoid creating the ready sentinel) instead of warn; update the code around the nvidia-smi checks and the docker run command to exit with non-zero status on failure so a broken GPU runtime is not marked ready.

coderabbitai · 2026-04-03T19:21:54Z

scripts/brev-launchable-ci-gpu.sh

+  ollama serve >/dev/null 2>&1 &
+  OLLAMA_PID=$!
+  info "Waiting for Ollama to start..."
+  until curl -s http://localhost:11434 >/dev/null; do
+    sleep 1
+  done


⚠️ Potential issue | 🟠 Major

Add a timeout to the Ollama startup poll.

until curl -s http://localhost:11434 >/dev/null; do sleep 1; done can block forever if ollama serve never comes up, so the launchable hangs until the outer workflow timeout instead of failing with a clear cause.

Possible fix

OLLAMA_PID=$! info "Waiting for Ollama to start..." - until curl -s http://localhost:11434 >/dev/null; do - sleep 1 - done + for _ in {1..60}; do + if curl -s http://localhost:11434 >/dev/null; then + break + fi + sleep 1 + done + curl -s http://localhost:11434 >/dev/null || fail "Ollama did not become ready in time"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

ollama serve >/dev/null 2>&1 &

OLLAMA_PID=$!

info "Waiting for Ollama to start..."

until curl -s http://localhost:11434 >/dev/null; do

sleep 1

done

ollama serve >/dev/null 2>&1 &

OLLAMA_PID=$!

info "Waiting for Ollama to start..."

for _ in {1..60}; do

if curl -s http://localhost:11434 >/dev/null; then

break

fi

sleep 1

done

curl -s http://localhost:11434 >/dev/null || fail "Ollama did not become ready in time"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@scripts/brev-launchable-ci-gpu.sh` around lines 295 - 300, The until loop polling Ollama can hang indefinitely; add a bounded timeout (e.g., MAX_WAIT_SECS) and check elapsed time or a counter inside the loop that waits for http://localhost:11434, and if the timeout is reached, kill the background process (use OLLAMA_PID), log a clear error, and exit non‑zero so the CI fails fast; update the block that starts "ollama serve >/dev/null 2>&1 &" and the subsequent loop that uses curl to implement this timeout and cleanup.

coderabbitai · 2026-04-03T19:21:54Z

test/e2e/brev-e2e.test.js

+  it.runIf(TEST_SUITE === "gpu")(
+    "GPU E2E suite passes on remote VM",
+    () => {
+      const output = runRemoteTest("test/e2e/test-gpu-e2e.sh");
+      expect(output).toContain("PASS");
+      expect(output).not.toMatch(/FAIL:/);
+    },
+    900_000,
+  );


⚠️ Potential issue | 🟠 Major

TEST_SUITE="gpu" still boots a CPU environment.

beforeAll always uses the CPU bootstrap (scripts/brev-launchable-ci-cpu.sh or brev search cpu) and later asserts gpuEnabled: false, so this branch never exercises the GPU path it claims to cover. Either switch the setup flow when TEST_SUITE === "gpu" or keep GPU coverage only in the dedicated workflow until this suite can request a GPU launchable.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@test/e2e/brev-e2e.test.js` around lines 648 - 656, The test's GPU branch is never exercising GPU because beforeAll always uses the CPU bootstrap (scripts/brev-launchable-ci-cpu.sh / brev search cpu) and later asserts gpuEnabled: false; update the setup in the beforeAll (or the bootstrap helper invoked there) to conditionally request a GPU when TEST_SUITE === "gpu" — e.g., switch to the GPU launch script or a brev search that returns a GPU launchable and ensure the resulting runtime's gpuEnabled flag is true (references: TEST_SUITE, beforeAll, scripts/brev-launchable-ci-cpu.sh, brev search cpu, gpuEnabled) so the "GPU E2E suite passes on remote VM" test actually runs on a GPU environment.

ci: add CI-Ready GPU Brev launchable for E2E tests

e6994ae

coderabbitai bot reviewed Apr 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: add CI-Ready GPU Brev launchable for E2E tests#1456

ci: add CI-Ready GPU Brev launchable for E2E tests#1456
ksapru wants to merge 1 commit intoNVIDIA:mainfrom
ksapru:feat/gpu-ci-launchable

ksapru commented Apr 3, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Apr 3, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Apr 3, 2026

Uh oh!

coderabbitai bot Apr 3, 2026

Uh oh!

coderabbitai bot Apr 3, 2026

Uh oh!

coderabbitai bot Apr 3, 2026

Uh oh!

coderabbitai bot Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ksapru commented Apr 3, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issue

Changes

Type of Change

Testing

Checklist

General

Code Changes

Doc Changes

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ksapru commented Apr 3, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 3, 2026 •

edited

Loading