Skip to content

Latest commit

 

History

History
128 lines (93 loc) · 3.76 KB

File metadata and controls

128 lines (93 loc) · 3.76 KB

NemoClaw on DGX Spark

WIP — This page is actively being updated as we work through Spark installs. Expect changes.

Prerequisites

  • Docker (pre-installed, v28.x)
  • Node.js 22 (installed by the install.sh)
  • OpenShell CLI (installed via the Quick Start steps below)
  • NVIDIA API Key from build.nvidia.com — prompted on first run

Quick Start

# Install OpenShell:
curl -LsSf https://raw.githubusercontent.com/NVIDIA/OpenShell/main/install.sh | sh

# Clone NemoClaw:
git clone https://github.com/NVIDIA/NemoClaw.git
cd NemoClaw

# Spark-specific setup
sudo ./scripts/setup-spark.sh

# Install NemoClaw using the NemoClaw/install.sh:
./install.sh

# Alternatively, you can use the hosted install script:
curl -fsSL https://www.nvidia.com/nemoclaw.sh | bash

What's Different on Spark

DGX Spark ships Ubuntu 24.04 + Docker 28.x but no k8s/k3s. OpenShell embeds k3s inside a Docker container, which hits two problems on Spark:

1. Docker permissions

Error in the hyper legacy client: client error (Connect)
  Permission denied (os error 13)

Cause: Your user isn't in the docker group. Fix: setup-spark runs usermod -aG docker $USER. You may need to log out and back in (or newgrp docker) for it to take effect.

2. cgroup v2 incompatibility

K8s namespace not ready
openat2 /sys/fs/cgroup/kubepods/pids.max: no
Failed to start ContainerManager: failed to initialize top level QOS containers

Cause: Spark runs cgroup v2 (Ubuntu 24.04 default). OpenShell's gateway container starts k3s, which tries to create cgroup v1-style paths that don't exist. The fix is --cgroupns=host on the container, but OpenShell doesn't expose that flag.

Fix: setup-spark sets "default-cgroupns-mode": "host" in /etc/docker/daemon.json and restarts Docker. This makes all containers use the host cgroup namespace, which is what k3s needs.

Manual Setup (if setup-spark doesn't work)

Fix Docker cgroup namespace

# Check if you're on cgroup v2
stat -fc %T /sys/fs/cgroup/
# Expected: cgroup2fs

# Add cgroupns=host to Docker daemon config
sudo python3 -c "
import json, os
path = '/etc/docker/daemon.json'
d = json.load(open(path)) if os.path.exists(path) else {}
d['default-cgroupns-mode'] = 'host'
json.dump(d, open(path, 'w'), indent=2)
"

# Restart Docker
sudo systemctl restart docker

Fix Docker permissions

sudo usermod -aG docker $USER
newgrp docker  # or log out and back in

Then run the onboard wizard

nemoclaw onboard

Known Issues

Issue Status Workaround
cgroup v2 kills k3s in Docker Fixed in setup-spark daemon.json cgroupns=host
Docker permission denied Fixed in setup-spark usermod -aG docker
CoreDNS CrashLoop after setup Fixed in fix-coredns.sh Uses container gateway IP, not 127.0.0.11
Image pull failure (k3s can't find built image) OpenShell bug openshell gateway destroy && openshell gateway start, re-run setup
GPU passthrough Untested on Spark Should work with --gpu flag if NVIDIA Container Toolkit is configured

Verifying Your Install

# Check sandbox is running
openshell sandbox list
# Should show: nemoclaw  Ready

# Test the agent
openshell sandbox connect nemoclaw
# Inside sandbox:
nemoclaw-start openclaw agent --agent main --local -m 'hello' --session-id test

# Monitor network egress (separate terminal)
openshell term

Architecture Notes

DGX Spark (Ubuntu 24.04, cgroup v2)
  └── Docker (28.x, cgroupns=host)
       └── OpenShell gateway container
            └── k3s (embedded)
                 └── nemoclaw sandbox pod
                      └── OpenClaw agent + NemoClaw plugin