Skip to content

fix(k8s): add resource limits to prevent unbounded CPU/memory consumption#1448

Open
BenediktSchackenberg wants to merge 6 commits intoNVIDIA:mainfrom
BenediktSchackenberg:fix/k8s-resource-limits
Open

fix(k8s): add resource limits to prevent unbounded CPU/memory consumption#1448
BenediktSchackenberg wants to merge 6 commits intoNVIDIA:mainfrom
BenediktSchackenberg:fix/k8s-resource-limits

Conversation

@BenediktSchackenberg
Copy link
Copy Markdown
Contributor

@BenediktSchackenberg BenediktSchackenberg commented Apr 3, 2026

Summary

Fixes #1447.

The K8s manifest defined resources.requests for both the DinD and workspace containers, but no resources.limits. Without limits, a misbehaving or runaway container can consume all available node resources, causing OOM kills of co-located pods or full node instability.

Changes

Added limits at 2× the requested values to allow reasonable burst while capping worst-case consumption:

Container Memory Request Memory Limit CPU Request CPU Limit
dind 8Gi 16Gi 2 4
workspace 4Gi 8Gi 2 4

The 2× multiplier follows the Kubernetes best-practice of setting limits above requests to avoid unnecessary OOMKills on momentary spikes, while still providing a hard cap that protects the node.

Signed-off-by: Benedikt Schackenberg 6381261+BenediktSchackenberg@users.noreply.github.com

Summary by CodeRabbit

  • Chores
    • Enforced memory and CPU requests and limits for init and runtime containers to improve stability.
    • Added explicit ephemeral-storage requests and limits to relevant containers to reduce out-of-space failures.
    • Constrained temporary/docker ephemeral storage with a 40Gi size limit to control disk usage and prevent runaway consumption.

…tion

Without limits, the DinD and workspace containers could consume all
available node resources, causing OOM kills of other pods or DoS against
the Kubernetes node.

Added limits at 2x the requested values to allow reasonable burst while
preventing runaway consumption:
- dind: requests 8Gi/2CPU → limits 16Gi/4CPU
- workspace: requests 4Gi/2CPU → limits 8Gi/4CPU

Fixes NVIDIA#1447

Signed-off-by: Benedikt Schackenberg <6381261+BenediktSchackenberg@users.noreply.github.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 3, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c958f017-521a-4026-b6ed-f0d93cc65a1a

📥 Commits

Reviewing files that changed from the base of the PR and between 78a402f and fa33969.

📒 Files selected for processing (1)
  • k8s/nemoclaw-k8s.yaml

📝 Walkthrough

Walkthrough

Updated the Kubernetes Pod spec in k8s/nemoclaw-k8s.yaml to add resource requests and limits (memory, cpu, ephemeral-storage) for dind, workspace, and initContainers.init-docker-config, and set emptyDir.sizeLimit: 40Gi for docker-storage.

Changes

Cohort / File(s) Summary
Kubernetes Pod spec
k8s/nemoclaw-k8s.yaml
Added resource requests and explicit limits for dind (memory/cpu/ephemeral-storage) and workspace (memory/cpu/ephemeral-storage); added low resource requests/limits for initContainers.init-docker-config; changed docker-storage emptyDir to emptyDir.sizeLimit: 40Gi.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐰 I hopped through YAML, tidy and bright,
I taught pods to mind memory and byte.
DinD and workspace now snug and right,
EmptyDir capped — no runaway night.
Hop, hop, safe cluster — hold tight!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: adding resource limits to prevent unbounded CPU/memory consumption in Kubernetes, which aligns with the PR's primary objective.
Linked Issues check ✅ Passed The PR successfully addresses all coding requirements from issue #1447: adds resource.limits for dind and workspace containers (2× requests as per limits), constrains ephemeral storage with sizeLimit on docker-storage volume, and adds resource bounds to init-docker-config container.
Out of Scope Changes check ✅ Passed All changes are directly related to issue #1447 requirements. The addition of ephemeral-storage limits and init-container resources go beyond the issue description but are reasonable preventive measures for resource isolation and are not contradictory.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds explicit Kubernetes resource limits to the NemoClaw Pod manifest to prevent unbounded CPU/memory consumption on shared nodes, addressing the risk described in #1447.

Changes:

  • Added resources.limits to the dind container (16Gi memory / 4 CPU).
  • Added resources.limits to the workspace container (8Gi memory / 4 CPU).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@k8s/nemoclaw-k8s.yaml`:
- Around line 33-35: The pod currently sets CPU/memory limits but leaves
ephemeral disk unbounded for the dind container that writes to /var/lib/docker
and uses the emptyDir volume named docker-storage; add ephemeral-storage to the
resources.requests and resources.limits of both containers (e.g., under the same
resource blocks where cpu/memory are defined) and add a sizeLimit on the
docker-storage emptyDir volume to cap node ephemeral usage (adjust the size to
an appropriate value like 10Gi for your workload). Ensure you modify the
resource blocks for the dind container and the other container that has
CPU/memory limits, and add sizeLimit under the docker-storage volume definition.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: b925414e-69b8-42d4-861c-d0fc09b8dc7d

📥 Commits

Reviewing files that changed from the base of the PR and between f4a01cf and b53a8a5.

📒 Files selected for processing (1)
  • k8s/nemoclaw-k8s.yaml

@BenediktSchackenberg
Copy link
Copy Markdown
Contributor Author

Good catch — added ephemeral-storage requests/limits to both containers and a sizeLimit on the docker-storage emptyDir volume. The dind container now caps disk at 40Gi (matching the memory limit multiplier), and workspace at 8Gi.

Per CodeRabbit: CPU/memory limits were set but disk remained unbounded.
The dind container writes Docker layers to an emptyDir volume; without
ephemeral-storage limits, heavy image builds can exhaust node disk and
trigger pod eviction.

- dind: ephemeral-storage 20Gi request / 40Gi limit
- workspace: ephemeral-storage 4Gi request / 8Gi limit
- docker-storage emptyDir: sizeLimit 40Gi

Signed-off-by: Benedikt Schackenberg <6381261+BenediktSchackenberg@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
k8s/nemoclaw-k8s.yaml (1)

112-117: Consider adding minimal resource constraints to the init container.

The init-docker-config init container lacks resource requests/limits. While it performs a trivial operation (writing a small JSON file), adding minimal constraints follows defense-in-depth principles.

♻️ Suggested addition
   initContainers:
     # Configure Docker daemon for cgroup v2
     - name: init-docker-config
       image: busybox
       command: ["sh", "-c", "echo '{\"default-cgroupns-mode\":\"host\"}' > /etc/docker/daemon.json"]
       volumeMounts:
         - name: docker-config
           mountPath: /etc/docker
+      resources:
+        requests:
+          memory: "16Mi"
+          cpu: "50m"
+        limits:
+          memory: "32Mi"
+          cpu: "100m"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@k8s/nemoclaw-k8s.yaml` around lines 112 - 117, The init container
init-docker-config (image busybox, command writing /etc/docker/daemon.json)
should include minimal resource requests and limits to follow defense-in-depth;
add a resources block with small cpu and memory requests (e.g., 10m CPU, 16Mi
memory) and corresponding limits (e.g., 50m CPU, 64Mi memory) under the
init-docker-config container spec so the pod scheduler and kubelet can enforce
predictable resource usage.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@k8s/nemoclaw-k8s.yaml`:
- Around line 112-117: The init container init-docker-config (image busybox,
command writing /etc/docker/daemon.json) should include minimal resource
requests and limits to follow defense-in-depth; add a resources block with small
cpu and memory requests (e.g., 10m CPU, 16Mi memory) and corresponding limits
(e.g., 50m CPU, 64Mi memory) under the init-docker-config container spec so the
pod scheduler and kubelet can enforce predictable resource usage.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c5078cfc-1578-4962-a83c-cbb1f71cd7dc

📥 Commits

Reviewing files that changed from the base of the PR and between b53a8a5 and 78a402f.

📒 Files selected for processing (1)
  • k8s/nemoclaw-k8s.yaml

The init container performs a trivial write (single JSON file) but
had no resource constraints. Added minimal bounds following
defense-in-depth principles:
- requests: 32Mi memory / 100m CPU
- limits: 64Mi memory / 200m CPU

Signed-off-by: Benedikt Schackenberg <6381261+BenediktSchackenberg@users.noreply.github.com>
@BenediktSchackenberg
Copy link
Copy Markdown
Contributor Author

Added minimal resource limits to the init-docker-config init container (32Mi/100m requests, 64Mi/200m limits). The init container just writes a single JSON file so the bounds are intentionally small.

@wscurran wscurran added CI/CD Use this label to identify issues with NemoClaw CI/CD pipeline or GitHub Actions. K8s Use this label to identify Kubernetes deployment issues with NemoClaw. fix labels Apr 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI/CD Use this label to identify issues with NemoClaw CI/CD pipeline or GitHub Actions. fix K8s Use this label to identify Kubernetes deployment issues with NemoClaw.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

K8s Manifest Missing Resource Limits — Pod Can Consume Unbounded CPU/Memory - IssueFinder - SN 23

3 participants