fix(k8s): add resource limits to prevent unbounded CPU/memory consumption#1448
fix(k8s): add resource limits to prevent unbounded CPU/memory consumption#1448BenediktSchackenberg wants to merge 6 commits intoNVIDIA:mainfrom
Conversation
…tion Without limits, the DinD and workspace containers could consume all available node resources, causing OOM kills of other pods or DoS against the Kubernetes node. Added limits at 2x the requested values to allow reasonable burst while preventing runaway consumption: - dind: requests 8Gi/2CPU → limits 16Gi/4CPU - workspace: requests 4Gi/2CPU → limits 8Gi/4CPU Fixes NVIDIA#1447 Signed-off-by: Benedikt Schackenberg <6381261+BenediktSchackenberg@users.noreply.github.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughUpdated the Kubernetes Pod spec in Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Pull request overview
Adds explicit Kubernetes resource limits to the NemoClaw Pod manifest to prevent unbounded CPU/memory consumption on shared nodes, addressing the risk described in #1447.
Changes:
- Added
resources.limitsto thedindcontainer (16Gi memory / 4 CPU). - Added
resources.limitsto theworkspacecontainer (8Gi memory / 4 CPU).
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@k8s/nemoclaw-k8s.yaml`:
- Around line 33-35: The pod currently sets CPU/memory limits but leaves
ephemeral disk unbounded for the dind container that writes to /var/lib/docker
and uses the emptyDir volume named docker-storage; add ephemeral-storage to the
resources.requests and resources.limits of both containers (e.g., under the same
resource blocks where cpu/memory are defined) and add a sizeLimit on the
docker-storage emptyDir volume to cap node ephemeral usage (adjust the size to
an appropriate value like 10Gi for your workload). Ensure you modify the
resource blocks for the dind container and the other container that has
CPU/memory limits, and add sizeLimit under the docker-storage volume definition.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: b925414e-69b8-42d4-861c-d0fc09b8dc7d
📒 Files selected for processing (1)
k8s/nemoclaw-k8s.yaml
|
Good catch — added |
Per CodeRabbit: CPU/memory limits were set but disk remained unbounded. The dind container writes Docker layers to an emptyDir volume; without ephemeral-storage limits, heavy image builds can exhaust node disk and trigger pod eviction. - dind: ephemeral-storage 20Gi request / 40Gi limit - workspace: ephemeral-storage 4Gi request / 8Gi limit - docker-storage emptyDir: sizeLimit 40Gi Signed-off-by: Benedikt Schackenberg <6381261+BenediktSchackenberg@users.noreply.github.com>
There was a problem hiding this comment.
🧹 Nitpick comments (1)
k8s/nemoclaw-k8s.yaml (1)
112-117: Consider adding minimal resource constraints to the init container.The
init-docker-configinit container lacks resource requests/limits. While it performs a trivial operation (writing a small JSON file), adding minimal constraints follows defense-in-depth principles.♻️ Suggested addition
initContainers: # Configure Docker daemon for cgroup v2 - name: init-docker-config image: busybox command: ["sh", "-c", "echo '{\"default-cgroupns-mode\":\"host\"}' > /etc/docker/daemon.json"] volumeMounts: - name: docker-config mountPath: /etc/docker + resources: + requests: + memory: "16Mi" + cpu: "50m" + limits: + memory: "32Mi" + cpu: "100m"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@k8s/nemoclaw-k8s.yaml` around lines 112 - 117, The init container init-docker-config (image busybox, command writing /etc/docker/daemon.json) should include minimal resource requests and limits to follow defense-in-depth; add a resources block with small cpu and memory requests (e.g., 10m CPU, 16Mi memory) and corresponding limits (e.g., 50m CPU, 64Mi memory) under the init-docker-config container spec so the pod scheduler and kubelet can enforce predictable resource usage.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@k8s/nemoclaw-k8s.yaml`:
- Around line 112-117: The init container init-docker-config (image busybox,
command writing /etc/docker/daemon.json) should include minimal resource
requests and limits to follow defense-in-depth; add a resources block with small
cpu and memory requests (e.g., 10m CPU, 16Mi memory) and corresponding limits
(e.g., 50m CPU, 64Mi memory) under the init-docker-config container spec so the
pod scheduler and kubelet can enforce predictable resource usage.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: c5078cfc-1578-4962-a83c-cbb1f71cd7dc
📒 Files selected for processing (1)
k8s/nemoclaw-k8s.yaml
The init container performs a trivial write (single JSON file) but had no resource constraints. Added minimal bounds following defense-in-depth principles: - requests: 32Mi memory / 100m CPU - limits: 64Mi memory / 200m CPU Signed-off-by: Benedikt Schackenberg <6381261+BenediktSchackenberg@users.noreply.github.com>
|
Added minimal resource limits to the |
Summary
Fixes #1447.
The K8s manifest defined
resources.requestsfor both the DinD and workspace containers, but noresources.limits. Without limits, a misbehaving or runaway container can consume all available node resources, causing OOM kills of co-located pods or full node instability.Changes
Added
limitsat 2× the requested values to allow reasonable burst while capping worst-case consumption:The 2× multiplier follows the Kubernetes best-practice of setting limits above requests to avoid unnecessary OOMKills on momentary spikes, while still providing a hard cap that protects the node.
Signed-off-by: Benedikt Schackenberg 6381261+BenediktSchackenberg@users.noreply.github.com
Summary by CodeRabbit