Finding
The GPU health and discovery subsystem in pkg/k8s contains ~32KB of untested code across 4 files:
| File |
Size |
Description |
client_gpu_discovery.go |
10,864 B |
Multi-vendor GPU node discovery (NVIDIA, AMD, Intel, Google TPU, IBM AIU) |
client_gpu_health.go |
9,210 B |
GPU node health monitoring with 7 health checks |
client_gpu_types.go |
6,718 B |
Type definitions for GPU health structures |
client_gpu_nvidia.go |
5,564 B |
NVIDIA operator status inspection |
Pure helper functions with zero coverage:
checkOperatorPod() (client_gpu_health.go) — inspects pod status for GPU operator DaemonSet pods. Checks for CrashLoopBackOff, non-Running state, and pod-not-found scenarios.
isStuckPod() (client_gpu_health.go) — determines if a pod is stuck via 3 conditions: ContainerStatusUnknown, terminating >5min, pending >10min.
deriveGPUNodeStatus() (client_gpu_health.go) — derives overall health (healthy/degraded/unhealthy) from check results with critical vs non-critical classification.
unstructuredNestedMap() (client_gpu_nvidia.go) — traverses nested unstructured K8s objects.
unstructuredNestedSlice() (client_gpu_nvidia.go) — traverses nested unstructured K8s objects for slices.
Why this matters:
- The GPU health endpoint drives the console UI's GPU dashboard — bugs in
deriveGPUNodeStatus or isStuckPod silently misclassify node health
checkOperatorPod handles edge cases (CrashLoopBackOff, missing pods) that determine whether operators see alerts
unstructuredNestedMap/Slice parse arbitrary CRD structures — nil panics here crash the handler
- These are all pure functions that can be unit-tested with constructed inputs (no K8s client mocks needed)
Recommendation
- Add table-driven tests for
isStuckPod covering all 3 stuck conditions + happy path
- Add tests for
deriveGPUNodeStatus covering healthy/degraded/unhealthy transitions
- Add tests for
checkOperatorPod covering Running, CrashLoopBackOff, Pending, not-found
- Add tests for
unstructuredNestedMap/unstructuredNestedSlice covering nil, missing keys, valid paths
Priority
- Impact: high (GPU dashboard correctness, silent health misclassification)
- Effort: low (all pure functions, no mocks needed)
Filed by quality agent (ACMM L4/L6 — full mode)
Finding
The GPU health and discovery subsystem in
pkg/k8scontains ~32KB of untested code across 4 files:client_gpu_discovery.goclient_gpu_health.goclient_gpu_types.goclient_gpu_nvidia.goPure helper functions with zero coverage:
checkOperatorPod()(client_gpu_health.go) — inspects pod status for GPU operator DaemonSet pods. Checks for CrashLoopBackOff, non-Running state, and pod-not-found scenarios.isStuckPod()(client_gpu_health.go) — determines if a pod is stuck via 3 conditions: ContainerStatusUnknown, terminating >5min, pending >10min.deriveGPUNodeStatus()(client_gpu_health.go) — derives overall health (healthy/degraded/unhealthy) from check results with critical vs non-critical classification.unstructuredNestedMap()(client_gpu_nvidia.go) — traverses nested unstructured K8s objects.unstructuredNestedSlice()(client_gpu_nvidia.go) — traverses nested unstructured K8s objects for slices.Why this matters:
deriveGPUNodeStatusorisStuckPodsilently misclassify node healthcheckOperatorPodhandles edge cases (CrashLoopBackOff, missing pods) that determine whether operators see alertsunstructuredNestedMap/Sliceparse arbitrary CRD structures — nil panics here crash the handlerRecommendation
isStuckPodcovering all 3 stuck conditions + happy pathderiveGPUNodeStatuscovering healthy/degraded/unhealthy transitionscheckOperatorPodcovering Running, CrashLoopBackOff, Pending, not-foundunstructuredNestedMap/unstructuredNestedSlicecovering nil, missing keys, valid pathsPriority
Filed by quality agent (ACMM L4/L6 — full mode)