Skip to content

[quality] pkg/k8s GPU health subsystem has zero test coverage — 5 pure helpers untested #19555

Description

@clubanderson

Finding

The GPU health and discovery subsystem in pkg/k8s contains ~32KB of untested code across 4 files:

File Size Description
client_gpu_discovery.go 10,864 B Multi-vendor GPU node discovery (NVIDIA, AMD, Intel, Google TPU, IBM AIU)
client_gpu_health.go 9,210 B GPU node health monitoring with 7 health checks
client_gpu_types.go 6,718 B Type definitions for GPU health structures
client_gpu_nvidia.go 5,564 B NVIDIA operator status inspection

Pure helper functions with zero coverage:

  1. checkOperatorPod() (client_gpu_health.go) — inspects pod status for GPU operator DaemonSet pods. Checks for CrashLoopBackOff, non-Running state, and pod-not-found scenarios.
  2. isStuckPod() (client_gpu_health.go) — determines if a pod is stuck via 3 conditions: ContainerStatusUnknown, terminating >5min, pending >10min.
  3. deriveGPUNodeStatus() (client_gpu_health.go) — derives overall health (healthy/degraded/unhealthy) from check results with critical vs non-critical classification.
  4. unstructuredNestedMap() (client_gpu_nvidia.go) — traverses nested unstructured K8s objects.
  5. unstructuredNestedSlice() (client_gpu_nvidia.go) — traverses nested unstructured K8s objects for slices.

Why this matters:

  1. The GPU health endpoint drives the console UI's GPU dashboard — bugs in deriveGPUNodeStatus or isStuckPod silently misclassify node health
  2. checkOperatorPod handles edge cases (CrashLoopBackOff, missing pods) that determine whether operators see alerts
  3. unstructuredNestedMap/Slice parse arbitrary CRD structures — nil panics here crash the handler
  4. These are all pure functions that can be unit-tested with constructed inputs (no K8s client mocks needed)

Recommendation

  1. Add table-driven tests for isStuckPod covering all 3 stuck conditions + happy path
  2. Add tests for deriveGPUNodeStatus covering healthy/degraded/unhealthy transitions
  3. Add tests for checkOperatorPod covering Running, CrashLoopBackOff, Pending, not-found
  4. Add tests for unstructuredNestedMap/unstructuredNestedSlice covering nil, missing keys, valid paths

Priority

  • Impact: high (GPU dashboard correctness, silent health misclassification)
  • Effort: low (all pure functions, no mocks needed)

Filed by quality agent (ACMM L4/L6 — full mode)

Metadata

Metadata

Assignees

No one assigned

    Labels

    help wantedDenotes an issue that needs help from a contributor. Must meet "help wanted" guidelines.qualitytesting

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions