Skip to content

feat: add bare-metal GPU clusterGroup for NVIDIA CoCo support#76

Closed
butler54 wants to merge 7 commits intovalidatedpatterns:mainfrom
butler54:feat/baremetal-gpu-support
Closed

feat: add bare-metal GPU clusterGroup for NVIDIA CoCo support#76
butler54 wants to merge 7 commits intovalidatedpatterns:mainfrom
butler54:feat/baremetal-gpu-support

Conversation

@butler54
Copy link
Copy Markdown
Collaborator

Summary

  • Adds values-baremetal-gpu.yaml clusterGroup extending bare-metal deployment with NVIDIA GPU operator support (Tech Preview, OSC 1.12)
  • Adds charts/all/nvidia-gpu/ with ClusterPolicy CR (sandbox workloads, VFIO manager, CC manager, sandbox device plugin) and IOMMU MachineConfig for GPU passthrough
  • Adds charts/coco-supported/gpu-workload/ with CUDA vectoradd sample running in a confidential VM (kata-cc-nvidia-gpu RuntimeClass)
  • Supports mixed clusters with both AMD SEV-SNP and Intel TDX nodes

Test plan

  • Deploy on bare-metal cluster with NVIDIA H100/H200 GPU and AMD SEV-SNP or Intel TDX
  • Verify GPU operator installs and ClusterPolicy reconciles
  • Verify VFIO manager binds GPUs to vfio-pci driver
  • Verify CC manager sets GPU confidential computing mode to "on"
  • Verify gpu-cc-verify pod completes CUDA vectoradd successfully
  • Verify existing hello-openshift and kbs-access workloads still function

🤖 Generated with Claude Code

butler54 and others added 7 commits March 10, 2026 11:22
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace git branch references (repoURL/targetRevision/path) with
released Helm chart references (chart/chartVersion) for trustee,
sandboxed-containers, and sandboxed-policies in values-baremetal.yaml.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add tdx.enabled flag (default true) to baremetal chart to conditionally
set kvm_intel.tdx=1 kernel argument. Without this, the kvm_intel module
does not activate TDX and NFD cannot detect it.

Enable intel-dcap application in values-baremetal.yaml for PCCS/QGS
attestation services.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…mplates

Address PR review feedback:
- Remove detect-runtime-class.yaml (OSC operator manages RuntimeClass)
- Remove bm-kernel-params.yaml and kernel-params-mco.yaml (config should
  be provided via initdata or pod annotations to avoid inconsistencies)
- Remove commented-out runtimeclass templates for AMD SNP and Intel TDX

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Chris Butler <chris.butler@redhat.com>
…ner support

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@butler54
Copy link
Copy Markdown
Collaborator Author

Closing to rebase.

@butler54 butler54 closed this Apr 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant