Skip to content

Support NVIDIADriver reconciliation without a ClusterPolicy#2572

Draft
karthikvetrivel wants to merge 4 commits into
kv-gpuclusterconfig-crdfrom
kv-nvidiadriver-standalone
Draft

Support NVIDIADriver reconciliation without a ClusterPolicy#2572
karthikvetrivel wants to merge 4 commits into
kv-gpuclusterconfig-crdfrom
kv-nvidiadriver-standalone

Conversation

@karthikvetrivel

@karthikvetrivel karthikvetrivel commented Jun 23, 2026

Copy link
Copy Markdown
Member

Stacked on #2571 (kv-gpuclusterconfig-crd). This PR's diff is just the four commits below; GitHub will retarget it onto main once #2571 lands.

Summary

Builds on the GPUClusterConfig controller so the NVIDIADriver operand can reconcile on its own, without a ClusterPolicy:

  • Resolves the driver host root from the info catalog rather than requiring the ClusterPolicy CR.
  • The controller reconciles driver state when no ClusterPolicy is present.
  • GPU node labeling driven by the GPUClusterConfig controller.
  • A disabled GPUClusterConfig re-reconciles so it picks up later ClusterPolicy removal.

Testing

Validated live on a single-node Tesla T4 cluster (Kubernetes 1.34, containerd 2.2), plus unit coverage for host-root sourcing.

  • I created a GPUClusterConfig on a cluster with no ClusterPolicy present and verified the node got the relevant nvidia.com/* labels, the NVIDIADriver-owned driver DaemonSet came up, and the DRA kubelet-plugin published its DeviceClasses and ResourceSlice. I then ran a pod with a DRA claim and verified nvidia-smi worked against the operator-installed driver.
  • With a ClusterPolicy present, I verified the NVIDIADriver stays disabled. I deleted the ClusterPolicy and verified the NVIDIADriver recovered to ready on its own within about a minute.
  • With neither a ClusterPolicy nor a GPUClusterConfig in the cluster, I verified the NVIDIADriver goes notReady with "no ClusterPolicy or GPUClusterConfig object found in the cluster". I created a GPUClusterConfig and verified the NVIDIADriver recovered.
  • I confirmed a GPUClusterConfig that's disabled by an existing ClusterPolicy keeps re-reconciling while disabled, so it notices and takes over once that ClusterPolicy is later removed.

@karthikvetrivel karthikvetrivel marked this pull request as draft June 23, 2026 15:54
@karthikvetrivel karthikvetrivel force-pushed the kv-nvidiadriver-standalone branch 2 times, most recently from e852c29 to 8ffcfc3 Compare June 24, 2026 15:28
Signed-off-by: Karthik Vetrivel <kvetrivel@nvidia.com>
Signed-off-by: Karthik Vetrivel <kvetrivel@nvidia.com>
Signed-off-by: Karthik Vetrivel <kvetrivel@nvidia.com>
Signed-off-by: Karthik Vetrivel <kvetrivel@nvidia.com>
@karthikvetrivel karthikvetrivel force-pushed the kv-nvidiadriver-standalone branch from 8ffcfc3 to 479562c Compare June 24, 2026 19:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant