Skip to content

Add upgradePolicy to NVIDIADriver CRD#2582

Open
cdesiniotis wants to merge 1 commit into
NVIDIA:mainfrom
cdesiniotis:add-upgrade-policy-in-nvd
Open

Add upgradePolicy to NVIDIADriver CRD#2582
cdesiniotis wants to merge 1 commit into
NVIDIA:mainfrom
cdesiniotis:add-upgrade-policy-in-nvd

Conversation

@cdesiniotis

@cdesiniotis cdesiniotis commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

The upgradePolicy influences how the driver-upgrade controller upgrades GPU driver daemonsets. Adding this field to the NVIDIADriver CRD allows users to define different upgrade policies for different NVIDIADriver CRs. If nil or empty, the driver-upgrade controller will fallback to using a default upgradePolicy defined in the code which aligns with the defaults in our helm values.

Code changes in this PR were drafted with the assistance of Claude Code.

Testing

I tested the following scenarios on a k8s cluster with two GPUs:

  1. Install gpu operator with default options. Verify entire stack comes up.
  2. Upgrade ClusterPolicy-managed driver daemonset. Verify upgrade controller uprades driver pods in rolling fashion based on configured upgrade policy.
  3. Migrate to NVIDIADriver-managed driver daemonset by a) setting driver.useNvidiaDriverCRD=true in ClusterPolicy; b) creating a default NVIDIADriver CR. Verify ClusterPolicy-managed pods gets orphaned then deleted by upgrade controller in a rolling fashion.
  4. Upgrade the default NVIDIADriver-managed driver daemonset by changing spec.version. Verify upgrade controller upgrades pods in a rolling fashion.
  5. Repeat step 4 with a different upgradePolicy configured in the NVIDIADriver CR. Verify upgrade controller upgrades pods according to the new upgrade policy.

@cdesiniotis cdesiniotis force-pushed the add-upgrade-policy-in-nvd branch from c145169 to 10fce04 Compare June 25, 2026 21:09
@cdesiniotis cdesiniotis marked this pull request as ready for review June 25, 2026 21:47
Comment thread api/nvidia/v1alpha1/nvidiadriver_types.go Outdated
Comment thread controllers/upgrade_controller.go Outdated
Comment thread controllers/upgrade_controller.go Outdated
Comment thread controllers/upgrade_controller.go Outdated
Comment thread controllers/upgrade_controller.go Outdated
Comment thread api/nvidia/v1alpha1/nvidiadriver_types.go
Comment thread config/samples/nvidia_v1alpha1_nvidiadriver.yaml Outdated
Comment thread controllers/upgrade_controller.go
Comment thread api/nvidia/v1alpha1/nvidiadriver_types.go
@cdesiniotis cdesiniotis self-assigned this Jun 26, 2026
@cdesiniotis cdesiniotis added this to the v26.7 milestone Jun 26, 2026
@cdesiniotis cdesiniotis force-pushed the add-upgrade-policy-in-nvd branch 2 times, most recently from 43f03c5 to 2080190 Compare June 26, 2026 23:27
Comment thread api/nvidia/v1alpha1/nvidiadriver_types.go Outdated
@cdesiniotis cdesiniotis force-pushed the add-upgrade-policy-in-nvd branch 2 times, most recently from fa6df1e to ad29e8b Compare June 26, 2026 23:52
Comment thread api/nvidia/v1alpha1/nvidiadriver_types.go Outdated
Comment thread api/nvidia/v1alpha1/nvidiadriver_types.go Outdated
Comment thread api/nvidia/v1alpha1/nvidiadriver_types.go Outdated
Comment thread api/nvidia/v1alpha1/nvidiadriver_types.go Outdated
Comment thread api/nvidia/v1alpha1/nvidiadriver_types.go Outdated
@cdesiniotis cdesiniotis force-pushed the add-upgrade-policy-in-nvd branch from ad29e8b to 8d2a1ea Compare June 27, 2026 00:42
Comment thread api/nvidia/v1alpha1/nvidiadriver_types.go Outdated
The upgradePolicy influences how the driver-upgrade controller
upgrades GPU driver daemonsets. Adding this field to the
NVIDIADriver CRD allows users to define different upgrade
policies for different NVIDIADriver CRs. If nil or empty,
the driver-upgrade controller will fallback to using a
default upgradePolicy defined in the code which aligns
with the defaults in our helm values.

Signed-off-by: Christopher Desiniotis <cdesiniotis@nvidia.com>
@cdesiniotis cdesiniotis force-pushed the add-upgrade-policy-in-nvd branch from 8d2a1ea to c4fe7d4 Compare June 27, 2026 01:05
@rahulait

rahulait commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

LGTM. I am wondering about backward compatibility. Existing user-defined nvidiadriver CRs won't have upgrade configuration, and previously they were defaulting to the config in clusterpolicy. If custom upgrade config was used in clusterpolicy previously and user specified nvidiadriver was using it and now user upgrades to newer version, nvidiadriver will start using default config we specify. Would we be documenting this as a breaking change or something for users to be aware of when they are using nvidiadrivers and they jump to v26.7.0?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants