What
Request a supported way for the driver upgrade flow to free every consumer holding the GPU kernel module open. Two cases the current flow cannot handle:
- Pods that use the GPU via
runtimeClassName: nvidia with no nvidia.com/gpu resource request (exclusive mode, no time-slicing / MPS).
- DaemonSet GPU consumers (for us, a standalone dcgm-exporter).
Both keep the module refcount above zero, so an in-place driver unload/upgrade stalls. This is the underlying cause behind the stuck-driver symptom in #2549 (the init container "fails to unload the driver while workloads are active").
Why the current flow doesn't cover these
There are two eviction paths, and neither can free these consumers:
- Node drain (
k8s-operator-libs pkg/upgrade/drain_manager.go): PodSelector is configurable and defaults in gpu-operator (controllers/upgrade_controller.go) to nvidia.com/gpu-driver-upgrade-drain.skip!=true. But IgnoreAllDaemonSets: true is hardcoded, so DaemonSet consumers are never evicted by drain.
- Targeted pod deletion (
pkg/upgrade/pod_manager.go): uses an injected PodDeletionFilter. gpu-operator's filter keys on the nvidia.com/gpu resource, so pods without that request are never selected. IgnoreAllDaemonSets: true is hardcoded here too.
Net: a DaemonSet GPU consumer cannot be freed by any path, and a non-resource GPU pod is only reachable via a broad node drain - heavier than needed, and in our testing it did not free our runtime-direct Deployments either.
Proposal
- Allow declaring additional GPU consumers to evict during an upgrade by label/selector, surfaced through
driver.upgradePolicy, so they do not have to request nvidia.com/gpu. This complements the existing nvidia.com/gpu-driver-upgrade-drain.skip opt-out with an opt-in.
- Provide handling for declared DaemonSet GPU consumers during the upgrade window - e.g. an opt-in to cordon/park them (or make the DaemonSet-ignore configurable for a named set), restored on
upgrade-done. The operator already quiesces its own operands; this extends that to declared third-party consumers.
Where the change lives
- Core:
NVIDIA/k8s-operator-libs pkg/upgrade (drain/pod managers; the hardcoded IgnoreAllDaemonSets: true).
- Surface:
NVIDIA/gpu-operator driver.upgradePolicy API and the injected PodDeletionFilter in controllers/upgrade_controller.go.
Workaround today
We run a small sidecar controller that watches nvidia.com/gpu-driver-upgrade-state and, on any transition away from upgrade-done, deletes the runtime-direct pods and parks the DaemonSet consumer (non-matching nodeSelector), then restores them on upgrade-done. With those consumers freed, the operator unloads, rebuilds and reloads the driver in place with no node reboot (boot_id unchanged across the upgrade). Happy to share it.
Related: #2549.
What
Request a supported way for the driver upgrade flow to free every consumer holding the GPU kernel module open. Two cases the current flow cannot handle:
runtimeClassName: nvidiawith nonvidia.com/gpuresource request (exclusive mode, no time-slicing / MPS).Both keep the module refcount above zero, so an in-place driver unload/upgrade stalls. This is the underlying cause behind the stuck-driver symptom in #2549 (the init container "fails to unload the driver while workloads are active").
Why the current flow doesn't cover these
There are two eviction paths, and neither can free these consumers:
k8s-operator-libspkg/upgrade/drain_manager.go):PodSelectoris configurable and defaults in gpu-operator (controllers/upgrade_controller.go) tonvidia.com/gpu-driver-upgrade-drain.skip!=true. ButIgnoreAllDaemonSets: trueis hardcoded, so DaemonSet consumers are never evicted by drain.pkg/upgrade/pod_manager.go): uses an injectedPodDeletionFilter. gpu-operator's filter keys on thenvidia.com/gpuresource, so pods without that request are never selected.IgnoreAllDaemonSets: trueis hardcoded here too.Net: a DaemonSet GPU consumer cannot be freed by any path, and a non-resource GPU pod is only reachable via a broad node drain - heavier than needed, and in our testing it did not free our runtime-direct Deployments either.
Proposal
driver.upgradePolicy, so they do not have to requestnvidia.com/gpu. This complements the existingnvidia.com/gpu-driver-upgrade-drain.skipopt-out with an opt-in.upgrade-done. The operator already quiesces its own operands; this extends that to declared third-party consumers.Where the change lives
NVIDIA/k8s-operator-libspkg/upgrade(drain/pod managers; the hardcodedIgnoreAllDaemonSets: true).NVIDIA/gpu-operatordriver.upgradePolicyAPI and the injectedPodDeletionFilterincontrollers/upgrade_controller.go.Workaround today
We run a small sidecar controller that watches
nvidia.com/gpu-driver-upgrade-stateand, on any transition away fromupgrade-done, deletes the runtime-direct pods and parks the DaemonSet consumer (non-matching nodeSelector), then restores them onupgrade-done. With those consumers freed, the operator unloads, rebuilds and reloads the driver in place with no node reboot (boot_id unchanged across the upgrade). Happy to share it.Related: #2549.