Skip to content

[Feature]: Driver upgrades cannot free GPU consumers that lack an nvidia.com/gpu request or run as DaemonSets #2570

Description

@ctrlsam

What

Request a supported way for the driver upgrade flow to free every consumer holding the GPU kernel module open. Two cases the current flow cannot handle:

  1. Pods that use the GPU via runtimeClassName: nvidia with no nvidia.com/gpu resource request (exclusive mode, no time-slicing / MPS).
  2. DaemonSet GPU consumers (for us, a standalone dcgm-exporter).

Both keep the module refcount above zero, so an in-place driver unload/upgrade stalls. This is the underlying cause behind the stuck-driver symptom in #2549 (the init container "fails to unload the driver while workloads are active").

Why the current flow doesn't cover these

There are two eviction paths, and neither can free these consumers:

  • Node drain (k8s-operator-libs pkg/upgrade/drain_manager.go): PodSelector is configurable and defaults in gpu-operator (controllers/upgrade_controller.go) to nvidia.com/gpu-driver-upgrade-drain.skip!=true. But IgnoreAllDaemonSets: true is hardcoded, so DaemonSet consumers are never evicted by drain.
  • Targeted pod deletion (pkg/upgrade/pod_manager.go): uses an injected PodDeletionFilter. gpu-operator's filter keys on the nvidia.com/gpu resource, so pods without that request are never selected. IgnoreAllDaemonSets: true is hardcoded here too.

Net: a DaemonSet GPU consumer cannot be freed by any path, and a non-resource GPU pod is only reachable via a broad node drain - heavier than needed, and in our testing it did not free our runtime-direct Deployments either.

Proposal

  1. Allow declaring additional GPU consumers to evict during an upgrade by label/selector, surfaced through driver.upgradePolicy, so they do not have to request nvidia.com/gpu. This complements the existing nvidia.com/gpu-driver-upgrade-drain.skip opt-out with an opt-in.
  2. Provide handling for declared DaemonSet GPU consumers during the upgrade window - e.g. an opt-in to cordon/park them (or make the DaemonSet-ignore configurable for a named set), restored on upgrade-done. The operator already quiesces its own operands; this extends that to declared third-party consumers.

Where the change lives

  • Core: NVIDIA/k8s-operator-libs pkg/upgrade (drain/pod managers; the hardcoded IgnoreAllDaemonSets: true).
  • Surface: NVIDIA/gpu-operator driver.upgradePolicy API and the injected PodDeletionFilter in controllers/upgrade_controller.go.

Workaround today

We run a small sidecar controller that watches nvidia.com/gpu-driver-upgrade-state and, on any transition away from upgrade-done, deletes the runtime-direct pods and parks the DaemonSet consumer (non-matching nodeSelector), then restores them on upgrade-done. With those consumers freed, the operator unloads, rebuilds and reloads the driver in place with no node reboot (boot_id unchanged across the upgrade). Happy to share it.

Related: #2549.

Metadata

Metadata

Assignees

No one assigned

    Labels

    featureissue/PR that proposes a new feature or functionalitylifecycle/frozenneeds-triageissue or PR has not been assigned a priority-px label

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions