Skip to content

Create new controller for all node labeling operations#2569

Open
cdesiniotis wants to merge 1 commit into
NVIDIA:mainfrom
cdesiniotis:ctlr-for-node-labeling
Open

Create new controller for all node labeling operations#2569
cdesiniotis wants to merge 1 commit into
NVIDIA:mainfrom
cdesiniotis:ctlr-for-node-labeling

Conversation

@cdesiniotis

@cdesiniotis cdesiniotis commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

This PR adds a new controller, named node-labeling-controller, which is responsible for labeling k8s nodes with GPU Operator-related state. Currently, both the clusterpolicy and nvidiadriver controller's label nodes. When DRA is integrated, we will have yet another controller that will need to label nodes as well. The intent is for the node-labeling-controller to reconcile any node labels required by any of these controllers. The only exception here is the nvidia.com/gpu-driver-upgrade-state label which will still be managed by the driver-upgrade controller for facilitating driver daemonset upgrades.

Code changes in this PR were drafted with the assistance of Claude Code.

Testing

I tested the following scenarios manually:

  1. Install GPU Operator with default values. Verify relevant nvidia.com/* node labels are added and all GPU Operator pods come up successfully.
  2. Remove nvidia.com/gpu.present label and verify it gets re-added to the node without disruption to operands.
  3. Disable a single operand via node label. I labeled my GPU node with nvidia.com/gpu.deploy.mig-manager=false and verified only the mig-manager pod got removed. I re-set this label to true and verified the mig-manager pod got rescheduled.
  4. Migrate from ClusterPolicy-owned driver daemonset to NVIDIADriver-owned driver daemonset. I set driver.useNvidiaDriverCRD=true in ClusterPolicy and created a default NVIDIADriver CR. Verified node got labeled with nvidia.com/gpu-operator.driver.owner=default. Verified old driver pod get replaced with new one.
  5. Deploy a non-default NVIDIADriver CR that targets my GPU node. Once the CR was created, I verified that the nvidia.com/gpu-operator.driver.owner was updated from default to the name of the CR I just created. Verified that the new driver pod comes up successfully.
  6. Disable all operands by setting nvidia.com/gpu.deploy.operands=false label.
  7. Re-enable all operands by setting nvidia.com/gpu.deploy.operands=true label.
  8. Add new GPU worker node to cluster. Verify appropriate nvidia.com/* node labels are added. Verify default NVIDIADriver gets deployed (nvidia.com/gpu-operator.driver.owner gets set to default on the new node) and all operands come up successfully.

@cdesiniotis cdesiniotis force-pushed the ctlr-for-node-labeling branch 2 times, most recently from ccde8c4 to a9a0f14 Compare June 23, 2026 16:57
@cdesiniotis cdesiniotis marked this pull request as ready for review June 23, 2026 23:21
@cdesiniotis cdesiniotis self-assigned this Jun 23, 2026
Comment thread controllers/nodelabeling_controller.go Outdated
Comment thread controllers/nodelabeling_controller.go Outdated
relevant := hasGPULabels(oldL) || hasGPULabels(newL) ||
hasCommonGPULabel(oldL) || hasCommonGPULabel(newL)
return relevant && !maps.Equal(oldL, newL)
},

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will still trigger reconcile on every node label update. To avoid always calculating new owners on every node label update, I was thinking we should reconcile only for selected set of labels:

  1. GPU labels and GPU present label
  2. Labels in nodeSelectors of all nvidiadrivers
  3. Ownership label itself

So like:

func (r *NVIDIADriverOwnershipReconciler) nodeOwnershipLabelsChanged(ctx context.Context, oldLabels, newLabels map[string]string) bool {
	if oldLabels[consts.GPUPresentLabel] != newLabels[consts.GPUPresentLabel] {
		return true
	}
	if oldLabels[consts.NVIDIADriverOwnerLabel] != newLabels[consts.NVIDIADriverOwnerLabel] {
		return true
	}
	if !hasGPULabels(oldLabels) && !hasGPULabels(newLabels) {
		return false
	}

	nvidiaDrivers := &nvidiav1alpha1.NVIDIADriverList{}
	if err := r.List(ctx, nvidiaDrivers); err != nil {
		log.FromContext(ctx).Error(err, "failed to list NVIDIADrivers while checking node label update")
		return true
	}

	for _, nvidiaDriver := range nvidiaDrivers.Items {
		for key := range nvidiaDriver.Spec.NodeSelector {
			if oldLabels[key] != newLabels[key] {
				return true
			}
		}
	}
	return false
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated this significantly now to align with the predicates used in ClusterPolicy previously. This should cover the three events you just listed, PTAL.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question -- I notice the NVIDIADriver controller reconciled on every node label update. Do we want to update this in a follow-up PR?

needsUpdate := hasGPULabels(newLabels) && !maps.Equal(newLabels, oldLabels)

Comment thread controllers/nodelabeling_controller.go Outdated
Comment thread controllers/nodelabeling_controller.go Outdated
Comment thread controllers/nodelabeling_controller.go Outdated
Comment thread controllers/nodelabeling_controller.go Outdated
Signed-off-by: Christopher Desiniotis <cdesiniotis@nvidia.com>
@cdesiniotis cdesiniotis force-pushed the ctlr-for-node-labeling branch from a9a0f14 to 7d5d0b7 Compare June 25, 2026 23:52
@a-mccarthy

Copy link
Copy Markdown

@cdesiniotis, looking through the PR and description, it seems like this will be a behind the scenes change for users. Is there anything that we should plan to document here? Or just a release note?

Is this planned to be included in the 26.7.0 release? its not currently added to a milestone.

@cdesiniotis cdesiniotis added this to the v26.7 milestone Jun 26, 2026
@cdesiniotis

Copy link
Copy Markdown
Contributor Author

@cdesiniotis, looking through the PR and description, it seems like this will be a behind the scenes change for users. Is there anything that we should plan to document here? Or just a release note?

Is this planned to be included in the 26.7.0 release? its not currently added to a milestone.

@a-mccarthy this is planned for 26.7.0, just added it to the milestone. I would say this is an implementation detail (with regards to how we label nodes in gpu-operator) that does not require documentation. We could consider mentioning this as an "improvement" in our release notes but I haven't put much thought into it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants