Create new controller for all node labeling operations#2569
Conversation
ccde8c4 to
a9a0f14
Compare
| relevant := hasGPULabels(oldL) || hasGPULabels(newL) || | ||
| hasCommonGPULabel(oldL) || hasCommonGPULabel(newL) | ||
| return relevant && !maps.Equal(oldL, newL) | ||
| }, |
There was a problem hiding this comment.
This will still trigger reconcile on every node label update. To avoid always calculating new owners on every node label update, I was thinking we should reconcile only for selected set of labels:
- GPU labels and GPU present label
- Labels in nodeSelectors of all nvidiadrivers
- Ownership label itself
So like:
func (r *NVIDIADriverOwnershipReconciler) nodeOwnershipLabelsChanged(ctx context.Context, oldLabels, newLabels map[string]string) bool {
if oldLabels[consts.GPUPresentLabel] != newLabels[consts.GPUPresentLabel] {
return true
}
if oldLabels[consts.NVIDIADriverOwnerLabel] != newLabels[consts.NVIDIADriverOwnerLabel] {
return true
}
if !hasGPULabels(oldLabels) && !hasGPULabels(newLabels) {
return false
}
nvidiaDrivers := &nvidiav1alpha1.NVIDIADriverList{}
if err := r.List(ctx, nvidiaDrivers); err != nil {
log.FromContext(ctx).Error(err, "failed to list NVIDIADrivers while checking node label update")
return true
}
for _, nvidiaDriver := range nvidiaDrivers.Items {
for key := range nvidiaDriver.Spec.NodeSelector {
if oldLabels[key] != newLabels[key] {
return true
}
}
}
return false
}
There was a problem hiding this comment.
I have updated this significantly now to align with the predicates used in ClusterPolicy previously. This should cover the three events you just listed, PTAL.
There was a problem hiding this comment.
Question -- I notice the NVIDIADriver controller reconciled on every node label update. Do we want to update this in a follow-up PR?
Signed-off-by: Christopher Desiniotis <cdesiniotis@nvidia.com>
a9a0f14 to
7d5d0b7
Compare
|
@cdesiniotis, looking through the PR and description, it seems like this will be a behind the scenes change for users. Is there anything that we should plan to document here? Or just a release note? Is this planned to be included in the 26.7.0 release? its not currently added to a milestone. |
@a-mccarthy this is planned for 26.7.0, just added it to the milestone. I would say this is an implementation detail (with regards to how we label nodes in gpu-operator) that does not require documentation. We could consider mentioning this as an "improvement" in our release notes but I haven't put much thought into it. |
This PR adds a new controller, named
node-labeling-controller, which is responsible for labeling k8s nodes with GPU Operator-related state. Currently, both the clusterpolicy and nvidiadriver controller's label nodes. When DRA is integrated, we will have yet another controller that will need to label nodes as well. The intent is for thenode-labeling-controllerto reconcile any node labels required by any of these controllers. The only exception here is thenvidia.com/gpu-driver-upgrade-statelabel which will still be managed by thedriver-upgradecontroller for facilitating driver daemonset upgrades.Code changes in this PR were drafted with the assistance of Claude Code.
Testing
I tested the following scenarios manually:
nvidia.com/*node labels are added and all GPU Operator pods come up successfully.nvidia.com/gpu.presentlabel and verify it gets re-added to the node without disruption to operands.nvidia.com/gpu.deploy.mig-manager=falseand verified only the mig-manager pod got removed. I re-set this label to true and verified the mig-manager pod got rescheduled.driver.useNvidiaDriverCRD=truein ClusterPolicy and created a default NVIDIADriver CR. Verified node got labeled withnvidia.com/gpu-operator.driver.owner=default. Verified old driver pod get replaced with new one.nvidia.com/gpu-operator.driver.ownerwas updated fromdefaultto the name of the CR I just created. Verified that the new driver pod comes up successfully.nvidia.com/gpu.deploy.operands=falselabel.nvidia.com/gpu.deploy.operands=truelabel.nvidia.com/*node labels are added. Verify default NVIDIADriver gets deployed (nvidia.com/gpu-operator.driver.ownergets set todefaulton the new node) and all operands come up successfully.