Add Cluster Autoscaler integration example with NPD and NRC#2573
Draft
rajathagasthya wants to merge 1 commit into
Draft
Add Cluster Autoscaler integration example with NPD and NRC#2573rajathagasthya wants to merge 1 commit into
rajathagasthya wants to merge 1 commit into
Conversation
When the Cluster Autoscaler adds a GPU node, the node reports Ready long before the driver, container toolkit, and device plugin are installed. Workloads scheduled in that window fail or occupy the node so the autoscaler considers the scale-up satisfied, and MIG node pools can deadlock entirely when scaling from zero. Add a reference guide and example manifests that gate scheduling on GPU readiness using upstream components: the node pool template applies a startup taint, Node Problem Detector publishes a nvidia.com/GPUReady node condition from an nvidia-smi probe, and the Node Readiness Controller removes the taint once the condition is True. The GPU Operator itself is unchanged; its operands only need tolerations via the existing daemonsets.tolerations and NFD worker toleration values. The example includes a kind-based simulation for clusters without GPUs: workers join pre-tainted via kubelet registration (matching node pool template semantics), a marker file stands in for the nvidia-smi probe, and a node-scaler script vendored from the NRC repo simulates a scale-up by adding a fresh worker to the running cluster. Known limitation documented in the guide: NRC requires the readiness.k8s.io/ taint prefix while managed autoscalers on GKE and AKS only support the reserved startup-taint key prefix, so the pattern currently requires a self-managed Cluster Autoscaler; kubernetes-sigs/node-readiness-controller#279 tracks lifting this. Signed-off-by: Rajath Agasthya <ragasthya@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
When the Cluster Autoscaler adds a GPU node, the node reports Ready long before the driver, container toolkit, and device plugin are installed. Workloads scheduled in that window fail or occupy the node so the autoscaler considers the scale-up satisfied.
This adds a reference guide and example manifests that gate scheduling on GPU readiness using upstream components: the node pool template applies a startup taint, Node Problem Detector publishes a
nvidia.com/GPUReadynode condition from an nvidia-smi probe, and the Node Readiness Controller removes the taint once the condition is True. The GPU Operator itself is unchanged; its operands only need tolerations via the existingdaemonsets.tolerationsand NFD worker toleration values.The example includes a kind-based simulation for clusters without GPUs: workers join pre-tainted via kubelet registration (matching node pool template semantics), a marker file stands in for the nvidia-smi probe, and a node-scaler script vendored from the NRC repo simulates a scale-up by adding a fresh worker to the running cluster.
Known limitation documented in the guide: NRC requires the
readiness.k8s.io/taint prefix while managed autoscalers on GKE and AKS only support the reserved startup-taint key prefix, so the pattern currently requires a self-managed Cluster Autoscaler; kubernetes-sigs/node-readiness-controller#279 tracks this issue.Checklist
make lint) — N/A, documentation and example manifests onlymake validate-generated-assets) — N/A, no generated assets affectedmake validate-modules) — N/A, no Go changesTesting
Manually verified on a kind cluster:
nvidia.com/GPUReady, NRC holds the startup taint while the condition is False and removes it once it turns True, and the gated pod then schedules.reset.shre-arms the node.kindscaler.sh, join with the startup taint andnvidia.com/gpu.presentlabel already applied.