Add Cluster Autoscaler integration example with NPD and NRC by rajathagasthya · Pull Request #2573 · NVIDIA/gpu-operator

rajathagasthya · 2026-06-23T15:59:57Z

Description

When the Cluster Autoscaler adds a GPU node, the node reports Ready long before the driver, container toolkit, and device plugin are installed. Workloads scheduled in that window fail or occupy the node so the autoscaler considers the scale-up satisfied.

This adds a reference guide and example manifests that gate scheduling on GPU readiness using upstream components: the node pool template applies a startup taint, Node Problem Detector publishes a nvidia.com/GPUReady node condition from an nvidia-smi probe, and the Node Readiness Controller removes the taint once the condition is True. The GPU Operator itself is unchanged; its operands only need tolerations via the existing daemonsets.tolerations and NFD worker toleration values.

The example includes a kind-based simulation for clusters without GPUs: workers join pre-tainted via kubelet registration (matching node pool template semantics), a marker file stands in for the nvidia-smi probe, and a node-scaler script vendored from the NRC repo simulates a scale-up by adding a fresh worker to the running cluster.

Known limitation documented in the guide: NRC requires the readiness.k8s.io/ taint prefix while managed autoscalers on GKE and AKS only support the reserved startup-taint key prefix, so the pattern currently requires a self-managed Cluster Autoscaler; kubernetes-sigs/node-readiness-controller#279 tracks this issue.

Checklist

No secrets, sensitive information, or unrelated changes
Lint checks passing (make lint) — N/A, documentation and example manifests only
Generated assets in-sync (make validate-generated-assets) — N/A, no generated assets affected
Go mod artifacts in-sync (make validate-modules) — N/A, no Go changes
Test cases are added for new code paths — N/A, no code paths added

Testing

Manually verified on a kind cluster:

Walkthrough A (simulation): NPD publishes nvidia.com/GPUReady, NRC holds the startup taint while the condition is False and removes it once it turns True, and the gated pod then schedules. reset.sh re-arms the node.
New workers, including one added via kindscaler.sh, join with the startup taint and nvidia.com/gpu.present label already applied.

When the Cluster Autoscaler adds a GPU node, the node reports Ready long before the driver, container toolkit, and device plugin are installed. Workloads scheduled in that window fail or occupy the node so the autoscaler considers the scale-up satisfied, and MIG node pools can deadlock entirely when scaling from zero. Add a reference guide and example manifests that gate scheduling on GPU readiness using upstream components: the node pool template applies a startup taint, Node Problem Detector publishes a nvidia.com/GPUReady node condition from an nvidia-smi probe, and the Node Readiness Controller removes the taint once the condition is True. The GPU Operator itself is unchanged; its operands only need tolerations via the existing daemonsets.tolerations and NFD worker toleration values. The example includes a kind-based simulation for clusters without GPUs: workers join pre-tainted via kubelet registration (matching node pool template semantics), a marker file stands in for the nvidia-smi probe, and a node-scaler script vendored from the NRC repo simulates a scale-up by adding a fresh worker to the running cluster. Known limitation documented in the guide: NRC requires the readiness.k8s.io/ taint prefix while managed autoscalers on GKE and AKS only support the reserved startup-taint key prefix, so the pattern currently requires a self-managed Cluster Autoscaler; kubernetes-sigs/node-readiness-controller#279 tracks lifting this. Signed-off-by: Rajath Agasthya <ragasthya@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Cluster Autoscaler integration example with NPD and NRC#2573

Add Cluster Autoscaler integration example with NPD and NRC#2573
rajathagasthya wants to merge 1 commit into
mainfrom
cluster-autoscaler-example

rajathagasthya commented Jun 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

rajathagasthya commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rajathagasthya commented Jun 23, 2026 •

edited

Loading