Skip to content

Add Cluster Autoscaler integration example with NPD and NRC#2573

Draft
rajathagasthya wants to merge 1 commit into
mainfrom
cluster-autoscaler-example
Draft

Add Cluster Autoscaler integration example with NPD and NRC#2573
rajathagasthya wants to merge 1 commit into
mainfrom
cluster-autoscaler-example

Conversation

@rajathagasthya

@rajathagasthya rajathagasthya commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Description

When the Cluster Autoscaler adds a GPU node, the node reports Ready long before the driver, container toolkit, and device plugin are installed. Workloads scheduled in that window fail or occupy the node so the autoscaler considers the scale-up satisfied.

This adds a reference guide and example manifests that gate scheduling on GPU readiness using upstream components: the node pool template applies a startup taint, Node Problem Detector publishes a nvidia.com/GPUReady node condition from an nvidia-smi probe, and the Node Readiness Controller removes the taint once the condition is True. The GPU Operator itself is unchanged; its operands only need tolerations via the existing daemonsets.tolerations and NFD worker toleration values.

The example includes a kind-based simulation for clusters without GPUs: workers join pre-tainted via kubelet registration (matching node pool template semantics), a marker file stands in for the nvidia-smi probe, and a node-scaler script vendored from the NRC repo simulates a scale-up by adding a fresh worker to the running cluster.

Known limitation documented in the guide: NRC requires the readiness.k8s.io/ taint prefix while managed autoscalers on GKE and AKS only support the reserved startup-taint key prefix, so the pattern currently requires a self-managed Cluster Autoscaler; kubernetes-sigs/node-readiness-controller#279 tracks this issue.

Checklist

  • No secrets, sensitive information, or unrelated changes
  • Lint checks passing (make lint) — N/A, documentation and example manifests only
  • Generated assets in-sync (make validate-generated-assets) — N/A, no generated assets affected
  • Go mod artifacts in-sync (make validate-modules) — N/A, no Go changes
  • Test cases are added for new code paths — N/A, no code paths added

Testing

Manually verified on a kind cluster:

  • Walkthrough A (simulation): NPD publishes nvidia.com/GPUReady, NRC holds the startup taint while the condition is False and removes it once it turns True, and the gated pod then schedules. reset.sh re-arms the node.
  • New workers, including one added via kindscaler.sh, join with the startup taint and nvidia.com/gpu.present label already applied.

When the Cluster Autoscaler adds a GPU node, the node reports Ready
long before the driver, container toolkit, and device plugin are
installed. Workloads scheduled in that window fail or occupy the node
so the autoscaler considers the scale-up satisfied, and MIG node pools
can deadlock entirely when scaling from zero.

Add a reference guide and example manifests that gate scheduling on
GPU readiness using upstream components: the node pool template
applies a startup taint, Node Problem Detector publishes a
nvidia.com/GPUReady node condition from an nvidia-smi probe, and the
Node Readiness Controller removes the taint once the condition is
True. The GPU Operator itself is unchanged; its operands only need
tolerations via the existing daemonsets.tolerations and NFD worker
toleration values.

The example includes a kind-based simulation for clusters without
GPUs: workers join pre-tainted via kubelet registration (matching
node pool template semantics), a marker file stands in for the
nvidia-smi probe, and a node-scaler script vendored from the NRC
repo simulates a scale-up by adding a fresh worker to the running
cluster.

Known limitation documented in the guide: NRC requires the
readiness.k8s.io/ taint prefix while managed autoscalers on GKE and
AKS only support the reserved startup-taint key prefix, so the
pattern currently requires a self-managed Cluster Autoscaler;
kubernetes-sigs/node-readiness-controller#279 tracks lifting this.

Signed-off-by: Rajath Agasthya <ragasthya@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant