From e973dc2c2c98b28f3688ec6f94fede75a28a4b1e Mon Sep 17 00:00:00 2001 From: Rajath Agasthya Date: Fri, 12 Jun 2026 12:25:40 -0500 Subject: [PATCH] Add Cluster Autoscaler integration example with NPD and NRC When the Cluster Autoscaler adds a GPU node, the node reports Ready long before the driver, container toolkit, and device plugin are installed. Workloads scheduled in that window fail or occupy the node so the autoscaler considers the scale-up satisfied, and MIG node pools can deadlock entirely when scaling from zero. Add a reference guide and example manifests that gate scheduling on GPU readiness using upstream components: the node pool template applies a startup taint, Node Problem Detector publishes a nvidia.com/GPUReady node condition from an nvidia-smi probe, and the Node Readiness Controller removes the taint once the condition is True. The GPU Operator itself is unchanged; its operands only need tolerations via the existing daemonsets.tolerations and NFD worker toleration values. The example includes a kind-based simulation for clusters without GPUs: workers join pre-tainted via kubelet registration (matching node pool template semantics), a marker file stands in for the nvidia-smi probe, and a node-scaler script vendored from the NRC repo simulates a scale-up by adding a fresh worker to the running cluster. Known limitation documented in the guide: NRC requires the readiness.k8s.io/ taint prefix while managed autoscalers on GKE and AKS only support the reserved startup-taint key prefix, so the pattern currently requires a self-managed Cluster Autoscaler; kubernetes-sigs/node-readiness-controller#279 tracks lifting this. Signed-off-by: Rajath Agasthya --- examples/cluster-autoscaler/README.md | 646 ++++++++++++++++++ .../node-readiness-rule.yaml | 33 + .../cluster-autoscaler/npd-gpu-ready.yaml | 173 +++++ .../simulation/kind-config.yaml | 32 + .../simulation/kindscaler.sh | 114 ++++ .../simulation/npd-gpu-ready-simulation.yaml | 163 +++++ .../cluster-autoscaler/simulation/reset.sh | 37 + 7 files changed, 1198 insertions(+) create mode 100644 examples/cluster-autoscaler/README.md create mode 100644 examples/cluster-autoscaler/node-readiness-rule.yaml create mode 100644 examples/cluster-autoscaler/npd-gpu-ready.yaml create mode 100644 examples/cluster-autoscaler/simulation/kind-config.yaml create mode 100755 examples/cluster-autoscaler/simulation/kindscaler.sh create mode 100644 examples/cluster-autoscaler/simulation/npd-gpu-ready-simulation.yaml create mode 100755 examples/cluster-autoscaler/simulation/reset.sh diff --git a/examples/cluster-autoscaler/README.md b/examples/cluster-autoscaler/README.md new file mode 100644 index 000000000..df2cf4f7c --- /dev/null +++ b/examples/cluster-autoscaler/README.md @@ -0,0 +1,646 @@ +# Cluster Autoscaler Integration with GPU Operator + +This guide shows how to keep workloads off autoscaled GPU nodes until the GPU +stack is actually ready, using a startup taint that is removed when the node +passes a GPU readiness probe. + +When the Cluster Autoscaler adds a GPU node, the node reports `Ready` long +before the GPU Operator has finished installing the driver, container toolkit, +and device plugin. Workloads scheduled during that window fail, or occupy the +node so the autoscaler considers the scale-up satisfied. Scaling a GPU pool +up from zero can stall for a separate reason, covered in the Scale-from-zero +section below. + +The integration adds three components; the GPU Operator itself is unchanged: + +| Component | Role | +|---|---| +| Node pool template (cloud provider) | Applies the startup taint to every new GPU node | +| [Node Problem Detector (NPD)](https://github.com/kubernetes/node-problem-detector) | Runs a GPU readiness probe on each GPU node and publishes the `nvidia.com/GPUReady` node condition | +| [Node Readiness Controller (NRC)](https://github.com/kubernetes-sigs/node-readiness-controller) | Removes the startup taint once the condition is `True` | +| GPU Operator | Unchanged; its operands tolerate the startup taint via existing toleration settings | + +The flow on a freshly provisioned node: + +``` +node pool template applies the startup taint + | + v +new GPU node joins: NoSchedule for regular pods + | cluster-autoscaler is informed the + | taint is temporary via + | --startup-taint-prefix=readiness.k8s.io/ + v +GPU Operator operands roll out (they tolerate the taint) + | + v +NPD probe succeeds (nvidia-smi works) + | + v +node condition nvidia.com/GPUReady = True + | + v +NRC removes the startup taint + | + v +pending GPU pods schedule +``` + +## Names used in this example + +| Object | Value | +|---|---| +| Node condition | `nvidia.com/GPUReady` | +| Startup taint | `readiness.k8s.io/nvidia-gpu-not-ready=pending:NoSchedule` | +| NodeReadinessRule | `nvidia-gpu-readiness` | +| NPD monitor source | `gpu-ready-monitor` | +| NPD ConfigMap / DaemonSet | `npd-gpu-ready-config` / `node-problem-detector` (namespace `kube-system`) | +| NRC bootstrap annotation | `readiness.k8s.io/bootstrap-completed-nvidia-gpu-readiness` (written by NRC after it removes the taint) | +| Simulation marker file | `/var/lib/gpu-ready-sim/ready` (on the node) | + +Two naming constraints to be aware of if you change these: + +- NRC requires the taint key to use the `readiness.k8s.io/` prefix; the + `NodeReadinessRule` CRD rejects other prefixes. +- Because of that prefix, you cannot use the Cluster Autoscaler's + auto-detected startup-taint prefix + (`startup-taint.cluster-autoscaler.kubernetes.io/`). Configuring the + autoscaler explicitly is therefore required, not optional: + `--startup-taint-prefix=readiness.k8s.io/` on Cluster Autoscaler 1.36 and + newer, or `--startup-taint=` on older versions. A feature + request to allow the autoscaler's startup-taint prefix in NRC rules is + open: + [node-readiness-controller#279](https://github.com/kubernetes-sigs/node-readiness-controller/issues/279). + +The same taint key appears in four places and must match exactly: the node +pool template, the `NodeReadinessRule`, the GPU Operator toleration values, +and the NPD DaemonSet tolerations in `npd-gpu-ready.yaml`. The autoscaler +flag needs only the `readiness.k8s.io/` prefix (or the full key, if you use +`--startup-taint`). + +## Files in this directory + +| File | Purpose | +|---|---| +| `npd-gpu-ready.yaml` | NPD DaemonSet + RBAC + ConfigMap with the nvidia-smi readiness probe | +| `node-readiness-rule.yaml` | NRC rule that removes the startup taint when the condition is `True` | +| `simulation/npd-gpu-ready-simulation.yaml` | NPD variant whose probe checks a marker file instead of nvidia-smi, for clusters without GPUs | +| `simulation/kind-config.yaml` | kind cluster whose workers join with the startup taint and GPU label already applied, like a node pool template | +| `simulation/kindscaler.sh` | Adds workers to the running kind cluster to simulate a scale-up (vendored from the NRC repo) | +| `simulation/reset.sh` | Re-arms the simulation on a node so the flow can be run again | + +All `kubectl apply -f ` commands in this guide are run from this +directory (`examples/cluster-autoscaler/`) of a repository clone. + +## Prerequisites + +These steps target a real GPU cluster and are referenced from Walkthrough B. +For the no-GPU simulation, only step 1 (NRC) is needed — Walkthrough A +applies its own NPD variant and the readiness rule inline. + +### 1. Install the Node Readiness Controller + +NRC is an alpha component ([KEP-5233](https://github.com/kubernetes/enhancements/issues/5233)). +This example was validated with v0.3.0. + +```sh +VERSION=v0.3.0 +kubectl apply -f https://github.com/kubernetes-sigs/node-readiness-controller/releases/download/${VERSION}/crds.yaml +kubectl wait --for condition=established --timeout=30s crd/nodereadinessrules.readiness.node.x-k8s.io +kubectl apply -f https://github.com/kubernetes-sigs/node-readiness-controller/releases/download/${VERSION}/install.yaml +kubectl -n nrr-system rollout status deploy/nrr-controller-manager --timeout=120s +``` + +This deploys the controller into the `nrr-system` namespace. See the +[NRC installation guide](https://node-readiness-controller.sigs.k8s.io/user-guide/installation.html) +for the full-install variant (metrics, validation webhook). + +### 2. Install NPD with the GPU readiness plugin + +```sh +kubectl apply -f npd-gpu-ready.yaml +``` + +This deploys NPD to nodes labeled `nvidia.com/gpu.present=true` — the label +the GPU Operator applies to nodes that Node Feature Discovery (NFD, deployed +as a GPU Operator subchart) has identified as having an NVIDIA GPU — with a +single custom-plugin monitor. The probe +runs `nvidia-smi` every 10 seconds — through the driver-container root +(`/run/nvidia/driver`) or the host root — and publishes the +result as the `nvidia.com/GPUReady` node condition. Both the monitor +configuration and the probe script live in the `npd-gpu-ready-config` +ConfigMap. + +If your cluster already runs NPD (some managed Kubernetes offerings deploy +it), do not install a second copy. Add the `gpu-ready-monitor.json` and +`check-gpu-ready.sh` keys from the ConfigMap to your existing NPD +configuration and pass an additional +`--config.custom-plugin-monitor=/config/gpu-ready-monitor.json` flag. + +NPD reads its configuration at startup, and ConfigMap updates do not restart +running pods. After changing the config, run +`kubectl -n kube-system rollout restart daemonset/node-problem-detector` +(substitute your NPD DaemonSet's name). + +### 3. Configure GPU Operator tolerations + +The GPU Operator's operands must run while the startup taint is still on the +node — they are what makes the node GPU ready. Two separate values control this, +and both replace their defaults rather than appending, so keep the existing +entries. Save the following as `values-autoscaler.yaml` and apply it with +`helm upgrade --install gpu-operator nvidia/gpu-operator -n gpu-operator +--create-namespace -f values-autoscaler.yaml`: + +```yaml +daemonsets: + tolerations: + # Default entry -- this list replaces the default, so keep it. + - key: nvidia.com/gpu + operator: Exists + effect: NoSchedule + - key: readiness.k8s.io/nvidia-gpu-not-ready + operator: Exists + effect: NoSchedule + +# The NFD subchart is not covered by daemonsets.tolerations. NFD must run on +# new nodes while they are still tainted so the GPU Operator can label them +# (nvidia.com/gpu.present=true). +node-feature-discovery: + worker: + tolerations: + # First two entries are the chart defaults -- keep them. + - key: node-role.kubernetes.io/control-plane + operator: Equal + value: "" + effect: NoSchedule + - key: nvidia.com/gpu + operator: Exists + effect: NoSchedule + - key: readiness.k8s.io/nvidia-gpu-not-ready + operator: Exists + effect: NoSchedule +``` + +Regular GPU workloads must NOT tolerate the startup taint — the taint is what +keeps them off the node until it is ready. + +### 4. Apply the readiness rule + +On a cluster that already has GPU nodes, preview the rule's effect first. +NRC adds the taint to any matching node whose condition is not `True` — in +both enforcement modes — so applying the rule before NPD reports readiness +on every existing node makes those nodes unschedulable for new pods. Set +`dryRun: true` in the rule spec; the controller then reports intended taint +changes in the rule's `status.dryRunResults` without modifying nodes: + +```sh +kubectl apply -f node-readiness-rule.yaml +kubectl get nodereadinessrule nvidia-gpu-readiness -o jsonpath='{.status.dryRunResults}' +``` + +Once the dry run shows no unexpected taint additions, remove `dryRun: true` +and re-apply. + +## Walkthrough A: simulation without GPUs (kind) + +This validates the full flow on a machine without GPUs: nodes that join the +cluster already tainted (as a node pool template would create them), the +NPD → condition → NRC → untaint chain, and a scale-up that adds a fresh node +The probe checks a marker file instead of running nvidia-smi, so +you control readiness by hand. The GPU Operator is not involved: the kind +config registers each worker with the GPU label (simulates NFD and the +GPU Operator) and the startup taint (simulates the node pool +template). Requires `kind`, `docker`, and `jq` on the local machine. + +1. Create the cluster. The config registers both workers with + `nvidia.com/gpu.present=true` and the startup taint, so they are tainted + from the moment they join: + + ```sh + kind create cluster --config simulation/kind-config.yaml + kubectl get nodes -o custom-columns='NAME:.metadata.name,TAINTS:.spec.taints[*].key' + ``` + + Expected: both workers list `readiness.k8s.io/nvidia-gpu-not-ready`. + +2. Install NRC (step 1 of Prerequisites above). + +3. Install the simulation NPD and verify the condition appears as `False` + on the workers: + + ```sh + kubectl apply -f simulation/npd-gpu-ready-simulation.yaml + kubectl get node gpu-sim-worker -o jsonpath='{.status.conditions[?(@.type=="nvidia.com/GPUReady")]}' | jq + ``` + + Expected within ~15 seconds: + + ``` + { + "type": "nvidia.com/GPUReady", + "status": "False", + "reason": "GPUReadinessPending", + ... + } + ``` + +4. Apply the readiness rule: + + ```sh + kubectl apply -f node-readiness-rule.yaml + ``` + + NRC adopts the existing taints; they stay in place because the + condition is `False`. + +5. Create a pod that needs a GPU node and confirm it stays `Pending`: + + ```sh + cat < -o jsonpath='{.status.conditions[?(@.type=="nvidia.com/GPUReady")]}' + kubectl get node -o jsonpath='{.spec.taints}' # taint removed once True + ``` + + The pod must stay `Pending` until the condition turns `True` and the taint + is removed, then run `nvidia-smi` successfully. + + This is a basic test: the pod requests one whole GPU (`nvidia.com/gpu: 1`), + which the autoscaler can schedule from a zero pool with no extra setup. The + Scale-from-zero and MIG readiness sections below cover the cases that need + more. + +## Scale-from-zero + +The startup taint keeps pods off a node until it is ready. A separate +autoscaler behavior decides whether a node is created at all, and GPU pools — +MIG pools especially — can run into it. + +To scale a pool up, the autoscaler first checks that the pending pod would +fit on a node from that pool. When the pool already has a node, it copies +that node, which advertises its real labels and resources. When the pool is +at zero, there is no node to copy, so it builds a template node from the +pool's static configuration alone — the instance type and the labels and +taints declared on the pool. + +It then matches the pod against that template the way the scheduler matches +it against a real node: the pod's node affinity and node selectors must match +the template's labels, and its resource requests must fit the template's +resources. For an ordinary pod this holds — CPU and memory come from the +instance type, and the labels it selects on are static. A GPU pod can ask for +two things a zero-pool template does not have, because the GPU Operator adds +them only after the node boots; either one keeps the pool at zero: + +- **A label the GPU Operator sets after the node is configured.** It sets + `nvidia.com/mig.config.state` and `nvidia.com/mig.strategy` once MIG + configuration finishes, so they are never in a zero-pool template. + Requiring `nvidia.com/mig.config.state=success` is a common way to keep + pods off a node until MIG is ready — the startup taint provides that gate + instead. Drop the affinity and select the pool on a static label: the + pool-name label (for example `agentpool` on AKS) or a custom one. +- **A GPU resource the autoscaler cannot infer from the instance type.** + Whole GPUs (`nvidia.com/gpu`) are usually inferable, which is why the + Walkthrough B test scales from zero with no extra setup. Per-profile MIG + resources (`mig.strategy=mixed`, for example `nvidia.com/mig-3g.20gb`) are + not — they appear only after the device plugin reports them. Declare them + on the pool so they enter the template: + - EKS / self-managed ASGs: tag the ASG + `k8s.io/cluster-autoscaler/node-template/resources/nvidia.com/mig-3g.20gb` + = `2`. + - Azure VMSS: the same tag, with `_` in place of `/` (Azure tag names + cannot contain slashes). + - GKE: set the accelerator and `gpu-partition-size` on the node pool. + +## MIG readiness + +The shipped probe only checks that the driver is up (`nvidia-smi` succeeds), +which happens before MIG partitioning finishes. On a MIG pool the taint +therefore comes off before the node can serve MIG pods. A MIG pod still does +not land early — the scheduler holds it until the MIG resource is +allocatable — but the taint is no longer what gates it, and the autoscaler +may treat the node as ready before MIG is configured. + +Two adjustments for MIG pools: + +- Set the MIG profile in the pool template (the `nvidia.com/mig.config` + label) so partitioning starts as soon as the node joins. +- To make the taint itself wait for MIG, extend `check-gpu-ready.sh` — for + example, require `nvidia-smi -L` to list the expected MIG devices, or read + the node's `nvidia.com/mig.config.state` label (this needs API access from + the probe) and exit ready only on `success`. + +## Day-2: bootstrap-only vs continuous enforcement + +The rule in this example uses `enforcementMode: bootstrap-only`: after NRC +removes the taint from a node, it records the bootstrap annotation and stops +managing that node. A driver upgrade or MIG reconfiguration later flips +`nvidia.com/GPUReady` to `False` (NPD keeps probing), but the node stays +schedulable. + +One caveat: the bootstrap annotation is written only when NRC removes a +taint. A node that matched the rule while already untainted and ready never +gets the annotation, so NRC taints it the first time its condition turns +`False` — even in `bootstrap-only` mode. The dry-run check in Prerequisites +step 4 shows which nodes are in this state. + +Setting `enforcementMode: continuous` makes NRC re-apply the taint whenever +the condition turns `False`, which extends the same mechanism to day-2 +gating. With `continuous`, a routine driver upgrade makes every node briefly +unschedulable, and new pods do not schedule during MIG reconfiguration. For +the autoscaler use case, `bootstrap-only` is the recommended starting point. + +## Troubleshooting + +**NPD pod crash-loops with `panic: No configuration option for any problem +daemon is specified`.** NPD refuses to start without at least one monitor. +Check that the `--config.custom-plugin-monitor` flag is present and points at +the mounted JSON file. + +**The `nvidia.com/GPUReady` condition is absent from the node.** The NPD pod +is probably not running on that node: +`kubectl get pods -n kube-system -l app=node-problem-detector -o wide`. The +shipped DaemonSet tolerates only the startup taint and `nvidia.com/gpu`; if +the GPU node pool carries additional taints, add matching tolerations to the +NPD DaemonSet (and to the GPU Operator operands and NFD). + +**The condition keeps being set to `False` (or flips back) unexpectedly.** +More than one writer may be publishing it — typically a second NPD DaemonSet +under a different name left over from earlier experiments. Find DaemonSets +with `kubectl get ds -A | grep -i -e problem -e npd`, and identify which +client owns the condition through managed fields: + +```sh +kubectl get node --show-managed-fields -o yaml | grep -B3 'GPUReady' +``` + +Each `managedFields` entry names the writing client in its `manager` field. + +**NRC does not remove the taint.** Check, in order: + +1. The condition is actually `True`: + `kubectl get node -o jsonpath='{.status.conditions[?(@.type=="nvidia.com/GPUReady")]}'` +2. The rule's `nodeSelector` matches the node's labels + (`nvidia.com/gpu.present=true` in this example — on a real cluster the + GPU Operator applies that label from NFD's feature labels, so both the + operator and NFD must be running, and NFD must tolerate the startup + taint; see Prerequisites step 3). +3. The node does not already have the + `readiness.k8s.io/bootstrap-completed-nvidia-gpu-readiness` annotation — + in `bootstrap-only` mode NRC ignores nodes that completed bootstrap once. + Remove the annotation to make NRC act again. +4. The NRC controller logs: `kubectl logs -n nrr-system deploy/nrr-controller-manager`. + +**The NodeReadinessRule is rejected on apply.** The taint key must use the +`readiness.k8s.io/` prefix; the CRD validates this. + +**The nvidia-smi probe never succeeds on a real GPU node.** Find the NPD pod +on the affected node +(`kubectl get pods -n kube-system -l app=node-problem-detector -o wide`) and +run the probe by hand: + +```sh +kubectl exec -n kube-system -- /config/check-gpu-ready.sh; echo "exit=$?" +``` + +Exit 1 means ready, exit 0 means not ready (NPD's plugin contract is built +for problem detection, so the codes are inverted compared to a typical health +check). If it stays at 0, check that the driver install finished +(`/run/nvidia/driver` populated on the host for driver-container installs) +and that the DaemonSet runs privileged with `/` mounted at `/host` with +`mountPropagation: HostToContainer` — without propagation, the bind mount +the driver container creates at `/run/nvidia/driver` is invisible to an NPD +pod that started before the driver installed (restarting the NPD pod hides +the problem, so it looks intermittent). + +A node whose probe never succeeds — failed hardware, for example — stays +tainted and `Ready` indefinitely. This pattern does not deprovision such +nodes; that takes admin intervention or node pool health checks. + +**Workloads schedule onto the node before the GPU is ready.** The workload +tolerates the startup taint. Only infrastructure that participates in making +the node ready (GPU Operator operands, NFD, NPD) should tolerate it. + +## Cleanup + +Remove the pieces in this order: + +1. Remove the startup taint from the node pool template, and the + `--startup-taint-prefix` / `--startup-taint` flag if you set one. + Skipping this leaves every new node + tainted with nothing in place to untaint it. +2. Delete the rule while NRC is still installed: + `kubectl delete -f node-readiness-rule.yaml`. NRC's cleanup finalizer + removes the rule's taint from any node still carrying it; if NRC is + uninstalled first, the deletion hangs on the finalizer and tainted nodes + stay tainted. +3. Uninstall NRC and delete NPD: `kubectl delete -f npd-gpu-ready.yaml` + (or `simulation/npd-gpu-ready-simulation.yaml`). +4. Optionally remove the readiness toleration entries from the GPU Operator + values. + +Stale `nvidia.com/GPUReady` conditions remain on nodes until the Node object +is deleted or another writer overwrites them; they are inert without NRC. +For the kind simulation, `kind delete cluster --name gpu-sim` removes +everything. diff --git a/examples/cluster-autoscaler/node-readiness-rule.yaml b/examples/cluster-autoscaler/node-readiness-rule.yaml new file mode 100644 index 000000000..c45239760 --- /dev/null +++ b/examples/cluster-autoscaler/node-readiness-rule.yaml @@ -0,0 +1,33 @@ +# NodeReadinessRule for the Node Readiness Controller (NRC). +# +# NRC removes the startup taint from a node once the nvidia.com/GPUReady +# condition (published by NPD, see npd-gpu-ready.yaml) is True. +# +# Notes: +# - The taint key must use the readiness.k8s.io/ prefix; the CRD rejects +# other prefixes. The same key must appear in the node pool template, +# the cluster-autoscaler --startup-taint flag, the GPU Operator +# toleration values, and the NPD DaemonSet tolerations. See README.md. +# - On a cluster with existing GPU nodes, preview with spec.dryRun: true +# first -- NRC adds the taint to matching nodes whose condition is not +# True, in both enforcement modes. See README.md prerequisites step 4. +# - bootstrap-only acts once per node: after removing the taint, NRC +# records the readiness.k8s.io/bootstrap-completed-nvidia-gpu-readiness +# annotation on the node and ignores it afterwards. Use `continuous` to +# also re-taint nodes whose condition later turns False (day-2 gating). +apiVersion: readiness.node.x-k8s.io/v1alpha1 +kind: NodeReadinessRule +metadata: + name: nvidia-gpu-readiness +spec: + conditions: + - type: nvidia.com/GPUReady + requiredStatus: "True" + taint: + key: readiness.k8s.io/nvidia-gpu-not-ready + effect: NoSchedule + value: pending + enforcementMode: bootstrap-only + nodeSelector: + matchLabels: + nvidia.com/gpu.present: "true" diff --git a/examples/cluster-autoscaler/npd-gpu-ready.yaml b/examples/cluster-autoscaler/npd-gpu-ready.yaml new file mode 100644 index 000000000..bf44c81f2 --- /dev/null +++ b/examples/cluster-autoscaler/npd-gpu-ready.yaml @@ -0,0 +1,173 @@ +# Node Problem Detector (NPD) with a custom plugin that publishes the +# nvidia.com/GPUReady node condition. The probe runs nvidia-smi against the +# node's driver installation; the Node Readiness Controller removes the +# startup taint once the condition is True. +# +# See README.md in this directory for the full setup guide. +--- +apiVersion: v1 +kind: ServiceAccount +metadata: + name: node-problem-detector + namespace: kube-system +--- +apiVersion: rbac.authorization.k8s.io/v1 +kind: ClusterRole +metadata: + name: node-problem-detector +rules: + - apiGroups: [""] + resources: ["nodes"] + verbs: ["get"] + - apiGroups: [""] + resources: ["nodes/status"] + verbs: ["patch"] + - apiGroups: [""] + resources: ["events"] + verbs: ["create", "patch", "update"] +--- +apiVersion: rbac.authorization.k8s.io/v1 +kind: ClusterRoleBinding +metadata: + name: node-problem-detector +roleRef: + apiGroup: rbac.authorization.k8s.io + kind: ClusterRole + name: node-problem-detector +subjects: + - kind: ServiceAccount + name: node-problem-detector + namespace: kube-system +--- +apiVersion: v1 +kind: ConfigMap +metadata: + name: npd-gpu-ready-config + namespace: kube-system +data: + gpu-ready-monitor.json: | + { + "plugin": "custom", + "pluginConfig": { + "invoke_interval": "10s", + "timeout": "5s", + "max_output_length": 80, + "concurrency": 1 + }, + "source": "gpu-ready-monitor", + "metricsReporting": false, + "conditions": [ + { + "type": "nvidia.com/GPUReady", + "reason": "GPUReadinessPending", + "message": "GPU readiness probe has not succeeded yet" + } + ], + "rules": [ + { + "type": "permanent", + "condition": "nvidia.com/GPUReady", + "reason": "GPUReady", + "path": "/config/check-gpu-ready.sh", + "timeout": "5s" + } + ] + } + check-gpu-ready.sh: | + #!/bin/sh + # Exit 1 when the GPU is ready, exit 0 when it is not. + # + # NPD's permanent-rule contract is built for problem detection: exit 0 + # means "no problem found" (condition stays False) and exit 1 means + # "problem found" (condition becomes True). nvidia.com/GPUReady reports + # a healthy state rather than a problem, so the exit codes are inverted + # compared to a typical health-check script. + # + # The driver can be installed two ways; probe both locations: + # - driver container: rooted at /run/nvidia/driver on the host + # - host-installed driver: nvidia-smi on the host PATH + if chroot /host/run/nvidia/driver nvidia-smi >/dev/null 2>&1; then + echo "nvidia-smi succeeded (driver container)" + exit 1 + fi + if chroot /host nvidia-smi >/dev/null 2>&1; then + echo "nvidia-smi succeeded (host driver)" + exit 1 + fi + echo "nvidia-smi failed: GPU driver not ready" + exit 0 +--- +apiVersion: apps/v1 +kind: DaemonSet +metadata: + name: node-problem-detector + namespace: kube-system + labels: + app: node-problem-detector +spec: + selector: + matchLabels: + app: node-problem-detector + template: + metadata: + labels: + app: node-problem-detector + spec: + serviceAccountName: node-problem-detector + priorityClassName: system-node-critical + nodeSelector: + nvidia.com/gpu.present: "true" + tolerations: + # NPD must run while the startup taint is still on the node -- + # it publishes the condition that gets the taint removed. + # If the GPU node pool carries additional taints, add matching + # tolerations here. + - key: readiness.k8s.io/nvidia-gpu-not-ready + operator: Exists + effect: NoSchedule + - key: nvidia.com/gpu + operator: Exists + effect: NoSchedule + containers: + - name: npd + image: registry.k8s.io/node-problem-detector/node-problem-detector:v0.8.20 + command: + - /node-problem-detector + - --logtostderr + - --prometheus-port=0 + - --config.custom-plugin-monitor=/config/gpu-ready-monitor.json + env: + - name: NODE_NAME + valueFrom: + fieldRef: + fieldPath: spec.nodeName + securityContext: + # nvidia-smi opens /dev/nvidia* device nodes through the + # hostPath mount, which requires a privileged container. + privileged: true + resources: + requests: + cpu: 10m + memory: 32Mi + limits: + memory: 128Mi + volumeMounts: + - name: config + mountPath: /config + - name: host + mountPath: /host + readOnly: true + # The driver container bind-mounts /run/nvidia/driver on the + # host after this pod starts; without propagation that mount + # stays invisible here and the probe never succeeds. + mountPropagation: HostToContainer + volumes: + - name: config + configMap: + name: npd-gpu-ready-config + # The probe script is executed directly from the mount. + defaultMode: 0755 + - name: host + hostPath: + path: / + type: Directory diff --git a/examples/cluster-autoscaler/simulation/kind-config.yaml b/examples/cluster-autoscaler/simulation/kind-config.yaml new file mode 100644 index 000000000..37653802f --- /dev/null +++ b/examples/cluster-autoscaler/simulation/kind-config.yaml @@ -0,0 +1,32 @@ +# kind cluster for the autoscaling simulation. Both workers register with +# the GPU node label and the startup taint already in place, the way a +# cloud node pool template provisions nodes: there is no window where the +# node is schedulable before the taint exists. +# +# Two workers are required: kindscaler.sh clones -worker2 as the +# template when adding nodes. +# +# Adapted from the node-readiness-controller testing setup +# (config/testing/kind/kind-3node-config.yaml in +# kubernetes-sigs/node-readiness-controller). +kind: Cluster +apiVersion: kind.x-k8s.io/v1alpha4 +name: gpu-sim +nodes: + - role: control-plane + - role: worker + kubeadmConfigPatches: + - | + kind: JoinConfiguration + nodeRegistration: + kubeletExtraArgs: + node-labels: "nvidia.com/gpu.present=true" + register-with-taints: "readiness.k8s.io/nvidia-gpu-not-ready=pending:NoSchedule" + - role: worker + kubeadmConfigPatches: + - | + kind: JoinConfiguration + nodeRegistration: + kubeletExtraArgs: + node-labels: "nvidia.com/gpu.present=true" + register-with-taints: "readiness.k8s.io/nvidia-gpu-not-ready=pending:NoSchedule" diff --git a/examples/cluster-autoscaler/simulation/kindscaler.sh b/examples/cluster-autoscaler/simulation/kindscaler.sh new file mode 100755 index 000000000..b899676eb --- /dev/null +++ b/examples/cluster-autoscaler/simulation/kindscaler.sh @@ -0,0 +1,114 @@ +#!/bin/bash + +# Copyright The Kubernetes Authors. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# This script is a simplified version of kindscaler, originally from: +# https://github.com/lobuhi/kindscaler/ +# +# It is modified to specifically support the testing needs of the +# node-readiness-controller project by only scaling worker nodes +# and using a specific node as a template. +# +# Vendored unmodified for the GPU Operator cluster-autoscaler example from +# kubernetes-sigs/node-readiness-controller (hack/test-workloads/kindscaler.sh). +# New nodes are cloned from -worker2, so the kind cluster must +# have at least two workers (see kind-config.yaml); added nodes join with the +# same labels and taints the template node registered with. +# +set -euxo pipefail + +# Check for required commands +if ! command -v kind &> /dev/null; then + echo "kind command not found, please install kind to use this script." + exit 1 +fi + +# Check input parameters +if [ $# -lt 2 ]; then + echo "Usage: $0 " + echo "count must be a positive integer" + exit 1 +fi + +CLUSTER_NAME=$1 +COUNT=$2 +ROLE="worker" # Hardcoded for our testing purposes + +# Validate count +if ! [[ "$COUNT" =~ ^[0-9]+$ ]] || [ "$COUNT" -le 0 ]; then + echo "Count must be a positive integer" + exit 1 +fi + +# Get existing nodes and determine the highest node index for the given role +highest_index=0 +existing_nodes=$(kind get nodes --name "$CLUSTER_NAME") +for node in $existing_nodes; do + if [[ $node == "$CLUSTER_NAME-$ROLE"* ]]; then + suffix=$(echo $node | sed -e "s/^$CLUSTER_NAME-$ROLE//") + if [[ "$suffix" =~ ^[0-9]+$ ]] && [ "$suffix" -gt "$highest_index" ]; then + highest_index=$suffix + fi + fi +done + +# Add nodes based on the highest found index and the count specified +start_index=$(($highest_index + 1)) +end_index=$(($highest_index + $COUNT)) +for i in $(seq $start_index $end_index); do + # Use nrr-test-worker2 as the template for all new worker nodes. + # This ensures they get the correct labels and taints. + TEMPLATE_NODE_NAME="$CLUSTER_NAME-worker2" + NEW_NODE_NAME="$CLUSTER_NAME-worker$i" + + # Copy the kubeadm file from the template node + docker cp $TEMPLATE_NODE_NAME:/kind/kubeadm.conf kubeadm-$i.conf > /dev/null 2>&1 + + # Replace the container role name with specific node name in the kubeadm file + sed -i.bak "s/$TEMPLATE_NODE_NAME$/$NEW_NODE_NAME/g" "./kubeadm-$i.conf" + rm -f "./kubeadm-$i.conf.bak" + + IMAGE=$(docker ps | grep $CLUSTER_NAME | awk '{print $2}' | head -1) + + echo -n "Adding $NEW_NODE_NAME node to $CLUSTER_NAME cluster... " + docker run --name $NEW_NODE_NAME --hostname $NEW_NODE_NAME \ + --label io.x-k8s.kind.role=$ROLE --privileged \ + --security-opt seccomp=unconfined --security-opt apparmor=unconfined \ + --tmpfs /tmp --tmpfs /run --volume /var \ + --volume /lib/modules:/lib/modules:ro -e KIND_EXPERIMENTAL_CONTAINERD_SNAPSHOTTER \ + --detach --tty --label io.x-k8s.kind.cluster=$CLUSTER_NAME --net kind \ + --restart=on-failure:1 --init=false $IMAGE > /dev/null 2>&1 + + # wait for cgroupv2 initialization before docker exec + wait_count=0 + while [ $wait_count -lt 30 ]; do + status=$(docker exec $NEW_NODE_NAME systemctl is-system-running 2>/dev/null || true) + if [[ "$status" == "running" ]]; then + break + fi + sleep 2 + wait_count=$((wait_count + 1)) + done + status=$(docker exec $NEW_NODE_NAME systemctl is-system-running 2>/dev/null || true) + if [[ $wait_count -ge 30 ]] && [ "$status" != "running" ]; then + echo "Container $NEW_NODE_NAME failed to initialize systemd" + echo "Review $NEW_NODE_NAME logs and remove container" + exit 1 + fi + docker cp kubeadm-$i.conf $NEW_NODE_NAME:/kind/kubeadm.conf > /dev/null 2>&1 + docker exec --privileged $NEW_NODE_NAME kubeadm join --config /kind/kubeadm.conf --skip-phases=preflight --v=6 > /dev/null 2>&1 + rm -f kubeadm-*.conf + echo "Done!" +done diff --git a/examples/cluster-autoscaler/simulation/npd-gpu-ready-simulation.yaml b/examples/cluster-autoscaler/simulation/npd-gpu-ready-simulation.yaml new file mode 100644 index 000000000..be9585534 --- /dev/null +++ b/examples/cluster-autoscaler/simulation/npd-gpu-ready-simulation.yaml @@ -0,0 +1,163 @@ +# Simulation variant of npd-gpu-ready.yaml for clusters without GPUs +# (for example, a local kind cluster). Identical to the production manifest +# except for two things: the probe script checks for a marker file on the +# node instead of running nvidia-smi, and the container is not privileged +# (reading a file does not need device access). +# +# Mark the simulated GPU ready (kind node names are docker containers): +# docker exec mkdir -p /var/lib/gpu-ready-sim +# docker exec touch /var/lib/gpu-ready-sim/ready +# Flip it back: +# docker exec rm /var/lib/gpu-ready-sim/ready +# +# The resource names match the production manifest on purpose. Two NPD +# stacks with different names can run at the same time and overwrite each +# other's condition updates. Apply one manifest or the other, not both. +--- +apiVersion: v1 +kind: ServiceAccount +metadata: + name: node-problem-detector + namespace: kube-system +--- +apiVersion: rbac.authorization.k8s.io/v1 +kind: ClusterRole +metadata: + name: node-problem-detector +rules: + - apiGroups: [""] + resources: ["nodes"] + verbs: ["get"] + - apiGroups: [""] + resources: ["nodes/status"] + verbs: ["patch"] + - apiGroups: [""] + resources: ["events"] + verbs: ["create", "patch", "update"] +--- +apiVersion: rbac.authorization.k8s.io/v1 +kind: ClusterRoleBinding +metadata: + name: node-problem-detector +roleRef: + apiGroup: rbac.authorization.k8s.io + kind: ClusterRole + name: node-problem-detector +subjects: + - kind: ServiceAccount + name: node-problem-detector + namespace: kube-system +--- +apiVersion: v1 +kind: ConfigMap +metadata: + name: npd-gpu-ready-config + namespace: kube-system +data: + gpu-ready-monitor.json: | + { + "plugin": "custom", + "pluginConfig": { + "invoke_interval": "10s", + "timeout": "5s", + "max_output_length": 80, + "concurrency": 1 + }, + "source": "gpu-ready-monitor", + "metricsReporting": false, + "conditions": [ + { + "type": "nvidia.com/GPUReady", + "reason": "GPUReadinessPending", + "message": "GPU readiness probe has not succeeded yet" + } + ], + "rules": [ + { + "type": "permanent", + "condition": "nvidia.com/GPUReady", + "reason": "GPUReady", + "path": "/config/check-gpu-ready.sh", + "timeout": "5s" + } + ] + } + check-gpu-ready.sh: | + #!/bin/sh + # Simulation probe: stands in for nvidia-smi on nodes without GPUs. + # The simulated GPU is ready when the marker file exists on the node. + # + # Exit 1 when ready, exit 0 when not. NPD's permanent-rule contract is + # built for problem detection (exit 1 = problem found = condition True), + # so the exit codes are inverted compared to a typical health check. + if [ -f /host/var/lib/gpu-ready-sim/ready ]; then + echo "marker file present: simulated GPU ready" + exit 1 + fi + echo "marker file absent: simulated GPU not ready" + exit 0 +--- +apiVersion: apps/v1 +kind: DaemonSet +metadata: + name: node-problem-detector + namespace: kube-system + labels: + app: node-problem-detector +spec: + selector: + matchLabels: + app: node-problem-detector + template: + metadata: + labels: + app: node-problem-detector + spec: + serviceAccountName: node-problem-detector + priorityClassName: system-node-critical + nodeSelector: + nvidia.com/gpu.present: "true" + tolerations: + - key: readiness.k8s.io/nvidia-gpu-not-ready + operator: Exists + effect: NoSchedule + - key: nvidia.com/gpu + operator: Exists + effect: NoSchedule + containers: + - name: npd + image: registry.k8s.io/node-problem-detector/node-problem-detector:v0.8.20 + command: + - /node-problem-detector + - --logtostderr + - --prometheus-port=0 + - --config.custom-plugin-monitor=/config/gpu-ready-monitor.json + env: + - name: NODE_NAME + valueFrom: + fieldRef: + fieldPath: spec.nodeName + resources: + requests: + cpu: 10m + memory: 32Mi + limits: + memory: 128Mi + volumeMounts: + - name: config + mountPath: /config + - name: host + mountPath: /host + readOnly: true + # Matches the production manifest; see the comment there. + mountPropagation: HostToContainer + volumes: + - name: config + configMap: + name: npd-gpu-ready-config + # The probe script is executed directly from the mount. + defaultMode: 0755 + - name: host + hostPath: + path: / + type: Directory diff --git a/examples/cluster-autoscaler/simulation/reset.sh b/examples/cluster-autoscaler/simulation/reset.sh new file mode 100755 index 000000000..f10febfb4 --- /dev/null +++ b/examples/cluster-autoscaler/simulation/reset.sh @@ -0,0 +1,37 @@ +#!/bin/sh +# Reset the simulation so the flow can be run again on the same node. +# Usage: ./reset.sh +set -e + +NODE="${1:?usage: $0 }" + +# Remove the marker file. NPD flips nvidia.com/GPUReady back to False on +# its next probe interval (10s). +docker exec "$NODE" rm -f /var/lib/gpu-ready-sim/ready + +# Wait for the condition to turn False before re-tainting. Re-tainting while +# it is still True triggers an NRC reconcile that can remove the new taint +# and re-write the bootstrap annotation, undoing the reset. +echo "Waiting for nvidia.com/GPUReady to turn False..." +i=0 +while :; do + status="$(kubectl get node "$NODE" -o jsonpath='{.status.conditions[?(@.type=="nvidia.com/GPUReady")].status}')" + [ "$status" = "False" ] && break + i=$((i + 1)) + if [ "$i" -ge 30 ]; then + echo "condition did not turn False after 60s; is the NPD pod running on $NODE?" >&2 + exit 1 + fi + sleep 2 +done + +# Re-apply the startup taint. In production the node pool template applies +# it when a node is created. +kubectl taint node "$NODE" readiness.k8s.io/nvidia-gpu-not-ready=pending:NoSchedule --overwrite + +# bootstrap-only mode acts once per node: after removing the taint, NRC +# records this annotation and ignores the node afterwards. Remove it so NRC +# manages the node again. +kubectl annotate node "$NODE" readiness.k8s.io/bootstrap-completed-nvidia-gpu-readiness- + +echo "Reset complete. The taint stays until the marker file is recreated."