diff --git a/examples/cluster-autoscaler/README.md b/examples/cluster-autoscaler/README.md new file mode 100644 index 000000000..df2cf4f7c --- /dev/null +++ b/examples/cluster-autoscaler/README.md @@ -0,0 +1,646 @@ +# Cluster Autoscaler Integration with GPU Operator + +This guide shows how to keep workloads off autoscaled GPU nodes until the GPU +stack is actually ready, using a startup taint that is removed when the node +passes a GPU readiness probe. + +When the Cluster Autoscaler adds a GPU node, the node reports `Ready` long +before the GPU Operator has finished installing the driver, container toolkit, +and device plugin. Workloads scheduled during that window fail, or occupy the +node so the autoscaler considers the scale-up satisfied. Scaling a GPU pool +up from zero can stall for a separate reason, covered in the Scale-from-zero +section below. + +The integration adds three components; the GPU Operator itself is unchanged: + +| Component | Role | +|---|---| +| Node pool template (cloud provider) | Applies the startup taint to every new GPU node | +| [Node Problem Detector (NPD)](https://github.com/kubernetes/node-problem-detector) | Runs a GPU readiness probe on each GPU node and publishes the `nvidia.com/GPUReady` node condition | +| [Node Readiness Controller (NRC)](https://github.com/kubernetes-sigs/node-readiness-controller) | Removes the startup taint once the condition is `True` | +| GPU Operator | Unchanged; its operands tolerate the startup taint via existing toleration settings | + +The flow on a freshly provisioned node: + +``` +node pool template applies the startup taint + | + v +new GPU node joins: NoSchedule for regular pods + | cluster-autoscaler is informed the + | taint is temporary via + | --startup-taint-prefix=readiness.k8s.io/ + v +GPU Operator operands roll out (they tolerate the taint) + | + v +NPD probe succeeds (nvidia-smi works) + | + v +node condition nvidia.com/GPUReady = True + | + v +NRC removes the startup taint + | + v +pending GPU pods schedule +``` + +## Names used in this example + +| Object | Value | +|---|---| +| Node condition | `nvidia.com/GPUReady` | +| Startup taint | `readiness.k8s.io/nvidia-gpu-not-ready=pending:NoSchedule` | +| NodeReadinessRule | `nvidia-gpu-readiness` | +| NPD monitor source | `gpu-ready-monitor` | +| NPD ConfigMap / DaemonSet | `npd-gpu-ready-config` / `node-problem-detector` (namespace `kube-system`) | +| NRC bootstrap annotation | `readiness.k8s.io/bootstrap-completed-nvidia-gpu-readiness` (written by NRC after it removes the taint) | +| Simulation marker file | `/var/lib/gpu-ready-sim/ready` (on the node) | + +Two naming constraints to be aware of if you change these: + +- NRC requires the taint key to use the `readiness.k8s.io/` prefix; the + `NodeReadinessRule` CRD rejects other prefixes. +- Because of that prefix, you cannot use the Cluster Autoscaler's + auto-detected startup-taint prefix + (`startup-taint.cluster-autoscaler.kubernetes.io/`). Configuring the + autoscaler explicitly is therefore required, not optional: + `--startup-taint-prefix=readiness.k8s.io/` on Cluster Autoscaler 1.36 and + newer, or `--startup-taint=` on older versions. A feature + request to allow the autoscaler's startup-taint prefix in NRC rules is + open: + [node-readiness-controller#279](https://github.com/kubernetes-sigs/node-readiness-controller/issues/279). + +The same taint key appears in four places and must match exactly: the node +pool template, the `NodeReadinessRule`, the GPU Operator toleration values, +and the NPD DaemonSet tolerations in `npd-gpu-ready.yaml`. The autoscaler +flag needs only the `readiness.k8s.io/` prefix (or the full key, if you use +`--startup-taint`). + +## Files in this directory + +| File | Purpose | +|---|---| +| `npd-gpu-ready.yaml` | NPD DaemonSet + RBAC + ConfigMap with the nvidia-smi readiness probe | +| `node-readiness-rule.yaml` | NRC rule that removes the startup taint when the condition is `True` | +| `simulation/npd-gpu-ready-simulation.yaml` | NPD variant whose probe checks a marker file instead of nvidia-smi, for clusters without GPUs | +| `simulation/kind-config.yaml` | kind cluster whose workers join with the startup taint and GPU label already applied, like a node pool template | +| `simulation/kindscaler.sh` | Adds workers to the running kind cluster to simulate a scale-up (vendored from the NRC repo) | +| `simulation/reset.sh` | Re-arms the simulation on a node so the flow can be run again | + +All `kubectl apply -f ` commands in this guide are run from this +directory (`examples/cluster-autoscaler/`) of a repository clone. + +## Prerequisites + +These steps target a real GPU cluster and are referenced from Walkthrough B. +For the no-GPU simulation, only step 1 (NRC) is needed — Walkthrough A +applies its own NPD variant and the readiness rule inline. + +### 1. Install the Node Readiness Controller + +NRC is an alpha component ([KEP-5233](https://github.com/kubernetes/enhancements/issues/5233)). +This example was validated with v0.3.0. + +```sh +VERSION=v0.3.0 +kubectl apply -f https://github.com/kubernetes-sigs/node-readiness-controller/releases/download/${VERSION}/crds.yaml +kubectl wait --for condition=established --timeout=30s crd/nodereadinessrules.readiness.node.x-k8s.io +kubectl apply -f https://github.com/kubernetes-sigs/node-readiness-controller/releases/download/${VERSION}/install.yaml +kubectl -n nrr-system rollout status deploy/nrr-controller-manager --timeout=120s +``` + +This deploys the controller into the `nrr-system` namespace. See the +[NRC installation guide](https://node-readiness-controller.sigs.k8s.io/user-guide/installation.html) +for the full-install variant (metrics, validation webhook). + +### 2. Install NPD with the GPU readiness plugin + +```sh +kubectl apply -f npd-gpu-ready.yaml +``` + +This deploys NPD to nodes labeled `nvidia.com/gpu.present=true` — the label +the GPU Operator applies to nodes that Node Feature Discovery (NFD, deployed +as a GPU Operator subchart) has identified as having an NVIDIA GPU — with a +single custom-plugin monitor. The probe +runs `nvidia-smi` every 10 seconds — through the driver-container root +(`/run/nvidia/driver`) or the host root — and publishes the +result as the `nvidia.com/GPUReady` node condition. Both the monitor +configuration and the probe script live in the `npd-gpu-ready-config` +ConfigMap. + +If your cluster already runs NPD (some managed Kubernetes offerings deploy +it), do not install a second copy. Add the `gpu-ready-monitor.json` and +`check-gpu-ready.sh` keys from the ConfigMap to your existing NPD +configuration and pass an additional +`--config.custom-plugin-monitor=/config/gpu-ready-monitor.json` flag. + +NPD reads its configuration at startup, and ConfigMap updates do not restart +running pods. After changing the config, run +`kubectl -n kube-system rollout restart daemonset/node-problem-detector` +(substitute your NPD DaemonSet's name). + +### 3. Configure GPU Operator tolerations + +The GPU Operator's operands must run while the startup taint is still on the +node — they are what makes the node GPU ready. Two separate values control this, +and both replace their defaults rather than appending, so keep the existing +entries. Save the following as `values-autoscaler.yaml` and apply it with +`helm upgrade --install gpu-operator nvidia/gpu-operator -n gpu-operator +--create-namespace -f values-autoscaler.yaml`: + +```yaml +daemonsets: + tolerations: + # Default entry -- this list replaces the default, so keep it. + - key: nvidia.com/gpu + operator: Exists + effect: NoSchedule + - key: readiness.k8s.io/nvidia-gpu-not-ready + operator: Exists + effect: NoSchedule + +# The NFD subchart is not covered by daemonsets.tolerations. NFD must run on +# new nodes while they are still tainted so the GPU Operator can label them +# (nvidia.com/gpu.present=true). +node-feature-discovery: + worker: + tolerations: + # First two entries are the chart defaults -- keep them. + - key: node-role.kubernetes.io/control-plane + operator: Equal + value: "" + effect: NoSchedule + - key: nvidia.com/gpu + operator: Exists + effect: NoSchedule + - key: readiness.k8s.io/nvidia-gpu-not-ready + operator: Exists + effect: NoSchedule +``` + +Regular GPU workloads must NOT tolerate the startup taint — the taint is what +keeps them off the node until it is ready. + +### 4. Apply the readiness rule + +On a cluster that already has GPU nodes, preview the rule's effect first. +NRC adds the taint to any matching node whose condition is not `True` — in +both enforcement modes — so applying the rule before NPD reports readiness +on every existing node makes those nodes unschedulable for new pods. Set +`dryRun: true` in the rule spec; the controller then reports intended taint +changes in the rule's `status.dryRunResults` without modifying nodes: + +```sh +kubectl apply -f node-readiness-rule.yaml +kubectl get nodereadinessrule nvidia-gpu-readiness -o jsonpath='{.status.dryRunResults}' +``` + +Once the dry run shows no unexpected taint additions, remove `dryRun: true` +and re-apply. + +## Walkthrough A: simulation without GPUs (kind) + +This validates the full flow on a machine without GPUs: nodes that join the +cluster already tainted (as a node pool template would create them), the +NPD → condition → NRC → untaint chain, and a scale-up that adds a fresh node +The probe checks a marker file instead of running nvidia-smi, so +you control readiness by hand. The GPU Operator is not involved: the kind +config registers each worker with the GPU label (simulates NFD and the +GPU Operator) and the startup taint (simulates the node pool +template). Requires `kind`, `docker`, and `jq` on the local machine. + +1. Create the cluster. The config registers both workers with + `nvidia.com/gpu.present=true` and the startup taint, so they are tainted + from the moment they join: + + ```sh + kind create cluster --config simulation/kind-config.yaml + kubectl get nodes -o custom-columns='NAME:.metadata.name,TAINTS:.spec.taints[*].key' + ``` + + Expected: both workers list `readiness.k8s.io/nvidia-gpu-not-ready`. + +2. Install NRC (step 1 of Prerequisites above). + +3. Install the simulation NPD and verify the condition appears as `False` + on the workers: + + ```sh + kubectl apply -f simulation/npd-gpu-ready-simulation.yaml + kubectl get node gpu-sim-worker -o jsonpath='{.status.conditions[?(@.type=="nvidia.com/GPUReady")]}' | jq + ``` + + Expected within ~15 seconds: + + ``` + { + "type": "nvidia.com/GPUReady", + "status": "False", + "reason": "GPUReadinessPending", + ... + } + ``` + +4. Apply the readiness rule: + + ```sh + kubectl apply -f node-readiness-rule.yaml + ``` + + NRC adopts the existing taints; they stay in place because the + condition is `False`. + +5. Create a pod that needs a GPU node and confirm it stays `Pending`: + + ```sh + cat < -o jsonpath='{.status.conditions[?(@.type=="nvidia.com/GPUReady")]}' + kubectl get node -o jsonpath='{.spec.taints}' # taint removed once True + ``` + + The pod must stay `Pending` until the condition turns `True` and the taint + is removed, then run `nvidia-smi` successfully. + + This is a basic test: the pod requests one whole GPU (`nvidia.com/gpu: 1`), + which the autoscaler can schedule from a zero pool with no extra setup. The + Scale-from-zero and MIG readiness sections below cover the cases that need + more. + +## Scale-from-zero + +The startup taint keeps pods off a node until it is ready. A separate +autoscaler behavior decides whether a node is created at all, and GPU pools — +MIG pools especially — can run into it. + +To scale a pool up, the autoscaler first checks that the pending pod would +fit on a node from that pool. When the pool already has a node, it copies +that node, which advertises its real labels and resources. When the pool is +at zero, there is no node to copy, so it builds a template node from the +pool's static configuration alone — the instance type and the labels and +taints declared on the pool. + +It then matches the pod against that template the way the scheduler matches +it against a real node: the pod's node affinity and node selectors must match +the template's labels, and its resource requests must fit the template's +resources. For an ordinary pod this holds — CPU and memory come from the +instance type, and the labels it selects on are static. A GPU pod can ask for +two things a zero-pool template does not have, because the GPU Operator adds +them only after the node boots; either one keeps the pool at zero: + +- **A label the GPU Operator sets after the node is configured.** It sets + `nvidia.com/mig.config.state` and `nvidia.com/mig.strategy` once MIG + configuration finishes, so they are never in a zero-pool template. + Requiring `nvidia.com/mig.config.state=success` is a common way to keep + pods off a node until MIG is ready — the startup taint provides that gate + instead. Drop the affinity and select the pool on a static label: the + pool-name label (for example `agentpool` on AKS) or a custom one. +- **A GPU resource the autoscaler cannot infer from the instance type.** + Whole GPUs (`nvidia.com/gpu`) are usually inferable, which is why the + Walkthrough B test scales from zero with no extra setup. Per-profile MIG + resources (`mig.strategy=mixed`, for example `nvidia.com/mig-3g.20gb`) are + not — they appear only after the device plugin reports them. Declare them + on the pool so they enter the template: + - EKS / self-managed ASGs: tag the ASG + `k8s.io/cluster-autoscaler/node-template/resources/nvidia.com/mig-3g.20gb` + = `2`. + - Azure VMSS: the same tag, with `_` in place of `/` (Azure tag names + cannot contain slashes). + - GKE: set the accelerator and `gpu-partition-size` on the node pool. + +## MIG readiness + +The shipped probe only checks that the driver is up (`nvidia-smi` succeeds), +which happens before MIG partitioning finishes. On a MIG pool the taint +therefore comes off before the node can serve MIG pods. A MIG pod still does +not land early — the scheduler holds it until the MIG resource is +allocatable — but the taint is no longer what gates it, and the autoscaler +may treat the node as ready before MIG is configured. + +Two adjustments for MIG pools: + +- Set the MIG profile in the pool template (the `nvidia.com/mig.config` + label) so partitioning starts as soon as the node joins. +- To make the taint itself wait for MIG, extend `check-gpu-ready.sh` — for + example, require `nvidia-smi -L` to list the expected MIG devices, or read + the node's `nvidia.com/mig.config.state` label (this needs API access from + the probe) and exit ready only on `success`. + +## Day-2: bootstrap-only vs continuous enforcement + +The rule in this example uses `enforcementMode: bootstrap-only`: after NRC +removes the taint from a node, it records the bootstrap annotation and stops +managing that node. A driver upgrade or MIG reconfiguration later flips +`nvidia.com/GPUReady` to `False` (NPD keeps probing), but the node stays +schedulable. + +One caveat: the bootstrap annotation is written only when NRC removes a +taint. A node that matched the rule while already untainted and ready never +gets the annotation, so NRC taints it the first time its condition turns +`False` — even in `bootstrap-only` mode. The dry-run check in Prerequisites +step 4 shows which nodes are in this state. + +Setting `enforcementMode: continuous` makes NRC re-apply the taint whenever +the condition turns `False`, which extends the same mechanism to day-2 +gating. With `continuous`, a routine driver upgrade makes every node briefly +unschedulable, and new pods do not schedule during MIG reconfiguration. For +the autoscaler use case, `bootstrap-only` is the recommended starting point. + +## Troubleshooting + +**NPD pod crash-loops with `panic: No configuration option for any problem +daemon is specified`.** NPD refuses to start without at least one monitor. +Check that the `--config.custom-plugin-monitor` flag is present and points at +the mounted JSON file. + +**The `nvidia.com/GPUReady` condition is absent from the node.** The NPD pod +is probably not running on that node: +`kubectl get pods -n kube-system -l app=node-problem-detector -o wide`. The +shipped DaemonSet tolerates only the startup taint and `nvidia.com/gpu`; if +the GPU node pool carries additional taints, add matching tolerations to the +NPD DaemonSet (and to the GPU Operator operands and NFD). + +**The condition keeps being set to `False` (or flips back) unexpectedly.** +More than one writer may be publishing it — typically a second NPD DaemonSet +under a different name left over from earlier experiments. Find DaemonSets +with `kubectl get ds -A | grep -i -e problem -e npd`, and identify which +client owns the condition through managed fields: + +```sh +kubectl get node --show-managed-fields -o yaml | grep -B3 'GPUReady' +``` + +Each `managedFields` entry names the writing client in its `manager` field. + +**NRC does not remove the taint.** Check, in order: + +1. The condition is actually `True`: + `kubectl get node -o jsonpath='{.status.conditions[?(@.type=="nvidia.com/GPUReady")]}'` +2. The rule's `nodeSelector` matches the node's labels + (`nvidia.com/gpu.present=true` in this example — on a real cluster the + GPU Operator applies that label from NFD's feature labels, so both the + operator and NFD must be running, and NFD must tolerate the startup + taint; see Prerequisites step 3). +3. The node does not already have the + `readiness.k8s.io/bootstrap-completed-nvidia-gpu-readiness` annotation — + in `bootstrap-only` mode NRC ignores nodes that completed bootstrap once. + Remove the annotation to make NRC act again. +4. The NRC controller logs: `kubectl logs -n nrr-system deploy/nrr-controller-manager`. + +**The NodeReadinessRule is rejected on apply.** The taint key must use the +`readiness.k8s.io/` prefix; the CRD validates this. + +**The nvidia-smi probe never succeeds on a real GPU node.** Find the NPD pod +on the affected node +(`kubectl get pods -n kube-system -l app=node-problem-detector -o wide`) and +run the probe by hand: + +```sh +kubectl exec -n kube-system -- /config/check-gpu-ready.sh; echo "exit=$?" +``` + +Exit 1 means ready, exit 0 means not ready (NPD's plugin contract is built +for problem detection, so the codes are inverted compared to a typical health +check). If it stays at 0, check that the driver install finished +(`/run/nvidia/driver` populated on the host for driver-container installs) +and that the DaemonSet runs privileged with `/` mounted at `/host` with +`mountPropagation: HostToContainer` — without propagation, the bind mount +the driver container creates at `/run/nvidia/driver` is invisible to an NPD +pod that started before the driver installed (restarting the NPD pod hides +the problem, so it looks intermittent). + +A node whose probe never succeeds — failed hardware, for example — stays +tainted and `Ready` indefinitely. This pattern does not deprovision such +nodes; that takes admin intervention or node pool health checks. + +**Workloads schedule onto the node before the GPU is ready.** The workload +tolerates the startup taint. Only infrastructure that participates in making +the node ready (GPU Operator operands, NFD, NPD) should tolerate it. + +## Cleanup + +Remove the pieces in this order: + +1. Remove the startup taint from the node pool template, and the + `--startup-taint-prefix` / `--startup-taint` flag if you set one. + Skipping this leaves every new node + tainted with nothing in place to untaint it. +2. Delete the rule while NRC is still installed: + `kubectl delete -f node-readiness-rule.yaml`. NRC's cleanup finalizer + removes the rule's taint from any node still carrying it; if NRC is + uninstalled first, the deletion hangs on the finalizer and tainted nodes + stay tainted. +3. Uninstall NRC and delete NPD: `kubectl delete -f npd-gpu-ready.yaml` + (or `simulation/npd-gpu-ready-simulation.yaml`). +4. Optionally remove the readiness toleration entries from the GPU Operator + values. + +Stale `nvidia.com/GPUReady` conditions remain on nodes until the Node object +is deleted or another writer overwrites them; they are inert without NRC. +For the kind simulation, `kind delete cluster --name gpu-sim` removes +everything. diff --git a/examples/cluster-autoscaler/node-readiness-rule.yaml b/examples/cluster-autoscaler/node-readiness-rule.yaml new file mode 100644 index 000000000..c45239760 --- /dev/null +++ b/examples/cluster-autoscaler/node-readiness-rule.yaml @@ -0,0 +1,33 @@ +# NodeReadinessRule for the Node Readiness Controller (NRC). +# +# NRC removes the startup taint from a node once the nvidia.com/GPUReady +# condition (published by NPD, see npd-gpu-ready.yaml) is True. +# +# Notes: +# - The taint key must use the readiness.k8s.io/ prefix; the CRD rejects +# other prefixes. The same key must appear in the node pool template, +# the cluster-autoscaler --startup-taint flag, the GPU Operator +# toleration values, and the NPD DaemonSet tolerations. See README.md. +# - On a cluster with existing GPU nodes, preview with spec.dryRun: true +# first -- NRC adds the taint to matching nodes whose condition is not +# True, in both enforcement modes. See README.md prerequisites step 4. +# - bootstrap-only acts once per node: after removing the taint, NRC +# records the readiness.k8s.io/bootstrap-completed-nvidia-gpu-readiness +# annotation on the node and ignores it afterwards. Use `continuous` to +# also re-taint nodes whose condition later turns False (day-2 gating). +apiVersion: readiness.node.x-k8s.io/v1alpha1 +kind: NodeReadinessRule +metadata: + name: nvidia-gpu-readiness +spec: + conditions: + - type: nvidia.com/GPUReady + requiredStatus: "True" + taint: + key: readiness.k8s.io/nvidia-gpu-not-ready + effect: NoSchedule + value: pending + enforcementMode: bootstrap-only + nodeSelector: + matchLabels: + nvidia.com/gpu.present: "true" diff --git a/examples/cluster-autoscaler/npd-gpu-ready.yaml b/examples/cluster-autoscaler/npd-gpu-ready.yaml new file mode 100644 index 000000000..bf44c81f2 --- /dev/null +++ b/examples/cluster-autoscaler/npd-gpu-ready.yaml @@ -0,0 +1,173 @@ +# Node Problem Detector (NPD) with a custom plugin that publishes the +# nvidia.com/GPUReady node condition. The probe runs nvidia-smi against the +# node's driver installation; the Node Readiness Controller removes the +# startup taint once the condition is True. +# +# See README.md in this directory for the full setup guide. +--- +apiVersion: v1 +kind: ServiceAccount +metadata: + name: node-problem-detector + namespace: kube-system +--- +apiVersion: rbac.authorization.k8s.io/v1 +kind: ClusterRole +metadata: + name: node-problem-detector +rules: + - apiGroups: [""] + resources: ["nodes"] + verbs: ["get"] + - apiGroups: [""] + resources: ["nodes/status"] + verbs: ["patch"] + - apiGroups: [""] + resources: ["events"] + verbs: ["create", "patch", "update"] +--- +apiVersion: rbac.authorization.k8s.io/v1 +kind: ClusterRoleBinding +metadata: + name: node-problem-detector +roleRef: + apiGroup: rbac.authorization.k8s.io + kind: ClusterRole + name: node-problem-detector +subjects: + - kind: ServiceAccount + name: node-problem-detector + namespace: kube-system +--- +apiVersion: v1 +kind: ConfigMap +metadata: + name: npd-gpu-ready-config + namespace: kube-system +data: + gpu-ready-monitor.json: | + { + "plugin": "custom", + "pluginConfig": { + "invoke_interval": "10s", + "timeout": "5s", + "max_output_length": 80, + "concurrency": 1 + }, + "source": "gpu-ready-monitor", + "metricsReporting": false, + "conditions": [ + { + "type": "nvidia.com/GPUReady", + "reason": "GPUReadinessPending", + "message": "GPU readiness probe has not succeeded yet" + } + ], + "rules": [ + { + "type": "permanent", + "condition": "nvidia.com/GPUReady", + "reason": "GPUReady", + "path": "/config/check-gpu-ready.sh", + "timeout": "5s" + } + ] + } + check-gpu-ready.sh: | + #!/bin/sh + # Exit 1 when the GPU is ready, exit 0 when it is not. + # + # NPD's permanent-rule contract is built for problem detection: exit 0 + # means "no problem found" (condition stays False) and exit 1 means + # "problem found" (condition becomes True). nvidia.com/GPUReady reports + # a healthy state rather than a problem, so the exit codes are inverted + # compared to a typical health-check script. + # + # The driver can be installed two ways; probe both locations: + # - driver container: rooted at /run/nvidia/driver on the host + # - host-installed driver: nvidia-smi on the host PATH + if chroot /host/run/nvidia/driver nvidia-smi >/dev/null 2>&1; then + echo "nvidia-smi succeeded (driver container)" + exit 1 + fi + if chroot /host nvidia-smi >/dev/null 2>&1; then + echo "nvidia-smi succeeded (host driver)" + exit 1 + fi + echo "nvidia-smi failed: GPU driver not ready" + exit 0 +--- +apiVersion: apps/v1 +kind: DaemonSet +metadata: + name: node-problem-detector + namespace: kube-system + labels: + app: node-problem-detector +spec: + selector: + matchLabels: + app: node-problem-detector + template: + metadata: + labels: + app: node-problem-detector + spec: + serviceAccountName: node-problem-detector + priorityClassName: system-node-critical + nodeSelector: + nvidia.com/gpu.present: "true" + tolerations: + # NPD must run while the startup taint is still on the node -- + # it publishes the condition that gets the taint removed. + # If the GPU node pool carries additional taints, add matching + # tolerations here. + - key: readiness.k8s.io/nvidia-gpu-not-ready + operator: Exists + effect: NoSchedule + - key: nvidia.com/gpu + operator: Exists + effect: NoSchedule + containers: + - name: npd + image: registry.k8s.io/node-problem-detector/node-problem-detector:v0.8.20 + command: + - /node-problem-detector + - --logtostderr + - --prometheus-port=0 + - --config.custom-plugin-monitor=/config/gpu-ready-monitor.json + env: + - name: NODE_NAME + valueFrom: + fieldRef: + fieldPath: spec.nodeName + securityContext: + # nvidia-smi opens /dev/nvidia* device nodes through the + # hostPath mount, which requires a privileged container. + privileged: true + resources: + requests: + cpu: 10m + memory: 32Mi + limits: + memory: 128Mi + volumeMounts: + - name: config + mountPath: /config + - name: host + mountPath: /host + readOnly: true + # The driver container bind-mounts /run/nvidia/driver on the + # host after this pod starts; without propagation that mount + # stays invisible here and the probe never succeeds. + mountPropagation: HostToContainer + volumes: + - name: config + configMap: + name: npd-gpu-ready-config + # The probe script is executed directly from the mount. + defaultMode: 0755 + - name: host + hostPath: + path: / + type: Directory diff --git a/examples/cluster-autoscaler/simulation/kind-config.yaml b/examples/cluster-autoscaler/simulation/kind-config.yaml new file mode 100644 index 000000000..37653802f --- /dev/null +++ b/examples/cluster-autoscaler/simulation/kind-config.yaml @@ -0,0 +1,32 @@ +# kind cluster for the autoscaling simulation. Both workers register with +# the GPU node label and the startup taint already in place, the way a +# cloud node pool template provisions nodes: there is no window where the +# node is schedulable before the taint exists. +# +# Two workers are required: kindscaler.sh clones -worker2 as the +# template when adding nodes. +# +# Adapted from the node-readiness-controller testing setup +# (config/testing/kind/kind-3node-config.yaml in +# kubernetes-sigs/node-readiness-controller). +kind: Cluster +apiVersion: kind.x-k8s.io/v1alpha4 +name: gpu-sim +nodes: + - role: control-plane + - role: worker + kubeadmConfigPatches: + - | + kind: JoinConfiguration + nodeRegistration: + kubeletExtraArgs: + node-labels: "nvidia.com/gpu.present=true" + register-with-taints: "readiness.k8s.io/nvidia-gpu-not-ready=pending:NoSchedule" + - role: worker + kubeadmConfigPatches: + - | + kind: JoinConfiguration + nodeRegistration: + kubeletExtraArgs: + node-labels: "nvidia.com/gpu.present=true" + register-with-taints: "readiness.k8s.io/nvidia-gpu-not-ready=pending:NoSchedule" diff --git a/examples/cluster-autoscaler/simulation/kindscaler.sh b/examples/cluster-autoscaler/simulation/kindscaler.sh new file mode 100755 index 000000000..b899676eb --- /dev/null +++ b/examples/cluster-autoscaler/simulation/kindscaler.sh @@ -0,0 +1,114 @@ +#!/bin/bash + +# Copyright The Kubernetes Authors. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# This script is a simplified version of kindscaler, originally from: +# https://github.com/lobuhi/kindscaler/ +# +# It is modified to specifically support the testing needs of the +# node-readiness-controller project by only scaling worker nodes +# and using a specific node as a template. +# +# Vendored unmodified for the GPU Operator cluster-autoscaler example from +# kubernetes-sigs/node-readiness-controller (hack/test-workloads/kindscaler.sh). +# New nodes are cloned from -worker2, so the kind cluster must +# have at least two workers (see kind-config.yaml); added nodes join with the +# same labels and taints the template node registered with. +# +set -euxo pipefail + +# Check for required commands +if ! command -v kind &> /dev/null; then + echo "kind command not found, please install kind to use this script." + exit 1 +fi + +# Check input parameters +if [ $# -lt 2 ]; then + echo "Usage: $0 " + echo "count must be a positive integer" + exit 1 +fi + +CLUSTER_NAME=$1 +COUNT=$2 +ROLE="worker" # Hardcoded for our testing purposes + +# Validate count +if ! [[ "$COUNT" =~ ^[0-9]+$ ]] || [ "$COUNT" -le 0 ]; then + echo "Count must be a positive integer" + exit 1 +fi + +# Get existing nodes and determine the highest node index for the given role +highest_index=0 +existing_nodes=$(kind get nodes --name "$CLUSTER_NAME") +for node in $existing_nodes; do + if [[ $node == "$CLUSTER_NAME-$ROLE"* ]]; then + suffix=$(echo $node | sed -e "s/^$CLUSTER_NAME-$ROLE//") + if [[ "$suffix" =~ ^[0-9]+$ ]] && [ "$suffix" -gt "$highest_index" ]; then + highest_index=$suffix + fi + fi +done + +# Add nodes based on the highest found index and the count specified +start_index=$(($highest_index + 1)) +end_index=$(($highest_index + $COUNT)) +for i in $(seq $start_index $end_index); do + # Use nrr-test-worker2 as the template for all new worker nodes. + # This ensures they get the correct labels and taints. + TEMPLATE_NODE_NAME="$CLUSTER_NAME-worker2" + NEW_NODE_NAME="$CLUSTER_NAME-worker$i" + + # Copy the kubeadm file from the template node + docker cp $TEMPLATE_NODE_NAME:/kind/kubeadm.conf kubeadm-$i.conf > /dev/null 2>&1 + + # Replace the container role name with specific node name in the kubeadm file + sed -i.bak "s/$TEMPLATE_NODE_NAME$/$NEW_NODE_NAME/g" "./kubeadm-$i.conf" + rm -f "./kubeadm-$i.conf.bak" + + IMAGE=$(docker ps | grep $CLUSTER_NAME | awk '{print $2}' | head -1) + + echo -n "Adding $NEW_NODE_NAME node to $CLUSTER_NAME cluster... " + docker run --name $NEW_NODE_NAME --hostname $NEW_NODE_NAME \ + --label io.x-k8s.kind.role=$ROLE --privileged \ + --security-opt seccomp=unconfined --security-opt apparmor=unconfined \ + --tmpfs /tmp --tmpfs /run --volume /var \ + --volume /lib/modules:/lib/modules:ro -e KIND_EXPERIMENTAL_CONTAINERD_SNAPSHOTTER \ + --detach --tty --label io.x-k8s.kind.cluster=$CLUSTER_NAME --net kind \ + --restart=on-failure:1 --init=false $IMAGE > /dev/null 2>&1 + + # wait for cgroupv2 initialization before docker exec + wait_count=0 + while [ $wait_count -lt 30 ]; do + status=$(docker exec $NEW_NODE_NAME systemctl is-system-running 2>/dev/null || true) + if [[ "$status" == "running" ]]; then + break + fi + sleep 2 + wait_count=$((wait_count + 1)) + done + status=$(docker exec $NEW_NODE_NAME systemctl is-system-running 2>/dev/null || true) + if [[ $wait_count -ge 30 ]] && [ "$status" != "running" ]; then + echo "Container $NEW_NODE_NAME failed to initialize systemd" + echo "Review $NEW_NODE_NAME logs and remove container" + exit 1 + fi + docker cp kubeadm-$i.conf $NEW_NODE_NAME:/kind/kubeadm.conf > /dev/null 2>&1 + docker exec --privileged $NEW_NODE_NAME kubeadm join --config /kind/kubeadm.conf --skip-phases=preflight --v=6 > /dev/null 2>&1 + rm -f kubeadm-*.conf + echo "Done!" +done diff --git a/examples/cluster-autoscaler/simulation/npd-gpu-ready-simulation.yaml b/examples/cluster-autoscaler/simulation/npd-gpu-ready-simulation.yaml new file mode 100644 index 000000000..be9585534 --- /dev/null +++ b/examples/cluster-autoscaler/simulation/npd-gpu-ready-simulation.yaml @@ -0,0 +1,163 @@ +# Simulation variant of npd-gpu-ready.yaml for clusters without GPUs +# (for example, a local kind cluster). Identical to the production manifest +# except for two things: the probe script checks for a marker file on the +# node instead of running nvidia-smi, and the container is not privileged +# (reading a file does not need device access). +# +# Mark the simulated GPU ready (kind node names are docker containers): +# docker exec mkdir -p /var/lib/gpu-ready-sim +# docker exec touch /var/lib/gpu-ready-sim/ready +# Flip it back: +# docker exec rm /var/lib/gpu-ready-sim/ready +# +# The resource names match the production manifest on purpose. Two NPD +# stacks with different names can run at the same time and overwrite each +# other's condition updates. Apply one manifest or the other, not both. +--- +apiVersion: v1 +kind: ServiceAccount +metadata: + name: node-problem-detector + namespace: kube-system +--- +apiVersion: rbac.authorization.k8s.io/v1 +kind: ClusterRole +metadata: + name: node-problem-detector +rules: + - apiGroups: [""] + resources: ["nodes"] + verbs: ["get"] + - apiGroups: [""] + resources: ["nodes/status"] + verbs: ["patch"] + - apiGroups: [""] + resources: ["events"] + verbs: ["create", "patch", "update"] +--- +apiVersion: rbac.authorization.k8s.io/v1 +kind: ClusterRoleBinding +metadata: + name: node-problem-detector +roleRef: + apiGroup: rbac.authorization.k8s.io + kind: ClusterRole + name: node-problem-detector +subjects: + - kind: ServiceAccount + name: node-problem-detector + namespace: kube-system +--- +apiVersion: v1 +kind: ConfigMap +metadata: + name: npd-gpu-ready-config + namespace: kube-system +data: + gpu-ready-monitor.json: | + { + "plugin": "custom", + "pluginConfig": { + "invoke_interval": "10s", + "timeout": "5s", + "max_output_length": 80, + "concurrency": 1 + }, + "source": "gpu-ready-monitor", + "metricsReporting": false, + "conditions": [ + { + "type": "nvidia.com/GPUReady", + "reason": "GPUReadinessPending", + "message": "GPU readiness probe has not succeeded yet" + } + ], + "rules": [ + { + "type": "permanent", + "condition": "nvidia.com/GPUReady", + "reason": "GPUReady", + "path": "/config/check-gpu-ready.sh", + "timeout": "5s" + } + ] + } + check-gpu-ready.sh: | + #!/bin/sh + # Simulation probe: stands in for nvidia-smi on nodes without GPUs. + # The simulated GPU is ready when the marker file exists on the node. + # + # Exit 1 when ready, exit 0 when not. NPD's permanent-rule contract is + # built for problem detection (exit 1 = problem found = condition True), + # so the exit codes are inverted compared to a typical health check. + if [ -f /host/var/lib/gpu-ready-sim/ready ]; then + echo "marker file present: simulated GPU ready" + exit 1 + fi + echo "marker file absent: simulated GPU not ready" + exit 0 +--- +apiVersion: apps/v1 +kind: DaemonSet +metadata: + name: node-problem-detector + namespace: kube-system + labels: + app: node-problem-detector +spec: + selector: + matchLabels: + app: node-problem-detector + template: + metadata: + labels: + app: node-problem-detector + spec: + serviceAccountName: node-problem-detector + priorityClassName: system-node-critical + nodeSelector: + nvidia.com/gpu.present: "true" + tolerations: + - key: readiness.k8s.io/nvidia-gpu-not-ready + operator: Exists + effect: NoSchedule + - key: nvidia.com/gpu + operator: Exists + effect: NoSchedule + containers: + - name: npd + image: registry.k8s.io/node-problem-detector/node-problem-detector:v0.8.20 + command: + - /node-problem-detector + - --logtostderr + - --prometheus-port=0 + - --config.custom-plugin-monitor=/config/gpu-ready-monitor.json + env: + - name: NODE_NAME + valueFrom: + fieldRef: + fieldPath: spec.nodeName + resources: + requests: + cpu: 10m + memory: 32Mi + limits: + memory: 128Mi + volumeMounts: + - name: config + mountPath: /config + - name: host + mountPath: /host + readOnly: true + # Matches the production manifest; see the comment there. + mountPropagation: HostToContainer + volumes: + - name: config + configMap: + name: npd-gpu-ready-config + # The probe script is executed directly from the mount. + defaultMode: 0755 + - name: host + hostPath: + path: / + type: Directory diff --git a/examples/cluster-autoscaler/simulation/reset.sh b/examples/cluster-autoscaler/simulation/reset.sh new file mode 100755 index 000000000..f10febfb4 --- /dev/null +++ b/examples/cluster-autoscaler/simulation/reset.sh @@ -0,0 +1,37 @@ +#!/bin/sh +# Reset the simulation so the flow can be run again on the same node. +# Usage: ./reset.sh +set -e + +NODE="${1:?usage: $0 }" + +# Remove the marker file. NPD flips nvidia.com/GPUReady back to False on +# its next probe interval (10s). +docker exec "$NODE" rm -f /var/lib/gpu-ready-sim/ready + +# Wait for the condition to turn False before re-tainting. Re-tainting while +# it is still True triggers an NRC reconcile that can remove the new taint +# and re-write the bootstrap annotation, undoing the reset. +echo "Waiting for nvidia.com/GPUReady to turn False..." +i=0 +while :; do + status="$(kubectl get node "$NODE" -o jsonpath='{.status.conditions[?(@.type=="nvidia.com/GPUReady")].status}')" + [ "$status" = "False" ] && break + i=$((i + 1)) + if [ "$i" -ge 30 ]; then + echo "condition did not turn False after 60s; is the NPD pod running on $NODE?" >&2 + exit 1 + fi + sleep 2 +done + +# Re-apply the startup taint. In production the node pool template applies +# it when a node is created. +kubectl taint node "$NODE" readiness.k8s.io/nvidia-gpu-not-ready=pending:NoSchedule --overwrite + +# bootstrap-only mode acts once per node: after removing the taint, NRC +# records this annotation and ignores the node afterwards. Remove it so NRC +# manages the node again. +kubectl annotate node "$NODE" readiness.k8s.io/bootstrap-completed-nvidia-gpu-readiness- + +echo "Reset complete. The taint stays until the marker file is recreated."