Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 25 additions & 1 deletion cmd/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,8 @@ func main() { // nolint: gocyclo
eventPort int
eventURL string
eventProtocol string
redfishMetricLabelsFromBMC string
redfishMetricLabelsFromServer string
registryClientTimeout time.Duration
registryDataMaxAge time.Duration
registryResyncInterval time.Duration
Expand Down Expand Up @@ -143,6 +145,16 @@ func main() { // nolint: gocyclo
flag.IntVar(&eventPort, "event-port", 10001, "The port to use for the server events endpoint for alerts and metrics.")
flag.StringVar(&eventProtocol, "event-protocol", "http",
"The protocol to use for the server events endpoint for alerts and metrics.")
flag.StringVar(&redfishMetricLabelsFromBMC, "redfish-metric-labels-from-bmc", "",
"Comma-separated list of 'kubernetes-label-key=prometheus-label-name' pairs. "+
"Each pair adds an additional label dimension to Redfish telemetry metrics, "+
"sourced from the matching label on the BMC resource. "+
"Example: topology.kubernetes.io/region=region,topology.kubernetes.io/zone=zone")
flag.StringVar(&redfishMetricLabelsFromServer, "redfish-metric-labels-from-server", "",
"Comma-separated list of 'kubernetes-label-key=prometheus-label-name' pairs. "+
"Each pair adds an additional label dimension to Redfish telemetry metrics, "+
"sourced from the matching label on the Server resource linked via spec.bmcRef.name. "+
"Example: metadata.metal.ironcore.dev/location=location,metadata.metal.ironcore.dev/rack=rack")
flag.StringVar(&probeImage, "probe-image", "", "Image for the first boot probing of a Server.")
flag.StringVar(&probeOSImage, "probe-os-image", "", "OS image for the first boot probing of a Server.")
flag.StringVar(&managerNamespace, "manager-namespace", "default", "Namespace the manager is running in.")
Expand Down Expand Up @@ -708,9 +720,21 @@ func main() { // nolint: gocyclo
}

if eventURL != "" {
bmcLabelMappings, err := serverevents.ParseLabelMappings(redfishMetricLabelsFromBMC)
if err != nil {
setupLog.Error(err, "Invalid --redfish-metric-labels-from-bmc")
os.Exit(1)
}
serverLabelMappings, err := serverevents.ParseLabelMappings(redfishMetricLabelsFromServer)
if err != nil {
setupLog.Error(err, "Invalid --redfish-metric-labels-from-server")
os.Exit(1)
}
Comment thread
coderabbitai[bot] marked this conversation as resolved.
if err := mgr.Add(manager.RunnableFunc(func(ctx context.Context) error {
setupLog.Info("starting event server for alerts and metrics", "EventURL", eventURL)
eventServer := serverevents.NewServer(setupLog, fmt.Sprintf(":%d", eventPort))
eventServer := serverevents.NewServer(
setupLog, fmt.Sprintf(":%d", eventPort), mgr.GetClient(), bmcLabelMappings, serverLabelMappings,
)
if err := eventServer.Start(ctx); err != nil {
return fmt.Errorf("unable to start event server: %w", err)
}
Expand Down
14 changes: 14 additions & 0 deletions dist/chart/templates/_helpers.tpl
Original file line number Diff line number Diff line change
Expand Up @@ -48,3 +48,17 @@ app.kubernetes.io/instance: {{ .Release.Name }}
$hasValidating = true }}{{- end }}
{{- end }}
{{ $hasValidating }}}}{{- end }}

{{/*
chart.redfishLabelFlag renders a single CLI flag string from a map of
kubernetes-label-key -> prometheus-label-name entries.

Usage: {{ include "chart.redfishLabelFlag" (dict "flag" "redfish-metric-labels-from-bmc" "map" .Values.redfishLabels.bmc) }}
*/}}
{{- define "chart.redfishLabelFlag" -}}
{{- $pairs := list -}}
{{- range $k, $v := .map -}}
{{- $pairs = append $pairs (printf "%s=%s" $k $v) -}}
{{- end -}}
{{- printf "--%s=%s" .flag (join "," ($pairs | sortAlpha)) -}}
{{- end }}
6 changes: 6 additions & 0 deletions dist/chart/templates/manager/manager.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,12 @@ spec:
{{- range .Values.controllerManager.manager.args }}
- {{ . }}
{{- end }}
{{- if .Values.redfishLabels.bmc }}
- {{ include "chart.redfishLabelFlag" (dict "flag" "redfish-metric-labels-from-bmc" "map" .Values.redfishLabels.bmc) | quote }}
{{- end }}
{{- if .Values.redfishLabels.server }}
- {{ include "chart.redfishLabelFlag" (dict "flag" "redfish-metric-labels-from-server" "map" .Values.redfishLabels.server) | quote }}
{{- end }}
command:
- /manager
image: {{ .Values.controllerManager.manager.image.repository }}:{{ .Values.controllerManager.manager.image.tag }}
Expand Down
18 changes: 17 additions & 1 deletion dist/chart/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,23 @@ crd:
# (Certificates, Issuers, ...) due to garbage collection.
keep: true

# [REDFISH LABEL ENRICHMENT]: Optional label enrichment for Redfish telemetry metrics.
# Define mappings from Kubernetes resource label keys to Prometheus label names.
# Mapped labels are appended as additional dimensions to redfish_monitor_reading and
# redfish_event_alert_total metrics. Leave empty ({}) to disable enrichment.
redfishLabels:
# Labels sourced from the BMC resource (matched by hostname == BMC resource name).
bmc: {}
# Example:
# topology.kubernetes.io/region: region
# topology.kubernetes.io/zone: zone

# Labels sourced from the Server resource linked via spec.bmcRef.name.
server: {}
# Example:
# metadata.metal.ironcore.dev/location: location
# metadata.metal.ironcore.dev/rack: rack

# [METRICS]: Set to true to generate manifests for exporting metrics.
# To disable metrics export set false, and ensure that the
# ControllerManager argument "--metrics-bind-address=:8443" is removed.
Expand Down Expand Up @@ -101,4 +118,3 @@ ignition:
# Template content that can be customized - this will be created as a ConfigMap
# and mounted to override the default template
# template: |

91 changes: 91 additions & 0 deletions docs/observability/metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -281,6 +281,97 @@ rate(metal_server_reconciliation_total{result=~"error_.*"}[5m])
count(metal_server_state{state="Available"} == 1)
```

## Redfish Telemetry Metrics

When the operator is configured with an event URL (`--event-url`), it subscribes to Redfish MetricReport and Alert events from each BMC and exposes two additional metrics.

### Sensor Readings (`redfish_monitor_reading`)

**Type:** Gauge
**Description:** Latest sensor value pushed via a Redfish MetricReport event.
**Fixed labels:**
- `hostname`: BMC Kubernetes resource name
- `metric_id`: Redfish metric ID (e.g., `CPU1Temp`)
- `type`: Metric type (e.g., `Temperature`, `Voltage`)
- `unit`: Unit of measure (e.g., `Cel`, `V`)
- `origin_context`: Originating hardware component path

**Dynamic labels:** Additional label dimensions can be injected from the BMC or Server resource (see [Label Enrichment](#label-enrichment) below).

**Example values:**
```text
redfish_monitor_reading{hostname="node001-bmc", metric_id="CPU1Temp", type="Temperature", unit="Cel", origin_context="/Chassis/1/Thermal"} 42.5
redfish_monitor_reading{hostname="node001-bmc", metric_id="FanSpeed1", type="Rotational", unit="RPM", origin_context="/Chassis/1/Thermal"} 3200
```

**Use cases:**
- Alert on thermal readings exceeding thresholds: `redfish_monitor_reading{type="Temperature"} > 80`
- Track fan speeds: `redfish_monitor_reading{type="Rotational"}`
- Compare readings across regions or racks when enriched with topology labels

### Alert Event Counter (`redfish_event_alert_total`)

**Type:** Counter
**Description:** Total count of Redfish alert/event messages received from each BMC.
**Fixed labels:**
- `hostname`: BMC Kubernetes resource name
- `severity`: Event severity (e.g., `OK`, `Warning`, `Critical`)
- `message_id`: Redfish MessageId (e.g., `Alert.1.0.ResourceStatusChangedOK`)
- `component`: Originating hardware component

**Dynamic labels:** Same enrichment as `redfish_monitor_reading`.

**Example values:**
```text
redfish_event_alert_total{hostname="node001-bmc", severity="Warning", message_id="ThermalEvents.1.0.TemperatureAboveUpperCautionThreshold", component="/Chassis/1/Thermal/CPU1Temp"} 3
redfish_event_alert_total{hostname="node001-bmc", severity="OK", message_id="Alert.1.0.ResourceStatusChangedOK", component="/Systems/1"} 12
```

**Use cases:**
- Alert on sustained critical events: `increase(redfish_event_alert_total{severity="Critical"}[5m]) > 0`
- Track warning frequency per host: `rate(redfish_event_alert_total{severity="Warning"}[1h])`

### Label Enrichment

When managing a large number of servers, it is often necessary to filter dashboard panels and alert rules by topology or location (e.g., region, availability zone, rack). Both Redfish metrics support optional dynamic label dimensions sourced from Kubernetes resources for exactly this purpose — enabling operators to slice telemetry by any organisational dimension without modifying the operator itself.

This is configured via two CLI flags:

| Flag | Source resource | Match key |
|------|----------------|-----------|
| `--redfish-metric-labels-from-bmc` | `BMC` resource | resource name == `hostname` label |
| `--redfish-metric-labels-from-server` | `Server` resource | `spec.bmcRef.name` == `hostname` label |

**Flag format:** `kubernetes-label-key=prometheus-label-name,...`

**Example:**
```bash
--redfish-metric-labels-from-bmc=topology.kubernetes.io/region=region,topology.kubernetes.io/zone=zone
--redfish-metric-labels-from-server=metadata.metal.ironcore.dev/location=location,metadata.metal.ironcore.dev/rack=rack
```

When configured, every Redfish metric gains the extra label columns. If a label key is missing from the resource, the value is emitted as an empty string — missing labels never block metric emission.

Labels are read from the controller-runtime informer cache, which is watch-based and always reflects the current cluster state. There is no TTL — label changes on BMC or Server resources are visible immediately.

#### Helm chart configuration

```yaml
redfishLabels:
bmc:
topology.kubernetes.io/region: region
topology.kubernetes.io/zone: zone
server:
metadata.metal.ironcore.dev/location: location
metadata.metal.ironcore.dev/rack: rack
```

#### Example enriched output

```text
redfish_monitor_reading{hostname="node001-bmc", metric_id="CPU1Temp", type="Temperature", unit="Cel", origin_context="/Chassis/1/Thermal", region="eu-de-1", zone="eu-de-1a", location="building-b", rack="row3-rack7"} 42.5
```

## Implementation Details

### Metric Collection Strategy
Expand Down
Loading
Loading