Add additional labels from BMC and Server resources to metrics#894
Add additional labels from BMC and Server resources to metrics#894asergeant01 wants to merge 2 commits into
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (4)
✅ Files skipped from review due to trivial changes (1)
📝 WalkthroughWalkthroughAdds configurable Kubernetes label enrichment to Redfish telemetry metrics. New CLI flags parse comma-separated ChangesRedfish Metric Label Enrichment
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@cmd/main.go`:
- Around line 723-732: The error log messages passed to setupLog.Error after
calling serverevents.ParseLabelMappings (for redfishMetricLabelsFromBMC and
redfishMetricLabelsFromServer) should start with a capital letter and follow
Kubernetes logging style; update the two calls that currently use "invalid
--redfish-metric-labels-from-bmc" and "invalid
--redfish-metric-labels-from-server" to use capitalized, active/past-tense, and
object-specific messages (e.g., "Invalid Redfish metric label mappings from BMC"
and "Invalid Redfish metric label mappings from server") so they adhere to the
logging convention.
In `@docs/observability/metrics.md`:
- Around line 348-351: The fenced code block showing the CLI flags
--redfish-metric-labels-from-bmc and --redfish-metric-labels-from-server lacks a
language identifier; update the block to include a language (e.g., bash) so
syntax highlighting renders correctly when the docs are built by adding the
language token after the opening backticks (```bash) for that snippet.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: fd9f383b-bd16-4023-967f-da43d5d25b00
⛔ Files ignored due to path filters (3)
dist/chart/templates/_helpers.tplis excluded by!**/dist/**dist/chart/templates/manager/manager.yamlis excluded by!**/dist/**dist/chart/values.yamlis excluded by!**/dist/**
📒 Files selected for processing (5)
cmd/main.godocs/observability/metrics.mdinternal/serverevents/metrics.gointernal/serverevents/server.gotest/serverevents/main.go
…hered Signed-off-by: Alan Sergeant <alan.sergeant@sap.com>
8dcb7e6 to
4cc89ef
Compare
|
Thanks for the PR @asergeant01 I think it makes sense to expose topology information, However, I'm wondering whether we should solve this at the Prometheus layer rather than inside the operator. The standard pattern for this in the Prometheus ecosystem is an info-style metric +
# kube-state-metrics custom resource state config
spec:
resources:
- groupVersionKind:
group: metal.ironcore.dev
version: v1alpha1
kind: BMC
labelsFromPath:
bmc_name: [metadata, name]
metrics:
- name: metal_bmc_labels
help: "BMC resource labels for join enrichment"
each:
type: Info
info:
labelsFromPath:
region: [metadata, labels, "topology.kubernetes.io/region"]
zone: [metadata, labels, "topology.kubernetes.io/zone"]
This avoids:
Adding label dimensions becomes a config change in KSM + a recording rule, no operator redeployment needed. |
|
Thanks for the feedback @afritzler. The KSM + group_left pattern is the right call in many contexts, but I think the in-operator approach is still the better fit here. However, you do raise some valid technical concerns which I think I can address. The recording rule requirement is a hard blocker for many users - The KSM approach only works ergonomically if a recording rule materialises the join — raw group_left in every dashboard query and alert rule is not realistic. That recording rule must live somewhere that evaluates against the metrics store. For users who remote-write to a managed observability platform (Grafana Cloud, Thanos, VictoriaMetrics, etc.) they often do not control the ruler component, or the recording rule must be deployed separately to the remote ruler rather than alongside the operator. The label_replace rename is fragile - the proposed join requires What I think I can fix is your concerns around K8s API calls on the hot path and 1-hour stale labels. They're artifacts of the TTL cache implementation, not of the in-operator approach itself. I think the way KSM does this is with runtime manager's informer-backed client.Client. In our implementation that client is already passed in, already watch-based, and reads are local in-memory lookups — zero API round trips, and labels reflect changes near-instantly. The remaining concerns (restart to add a label dimension, separation of concerns) are right but I think are acceptable trade-offs given that topology labels like region and zone are stable once configured. |
a7ccd7c to
9b0b585
Compare
xkonni
left a comment
There was a problem hiding this comment.
Nice, the informer cache fix is the right call. Field index is there, parsing is solid. LGTM 👍
9b0b585 to
ea08240
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/serverevents/metrics.go`:
- Around line 142-169: Add a per-hostname TTL cache (e.g., hostnameLabelsCache
map[string]{labels []string, ts time.Time} plus a small mutex) and have
RedfishEventCollector.getLabels consult that cache first and only perform k8s
lookups (using c.k8sClient and the existing bmcMappings/serverMappings logic)
when the cache is missing or older than 1 hour; update the cache with the
fetched labels and timestamp. Also change Collect so label resolution happens
outside the c.mux critical section (or precompute a hostname->labels map for the
current scrape) and then hold c.mux only while emitting metrics, referencing the
cached/precomputed labels to avoid repeated lookups and lock contention. Ensure
you reference getLabels, Collect, c.mux, bmcMappings, serverMappings, and
c.k8sClient when making these changes.
- Around line 40-67: ParseLabelMappings currently only validates format but must
also reject duplicate PromLabel names and names that collide with the fixed
descriptor labels used by sensorDesc/alertDesc; update ParseLabelMappings to (1)
maintain a set of seen PromLabel strings and return an error if a duplicate is
encountered and (2) check each PromLabel against the reserved fixed label names
(hostname, metric_id, type, unit, origin_context, severity, message_id,
component) and return an error if any mapping uses one of those reserved names;
mention ParseLabelMappings and NewRedfishEventCollector/allLabels in the error
text so callers can identify the problem.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 3a041261-6bc2-4400-9ae1-bc873db05d11
⛔ Files ignored due to path filters (3)
dist/chart/templates/_helpers.tplis excluded by!**/dist/**dist/chart/templates/manager/manager.yamlis excluded by!**/dist/**dist/chart/values.yamlis excluded by!**/dist/**
📒 Files selected for processing (6)
LICENSES/BSD-3-Clause.txtcmd/main.godocs/observability/metrics.mdinternal/serverevents/metrics.gointernal/serverevents/server.gotest/serverevents/main.go
💤 Files with no reviewable changes (1)
- LICENSES/BSD-3-Clause.txt
| func ParseLabelMappings(s string) ([]LabelMapping, error) { | ||
| s = strings.TrimSpace(s) | ||
| if s == "" { | ||
| return nil, nil | ||
| } | ||
| tokens := strings.Split(s, ",") | ||
| mappings := make([]LabelMapping, 0, len(tokens)) | ||
| for _, token := range tokens { | ||
| token = strings.TrimSpace(token) | ||
| if token == "" { | ||
| continue | ||
| } | ||
| parts := strings.SplitN(token, "=", 2) | ||
| if len(parts) != 2 { | ||
| return nil, fmt.Errorf("invalid label mapping %q: must be 'kubernetes-label=prometheus-label'", token) | ||
| } | ||
| k8sKey := strings.TrimSpace(parts[0]) | ||
| promLabel := strings.TrimSpace(parts[1]) | ||
| if k8sKey == "" { | ||
| return nil, fmt.Errorf("invalid label mapping %q: Kubernetes label key must not be empty", token) | ||
| } | ||
| if !promLabelPattern.MatchString(promLabel) { | ||
| return nil, fmt.Errorf("invalid label mapping %q: Prometheus label name %q must match [a-zA-Z_][a-zA-Z0-9_]*", token, promLabel) | ||
| } | ||
| mappings = append(mappings, LabelMapping{K8sKey: k8sKey, PromLabel: promLabel}) | ||
| } | ||
| return mappings, nil | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
In prometheus/client_golang v1.23.2, are duplicate variable label names in prometheus.NewDesc considered invalid, and how does registration/collection fail when they are present?
💡 Result:
Yes. In prometheus/client_golang v1.23.2, duplicate variable label names in prometheus.NewDesc are considered invalid. When NewDesc is constructed, it validates that the combined set of variable label names (variableLabels) and constant label names (constLabels) has unique names; if duplicates exist, NewDesc returns a Desc with an internal error set to something like “duplicate label names in constant and variable labels for metric %q” and does not create a usable descriptor. [1] How registration/collection fails when they are present: - If you use the descriptor via MustNewConstMetric, it panics with the Desc error (so collection/gathering aborts for that collector/metric). This “duplicate label names …” panic behavior is observed in the field. [2] - If you instead use a non-must path that returns an error (e.g., creating/collecting metrics from an invalid Desc), the Registry/Gather path surfaces the Desc error and fails the scrape/collection with that error. (The library’s Desc has an internal err that is set during NewDesc validation.) [3] In short: duplicates are caught at Desc creation time; the failure shows up at metric creation/collection time (panic for MustNewConstMetric, or returned error for non-must paths), which prevents successful gathering/scraping. [1][2][3]
Citations:
- 1: grafana-cold-storage/client_golang@3d765a1
- 2: panic: duplicate label names in constant and variable labels for metric "minio_cluster_replication_current_active_workers" minio/minio#18893
- 3: https://github.com/prometheus/client_golang/blob/master/prometheus/desc.go
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Show relevant sections of internal/serverevents/metrics.go
sed -n '1,220p' internal/serverevents/metrics.go
sed -n '220,420p' internal/serverevents/metrics.go
# Locate ParseLabelMappings usage and related descriptor building
rg -n "ParseLabelMappings|LabelMapping|NewDesc|promLabelPattern|getLabels|hostname|metric_id" internal/serverevents/metrics.goRepository: ironcore-dev/metal-operator
Length of output: 10261
Reject duplicate and fixed-label-colliding Prometheus label names in ParseLabelMappings
ParseLabelMappings only validates syntax/regex; it doesn’t reject duplicate PromLabels or collisions with the fixed label names used to build sensorDesc and alertDesc (hostname, metric_id, type, unit, origin_context, severity, message_id, component). Because NewRedfishEventCollector appends allLabels (mappings’ PromLabels) to both descriptors’ variable label lists, duplicate/colliding names make the descriptors invalid and Collect’s prometheus.MustNewConstMetric(...) will panic during scraping.
Suggested fix (keep duplicate rejection; extend to reserved fixed-label collisions)
func ParseLabelMappings(s string) ([]LabelMapping, error) {
s = strings.TrimSpace(s)
if s == "" {
return nil, nil
}
tokens := strings.Split(s, ",")
mappings := make([]LabelMapping, 0, len(tokens))
+ seen := make(map[string]struct{}, len(tokens))
for _, token := range tokens {
@@
if !promLabelPattern.MatchString(promLabel) {
return nil, fmt.Errorf("invalid label mapping %q: Prometheus label name %q must match [a-zA-Z_][a-zA-Z0-9_]*", token, promLabel)
}
+ if _, exists := seen[promLabel]; exists {
+ return nil, fmt.Errorf("invalid label mapping %q: duplicate Prometheus label name %q", token, promLabel)
+ }
+ seen[promLabel] = struct{}{}
mappings = append(mappings, LabelMapping{K8sKey: k8sKey, PromLabel: promLabel})
}
return mappings, nil
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@internal/serverevents/metrics.go` around lines 40 - 67, ParseLabelMappings
currently only validates format but must also reject duplicate PromLabel names
and names that collide with the fixed descriptor labels used by
sensorDesc/alertDesc; update ParseLabelMappings to (1) maintain a set of seen
PromLabel strings and return an error if a duplicate is encountered and (2)
check each PromLabel against the reserved fixed label names (hostname,
metric_id, type, unit, origin_context, severity, message_id, component) and
return an error if any mapping uses one of those reserved names; mention
ParseLabelMappings and NewRedfishEventCollector/allLabels in the error text so
callers can identify the problem.
| func (c *RedfishEventCollector) getLabels(hostname string) []string { | ||
| vals := make([]string, c.allLabelCount) | ||
| if c.k8sClient == nil || c.allLabelCount == 0 { | ||
| return vals | ||
| } | ||
| ctx := context.Background() | ||
|
|
||
| // --- BMC labels --- | ||
| if len(c.bmcMappings) > 0 { | ||
| bmc := &metalv1alpha1.BMC{} | ||
| if err := c.k8sClient.Get(ctx, client.ObjectKey{Name: hostname}, bmc); err == nil { | ||
| for i, m := range c.bmcMappings { | ||
| vals[i] = bmc.Labels[m.K8sKey] | ||
| } | ||
| } | ||
| } | ||
|
|
||
| // --- Server labels (looked up via spec.bmcRef.name field index) --- | ||
| if len(c.serverMappings) > 0 { | ||
| serverList := &metalv1alpha1.ServerList{} | ||
| if err := c.k8sClient.List(ctx, serverList, client.MatchingFields{"spec.bmcRef.name": hostname}); err == nil && len(serverList.Items) == 1 { | ||
| for i, m := range c.serverMappings { | ||
| vals[len(c.bmcMappings)+i] = serverList.Items[0].Labels[m.K8sKey] | ||
| } | ||
| } | ||
| } | ||
| return vals | ||
| } |
There was a problem hiding this comment.
Missing hostname TTL cache causes repeated Kubernetes lookups on scrape hot path.
Line 142 executes K8s lookups per getLabels call, and Line 245/257 calls it for every emitted metric while Collect holds the collector read lock (Line 238). This does not implement the stated 1-hour per-hostname cache and can degrade scrape latency under load.
Suggested direction
type RedfishEventCollector struct {
@@
allLabelCount int
+ labelCacheTTL time.Duration
+ labelCache map[string]cachedLabels
+ labelCacheMux sync.RWMutex
}
+
+type cachedLabels struct {
+ values []string
+ expiresAt time.Time
+}Then resolve once per hostname with TTL in getLabels, and in Collect compute labels outside c.mux critical section (or precompute a per-scrape hostname->labels map) to avoid lock contention.
Also applies to: 237-260
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@internal/serverevents/metrics.go` around lines 142 - 169, Add a per-hostname
TTL cache (e.g., hostnameLabelsCache map[string]{labels []string, ts time.Time}
plus a small mutex) and have RedfishEventCollector.getLabels consult that cache
first and only perform k8s lookups (using c.k8sClient and the existing
bmcMappings/serverMappings logic) when the cache is missing or older than 1
hour; update the cache with the fetched labels and timestamp. Also change
Collect so label resolution happens outside the c.mux critical section (or
precompute a hostname->labels map for the current scrape) and then hold c.mux
only while emitting metrics, referencing the cached/precomputed labels to avoid
repeated lookups and lock contention. Ensure you reference getLabels, Collect,
c.mux, bmcMappings, serverMappings, and c.k8sClient when making these changes.
Signed-off-by: Alan Sergeant <alan.sergeant@sap.com>
ea08240 to
417e30b
Compare
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
The The concern I'd want addressed is cardinality. Sensor readings already fan out across |
|
@afritzler Ok, I understand. You're right on the On your cardinality point, I get the concern- but the intended use case (topology labels like region/zone/rack that are 1:1 with hostname), adding them doesn't actually multiply cardinality — (hostname, metric_id, ..., region, zone) has the same number of unique series as without them, since region and zone are deterministic given hostname, e.g. A middle path that I think addresses your data model concern without the KSM dependency: instead of baking the labels into the operational metrics, the operator emits a redfish_bmc_info gauge (constant 1, labels from BMC resource) using the same informer cache already in place. Users join at query time with group_left. No schema coupling on the operational metrics, no KSM required, and cardinality is paid once per BMC. Would that be an acceptable direction? Happy to rework the PR along those lines. |
|
What I am asking myself here is: Which deployment scenario specifically motivates this change? If KSM is available + the fact that we here are 100% in the k8s types world -> we can solve that problem with 0 lines of code. |
Fixes #893
Summary by CodeRabbit
New Features
Documentation