Skip to content

Add additional labels from BMC and Server resources to metrics#894

Open
asergeant01 wants to merge 2 commits into
mainfrom
feat/add-additional-metric-labels
Open

Add additional labels from BMC and Server resources to metrics#894
asergeant01 wants to merge 2 commits into
mainfrom
feat/add-additional-metric-labels

Conversation

@asergeant01

@asergeant01 asergeant01 commented May 15, 2026

Copy link
Copy Markdown
Contributor

Fixes #893

Summary by CodeRabbit

  • New Features

    • Added CLI flags to enrich Redfish metrics with labels from Kubernetes BMC and Server resources using configurable key-to-label mappings.
    • Enhanced observability with additional dynamic label dimensions for metrics when event monitoring is enabled.
  • Documentation

    • Added documentation describing Redfish telemetry metrics, label enrichment configuration, and usage examples.

Review Change Stack

@asergeant01 asergeant01 self-assigned this May 15, 2026
@asergeant01 asergeant01 requested a review from a team as a code owner May 15, 2026 07:53
@github-actions github-actions Bot added the documentation Improvements or additions to documentation label May 15, 2026
@coderabbitai

coderabbitai Bot commented May 15, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 72997b0a-7ad5-4393-b371-f6360ad25117

📥 Commits

Reviewing files that changed from the base of the PR and between ea08240 and 417e30b.

📒 Files selected for processing (4)
  • cmd/main.go
  • docs/observability/metrics.md
  • internal/serverevents/metrics.go
  • internal/serverevents/server.go
✅ Files skipped from review due to trivial changes (1)
  • docs/observability/metrics.md

📝 Walkthrough

Walkthrough

Adds configurable Kubernetes label enrichment to Redfish telemetry metrics. New CLI flags parse comma-separated k8s-label=prometheus-label mappings for BMC and Server resources; the event server wires these mappings through a redesigned collector that resolves label values via controller-runtime cache and appends them to metric emissions.

Changes

Redfish Metric Label Enrichment

Layer / File(s) Summary
Label mapping parsing
internal/serverevents/metrics.go
LabelMapping type and ParseLabelMappings() parse and validate comma-separated kubernetes-label-key=prometheus-label-name pairs with Prometheus label-name regex validation.
Collector fields and metadata
internal/serverevents/metrics.go
RedfishEventCollector extended with k8sClient, bmcMappings, serverMappings, and allLabelCount fields to hold controller-runtime client and mapping configuration.
Collector constructor & label resolution
internal/serverevents/metrics.go
NewRedfishEventCollector parameterized to accept client and mappings; builds metric descriptors with dynamic label keys; getLabels(hostname) resolves BMC labels and matched Server labels (via spec.bmcRef.name) using controller-runtime cache; minor comment cleanup in UpdateFromEvent.
Metric emission with enriched labels
internal/serverevents/metrics.go
Collect appends enrichment label values from getLabels() to base label sets when emitting redfish_monitor_reading gauge and redfish_event_alert_total counter metrics.
Server constructor and wiring
internal/serverevents/server.go
NewServer signature extended to accept k8sClient, bmcMappings, serverMappings; initializes Server.collector via updated NewRedfishEventCollector constructor; adds controller-runtime client import.
CLI flags and main wiring
cmd/main.go
Two new CLI flags (--redfish-metric-labels-from-bmc, --redfish-metric-labels-from-server) accept label-mapping specs; parsed when eventURL is set via ParseLabelMappings; exits on parse errors; passes parsed mappings to serverevents.NewServer.
Metrics documentation
docs/observability/metrics.md
Adds "Redfish Telemetry Metrics" section documenting redfish_monitor_reading gauge and redfish_event_alert_total counter, including fixed/dynamic labels, label-enrichment via CLI flags and Kubernetes resources, Helm configuration, and enriched metric examples.
Test callsite update
test/serverevents/main.go
Updates serverevents.NewServer call to pass three nil arguments matching the new constructor signature.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • ironcore-dev/metal-operator#494: Adds serverevents registry and subscription management that relies on the same serverevents server startup API integration points extended by this PR.

Suggested labels

size/M

Suggested reviewers

  • afritzler
  • stefanhipfel
🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 60.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive The description is minimal, providing only a reference to the linked issue (#893) without using the repository's template structure or elaborating on proposed changes. Expand the description to follow the template format with bullet-pointed proposed changes and include the 'Fixes #' reference as shown in the template.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and concisely summarizes the main change: adding Kubernetes-sourced labels from BMC and Server resources to Redfish metrics.
Linked Issues check ✅ Passed The pull request fully implements the requirements from issue #893: adds two CLI flags with correct parsing, integrates label enrichment into metrics collection, caches with TTL, and documents the feature.
Out of Scope Changes check ✅ Passed All code changes are directly scoped to the linked issue: adding CLI flags, implementing label mapping parsing, enriching metrics, and updating documentation without introducing unrelated functionality.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/add-additional-metric-labels

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cmd/main.go`:
- Around line 723-732: The error log messages passed to setupLog.Error after
calling serverevents.ParseLabelMappings (for redfishMetricLabelsFromBMC and
redfishMetricLabelsFromServer) should start with a capital letter and follow
Kubernetes logging style; update the two calls that currently use "invalid
--redfish-metric-labels-from-bmc" and "invalid
--redfish-metric-labels-from-server" to use capitalized, active/past-tense, and
object-specific messages (e.g., "Invalid Redfish metric label mappings from BMC"
and "Invalid Redfish metric label mappings from server") so they adhere to the
logging convention.

In `@docs/observability/metrics.md`:
- Around line 348-351: The fenced code block showing the CLI flags
--redfish-metric-labels-from-bmc and --redfish-metric-labels-from-server lacks a
language identifier; update the block to include a language (e.g., bash) so
syntax highlighting renders correctly when the docs are built by adding the
language token after the opening backticks (```bash) for that snippet.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: fd9f383b-bd16-4023-967f-da43d5d25b00

📥 Commits

Reviewing files that changed from the base of the PR and between bf8e075 and 8dcb7e6.

⛔ Files ignored due to path filters (3)
  • dist/chart/templates/_helpers.tpl is excluded by !**/dist/**
  • dist/chart/templates/manager/manager.yaml is excluded by !**/dist/**
  • dist/chart/values.yaml is excluded by !**/dist/**
📒 Files selected for processing (5)
  • cmd/main.go
  • docs/observability/metrics.md
  • internal/serverevents/metrics.go
  • internal/serverevents/server.go
  • test/serverevents/main.go

Comment thread cmd/main.go
Comment thread docs/observability/metrics.md Outdated
…hered

Signed-off-by: Alan Sergeant <alan.sergeant@sap.com>
@asergeant01 asergeant01 force-pushed the feat/add-additional-metric-labels branch from 8dcb7e6 to 4cc89ef Compare May 15, 2026 08:03
@afritzler afritzler changed the title adding additional labels from bmc and server resources to metrics Add additional labels from BMC and Server resources to metrics May 18, 2026
@afritzler afritzler added the enhancement New feature or request label May 18, 2026
@afritzler

Copy link
Copy Markdown
Member

Thanks for the PR @asergeant01 I think it makes sense to expose topology information,

However, I'm wondering whether we should solve this at the Prometheus layer rather than inside the operator. The standard pattern for this in the Prometheus ecosystem is an info-style metric + group_left join:

  1. Expose topology labels via a kube-state-metrics customresourcestate config (no operator code needed):
# kube-state-metrics custom resource state config
spec:
  resources:
    - groupVersionKind:
        group: metal.ironcore.dev
        version: v1alpha1
        kind: BMC
      labelsFromPath:
        bmc_name: [metadata, name]
      metrics:
        - name: metal_bmc_labels
          help: "BMC resource labels for join enrichment"
          each:
            type: Info
            info:
              labelsFromPath:
                region: [metadata, labels, "topology.kubernetes.io/region"]
                zone: [metadata, labels, "topology.kubernetes.io/zone"]
  1. Join at query time (or materialize as a recording rule):
redfish_monitor_reading
  * on(hostname) group_left(region, zone)
  label_replace(metal_bmc_labels, "hostname", "$1", "bmc_name", "(.*)")

This avoids:

  • Kubernetes API calls on the metrics ingestion hot path
  • A TTL cache that can serve stale labels for up to an hour (vs. KSM which watches resources)
  • New operator flags that require a restart to change label dimensions
  • Cardinality risk from misconfiguration (with recording rules you see the cardinality before it hits long-term storage)

Adding label dimensions becomes a config change in KSM + a recording rule, no operator redeployment needed.

@asergeant01

Copy link
Copy Markdown
Contributor Author

Thanks for the feedback @afritzler. The KSM + group_left pattern is the right call in many contexts, but I think the in-operator approach is still the better fit here. However, you do raise some valid technical concerns which I think I can address.

The recording rule requirement is a hard blocker for many users - The KSM approach only works ergonomically if a recording rule materialises the join — raw group_left in every dashboard query and alert rule is not realistic. That recording rule must live somewhere that evaluates against the metrics store. For users who remote-write to a managed observability platform (Grafana Cloud, Thanos, VictoriaMetrics, etc.) they often do not control the ruler component, or the recording rule must be deployed separately to the remote ruler rather than alongside the operator.

The label_replace rename is fragile - the proposed join requires label_replace(metal_bmc_labels, "hostname", "$1", "bmc_name", "(.*)") to align the join key. A mismatch between the KSM-emitted bmc_name and the hostname label on Redfish metrics silently returns no data. That's not a great experience for users.

What I think I can fix is your concerns around K8s API calls on the hot path and 1-hour stale labels. They're artifacts of the TTL cache implementation, not of the in-operator approach itself. I think the way KSM does this is with runtime manager's informer-backed client.Client. In our implementation that client is already passed in, already watch-based, and reads are local in-memory lookups — zero API round trips, and labels reflect changes near-instantly.

The remaining concerns (restart to add a label dimension, separation of concerns) are right but I think are acceptable trade-offs given that topology labels like region and zone are stable once configured.

@asergeant01 asergeant01 force-pushed the feat/add-additional-metric-labels branch 2 times, most recently from a7ccd7c to 9b0b585 Compare May 20, 2026 14:26

@xkonni xkonni left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, the informer cache fix is the right call. Field index is there, parsing is solid. LGTM 👍

@asergeant01 asergeant01 force-pushed the feat/add-additional-metric-labels branch from 9b0b585 to ea08240 Compare May 21, 2026 09:01

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/serverevents/metrics.go`:
- Around line 142-169: Add a per-hostname TTL cache (e.g., hostnameLabelsCache
map[string]{labels []string, ts time.Time} plus a small mutex) and have
RedfishEventCollector.getLabels consult that cache first and only perform k8s
lookups (using c.k8sClient and the existing bmcMappings/serverMappings logic)
when the cache is missing or older than 1 hour; update the cache with the
fetched labels and timestamp. Also change Collect so label resolution happens
outside the c.mux critical section (or precompute a hostname->labels map for the
current scrape) and then hold c.mux only while emitting metrics, referencing the
cached/precomputed labels to avoid repeated lookups and lock contention. Ensure
you reference getLabels, Collect, c.mux, bmcMappings, serverMappings, and
c.k8sClient when making these changes.
- Around line 40-67: ParseLabelMappings currently only validates format but must
also reject duplicate PromLabel names and names that collide with the fixed
descriptor labels used by sensorDesc/alertDesc; update ParseLabelMappings to (1)
maintain a set of seen PromLabel strings and return an error if a duplicate is
encountered and (2) check each PromLabel against the reserved fixed label names
(hostname, metric_id, type, unit, origin_context, severity, message_id,
component) and return an error if any mapping uses one of those reserved names;
mention ParseLabelMappings and NewRedfishEventCollector/allLabels in the error
text so callers can identify the problem.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 3a041261-6bc2-4400-9ae1-bc873db05d11

📥 Commits

Reviewing files that changed from the base of the PR and between 8dcb7e6 and ea08240.

⛔ Files ignored due to path filters (3)
  • dist/chart/templates/_helpers.tpl is excluded by !**/dist/**
  • dist/chart/templates/manager/manager.yaml is excluded by !**/dist/**
  • dist/chart/values.yaml is excluded by !**/dist/**
📒 Files selected for processing (6)
  • LICENSES/BSD-3-Clause.txt
  • cmd/main.go
  • docs/observability/metrics.md
  • internal/serverevents/metrics.go
  • internal/serverevents/server.go
  • test/serverevents/main.go
💤 Files with no reviewable changes (1)
  • LICENSES/BSD-3-Clause.txt

Comment on lines +40 to +67
func ParseLabelMappings(s string) ([]LabelMapping, error) {
s = strings.TrimSpace(s)
if s == "" {
return nil, nil
}
tokens := strings.Split(s, ",")
mappings := make([]LabelMapping, 0, len(tokens))
for _, token := range tokens {
token = strings.TrimSpace(token)
if token == "" {
continue
}
parts := strings.SplitN(token, "=", 2)
if len(parts) != 2 {
return nil, fmt.Errorf("invalid label mapping %q: must be 'kubernetes-label=prometheus-label'", token)
}
k8sKey := strings.TrimSpace(parts[0])
promLabel := strings.TrimSpace(parts[1])
if k8sKey == "" {
return nil, fmt.Errorf("invalid label mapping %q: Kubernetes label key must not be empty", token)
}
if !promLabelPattern.MatchString(promLabel) {
return nil, fmt.Errorf("invalid label mapping %q: Prometheus label name %q must match [a-zA-Z_][a-zA-Z0-9_]*", token, promLabel)
}
mappings = append(mappings, LabelMapping{K8sKey: k8sKey, PromLabel: promLabel})
}
return mappings, nil
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

In prometheus/client_golang v1.23.2, are duplicate variable label names in prometheus.NewDesc considered invalid, and how does registration/collection fail when they are present?

💡 Result:

Yes. In prometheus/client_golang v1.23.2, duplicate variable label names in prometheus.NewDesc are considered invalid. When NewDesc is constructed, it validates that the combined set of variable label names (variableLabels) and constant label names (constLabels) has unique names; if duplicates exist, NewDesc returns a Desc with an internal error set to something like “duplicate label names in constant and variable labels for metric %q” and does not create a usable descriptor. [1] How registration/collection fails when they are present: - If you use the descriptor via MustNewConstMetric, it panics with the Desc error (so collection/gathering aborts for that collector/metric). This “duplicate label names …” panic behavior is observed in the field. [2] - If you instead use a non-must path that returns an error (e.g., creating/collecting metrics from an invalid Desc), the Registry/Gather path surfaces the Desc error and fails the scrape/collection with that error. (The library’s Desc has an internal err that is set during NewDesc validation.) [3] In short: duplicates are caught at Desc creation time; the failure shows up at metric creation/collection time (panic for MustNewConstMetric, or returned error for non-must paths), which prevents successful gathering/scraping. [1][2][3]

Citations:


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Show relevant sections of internal/serverevents/metrics.go
sed -n '1,220p' internal/serverevents/metrics.go
sed -n '220,420p' internal/serverevents/metrics.go

# Locate ParseLabelMappings usage and related descriptor building
rg -n "ParseLabelMappings|LabelMapping|NewDesc|promLabelPattern|getLabels|hostname|metric_id" internal/serverevents/metrics.go

Repository: ironcore-dev/metal-operator

Length of output: 10261


Reject duplicate and fixed-label-colliding Prometheus label names in ParseLabelMappings

ParseLabelMappings only validates syntax/regex; it doesn’t reject duplicate PromLabels or collisions with the fixed label names used to build sensorDesc and alertDesc (hostname, metric_id, type, unit, origin_context, severity, message_id, component). Because NewRedfishEventCollector appends allLabels (mappings’ PromLabels) to both descriptors’ variable label lists, duplicate/colliding names make the descriptors invalid and Collect’s prometheus.MustNewConstMetric(...) will panic during scraping.

Suggested fix (keep duplicate rejection; extend to reserved fixed-label collisions)
 func ParseLabelMappings(s string) ([]LabelMapping, error) {
 	s = strings.TrimSpace(s)
 	if s == "" {
 		return nil, nil
 	}
 	tokens := strings.Split(s, ",")
 	mappings := make([]LabelMapping, 0, len(tokens))
+	seen := make(map[string]struct{}, len(tokens))
 
 	for _, token := range tokens {
@@
 		if !promLabelPattern.MatchString(promLabel) {
 			return nil, fmt.Errorf("invalid label mapping %q: Prometheus label name %q must match [a-zA-Z_][a-zA-Z0-9_]*", token, promLabel)
 		}
+		if _, exists := seen[promLabel]; exists {
+			return nil, fmt.Errorf("invalid label mapping %q: duplicate Prometheus label name %q", token, promLabel)
+		}
+		seen[promLabel] = struct{}{}
 		mappings = append(mappings, LabelMapping{K8sKey: k8sKey, PromLabel: promLabel})
 	}
 	return mappings, nil
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/serverevents/metrics.go` around lines 40 - 67, ParseLabelMappings
currently only validates format but must also reject duplicate PromLabel names
and names that collide with the fixed descriptor labels used by
sensorDesc/alertDesc; update ParseLabelMappings to (1) maintain a set of seen
PromLabel strings and return an error if a duplicate is encountered and (2)
check each PromLabel against the reserved fixed label names (hostname,
metric_id, type, unit, origin_context, severity, message_id, component) and
return an error if any mapping uses one of those reserved names; mention
ParseLabelMappings and NewRedfishEventCollector/allLabels in the error text so
callers can identify the problem.

Comment on lines +142 to +169
func (c *RedfishEventCollector) getLabels(hostname string) []string {
vals := make([]string, c.allLabelCount)
if c.k8sClient == nil || c.allLabelCount == 0 {
return vals
}
ctx := context.Background()

// --- BMC labels ---
if len(c.bmcMappings) > 0 {
bmc := &metalv1alpha1.BMC{}
if err := c.k8sClient.Get(ctx, client.ObjectKey{Name: hostname}, bmc); err == nil {
for i, m := range c.bmcMappings {
vals[i] = bmc.Labels[m.K8sKey]
}
}
}

// --- Server labels (looked up via spec.bmcRef.name field index) ---
if len(c.serverMappings) > 0 {
serverList := &metalv1alpha1.ServerList{}
if err := c.k8sClient.List(ctx, serverList, client.MatchingFields{"spec.bmcRef.name": hostname}); err == nil && len(serverList.Items) == 1 {
for i, m := range c.serverMappings {
vals[len(c.bmcMappings)+i] = serverList.Items[0].Labels[m.K8sKey]
}
}
}
return vals
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Missing hostname TTL cache causes repeated Kubernetes lookups on scrape hot path.

Line 142 executes K8s lookups per getLabels call, and Line 245/257 calls it for every emitted metric while Collect holds the collector read lock (Line 238). This does not implement the stated 1-hour per-hostname cache and can degrade scrape latency under load.

Suggested direction
 type RedfishEventCollector struct {
@@
 	allLabelCount  int
+	labelCacheTTL  time.Duration
+	labelCache     map[string]cachedLabels
+	labelCacheMux  sync.RWMutex
 }
+
+type cachedLabels struct {
+	values    []string
+	expiresAt time.Time
+}

Then resolve once per hostname with TTL in getLabels, and in Collect compute labels outside c.mux critical section (or precompute a per-scrape hostname->labels map) to avoid lock contention.

Also applies to: 237-260

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/serverevents/metrics.go` around lines 142 - 169, Add a per-hostname
TTL cache (e.g., hostnameLabelsCache map[string]{labels []string, ts time.Time}
plus a small mutex) and have RedfishEventCollector.getLabels consult that cache
first and only perform k8s lookups (using c.k8sClient and the existing
bmcMappings/serverMappings logic) when the cache is missing or older than 1
hour; update the cache with the fetched labels and timestamp. Also change
Collect so label resolution happens outside the c.mux critical section (or
precompute a hostname->labels map for the current scrape) and then hold c.mux
only while emitting metrics, referencing the cached/precomputed labels to avoid
repeated lookups and lock contention. Ensure you reference getLabels, Collect,
c.mux, bmcMappings, serverMappings, and c.k8sClient when making these changes.

Signed-off-by: Alan Sergeant <alan.sergeant@sap.com>
@asergeant01 asergeant01 force-pushed the feat/add-additional-metric-labels branch from ea08240 to 417e30b Compare May 21, 2026 09:13
@asergeant01

Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented May 22, 2026

Copy link
Copy Markdown
Contributor
✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@afritzler

Copy link
Copy Markdown
Member

Thanks for the feedback @afritzler. The KSM + group_left pattern is the right call in many contexts, but I think the in-operator approach is still the better fit here. However, you do raise some valid technical concerns which I think I can address.

The recording rule requirement is a hard blocker for many users - The KSM approach only works ergonomically if a recording rule materialises the join — raw group_left in every dashboard query and alert rule is not realistic. That recording rule must live somewhere that evaluates against the metrics store. For users who remote-write to a managed observability platform (Grafana Cloud, Thanos, VictoriaMetrics, etc.) they often do not control the ruler component, or the recording rule must be deployed separately to the remote ruler rather than alongside the operator.

The label_replace rename is fragile - the proposed join requires label_replace(metal_bmc_labels, "hostname", "$1", "bmc_name", "(.*)") to align the join key. A mismatch between the KSM-emitted bmc_name and the hostname label on Redfish metrics silently returns no data. That's not a great experience for users.

What I think I can fix is your concerns around K8s API calls on the hot path and 1-hour stale labels. They're artifacts of the TTL cache implementation, not of the in-operator approach itself. I think the way KSM does this is with runtime manager's informer-backed client.Client. In our implementation that client is already passed in, already watch-based, and reads are local in-memory lookups — zero API round trips, and labels reflect changes near-instantly.

The remaining concerns (restart to add a label dimension, separation of concerns) are right but I think are acceptable trade-offs given that topology labels like region and zone are stable once configured.

The label_replace example doesn't apply, since both metrics share hostname. The join is just on(hostname) group_left(...) with no renaming needed. Recording rules are also broadly supported on managed backends (Mimir, Thanos, Grafana Cloud, VictoriaMetrics all support them), and inline group_left in dashboard panels works fine in practice.

The concern I'd want addressed is cardinality. Sensor readings already fan out across metric_id × type × unit × origin_context, and every added topology label multiplies that. An info metric pays the cost once per BMC. The "restart to add a label dimension" tradeoff, plus the operator owning user specific topology vocabulary, are exactly what the KSM pattern exists to avoid.

@asergeant01

asergeant01 commented Jun 3, 2026

Copy link
Copy Markdown
Contributor Author

@afritzler Ok, I understand. You're right on the label_replace point — I misread the query. Since both metrics already carry hostname and KSM would emit with the same key, the join is just on(hostname) group_left(...).

On your cardinality point, I get the concern- but the intended use case (topology labels like region/zone/rack that are 1:1 with hostname), adding them doesn't actually multiply cardinality — (hostname, metric_id, ..., region, zone) has the same number of unique series as without them, since region and zone are deterministic given hostname, e.g. redfish_bmc_info{hostname="...", region="...", zone="..."}. But your point about misconfiguration is valid — if someone maps a high-cardinality label by mistake there's no guard, and with an info metric you'd see the blast radius before it hits long-term storage.

A middle path that I think addresses your data model concern without the KSM dependency: instead of baking the labels into the operational metrics, the operator emits a redfish_bmc_info gauge (constant 1, labels from BMC resource) using the same informer cache already in place. Users join at query time with group_left. No schema coupling on the operational metrics, no KSM required, and cardinality is paid once per BMC.

Would that be an acceptable direction? Happy to rework the PR along those lines.

@afritzler

Copy link
Copy Markdown
Member

What I am asking myself here is: Which deployment scenario specifically motivates this change? If KSM is available + the fact that we here are 100% in the k8s types world -> we can solve that problem with 0 lines of code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/metal-automation documentation Improvements or additions to documentation enhancement New feature or request size/L

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

Add configurable label enrichment to Redfish telemetry metrics

4 participants