Skip to content

Fix ServerMaintenance transitions to Pending when referenced Server is deleted#951

Open
stefanhipfel wants to merge 1 commit into
mainfrom
worktree-issue367
Open

Fix ServerMaintenance transitions to Pending when referenced Server is deleted#951
stefanhipfel wants to merge 1 commit into
mainfrom
worktree-issue367

Conversation

@stefanhipfel

@stefanhipfel stefanhipfel commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Fixes #367

When a BMC is deleted, Kubernetes cascades deletion to owned Server resources. ServerMaintenance objects referencing those servers were left orphaned due to two bugs in the ServerMaintenanceReconciler:

  • reconcile() returned silently on NotFound server before adding the finalizer — so the resource was never garbage collected
  • delete() propagated NotFound as an error, blocking finalizer removal permanently

Both paths now handle a missing Server gracefully.

Summary by CodeRabbit

  • Bug Fixes

    • Improved reconciliation when a referenced Server is missing: ServerMaintenance now records the condition, updates its state/status to Pending, and exits early to avoid further reconciliation steps.
  • Tests

    • Updated cleanup to ignore “not found” errors during deletion.
    • Added coverage asserting that when the referenced Server is deleted, the ServerMaintenance object remains and transitions to Pending (rather than being treated as fully deleted).

@stefanhipfel stefanhipfel requested a review from a team as a code owner June 15, 2026 08:19
@github-actions github-actions Bot added size/M bug Something isn't working labels Jun 15, 2026
@coderabbitai

coderabbitai Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e1b7e77d-c225-4650-9a4f-e04134a3d32f

📥 Commits

Reviewing files that changed from the base of the PR and between 50e70b0 and f3ae5a8.

📒 Files selected for processing (2)
  • internal/controller/servermaintenance_controller.go
  • internal/controller/servermaintenance_controller_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • internal/controller/servermaintenance_controller.go

📝 Walkthrough

Walkthrough

ServerMaintenanceReconciler is updated to handle missing Server references by patching the ServerMaintenance status to Pending and returning early during reconciliation. A new test verifies this Pending status transition when the referenced Server is deleted.

Changes

ServerMaintenance Pending status on missing Server

Layer / File(s) Summary
Reconciler NotFound handling and Pending status patch
internal/controller/servermaintenance_controller.go, internal/controller/servermaintenance_controller_test.go
In reconcile(), a NotFound error when fetching the referenced Server now triggers a log message and a patch of ServerMaintenance status to Pending, followed by early return without proceeding to finalizer logic. Other Get errors still propagate. A test creates a ServerMaintenance, deletes the referenced Server, and asserts the ServerMaintenance status transitions to Pending.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Suggested labels

area/metal-automation, size/XS

Suggested reviewers

  • afritzler
  • Nuckal777
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and accurately summarizes the main change: ServerMaintenance objects transitioning to Pending when referenced Server is deleted.
Description check ✅ Passed The description provides context, references the issue #367, explains the two bugs fixed, and describes the graceful handling solution.
Linked Issues check ✅ Passed The PR addresses the core requirement from #367 by handling missing Server resources gracefully in both reconciliation and deletion paths, enabling proper cleanup of ServerMaintenance objects.
Out of Scope Changes check ✅ Passed All changes are directly related to fixing the ServerMaintenance orphaning issue described in #367; no unrelated modifications are present.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch worktree-issue367

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
internal/controller/servermaintenance_controller.go (1)

76-76: ⚡ Quick win

Add structured context fields to the orphan-deletion log.

This message should include object keys (for example, ServerMaintenance and referenced Server) to match controller logging conventions.

Suggested change
-			log.V(1).Info("Referenced Server not found, deleting ServerMaintenance")
+			log.V(1).Info(
+				"Referenced Server not found, deleting ServerMaintenance",
+				"ServerMaintenance", client.ObjectKeyFromObject(maintenance),
+				"Server", maintenance.Spec.ServerRef.Name,
+			)

As per coding guidelines: Use structured logging with key-value pairs following Kubernetes conventions - use log := log.FromContext(ctx); log.Info("msg", "key", val).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/controller/servermaintenance_controller.go` at line 76, The log
statement at the orphan-deletion check lacks structured context fields required
by Kubernetes logging conventions. Modify the log.V(1).Info() call to include
key-value pairs for the ServerMaintenance object key and the referenced Server
object key (for example, using "namespace" and "name" or similar identifying
fields). Keep the message text the same but add the structured fields as
additional arguments to the Info() method following the pattern log.Info("msg",
"key", val, "key2", val2).

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/controller/servermaintenance_controller_test.go`:
- Around line 575-578: The test assertions using
Eventually(Get(server)).ShouldNot(Succeed()) and
Eventually(Get(serverMaintenance)).ShouldNot(Succeed()) are too loose and can
pass for unrelated API failures. Replace these checks to explicitly verify that
the resources are not found by using apierrors.IsNotFound to confirm the
intended deletion outcome instead of only checking that the Get operation
failed. This change needs to be applied at lines 575-578 (for the server
resource check) and at lines 605-611 (for the serverMaintenance resource check)
in the servermaintenance_controller_test.go file.

---

Nitpick comments:
In `@internal/controller/servermaintenance_controller.go`:
- Line 76: The log statement at the orphan-deletion check lacks structured
context fields required by Kubernetes logging conventions. Modify the
log.V(1).Info() call to include key-value pairs for the ServerMaintenance object
key and the referenced Server object key (for example, using "namespace" and
"name" or similar identifying fields). Keep the message text the same but add
the structured fields as additional arguments to the Info() method following the
pattern log.Info("msg", "key", val, "key2", val2).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a3338e7a-d4e1-4ca1-8282-8f65eab0548c

📥 Commits

Reviewing files that changed from the base of the PR and between 1636209 and c866f34.

📒 Files selected for processing (2)
  • internal/controller/servermaintenance_controller.go
  • internal/controller/servermaintenance_controller_test.go

Comment on lines +575 to +578
Eventually(Get(server)).ShouldNot(Succeed())

By("Expecting the ServerMaintenance to be deleted automatically")
Eventually(Get(serverMaintenance)).ShouldNot(Succeed())

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Assert NotFound explicitly instead of only “not succeed”.

These checks can pass on unrelated API failures; assert apierrors.IsNotFound to verify the intended deletion outcome.

Suggested change
+	apierrors "k8s.io/apimachinery/pkg/api/errors"
...
-		Eventually(Get(server)).ShouldNot(Succeed())
+		Eventually(Get(server)).Should(Satisfy(apierrors.IsNotFound))
...
-		Eventually(Get(serverMaintenance)).ShouldNot(Succeed())
+		Eventually(Get(serverMaintenance)).Should(Satisfy(apierrors.IsNotFound))
...
-		Eventually(Get(server)).ShouldNot(Succeed())
+		Eventually(Get(server)).Should(Satisfy(apierrors.IsNotFound))
...
-		Eventually(Get(serverMaintenance)).ShouldNot(Succeed())
+		Eventually(Get(serverMaintenance)).Should(Satisfy(apierrors.IsNotFound))

Also applies to: 605-611

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/controller/servermaintenance_controller_test.go` around lines 575 -
578, The test assertions using Eventually(Get(server)).ShouldNot(Succeed()) and
Eventually(Get(serverMaintenance)).ShouldNot(Succeed()) are too loose and can
pass for unrelated API failures. Replace these checks to explicitly verify that
the resources are not found by using apierrors.IsNotFound to confirm the
intended deletion outcome instead of only checking that the Get operation
failed. This change needs to be applied at lines 575-578 (for the server
resource check) and at lines 605-611 (for the serverMaintenance resource check)
in the servermaintenance_controller_test.go file.

@nagadeesh-nagaraja nagadeesh-nagaraja left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like we both kind of working on similar topic:

#950

I believe that the deletion should not be forced if in case its caused by flaky issue.

the deletion should be handled by the creator of the maintenance? let me know what you think.

Comment on lines +76 to +79
log.V(1).Info("Referenced Server not found, deleting ServerMaintenance")
if err := r.Delete(ctx, maintenance); err != nil {
return ctrl.Result{}, client.IgnoreNotFound(err)
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dont you think this might cause any flaky issue to delete the maintenances?
I think the owner of the servermaintenance should be responsible for deleting the servermaintenance.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why should we leave orphaned maintenances? Same way we don't leave orphaned servers around after their bmc got deleted.
No one will clean them up!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maintenances are mostly created by someone..
example when BMCSettings or other similar operators creates it, it needs to be deleted by them as they are the owner of it. they should detect the ref server or BMC is gone and delete themself and hence delete the maintenance as well. (this is already done when the SET CRD detects the server is gone)

if its created by the user (human), then we can delete them. as its easier for user. may be we can check if it has a owner ref, and if so do not delete it.. let the owner handle it?

@afritzler afritzler left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest not to delete resources in the reconciliation flow: Following analogy: if you create a Pod with spec.nodeName and the Node is not there, nobody will in k8s delete the Pod. It will transition into a Pending state. For our ServerMaintenance I would suggest that whoever creates the ServerMaintenance should be responsible deleting it.

@stefanhipfel

stefanhipfel commented Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

ServerMaintenance

I don't fully agree, since people just don't delete their servermaintenances or just forget it. Can we at least put a different state to the serverMaintenance object? Otherwise it would keep InMaintenance forever even with the server gone, which is also not true!

@stefanhipfel stefanhipfel changed the title Fix ServerMaintenance not deleted when referenced Server is gone Fix ServerMaintenance transitions to Pending when referenced Server is deleted Jun 16, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (3)
internal/controller/servermaintenance_controller_test.go (3)

563-563: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Assert NotFound explicitly instead of only "not succeed".

The check can pass on unrelated API failures; assert apierrors.IsNotFound to verify the intended deletion outcome.

🔍 Proposed fix
+	apierrors "k8s.io/apimachinery/pkg/api/errors"
 ...
-	Eventually(Get(server)).ShouldNot(Succeed())
+	Eventually(Get(server)).Should(Satisfy(apierrors.IsNotFound))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/controller/servermaintenance_controller_test.go` at line 563, The
assertion using `Eventually(Get(server)).ShouldNot(Succeed())` is too broad and
will pass when the Get operation fails for any reason (network errors,
permissions issues, etc.), not just when the resource is deleted. Replace this
generic success check with an explicit assertion using `apierrors.IsNotFound` to
verify that the Get operation fails specifically because the resource was not
found, ensuring the test validates the intended deletion outcome rather than
just any API failure.

591-591: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Assert NotFound explicitly instead of only "not succeed".

These checks can pass on unrelated API failures; assert apierrors.IsNotFound to verify the intended deletion outcome.

🔍 Proposed fix
+	apierrors "k8s.io/apimachinery/pkg/api/errors"
 ...
-	Eventually(Get(server)).ShouldNot(Succeed())
+	Eventually(Get(server)).Should(Satisfy(apierrors.IsNotFound))
 ...
-	Eventually(Get(serverMaintenance)).ShouldNot(Succeed())
+	Eventually(Get(serverMaintenance)).Should(Satisfy(apierrors.IsNotFound))

Also applies to: 597-597

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/controller/servermaintenance_controller_test.go` at line 591, The
test assertions at line 591 and line 597 in
internal/controller/servermaintenance_controller_test.go currently only verify
that the Get operation fails, but do not confirm the failure is specifically a
NotFound error. Replace the ShouldNot(Succeed()) assertions with explicit checks
using apierrors.IsNotFound to verify that the server object is actually deleted
and returns a 404 NotFound error, rather than failing due to other unrelated API
failures.

637-637: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Assert NotFound explicitly instead of only "not succeed".

The check can pass on unrelated API failures; assert apierrors.IsNotFound to verify the intended deletion outcome.

🔍 Proposed fix
+	apierrors "k8s.io/apimachinery/pkg/api/errors"
 ...
-	Eventually(Get(serverMaintenance)).ShouldNot(Succeed())
+	Eventually(Get(serverMaintenance)).Should(Satisfy(apierrors.IsNotFound))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/controller/servermaintenance_controller_test.go` at line 637, The
assertion at the Get(serverMaintenance) call only verifies that the operation
did not succeed, but this can pass for any API failure, not specifically
deletion. Replace the ShouldNot(Succeed()) check with an explicit assertion
using apierrors.IsNotFound to verify that the serverMaintenance resource was
actually deleted and not just failing for an unrelated reason. This ensures the
test validates the intended deletion outcome rather than any arbitrary API
failure.
🧹 Nitpick comments (1)
internal/controller/servermaintenance_controller_test.go (1)

566-566: 💤 Low value

Remove redundant Equal() wrapper.

HaveField already performs equality checking, so wrapping the expected value in Equal() is unnecessary.

♻️ Simplification
-	Eventually(Object(serverMaintenance)).Should(HaveField("Status.State", Equal(metalv1alpha1.ServerMaintenanceStatePending)))
+	Eventually(Object(serverMaintenance)).Should(HaveField("Status.State", metalv1alpha1.ServerMaintenanceStatePending))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/controller/servermaintenance_controller_test.go` at line 566, Remove
the redundant Equal() wrapper from the HaveField assertion. In the
Eventually(Object(serverMaintenance)).Should(HaveField(...)) call, HaveField
already performs equality checking internally, so pass
metalv1alpha1.ServerMaintenanceStatePending directly as the second argument to
HaveField instead of wrapping it with Equal().
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@internal/controller/servermaintenance_controller_test.go`:
- Line 563: The assertion using `Eventually(Get(server)).ShouldNot(Succeed())`
is too broad and will pass when the Get operation fails for any reason (network
errors, permissions issues, etc.), not just when the resource is deleted.
Replace this generic success check with an explicit assertion using
`apierrors.IsNotFound` to verify that the Get operation fails specifically
because the resource was not found, ensuring the test validates the intended
deletion outcome rather than just any API failure.
- Line 591: The test assertions at line 591 and line 597 in
internal/controller/servermaintenance_controller_test.go currently only verify
that the Get operation fails, but do not confirm the failure is specifically a
NotFound error. Replace the ShouldNot(Succeed()) assertions with explicit checks
using apierrors.IsNotFound to verify that the server object is actually deleted
and returns a 404 NotFound error, rather than failing due to other unrelated API
failures.
- Line 637: The assertion at the Get(serverMaintenance) call only verifies that
the operation did not succeed, but this can pass for any API failure, not
specifically deletion. Replace the ShouldNot(Succeed()) check with an explicit
assertion using apierrors.IsNotFound to verify that the serverMaintenance
resource was actually deleted and not just failing for an unrelated reason. This
ensures the test validates the intended deletion outcome rather than any
arbitrary API failure.

---

Nitpick comments:
In `@internal/controller/servermaintenance_controller_test.go`:
- Line 566: Remove the redundant Equal() wrapper from the HaveField assertion.
In the Eventually(Object(serverMaintenance)).Should(HaveField(...)) call,
HaveField already performs equality checking internally, so pass
metalv1alpha1.ServerMaintenanceStatePending directly as the second argument to
HaveField instead of wrapping it with Equal().

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 870e4f3b-e13e-49a3-bb45-9b68247bb783

📥 Commits

Reviewing files that changed from the base of the PR and between c866f34 and 50e70b0.

📒 Files selected for processing (2)
  • internal/controller/servermaintenance_controller.go
  • internal/controller/servermaintenance_controller_test.go

When the referenced Server is not found, transition ServerMaintenance to
Pending instead of deleting it. This follows the Kubernetes PVC pattern
where a resource waits in Pending when its dependency is missing, and
keeps the creator responsible for cleanup.

Signed-off-by: Stefan Hipfel <stefan.hipfel@sap.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working size/M

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

When deleting BMCs the ServerMaintenance is not being deleted

3 participants