Skip to content

Rollbacks: Adding ArgoCD deployer backend#499

Open
goelozev wants to merge 26 commits into
pre-release-5.1.0from
rollback-argocd-backend
Open

Rollbacks: Adding ArgoCD deployer backend#499
goelozev wants to merge 26 commits into
pre-release-5.1.0from
rollback-argocd-backend

Conversation

@goelozev

@goelozev goelozev commented Apr 20, 2026

Copy link
Copy Markdown
Member

Summary

Adding ArgoCD CI/CD backend for the rollback controller. This implementation will allow for faster rollbacks as it will skip our current CI system and directly perform the rollback. To run the rollback controller with ArgoCD, you will need to provide the following cmd flags:

  • --cicd-backend=argocd
  • --argocd-server-url=<your-argocd-url> the base URL of your ArgoCD server (e.g. https://argocd.yourcompany.dev)
  • --argocd-auth-token=<your-argocd-token> authentication token for a robot account, defined in argocd-cm (e.g. accounts.theatre: apiKey`)
  • --argocd-app-name-template="<template>" - define template to resolve the application name. In GC, we are using the compute-lab-{{.Namespace}}-{{.Target}} template for all ArgoCD application names. If you wish, you can modify that template, but currently, the only available fields you can use are .Namespace and .Target. You can also override this on a Rollback level by providing the argocd_app_name deployment option.

Requirements

Rollbacks using this CI/CD backend requires the following deployment options:

  • target_revision - a specific revision or a JSONPath in the Release to find the revision (e.g. '{.config.revisions[?(@.name=="infrastructure")].id}').

Optional deployment options:

  • app_revision - a specific revision or a JSONPath in the Release to find the revision (e.g. '{.config.revisions[?(@.name=="application")].id}').
  • argocd_app_name - you can provide this deployment option if you wish to specify a specific application name if the application is not using the template provided in --argocd-app-name-template.
  • argocd_add_sync_window - when set to true, a sync window will be added to the application, which will pause any further deployments

Changes to the Deployer interface

The Deployer interface (used to provide structure for CI/CD backends) now has a PostDeploymentHooks method, which will be called after the Rollback is deemed Healthy.

Tests

Tests conducted in our lab environment:

  1. Rollback with target_revision, app_revision and argocd_add_sync_window=true set ✅
  2. Rollback with target_revision and argocd_add_sync_window=true set ✅
  3. Rollback with target_revision only set ✅

@goelozev goelozev force-pushed the rollback-argocd-backend branch 4 times, most recently from 06e7315 to 40cf853 Compare April 21, 2026 15:10
@goelozev goelozev requested review from 0x0013 and IsaacJames and removed request for IsaacJames April 22, 2026 08:28
Comment thread internal/controller/deploy/rollback_controller.go
@goelozev goelozev marked this pull request as ready for review April 22, 2026 11:06
Comment thread pkg/cicd/deployer.go Outdated
GetDeploymentStatus(ctx context.Context, deploymentID string) (*DeploymentResult, error)

// PostDeploymentHooks is called after a deployment is successful.
// E.g. adding a sync window on a ArgoCD application

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// E.g. adding a sync window on a ArgoCD application
// E.g. adding a deny sync window on a ArgoCD application

Comment thread pkg/cicd/argocd/README.md Outdated

| Key | Description |
| ----------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `target_revision` | The infrastructure revision to deploy. Can be a literal revision string or a JSONPath expression into the Release (e.g. `{.config.revisions[?(@.name=="infrastructure")].id}`). |

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to accept literal revision string here (or for app_revision )? I'm thinking if someone creates a Rollback to release_A and then specifies literal string revisions that aren't actually associated with that release, then the target state would have nothing to do with the target release, which would be misleading.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I will update the documentation to not state that you can provide a literal revision.

I'm thinking if someone creates a Rollback to release_A and then specifies literal string revisions that aren't actually associated with that release, then the target state would have nothing to do with the target release, which would be misleading.

With the interfaces we provide within GC, I don't think this will be an issue, but still a valid point. You can also make the case that people will be able to use a JSONPath that doesn't target the original target_revision, which will result in an inaccurate rollback.

Comment thread internal/controller/deploy/rollback_controller.go
Comment thread pkg/cicd/argocd/deployer.go Outdated
Comment on lines +345 to +361
func (d *Deployer) mapSyncStatus(app applicationResponse) (cicd.DeploymentStatus, string) {
// Check operation state first — it reflects the active sync operation
if op := app.Status.OperationState; op != nil && op.Phase != "" {
switch op.Phase {
case OperationPhaseRunning:
return cicd.DeploymentStatusInProgress, op.Message
case OperationPhaseError, OperationPhaseFailed:
return cicd.DeploymentStatusFailed, op.Message
case OperationPhaseSucceeded:
return cicd.DeploymentStatusSucceeded, op.Message
default:
return cicd.DeploymentStatusPending, fmt.Sprintf("Operation phase: %s", op.Phase)
}
}

return cicd.DeploymentStatusPending, "No active operation"
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still unsure about mapping the operation status of whatever operation happens to be running to the deployment status. I'm not sure if there is an easy way to verify the operation is the same rollback sync that the controller initiated, but not doing so could produce some weird behaviour. E.g. if the controller fails to trigger a sync, but some other actor (CICD, human etc.) triggers one, the rollback could get marked as in progress/completed.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I'm not sure if we need to verify whether the operation is even a sync operation (though I'm unsure if other types of operations exist as I write this).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a valid concern!

After a bit of back and forth with Claude today, I've updated the logic so the application name and the set of revisions are recovered as the deployment URL. Then, when the GetDeploymentStatus is called, the logic makes sure that the latest sync operation has the target revisions we are rolling back to. Even if an external action (person or CD) syncs to the same revision, the rollback will be marked as successful.

Only if the external actor syncs to a different set of revisions before the controller reconciles the rollback to a terminal state, or the sync operation fails, will the rollback be marked as failed.

@0x0013

This comment was marked as outdated.

@goelozev goelozev force-pushed the rollback-argocd-backend branch from 58834d1 to 9651dfb Compare May 6, 2026 15:56
@goelozev

goelozev commented May 7, 2026

Copy link
Copy Markdown
Member Author

While comparing different promp context inclusions for claude reviews, I was running the experiments on the changes in this PR. Pasting one of the reviews (which was considered the highest quality review)

Note that I haven't personally verified the validity of the review.

@0x0013 I've minimised your comment as I changed the logic of the argocd deployer, and I will run the Claude code review separately on the new logic.

@goelozev

This comment was marked as outdated.

@claude

This comment was marked as outdated.

@goelozev

goelozev commented May 8, 2026

Copy link
Copy Markdown
Member Author

@claude review this PR!

@claude

claude Bot commented May 8, 2026

Copy link
Copy Markdown

Claude finished @goelozev's task in 3m 6s —— View job


Code Review: Rollbacks — Adding ArgoCD Deployer Backend

  • Gather PR diff and context
  • Read all changed files in full
  • Review ArgoCD deployer implementation
  • Review controller changes
  • Review interface changes
  • Review tests
  • Post final review

🔴 Critical Issues

[pkg/cicd/argocd/types.go:101-104 + deployer.go:219-234] Plugin struct field with omitempty never omitted — may silently clear ArgoCD plugin config

applicationPatchSource.Plugin is declared as a value type (applicationPatchPlugin) with omitempty:

type applicationPatchSource struct {
    TargetRevision string                 `json:"targetRevision"`
    Plugin         applicationPatchPlugin `json:"plugin,omitempty"`  // ← BUG
}

Go's encoding/json only applies omitempty to primitive types, pointers, slices, maps, and interfaces — not structs. A zero-value struct is never omitted. This means every updateApplication call where appRevision == "" will marshal a patch containing "plugin":{"env":null}, which in a JSON merge-patch will set plugin.env to null on the ArgoCD application, potentially clearing any existing REVISION env variable on applications that were previously configured with one.

Fix: change Plugin to a pointer type — Plugin *applicationPatchPlugin \json:"plugin,omitempty"``. Fix this →


🟡 Major Concerns

[cmd/rollback-manager/main.go:150] http.DefaultClient has no timeout

return argocddeployer.NewDeployer(http.DefaultClient, ...)

http.DefaultClient has no timeout configured. In a production controller reconciling Kubernetes resources, a hung HTTP connection to an ArgoCD server will block the reconcile goroutine indefinitely. Use &http.Client{Timeout: someReasonableTimeout} (e.g. 30s).


[cmd/rollback-manager/main.go:139, main.go:149] Deployers use dev-mode logger instead of the application logger

Both github and argocd deployer construction use zap.New(zap.UseDevMode(true)) instead of the logger derived from commonOptions.Logger(). This was pre-existing in the github deployer but is being carried forward here. Production deployments will emit development-format logs instead of the structured JSON logs expected by the rest of the stack.


[pkg/cicd/argocd/deployer.go:402-418] Dead code: mapSyncStatus method is never called

mapSyncStatus defines a mapping from operation state to DeploymentStatus, but GetDeploymentStatus implements its own inline logic and never calls this method. The dead method could mislead future contributors about which code path is active. It should be removed.


[pkg/cicd/argocd/deployer.go:371-377] Sync window is applied to the entire project (Applications: ["*"]), not just the specific application

proj.Spec.SyncWindows = append(proj.Spec.SyncWindows, SyncWindow{
    Kind:         "deny",
    Schedule:     "* * * * *",
    Duration:     "1h",
    Applications: []string{"*"},  // ← all apps in the project
    ...
})

This blocks syncs for every application in the ArgoCD project, not just the application that was rolled back. If a project houses multiple services, triggering a rollback on one will prevent all others from being synced for up to an hour. The Applications list should at minimum target only the specific appName involved in the rollback.


[pkg/cicd/argocd/deployer.go:371-377] Sync windows are never deduplicated

There is no guard against adding a duplicate deny window. If PostDeploymentHooks is called more than once (e.g. if the controller retries after a transient failure during status update, or the rollback is re-reconciled), additional deny windows accumulate in the project indefinitely. At minimum, check whether an identical window already exists before appending.


[internal/controller/deploy/rollback_controller.go:262-268] PostDeploymentHooks failures are silently swallowed

err := r.triggerPostDeploymentHooks(ctx, rollback, toRelease)
if err != nil {
    logger.Error(err, "failed to trigger post deployment hooks")
    message = fmt.Sprintf("%s (post-deployment hooks failed: %v)", message, err)
}
return r.markRollbackSucceeded(ctx, logger, rollback, message)

A hooks failure only appends to the status.message string — the rollback is still marked Succeeded. There's no dedicated status condition, no event emitted, and no retry. If adding the sync window fails (e.g. a 500 from the ArgoCD project API), the operator has no reliable way to alert on it. This is acknowledged in the PR review thread, but given the operational importance of the sync window feature, at minimum an Event should be emitted via the recorder so the failure is visible in kubectl describe rollback.


[pkg/cicd/argocd/deployer.go:367-369, deployer.go:383-386] addSyncWindow returns plain errors — not *cicd.DeployerError

if err := json.NewDecoder(getResp.Body).Decode(&proj); err != nil {
    return fmt.Errorf("failed to decode project response: %w", err)  // plain error
}
...
if err := json.Marshal(projectUpdate); err != nil {
    return fmt.Errorf("failed to marshal project update: %w", err)   // plain error
}

These error paths return plain error values. The controller uses errors.As(err, &deployerErr) to determine retryability; plain errors are treated as non-retryable. The serialisation failures are probably fine as non-retryable, but the inconsistency with the rest of the package is confusing. Wrapping in cicd.NewDeployerError with explicit retryability is more defensive.


[pkg/cicd/argocd/deployer.go:341] getApplication hardcodes "GetDeploymentStatus" as the operation name regardless of caller

When getApplication is called from PostDeploymentHooks, any resulting DeployerError will report Operation: "GetDeploymentStatus", making diagnostics misleading. The operation name should be passed as a parameter.


🔵 Minor Improvements

[pkg/cicd/argocd/README.md:42-51] Status mapping table is inaccurate after refactor

The README states SucceededSucceeded, but GetDeploymentStatus now performs revisionMatches as its primary check and only falls through to operationState for in-progress/failure cases. OperationPhaseSucceeded maps to Pending (not Succeeded) in the current code. The table should be updated to describe the actual state-machine logic.

[pkg/cicd/argocd/README.md] argocd_add_sync_window option missing from documentation

The argocd_add_sync_window deployment option is described in the PR body and implemented in the code, but absent from the README's optional deployment options table. It should be documented there.

[pkg/cicd/argocd/deployer.go:26] deploymentIDSep = "::" fragility

If an ArgoCD application name or a target revision ever contains "::" (unlikely but possible for exotic revision strings), decodeDeploymentID will split incorrectly. Consider using a character that can't appear in a URL path segment or Git ref, or use a proper encoding (e.g. base64 or structured field separator).

[pkg/cicd/argocd/deployer_test.go] Missing test: GetDeploymentStatus when operationState.Phase == "Succeeded" but revisionMatches is also false for the appRevision component

The test suite covers most branches well. One gap: the case where the last operation succeeded, the targetRevision matches the spec, but appRevision does not yet match (e.g. ArgoCD updated the helm/plugin revision separately). The current code would fall into the OperationPhaseSucceededspec still targets our revisionPending branch — a test covering this scenario would make the intent explicit.


✅ Positive Observations

  • The revisionMatches-first check in GetDeploymentStatus is a sound approach: it makes success determined by the actual applied state, not just the operation phase, which correctly handles the race where an external actor syncs to the same revision.
  • encodeDeploymentID / decodeDeploymentID are cleanly symmetric and well-tested with round-trip tests.
  • HTTP error handling correctly distinguishes retryable (5xx, 429) from non-retryable (4xx) responses throughout handleErrorResponse.
  • Test coverage with gock is thorough and covers all major HTTP failure modes. The Describe/Context/It structure follows the project's Ginkgo conventions consistently.
  • The resolveAppName fallback chain (option → template → error) is intuitive and well-tested.
  • Deployer interface change (PostDeploymentHooks) is backwards-compatible for existing implementations (NoopDeployer and GitHub deployer both provide trivial implementations).

Summary & Next Steps

The PR introduces a solid foundation for ArgoCD-backed rollbacks with good test coverage and a well-reasoned status-detection strategy. The most important fix before merge is the Plugin struct/omitempty bug — it will silently corrupt ArgoCD plugin config on applications that don't use app_revision. The sync window scope (Applications: ["*"]) and lack of deduplication are also high-priority operational correctness issues. The http.DefaultClient timeout and dev-mode logger are straightforward fixes. Dead code (mapSyncStatus) should be cleaned up before it causes confusion.

fix: add default http client timeout

fix: switch to the existing logger

fix: removed unused mapSyncStatus

refactor: move error handler of getApplication

docs: updated ArgoCD deployer docs

test: adding a new test case

fix: switch back to the default client
@goelozev goelozev force-pushed the rollback-argocd-backend branch from f8d8fd5 to 7355a1e Compare May 8, 2026 15:41
@goelozev

goelozev commented May 8, 2026

Copy link
Copy Markdown
Member Author

Claude finished @goelozev's task in 3m 6s —— View job
...

🔴 Critical Issues

[pkg/cicd/argocd/types.go:101-104 + deployer.go:219-234] Plugin struct field with omitempty never omitted — may silently clear ArgoCD plugin config
Addressed ✅

🟡 Major Concerns

[cmd/rollback-manager/main.go:139, main.go:149] Deployers use dev-mode logger instead of the application logger

Addressed ✅

[pkg/cicd/argocd/deployer.go:402-418] Dead code: mapSyncStatus method is never called

Addressed ✅

[pkg/cicd/argocd/deployer.go:371-377] Sync window is applied to the entire project (Applications: ["*"]), not just the specific application

That's expected behaviour!

[pkg/cicd/argocd/deployer.go:371-377] Sync windows are never deduplicated
Deduplication is handled by the ArgoCD API.

[internal/controller/deploy/rollback_controller.go:262-268] PostDeploymentHooks failures are silently swallowed

This is expected behaviour!

[pkg/cicd/argocd/deployer.go:367-369, deployer.go:383-386] addSyncWindow returns plain errors — not *cicd.DeployerError

Addressed ✅

[pkg/cicd/argocd/deployer.go:341] getApplication hardcodes "GetDeploymentStatus" as the operation name regardless of caller

Addressed ✅

When getApplication is called from PostDeploymentHooks, any resulting DeployerError will report Operation: "GetDeploymentStatus", making diagnostics misleading. The operation name should be passed as a parameter.

Addressed ✅

🔵 Minor Improvements

[pkg/cicd/argocd/README.md:42-51] Status mapping table is inaccurate after refactor

Addressed ✅

[pkg/cicd/argocd/README.md] argocd_add_sync_window option missing from documentation

Addressed ✅

[pkg/cicd/argocd/deployer.go:26] deploymentIDSep = "::" fragility

Not addressed.

[pkg/cicd/argocd/deployer_test.go] Missing test: GetDeploymentStatus when operationState.Phase == "Succeeded", but revisionMatchesis also false for theappRevision` component

Addressed ✅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants