fix(site-update): Make site update recovery resilient and recoverable by balamurali27 · Pull Request #6729 · frappe/press

balamurali27 · 2026-06-17T04:49:35Z

Problem

When a site update (Pull/Migrate onto a new bench) fails, recovery could end up Fatal with no automatic path back and no clear signal to the user:

Recovery migrate queries on large sites could exceed the database server's max_statement_time and get killed mid-query, turning a recoverable failure into a fatal one.
A migrate recovery that failed mid table-restore had no automatic follow-up — the site was left Broken even though its tables could still be restored from the backup. Re-running the full recovery doesn't work: the agent's move_site is not idempotent (the site has already been moved back), so it fails at "Move Site".
An update run with backups skipped that failed gave a generic notification, leaving the user unsure it needs manual intervention and how to do it.

Changes

`max_statement_time` bump before a recovery migrate (and revert after)

Before a recovery migrate on a large database (over LARGE_DATABASE_SIZE, 2 GB), bump max_statement_time by an hour (dynamic MariaDB variable, no restart), recorded as a comment on the Site Update. Smaller databases finish well within the timeout and are skipped.
The pre-bump value is stashed on the Site Update (previous_max_statement_time) and restored once recovery finishes — at the recover job's terminal state, or, when the fallback table restore runs, at that job's callback. Without this the value ratcheted up an hour on every recovery.

One-shot `Restore Site Tables` fallback when a migrate recovery fails — but only when safe

Press makes one automatic attempt to bring the site back up, only when all hold:

it was a Recover Failed Site Migrate;
its "Move Site" step succeeded (so the site is back on the source bench — otherwise restoring tables would target the wrong bench); and
it failed due to a transient DB error (MySQL server has gone away / Lost connection to MySQL server, detected from the job output/traceback and step output). Other failures are genuine problems left Fatal for manual attention.

Since the failed recovery has already moved the site back, only the table restore is left undone, so Press re-issues just that job (linked on the Site Update as a comment for traceability).

On a successful fallback `Restore Site Tables`, the update stays `Fatal`

The site becomes Active again and the fatal update's cause of failure is marked resolved (set_cause_of_failure_is_resolved, which also clears the site's fatal_site_update). The update itself still failed and is recorded as such, rather than being silently flipped to Recovered.

Actionable notification for skipped-backups failures

Replaces the generic message with one telling the user the site can't be recovered automatically, linking to the SSH docs to fix it manually. (Skipped-backups failures go straight to Fatal — no recovery, no table restore, since there's no backup.)

Notes

Site Usage stores sizes in MB, not bytes (despite unitless Int fields) — so the 2 GB threshold is 2048, not 2048**3. Documented in a new Site Usage README.

While here, all agent-job failure notification messages were flattened to single lines via implicit string concatenation — the fix dialog renders them with whitespace-pre-wrap, so source indentation and mid-paragraph line breaks were showing as literal whitespace.

Docs

docs/code/site-update/index.md — the recovery flow, status lifecycle, the transient-only restore fallback, the max_statement_time bump/revert, and the constants.
press/press/doctype/site_update/README.md — Fatal-state behavior with the transient-error fallback.
press/press/doctype/site_usage/README.md — the MB-not-bytes gotcha.
Removed the redundant top-level guide-to-testing.md in favour of docs/code/testing/.

Tests

bench --site <site> run-tests --app press --module press.press.doctype.site_update.test_site_update — passing, including:

a transient-error migrate recovery triggers Restore Site Tables and, on success, leaves the update Fatal with its cause of failure resolved and the site Active
a transient-error recovery followed by a failed table restore goes Fatal with fatal_site_update set
a non-transient recovery failure does not restore tables (stays Fatal)
a recovery that fails before "Move Site" does not restore tables
a successful recovery does not restore tables (stays Recovered)
a skipped-backups update failure goes straight to Fatal with no recovery/restore, and surfaces the actionable SSH notification
the recovery-migrate max_statement_time bump is stashed and restored to its pre-bump value
increase_max_statement_time bumps the DB server variable by an hour; Site.database_size returns the latest Site Usage value (in MB)

Also verified end-to-end on a local Vagrant Frappe Cloud: real Update Site Migrate → Recover Failed Site Migrate (killed mid-restore) → Restore Site Tables success, ending Fatal with cause resolved and the site Active.

🤖 Generated with Claude Code

When a recovery job (Recover Failed Site Update/Pull/Migrate) fails due to a transient database error like "MySQL server has gone away" or "Lost connection to MySQL server", the site update was immediately marked Fatal. Retry the recovery job up to MAX_RECOVERY_RETRIES times before giving up so that a transient blip doesn't strand the site in a Fatal state. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

When an operator manually runs Restore Tables to recover a site stuck in a Fatal site update, a successful restore now clears the site's fatal_site_update and marks the associated Site Update as Recovered, instead of leaving it Fatal forever. Also make restore_tables idempotent with respect to status_before_update so repeated attempts don't overwrite the originally captured status (e.g. with Broken), which would prevent the site from being reactivated on success. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Restoring site tables during recovery runs heavy queries that can exceed the database server's max_statement_time on large sites, killing the restore with "max_statement_time exceeded" and leaving the site Broken. When a Restore Site Tables job fails for that reason, double the max_statement_time variable on the database server — a dynamic variable, so it applies without a restart — and retry the restore, up to MAX_STATEMENT_TIMEOUT_RETRIES times. Each additional attempt is recorded as a comment on the fatal Site Update being recovered so the escalation is auditable. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

A site update run with backups skipped can fail mid-migration, leaving the database partially migrated. There is no backup to restore from, so the site cannot be recovered automatically — the user has to fix it over SSH. Detect this case when building the agent job failure notification and replace the generic message with an actionable one that tells the user to connect to the bench over SSH and fix the site manually. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

A recovery migrate runs heavy restore and migrate queries that can exceed the database server's max_statement_time on large sites and get killed, sending the update Fatal. Proactively increase max_statement_time by an hour before triggering the recovery migrate job so the recovery isn't timed out mid-query in the first place. Extract the bump into a shared Site.increase_max_statement_time() helper (incrementing by an hour rather than doubling) and reuse it from both the proactive path and the restore-tables timeout retry. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The transient-error check only looked at Agent Job Step output, but the "MySQL server has gone away" / "Lost connection" message also surfaces in the recover job's own output/traceback. Check those too so a retryable recovery isn't sent straight to Fatal. Also mark the Site Update (and site) as Recovering when scheduling a retry, so the in-progress state is observable instead of lingering on the prior Failure status until the new recover job starts running. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

STATEMENT_TIME_INCREMENT is defined at the bottom of the module, but using it as a default argument value evaluates it when the class body runs at import time — before the constant exists — raising NameError on import. Default to None and resolve inside the method instead. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Cover what runs when an update fails: automatic recovery to the source bench, the proactive max_statement_time bump before a migrate recovery, retrying on transient database errors, restoring tables after a fatal update (with the statement-timeout retry), and the skipped-backups case that requires manual SSH intervention. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Add tests for the site update recovery behaviour: - recovery retries on transient DB errors and goes Fatal after max retries - restore_tables success reactivates the site and marks the update Recovered - increase_max_statement_time bumps the database server variable by an hour - a Restore Site Tables statement timeout retries after bumping the variable - a skipped-backups update failure surfaces an actionable SSH notification Add an ignore_validate flag to create_test_site_update so record-only fixtures don't need a full destination-bench setup. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

greptile-apps · 2026-06-17T05:05:06Z

Confidence Score: 3/5

The recovery fallback logic is well-tested and correctly gated, but a new Delivery Failure mapping in process_restore_tables_job_update clears database_name when the restore never ran, and several edge cases in the max_statement_time bump/restore path remain open from prior review rounds.

The Delivery Failure branch introduced here reaches frappe.db.set_value("Site", job.site, "database_name", None) even when the restore job never touched the agent, corrupting the site's database-name record while leaving the actual database intact. Combined with unresolved concerns from earlier iterations around the max_statement_time zero-value sentinel and synchronous Ansible failures propagating out of trigger_recovery_job, the recovery path has multiple failure modes that leave sites broken with no automatic path back.

press/press/doctype/site/site.py (process_restore_tables_job_update Delivery Failure branch and the max_statement_time zero-value guard) and press/press/doctype/site_update/site_update.py (bump_max_statement_time_before_recovery exception propagation)

Important Files Changed

Filename	Overview
press/press/doctype/site_update/site_update.py	Core recovery orchestration; adds max_statement_time bump/restore, the one-shot Restore Site Tables fallback, and transient-error detection — several edge cases remain from prior review rounds
press/press/doctype/site/site.py	Adds database_size property, increase_max_statement_time, set_max_statement_time, and updates process_restore_tables_job_update to handle fatal_update; zero-value and exception-propagation edge cases from prior rounds remain open
press/press/doctype/agent_job/agent_job_notifications.py	Adds skipped-backups actionable notification and flattens multi-line message strings; logic is correct, i18n gap noted in prior thread
press/press/doctype/site_update/test_site_update.py	Comprehensive new tests covering all major recovery scenarios including transient-error fallback, Move Site guard, non-transient no-op, skipped-backups notification, and max_statement_time bump/restore
press/press/doctype/site_update/site_update.json	Adds previous_max_statement_time Int field with no_copy and read_only; field definition is correct

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    UF[Update Job Fails] --> SB{Backups skipped?}
    SB -- Yes --> FT[Fatal + SSH notification]
    SB -- No --> TR[trigger_recovery_job]
    TR --> bump{Large DB AND deploy_type=Migrate?}
    bump -- Yes --> BMP[bump max_statement_time store previous_max_statement_time]
    bump -- No --> RJOB
    BMP --> RJOB[Recover Failed Site job]
    RJOB --> RES{Recovery result}
    RES -- Success --> REC[Recovered restore max_statement_time]
    RES -- Failure --> FAT[Fatal site Broken fatal_site_update set]
    FAT --> COND{Recover Failed Site Migrate + Move Site=Success + transient DB error?}
    COND -- No --> LEAVE[Leave Fatal restore max_statement_time]
    COND -- Yes --> RST[Restore Site Tables job keep max_statement_time elevated]
    RST --> RSTRES{Restore result}
    RSTRES -- Success --> ACT[Site Active update stays Fatal cause_of_failure_is_resolved restore max_statement_time]
    RSTRES -- Failure --> BRK[Site Broken fatal_site_update remains restore max_statement_time]

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    UF[Update Job Fails] --> SB{Backups skipped?}
    SB -- Yes --> FT[Fatal + SSH notification]
    SB -- No --> TR[trigger_recovery_job]
    TR --> bump{Large DB AND deploy_type=Migrate?}
    bump -- Yes --> BMP[bump max_statement_time store previous_max_statement_time]
    bump -- No --> RJOB
    BMP --> RJOB[Recover Failed Site job]
    RJOB --> RES{Recovery result}
    RES -- Success --> REC[Recovered restore max_statement_time]
    RES -- Failure --> FAT[Fatal site Broken fatal_site_update set]
    FAT --> COND{Recover Failed Site Migrate + Move Site=Success + transient DB error?}
    COND -- No --> LEAVE[Leave Fatal restore max_statement_time]
    COND -- Yes --> RST[Restore Site Tables job keep max_statement_time elevated]
    RST --> RSTRES{Restore result}
    RSTRES -- Success --> ACT[Site Active update stays Fatal cause_of_failure_is_resolved restore max_statement_time]
    RSTRES -- Failure --> BRK[Site Broken fatal_site_update remains restore max_statement_time]

Prompt To Fix All With AI

Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
press/press/doctype/site/site.py:5028
**`database_name` cleared on `"Delivery Failure"` even though nothing ran**

For `"Delivery Failure"` the `Restore Site Tables` job was never delivered to the agent — the database was never touched. The else branch still runs `frappe.db.set_value("Site", job.site, "database_name", None)`, which wipes the site's database name even though the database exists and is intact. The site is now `Broken` and trying to reconnect to the database would fail because the name record is gone. For `"Failure"` this clearing is arguably intentional (partial restore, unknown state); for `"Delivery Failure"` it introduces corruption that wouldn't have existed before this PR added the mapping entry.

_{Reviews (13): Last reviewed commit: "fix(site): Handle Delivery Failure in re..." | Re-trigger Greptile}

codecov-commenter · 2026-06-17T05:06:44Z

Codecov Report

❌ Patch coverage is 94.71545% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 50.91%. Comparing base (b56a7c7) to head (5c6fc7c).
⚠️ Report is 24 commits behind head on develop.

Files with missing lines	Patch %	Lines
...press/doctype/agent_job/agent_job_notifications.py	60.00%	8 Missing ⚠️
press/press/doctype/site/site.py	91.17%	3 Missing ⚠️
press/press/doctype/site_update/site_update.py	97.05%	1 Missing ⚠️
...ress/press/doctype/site_update/test_site_update.py	99.36%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##           develop    #6729       +/-   ##
============================================
- Coverage    62.94%   50.91%   -12.03%     
============================================
  Files          117      995      +878     
  Lines        18110    84143    +66033     
  Branches       527      527               
============================================
+ Hits         11399    42843    +31444     
- Misses        6678    41267    +34589     
  Partials        33       33

Flag	Coverage Δ
dashboard	`62.94% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

The proactive max_statement_time bump before a recovery migrate is only needed for large sites — small databases finish well within the statement timeout, so bumping the DB server variable for them is pointless churn. Gate the bump on Site.database_size, a new property exposing the latest Site Usage database size. Site Usage stores sizes in MB (not bytes, despite the unitless Int field), so LARGE_DATABASE_SIZE is 1024, not 1024**3. The dashboard confirms the unit: SiteOverview renders these via $format.bytes(v, 2, 2), where current=2 shifts the scale to start at MB. Added a Site Usage README documenting the MB-not-bytes gotcha. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Retrying the Restore Site Tables job on a statement timeout is enough on its own — the heavy queries that timed out get another full max_statement_time window each attempt. Bumping the DB server variable on top of that was redundant churn, so drop it and just retry. The proactive bump before a recovery migrate (Site.increase_max_statement_time) is unchanged; it's the only remaining caller. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The one-hour max_statement_time bump done proactively before a recovery migrate already gives heavy queries enough headroom, so the separate retry-on-timeout loop around Restore Site Tables was redundant. Remove it along with its helper, constants, and the call in process_restore_tables_job_update — on a restore failure the site simply stays Broken. Removes retry_restore_tables_after_statement_timeout, restore_tables_failed_due_to_statement_timeout, STATEMENT_TIMEOUT_ERROR, and MAX_STATEMENT_TIMEOUT_RETRIES. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The docs/code/testing set (index, mocking, best-practices) is a cleaner, reorganized version that covers everything guide-to-testing.md did, namely prerequisites, test-site setup, running tests, mocking, and rerunnability. This removes the duplicate and points AGENTS.md at the new docs. CLAUDE.md picks it up transitively via @AGENTS.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The recover_job link is overwritten with the latest recover job by trigger_recovery_job, so when recovery is retried on a transient DB error the previously failed recover jobs leave no trace on the Site Update — they were only findable by digging through the Agent Job list. Now retry_recovery adds a comment linking the failed recover job and noting the transient error before re-triggering, giving operators a visible trail of how many times recovery was retried. The recover_job field is cleared via db_set rather than frappe.db.set_value so the in-memory doc is updated too. Otherwise the recover_job guard in trigger_recovery_job short-circuits on the stale value and no fresh recover job gets created.

When a "Recover Failed Site Migrate" job fails, the site has already been moved back to the source bench but its tables are left half-restored. Re-running the recovery would fail at "Move Site" (the site directory is no longer on the destination bench, and the agent's move_site isn't idempotent), so instead trigger a single "Restore Site Tables" job to bring the site back up from its backup. Restore Site Tables keeps its existing behaviour and callback; the Site Update only carries a reference comment to the triggered job. On a successful restore the update stays Fatal but with cause_of_failure_is_resolved set (the update itself failed for good, but the site was recovered) — it is not flipped to Recovered. This also clears the site's fatal_site_update. Replaces the earlier 3x transient-error retry loop, which couldn't actually re-run the table restore. The max_statement_time bump for large databases is retained.

The skipped-backups update-failure notification told the user to fix the site over SSH but gave no pointer on how. Link the same SSH docs page the dashboard's SiteUpdateDialog uses (docs.frappe.io/cloud/benches/ssh), with the `underline` class so it renders as a visible link. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RMz9YYdEHMUreMzDkRZDJp

Drop the removed transient-retry section and MAX_RECOVERY_RETRIES, describe the one-shot table-restore fallback, and redraw the status lifecycle. Also fix a stale inline comment claiming the fallback marks the update Recovered. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RMz9YYdEHMUreMzDkRZDJp

A skipped-backups update has no backup to roll back to, so a failed update must go straight to Fatal with no recover or Restore Site Tables job. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RMz9YYdEHMUreMzDkRZDJp

A body should give context, not retell the diff. A couple of sentences on the why is enough. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RMz9YYdEHMUreMzDkRZDJp

The set_cause_of_failure_is_resolved method already clears fatal_site_update, and the else branch only ran when it was already empty — a no-op. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RMz9YYdEHMUreMzDkRZDJp

Site("Site", name) over frappe.get_doc("Site", name), especially when the controller lives in the same file — shorter and concretely typed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RMz9YYdEHMUreMzDkRZDJp

Per taste.md, prefer Site("Site", name) over frappe.get_doc in the process_*_job_update callbacks, where the controller is in the same file. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RMz9YYdEHMUreMzDkRZDJp

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RMz9YYdEHMUreMzDkRZDJp

A comment that needs more is a sign the code should be clearer or the explanation belongs in a doc. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RMz9YYdEHMUreMzDkRZDJp

The Restore Site Tables fallback assumes the failed migrate recovery already moved the site to the source bench. Guard on the recover job's Move Site step succeeding, so a recovery that fails at/before Move Site doesn't restore tables onto the wrong bench. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RMz9YYdEHMUreMzDkRZDJp

greptile-apps · 2026-06-19T04:42:17Z

+		return (
+			frappe.db.get_value("Site Usage", {"site": self.name}, "database", order_by="creation desc") or 0
+		)


database_size silently returns 0 when no Site Usage row exists yet, causing the > LARGE_DATABASE_SIZE guard in trigger_recovery_job to be False for any site without usage history. A large site whose usage was never recorded will have max_statement_time skipped, re-exposing the exact timeout risk this feature aims to prevent. Returning None and handling the sentinel explicitly in the caller makes the "no data" case visible rather than silently treating it as "small".

Suggested change

return (

frappe.db.get_value("Site Usage", {"site": self.name}, "database", order_by="creation desc") or 0

)

return (

frappe.db.get_value("Site Usage", {"site": self.name}, "database", order_by="creation desc") or None

)

Prompt To Fix With AI

This is a comment left during a code review. Path: press/press/doctype/site/site.py Line: 1184-1186 Comment: `database_size` silently returns `0` when no Site Usage row exists yet, causing the `> LARGE_DATABASE_SIZE` guard in `trigger_recovery_job` to be `False` for any site without usage history. A large site whose usage was never recorded will have `max_statement_time` skipped, re-exposing the exact timeout risk this feature aims to prevent. Returning `None` and handling the sentinel explicitly in the caller makes the "no data" case visible rather than silently treating it as "small". ```suggestion return ( frappe.db.get_value("Site Usage", {"site": self.name}, "database", order_by="creation desc") or None ) ``` How can I resolve this? If you propose a fix, please make it concise.

The trigger_recovery_job method is already long; move the recovery-migrate timeout bump into bump_max_statement_time_before_recovery. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RMz9YYdEHMUreMzDkRZDJp

The fix dialog renders messages with whitespace-pre-wrap, so source indentation and mid-paragraph line breaks showed as literal whitespace. Build every message via implicit string concatenation — one clean line per paragraph, matching the existing redis-unpack message. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RMz9YYdEHMUreMzDkRZDJp

Two related hardenings of the failed-migrate recovery path: - Only re-issue Restore Site Tables when the recovery failed due to a transient DB error (connection dropped mid-restore); other failures need manual attention, so leave the site Fatal. - The recovery-migrate max_statement_time bump was never undone, so it grew an hour on the database server every recovery. Stash the pre-bump value on the Site Update and restore it once recovery finishes — at the recover job's terminal state, or, when the fallback table restore runs, at its callback. Also extracts Site.set_max_statement_time and makes increase_max_statement_time take a plain default increment. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RMz9YYdEHMUreMzDkRZDJp

Only bump max_statement_time before a recovery migrate for databases over 2 GB (was 1 GB); smaller databases finish well within the timeout. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RMz9YYdEHMUreMzDkRZDJp

Update the recovery runbook and the Site Update README for the transient-DB- error condition on the table-restore fallback, the max_statement_time revert, and the 2 GB threshold. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RMz9YYdEHMUreMzDkRZDJp

Load the Site Update found by recover_job as a full doc named site_update (lowercase doctype slug) instead of a separate `recovery` var, and rename the `restored` flag to `fallback_triggered`. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RMz9YYdEHMUreMzDkRZDJp

balamurali27 · 2026-06-19T05:52:20Z

Note for reviewers — review the net diff, not the commit history.

The early commits of this PR (roughly 7ff506d9…b5365565) were over-ambitious and no longer reflect the final design — please don't review them commit-by-commit.

They built a full retry mechanism around recovery: up to MAX_RECOVERY_RETRIES (3) recover-job retries on transient DB errors, a statement-timeout retry loop around restore tables, flipping the update to Recovered after a restore, etc. That machinery was later torn back out in favour of a minimal one-shot fallback: when a migrate recovery fails after moving the site back due to a transient DB error, re-issue just Restore Site Tables once; everything else stays Fatal. The max_statement_time bump survived (now also reverted after recovery), and the threshold moved to 2 GB.

So most of those starting commits are effectively superseded. The parts that carried through unchanged are the tests (the scaffolding/helpers in test_site_update.py) and the SSH/skipped-backups notification. The current behaviour is best read from the PR description and the final state of the files, not the intermediate commits.

The transient-error check keys off the recover job, not the failed update job. Cover that a transient error during the update, with a non-transient recovery failure, leaves the site Fatal without a Restore Site Tables job. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RMz9YYdEHMUreMzDkRZDJp

greptile-apps · 2026-06-19T06:03:51Z

 		if updated_status == "Active":
-			frappe.get_doc("Site", job.site).reset_previous_status(fix_broken=True)
-			frappe.db.set_value("Site", job.site, "fatal_site_update", None)
+			site = Site("Site", job.site)
+			fatal_update = site.fatal_site_update
+			site.reset_previous_status(fix_broken=True)
+			if fatal_update:
+				# The site is back up, but the update itself failed for good. Keep it Fatal and
+				# just mark the cause resolved (this also clears the site's fatal_site_update).
+				site_update = frappe.get_doc("Site Update", fatal_update)
+				site_update.restore_max_statement_time()
+				site_update.set_cause_of_failure_is_resolved()


Unconditional fatal_site_update clear replaced by error-prone conditional

The old code always ran frappe.db.set_value("Site", job.site, "fatal_site_update", None) after reset_previous_status. Now the clear depends on set_cause_of_failure_is_resolved() completing successfully. If frappe.get_doc("Site Update", fatal_update) raises (e.g. the SiteUpdate was deleted) or restore_max_statement_time() raises (e.g. the database server is unreachable), the site will be left Active (status already persisted by reset_previous_status → self.save()) with fatal_site_update still set and cause_of_failure_is_resolved still 0. Adding a try/finally that calls frappe.db.set_value("Site", job.site, "fatal_site_update", None) — mirroring the old unconditional clear — would prevent this inconsistency.

Prompt To Fix With AI

This is a comment left during a code review. Path: press/press/doctype/site/site.py Line: 5029-5038 Comment: **Unconditional `fatal_site_update` clear replaced by error-prone conditional** The old code always ran `frappe.db.set_value("Site", job.site, "fatal_site_update", None)` after `reset_previous_status`. Now the clear depends on `set_cause_of_failure_is_resolved()` completing successfully. If `frappe.get_doc("Site Update", fatal_update)` raises (e.g. the SiteUpdate was deleted) or `restore_max_statement_time()` raises (e.g. the database server is unreachable), the site will be left `Active` (status already persisted by `reset_previous_status` → `self.save()`) with `fatal_site_update` still set and `cause_of_failure_is_resolved` still `0`. Adding a `try/finally` that calls `frappe.db.set_value("Site", job.site, "fatal_site_update", None)` — mirroring the old unconditional clear — would prevent this inconsistency. How can I resolve this? If you propose a fix, please make it concise.

mergify · 2026-06-19T09:20:18Z

Tick the box to add this pull request to the merge queue (same as @mergifyio queue).

Queue this pull request

The hide_days/hide_seconds properties apply only to Duration fields and have no effect on the previous_max_statement_time Int field. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

greptile-apps · 2026-06-21T04:54:21Z

+	def restore_max_statement_time(self) -> None:
+		# No-op unless a recovery migrate bumped it (see bump_max_statement_time_before_recovery).
+		if not self.previous_max_statement_time:
+			return
+		frappe.get_doc("Site", self.site).set_max_statement_time(self.previous_max_statement_time)
+		self.db_set("previous_max_statement_time", 0)


restore_max_statement_time is silently a no-op when the DB server's max_statement_time was already 0 (MariaDB's "no limit" sentinel). increase_max_statement_time calls int(float(current_timeout)) if current_timeout else DEFAULT_MAX_STATEMENT_TIME, so a stored value of "0" yields old_timeout = 0, which is then stored as previous_max_statement_time. The guard if not self.previous_max_statement_time: treats 0 as falsy and returns early without restoring, permanently leaving the DB server at 3600 s after the recovery. Change the clear value to None and guard with is None so the "never bumped" and "bumped from 0" cases are distinguishable.

Suggested change

def restore_max_statement_time(self) -> None:

# No-op unless a recovery migrate bumped it (see bump_max_statement_time_before_recovery).

if not self.previous_max_statement_time:

return

frappe.get_doc("Site", self.site).set_max_statement_time(self.previous_max_statement_time)

self.db_set("previous_max_statement_time", 0)

def restore_max_statement_time(self) -> None:

# No-op unless a recovery migrate bumped it (see bump_max_statement_time_before_recovery).

if self.previous_max_statement_time is None:

return

frappe.get_doc("Site", self.site).set_max_statement_time(self.previous_max_statement_time)

self.db_set("previous_max_statement_time", None)

Prompt To Fix With AI

This is a comment left during a code review. Path: press/press/doctype/site_update/site_update.py Line: 561-566 Comment: `restore_max_statement_time` is silently a no-op when the DB server's `max_statement_time` was already `0` (MariaDB's "no limit" sentinel). `increase_max_statement_time` calls `int(float(current_timeout)) if current_timeout else DEFAULT_MAX_STATEMENT_TIME`, so a stored value of `"0"` yields `old_timeout = 0`, which is then stored as `previous_max_statement_time`. The guard `if not self.previous_max_statement_time:` treats `0` as falsy and returns early without restoring, permanently leaving the DB server at 3600 s after the recovery. Change the clear value to `None` and guard with `is None` so the "never bumped" and "bumped from 0" cases are distinguishable. ```suggestion def restore_max_statement_time(self) -> None: # No-op unless a recovery migrate bumped it (see bump_max_statement_time_before_recovery). if self.previous_max_statement_time is None: return frappe.get_doc("Site", self.site).set_max_statement_time(self.previous_max_statement_time) self.db_set("previous_max_statement_time", None) ``` How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-06-22T08:25:50Z


 			# Attempt to move site to source bench

+			self.bump_max_statement_time_before_recovery(site)


Timeout-bump failure silently blocks all recovery

bump_max_statement_time_before_recovery calls set_max_statement_time → add_or_update_mariadb_variable(update_variables_synchronously=True), which runs Ansible synchronously. If the database server is temporarily unreachable or Ansible fails (which is plausible exactly when an update has just failed), the exception propagates out of trigger_recovery_job before the recovery AgentJob is ever created. The site is left Broken with no recovery path and no user-visible signal — the opposite of what this PR aims to fix. Wrap the bump in a try/except so a bump failure is logged but doesn't abort recovery.

Prompt To Fix With AI

This is a comment left during a code review. Path: press/press/doctype/site_update/site_update.py Line: 631 Comment: **Timeout-bump failure silently blocks all recovery** `bump_max_statement_time_before_recovery` calls `set_max_statement_time` → `add_or_update_mariadb_variable(update_variables_synchronously=True)`, which runs Ansible synchronously. If the database server is temporarily unreachable or Ansible fails (which is plausible exactly when an update has just failed), the exception propagates out of `trigger_recovery_job` before the recovery `AgentJob` is ever created. The site is left `Broken` with no recovery path and no user-visible signal — the opposite of what this PR aims to fix. Wrap the bump in a `try/except` so a bump failure is logged but doesn't abort recovery. How can I resolve this? If you propose a fix, please make it concise.

A Restore Site Tables job can come back as Delivery Failure, not just Failure. The status map would KeyError on it, and the max_statement_time revert guard ignored it, leaving the bump in place. Treat it like Failure, mirroring the other job callbacks in this file. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

greptile-apps · 2026-06-23T11:34:44Z

 		"Running": "Updating",
 		"Success": "Active",
 		"Failure": "Broken",
+		"Delivery Failure": "Broken",


database_name cleared on "Delivery Failure" even though nothing ran

For "Delivery Failure" the Restore Site Tables job was never delivered to the agent — the database was never touched. The else branch still runs frappe.db.set_value("Site", job.site, "database_name", None), which wipes the site's database name even though the database exists and is intact. The site is now Broken and trying to reconnect to the database would fail because the name record is gone. For "Failure" this clearing is arguably intentional (partial restore, unknown state); for "Delivery Failure" it introduces corruption that wouldn't have existed before this PR added the mapping entry.

Prompt To Fix With AI

This is a comment left during a code review. Path: press/press/doctype/site/site.py Line: 5028 Comment: **`database_name` cleared on `"Delivery Failure"` even though nothing ran** For `"Delivery Failure"` the `Restore Site Tables` job was never delivered to the agent — the database was never touched. The else branch still runs `frappe.db.set_value("Site", job.site, "database_name", None)`, which wipes the site's database name even though the database exists and is intact. The site is now `Broken` and trying to reconnect to the database would fail because the name record is gone. For `"Failure"` this clearing is arguably intentional (partial restore, unknown state); for `"Delivery Failure"` it introduces corruption that wouldn't have existed before this PR added the mapping entry. How can I resolve this? If you propose a fix, please make it concise.

balamurali27 requested a review from tanmoysrt as a code owner June 17, 2026 04:49

balamurali27 changed the title ~~fix(site-update): make site update recovery resilient and recoverable~~ fix(site-update): Make site update recovery resilient and recoverable Jun 17, 2026

balamurali27 and others added 10 commits June 17, 2026 10:25

style(site-update): Sort imports in test_site_update

1a04a2d

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

balamurali27 force-pushed the fix/site-update-recovery branch from aed472a to 1a04a2d Compare June 17, 2026 04:59

greptile-apps Bot reviewed Jun 17, 2026

View reviewed changes

Comment thread press/press/doctype/agent_job/agent_job_notifications.py Outdated

Comment thread press/press/doctype/site/site.py Outdated

balamurali27 and others added 2 commits June 17, 2026 10:48

greptile-apps Bot reviewed Jun 17, 2026

View reviewed changes

Comment thread press/press/doctype/site/site.py Outdated

balamurali27 and others added 4 commits June 17, 2026 10:58

greptile-apps Bot reviewed Jun 19, 2026

View reviewed changes

Comment thread press/press/doctype/site_update/site_update.py Outdated

balamurali27 and others added 6 commits June 19, 2026 09:53

docs(commit): Add guideline to keep bodies short

3fd35a4

A body should give context, not retell the diff. A couple of sentences on the why is enough. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RMz9YYdEHMUreMzDkRZDJp

balamurali27 and others added 4 commits June 19, 2026 10:03

docs(site-update): Trim recovery comments to two lines

0d72d5a

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RMz9YYdEHMUreMzDkRZDJp

docs(taste): Keep comments to one or two lines

a5f29a0

A comment that needs more is a sign the code should be clearer or the explanation belongs in a doc. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RMz9YYdEHMUreMzDkRZDJp

greptile-apps Bot reviewed Jun 19, 2026

View reviewed changes

balamurali27 and others added 2 commits June 19, 2026 10:37

balamurali27 force-pushed the fix/site-update-recovery branch from e6456c5 to 8f309af Compare June 19, 2026 05:08

greptile-apps Bot reviewed Jun 19, 2026

View reviewed changes

Comment thread press/press/doctype/site_update/site_update.py Outdated

balamurali27 and others added 4 commits June 19, 2026 10:58

greptile-apps Bot reviewed Jun 19, 2026

View reviewed changes

Comment thread press/press/doctype/site/site.py

greptile-apps Bot reviewed Jun 19, 2026

View reviewed changes

fix(site-update): Drop Duration-only props from Int field

a0453a9

The hide_days/hide_seconds properties apply only to Duration fields and have no effect on the previous_max_statement_time Int field. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

greptile-apps Bot reviewed Jun 21, 2026

View reviewed changes

Merge branch 'develop' into fix/site-update-recovery

79a1675

greptile-apps Bot reviewed Jun 22, 2026

View reviewed changes

greptile-apps Bot reviewed Jun 23, 2026

View reviewed changes


		# Attempt to move site to source bench

		self.bump_max_statement_time_before_recovery(site)

Conversation

balamurali27 commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Changes

max_statement_time bump before a recovery migrate (and revert after)

One-shot Restore Site Tables fallback when a migrate recovery fails — but only when safe

On a successful fallback Restore Site Tables, the update stays Fatal

Actionable notification for skipped-backups failures

Notes

Docs

Tests

Uh oh!

greptile-apps Bot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Confidence Score: 3/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

balamurali27 commented Jun 19, 2026

Uh oh!

Uh oh!

greptile-apps Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

mergify Bot commented Jun 19, 2026

Uh oh!

greptile-apps Bot Jun 21, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

balamurali27 commented Jun 17, 2026 •

edited

Loading

`max_statement_time` bump before a recovery migrate (and revert after)

One-shot `Restore Site Tables` fallback when a migrate recovery fails — but only when safe

On a successful fallback `Restore Site Tables`, the update stays `Fatal`

greptile-apps Bot commented Jun 17, 2026 •

edited

Loading

codecov-commenter commented Jun 17, 2026 •

edited

Loading