Skip to content

Investigate RPi hardware watchdog for autonomous reboot #372

Description

@fe51

Context

Pyronear stations run in the field, in quite remote areas. When a station freezes/bug (kernel panic, deadlock, full crash), the current recovery relies on a physical power cut, which risks filesystem corruption and requires human intervention or usually hardware timer relay.

The RPi has a built-in hardware watchdog (/dev/watchdog) that can trigger a clean software reboot automatically if the system stops responding ! no relay, no power cut.

How it works

The watchdog runs a hardware timer. Any process must periodically write to /dev/watchdog to reset it ("petting the dog"). If nothing writes for N seconds → the chip triggers a reboot.

It's deliberately dumb: it doesn't know why nothing wrote — freeze, crash, deadlock — it just reboots.

Two approaches to explore

1. System-level only (watchdog daemon)

  • Activate via dtparam=watchdog=on in /boot/config.txt
  • Let the watchdog Linux daemon handle the petting
  • ✅ Simple, no code change
  • ⚠️ Only covers full OS/kernel freeze — won't catch a live-but-broken capture pipeline

2. Application-level (from the main Python script)

  • Pet the watchdog only when the system is actually healthy (camera alive, recent frame, model responding)
  • If a check fails → stop petting → reboot triggered after timeout
  • ✅ Covers applicative failures too
  • ⚠️ Slightly more complexity, watchdog thread must be robust itself

Questions to answer

  • Does approach 1 alone cover enough failure modes for our use case? Might be relevant to start from that
  • What health checks are meaningful to gate the pet on (approach 2)?
  • What's the right timeout value (balance between false reboot vs slow recovery)? -> 10 ? 30 min ?
  • Impact on clean shutdown or an long update need to write magic char V to /dev/watchdog before closing -> Since deployments/updates are managed via Ansible, a natural solution is to add explicit steps around the update tasks (Disable watchdog before update and Re-enable watchdog after update)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions