Context
Pyronear stations run in the field, in quite remote areas. When a station freezes/bug (kernel panic, deadlock, full crash), the current recovery relies on a physical power cut, which risks filesystem corruption and requires human intervention or usually hardware timer relay.
The RPi has a built-in hardware watchdog (/dev/watchdog) that can trigger a clean software reboot automatically if the system stops responding ! no relay, no power cut.
How it works
The watchdog runs a hardware timer. Any process must periodically write to /dev/watchdog to reset it ("petting the dog"). If nothing writes for N seconds → the chip triggers a reboot.
It's deliberately dumb: it doesn't know why nothing wrote — freeze, crash, deadlock — it just reboots.
Two approaches to explore
1. System-level only (watchdog daemon)
- Activate via
dtparam=watchdog=on in /boot/config.txt
- Let the
watchdog Linux daemon handle the petting
- ✅ Simple, no code change
- ⚠️ Only covers full OS/kernel freeze — won't catch a live-but-broken capture pipeline
2. Application-level (from the main Python script)
- Pet the watchdog only when the system is actually healthy (camera alive, recent frame, model responding)
- If a check fails → stop petting → reboot triggered after timeout
- ✅ Covers applicative failures too
- ⚠️ Slightly more complexity, watchdog thread must be robust itself
Questions to answer
Context
Pyronear stations run in the field, in quite remote areas. When a station freezes/bug (kernel panic, deadlock, full crash), the current recovery relies on a physical power cut, which risks filesystem corruption and requires human intervention or usually hardware timer relay.
The RPi has a built-in hardware watchdog (
/dev/watchdog) that can trigger a clean software reboot automatically if the system stops responding ! no relay, no power cut.How it works
The watchdog runs a hardware timer. Any process must periodically write to
/dev/watchdogto reset it ("petting the dog"). If nothing writes forNseconds → the chip triggers a reboot.It's deliberately dumb: it doesn't know why nothing wrote — freeze, crash, deadlock — it just reboots.
Two approaches to explore
1. System-level only (
watchdogdaemon)dtparam=watchdog=onin/boot/config.txtwatchdogLinux daemon handle the petting2. Application-level (from the main Python script)
Questions to answer
Vto/dev/watchdogbefore closing -> Since deployments/updates are managed via Ansible, a natural solution is to add explicit steps around the update tasks (Disable watchdog before update and Re-enable watchdog after update)