Captures the majorlab incident where the backup watchdog emailed a missing heartbeat after a kernel-update reboot wiped /var/run, even though the backup had actually completed cleanly. Documents the tmpfs root cause and the fix of storing heartbeats under /var/lib instead. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
4.0 KiB
title, domain, category, tags, status, created, updated
| title | domain | category | tags | status | created | updated | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Cron Heartbeat False Alarm: /var/run Cleared by Reboot | troubleshooting | general |
|
published | 2026-04-13 | 2026-04-13T10:10 |
Cron Heartbeat False Alarm: /var/run Cleared by Reboot
If a cron-driven watchdog emails you that a job "may never have run" — but the job's log clearly shows it completed successfully — check whether the heartbeat file lives under /var/run (or /run). On most modern Linux distros, /run is a tmpfs and is wiped on every reboot. Any file there survives only until the next boot.
Symptoms
- A heartbeat-based watchdog fires a missing-heartbeat or stale-heartbeat alert
- The job the watchdog is monitoring actually ran successfully — its log file shows a clean completion long before the alert fired
- The host was rebooted between when the job wrote its heartbeat and when the watchdog checked it
stat /var/run/<your-heartbeat>returnsNo such file or directoryreadlink -f /var/runreturns/run, andmount | grep ' /run 'showstmpfs
Why It Happens
Systemd distros mount /run as a tmpfs for runtime state. /var/run is kept only as a compatibility symlink to /run. The whole filesystem is memory-backed: when the host reboots, every file under /run vanishes unless a tmpfiles.d rule explicitly recreates it. The convention is that only things like PID files and sockets — state that is meaningful only for the current boot — should live there.
A daily backup or maintenance job that touches a heartbeat file to prove it ran is not boot-scoped state. If the job runs at 03:00, the host reboots at 07:00 for a kernel update, and a watchdog checks the heartbeat at 08:00, the watchdog sees nothing — even though the job ran four hours earlier and exited 0.
The common mitigation of checking the heartbeat's mtime against a max age (e.g. "alert if older than 25h") does not protect against this. It catches stale heartbeats from real failures, but a deleted file has no mtime to compare.
Fix
Move the heartbeat out of tmpfs and into a persistent directory. Good options:
/var/lib/<service>/heartbeat— canonical home for persistent service state/var/log/<service>-heartbeat— acceptable if you want it alongside existing logs- Any path on a real disk-backed filesystem
Both the writer (the monitored job) and the reader (the watchdog) need to agree on the new path. Make sure the parent directory exists before the first write:
HEARTBEAT="/var/lib/myservice/heartbeat"
mkdir -p "$(dirname "$HEARTBEAT")"
# ... later, on success:
touch "$HEARTBEAT"
The mkdir -p is cheap to run unconditionally and avoids a first-run-after-deploy edge case where the directory hasn't been created yet.
Verification
After deploying the fix:
# 1. Run the monitored job manually (or wait for its next scheduled run)
sudo bash /path/to/monitored-job.sh
# 2. Confirm the heartbeat was created on persistent storage
ls -la /var/lib/myservice/heartbeat
# 3. Reboot and re-check — the file should survive
sudo reboot
# ... after reboot ...
ls -la /var/lib/myservice/heartbeat # still there, mtime unchanged
# 4. Run the watchdog manually to confirm it passes
sudo bash /path/to/watchdog.sh
Why Not Use tmpfiles.d Instead
systemd-tmpfiles can recreate files in /run at boot via a f /run/<name> 0644 root root - - entry. That works, but it's the wrong tool for this problem: a boot-created empty file has the boot time as its mtime, which defeats the watchdog's age check. The watchdog would see a fresh heartbeat after every reboot even if the monitored job hasn't actually run in days.
Keep /run for true runtime state (PIDs, sockets, locks). Put success markers on persistent storage.
Related
- Docker & Caddy Recovery After Reboot (Fedora + SELinux) — another class of post-reboot surprise
- rsync Backup Patterns — reusable backup script patterns