wiki: add troubleshooting article on /var/run heartbeat reboot false alarm

Captures the majorlab incident where the backup watchdog emailed a missing heartbeat after a kernel-update reboot wiped /var/run, even though the backup had actually completed cleanly. Documents the tmpfs root cause and the fix of storing heartbeats under /var/lib instead. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 10:11:24 -04:00
parent efc8f22f6c
commit 326c87421f
5 changed files with 94 additions and 4 deletions
--- a/05-troubleshooting/cron-heartbeat-tmpfs-reboot-false-alarm.md
+++ b/05-troubleshooting/cron-heartbeat-tmpfs-reboot-false-alarm.md
@@ -0,0 +1,84 @@
+---
+title: "Cron Heartbeat False Alarm: /var/run Cleared by Reboot"
+domain: troubleshooting
+category: general
+tags:
+  - cron
+  - systemd
+  - tmpfs
+  - monitoring
+  - backups
+  - heartbeat
+status: published
+created: 2026-04-13
+updated: 2026-04-13T10:10
+---
+# Cron Heartbeat False Alarm: /var/run Cleared by Reboot
+
+If a cron-driven watchdog emails you that a job "may never have run" — but the job's log clearly shows it completed successfully — check whether the heartbeat file lives under `/var/run` (or `/run`). On most modern Linux distros, `/run` is a **tmpfs** and is wiped on every reboot. Any file there survives only until the next boot.
+
+## Symptoms
+
+- A heartbeat-based watchdog fires a missing-heartbeat or stale-heartbeat alert
+- The job the watchdog is monitoring actually ran successfully — its log file shows a clean completion long before the alert fired
+- The host was rebooted between when the job wrote its heartbeat and when the watchdog checked it
+- `stat /var/run/<your-heartbeat>` returns `No such file or directory`
+- `readlink -f /var/run` returns `/run`, and `mount | grep ' /run '` shows `tmpfs`
+
+## Why It Happens
+
+Systemd distros mount `/run` as a tmpfs for runtime state. `/var/run` is kept only as a compatibility symlink to `/run`. The whole filesystem is memory-backed: when the host reboots, every file under `/run` vanishes unless a `tmpfiles.d` rule explicitly recreates it. The convention is that only things like PID files and sockets — state that is meaningful **only for the current boot** — should live there.
+
+A daily backup or maintenance job that touches a heartbeat file to prove it ran is *not* boot-scoped state. If the job runs at 03:00, the host reboots at 07:00 for a kernel update, and a watchdog checks the heartbeat at 08:00, the watchdog sees nothing — even though the job ran four hours earlier and exited 0.
+
+The common mitigation of checking the heartbeat's mtime against a max age (e.g. "alert if older than 25h") does **not** protect against this. It catches stale heartbeats from real failures, but a deleted file has no mtime to compare.
+
+## Fix
+
+Move the heartbeat out of tmpfs and into a persistent directory. Good options:
+
+- `/var/lib/<service>/heartbeat` — canonical home for persistent service state
+- `/var/log/<service>-heartbeat` — acceptable if you want it alongside existing logs
+- Any path on a real disk-backed filesystem
+
+Both the writer (the monitored job) and the reader (the watchdog) need to agree on the new path. Make sure the parent directory exists before the first write:
+
+```bash
+HEARTBEAT="/var/lib/myservice/heartbeat"
+mkdir -p "$(dirname "$HEARTBEAT")"
+# ... later, on success:
+touch "$HEARTBEAT"
+```
+
+The `mkdir -p` is cheap to run unconditionally and avoids a first-run-after-deploy edge case where the directory hasn't been created yet.
+
+## Verification
+
+After deploying the fix:
+
+```bash
+# 1. Run the monitored job manually (or wait for its next scheduled run)
+sudo bash /path/to/monitored-job.sh
+
+# 2. Confirm the heartbeat was created on persistent storage
+ls -la /var/lib/myservice/heartbeat
+
+# 3. Reboot and re-check — the file should survive
+sudo reboot
+# ... after reboot ...
+ls -la /var/lib/myservice/heartbeat   # still there, mtime unchanged
+
+# 4. Run the watchdog manually to confirm it passes
+sudo bash /path/to/watchdog.sh
+```
+
+## Why Not Use `tmpfiles.d` Instead
+
+systemd-tmpfiles can recreate files in `/run` at boot via a `f /run/<name> 0644 root root - -` entry. That works, but it's the wrong tool for this problem: a boot-created empty file has the boot time as its mtime, which defeats the watchdog's age check. The watchdog would see a fresh heartbeat after every reboot even if the monitored job hasn't actually run in days.
+
+Keep `/run` for true runtime state (PIDs, sockets, locks). Put success markers on persistent storage.
+
+## Related
+
+- [Docker & Caddy Recovery After Reboot (Fedora + SELinux)](docker-caddy-selinux-post-reboot-recovery.md) — another class of post-reboot surprise
+- [rsync Backup Patterns](../02-selfhosting/storage-backup/rsync-backup-patterns.md) — reusable backup script patterns