Files

MajorLinux 326c87421f wiki: add troubleshooting article on /var/run heartbeat reboot false alarm

Captures the majorlab incident where the backup watchdog emailed a missing
heartbeat after a kernel-update reboot wiped /var/run, even though the
backup had actually completed cleanly. Documents the tmpfs root cause and
the fix of storing heartbeats under /var/lib instead.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-13 10:11:24 -04:00

4.0 KiB

Raw Blame History

title, domain, category, tags, status, created, updated

title

domain

Cron Heartbeat False Alarm: /var/run Cleared by Reboot

If a cron-driven watchdog emails you that a job "may never have run" — but the job's log clearly shows it completed successfully — check whether the heartbeat file lives under /var/run (or /run). On most modern Linux distros, /run is a tmpfs and is wiped on every reboot. Any file there survives only until the next boot.

Symptoms

A heartbeat-based watchdog fires a missing-heartbeat or stale-heartbeat alert
The job the watchdog is monitoring actually ran successfully — its log file shows a clean completion long before the alert fired
The host was rebooted between when the job wrote its heartbeat and when the watchdog checked it
stat /var/run/<your-heartbeat> returns No such file or directory
readlink -f /var/run returns /run, and mount | grep ' /run ' shows tmpfs

Why It Happens

Systemd distros mount /run as a tmpfs for runtime state. /var/run is kept only as a compatibility symlink to /run. The whole filesystem is memory-backed: when the host reboots, every file under /run vanishes unless a tmpfiles.d rule explicitly recreates it. The convention is that only things like PID files and sockets — state that is meaningful only for the current boot — should live there.

A daily backup or maintenance job that touches a heartbeat file to prove it ran is not boot-scoped state. If the job runs at 03:00, the host reboots at 07:00 for a kernel update, and a watchdog checks the heartbeat at 08:00, the watchdog sees nothing — even though the job ran four hours earlier and exited 0.

The common mitigation of checking the heartbeat's mtime against a max age (e.g. "alert if older than 25h") does not protect against this. It catches stale heartbeats from real failures, but a deleted file has no mtime to compare.

Fix

Move the heartbeat out of tmpfs and into a persistent directory. Good options:

/var/lib/<service>/heartbeat — canonical home for persistent service state
/var/log/<service>-heartbeat — acceptable if you want it alongside existing logs
Any path on a real disk-backed filesystem

Both the writer (the monitored job) and the reader (the watchdog) need to agree on the new path. Make sure the parent directory exists before the first write:

HEARTBEAT="/var/lib/myservice/heartbeat"
mkdir -p "$(dirname "$HEARTBEAT")"
# ... later, on success:
touch "$HEARTBEAT"

The mkdir -p is cheap to run unconditionally and avoids a first-run-after-deploy edge case where the directory hasn't been created yet.

Verification

After deploying the fix:

# 1. Run the monitored job manually (or wait for its next scheduled run)
sudo bash /path/to/monitored-job.sh

# 2. Confirm the heartbeat was created on persistent storage
ls -la /var/lib/myservice/heartbeat

# 3. Reboot and re-check — the file should survive
sudo reboot
# ... after reboot ...
ls -la /var/lib/myservice/heartbeat   # still there, mtime unchanged

# 4. Run the watchdog manually to confirm it passes
sudo bash /path/to/watchdog.sh

Why Not Use `tmpfiles.d` Instead

systemd-tmpfiles can recreate files in /run at boot via a f /run/<name> 0644 root root - - entry. That works, but it's the wrong tool for this problem: a boot-created empty file has the boot time as its mtime, which defeats the watchdog's age check. The watchdog would see a fresh heartbeat after every reboot even if the monitored job hasn't actually run in days.

Keep /run for true runtime state (PIDs, sockets, locks). Put success markers on persistent storage.

Docker & Caddy Recovery After Reboot (Fedora + SELinux) — another class of post-reboot surprise
rsync Backup Patterns — reusable backup script patterns

4.0 KiB Raw Blame History