diff --git a/05-troubleshooting/cron-heartbeat-tmpfs-reboot-false-alarm.md b/05-troubleshooting/cron-heartbeat-tmpfs-reboot-false-alarm.md new file mode 100644 index 0000000..14b2ba6 --- /dev/null +++ b/05-troubleshooting/cron-heartbeat-tmpfs-reboot-false-alarm.md @@ -0,0 +1,84 @@ +--- +title: "Cron Heartbeat False Alarm: /var/run Cleared by Reboot" +domain: troubleshooting +category: general +tags: + - cron + - systemd + - tmpfs + - monitoring + - backups + - heartbeat +status: published +created: 2026-04-13 +updated: 2026-04-13T10:10 +--- +# Cron Heartbeat False Alarm: /var/run Cleared by Reboot + +If a cron-driven watchdog emails you that a job "may never have run" — but the job's log clearly shows it completed successfully — check whether the heartbeat file lives under `/var/run` (or `/run`). On most modern Linux distros, `/run` is a **tmpfs** and is wiped on every reboot. Any file there survives only until the next boot. + +## Symptoms + +- A heartbeat-based watchdog fires a missing-heartbeat or stale-heartbeat alert +- The job the watchdog is monitoring actually ran successfully — its log file shows a clean completion long before the alert fired +- The host was rebooted between when the job wrote its heartbeat and when the watchdog checked it +- `stat /var/run/` returns `No such file or directory` +- `readlink -f /var/run` returns `/run`, and `mount | grep ' /run '` shows `tmpfs` + +## Why It Happens + +Systemd distros mount `/run` as a tmpfs for runtime state. `/var/run` is kept only as a compatibility symlink to `/run`. The whole filesystem is memory-backed: when the host reboots, every file under `/run` vanishes unless a `tmpfiles.d` rule explicitly recreates it. The convention is that only things like PID files and sockets — state that is meaningful **only for the current boot** — should live there. + +A daily backup or maintenance job that touches a heartbeat file to prove it ran is *not* boot-scoped state. If the job runs at 03:00, the host reboots at 07:00 for a kernel update, and a watchdog checks the heartbeat at 08:00, the watchdog sees nothing — even though the job ran four hours earlier and exited 0. + +The common mitigation of checking the heartbeat's mtime against a max age (e.g. "alert if older than 25h") does **not** protect against this. It catches stale heartbeats from real failures, but a deleted file has no mtime to compare. + +## Fix + +Move the heartbeat out of tmpfs and into a persistent directory. Good options: + +- `/var/lib//heartbeat` — canonical home for persistent service state +- `/var/log/-heartbeat` — acceptable if you want it alongside existing logs +- Any path on a real disk-backed filesystem + +Both the writer (the monitored job) and the reader (the watchdog) need to agree on the new path. Make sure the parent directory exists before the first write: + +```bash +HEARTBEAT="/var/lib/myservice/heartbeat" +mkdir -p "$(dirname "$HEARTBEAT")" +# ... later, on success: +touch "$HEARTBEAT" +``` + +The `mkdir -p` is cheap to run unconditionally and avoids a first-run-after-deploy edge case where the directory hasn't been created yet. + +## Verification + +After deploying the fix: + +```bash +# 1. Run the monitored job manually (or wait for its next scheduled run) +sudo bash /path/to/monitored-job.sh + +# 2. Confirm the heartbeat was created on persistent storage +ls -la /var/lib/myservice/heartbeat + +# 3. Reboot and re-check — the file should survive +sudo reboot +# ... after reboot ... +ls -la /var/lib/myservice/heartbeat # still there, mtime unchanged + +# 4. Run the watchdog manually to confirm it passes +sudo bash /path/to/watchdog.sh +``` + +## Why Not Use `tmpfiles.d` Instead + +systemd-tmpfiles can recreate files in `/run` at boot via a `f /run/ 0644 root root - -` entry. That works, but it's the wrong tool for this problem: a boot-created empty file has the boot time as its mtime, which defeats the watchdog's age check. The watchdog would see a fresh heartbeat after every reboot even if the monitored job hasn't actually run in days. + +Keep `/run` for true runtime state (PIDs, sockets, locks). Put success markers on persistent storage. + +## Related + +- [Docker & Caddy Recovery After Reboot (Fedora + SELinux)](docker-caddy-selinux-post-reboot-recovery.md) — another class of post-reboot surprise +- [rsync Backup Patterns](../02-selfhosting/storage-backup/rsync-backup-patterns.md) — reusable backup script patterns diff --git a/05-troubleshooting/index.md b/05-troubleshooting/index.md index d50be83..1e3b9f5 100644 --- a/05-troubleshooting/index.md +++ b/05-troubleshooting/index.md @@ -1,6 +1,6 @@ --- created: 2026-03-15T06:37 -updated: 2026-04-08 +updated: 2026-04-13T10:10 --- # 🔧 General Troubleshooting @@ -29,6 +29,7 @@ Practical fixes for common Linux, networking, and application problems. - [Gitea Actions Runner: Boot Race Condition Fix](gitea-runner-boot-race-network-target.md) - [Systemd Session Scope Fails at Login (`session-cN.scope`)](systemd/session-scope-failure-at-login.md) - [MajorWiki Setup & Publishing Pipeline](majwiki-setup-and-pipeline.md) +- [Cron Heartbeat False Alarm: /var/run Cleared by Reboot](cron-heartbeat-tmpfs-reboot-false-alarm.md) ## 🔒 SELinux - [SELinux: Fixing Dovecot Mail Spool Context (/var/vmail)](selinux-dovecot-vmail-context.md) diff --git a/README.md b/README.md index a3a69a6..4a46156 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,6 @@ --- created: 2026-04-06T09:52 -updated: 2026-04-07T21:59 +updated: 2026-04-13T10:10 --- # MajorLinux Tech Wiki — Index @@ -142,6 +142,7 @@ updated: 2026-04-07T21:59 - [Gemini CLI Manual Update](05-troubleshooting/gemini-cli-manual-update.md) — how to manually update the Gemini CLI when automatic updates fail - [MajorWiki Setup & Pipeline](05-troubleshooting/majwiki-setup-and-pipeline.md) — setting up MajorWiki and the Obsidian → Gitea → MkDocs publishing pipeline - [Gitea Actions Runner: Boot Race Condition Fix](05-troubleshooting/gitea-runner-boot-race-network-target.md) — fixing act_runner crash loop on boot caused by DNS not ready at startup +- [Cron Heartbeat False Alarm: /var/run Cleared by Reboot](05-troubleshooting/cron-heartbeat-tmpfs-reboot-false-alarm.md) — why `/run` is tmpfs and how a reboot wipes cron heartbeat files, and where to put them instead - [SELinux: Fixing Dovecot Mail Spool Context (/var/vmail)](05-troubleshooting/selinux-dovecot-vmail-context.md) — fixing thousands of AVC denials when /var/vmail has wrong SELinux context - [mdadm RAID Recovery After USB Hub Disconnect](05-troubleshooting/storage/mdadm-usb-hub-disconnect-recovery.md) — diagnosing and recovering a failed mdadm array caused by a USB hub dropout - [Windows OpenSSH Server (sshd) Stops After Reboot](05-troubleshooting/networking/windows-sshd-stops-after-reboot.md) — fixing sshd not running after reboot due to Manual startup type @@ -160,6 +161,7 @@ updated: 2026-04-07T21:59 | Date | Article | Domain | |---|---|---| +| 2026-04-13 | [Cron Heartbeat False Alarm: /var/run Cleared by Reboot](05-troubleshooting/cron-heartbeat-tmpfs-reboot-false-alarm.md) | Troubleshooting | | 2026-04-08 | [wget/curl: URLs with Special Characters Fail in Bash](05-troubleshooting/wget-url-special-characters.md) | Troubleshooting | | 2026-04-07 | [SSH Config & Key Management](01-linux/networking/ssh-config-key-management.md) | Linux | | 2026-04-07 | [Windows OpenSSH: WSL Default Shell Breaks Remote Commands](05-troubleshooting/networking/windows-openssh-wsl-default-shell-breaks-remote-commands.md) | Troubleshooting | diff --git a/SUMMARY.md b/SUMMARY.md index 88a8451..47a3129 100644 --- a/SUMMARY.md +++ b/SUMMARY.md @@ -1,6 +1,6 @@ --- created: 2026-04-02T16:03 -updated: 2026-04-08 +updated: 2026-04-13T10:10 --- * [Home](index.md) * [Linux & Sysadmin](01-linux/index.md) @@ -69,6 +69,7 @@ updated: 2026-04-08 * [Gemini CLI Manual Update](05-troubleshooting/gemini-cli-manual-update.md) * [MajorWiki Setup & Publishing Pipeline](05-troubleshooting/majwiki-setup-and-pipeline.md) * [Gitea Actions Runner: Boot Race Condition Fix](05-troubleshooting/gitea-runner-boot-race-network-target.md) + * [Cron Heartbeat False Alarm: /var/run Cleared by Reboot](05-troubleshooting/cron-heartbeat-tmpfs-reboot-false-alarm.md) * [SELinux: Fixing Dovecot Mail Spool Context (/var/vmail)](05-troubleshooting/selinux-dovecot-vmail-context.md) * [mdadm RAID Recovery After USB Hub Disconnect](05-troubleshooting/storage/mdadm-usb-hub-disconnect-recovery.md) * [Windows OpenSSH Server (sshd) Stops After Reboot](05-troubleshooting/networking/windows-sshd-stops-after-reboot.md) diff --git a/index.md b/index.md index d0de3c7..c600623 100644 --- a/index.md +++ b/index.md @@ -1,6 +1,6 @@ --- created: 2026-04-06T09:52 -updated: 2026-04-07T21:59 +updated: 2026-04-13T10:11 --- # MajorLinux Tech Wiki — Index @@ -143,6 +143,7 @@ updated: 2026-04-07T21:59 - [Gemini CLI Manual Update](05-troubleshooting/gemini-cli-manual-update.md) — how to manually update the Gemini CLI when automatic updates fail - [MajorWiki Setup & Pipeline](05-troubleshooting/majwiki-setup-and-pipeline.md) — setting up MajorWiki and the Obsidian → Gitea → MkDocs publishing pipeline - [Gitea Actions Runner: Boot Race Condition Fix](05-troubleshooting/gitea-runner-boot-race-network-target.md) — fixing act_runner crash loop on boot caused by DNS not ready at startup +- [Cron Heartbeat False Alarm: /var/run Cleared by Reboot](05-troubleshooting/cron-heartbeat-tmpfs-reboot-false-alarm.md) — why `/run` is tmpfs and how a reboot wipes cron heartbeat files, and where to put them instead - [SELinux: Fixing Dovecot Mail Spool Context (/var/vmail)](05-troubleshooting/selinux-dovecot-vmail-context.md) — fixing thousands of AVC denials when /var/vmail has wrong SELinux context - [mdadm RAID Recovery After USB Hub Disconnect](05-troubleshooting/storage/mdadm-usb-hub-disconnect-recovery.md) — diagnosing and recovering a failed mdadm array caused by a USB hub dropout - [Windows OpenSSH Server (sshd) Stops After Reboot](05-troubleshooting/networking/windows-sshd-stops-after-reboot.md) — fixing sshd not running after reboot due to Manual startup type @@ -164,6 +165,7 @@ updated: 2026-04-07T21:59 | Date | Article | Domain | |---|---|---| +| 2026-04-13 | [Cron Heartbeat False Alarm: /var/run Cleared by Reboot](05-troubleshooting/cron-heartbeat-tmpfs-reboot-false-alarm.md) | Troubleshooting | | 2026-04-08 | [wget/curl: URLs with Special Characters Fail in Bash](05-troubleshooting/wget-url-special-characters.md) | Troubleshooting | | 2026-04-07 | [SSH Config & Key Management](01-linux/networking/ssh-config-key-management.md) | Linux | | 2026-04-07 | [Windows OpenSSH: WSL Default Shell Breaks Remote Commands](05-troubleshooting/networking/windows-openssh-wsl-default-shell-breaks-remote-commands.md) | Troubleshooting |