diff --git a/02-selfhosting/index.md b/02-selfhosting/index.md index fba6d82..48300bf 100644 --- a/02-selfhosting/index.md +++ b/02-selfhosting/index.md @@ -23,6 +23,7 @@ Guides for running your own services at home, including Docker, reverse proxies, ## Monitoring - [Tuning Netdata Web Log Alerts](monitoring/tuning-netdata-web-log-alerts.md) +- [Tuning Netdata Docker Health Alarms](monitoring/netdata-docker-health-alarm-tuning.md) ## Security diff --git a/02-selfhosting/monitoring/netdata-docker-health-alarm-tuning.md b/02-selfhosting/monitoring/netdata-docker-health-alarm-tuning.md new file mode 100644 index 0000000..fe116ac --- /dev/null +++ b/02-selfhosting/monitoring/netdata-docker-health-alarm-tuning.md @@ -0,0 +1,83 @@ +--- +title: "Tuning Netdata Docker Health Alarms to Prevent Update Flapping" +domain: selfhosting +category: monitoring +tags: [netdata, docker, nextcloud, alarms, health, monitoring] +status: published +created: 2026-03-18 +updated: 2026-03-18 +--- + +# Tuning Netdata Docker Health Alarms to Prevent Update Flapping + +Netdata's default `docker_container_unhealthy` alarm fires on a 10-second average with no delay. When Nextcloud AIO (or any stack with a watchtower/auto-update setup) does its nightly update cycle, containers restart in sequence and briefly show as unhealthy — generating a flood of false alerts. + +## The Default Alarm + +```ini +template: docker_container_unhealthy + on: docker.container_health_status + every: 10s + lookup: average -10s of unhealthy + warn: $this > 0 +``` + +A single container being unhealthy for 10 seconds triggers it. No grace period, no delay. + +## The Fix + +Create a custom override at `/etc/netdata/health.d/docker.conf` (maps to the Netdata config volume if running in Docker). This file takes precedence over the stock config in `/usr/lib/netdata/conf.d/health.d/docker.conf`. + +```ini +# Custom override — reduces flapping during nightly container updates. + +template: docker_container_unhealthy + on: docker.container_health_status + class: Errors + type: Containers +component: Docker + units: status + every: 30s + lookup: average -5m of unhealthy + warn: $this > 0 + delay: down 5m multiplier 1.5 max 30m + summary: Docker container ${label:container_name} health + info: ${label:container_name} docker container health status is unhealthy + to: sysadmin +``` + +| Setting | Default | Tuned | Effect | +|---|---|---|---| +| `every` | 10s | 30s | Check less frequently | +| `lookup` | average -10s | average -5m | Must be unhealthy for sustained 5 minutes | +| `delay` | none | down 5m (max 30m) | Grace period after recovery before clearing | + +A typical Nextcloud AIO update cycle (30–90 seconds of container restarts) won't sustain 5 minutes of unhealthy status, so no alert fires. A genuinely broken container will still be caught. + +## Applying the Config + +```bash +# If Netdata runs in Docker, write to the config volume +sudo tee /var/lib/docker/volumes/netdata_netdataconfig/_data/health.d/docker.conf > /dev/null << 'EOF' +# paste config here +EOF + +# Reload health alarms without restarting the container +sudo docker exec netdata netdatacli reload-health +``` + +No container restart needed — `reload-health` picks up the new config immediately. + +## Verify + +In the Netdata UI, navigate to **Alerts → Manage Alerts** and search for `docker_container_unhealthy`. The lookup and delay values should reflect the new config. + +## Notes + +- This only overrides the `docker_container_unhealthy` alarm. The `docker_container_down` alarm (for exited containers) is left at its default — it already has a `delay: down 1m` and is disabled by default (`chart labels: container_name=!*`). +- If you want per-container silencing instead of a blanket delay, use the `host labels` or `chart labels` filter to scope the alarm to specific containers. +- Config volume path on majorlab: `/var/lib/docker/volumes/netdata_netdataconfig/_data/` + +## See Also + +- [Tuning Netdata Web Log Alerts](tuning-netdata-web-log-alerts.md) — similar tuning for web_log redirect alerts diff --git a/MajorWiki-Deploy-Status.md b/MajorWiki-Deploy-Status.md index 2db53c4..6929b5d 100644 --- a/MajorWiki-Deploy-Status.md +++ b/MajorWiki-Deploy-Status.md @@ -127,3 +127,12 @@ Every time a new article is added, the following **MUST** be updated to maintain - `05-troubleshooting/ollama-macos-sleep-tailscale-disconnect.md` — Ollama drops off Tailscale when MajorMac sleeps **Updated:** `updated: 2026-03-17` + +## Session Update — 2026-03-18 + +**Article count:** 48 (was 47) + +**New articles added:** +- `02-selfhosting/monitoring/netdata-docker-health-alarm-tuning.md` — tuning docker_container_unhealthy alarm to prevent flapping during Nextcloud AIO updates + +**Updated:** `updated: 2026-03-18` diff --git a/README.md b/README.md index 425db2a..32da4a2 100644 --- a/README.md +++ b/README.md @@ -2,15 +2,15 @@ > A growing reference of Linux, self-hosting, open source, streaming, and troubleshooting guides. Written by MajorLinux. Used by MajorTwin. > -**Last updated:** 2026-03-17 -**Article count:** 47 +**Last updated:** 2026-03-18 +**Article count:** 48 ## Domains | Domain | Folder | Articles | |---|---|---| | 🐧 Linux & Sysadmin | `01-linux/` | 11 | -| 🏠 Self-Hosting & Homelab | `02-selfhosting/` | 9 | +| 🏠 Self-Hosting & Homelab | `02-selfhosting/` | 10 | | 🔓 Open Source Tools | `03-opensource/` | 9 | | 🎙️ Streaming & Podcasting | `04-streaming/` | 2 | | 🔧 General Troubleshooting | `05-troubleshooting/` | 16 | @@ -64,6 +64,7 @@ ### Monitoring - [Tuning Netdata Web Log Alerts](02-selfhosting/monitoring/tuning-netdata-web-log-alerts.md) — tuning web_log_1m_redirects threshold for HTTPS-forcing servers +- [Tuning Netdata Docker Health Alarms](02-selfhosting/monitoring/netdata-docker-health-alarm-tuning.md) — preventing false alerts during nightly Nextcloud AIO container update cycles ### Security - [Linux Server Hardening Checklist](02-selfhosting/security/linux-server-hardening-checklist.md) — non-root user, SSH key auth, sshd_config, firewall, fail2ban @@ -128,6 +129,7 @@ | Date | Article | Domain | |---|---|---| +| 2026-03-18 | [Tuning Netdata Docker Health Alarms](02-selfhosting/monitoring/netdata-docker-health-alarm-tuning.md) | Self-Hosting | | 2026-03-17 | [Ollama Drops Off Tailscale When Mac Sleeps](05-troubleshooting/ollama-macos-sleep-tailscale-disconnect.md) | Troubleshooting | | 2026-03-17 | [Windows OpenSSH Server (sshd) Stops After Reboot](05-troubleshooting/networking/windows-sshd-stops-after-reboot.md) | Troubleshooting | | 2026-03-16 | [Standardizing unattended-upgrades with Ansible](02-selfhosting/security/ansible-unattended-upgrades-fleet.md) | Self-Hosting | diff --git a/SUMMARY.md b/SUMMARY.md index f405b59..3590afc 100644 --- a/SUMMARY.md +++ b/SUMMARY.md @@ -19,6 +19,7 @@ * [Tailscale for Homelab Remote Access](02-selfhosting/dns-networking/tailscale-homelab-remote-access.md) * [rsync Backup Patterns](02-selfhosting/storage-backup/rsync-backup-patterns.md) * [Tuning Netdata Web Log Alerts](02-selfhosting/monitoring/tuning-netdata-web-log-alerts.md) + * [Tuning Netdata Docker Health Alarms](02-selfhosting/monitoring/netdata-docker-health-alarm-tuning.md) * [Linux Server Hardening Checklist](02-selfhosting/security/linux-server-hardening-checklist.md) * [Standardizing unattended-upgrades with Ansible](02-selfhosting/security/ansible-unattended-upgrades-fleet.md) * [Open Source & Alternatives](03-opensource/index.md) diff --git a/index.md b/index.md index 98d6b26..30a2d1d 100644 --- a/index.md +++ b/index.md @@ -2,15 +2,15 @@ > A growing reference of Linux, self-hosting, open source, streaming, and troubleshooting guides. Written by MajorLinux. Used by MajorTwin. > -> **Last updated:** 2026-03-17 -> **Article count:** 47 +> **Last updated:** 2026-03-18 +> **Article count:** 48 ## Domains | Domain | Folder | Articles | |---|---|---| | 🐧 Linux & Sysadmin | `01-linux/` | 11 | -| 🏠 Self-Hosting & Homelab | `02-selfhosting/` | 9 | +| 🏠 Self-Hosting & Homelab | `02-selfhosting/` | 10 | | 🔓 Open Source Tools | `03-opensource/` | 9 | | 🎙️ Streaming & Podcasting | `04-streaming/` | 2 | | 🔧 General Troubleshooting | `05-troubleshooting/` | 16 | @@ -64,6 +64,7 @@ ### Monitoring - [Tuning Netdata Web Log Alerts](02-selfhosting/monitoring/tuning-netdata-web-log-alerts.md) — tuning web_log_1m_redirects threshold for HTTPS-forcing servers +- [Tuning Netdata Docker Health Alarms](02-selfhosting/monitoring/netdata-docker-health-alarm-tuning.md) — preventing false alerts during nightly Nextcloud AIO container update cycles ### Security - [Linux Server Hardening Checklist](02-selfhosting/security/linux-server-hardening-checklist.md) — non-root user, SSH key auth, sshd_config, firewall, fail2ban @@ -128,6 +129,7 @@ | Date | Article | Domain | |---|---|---| +| 2026-03-18 | [Tuning Netdata Docker Health Alarms](02-selfhosting/monitoring/netdata-docker-health-alarm-tuning.md) | Self-Hosting | | 2026-03-17 | [Ollama Drops Off Tailscale When Mac Sleeps](05-troubleshooting/ollama-macos-sleep-tailscale-disconnect.md) | Troubleshooting | | 2026-03-17 | [Windows OpenSSH Server (sshd) Stops After Reboot](05-troubleshooting/networking/windows-sshd-stops-after-reboot.md) | Troubleshooting | | 2026-03-16 | [Standardizing unattended-upgrades with Ansible](02-selfhosting/security/ansible-unattended-upgrades-fleet.md) | Self-Hosting |