wiki: add Netdata Docker health alarm tuning article; update indexes to 48

- 02-selfhosting/monitoring/netdata-docker-health-alarm-tuning.md — new
- lookup extended to 5m average, delay: down 5m to prevent Nextcloud AIO update flapping
- SUMMARY.md, index.md, README.md, deploy status updated

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-03-18 00:10:36 -04:00
parent 59a5cc530e
commit 38fe720e63
6 changed files with 104 additions and 6 deletions

View File

@@ -23,6 +23,7 @@ Guides for running your own services at home, including Docker, reverse proxies,
## Monitoring
- [Tuning Netdata Web Log Alerts](monitoring/tuning-netdata-web-log-alerts.md)
- [Tuning Netdata Docker Health Alarms](monitoring/netdata-docker-health-alarm-tuning.md)
## Security

View File

@@ -0,0 +1,83 @@
---
title: "Tuning Netdata Docker Health Alarms to Prevent Update Flapping"
domain: selfhosting
category: monitoring
tags: [netdata, docker, nextcloud, alarms, health, monitoring]
status: published
created: 2026-03-18
updated: 2026-03-18
---
# Tuning Netdata Docker Health Alarms to Prevent Update Flapping
Netdata's default `docker_container_unhealthy` alarm fires on a 10-second average with no delay. When Nextcloud AIO (or any stack with a watchtower/auto-update setup) does its nightly update cycle, containers restart in sequence and briefly show as unhealthy — generating a flood of false alerts.
## The Default Alarm
```ini
template: docker_container_unhealthy
on: docker.container_health_status
every: 10s
lookup: average -10s of unhealthy
warn: $this > 0
```
A single container being unhealthy for 10 seconds triggers it. No grace period, no delay.
## The Fix
Create a custom override at `/etc/netdata/health.d/docker.conf` (maps to the Netdata config volume if running in Docker). This file takes precedence over the stock config in `/usr/lib/netdata/conf.d/health.d/docker.conf`.
```ini
# Custom override — reduces flapping during nightly container updates.
template: docker_container_unhealthy
on: docker.container_health_status
class: Errors
type: Containers
component: Docker
units: status
every: 30s
lookup: average -5m of unhealthy
warn: $this > 0
delay: down 5m multiplier 1.5 max 30m
summary: Docker container ${label:container_name} health
info: ${label:container_name} docker container health status is unhealthy
to: sysadmin
```
| Setting | Default | Tuned | Effect |
|---|---|---|---|
| `every` | 10s | 30s | Check less frequently |
| `lookup` | average -10s | average -5m | Must be unhealthy for sustained 5 minutes |
| `delay` | none | down 5m (max 30m) | Grace period after recovery before clearing |
A typical Nextcloud AIO update cycle (3090 seconds of container restarts) won't sustain 5 minutes of unhealthy status, so no alert fires. A genuinely broken container will still be caught.
## Applying the Config
```bash
# If Netdata runs in Docker, write to the config volume
sudo tee /var/lib/docker/volumes/netdata_netdataconfig/_data/health.d/docker.conf > /dev/null << 'EOF'
# paste config here
EOF
# Reload health alarms without restarting the container
sudo docker exec netdata netdatacli reload-health
```
No container restart needed — `reload-health` picks up the new config immediately.
## Verify
In the Netdata UI, navigate to **Alerts → Manage Alerts** and search for `docker_container_unhealthy`. The lookup and delay values should reflect the new config.
## Notes
- This only overrides the `docker_container_unhealthy` alarm. The `docker_container_down` alarm (for exited containers) is left at its default — it already has a `delay: down 1m` and is disabled by default (`chart labels: container_name=!*`).
- If you want per-container silencing instead of a blanket delay, use the `host labels` or `chart labels` filter to scope the alarm to specific containers.
- Config volume path on majorlab: `/var/lib/docker/volumes/netdata_netdataconfig/_data/`
## See Also
- [Tuning Netdata Web Log Alerts](tuning-netdata-web-log-alerts.md) — similar tuning for web_log redirect alerts

View File

@@ -127,3 +127,12 @@ Every time a new article is added, the following **MUST** be updated to maintain
- `05-troubleshooting/ollama-macos-sleep-tailscale-disconnect.md` — Ollama drops off Tailscale when MajorMac sleeps
**Updated:** `updated: 2026-03-17`
## Session Update — 2026-03-18
**Article count:** 48 (was 47)
**New articles added:**
- `02-selfhosting/monitoring/netdata-docker-health-alarm-tuning.md` — tuning docker_container_unhealthy alarm to prevent flapping during Nextcloud AIO updates
**Updated:** `updated: 2026-03-18`

View File

@@ -2,15 +2,15 @@
> A growing reference of Linux, self-hosting, open source, streaming, and troubleshooting guides. Written by MajorLinux. Used by MajorTwin.
>
**Last updated:** 2026-03-17
**Article count:** 47
**Last updated:** 2026-03-18
**Article count:** 48
## Domains
| Domain | Folder | Articles |
|---|---|---|
| 🐧 Linux & Sysadmin | `01-linux/` | 11 |
| 🏠 Self-Hosting & Homelab | `02-selfhosting/` | 9 |
| 🏠 Self-Hosting & Homelab | `02-selfhosting/` | 10 |
| 🔓 Open Source Tools | `03-opensource/` | 9 |
| 🎙️ Streaming & Podcasting | `04-streaming/` | 2 |
| 🔧 General Troubleshooting | `05-troubleshooting/` | 16 |
@@ -64,6 +64,7 @@
### Monitoring
- [Tuning Netdata Web Log Alerts](02-selfhosting/monitoring/tuning-netdata-web-log-alerts.md) — tuning web_log_1m_redirects threshold for HTTPS-forcing servers
- [Tuning Netdata Docker Health Alarms](02-selfhosting/monitoring/netdata-docker-health-alarm-tuning.md) — preventing false alerts during nightly Nextcloud AIO container update cycles
### Security
- [Linux Server Hardening Checklist](02-selfhosting/security/linux-server-hardening-checklist.md) — non-root user, SSH key auth, sshd_config, firewall, fail2ban
@@ -128,6 +129,7 @@
| Date | Article | Domain |
|---|---|---|
| 2026-03-18 | [Tuning Netdata Docker Health Alarms](02-selfhosting/monitoring/netdata-docker-health-alarm-tuning.md) | Self-Hosting |
| 2026-03-17 | [Ollama Drops Off Tailscale When Mac Sleeps](05-troubleshooting/ollama-macos-sleep-tailscale-disconnect.md) | Troubleshooting |
| 2026-03-17 | [Windows OpenSSH Server (sshd) Stops After Reboot](05-troubleshooting/networking/windows-sshd-stops-after-reboot.md) | Troubleshooting |
| 2026-03-16 | [Standardizing unattended-upgrades with Ansible](02-selfhosting/security/ansible-unattended-upgrades-fleet.md) | Self-Hosting |

View File

@@ -19,6 +19,7 @@
* [Tailscale for Homelab Remote Access](02-selfhosting/dns-networking/tailscale-homelab-remote-access.md)
* [rsync Backup Patterns](02-selfhosting/storage-backup/rsync-backup-patterns.md)
* [Tuning Netdata Web Log Alerts](02-selfhosting/monitoring/tuning-netdata-web-log-alerts.md)
* [Tuning Netdata Docker Health Alarms](02-selfhosting/monitoring/netdata-docker-health-alarm-tuning.md)
* [Linux Server Hardening Checklist](02-selfhosting/security/linux-server-hardening-checklist.md)
* [Standardizing unattended-upgrades with Ansible](02-selfhosting/security/ansible-unattended-upgrades-fleet.md)
* [Open Source & Alternatives](03-opensource/index.md)

View File

@@ -2,15 +2,15 @@
> A growing reference of Linux, self-hosting, open source, streaming, and troubleshooting guides. Written by MajorLinux. Used by MajorTwin.
>
> **Last updated:** 2026-03-17
> **Article count:** 47
> **Last updated:** 2026-03-18
> **Article count:** 48
## Domains
| Domain | Folder | Articles |
|---|---|---|
| 🐧 Linux & Sysadmin | `01-linux/` | 11 |
| 🏠 Self-Hosting & Homelab | `02-selfhosting/` | 9 |
| 🏠 Self-Hosting & Homelab | `02-selfhosting/` | 10 |
| 🔓 Open Source Tools | `03-opensource/` | 9 |
| 🎙️ Streaming & Podcasting | `04-streaming/` | 2 |
| 🔧 General Troubleshooting | `05-troubleshooting/` | 16 |
@@ -64,6 +64,7 @@
### Monitoring
- [Tuning Netdata Web Log Alerts](02-selfhosting/monitoring/tuning-netdata-web-log-alerts.md) — tuning web_log_1m_redirects threshold for HTTPS-forcing servers
- [Tuning Netdata Docker Health Alarms](02-selfhosting/monitoring/netdata-docker-health-alarm-tuning.md) — preventing false alerts during nightly Nextcloud AIO container update cycles
### Security
- [Linux Server Hardening Checklist](02-selfhosting/security/linux-server-hardening-checklist.md) — non-root user, SSH key auth, sshd_config, firewall, fail2ban
@@ -128,6 +129,7 @@
| Date | Article | Domain |
|---|---|---|
| 2026-03-18 | [Tuning Netdata Docker Health Alarms](02-selfhosting/monitoring/netdata-docker-health-alarm-tuning.md) | Self-Hosting |
| 2026-03-17 | [Ollama Drops Off Tailscale When Mac Sleeps](05-troubleshooting/ollama-macos-sleep-tailscale-disconnect.md) | Troubleshooting |
| 2026-03-17 | [Windows OpenSSH Server (sshd) Stops After Reboot](05-troubleshooting/networking/windows-sshd-stops-after-reboot.md) | Troubleshooting |
| 2026-03-16 | [Standardizing unattended-upgrades with Ansible](02-selfhosting/security/ansible-unattended-upgrades-fleet.md) | Self-Hosting |