- fail2ban-digest-mode-fleet: recidive-only email model, sshd now silent, defaults-debian.conf gotcha added - netdata-docker-health-alarm-tuning: 30m/10m config, tuning history table - New: wp-fail2ban-logpath-debian-ubuntu, lora-adapter-gguf-conversion-fails, tailscale-status-json-hostname-localhost-ios - Various article updates and nav index refreshes Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
171 lines
7.7 KiB
Markdown
171 lines
7.7 KiB
Markdown
---
|
|
title: Tuning Netdata Docker Health Alarms to Prevent Update Flapping
|
|
domain: selfhosting
|
|
category: monitoring
|
|
tags:
|
|
- netdata
|
|
- docker
|
|
- nextcloud
|
|
- alarms
|
|
- health
|
|
- monitoring
|
|
status: published
|
|
created: 2026-03-18
|
|
updated: 2026-05-02T11:04
|
|
---
|
|
|
|
# Tuning Netdata Docker Health Alarms to Prevent Update Flapping
|
|
|
|
Netdata's default `docker_container_unhealthy` alarm fires on a 10-second average with no delay. When Nextcloud AIO (or any stack with a watchtower/auto-update setup) does its nightly update cycle, containers restart in sequence and briefly show as unhealthy — generating a flood of false alerts.
|
|
|
|
## The Default Alarm
|
|
|
|
```ini
|
|
template: docker_container_unhealthy
|
|
on: docker.container_health_status
|
|
every: 10s
|
|
lookup: average -10s of unhealthy
|
|
warn: $this > 0
|
|
```
|
|
|
|
A single container being unhealthy for 10 seconds triggers it. No grace period, no delay.
|
|
|
|
## The Fix
|
|
|
|
Create a custom override at `/etc/netdata/health.d/docker.conf` (maps to the Netdata config volume if running in Docker). This file takes precedence over the stock config in `/usr/lib/netdata/conf.d/health.d/docker.conf`.
|
|
|
|
### General Container Alarm
|
|
|
|
This alarm covers all containers **except** `nextcloud-aio-nextcloud`, which gets its own dedicated alarm (see below).
|
|
|
|
```ini
|
|
# Custom override — reduces flapping during nightly container updates.
|
|
# General container unhealthy alarm — all containers except nextcloud-aio-nextcloud
|
|
|
|
template: docker_container_unhealthy
|
|
on: docker.container_health_status
|
|
class: Errors
|
|
type: Containers
|
|
component: Docker
|
|
units: status
|
|
every: 30s
|
|
lookup: average -5m of unhealthy
|
|
chart labels: container_name=!nextcloud-aio-nextcloud *
|
|
warn: $this > 0
|
|
delay: up 3m down 5m multiplier 1.5 max 30m
|
|
summary: Docker container ${label:container_name} health
|
|
info: ${label:container_name} docker container health status is unhealthy
|
|
to: sysadmin
|
|
```
|
|
|
|
| Setting | Default | Tuned | Effect |
|
|
|---|---|---|---|
|
|
| `every` | 10s | 30s | Check less frequently |
|
|
| `lookup` | average -10s | average -5m | Smooths transient unhealthy samples over 5 minutes |
|
|
| `delay: up 3m` | none | 3m | Won't fire until unhealthy condition persists for 3 continuous minutes |
|
|
| `delay: down 5m` | none | 5m (max 30m) | Grace period after recovery before clearing |
|
|
|
|
### Dedicated Nextcloud AIO Alarm
|
|
|
|
Added 2026-03-23, updated 2026-05-02. The `nextcloud-aio-nextcloud` container needs a more lenient window than other containers. Its healthcheck (`/healthcheck.sh`) verifies PostgreSQL connectivity (port 5432) and PHP-FPM (port 9000). PHP-FPM takes ~90 seconds to warm up after a normal restart — but during nightly AIO update cycles, the full startup (occ upgrade, app updates, migrations) can take 5+ minutes. On 2026-03-27, a startup hung and left the container unhealthy for 20 hours until the next nightly cycle replaced it.
|
|
|
|
The dedicated alarm uses a 30-minute lookup window and 10-minute delay to absorb normal startup and update cycles (~40 minutes total grace), while still catching sustained failures:
|
|
|
|
```ini
|
|
# Dedicated alarm for nextcloud-aio-nextcloud — lenient window to absorb nightly update cycle
|
|
# PHP-FPM can take 5+ minutes to warm up; only alert on sustained failure
|
|
|
|
template: docker_nextcloud_unhealthy
|
|
on: docker.container_health_status
|
|
class: Errors
|
|
type: Containers
|
|
component: Docker
|
|
units: status
|
|
every: 30s
|
|
lookup: average -30m of unhealthy
|
|
chart labels: container_name=nextcloud-aio-nextcloud
|
|
warn: $this >= 1
|
|
delay: up 10m down 5m multiplier 1.5 max 30m
|
|
summary: Nextcloud container health sustained
|
|
info: nextcloud-aio-nextcloud has been continuously unhealthy for 30+ minutes — not a transient update blip
|
|
to: sysadmin
|
|
```
|
|
|
|
**Tuning history:**
|
|
|
|
| Date | Lookup | Delay | Trigger | Notes |
|
|
|---|---|---|---|---|
|
|
| 2026-03-23 | 35m | 35m | Initial split from general alarm | Absorbed PHP-FPM warm-up |
|
|
| 2026-04-29 | 15m | 5m | Backup blip (~6m) never triggered | Tightened after stability |
|
|
| 2026-05-02 | 30m | 10m | 15m still too aggressive for update cycles | ~40m total grace; catches real outages |
|
|
|
|
## Watchdog Cron: Auto-Restart on Sustained Unhealthy
|
|
|
|
If the Nextcloud container stays unhealthy for more than 1 hour (well past any normal startup window), a cron watchdog on majorlab auto-restarts it and logs the event. This was added 2026-03-28 after an incident where the container sat unhealthy for 20 hours until the next nightly backup cycle replaced it.
|
|
|
|
**File:** `/etc/cron.d/nextcloud-health-watchdog`
|
|
|
|
```bash
|
|
# Restart nextcloud-aio-nextcloud if unhealthy for >1 hour
|
|
*/15 * * * * root docker inspect --format={{.State.Health.Status}} nextcloud-aio-nextcloud 2>/dev/null | grep -q unhealthy && [ "$(docker inspect --format={{.State.StartedAt}} nextcloud-aio-nextcloud | xargs -I{} date -d {} +\%s)" -lt "$(date -d "1 hour ago" +\%s)" ] && docker restart nextcloud-aio-nextcloud && logger -t nextcloud-watchdog "Restarted unhealthy nextcloud-aio-nextcloud"
|
|
```
|
|
|
|
- Runs every 15 minutes as root
|
|
- Only restarts if the container has been running for >1 hour (avoids interfering with normal startup)
|
|
- Logs to syslog as `nextcloud-watchdog` — check with `journalctl -t nextcloud-watchdog`
|
|
- Netdata will still fire the `docker_nextcloud_unhealthy` alert during the unhealthy window, but the outage is capped at ~1 hour instead of persisting until the next nightly cycle
|
|
|
|
## Also: Suppress `docker_container_down` for Normally-Exiting Containers
|
|
|
|
Nextcloud AIO runs `borgbackup` (scheduled backups) and `watchtower` (auto-updates) as containers that exit with code 0 after completing their work. The stock `docker_container_down` alarm fires on any exited container, generating false alerts after every nightly cycle.
|
|
|
|
Add a second override to the same file using `chart labels` to exclude them:
|
|
|
|
```ini
|
|
# Suppress docker_container_down for Nextcloud AIO containers that exit normally
|
|
# (borgbackup runs on schedule then exits; watchtower does updates then exits)
|
|
template: docker_container_down
|
|
on: docker.container_running_state
|
|
class: Errors
|
|
type: Containers
|
|
component: Docker
|
|
units: status
|
|
every: 30s
|
|
lookup: average -5m of down
|
|
chart labels: container_name=!nextcloud-aio-borgbackup !nextcloud-aio-watchtower *
|
|
warn: $this > 0
|
|
delay: up 3m down 5m multiplier 1.5 max 30m
|
|
summary: Docker container ${label:container_name} down
|
|
info: ${label:container_name} docker container is down
|
|
to: sysadmin
|
|
```
|
|
|
|
The `chart labels` line uses Netdata's simple pattern syntax — `!` prefix excludes a container, `*` matches everything else. All other exited containers still alert normally.
|
|
|
|
## Applying the Config
|
|
|
|
```bash
|
|
# If Netdata runs in Docker, write to the config volume
|
|
sudo tee /var/lib/docker/volumes/netdata_netdataconfig/_data/health.d/docker.conf > /dev/null << 'EOF'
|
|
# paste config here
|
|
EOF
|
|
|
|
# Reload health alarms without restarting the container
|
|
sudo docker exec netdata netdatacli reload-health
|
|
```
|
|
|
|
No container restart needed — `reload-health` picks up the new config immediately.
|
|
|
|
## Verify
|
|
|
|
In the Netdata UI, navigate to **Alerts → Manage Alerts** and search for `docker_container_unhealthy`. The lookup and delay values should reflect the new config.
|
|
|
|
## Notes
|
|
|
|
- Both `docker_container_unhealthy` and `docker_container_down` are overridden in this config. Any container not explicitly excluded in the `chart labels` filter will still alert normally.
|
|
- If you want per-container silencing instead of a blanket delay, use the `host labels` or `chart labels` filter to scope the alarm to specific containers.
|
|
- Config volume path on majorlab: `/var/lib/docker/volumes/netdata_netdataconfig/_data/`
|
|
|
|
## See Also
|
|
|
|
- [Tuning Netdata Web Log Alerts](tuning-netdata-web-log-alerts.md) — similar tuning for web_log redirect alerts
|