- 02-selfhosting/monitoring/netdata-docker-health-alarm-tuning.md — new - lookup extended to 5m average, delay: down 5m to prevent Nextcloud AIO update flapping - SUMMARY.md, index.md, README.md, deploy status updated Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3.2 KiB
title, domain, category, tags, status, created, updated
| title | domain | category | tags | status | created | updated | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Tuning Netdata Docker Health Alarms to Prevent Update Flapping | selfhosting | monitoring |
|
published | 2026-03-18 | 2026-03-18 |
Tuning Netdata Docker Health Alarms to Prevent Update Flapping
Netdata's default docker_container_unhealthy alarm fires on a 10-second average with no delay. When Nextcloud AIO (or any stack with a watchtower/auto-update setup) does its nightly update cycle, containers restart in sequence and briefly show as unhealthy — generating a flood of false alerts.
The Default Alarm
template: docker_container_unhealthy
on: docker.container_health_status
every: 10s
lookup: average -10s of unhealthy
warn: $this > 0
A single container being unhealthy for 10 seconds triggers it. No grace period, no delay.
The Fix
Create a custom override at /etc/netdata/health.d/docker.conf (maps to the Netdata config volume if running in Docker). This file takes precedence over the stock config in /usr/lib/netdata/conf.d/health.d/docker.conf.
# Custom override — reduces flapping during nightly container updates.
template: docker_container_unhealthy
on: docker.container_health_status
class: Errors
type: Containers
component: Docker
units: status
every: 30s
lookup: average -5m of unhealthy
warn: $this > 0
delay: down 5m multiplier 1.5 max 30m
summary: Docker container ${label:container_name} health
info: ${label:container_name} docker container health status is unhealthy
to: sysadmin
| Setting | Default | Tuned | Effect |
|---|---|---|---|
every |
10s | 30s | Check less frequently |
lookup |
average -10s | average -5m | Must be unhealthy for sustained 5 minutes |
delay |
none | down 5m (max 30m) | Grace period after recovery before clearing |
A typical Nextcloud AIO update cycle (30–90 seconds of container restarts) won't sustain 5 minutes of unhealthy status, so no alert fires. A genuinely broken container will still be caught.
Applying the Config
# If Netdata runs in Docker, write to the config volume
sudo tee /var/lib/docker/volumes/netdata_netdataconfig/_data/health.d/docker.conf > /dev/null << 'EOF'
# paste config here
EOF
# Reload health alarms without restarting the container
sudo docker exec netdata netdatacli reload-health
No container restart needed — reload-health picks up the new config immediately.
Verify
In the Netdata UI, navigate to Alerts → Manage Alerts and search for docker_container_unhealthy. The lookup and delay values should reflect the new config.
Notes
- This only overrides the
docker_container_unhealthyalarm. Thedocker_container_downalarm (for exited containers) is left at its default — it already has adelay: down 1mand is disabled by default (chart labels: container_name=!*). - If you want per-container silencing instead of a blanket delay, use the
host labelsorchart labelsfilter to scope the alarm to specific containers. - Config volume path on majorlab:
/var/lib/docker/volumes/netdata_netdataconfig/_data/
See Also
- Tuning Netdata Web Log Alerts — similar tuning for web_log redirect alerts