majorwiki/02-selfhosting/monitoring/netdata-docker-health-alarm-tuning.md
MajorLinux 38fe720e63 wiki: add Netdata Docker health alarm tuning article; update indexes to 48
- 02-selfhosting/monitoring/netdata-docker-health-alarm-tuning.md — new
- lookup extended to 5m average, delay: down 5m to prevent Nextcloud AIO update flapping
- SUMMARY.md, index.md, README.md, deploy status updated

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-18 00:10:36 -04:00

3.2 KiB
Raw Blame History

title domain category tags status created updated
Tuning Netdata Docker Health Alarms to Prevent Update Flapping selfhosting monitoring
netdata
docker
nextcloud
alarms
health
monitoring
published 2026-03-18 2026-03-18

Tuning Netdata Docker Health Alarms to Prevent Update Flapping

Netdata's default docker_container_unhealthy alarm fires on a 10-second average with no delay. When Nextcloud AIO (or any stack with a watchtower/auto-update setup) does its nightly update cycle, containers restart in sequence and briefly show as unhealthy — generating a flood of false alerts.

The Default Alarm

template: docker_container_unhealthy
       on: docker.container_health_status
    every: 10s
   lookup: average -10s of unhealthy
     warn: $this > 0

A single container being unhealthy for 10 seconds triggers it. No grace period, no delay.

The Fix

Create a custom override at /etc/netdata/health.d/docker.conf (maps to the Netdata config volume if running in Docker). This file takes precedence over the stock config in /usr/lib/netdata/conf.d/health.d/docker.conf.

# Custom override — reduces flapping during nightly container updates.

template: docker_container_unhealthy
       on: docker.container_health_status
    class: Errors
     type: Containers
component: Docker
    units: status
    every: 30s
   lookup: average -5m of unhealthy
     warn: $this > 0
    delay: down 5m multiplier 1.5 max 30m
  summary: Docker container ${label:container_name} health
     info: ${label:container_name} docker container health status is unhealthy
       to: sysadmin
Setting Default Tuned Effect
every 10s 30s Check less frequently
lookup average -10s average -5m Must be unhealthy for sustained 5 minutes
delay none down 5m (max 30m) Grace period after recovery before clearing

A typical Nextcloud AIO update cycle (3090 seconds of container restarts) won't sustain 5 minutes of unhealthy status, so no alert fires. A genuinely broken container will still be caught.

Applying the Config

# If Netdata runs in Docker, write to the config volume
sudo tee /var/lib/docker/volumes/netdata_netdataconfig/_data/health.d/docker.conf > /dev/null << 'EOF'
# paste config here
EOF

# Reload health alarms without restarting the container
sudo docker exec netdata netdatacli reload-health

No container restart needed — reload-health picks up the new config immediately.

Verify

In the Netdata UI, navigate to Alerts → Manage Alerts and search for docker_container_unhealthy. The lookup and delay values should reflect the new config.

Notes

  • This only overrides the docker_container_unhealthy alarm. The docker_container_down alarm (for exited containers) is left at its default — it already has a delay: down 1m and is disabled by default (chart labels: container_name=!*).
  • If you want per-container silencing instead of a blanket delay, use the host labels or chart labels filter to scope the alarm to specific containers.
  • Config volume path on majorlab: /var/lib/docker/volumes/netdata_netdataconfig/_data/

See Also