Files

MajorLinux d1e9571761 wiki: update Netdata Docker alarm tuning — add docker_container_down suppression

Nextcloud AIO borgbackup and watchtower exit normally after nightly update/backup
cycles. Added docker_container_down override with chart labels to exclude them,
preventing false alerts. Documents chart labels pattern syntax.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-03-22 03:17:31 -04:00

4.8 KiB

Raw Blame History

title, domain, category, tags, status, created, updated

title

domain

Tuning Netdata Docker Health Alarms to Prevent Update Flapping

Netdata's default docker_container_unhealthy alarm fires on a 10-second average with no delay. When Nextcloud AIO (or any stack with a watchtower/auto-update setup) does its nightly update cycle, containers restart in sequence and briefly show as unhealthy — generating a flood of false alerts.

The Default Alarm

template: docker_container_unhealthy
       on: docker.container_health_status
    every: 10s
   lookup: average -10s of unhealthy
     warn: $this > 0

A single container being unhealthy for 10 seconds triggers it. No grace period, no delay.

The Fix

Create a custom override at /etc/netdata/health.d/docker.conf (maps to the Netdata config volume if running in Docker). This file takes precedence over the stock config in /usr/lib/netdata/conf.d/health.d/docker.conf.

# Custom override — reduces flapping during nightly container updates.

template: docker_container_unhealthy
       on: docker.container_health_status
    class: Errors
     type: Containers
component: Docker
    units: status
    every: 30s
   lookup: average -5m of unhealthy
     warn: $this > 0
    delay: up 3m down 5m multiplier 1.5 max 30m
  summary: Docker container ${label:container_name} health
     info: ${label:container_name} docker container health status is unhealthy
       to: sysadmin

Setting	Default	Tuned	Effect
`every`	10s	30s	Check less frequently
`lookup`	average -10s	average -5m	Smooths transient unhealthy samples over 5 minutes
`delay: up 3m`	none	3m	Won't fire until unhealthy condition persists for 3 continuous minutes
`delay: down 5m`	none	5m (max 30m)	Grace period after recovery before clearing

The up delay is the critical addition. Nextcloud AIO's nextcloud-aio-nextcloud container checks both PostgreSQL (port 5432) and PHP-FPM (port 9000). PHP-FPM takes ~90 seconds to warm up after a restart, causing 2–3 failing health checks before the container becomes healthy. With delay: up 3m, Netdata waits for 3 continuous minutes of unhealthy status before firing — absorbing the ~90 second startup window with margin to spare. A genuinely broken container will still trigger the alert.

Also: Suppress `docker_container_down` for Normally-Exiting Containers

Nextcloud AIO runs borgbackup (scheduled backups) and watchtower (auto-updates) as containers that exit with code 0 after completing their work. The stock docker_container_down alarm fires on any exited container, generating false alerts after every nightly cycle.

Add a second override to the same file using chart labels to exclude them:

# Suppress docker_container_down for Nextcloud AIO containers that exit normally
# (borgbackup runs on schedule then exits; watchtower does updates then exits)
template: docker_container_down
       on: docker.container_running_state
    class: Errors
     type: Containers
component: Docker
    units: status
    every: 30s
   lookup: average -5m of down
chart labels: container_name=!nextcloud-aio-borgbackup !nextcloud-aio-watchtower *
     warn: $this > 0
    delay: up 3m down 5m multiplier 1.5 max 30m
  summary: Docker container ${label:container_name} down
     info: ${label:container_name} docker container is down
       to: sysadmin

The chart labels line uses Netdata's simple pattern syntax — ! prefix excludes a container, * matches everything else. All other exited containers still alert normally.

Applying the Config

# If Netdata runs in Docker, write to the config volume
sudo tee /var/lib/docker/volumes/netdata_netdataconfig/_data/health.d/docker.conf > /dev/null << 'EOF'
# paste config here
EOF

# Reload health alarms without restarting the container
sudo docker exec netdata netdatacli reload-health

No container restart needed — reload-health picks up the new config immediately.

Verify

In the Netdata UI, navigate to Alerts → Manage Alerts and search for docker_container_unhealthy. The lookup and delay values should reflect the new config.

Notes

Both docker_container_unhealthy and docker_container_down are overridden in this config. Any container not explicitly excluded in the chart labels filter will still alert normally.
If you want per-container silencing instead of a blanket delay, use the host labels or chart labels filter to scope the alarm to specific containers.
Config volume path on majorlab: /var/lib/docker/volumes/netdata_netdataconfig/_data/

4.8 KiB Raw Blame History Unescape Escape