Files
MajorWiki/02-selfhosting/monitoring/netdata-docker-health-alarm-tuning.md
MajorLinux d1e9571761 wiki: update Netdata Docker alarm tuning — add docker_container_down suppression
Nextcloud AIO borgbackup and watchtower exit normally after nightly update/backup
cycles. Added docker_container_down override with chart labels to exclude them,
preventing false alerts. Documents chart labels pattern syntax.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-22 03:17:31 -04:00

4.8 KiB
Raw Blame History

title, domain, category, tags, status, created, updated
title domain category tags status created updated
Tuning Netdata Docker Health Alarms to Prevent Update Flapping selfhosting monitoring
netdata
docker
nextcloud
alarms
health
monitoring
published 2026-03-18 2026-03-22

Tuning Netdata Docker Health Alarms to Prevent Update Flapping

Netdata's default docker_container_unhealthy alarm fires on a 10-second average with no delay. When Nextcloud AIO (or any stack with a watchtower/auto-update setup) does its nightly update cycle, containers restart in sequence and briefly show as unhealthy — generating a flood of false alerts.

The Default Alarm

template: docker_container_unhealthy
       on: docker.container_health_status
    every: 10s
   lookup: average -10s of unhealthy
     warn: $this > 0

A single container being unhealthy for 10 seconds triggers it. No grace period, no delay.

The Fix

Create a custom override at /etc/netdata/health.d/docker.conf (maps to the Netdata config volume if running in Docker). This file takes precedence over the stock config in /usr/lib/netdata/conf.d/health.d/docker.conf.

# Custom override — reduces flapping during nightly container updates.

template: docker_container_unhealthy
       on: docker.container_health_status
    class: Errors
     type: Containers
component: Docker
    units: status
    every: 30s
   lookup: average -5m of unhealthy
     warn: $this > 0
    delay: up 3m down 5m multiplier 1.5 max 30m
  summary: Docker container ${label:container_name} health
     info: ${label:container_name} docker container health status is unhealthy
       to: sysadmin
Setting Default Tuned Effect
every 10s 30s Check less frequently
lookup average -10s average -5m Smooths transient unhealthy samples over 5 minutes
delay: up 3m none 3m Won't fire until unhealthy condition persists for 3 continuous minutes
delay: down 5m none 5m (max 30m) Grace period after recovery before clearing

The up delay is the critical addition. Nextcloud AIO's nextcloud-aio-nextcloud container checks both PostgreSQL (port 5432) and PHP-FPM (port 9000). PHP-FPM takes ~90 seconds to warm up after a restart, causing 23 failing health checks before the container becomes healthy. With delay: up 3m, Netdata waits for 3 continuous minutes of unhealthy status before firing — absorbing the ~90 second startup window with margin to spare. A genuinely broken container will still trigger the alert.

Also: Suppress docker_container_down for Normally-Exiting Containers

Nextcloud AIO runs borgbackup (scheduled backups) and watchtower (auto-updates) as containers that exit with code 0 after completing their work. The stock docker_container_down alarm fires on any exited container, generating false alerts after every nightly cycle.

Add a second override to the same file using chart labels to exclude them:

# Suppress docker_container_down for Nextcloud AIO containers that exit normally
# (borgbackup runs on schedule then exits; watchtower does updates then exits)
template: docker_container_down
       on: docker.container_running_state
    class: Errors
     type: Containers
component: Docker
    units: status
    every: 30s
   lookup: average -5m of down
chart labels: container_name=!nextcloud-aio-borgbackup !nextcloud-aio-watchtower *
     warn: $this > 0
    delay: up 3m down 5m multiplier 1.5 max 30m
  summary: Docker container ${label:container_name} down
     info: ${label:container_name} docker container is down
       to: sysadmin

The chart labels line uses Netdata's simple pattern syntax — ! prefix excludes a container, * matches everything else. All other exited containers still alert normally.

Applying the Config

# If Netdata runs in Docker, write to the config volume
sudo tee /var/lib/docker/volumes/netdata_netdataconfig/_data/health.d/docker.conf > /dev/null << 'EOF'
# paste config here
EOF

# Reload health alarms without restarting the container
sudo docker exec netdata netdatacli reload-health

No container restart needed — reload-health picks up the new config immediately.

Verify

In the Netdata UI, navigate to Alerts → Manage Alerts and search for docker_container_unhealthy. The lookup and delay values should reflect the new config.

Notes

  • Both docker_container_unhealthy and docker_container_down are overridden in this config. Any container not explicitly excluded in the chart labels filter will still alert normally.
  • If you want per-container silencing instead of a blanket delay, use the host labels or chart labels filter to scope the alarm to specific containers.
  • Config volume path on majorlab: /var/lib/docker/volumes/netdata_netdataconfig/_data/

See Also