majorwiki/02-selfhosting/monitoring/netdata-docker-health-alarm-tuning.md
majorlinux 4126656c05 wiki: update fail2ban digest + netdata docker health + 3 new articles
- fail2ban-digest-mode-fleet: recidive-only email model, sshd now silent,
  defaults-debian.conf gotcha added
- netdata-docker-health-alarm-tuning: 30m/10m config, tuning history table
- New: wp-fail2ban-logpath-debian-ubuntu, lora-adapter-gguf-conversion-fails,
  tailscale-status-json-hostname-localhost-ios
- Various article updates and nav index refreshes

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-02 14:58:07 -04:00

7.7 KiB

title domain category tags status created updated
Tuning Netdata Docker Health Alarms to Prevent Update Flapping selfhosting monitoring
netdata
docker
nextcloud
alarms
health
monitoring
published 2026-03-18 2026-05-02T11:04

Tuning Netdata Docker Health Alarms to Prevent Update Flapping

Netdata's default docker_container_unhealthy alarm fires on a 10-second average with no delay. When Nextcloud AIO (or any stack with a watchtower/auto-update setup) does its nightly update cycle, containers restart in sequence and briefly show as unhealthy — generating a flood of false alerts.

The Default Alarm

template: docker_container_unhealthy
       on: docker.container_health_status
    every: 10s
   lookup: average -10s of unhealthy
     warn: $this > 0

A single container being unhealthy for 10 seconds triggers it. No grace period, no delay.

The Fix

Create a custom override at /etc/netdata/health.d/docker.conf (maps to the Netdata config volume if running in Docker). This file takes precedence over the stock config in /usr/lib/netdata/conf.d/health.d/docker.conf.

General Container Alarm

This alarm covers all containers except nextcloud-aio-nextcloud, which gets its own dedicated alarm (see below).

# Custom override — reduces flapping during nightly container updates.
# General container unhealthy alarm — all containers except nextcloud-aio-nextcloud

template: docker_container_unhealthy
       on: docker.container_health_status
    class: Errors
     type: Containers
component: Docker
    units: status
    every: 30s
   lookup: average -5m of unhealthy
chart labels: container_name=!nextcloud-aio-nextcloud *
     warn: $this > 0
    delay: up 3m down 5m multiplier 1.5 max 30m
  summary: Docker container ${label:container_name} health
     info: ${label:container_name} docker container health status is unhealthy
       to: sysadmin
Setting Default Tuned Effect
every 10s 30s Check less frequently
lookup average -10s average -5m Smooths transient unhealthy samples over 5 minutes
delay: up 3m none 3m Won't fire until unhealthy condition persists for 3 continuous minutes
delay: down 5m none 5m (max 30m) Grace period after recovery before clearing

Dedicated Nextcloud AIO Alarm

Added 2026-03-23, updated 2026-05-02. The nextcloud-aio-nextcloud container needs a more lenient window than other containers. Its healthcheck (/healthcheck.sh) verifies PostgreSQL connectivity (port 5432) and PHP-FPM (port 9000). PHP-FPM takes ~90 seconds to warm up after a normal restart — but during nightly AIO update cycles, the full startup (occ upgrade, app updates, migrations) can take 5+ minutes. On 2026-03-27, a startup hung and left the container unhealthy for 20 hours until the next nightly cycle replaced it.

The dedicated alarm uses a 30-minute lookup window and 10-minute delay to absorb normal startup and update cycles (~40 minutes total grace), while still catching sustained failures:

# Dedicated alarm for nextcloud-aio-nextcloud — lenient window to absorb nightly update cycle
# PHP-FPM can take 5+ minutes to warm up; only alert on sustained failure

template: docker_nextcloud_unhealthy
       on: docker.container_health_status
    class: Errors
     type: Containers
component: Docker
    units: status
    every: 30s
   lookup: average -30m of unhealthy
chart labels: container_name=nextcloud-aio-nextcloud
     warn: $this >= 1
    delay: up 10m down 5m multiplier 1.5 max 30m
  summary: Nextcloud container health sustained
     info: nextcloud-aio-nextcloud has been continuously unhealthy for 30+ minutes — not a transient update blip
       to: sysadmin

Tuning history:

Date Lookup Delay Trigger Notes
2026-03-23 35m 35m Initial split from general alarm Absorbed PHP-FPM warm-up
2026-04-29 15m 5m Backup blip (~6m) never triggered Tightened after stability
2026-05-02 30m 10m 15m still too aggressive for update cycles ~40m total grace; catches real outages

Watchdog Cron: Auto-Restart on Sustained Unhealthy

If the Nextcloud container stays unhealthy for more than 1 hour (well past any normal startup window), a cron watchdog on majorlab auto-restarts it and logs the event. This was added 2026-03-28 after an incident where the container sat unhealthy for 20 hours until the next nightly backup cycle replaced it.

File: /etc/cron.d/nextcloud-health-watchdog

# Restart nextcloud-aio-nextcloud if unhealthy for >1 hour
*/15 * * * * root docker inspect --format={{.State.Health.Status}} nextcloud-aio-nextcloud 2>/dev/null | grep -q unhealthy && [ "$(docker inspect --format={{.State.StartedAt}} nextcloud-aio-nextcloud | xargs -I{} date -d {} +\%s)" -lt "$(date -d "1 hour ago" +\%s)" ] && docker restart nextcloud-aio-nextcloud && logger -t nextcloud-watchdog "Restarted unhealthy nextcloud-aio-nextcloud"
  • Runs every 15 minutes as root
  • Only restarts if the container has been running for >1 hour (avoids interfering with normal startup)
  • Logs to syslog as nextcloud-watchdog — check with journalctl -t nextcloud-watchdog
  • Netdata will still fire the docker_nextcloud_unhealthy alert during the unhealthy window, but the outage is capped at ~1 hour instead of persisting until the next nightly cycle

Also: Suppress docker_container_down for Normally-Exiting Containers

Nextcloud AIO runs borgbackup (scheduled backups) and watchtower (auto-updates) as containers that exit with code 0 after completing their work. The stock docker_container_down alarm fires on any exited container, generating false alerts after every nightly cycle.

Add a second override to the same file using chart labels to exclude them:

# Suppress docker_container_down for Nextcloud AIO containers that exit normally
# (borgbackup runs on schedule then exits; watchtower does updates then exits)
template: docker_container_down
       on: docker.container_running_state
    class: Errors
     type: Containers
component: Docker
    units: status
    every: 30s
   lookup: average -5m of down
chart labels: container_name=!nextcloud-aio-borgbackup !nextcloud-aio-watchtower *
     warn: $this > 0
    delay: up 3m down 5m multiplier 1.5 max 30m
  summary: Docker container ${label:container_name} down
     info: ${label:container_name} docker container is down
       to: sysadmin

The chart labels line uses Netdata's simple pattern syntax — ! prefix excludes a container, * matches everything else. All other exited containers still alert normally.

Applying the Config

# If Netdata runs in Docker, write to the config volume
sudo tee /var/lib/docker/volumes/netdata_netdataconfig/_data/health.d/docker.conf > /dev/null << 'EOF'
# paste config here
EOF

# Reload health alarms without restarting the container
sudo docker exec netdata netdatacli reload-health

No container restart needed — reload-health picks up the new config immediately.

Verify

In the Netdata UI, navigate to Alerts → Manage Alerts and search for docker_container_unhealthy. The lookup and delay values should reflect the new config.

Notes

  • Both docker_container_unhealthy and docker_container_down are overridden in this config. Any container not explicitly excluded in the chart labels filter will still alert normally.
  • If you want per-container silencing instead of a blanket delay, use the host labels or chart labels filter to scope the alarm to specific containers.
  • Config volume path on majorlab: /var/lib/docker/volumes/netdata_netdataconfig/_data/

See Also