112 lines
4.8 KiB
Markdown
112 lines
4.8 KiB
Markdown
---
|
||
title: "Tuning Netdata Docker Health Alarms to Prevent Update Flapping"
|
||
domain: selfhosting
|
||
category: monitoring
|
||
tags: [netdata, docker, nextcloud, alarms, health, monitoring]
|
||
status: published
|
||
created: 2026-03-18
|
||
updated: 2026-03-22
|
||
---
|
||
|
||
# Tuning Netdata Docker Health Alarms to Prevent Update Flapping
|
||
|
||
Netdata's default `docker_container_unhealthy` alarm fires on a 10-second average with no delay. When Nextcloud AIO (or any stack with a watchtower/auto-update setup) does its nightly update cycle, containers restart in sequence and briefly show as unhealthy — generating a flood of false alerts.
|
||
|
||
## The Default Alarm
|
||
|
||
```ini
|
||
template: docker_container_unhealthy
|
||
on: docker.container_health_status
|
||
every: 10s
|
||
lookup: average -10s of unhealthy
|
||
warn: $this > 0
|
||
```
|
||
|
||
A single container being unhealthy for 10 seconds triggers it. No grace period, no delay.
|
||
|
||
## The Fix
|
||
|
||
Create a custom override at `/etc/netdata/health.d/docker.conf` (maps to the Netdata config volume if running in Docker). This file takes precedence over the stock config in `/usr/lib/netdata/conf.d/health.d/docker.conf`.
|
||
|
||
```ini
|
||
# Custom override — reduces flapping during nightly container updates.
|
||
|
||
template: docker_container_unhealthy
|
||
on: docker.container_health_status
|
||
class: Errors
|
||
type: Containers
|
||
component: Docker
|
||
units: status
|
||
every: 30s
|
||
lookup: average -5m of unhealthy
|
||
warn: $this > 0
|
||
delay: up 3m down 5m multiplier 1.5 max 30m
|
||
summary: Docker container ${label:container_name} health
|
||
info: ${label:container_name} docker container health status is unhealthy
|
||
to: sysadmin
|
||
```
|
||
|
||
| Setting | Default | Tuned | Effect |
|
||
|---|---|---|---|
|
||
| `every` | 10s | 30s | Check less frequently |
|
||
| `lookup` | average -10s | average -5m | Smooths transient unhealthy samples over 5 minutes |
|
||
| `delay: up 3m` | none | 3m | Won't fire until unhealthy condition persists for 3 continuous minutes |
|
||
| `delay: down 5m` | none | 5m (max 30m) | Grace period after recovery before clearing |
|
||
|
||
The `up` delay is the critical addition. Nextcloud AIO's `nextcloud-aio-nextcloud` container checks both PostgreSQL (port 5432) and PHP-FPM (port 9000). PHP-FPM takes ~90 seconds to warm up after a restart, causing 2–3 failing health checks before the container becomes healthy. With `delay: up 3m`, Netdata waits for 3 continuous minutes of unhealthy status before firing — absorbing the ~90 second startup window with margin to spare. A genuinely broken container will still trigger the alert.
|
||
|
||
## Also: Suppress `docker_container_down` for Normally-Exiting Containers
|
||
|
||
Nextcloud AIO runs `borgbackup` (scheduled backups) and `watchtower` (auto-updates) as containers that exit with code 0 after completing their work. The stock `docker_container_down` alarm fires on any exited container, generating false alerts after every nightly cycle.
|
||
|
||
Add a second override to the same file using `chart labels` to exclude them:
|
||
|
||
```ini
|
||
# Suppress docker_container_down for Nextcloud AIO containers that exit normally
|
||
# (borgbackup runs on schedule then exits; watchtower does updates then exits)
|
||
template: docker_container_down
|
||
on: docker.container_running_state
|
||
class: Errors
|
||
type: Containers
|
||
component: Docker
|
||
units: status
|
||
every: 30s
|
||
lookup: average -5m of down
|
||
chart labels: container_name=!nextcloud-aio-borgbackup !nextcloud-aio-watchtower *
|
||
warn: $this > 0
|
||
delay: up 3m down 5m multiplier 1.5 max 30m
|
||
summary: Docker container ${label:container_name} down
|
||
info: ${label:container_name} docker container is down
|
||
to: sysadmin
|
||
```
|
||
|
||
The `chart labels` line uses Netdata's simple pattern syntax — `!` prefix excludes a container, `*` matches everything else. All other exited containers still alert normally.
|
||
|
||
## Applying the Config
|
||
|
||
```bash
|
||
# If Netdata runs in Docker, write to the config volume
|
||
sudo tee /var/lib/docker/volumes/netdata_netdataconfig/_data/health.d/docker.conf > /dev/null << 'EOF'
|
||
# paste config here
|
||
EOF
|
||
|
||
# Reload health alarms without restarting the container
|
||
sudo docker exec netdata netdatacli reload-health
|
||
```
|
||
|
||
No container restart needed — `reload-health` picks up the new config immediately.
|
||
|
||
## Verify
|
||
|
||
In the Netdata UI, navigate to **Alerts → Manage Alerts** and search for `docker_container_unhealthy`. The lookup and delay values should reflect the new config.
|
||
|
||
## Notes
|
||
|
||
- Both `docker_container_unhealthy` and `docker_container_down` are overridden in this config. Any container not explicitly excluded in the `chart labels` filter will still alert normally.
|
||
- If you want per-container silencing instead of a blanket delay, use the `host labels` or `chart labels` filter to scope the alarm to specific containers.
|
||
- Config volume path on majorlab: `/var/lib/docker/volumes/netdata_netdataconfig/_data/`
|
||
|
||
## See Also
|
||
|
||
- [Tuning Netdata Web Log Alerts](tuning-netdata-web-log-alerts.md) — similar tuning for web_log redirect alerts
|