wiki: update Netdata Docker alarm tuning — add docker_container_down suppression
Nextcloud AIO borgbackup and watchtower exit normally after nightly update/backup cycles. Added docker_container_down override with chart labels to exclude them, preventing false alerts. Documents chart labels pattern syntax. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -5,7 +5,7 @@ category: monitoring
|
|||||||
tags: [netdata, docker, nextcloud, alarms, health, monitoring]
|
tags: [netdata, docker, nextcloud, alarms, health, monitoring]
|
||||||
status: published
|
status: published
|
||||||
created: 2026-03-18
|
created: 2026-03-18
|
||||||
updated: 2026-03-21
|
updated: 2026-03-22
|
||||||
---
|
---
|
||||||
|
|
||||||
# Tuning Netdata Docker Health Alarms to Prevent Update Flapping
|
# Tuning Netdata Docker Health Alarms to Prevent Update Flapping
|
||||||
@@ -55,6 +55,33 @@ component: Docker
|
|||||||
|
|
||||||
The `up` delay is the critical addition. Nextcloud AIO's `nextcloud-aio-nextcloud` container checks both PostgreSQL (port 5432) and PHP-FPM (port 9000). PHP-FPM takes ~90 seconds to warm up after a restart, causing 2–3 failing health checks before the container becomes healthy. With `delay: up 3m`, Netdata waits for 3 continuous minutes of unhealthy status before firing — absorbing the ~90 second startup window with margin to spare. A genuinely broken container will still trigger the alert.
|
The `up` delay is the critical addition. Nextcloud AIO's `nextcloud-aio-nextcloud` container checks both PostgreSQL (port 5432) and PHP-FPM (port 9000). PHP-FPM takes ~90 seconds to warm up after a restart, causing 2–3 failing health checks before the container becomes healthy. With `delay: up 3m`, Netdata waits for 3 continuous minutes of unhealthy status before firing — absorbing the ~90 second startup window with margin to spare. A genuinely broken container will still trigger the alert.
|
||||||
|
|
||||||
|
## Also: Suppress `docker_container_down` for Normally-Exiting Containers
|
||||||
|
|
||||||
|
Nextcloud AIO runs `borgbackup` (scheduled backups) and `watchtower` (auto-updates) as containers that exit with code 0 after completing their work. The stock `docker_container_down` alarm fires on any exited container, generating false alerts after every nightly cycle.
|
||||||
|
|
||||||
|
Add a second override to the same file using `chart labels` to exclude them:
|
||||||
|
|
||||||
|
```ini
|
||||||
|
# Suppress docker_container_down for Nextcloud AIO containers that exit normally
|
||||||
|
# (borgbackup runs on schedule then exits; watchtower does updates then exits)
|
||||||
|
template: docker_container_down
|
||||||
|
on: docker.container_running_state
|
||||||
|
class: Errors
|
||||||
|
type: Containers
|
||||||
|
component: Docker
|
||||||
|
units: status
|
||||||
|
every: 30s
|
||||||
|
lookup: average -5m of down
|
||||||
|
chart labels: container_name=!nextcloud-aio-borgbackup !nextcloud-aio-watchtower *
|
||||||
|
warn: $this > 0
|
||||||
|
delay: up 3m down 5m multiplier 1.5 max 30m
|
||||||
|
summary: Docker container ${label:container_name} down
|
||||||
|
info: ${label:container_name} docker container is down
|
||||||
|
to: sysadmin
|
||||||
|
```
|
||||||
|
|
||||||
|
The `chart labels` line uses Netdata's simple pattern syntax — `!` prefix excludes a container, `*` matches everything else. All other exited containers still alert normally.
|
||||||
|
|
||||||
## Applying the Config
|
## Applying the Config
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
@@ -75,7 +102,7 @@ In the Netdata UI, navigate to **Alerts → Manage Alerts** and search for `dock
|
|||||||
|
|
||||||
## Notes
|
## Notes
|
||||||
|
|
||||||
- This only overrides the `docker_container_unhealthy` alarm. The `docker_container_down` alarm (for exited containers) is left at its default — it already has a `delay: down 1m` and is disabled by default (`chart labels: container_name=!*`).
|
- Both `docker_container_unhealthy` and `docker_container_down` are overridden in this config. Any container not explicitly excluded in the `chart labels` filter will still alert normally.
|
||||||
- If you want per-container silencing instead of a blanket delay, use the `host labels` or `chart labels` filter to scope the alarm to specific containers.
|
- If you want per-container silencing instead of a blanket delay, use the `host labels` or `chart labels` filter to scope the alarm to specific containers.
|
||||||
- Config volume path on majorlab: `/var/lib/docker/volumes/netdata_netdataconfig/_data/`
|
- Config volume path on majorlab: `/var/lib/docker/volumes/netdata_netdataconfig/_data/`
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user