Files
MajorWiki/02-selfhosting/monitoring/netdata-docker-health-alarm-tuning.md
MajorLinux 38fe720e63 wiki: add Netdata Docker health alarm tuning article; update indexes to 48
- 02-selfhosting/monitoring/netdata-docker-health-alarm-tuning.md — new
- lookup extended to 5m average, delay: down 5m to prevent Nextcloud AIO update flapping
- SUMMARY.md, index.md, README.md, deploy status updated

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-18 00:10:36 -04:00

84 lines
3.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
title: "Tuning Netdata Docker Health Alarms to Prevent Update Flapping"
domain: selfhosting
category: monitoring
tags: [netdata, docker, nextcloud, alarms, health, monitoring]
status: published
created: 2026-03-18
updated: 2026-03-18
---
# Tuning Netdata Docker Health Alarms to Prevent Update Flapping
Netdata's default `docker_container_unhealthy` alarm fires on a 10-second average with no delay. When Nextcloud AIO (or any stack with a watchtower/auto-update setup) does its nightly update cycle, containers restart in sequence and briefly show as unhealthy — generating a flood of false alerts.
## The Default Alarm
```ini
template: docker_container_unhealthy
on: docker.container_health_status
every: 10s
lookup: average -10s of unhealthy
warn: $this > 0
```
A single container being unhealthy for 10 seconds triggers it. No grace period, no delay.
## The Fix
Create a custom override at `/etc/netdata/health.d/docker.conf` (maps to the Netdata config volume if running in Docker). This file takes precedence over the stock config in `/usr/lib/netdata/conf.d/health.d/docker.conf`.
```ini
# Custom override — reduces flapping during nightly container updates.
template: docker_container_unhealthy
on: docker.container_health_status
class: Errors
type: Containers
component: Docker
units: status
every: 30s
lookup: average -5m of unhealthy
warn: $this > 0
delay: down 5m multiplier 1.5 max 30m
summary: Docker container ${label:container_name} health
info: ${label:container_name} docker container health status is unhealthy
to: sysadmin
```
| Setting | Default | Tuned | Effect |
|---|---|---|---|
| `every` | 10s | 30s | Check less frequently |
| `lookup` | average -10s | average -5m | Must be unhealthy for sustained 5 minutes |
| `delay` | none | down 5m (max 30m) | Grace period after recovery before clearing |
A typical Nextcloud AIO update cycle (3090 seconds of container restarts) won't sustain 5 minutes of unhealthy status, so no alert fires. A genuinely broken container will still be caught.
## Applying the Config
```bash
# If Netdata runs in Docker, write to the config volume
sudo tee /var/lib/docker/volumes/netdata_netdataconfig/_data/health.d/docker.conf > /dev/null << 'EOF'
# paste config here
EOF
# Reload health alarms without restarting the container
sudo docker exec netdata netdatacli reload-health
```
No container restart needed — `reload-health` picks up the new config immediately.
## Verify
In the Netdata UI, navigate to **Alerts → Manage Alerts** and search for `docker_container_unhealthy`. The lookup and delay values should reflect the new config.
## Notes
- This only overrides the `docker_container_unhealthy` alarm. The `docker_container_down` alarm (for exited containers) is left at its default — it already has a `delay: down 1m` and is disabled by default (`chart labels: container_name=!*`).
- If you want per-container silencing instead of a blanket delay, use the `host labels` or `chart labels` filter to scope the alarm to specific containers.
- Config volume path on majorlab: `/var/lib/docker/volumes/netdata_netdataconfig/_data/`
## See Also
- [Tuning Netdata Web Log Alerts](tuning-netdata-web-log-alerts.md) — similar tuning for web_log redirect alerts