Add troubleshooting article for the 2026-03-27 incident where PHP-FPM hung after the nightly update cycle. Update the Netdata Docker alarm tuning article with the dedicated Nextcloud alarm split and the new watchdog cron deployed to majorlab. (54 articles) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
158 lines
7.3 KiB
Markdown
158 lines
7.3 KiB
Markdown
---
|
|
title: "Tuning Netdata Docker Health Alarms to Prevent Update Flapping"
|
|
domain: selfhosting
|
|
category: monitoring
|
|
tags: [netdata, docker, nextcloud, alarms, health, monitoring]
|
|
status: published
|
|
created: 2026-03-18
|
|
updated: 2026-03-28
|
|
---
|
|
|
|
# Tuning Netdata Docker Health Alarms to Prevent Update Flapping
|
|
|
|
Netdata's default `docker_container_unhealthy` alarm fires on a 10-second average with no delay. When Nextcloud AIO (or any stack with a watchtower/auto-update setup) does its nightly update cycle, containers restart in sequence and briefly show as unhealthy — generating a flood of false alerts.
|
|
|
|
## The Default Alarm
|
|
|
|
```ini
|
|
template: docker_container_unhealthy
|
|
on: docker.container_health_status
|
|
every: 10s
|
|
lookup: average -10s of unhealthy
|
|
warn: $this > 0
|
|
```
|
|
|
|
A single container being unhealthy for 10 seconds triggers it. No grace period, no delay.
|
|
|
|
## The Fix
|
|
|
|
Create a custom override at `/etc/netdata/health.d/docker.conf` (maps to the Netdata config volume if running in Docker). This file takes precedence over the stock config in `/usr/lib/netdata/conf.d/health.d/docker.conf`.
|
|
|
|
### General Container Alarm
|
|
|
|
This alarm covers all containers **except** `nextcloud-aio-nextcloud`, which gets its own dedicated alarm (see below).
|
|
|
|
```ini
|
|
# Custom override — reduces flapping during nightly container updates.
|
|
# General container unhealthy alarm — all containers except nextcloud-aio-nextcloud
|
|
|
|
template: docker_container_unhealthy
|
|
on: docker.container_health_status
|
|
class: Errors
|
|
type: Containers
|
|
component: Docker
|
|
units: status
|
|
every: 30s
|
|
lookup: average -5m of unhealthy
|
|
chart labels: container_name=!nextcloud-aio-nextcloud *
|
|
warn: $this > 0
|
|
delay: up 3m down 5m multiplier 1.5 max 30m
|
|
summary: Docker container ${label:container_name} health
|
|
info: ${label:container_name} docker container health status is unhealthy
|
|
to: sysadmin
|
|
```
|
|
|
|
| Setting | Default | Tuned | Effect |
|
|
|---|---|---|---|
|
|
| `every` | 10s | 30s | Check less frequently |
|
|
| `lookup` | average -10s | average -5m | Smooths transient unhealthy samples over 5 minutes |
|
|
| `delay: up 3m` | none | 3m | Won't fire until unhealthy condition persists for 3 continuous minutes |
|
|
| `delay: down 5m` | none | 5m (max 30m) | Grace period after recovery before clearing |
|
|
|
|
### Dedicated Nextcloud AIO Alarm
|
|
|
|
Added 2026-03-23, updated 2026-03-28. The `nextcloud-aio-nextcloud` container needs a more lenient window than other containers. Its healthcheck (`/healthcheck.sh`) verifies PostgreSQL connectivity (port 5432) and PHP-FPM (port 9000). PHP-FPM takes ~90 seconds to warm up after a normal restart — but during nightly AIO update cycles, the full startup (occ upgrade, app updates, migrations) can take 5+ minutes. On 2026-03-27, a startup hung and left the container unhealthy for 20 hours until the next nightly cycle replaced it.
|
|
|
|
The dedicated alarm uses a 10-minute lookup window and 10-minute delay to absorb normal startup, while still catching sustained failures:
|
|
|
|
```ini
|
|
# Dedicated alarm for nextcloud-aio-nextcloud — lenient window to absorb nightly update cycle
|
|
# PHP-FPM can take 5+ minutes to warm up; only alert on sustained failure
|
|
|
|
template: docker_nextcloud_unhealthy
|
|
on: docker.container_health_status
|
|
class: Errors
|
|
type: Containers
|
|
component: Docker
|
|
units: status
|
|
every: 30s
|
|
lookup: average -10m of unhealthy
|
|
chart labels: container_name=nextcloud-aio-nextcloud
|
|
warn: $this > 0
|
|
delay: up 10m down 5m multiplier 1.5 max 30m
|
|
summary: Nextcloud container health sustained
|
|
info: nextcloud-aio-nextcloud has been unhealthy for a sustained period — not a transient update blip
|
|
to: sysadmin
|
|
```
|
|
|
|
## Watchdog Cron: Auto-Restart on Sustained Unhealthy
|
|
|
|
If the Nextcloud container stays unhealthy for more than 1 hour (well past any normal startup window), a cron watchdog on majorlab auto-restarts it and logs the event. This was added 2026-03-28 after an incident where the container sat unhealthy for 20 hours until the next nightly backup cycle replaced it.
|
|
|
|
**File:** `/etc/cron.d/nextcloud-health-watchdog`
|
|
|
|
```bash
|
|
# Restart nextcloud-aio-nextcloud if unhealthy for >1 hour
|
|
*/15 * * * * root docker inspect --format={{.State.Health.Status}} nextcloud-aio-nextcloud 2>/dev/null | grep -q unhealthy && [ "$(docker inspect --format={{.State.StartedAt}} nextcloud-aio-nextcloud | xargs -I{} date -d {} +\%s)" -lt "$(date -d "1 hour ago" +\%s)" ] && docker restart nextcloud-aio-nextcloud && logger -t nextcloud-watchdog "Restarted unhealthy nextcloud-aio-nextcloud"
|
|
```
|
|
|
|
- Runs every 15 minutes as root
|
|
- Only restarts if the container has been running for >1 hour (avoids interfering with normal startup)
|
|
- Logs to syslog as `nextcloud-watchdog` — check with `journalctl -t nextcloud-watchdog`
|
|
- Netdata will still fire the `docker_nextcloud_unhealthy` alert during the unhealthy window, but the outage is capped at ~1 hour instead of persisting until the next nightly cycle
|
|
|
|
## Also: Suppress `docker_container_down` for Normally-Exiting Containers
|
|
|
|
Nextcloud AIO runs `borgbackup` (scheduled backups) and `watchtower` (auto-updates) as containers that exit with code 0 after completing their work. The stock `docker_container_down` alarm fires on any exited container, generating false alerts after every nightly cycle.
|
|
|
|
Add a second override to the same file using `chart labels` to exclude them:
|
|
|
|
```ini
|
|
# Suppress docker_container_down for Nextcloud AIO containers that exit normally
|
|
# (borgbackup runs on schedule then exits; watchtower does updates then exits)
|
|
template: docker_container_down
|
|
on: docker.container_running_state
|
|
class: Errors
|
|
type: Containers
|
|
component: Docker
|
|
units: status
|
|
every: 30s
|
|
lookup: average -5m of down
|
|
chart labels: container_name=!nextcloud-aio-borgbackup !nextcloud-aio-watchtower *
|
|
warn: $this > 0
|
|
delay: up 3m down 5m multiplier 1.5 max 30m
|
|
summary: Docker container ${label:container_name} down
|
|
info: ${label:container_name} docker container is down
|
|
to: sysadmin
|
|
```
|
|
|
|
The `chart labels` line uses Netdata's simple pattern syntax — `!` prefix excludes a container, `*` matches everything else. All other exited containers still alert normally.
|
|
|
|
## Applying the Config
|
|
|
|
```bash
|
|
# If Netdata runs in Docker, write to the config volume
|
|
sudo tee /var/lib/docker/volumes/netdata_netdataconfig/_data/health.d/docker.conf > /dev/null << 'EOF'
|
|
# paste config here
|
|
EOF
|
|
|
|
# Reload health alarms without restarting the container
|
|
sudo docker exec netdata netdatacli reload-health
|
|
```
|
|
|
|
No container restart needed — `reload-health` picks up the new config immediately.
|
|
|
|
## Verify
|
|
|
|
In the Netdata UI, navigate to **Alerts → Manage Alerts** and search for `docker_container_unhealthy`. The lookup and delay values should reflect the new config.
|
|
|
|
## Notes
|
|
|
|
- Both `docker_container_unhealthy` and `docker_container_down` are overridden in this config. Any container not explicitly excluded in the `chart labels` filter will still alert normally.
|
|
- If you want per-container silencing instead of a blanket delay, use the `host labels` or `chart labels` filter to scope the alarm to specific containers.
|
|
- Config volume path on majorlab: `/var/lib/docker/volumes/netdata_netdataconfig/_data/`
|
|
|
|
## See Also
|
|
|
|
- [Tuning Netdata Web Log Alerts](tuning-netdata-web-log-alerts.md) — similar tuning for web_log redirect alerts
|