diff --git a/02-selfhosting/monitoring/netdata-docker-health-alarm-tuning.md b/02-selfhosting/monitoring/netdata-docker-health-alarm-tuning.md index 01a9e2f..b5fc1cf 100644 --- a/02-selfhosting/monitoring/netdata-docker-health-alarm-tuning.md +++ b/02-selfhosting/monitoring/netdata-docker-health-alarm-tuning.md @@ -5,7 +5,7 @@ category: monitoring tags: [netdata, docker, nextcloud, alarms, health, monitoring] status: published created: 2026-03-18 -updated: 2026-03-22 +updated: 2026-03-28 --- # Tuning Netdata Docker Health Alarms to Prevent Update Flapping @@ -28,8 +28,13 @@ A single container being unhealthy for 10 seconds triggers it. No grace period, Create a custom override at `/etc/netdata/health.d/docker.conf` (maps to the Netdata config volume if running in Docker). This file takes precedence over the stock config in `/usr/lib/netdata/conf.d/health.d/docker.conf`. +### General Container Alarm + +This alarm covers all containers **except** `nextcloud-aio-nextcloud`, which gets its own dedicated alarm (see below). + ```ini # Custom override — reduces flapping during nightly container updates. +# General container unhealthy alarm — all containers except nextcloud-aio-nextcloud template: docker_container_unhealthy on: docker.container_health_status @@ -39,6 +44,7 @@ component: Docker units: status every: 30s lookup: average -5m of unhealthy +chart labels: container_name=!nextcloud-aio-nextcloud * warn: $this > 0 delay: up 3m down 5m multiplier 1.5 max 30m summary: Docker container ${label:container_name} health @@ -53,7 +59,47 @@ component: Docker | `delay: up 3m` | none | 3m | Won't fire until unhealthy condition persists for 3 continuous minutes | | `delay: down 5m` | none | 5m (max 30m) | Grace period after recovery before clearing | -The `up` delay is the critical addition. Nextcloud AIO's `nextcloud-aio-nextcloud` container checks both PostgreSQL (port 5432) and PHP-FPM (port 9000). PHP-FPM takes ~90 seconds to warm up after a restart, causing 2–3 failing health checks before the container becomes healthy. With `delay: up 3m`, Netdata waits for 3 continuous minutes of unhealthy status before firing — absorbing the ~90 second startup window with margin to spare. A genuinely broken container will still trigger the alert. +### Dedicated Nextcloud AIO Alarm + +Added 2026-03-23, updated 2026-03-28. The `nextcloud-aio-nextcloud` container needs a more lenient window than other containers. Its healthcheck (`/healthcheck.sh`) verifies PostgreSQL connectivity (port 5432) and PHP-FPM (port 9000). PHP-FPM takes ~90 seconds to warm up after a normal restart — but during nightly AIO update cycles, the full startup (occ upgrade, app updates, migrations) can take 5+ minutes. On 2026-03-27, a startup hung and left the container unhealthy for 20 hours until the next nightly cycle replaced it. + +The dedicated alarm uses a 10-minute lookup window and 10-minute delay to absorb normal startup, while still catching sustained failures: + +```ini +# Dedicated alarm for nextcloud-aio-nextcloud — lenient window to absorb nightly update cycle +# PHP-FPM can take 5+ minutes to warm up; only alert on sustained failure + +template: docker_nextcloud_unhealthy + on: docker.container_health_status + class: Errors + type: Containers +component: Docker + units: status + every: 30s + lookup: average -10m of unhealthy +chart labels: container_name=nextcloud-aio-nextcloud + warn: $this > 0 + delay: up 10m down 5m multiplier 1.5 max 30m + summary: Nextcloud container health sustained + info: nextcloud-aio-nextcloud has been unhealthy for a sustained period — not a transient update blip + to: sysadmin +``` + +## Watchdog Cron: Auto-Restart on Sustained Unhealthy + +If the Nextcloud container stays unhealthy for more than 1 hour (well past any normal startup window), a cron watchdog on majorlab auto-restarts it and logs the event. This was added 2026-03-28 after an incident where the container sat unhealthy for 20 hours until the next nightly backup cycle replaced it. + +**File:** `/etc/cron.d/nextcloud-health-watchdog` + +```bash +# Restart nextcloud-aio-nextcloud if unhealthy for >1 hour +*/15 * * * * root docker inspect --format={{.State.Health.Status}} nextcloud-aio-nextcloud 2>/dev/null | grep -q unhealthy && [ "$(docker inspect --format={{.State.StartedAt}} nextcloud-aio-nextcloud | xargs -I{} date -d {} +\%s)" -lt "$(date -d "1 hour ago" +\%s)" ] && docker restart nextcloud-aio-nextcloud && logger -t nextcloud-watchdog "Restarted unhealthy nextcloud-aio-nextcloud" +``` + +- Runs every 15 minutes as root +- Only restarts if the container has been running for >1 hour (avoids interfering with normal startup) +- Logs to syslog as `nextcloud-watchdog` — check with `journalctl -t nextcloud-watchdog` +- Netdata will still fire the `docker_nextcloud_unhealthy` alert during the unhealthy window, but the outage is capped at ~1 hour instead of persisting until the next nightly cycle ## Also: Suppress `docker_container_down` for Normally-Exiting Containers diff --git a/05-troubleshooting/docker/nextcloud-aio-unhealthy-20h-stuck.md b/05-troubleshooting/docker/nextcloud-aio-unhealthy-20h-stuck.md new file mode 100644 index 0000000..d5f7b0e --- /dev/null +++ b/05-troubleshooting/docker/nextcloud-aio-unhealthy-20h-stuck.md @@ -0,0 +1,82 @@ +--- +title: "Nextcloud AIO Container Unhealthy for 20 Hours After Nightly Update" +domain: troubleshooting +category: docker +tags: [nextcloud, docker, healthcheck, netdata, php-fpm, aio] +status: published +created: 2026-03-28 +updated: 2026-03-28 +--- + +# Nextcloud AIO Container Unhealthy for 20 Hours After Nightly Update + +## Symptom + +Netdata alert `docker_nextcloud_unhealthy` fired on majorlab and stayed in Warning for 20 hours. The `nextcloud-aio-nextcloud` container was running but its Docker healthcheck kept failing. No user-facing errors were visible in `nextcloud.log`. + +## Investigation + +### Timeline (2026-03-27, all UTC) + +| Time | Event | +|---|---| +| 04:00 | Nightly backup script started, mastercontainer update kicked off | +| 04:03 | `nextcloud-aio-nextcloud` container recreated | +| 04:05 | Backup finished | +| 07:25 | Mastercontainer logged "Initial startup of Nextcloud All-in-One complete!" (3h20m delay) | +| 10:22 | First entry in `nextcloud.log` (deprecation warnings only — no errors) | +| 04:00 (Mar 28) | Next nightly backup replaced the container; new container came up healthy in ~25 minutes | + +### Key findings + +- **No image update** — the container image dated to Feb 26, so this was not caused by a version change. +- **No app-level errors** — `nextcloud.log` contained only `files_rightclick` deprecation warnings (level 3). No level 2/4 entries. +- **PHP-FPM never stabilized** — the healthcheck (`/healthcheck.sh`) tests `nc -z 127.0.0.1 9000` (PHP-FPM). The container was running but FPM wasn't responding to the port check. +- **6-hour log gap** — no `nextcloud.log` entries between container start (04:03) and first log (10:22), suggesting the AIO init scripts (occ upgrade, app updates, cron jobs) ran for hours before the app became partially responsive. +- **RestartCount: 0** — the container never restarted on its own. It sat there unhealthy for the full 20 hours. +- **Disk space fine** — 40% used on `/`. + +### Healthcheck details + +```bash +#!/bin/bash +# /healthcheck.sh inside nextcloud-aio-nextcloud +nc -z "$POSTGRES_HOST" "$POSTGRES_PORT" || exit 0 # postgres down = pass (graceful) +nc -z 127.0.0.1 9000 || exit 1 # PHP-FPM down = fail +``` + +If PostgreSQL is unreachable, the check passes (exits 0). The only failure path is PHP-FPM not listening on port 9000. + +## Root Cause + +The AIO nightly update cycle recreated the container, but the startup/migration process hung or ran extremely long, preventing PHP-FPM from fully initializing. The container sat in this state for 20 hours with no self-recovery mechanism until the next nightly cycle replaced it. + +The exact migration or occ command that stalled could not be confirmed — the old container's entrypoint logs were lost when the Mar 28 backup cycle replaced it. + +## Fix + +Two changes deployed on 2026-03-28: + +### 1. Dedicated Netdata alarm with lenient window + +Split `nextcloud-aio-nextcloud` into its own Netdata alarm (`docker_nextcloud_unhealthy`) with a 10-minute lookup and 10-minute delay, separate from the general container alarm. See [Tuning Netdata Docker Health Alarms](../../02-selfhosting/monitoring/netdata-docker-health-alarm-tuning.md). + +### 2. Watchdog cron for auto-restart + +Deployed `/etc/cron.d/nextcloud-health-watchdog` on majorlab: + +```bash +*/15 * * * * root docker inspect --format={{.State.Health.Status}} nextcloud-aio-nextcloud 2>/dev/null | grep -q unhealthy && [ "$(docker inspect --format={{.State.StartedAt}} nextcloud-aio-nextcloud | xargs -I{} date -d {} +\%s)" -lt "$(date -d "1 hour ago" +\%s)" ] && docker restart nextcloud-aio-nextcloud && logger -t nextcloud-watchdog "Restarted unhealthy nextcloud-aio-nextcloud" +``` + +- Checks every 15 minutes +- Only restarts if the container has been running >1 hour (avoids interfering with normal startup) +- Logs to syslog: `journalctl -t nextcloud-watchdog` + +This caps future unhealthy outages at ~1 hour instead of persisting until the next nightly cycle. + +## See Also + +- [Tuning Netdata Docker Health Alarms](../../02-selfhosting/monitoring/netdata-docker-health-alarm-tuning.md) +- [Debugging Broken Docker Containers](../../02-selfhosting/docker/debugging-broken-docker-containers.md) +- [Docker Healthchecks](../../02-selfhosting/docker/docker-healthchecks.md) diff --git a/SUMMARY.md b/SUMMARY.md index 42db083..b8c1628 100644 --- a/SUMMARY.md +++ b/SUMMARY.md @@ -44,6 +44,7 @@ * [Mail Client Stops Receiving: Fail2ban IMAP Self-Ban](05-troubleshooting/networking/fail2ban-imap-self-ban-mail-client.md) * [firewalld: Mail Ports Wiped After Reload](05-troubleshooting/networking/firewalld-mail-ports-reset.md) * [Tailscale SSH: Unexpected Re-Authentication Prompt](05-troubleshooting/networking/tailscale-ssh-reauth-prompt.md) + * [Nextcloud AIO Unhealthy 20h After Nightly Update](05-troubleshooting/docker/nextcloud-aio-unhealthy-20h-stuck.md) * [Docker & Caddy Recovery After Reboot (Fedora + SELinux)](05-troubleshooting/docker-caddy-selinux-post-reboot-recovery.md) * [ISP SNI Filtering with Caddy](05-troubleshooting/isp-sni-filtering-caddy.md) * [Obsidian Vault Recovery — Loading Cache Hang](05-troubleshooting/obsidian-cache-hang-recovery.md)