wiki: document Nextcloud AIO 20h unhealthy incident and watchdog cron fix

Add troubleshooting article for the 2026-03-27 incident where PHP-FPM
hung after the nightly update cycle. Update the Netdata Docker alarm
tuning article with the dedicated Nextcloud alarm split and the new
watchdog cron deployed to majorlab. (54 articles)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-03-28 00:52:07 -04:00
parent d37bd60a24
commit cfaee5cf43
3 changed files with 131 additions and 2 deletions

View File

@@ -5,7 +5,7 @@ category: monitoring
tags: [netdata, docker, nextcloud, alarms, health, monitoring] tags: [netdata, docker, nextcloud, alarms, health, monitoring]
status: published status: published
created: 2026-03-18 created: 2026-03-18
updated: 2026-03-22 updated: 2026-03-28
--- ---
# Tuning Netdata Docker Health Alarms to Prevent Update Flapping # Tuning Netdata Docker Health Alarms to Prevent Update Flapping
@@ -28,8 +28,13 @@ A single container being unhealthy for 10 seconds triggers it. No grace period,
Create a custom override at `/etc/netdata/health.d/docker.conf` (maps to the Netdata config volume if running in Docker). This file takes precedence over the stock config in `/usr/lib/netdata/conf.d/health.d/docker.conf`. Create a custom override at `/etc/netdata/health.d/docker.conf` (maps to the Netdata config volume if running in Docker). This file takes precedence over the stock config in `/usr/lib/netdata/conf.d/health.d/docker.conf`.
### General Container Alarm
This alarm covers all containers **except** `nextcloud-aio-nextcloud`, which gets its own dedicated alarm (see below).
```ini ```ini
# Custom override — reduces flapping during nightly container updates. # Custom override — reduces flapping during nightly container updates.
# General container unhealthy alarm — all containers except nextcloud-aio-nextcloud
template: docker_container_unhealthy template: docker_container_unhealthy
on: docker.container_health_status on: docker.container_health_status
@@ -39,6 +44,7 @@ component: Docker
units: status units: status
every: 30s every: 30s
lookup: average -5m of unhealthy lookup: average -5m of unhealthy
chart labels: container_name=!nextcloud-aio-nextcloud *
warn: $this > 0 warn: $this > 0
delay: up 3m down 5m multiplier 1.5 max 30m delay: up 3m down 5m multiplier 1.5 max 30m
summary: Docker container ${label:container_name} health summary: Docker container ${label:container_name} health
@@ -53,7 +59,47 @@ component: Docker
| `delay: up 3m` | none | 3m | Won't fire until unhealthy condition persists for 3 continuous minutes | | `delay: up 3m` | none | 3m | Won't fire until unhealthy condition persists for 3 continuous minutes |
| `delay: down 5m` | none | 5m (max 30m) | Grace period after recovery before clearing | | `delay: down 5m` | none | 5m (max 30m) | Grace period after recovery before clearing |
The `up` delay is the critical addition. Nextcloud AIO's `nextcloud-aio-nextcloud` container checks both PostgreSQL (port 5432) and PHP-FPM (port 9000). PHP-FPM takes ~90 seconds to warm up after a restart, causing 23 failing health checks before the container becomes healthy. With `delay: up 3m`, Netdata waits for 3 continuous minutes of unhealthy status before firing — absorbing the ~90 second startup window with margin to spare. A genuinely broken container will still trigger the alert. ### Dedicated Nextcloud AIO Alarm
Added 2026-03-23, updated 2026-03-28. The `nextcloud-aio-nextcloud` container needs a more lenient window than other containers. Its healthcheck (`/healthcheck.sh`) verifies PostgreSQL connectivity (port 5432) and PHP-FPM (port 9000). PHP-FPM takes ~90 seconds to warm up after a normal restart — but during nightly AIO update cycles, the full startup (occ upgrade, app updates, migrations) can take 5+ minutes. On 2026-03-27, a startup hung and left the container unhealthy for 20 hours until the next nightly cycle replaced it.
The dedicated alarm uses a 10-minute lookup window and 10-minute delay to absorb normal startup, while still catching sustained failures:
```ini
# Dedicated alarm for nextcloud-aio-nextcloud — lenient window to absorb nightly update cycle
# PHP-FPM can take 5+ minutes to warm up; only alert on sustained failure
template: docker_nextcloud_unhealthy
on: docker.container_health_status
class: Errors
type: Containers
component: Docker
units: status
every: 30s
lookup: average -10m of unhealthy
chart labels: container_name=nextcloud-aio-nextcloud
warn: $this > 0
delay: up 10m down 5m multiplier 1.5 max 30m
summary: Nextcloud container health sustained
info: nextcloud-aio-nextcloud has been unhealthy for a sustained period — not a transient update blip
to: sysadmin
```
## Watchdog Cron: Auto-Restart on Sustained Unhealthy
If the Nextcloud container stays unhealthy for more than 1 hour (well past any normal startup window), a cron watchdog on majorlab auto-restarts it and logs the event. This was added 2026-03-28 after an incident where the container sat unhealthy for 20 hours until the next nightly backup cycle replaced it.
**File:** `/etc/cron.d/nextcloud-health-watchdog`
```bash
# Restart nextcloud-aio-nextcloud if unhealthy for >1 hour
*/15 * * * * root docker inspect --format={{.State.Health.Status}} nextcloud-aio-nextcloud 2>/dev/null | grep -q unhealthy && [ "$(docker inspect --format={{.State.StartedAt}} nextcloud-aio-nextcloud | xargs -I{} date -d {} +\%s)" -lt "$(date -d "1 hour ago" +\%s)" ] && docker restart nextcloud-aio-nextcloud && logger -t nextcloud-watchdog "Restarted unhealthy nextcloud-aio-nextcloud"
```
- Runs every 15 minutes as root
- Only restarts if the container has been running for >1 hour (avoids interfering with normal startup)
- Logs to syslog as `nextcloud-watchdog` — check with `journalctl -t nextcloud-watchdog`
- Netdata will still fire the `docker_nextcloud_unhealthy` alert during the unhealthy window, but the outage is capped at ~1 hour instead of persisting until the next nightly cycle
## Also: Suppress `docker_container_down` for Normally-Exiting Containers ## Also: Suppress `docker_container_down` for Normally-Exiting Containers

View File

@@ -0,0 +1,82 @@
---
title: "Nextcloud AIO Container Unhealthy for 20 Hours After Nightly Update"
domain: troubleshooting
category: docker
tags: [nextcloud, docker, healthcheck, netdata, php-fpm, aio]
status: published
created: 2026-03-28
updated: 2026-03-28
---
# Nextcloud AIO Container Unhealthy for 20 Hours After Nightly Update
## Symptom
Netdata alert `docker_nextcloud_unhealthy` fired on majorlab and stayed in Warning for 20 hours. The `nextcloud-aio-nextcloud` container was running but its Docker healthcheck kept failing. No user-facing errors were visible in `nextcloud.log`.
## Investigation
### Timeline (2026-03-27, all UTC)
| Time | Event |
|---|---|
| 04:00 | Nightly backup script started, mastercontainer update kicked off |
| 04:03 | `nextcloud-aio-nextcloud` container recreated |
| 04:05 | Backup finished |
| 07:25 | Mastercontainer logged "Initial startup of Nextcloud All-in-One complete!" (3h20m delay) |
| 10:22 | First entry in `nextcloud.log` (deprecation warnings only — no errors) |
| 04:00 (Mar 28) | Next nightly backup replaced the container; new container came up healthy in ~25 minutes |
### Key findings
- **No image update** — the container image dated to Feb 26, so this was not caused by a version change.
- **No app-level errors** — `nextcloud.log` contained only `files_rightclick` deprecation warnings (level 3). No level 2/4 entries.
- **PHP-FPM never stabilized** — the healthcheck (`/healthcheck.sh`) tests `nc -z 127.0.0.1 9000` (PHP-FPM). The container was running but FPM wasn't responding to the port check.
- **6-hour log gap** — no `nextcloud.log` entries between container start (04:03) and first log (10:22), suggesting the AIO init scripts (occ upgrade, app updates, cron jobs) ran for hours before the app became partially responsive.
- **RestartCount: 0** — the container never restarted on its own. It sat there unhealthy for the full 20 hours.
- **Disk space fine** — 40% used on `/`.
### Healthcheck details
```bash
#!/bin/bash
# /healthcheck.sh inside nextcloud-aio-nextcloud
nc -z "$POSTGRES_HOST" "$POSTGRES_PORT" || exit 0 # postgres down = pass (graceful)
nc -z 127.0.0.1 9000 || exit 1 # PHP-FPM down = fail
```
If PostgreSQL is unreachable, the check passes (exits 0). The only failure path is PHP-FPM not listening on port 9000.
## Root Cause
The AIO nightly update cycle recreated the container, but the startup/migration process hung or ran extremely long, preventing PHP-FPM from fully initializing. The container sat in this state for 20 hours with no self-recovery mechanism until the next nightly cycle replaced it.
The exact migration or occ command that stalled could not be confirmed — the old container's entrypoint logs were lost when the Mar 28 backup cycle replaced it.
## Fix
Two changes deployed on 2026-03-28:
### 1. Dedicated Netdata alarm with lenient window
Split `nextcloud-aio-nextcloud` into its own Netdata alarm (`docker_nextcloud_unhealthy`) with a 10-minute lookup and 10-minute delay, separate from the general container alarm. See [Tuning Netdata Docker Health Alarms](../../02-selfhosting/monitoring/netdata-docker-health-alarm-tuning.md).
### 2. Watchdog cron for auto-restart
Deployed `/etc/cron.d/nextcloud-health-watchdog` on majorlab:
```bash
*/15 * * * * root docker inspect --format={{.State.Health.Status}} nextcloud-aio-nextcloud 2>/dev/null | grep -q unhealthy && [ "$(docker inspect --format={{.State.StartedAt}} nextcloud-aio-nextcloud | xargs -I{} date -d {} +\%s)" -lt "$(date -d "1 hour ago" +\%s)" ] && docker restart nextcloud-aio-nextcloud && logger -t nextcloud-watchdog "Restarted unhealthy nextcloud-aio-nextcloud"
```
- Checks every 15 minutes
- Only restarts if the container has been running >1 hour (avoids interfering with normal startup)
- Logs to syslog: `journalctl -t nextcloud-watchdog`
This caps future unhealthy outages at ~1 hour instead of persisting until the next nightly cycle.
## See Also
- [Tuning Netdata Docker Health Alarms](../../02-selfhosting/monitoring/netdata-docker-health-alarm-tuning.md)
- [Debugging Broken Docker Containers](../../02-selfhosting/docker/debugging-broken-docker-containers.md)
- [Docker Healthchecks](../../02-selfhosting/docker/docker-healthchecks.md)

View File

@@ -44,6 +44,7 @@
* [Mail Client Stops Receiving: Fail2ban IMAP Self-Ban](05-troubleshooting/networking/fail2ban-imap-self-ban-mail-client.md) * [Mail Client Stops Receiving: Fail2ban IMAP Self-Ban](05-troubleshooting/networking/fail2ban-imap-self-ban-mail-client.md)
* [firewalld: Mail Ports Wiped After Reload](05-troubleshooting/networking/firewalld-mail-ports-reset.md) * [firewalld: Mail Ports Wiped After Reload](05-troubleshooting/networking/firewalld-mail-ports-reset.md)
* [Tailscale SSH: Unexpected Re-Authentication Prompt](05-troubleshooting/networking/tailscale-ssh-reauth-prompt.md) * [Tailscale SSH: Unexpected Re-Authentication Prompt](05-troubleshooting/networking/tailscale-ssh-reauth-prompt.md)
* [Nextcloud AIO Unhealthy 20h After Nightly Update](05-troubleshooting/docker/nextcloud-aio-unhealthy-20h-stuck.md)
* [Docker & Caddy Recovery After Reboot (Fedora + SELinux)](05-troubleshooting/docker-caddy-selinux-post-reboot-recovery.md) * [Docker & Caddy Recovery After Reboot (Fedora + SELinux)](05-troubleshooting/docker-caddy-selinux-post-reboot-recovery.md)
* [ISP SNI Filtering with Caddy](05-troubleshooting/isp-sni-filtering-caddy.md) * [ISP SNI Filtering with Caddy](05-troubleshooting/isp-sni-filtering-caddy.md)
* [Obsidian Vault Recovery — Loading Cache Hang](05-troubleshooting/obsidian-cache-hang-recovery.md) * [Obsidian Vault Recovery — Loading Cache Hang](05-troubleshooting/obsidian-cache-hang-recovery.md)