Merge cowork/majorair/wiki-updates-may02 — fail2ban digest + netdata docker health + 3 new articles

This commit is contained in:
Marcus Summers 2026-05-02 16:28:48 -04:00
commit 021c7f6539
18 changed files with 73 additions and 35 deletions

View file

@ -10,7 +10,7 @@ tags:
- majorrig
status: published
created: 2026-03-16
updated: 2026-04-29T22:45
updated: 2026-04-30T05:21
---
# WSL2 Backup via PowerShell Scheduled Task

View file

@ -10,7 +10,7 @@ tags:
- remote-access
status: published
created: 2026-03-08
updated: 2026-04-22T09:20
updated: 2026-04-30T05:21
---
# SSH Config and Key Management

View file

@ -7,7 +7,7 @@ tags:
- asus
- ssh
created: 2026-04-19
updated: 2026-04-29T22:45
updated: 2026-04-30T05:21
---
# Wake-on-LAN via Router SSH

View file

@ -1,6 +1,6 @@
---
created: 2026-04-13T10:15
updated: 2026-04-29T22:45
updated: 2026-04-30T05:21
---
# 🏠 Self-Hosting & Homelab

View file

@ -1,11 +1,17 @@
---
title: "Tuning Netdata Docker Health Alarms to Prevent Update Flapping"
title: Tuning Netdata Docker Health Alarms to Prevent Update Flapping
domain: selfhosting
category: monitoring
tags: [netdata, docker, nextcloud, alarms, health, monitoring]
tags:
- netdata
- docker
- nextcloud
- alarms
- health
- monitoring
status: published
created: 2026-03-18
updated: 2026-03-28
updated: 2026-05-02T11:04
---
# Tuning Netdata Docker Health Alarms to Prevent Update Flapping
@ -61,9 +67,9 @@ chart labels: container_name=!nextcloud-aio-nextcloud *
### Dedicated Nextcloud AIO Alarm
Added 2026-03-23, updated 2026-03-28. The `nextcloud-aio-nextcloud` container needs a more lenient window than other containers. Its healthcheck (`/healthcheck.sh`) verifies PostgreSQL connectivity (port 5432) and PHP-FPM (port 9000). PHP-FPM takes ~90 seconds to warm up after a normal restart — but during nightly AIO update cycles, the full startup (occ upgrade, app updates, migrations) can take 5+ minutes. On 2026-03-27, a startup hung and left the container unhealthy for 20 hours until the next nightly cycle replaced it.
Added 2026-03-23, updated 2026-05-02. The `nextcloud-aio-nextcloud` container needs a more lenient window than other containers. Its healthcheck (`/healthcheck.sh`) verifies PostgreSQL connectivity (port 5432) and PHP-FPM (port 9000). PHP-FPM takes ~90 seconds to warm up after a normal restart — but during nightly AIO update cycles, the full startup (occ upgrade, app updates, migrations) can take 5+ minutes. On 2026-03-27, a startup hung and left the container unhealthy for 20 hours until the next nightly cycle replaced it.
The dedicated alarm uses a 10-minute lookup window and 10-minute delay to absorb normal startup, while still catching sustained failures:
The dedicated alarm uses a 30-minute lookup window and 10-minute delay to absorb normal startup and update cycles (~40 minutes total grace), while still catching sustained failures:
```ini
# Dedicated alarm for nextcloud-aio-nextcloud — lenient window to absorb nightly update cycle
@ -76,15 +82,23 @@ template: docker_nextcloud_unhealthy
component: Docker
units: status
every: 30s
lookup: average -10m of unhealthy
lookup: average -30m of unhealthy
chart labels: container_name=nextcloud-aio-nextcloud
warn: $this > 0
warn: $this >= 1
delay: up 10m down 5m multiplier 1.5 max 30m
summary: Nextcloud container health sustained
info: nextcloud-aio-nextcloud has been unhealthy for a sustained period — not a transient update blip
info: nextcloud-aio-nextcloud has been continuously unhealthy for 30+ minutes — not a transient update blip
to: sysadmin
```
**Tuning history:**
| Date | Lookup | Delay | Trigger | Notes |
|---|---|---|---|---|
| 2026-03-23 | 35m | 35m | Initial split from general alarm | Absorbed PHP-FPM warm-up |
| 2026-04-29 | 15m | 5m | Backup blip (~6m) never triggered | Tightened after stability |
| 2026-05-02 | 30m | 10m | 15m still too aggressive for update cycles | ~40m total grace; catches real outages |
## Watchdog Cron: Auto-Restart on Sustained Unhealthy
If the Nextcloud container stays unhealthy for more than 1 hour (well past any normal startup window), a cron watchdog on majorlab auto-restarts it and logs the event. This was added 2026-03-28 after an incident where the container sat unhealthy for 20 hours until the next nightly backup cycle replaced it.

View file

@ -11,7 +11,7 @@ tags:
- cron
status: published
created: 2026-04-18
updated: 2026-04-18T11:13
updated: 2026-04-30T05:21
---
# ClamAV Fleet Deployment with Ansible

View file

@ -1,11 +1,18 @@
---
title: "Fail2Ban Digest Mode — Fleet-Wide Quiet Alerts"
title: Fail2Ban Digest Mode — Fleet-Wide Quiet Alerts
domain: selfhosting
category: security
tags: [fail2ban, security, email, ansible, fleet, cron, digest]
tags:
- fail2ban
- security
- email
- ansible
- fleet
- cron
- digest
status: published
created: 2026-04-22
updated: 2026-04-22
updated: 2026-05-02T14:56
---
# Fail2Ban Digest Mode — Fleet-Wide Quiet Alerts
@ -21,11 +28,11 @@ Three tiers replace the firehose:
| Tier | Jails | Action | Why |
|------|-------|--------|-----|
| **Immediate email** | `sshd`, `recidive` | `action_mwl` | Security-critical — someone is actively targeting auth or is a repeat offender |
| **Immediate email** | `recidive` | `action_mwl` | Repeat offenders only — someone has been banned multiple times across jails |
| **Silent ban** | Everything else | `action_` (default) | Ban happens, firewall rule applied, no email sent |
| **Daily digest** | All jails | Cron script at 08:00 UTC | One summary email per host with ban counts across all jails |
This reduces email volume from hundreds per day to ~10 (one digest per host + occasional sshd/recidive alerts).
This reduces email volume from hundreds per day to ~10 (one digest per host + occasional recidive alerts).
## jail.local Configuration
@ -40,18 +47,20 @@ action = %(action_)s
This overrides the stock `action_mwl` for all jails. Bans still happen — the firewall rule is applied — but no email is sent.
### Keep immediate alerts for critical jails
### Keep immediate alerts for recidive only
```ini
[sshd]
enabled = true
action = %(action_mwl)s
action = %(action_)s
[recidive]
enabled = true
action = %(action_mwl)s
```
> **Updated 2026-05-02:** sshd was moved to silent (`action_`). Only recidive (repeat offenders) now triggers immediate email. sshd bans are captured in the daily digest.
### Clean up email subjects with fq-hostname
By default, fail2ban uses the system FQDN in email subjects. On Tailscale hosts, this produces ugly subjects like `[Fail2Ban] sshd: banned 1.2.3.4 on MajorToot.tail7f2d9.ts.net`. Override it in `[DEFAULT]`:
@ -91,8 +100,9 @@ The playbook `configure_fail2ban_digest.yml` deploys the full digest model fleet
### What it does
1. Deploys a Python helper script that performs **section-aware editing** of `jail.local` (see gotchas below)
2. Sets `action = %(action_)s` in `[DEFAULT]`
3. Sets `action = %(action_mwl)s` in `[sshd]` and `[recidive]`
2. Sets `action = %(action_)s` in `[DEFAULT]` and `[sshd]`
3. Sets `action = %(action_mwl)s` in `[recidive]`
4. Removes stale `action = %(action_mwl)s` from `defaults-debian.conf` if present
4. Sets `fq-hostname` per host using an override dict
5. Deploys the digest script from a Jinja2 template
6. Creates the cron job via `ansible.builtin.cron`
@ -143,6 +153,14 @@ option 'action' in section 'DEFAULT' already exists
The Python editor script handles this by replacing existing keys rather than appending.
### defaults-debian.conf overrides jail.local
On Debian/Ubuntu, `/etc/fail2ban/jail.d/defaults-debian.conf` is loaded **after** `jail.local`. If it contains `action = %(action_mwl)s`, it silently overrides your silent default — every jail sends email on every ban. The Ansible playbook now removes this line automatically. If you see per-ban emails after deploying digest mode, check this file first:
```bash
grep action /etc/fail2ban/jail.d/defaults-debian.conf
```
### fq-hostname scope
Setting `fq-hostname` in `[DEFAULT]` affects all action templates that use the `<fq-hostname>` tag — including both immediate emails and the digest subject. This is the desired behavior, but be aware that it overrides the system hostname globally within fail2ban.

View file

@ -10,7 +10,7 @@ tags:
- docker
status: published
created: 2026-04-02
updated: 2026-04-29T22:45
updated: 2026-04-30T05:21
---
# Mastodon Instance Tuning

View file

@ -11,7 +11,7 @@ tags:
- troubleshooting
status: published
created: 2026-04-18
updated: 2026-04-29T22:45
updated: 2026-04-30T05:21
---
# Ansible Check Mode False Positives in Verify/Assert Tasks

View file

@ -1,6 +1,6 @@
---
created: 2026-03-15T06:37
updated: 2026-04-29T23:55
updated: 2026-04-30T10:41
---
# 🔧 General Troubleshooting

View file

@ -1,11 +1,17 @@
---
title: "ISP SNI Filtering & Caddy Troubleshooting"
title: ISP SNI Filtering & Caddy Troubleshooting
domain: troubleshooting
category: general
tags: [isp, sni, caddy, tls, dns, cloudflare]
tags:
- isp
- sni
- caddy
- tls
- dns
- cloudflare
status: published
created: 2026-04-02
updated: 2026-04-30
updated: 2026-04-30T13:07
---
# ISP SNI Filtering & Caddy Troubleshooting

View file

@ -11,7 +11,7 @@ tags:
- powershell
status: published
created: 2026-04-03
updated: 2026-04-22T09:20
updated: 2026-04-30T05:21
---
# Windows OpenSSH: WSL as Default Shell Breaks Remote Commands

View file

@ -10,7 +10,7 @@ tags:
- majorrig
status: published
created: 2026-04-02
updated: 2026-04-22T09:20
updated: 2026-04-30T05:21
---
# Windows OpenSSH Server (sshd) Stops After Reboot

View file

@ -10,7 +10,7 @@ tags:
- deno
status: published
created: 2026-04-02
updated: 2026-04-22T11:33
updated: 2026-04-30T05:21
---
# yt-dlp YouTube JS Challenge Fix (Fedora)

View file

@ -2,7 +2,7 @@
title: MajorWiki Deployment Status
status: deployed
project: MajorTwin
updated: 2026-04-07T10:48
updated: 2026-04-30T05:30
created: 2026-04-02T16:10
---

View file

@ -1,6 +1,6 @@
---
created: 2026-04-06T09:52
updated: 2026-04-29T22:46
updated: 2026-04-30T05:21
---
# MajorLinux Tech Wiki — Index

View file

@ -1,6 +1,6 @@
---
created: 2026-04-02T16:03
updated: 2026-04-29T23:55
updated: 2026-04-30T11:24
---
* [Home](index.md)
* [Linux & Sysadmin](01-linux/index.md)

View file

@ -1,6 +1,6 @@
---
created: 2026-04-06T09:52
updated: 2026-04-29T22:45
updated: 2026-04-30T05:21
---
# MajorLinux Tech Wiki — Index