wiki: update fail2ban digest + netdata docker health + 3 new articles

- fail2ban-digest-mode-fleet: recidive-only email model, sshd now silent, defaults-debian.conf gotcha added - netdata-docker-health-alarm-tuning: 30m/10m config, tuning history table - New: wp-fail2ban-logpath-debian-ubuntu, lora-adapter-gguf-conversion-fails, tailscale-status-json-hostname-localhost-ios - Various article updates and nav index refreshes Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-02 14:58:07 -04:00 · 2026-05-02 14:58:07 -04:00 · 4126656c05
commit 4126656c05
parent f40f497b46
21 changed files with 567 additions and 35 deletions
--- a/01-linux/distro-specific/wsl2-backup-powershell.md
+++ b/01-linux/distro-specific/wsl2-backup-powershell.md
@ -10,7 +10,7 @@ tags:
  - majorrig
 status: published
 created: 2026-03-16
-updated: 2026-04-29T22:45
+updated: 2026-04-30T05:21
 ---
 # WSL2 Backup via PowerShell Scheduled Task
--- a/01-linux/networking/ssh-config-key-management.md
+++ b/01-linux/networking/ssh-config-key-management.md
@ -10,7 +10,7 @@ tags:
  - remote-access
 status: published
 created: 2026-03-08
-updated: 2026-04-22T09:20
+updated: 2026-04-30T05:21
 ---
 # SSH Config and Key Management
--- a/02-selfhosting/dns-networking/wake-on-lan-router-ssh.md
+++ b/02-selfhosting/dns-networking/wake-on-lan-router-ssh.md
@ -7,7 +7,7 @@ tags:
  - asus
  - ssh
 created: 2026-04-19
-updated: 2026-04-29T22:45
+updated: 2026-04-30T05:21
 ---
 # Wake-on-LAN via Router SSH
--- a/02-selfhosting/index.md
+++ b/02-selfhosting/index.md
@ -1,6 +1,6 @@
 ---
 created: 2026-04-13T10:15
-updated: 2026-04-29T22:45
+updated: 2026-04-30T05:21
 ---
 # 🏠 Self-Hosting & Homelab
--- a/02-selfhosting/monitoring/netdata-docker-health-alarm-tuning.md
+++ b/02-selfhosting/monitoring/netdata-docker-health-alarm-tuning.md
@ -1,11 +1,17 @@
 ---
-title: "Tuning Netdata Docker Health Alarms to Prevent Update Flapping"
+title: Tuning Netdata Docker Health Alarms to Prevent Update Flapping
 domain: selfhosting
 category: monitoring
-tags: [netdata, docker, nextcloud, alarms, health, monitoring]
+tags:
  - netdata
  - docker
  - nextcloud
  - alarms
  - health
  - monitoring
 status: published
 created: 2026-03-18
-updated: 2026-03-28
+updated: 2026-05-02T11:04
 ---
 # Tuning Netdata Docker Health Alarms to Prevent Update Flapping
@ -61,9 +67,9 @@ chart labels: container_name=!nextcloud-aio-nextcloud *
 ### Dedicated Nextcloud AIO Alarm
-Added 2026-03-23, updated 2026-03-28. The `nextcloud-aio-nextcloud` container needs a more lenient window than other containers. Its healthcheck (`/healthcheck.sh`) verifies PostgreSQL connectivity (port 5432) and PHP-FPM (port 9000). PHP-FPM takes ~90 seconds to warm up after a normal restart — but during nightly AIO update cycles, the full startup (occ upgrade, app updates, migrations) can take 5+ minutes. On 2026-03-27, a startup hung and left the container unhealthy for 20 hours until the next nightly cycle replaced it.
+Added 2026-03-23, updated 2026-05-02. The `nextcloud-aio-nextcloud` container needs a more lenient window than other containers. Its healthcheck (`/healthcheck.sh`) verifies PostgreSQL connectivity (port 5432) and PHP-FPM (port 9000). PHP-FPM takes ~90 seconds to warm up after a normal restart — but during nightly AIO update cycles, the full startup (occ upgrade, app updates, migrations) can take 5+ minutes. On 2026-03-27, a startup hung and left the container unhealthy for 20 hours until the next nightly cycle replaced it.
-The dedicated alarm uses a 10-minute lookup window and 10-minute delay to absorb normal startup, while still catching sustained failures:
+The dedicated alarm uses a 30-minute lookup window and 10-minute delay to absorb normal startup and update cycles (~40 minutes total grace), while still catching sustained failures:
 ```ini
 # Dedicated alarm for nextcloud-aio-nextcloud — lenient window to absorb nightly update cycle
@ -76,15 +82,23 @@ template: docker_nextcloud_unhealthy
 component: Docker
    units: status
    every: 30s
-   lookup: average -10m of unhealthy
+   lookup: average -30m of unhealthy
 chart labels: container_name=nextcloud-aio-nextcloud
-     warn: $this > 0
+     warn: $this >= 1
    delay: up 10m down 5m multiplier 1.5 max 30m
  summary: Nextcloud container health sustained
-     info: nextcloud-aio-nextcloud has been unhealthy for a sustained period — not a transient update blip
+     info: nextcloud-aio-nextcloud has been continuously unhealthy for 30+ minutes — not a transient update blip
       to: sysadmin
 ```
 **Tuning history:**
 | Date | Lookup | Delay | Trigger | Notes |
 |---|---|---|---|---|
 | 2026-03-23 | 35m | 35m | Initial split from general alarm | Absorbed PHP-FPM warm-up |
 | 2026-04-29 | 15m | 5m | Backup blip (~6m) never triggered | Tightened after stability |
 | 2026-05-02 | 30m | 10m | 15m still too aggressive for update cycles | ~40m total grace; catches real outages |
 ## Watchdog Cron: Auto-Restart on Sustained Unhealthy
 If the Nextcloud container stays unhealthy for more than 1 hour (well past any normal startup window), a cron watchdog on majorlab auto-restarts it and logs the event. This was added 2026-03-28 after an incident where the container sat unhealthy for 20 hours until the next nightly backup cycle replaced it.
--- a/02-selfhosting/security/clamav-fleet-deployment.md
+++ b/02-selfhosting/security/clamav-fleet-deployment.md
@ -11,7 +11,7 @@ tags:
  - cron
 status: published
 created: 2026-04-18
-updated: 2026-04-18T11:13
+updated: 2026-04-30T05:21
 ---
 # ClamAV Fleet Deployment with Ansible
--- a/02-selfhosting/security/fail2ban-digest-mode-fleet.md
+++ b/02-selfhosting/security/fail2ban-digest-mode-fleet.md
@ -1,11 +1,18 @@
 ---
-title: "Fail2Ban Digest Mode — Fleet-Wide Quiet Alerts"
+title: Fail2Ban Digest Mode — Fleet-Wide Quiet Alerts
 domain: selfhosting
 category: security
-tags: [fail2ban, security, email, ansible, fleet, cron, digest]
+tags:
  - fail2ban
  - security
  - email
  - ansible
  - fleet
  - cron
  - digest
 status: published
 created: 2026-04-22
-updated: 2026-04-22
+updated: 2026-05-02T14:56
 ---
 # Fail2Ban Digest Mode — Fleet-Wide Quiet Alerts
@ -21,11 +28,11 @@ Three tiers replace the firehose:
 | Tier | Jails | Action | Why |
 |------|-------|--------|-----|
-| **Immediate email** | `sshd`, `recidive` | `action_mwl` | Security-critical — someone is actively targeting auth or is a repeat offender |
+| **Immediate email** | `recidive` | `action_mwl` | Repeat offenders only — someone has been banned multiple times across jails |
 | **Silent ban** | Everything else | `action_` (default) | Ban happens, firewall rule applied, no email sent |
 | **Daily digest** | All jails | Cron script at 08:00 UTC | One summary email per host with ban counts across all jails |
-This reduces email volume from hundreds per day to ~10 (one digest per host + occasional sshd/recidive alerts).
+This reduces email volume from hundreds per day to ~10 (one digest per host + occasional recidive alerts).
 ## jail.local Configuration
@ -40,18 +47,20 @@ action = %(action_)s
 This overrides the stock `action_mwl` for all jails. Bans still happen — the firewall rule is applied — but no email is sent.
-### Keep immediate alerts for critical jails
+### Keep immediate alerts for recidive only
 ```ini
 [sshd]
 enabled = true
-action = %(action_mwl)s
+action = %(action_)s
 [recidive]
 enabled = true
 action = %(action_mwl)s
 ```
 > **Updated 2026-05-02:** sshd was moved to silent (`action_`). Only recidive (repeat offenders) now triggers immediate email. sshd bans are captured in the daily digest.
 ### Clean up email subjects with fq-hostname
 By default, fail2ban uses the system FQDN in email subjects. On Tailscale hosts, this produces ugly subjects like `[Fail2Ban] sshd: banned 1.2.3.4 on MajorToot.tail7f2d9.ts.net`. Override it in `[DEFAULT]`:
@ -91,8 +100,9 @@ The playbook `configure_fail2ban_digest.yml` deploys the full digest model fleet
 ### What it does
 1. Deploys a Python helper script that performs **section-aware editing** of `jail.local` (see gotchas below)
-2. Sets `action = %(action_)s` in `[DEFAULT]`
+2. Sets `action = %(action_)s` in `[DEFAULT]` and `[sshd]`
-3. Sets `action = %(action_mwl)s` in `[sshd]` and `[recidive]`
+3. Sets `action = %(action_mwl)s` in `[recidive]`
 4. Removes stale `action = %(action_mwl)s` from `defaults-debian.conf` if present
 4. Sets `fq-hostname` per host using an override dict
 5. Deploys the digest script from a Jinja2 template
 6. Creates the cron job via `ansible.builtin.cron`
@ -143,6 +153,14 @@ option 'action' in section 'DEFAULT' already exists
 The Python editor script handles this by replacing existing keys rather than appending.
 ### defaults-debian.conf overrides jail.local
 On Debian/Ubuntu, `/etc/fail2ban/jail.d/defaults-debian.conf` is loaded **after** `jail.local`. If it contains `action = %(action_mwl)s`, it silently overrides your silent default — every jail sends email on every ban. The Ansible playbook now removes this line automatically. If you see per-ban emails after deploying digest mode, check this file first:
 ```bash
 grep action /etc/fail2ban/jail.d/defaults-debian.conf
 ```
 ### fq-hostname scope
 Setting `fq-hostname` in `[DEFAULT]` affects all action templates that use the `<fq-hostname>` tag — including both immediate emails and the digest subject. This is the desired behavior, but be aware that it overrides the system hostname globally within fail2ban.
--- a/02-selfhosting/security/wp-fail2ban-logpath-debian-ubuntu.md
+++ b/02-selfhosting/security/wp-fail2ban-logpath-debian-ubuntu.md
@ -0,0 +1,151 @@
 ---
 title: "wp-fail2ban Plugin Logpath on Debian/Ubuntu (auth.log, not syslog)"
 domain: selfhosting
 category: security
 tags: [fail2ban, wordpress, wp-fail2ban, debugging, gotcha, debian, ubuntu]
 status: published
 created: 2026-04-30
 updated: 2026-04-30
 ---
 # wp-fail2ban Plugin Logpath on Debian/Ubuntu (auth.log, not syslog)
 ## The Problem
 You install the [WP fail2ban](https://wordpress.org/plugins/wp-fail2ban/) WordPress plugin, configure the fleet-standard `wordpress-hard`, `wordpress-soft`, and `wordpress-extra` jails, and… nothing. Weeks pass. `fail2ban-client status wordpress-hard` reports `Total failed: 0, Total banned: 0`. Your site is being attacked, but the jails are dead.
 Meanwhile the `wordpress-login` jail (which reads Apache access logs for `POST /wp-login.php` directly) is happily catching brute-forcers. So the problem isn't fail2ban itself — it's specifically the wp-fail2ban-plugin-derived jails.
 ## The Cause
 The wp-fail2ban plugin emits events via PHP's `syslog()` call with facility `LOG_AUTH`. On Debian/Ubuntu, rsyslog routes the `auth` facility to **`/var/log/auth.log`**, NOT `/var/log/syslog`. On RHEL/Fedora it's `/var/log/secure`.
 A lot of tutorials, ansible-galaxy roles, and copy-paste config snippets specify:
 ```ini
 logpath = /var/log/syslog
 ```
 That's wrong on Debian/Ubuntu. The events never land there, so the filter regex has nothing to match, so the jail catches zero events forever. Silently.
 ## Diagnostic Steps
 If a `wordpress-{hard,soft,extra}` jail shows `Total failed: 0` over a long window despite the plugin being active and the site getting attacked:
 **1. Check what the jail thinks it's watching:**
 ```bash
 sudo fail2ban-client status wordpress-hard | grep "File list"
 ```
 **2. Check where wp-fail2ban events actually land:**
 ```bash
 sudo grep -c "wordpress(" /var/log/auth.log /var/log/syslog /var/log/secure 2>/dev/null
 ```
 You'll see something like:
 ```
 /var/log/auth.log:314
 /var/log/syslog:0
 ```
 **3. If the jail's `File list` ≠ the file with events, fix the `logpath`.**
 A real event line on Debian/Ubuntu looks like:
 ```
 2026-04-18T23:28:21.027004-04:00 hostname wordpress(example.com)[719989]: XML-RPC authentication failure for someone from 1.2.3.4
 ```
 The `wordpress(domain)[pid]` syslog tag is the giveaway — those are wp-fail2ban events.
 ## The Fix
 Edit the jail blocks in `/etc/fail2ban/jail.local` (or your Ansible source for the jail) and set:
 ```ini
 [wordpress-hard]
 enabled = true
 port = http,https
 filter = wordpress-hard
 logpath = /var/log/auth.log
 maxretry = 1
 findtime = 60
 bantime = 30d
 backend = polling
 [wordpress-soft]
 enabled = true
 port = http,https
 filter = wordpress-soft
 logpath = /var/log/auth.log
 maxretry = 5
 findtime = 60
 bantime = 30d
 backend = polling
 [wordpress-extra]
 enabled = true
 port = http,https
 filter = wordpress-extra
 logpath = /var/log/auth.log
 maxretry = 5
 findtime = 60
 bantime = 30d
 backend = polling
 ```
 Then:
 ```bash
 sudo fail2ban-client -t          # validate
 sudo fail2ban-client reload
 sudo fail2ban-client status wordpress-hard | grep "File list"
 # should now show /var/log/auth.log
 ```
 ## Verification
 You can prove the filter regex actually matches your real events without waiting for an attack — run `fail2ban-regex` against the rotated log:
 ```bash
 sudo fail2ban-regex /var/log/auth.log.1 /etc/fail2ban/filter.d/wordpress-hard.conf | grep -E "Failregex:|Lines:"
 ```
 Healthy output looks like:
 ```
 Failregex: 81 total
 Lines: 13008 lines, 0 ignored, 81 matched, 12927 missed
 ```
 If you see `Failregex: 0 total`, the filter regex doesn't match what the plugin actually emits — which is a different bug (filter version skew vs. plugin version), not the logpath gotcha. Investigate `/etc/fail2ban/filter.d/wordpress-{hard,soft}.conf` against actual event lines.
 > **Note:** On a freshly-fixed jail, counters will sit at `Total failed: 0` for a while — the `polling` backend starts at the file's current EOF, so old events aren't retroactively counted. New events from the moment of `reload` onward will accumulate. Allow a few days of normal attack traffic before declaring the fix broken.
 ## Distribution Cheat Sheet
 | Distro family | wp-fail2ban events land in |
 |---|---|
 | Debian / Ubuntu | `/var/log/auth.log` |
 | RHEL / CentOS / Fedora | `/var/log/secure` |
 | systemd-journal-only systems | `journalctl SYSLOG_FACILITY=4` (use `backend = systemd` + `journalmatch = SYSLOG_FACILITY=4`) |
 If you have a mixed fleet, parameterize the path:
 ```yaml
 # Ansible vars
 wp_fail2ban_log_path: "{{ '/var/log/auth.log' if ansible_os_family == 'Debian' else '/var/log/secure' }}"
 ```
 ## Why wordpress-login Is Unaffected
 The `wordpress-login` jail is a different beast — it reads `/var/log/apache2/access.log` directly and matches `^<HOST> -.*"POST /wp-login.php` via the `wordpress-login` filter. No plugin involved, no syslog facility involved. So a host can have `wordpress-login` working perfectly while `wordpress-{hard,soft,extra}` are silently dead. Don't let a healthy `wordpress-login` reassure you that the rest of the wp-fail2ban stack is also fine.
 ## Related
 - [[fail2ban-wordpress-login-jail]] — the access-log layer that catches WP brute force without any plugin dependency
 - [[fail2ban-apache-bad-request-jail]]
 - [[fail2ban-apache-php-probe-jail]]
 - [[clamav-fleet-deployment]]
--- a/02-selfhosting/services/mastodon-instance-tuning.md
+++ b/02-selfhosting/services/mastodon-instance-tuning.md
@ -10,7 +10,7 @@ tags:
  - docker
 status: published
 created: 2026-04-02
-updated: 2026-04-29T22:45
+updated: 2026-04-30T05:21
 ---
 # Mastodon Instance Tuning
--- a/05-troubleshooting/ansible-check-mode-false-positives.md
+++ b/05-troubleshooting/ansible-check-mode-false-positives.md
@ -11,7 +11,7 @@ tags:
  - troubleshooting
 status: published
 created: 2026-04-18
-updated: 2026-04-29T22:45
+updated: 2026-04-30T05:21
 ---
 # Ansible Check Mode False Positives in Verify/Assert Tasks
--- a/05-troubleshooting/gpu-display/lora-adapter-gguf-conversion-fails.md
+++ b/05-troubleshooting/gpu-display/lora-adapter-gguf-conversion-fails.md
@ -0,0 +1,119 @@
 ---
 title: "LoRA adapter — GGUF conversion fails with 'config.json not found'"
 domain: troubleshooting
 category: gpu-display
 tags: [lora, qlora, gguf, llama.cpp, unsloth, fine-tuning, qwen]
 status: published
 created: 2026-04-30
 updated: 2026-04-30
 ---
 # LoRA adapter — GGUF conversion fails with 'config.json not found'
 ## Problem
 After a QLoRA fine-tune, you point `llama.cpp/convert_hf_to_gguf.py` at the training output directory and it crashes immediately:
 ```
 FileNotFoundError: [Errno 2] No such file or directory:
  '/path/to/training-runs/<run>/final/config.json'
 ```
 The output directory looks fine — it contains:
 ```
 adapter_config.json
 adapter_model.safetensors  (~150 MB for a 7B base)
 chat_template.jinja
 tokenizer_config.json
 tokenizer.json
 ```
 But no `config.json`, and `adapter_model.safetensors` is 150 MB — way smaller than the ~14 GB you'd expect for a full Qwen2.5-7B 16-bit checkpoint.
 ## Root cause
 `model.save_pretrained()` after a LoRA/QLoRA train saves **only the adapter weights**, not a merged full-precision model. `convert_hf_to_gguf.py` expects a full HuggingFace model directory — it reads `config.json` to identify the architecture. Adapter-only directories don't have one.
 You need to merge the LoRA adapter into the base model first, then point the GGUF converter at the merged dir.
 ## Solution
 ### Quick fix — inline merge step
 Insert this block between training completion and `convert_hf_to_gguf.py`:
 ```python
 from unsloth import FastLanguageModel
 adapter = "/path/to/training-runs/<run>/final"
 merged  = "/path/to/training-runs/<run>/merged"
 model, tok = FastLanguageModel.from_pretrained(
    model_name=adapter,
    max_seq_length=2048,
    load_in_4bit=True,
 )
 model.save_pretrained_merged(merged, tok, save_method="merged_16bit")
 ```
 Then run the GGUF converter against the **merged** dir, not the adapter dir:
 ```bash
 python3 llama.cpp/convert_hf_to_gguf.py /path/to/training-runs/<run>/merged \
  --outfile model-f16.gguf --outtype f16
 ```
 The merged dir will contain `config.json`, `model-00001-of-00004.safetensors` (multiple shards totaling the full base model size), `generation_config.json`, etc.
 ### Cleaner fix — use a wrapper
 If you do this often, encapsulate it:
 1. Wrapper Python script accepts `--adapter`, `--output`, `--skip-merge`, `--all-quants`
 2. Step 1: load adapter via `FastLanguageModel.from_pretrained()`, call `save_pretrained_merged()`
 3. Step 2: subprocess `convert_hf_to_gguf.py` on the merged dir
 4. Step 3: subprocess `llama-quantize` for each requested quant
 This is what `~/corpus/scripts/convert_gguf.py` does on MajorRig (rewritten 2026-04-09 for the MajorTwin v7b cycle).
 ## Why this trips people up
 - Unsloth and PEFT both save adapter-only by default after `trainer.save_model()` or `model.save_pretrained()`. There's no warning that downstream tools expect a merged model.
 - The training output **looks** complete — there's a `tokenizer.json`, a `chat_template.jinja`, and a non-trivial `.safetensors`. It feels like a checkpoint.
 - A pipeline that uses `convert_gguf.py` (with merge) once and then someone reimplements Step 4 inline (skipping the wrapper) will silently lose the merge step. This is what happened in MajorTwin v8c (Apr 30, 2026) — see [[majortwin-v8b-plan#Pipeline Bug + Fix (2026-04-30)]].
 ## Verification checklist
 After training, before running the GGUF converter, verify the directory you're pointing at:
 | File | Adapter-only dir | Merged dir |
 |---|---|---|
 | `adapter_config.json` | ✅ | ❌ |
 | `adapter_model.safetensors` | ✅ (~150 MB / 7B) | ❌ |
 | `config.json` | ❌ | ✅ |
 | `model-*.safetensors` (sharded) | ❌ | ✅ (~14 GB / 7B) |
 | `generation_config.json` | ❌ | ✅ |
 | `tokenizer.json` | ✅ | ✅ |
 If you see only the left column, you need to merge before converting.
 ## Resuming a failed pipeline without re-training
 The adapter is small and self-contained. If your pipeline crashes at the GGUF step, you do NOT need to retrain — the LoRA adapter at `<run>/final/` is intact. Write a resume wrapper that runs only:
 1. Merge (`save_pretrained_merged`)
 2. F16 conversion (`convert_hf_to_gguf.py`)
 3. Quantization (`llama-quantize`)
 4. Deploy
 This saves the cost of however many GPU-hours the training took. See `~/corpus/scripts/resume_v8c_step4.sh` on MajorRig for an example.
 ## Related
 - [[qwen-14b-oom-3080ti]] — base model size choice on a 12GB GPU
 - [[majortwin-v8b-plan]] — v8c pipeline architecture and resume
 ## Maintenance
 - 2026-04-30 — Created after MajorTwin v8c pipeline failed Step 4. Root-caused, patched, resumed.
--- a/05-troubleshooting/index.md
+++ b/05-troubleshooting/index.md
@ -1,6 +1,6 @@
 ---
 created: 2026-03-15T06:37
-updated: 2026-04-29T22:45
+updated: 2026-04-30T10:41
 ---
 # 🔧 General Troubleshooting
@ -8,12 +8,14 @@ Practical fixes for common Linux, networking, and application problems.
 ## 🖥️ GPU & AI
 - [Qwen2.5-14B OOM on RTX 3080 Ti (12GB)](gpu-display/qwen-14b-oom-3080ti.md)
 - [LoRA adapter — GGUF conversion fails with 'config.json not found'](gpu-display/lora-adapter-gguf-conversion-fails.md)
 ## 🌐 Networking & Web
 - [Apache Outage: Fail2ban Self-Ban + Missing iptables Rules](networking/fail2ban-self-ban-apache-outage.md)
 - [Mail Client Stops Receiving: Fail2ban IMAP Self-Ban](networking/fail2ban-imap-self-ban-mail-client.md)
 - [firewalld: Mail Ports Wiped After Reload](networking/firewalld-mail-ports-reset.md)
 - [Tailscale SSH: Unexpected Re-Authentication Prompt](networking/tailscale-ssh-reauth-prompt.md)
 - [iOS Tailscale Clients Report HostName="localhost" — Breaks /etc/hosts Generators](networking/tailscale-status-json-hostname-localhost-ios.md)
 - [rsync over Tailscale: Hung in TCP Teardown After Transfer Completes](networking/rsync-tailscale-teardown-stall.md)
 - [Windows OpenSSH: WSL Default Shell Breaks Remote Commands](networking/windows-openssh-wsl-default-shell-breaks-remote-commands.md)
 - [Pi-hole AI Blocklist Blocks Claude Desktop (ERR_CONNECTION_REFUSED)](networking/pihole-blocks-claude-desktop.md)
--- a/05-troubleshooting/isp-sni-filtering-caddy.md
+++ b/05-troubleshooting/isp-sni-filtering-caddy.md
@ -1,11 +1,17 @@
 ---
-title: "ISP SNI Filtering & Caddy Troubleshooting"
+title: ISP SNI Filtering & Caddy Troubleshooting
 domain: troubleshooting
 category: general
-tags: [isp, sni, caddy, tls, dns, cloudflare]
+tags:
  - isp
  - sni
  - caddy
  - tls
  - dns
  - cloudflare
 status: published
 created: 2026-04-02
-updated: 2026-04-02
+updated: 2026-04-30T13:07
 ---
 # ISP SNI Filtering & Caddy Troubleshooting
@ -29,3 +35,89 @@ notes.majorshouse.com {
 ```
 Once the hostname was changed to one without the "wiki" keyword, the TLS handshake completed successfully.
 ---
 ## 🔁 2026-04-30 Update — Stale A Record + Cloudflare Proxy Fix
 The hostname rename held for ~4 weeks. On 2026-04-30 the wiki went down with a TLS handshake failure on `notes.majorshouse.com`. The on-the-spot framing was "ISP filter expanded to include 'notes'" — but Cloudflare DNS audit showed a different (and arguably worse) root cause: **the `notes` A record was pointing at `136.54.3.248`, an IP that is not majorlab's current home IP.** Whichever host responds at that address either does not run Caddy or does not know about the `notes.majorshouse.com` SNI, so the TLS handshake was rejected with `internal_error 80`.
 ### Re-diagnosis
 ```bash
 # Cert + Caddy + mkdocs all healthy on majorlab
 $ ssh majorlab 'systemctl is-active caddy; ss -tlnp | grep :443'
 active
 LISTEN 0  4096  *:443  users:(("caddy",pid=1549,fd=7))
 # Loopback-served TLS works fine — cert valid Mar 11 → Jun 9 2026
 $ ssh majorlab 'curl -sS -o /dev/null -w "%{http_code}\n" --resolve notes.majorshouse.com:443:127.0.0.1 https://notes.majorshouse.com/'
 200
 # External TLS handshake gets rejected with internal_error
 $ openssl s_client -servername notes.majorshouse.com -connect 136.54.3.248:443
 … SSL alert number 80 (internal_error) …
 ```
 ### The smoking-gun comparison
 Other `*.majorshouse.com` services worked because they were CNAMEs to the apex, which resolves to majorlab's actual home IP:
 | Subdomain | DNS shape | Final IP | Status |
 |---|---|---|---|
 | `notes.majorshouse.com` | **A → `136.54.3.248`** (stale) | `136.54.3.248` (wrong host) | ❌ TLS rejected |
 | `git.majorshouse.com` | CNAME → `majorshouse.com.` | `136.56.0.55` (majorlab) | ✅ |
 | `n8n.majorshouse.com` | CNAME → `majorshouse.com.` | `136.56.0.55` (majorlab) | ✅ |
 | `matrix.majorshouse.com` | CNAME → `majorshouse.com.` | `136.56.0.55` (majorlab) | ✅ |
 None of the working subdomains were proxied through Cloudflare (`proxied=false` on all of them); they simply had the right IP. The `notes` A record was the only one pointing somewhere wrong — most likely a stale value from a prior ISP / IP change that never got cleaned up.
 ### ✅ Fix — switch `notes` to a Cloudflare-proxied CNAME
 Rather than just correcting the A record (which would silently break again the next time the home IP changes), the fix is a CNAME to the apex with proxy on. That gives two protections in one move: it always tracks the apex (so home IP changes propagate automatically) and it puts the wiki behind Cloudflare's edge (so any future ISP-side weirdness like the original `wiki` SNI filter is also bypassed).
 ```bash
 # via Cloudflare API (token from ansible-vault: vault_cloudflare_api_token)
 PUT /zones/{ZONE_ID}/dns_records/{NOTES_RECORD_ID}
 {
  "type":    "CNAME",
  "name":    "notes.majorshouse.com",
  "content": "majorshouse.com",
  "ttl":     1,
  "proxied": true,
  "comment": "switched A→CNAME proxied to bypass stale IP / ISP SNI filter"
 }
 ```
 Or via the dashboard:
 1. Cloudflare → `majorshouse.com` zone → DNS → Records
 2. Edit the `notes` record: Type `CNAME`, Target `majorshouse.com`, Proxy `Proxied` (orange cloud)
 3. Save
 External clients now hit Cloudflare edge IPs (`104.21.x.x` / `172.67.x.x`) which TLS-terminate at the edge and tunnel back to majorlab's apex IP. ACME on majorlab keeps working — Cloudflare passes the HTTP-01 challenge through on port 80. Caddy's `notes.majorshouse.com {}` block needs no change.
 Verify (response should show `server: cloudflare` and `via: 1.0 Caddy`):
 ```bash
 curl -sSI https://notes.majorshouse.com/
 ```
 ### Why a Cloudflare-proxied CNAME is the durable shape
 - **Apex follows the home IP automatically.** Update the apex A record once when the ISP changes; every subdomain inherits it without per-record fixes.
 - **TLS handshake is offloaded to CF.** Any ISP-level SNI weirdness (the original `wiki` ban; theoretical future bans) becomes irrelevant — external clients SNI=`notes.majorshouse.com` to Cloudflare, which the ISP doesn't filter.
 - **Free.** Cloudflare's free tier covers proxy + TLS termination.
 ### Audit checklist for any home-hosted `*.majorshouse.com` subdomain
 - [ ] DNS record is a **CNAME** to `majorshouse.com.`, not an A record to a literal home IP.
 - [ ] Cloudflare proxy (orange cloud, `proxied=true`) enabled on the record — at minimum for any subdomain where TLS reachability matters.
 - [ ] Caddy entry on majorlab references the public hostname; `reverse_proxy` stays on the localhost port.
 - [ ] HTTPS verified from outside the LAN (phone on cellular is sufficient) within the first hour after the change.
 - [ ] If an A record is genuinely required (e.g. it must NOT go through CF), document why in the deploy notes for that service.
 ### Related
 - [[majwiki-setup-and-pipeline]] — full wiki deploy pipeline; the DNS step there should reference this fix
 - [[Network-Overview]] — fleet IP table
--- a/05-troubleshooting/networking/tailscale-status-json-hostname-localhost-ios.md
+++ b/05-troubleshooting/networking/tailscale-status-json-hostname-localhost-ios.md
@ -0,0 +1,116 @@
 ---
 title: iOS Tailscale Clients Report HostName="localhost" — Breaks /etc/hosts Generators
 domain: troubleshooting
 category: networking
 tags:
  - tailscale
  - ios
  - postfix
  - etc-hosts
  - jq
 status: published
 created: 2026-04-29
 updated: 2026-04-29
 ---
 # iOS Tailscale Clients Report HostName="localhost" — Breaks /etc/hosts Generators
 ## Problem
 A homegrown script that builds an `/etc/hosts` block from `tailscale status --json` silently corrupted the file the moment any iOS device joined the tailnet. After the next run, services bound to `localhost` started failing.
 On the affected host (`majordiscord`), Postfix refused to start with:
 ```
 postfix: fatal: parameter inet_interfaces: no local interface found for 100.127.114.10
 ```
 `/etc/hosts` looked fine at the top — `127.0.0.1 localhost` was still present — but inside the Tailscale-managed block:
 ```
 # TAILSCALE_START
 100.84.42.102 tttpod
 100.110.197.17 majortoot
 100.95.55.40 localhost          <-- WRONG (this is an iPhone)
 100.84.165.52 majormail
 ...
 100.127.114.10 localhost         <-- WRONG (this is an iPad)
 # TAILSCALE_END
 ```
 When Postfix resolved `localhost` (because `inet_interfaces = localhost` in `main.cf`), the **last matching entry** in `/etc/hosts` won — a Tailscale IP that doesn't exist on this host — and the daemon died on bind.
 ## Root Cause
 The script used `.HostName` from the Tailscale JSON:
 ```bash
 tailscale status --json \
  | jq -r '.Peer[] | "\(.TailscaleIPs[0]) \(.HostName)"' \
  >> "$TEMP_HOSTS"
 ```
 iOS Tailscale clients (iPhone, iPad) **always report `HostName: "localhost"`** in the JSON. iOS doesn't expose the real device name to apps the way macOS/Linux/Windows do, so the Tailscale client falls back to the literal string `localhost`.
 Inspect it directly:
 ```bash
 $ tailscale status --json | jq '.Peer[] | select(.OS == "iOS") | {DNSName, HostName, OS}'
 {
  "DNSName": "iphone171.tail7f2d9.ts.net.",
  "HostName": "localhost",
  "OS": "iOS"
 }
 {
  "DNSName": "ipad166.tail7f2d9.ts.net.",
  "HostName": "localhost",
  "OS": "iOS"
 }
 ```
 Every iOS device contributes a line `<tailscale-ip> localhost` to `/etc/hosts`, hijacking the `localhost` lookup.
 ## Fix
 Use `.DNSName` (the unique tailnet DNS name) and take the first dotted component instead of `.HostName`:
 ```bash
 tailscale status --json \
  | jq -r '.Peer[] | "\(.TailscaleIPs[0]) \(.DNSName | rtrimstr(".") | split(".")[0])"' \
  >> "$TEMP_HOSTS"
 ```
 `DNSName` is always set, always unique, and produces clean labels like `iphone171`, `ipad166`, `majorlab`, etc.
 After patching the script and re-running it:
 ```bash
 $ bash /root/update_tailscale_hosts.sh
 $ systemctl restart postfix
 $ systemctl is-active postfix
 active
 ```
 ## Why It's Hard to Spot
 - The corruption only triggers when an iOS device is in the tailnet — so the script "worked" for months.
 - `/etc/hosts` files are commonly skimmed top-down. The bogus `localhost` line is buried in the Tailscale block, well below the legitimate `127.0.0.1 localhost` line, and looks superficially like a normal Tailscale entry.
 - Postfix's error message names the IP, not `localhost`, so the connection to `/etc/hosts` isn't obvious.
 - `getent hosts localhost` shows the *first* match (`127.0.0.1`), not the one Postfix's resolver actually picks for `inet_interfaces` lookup.
 ## Verification Checklist
 If you suspect this on any host using a similar generator script:
 ```bash
 # Any non-loopback "localhost" entries are bugs
 grep -nE '^[0-9]+\..* localhost\s*$' /etc/hosts
 # Look at iOS peers' HostName field
 tailscale status --json | jq '.Peer[] | select(.OS == "iOS") | .HostName'
 ```
 ## Related
 - [[majordiscord]] — affected host (incident logged 2026-04-29)
 - [[Network Overview]] — Tailscale fleet topology
--- a/05-troubleshooting/networking/windows-openssh-wsl-default-shell-breaks-remote-commands.md
+++ b/05-troubleshooting/networking/windows-openssh-wsl-default-shell-breaks-remote-commands.md
@ -11,7 +11,7 @@ tags:
  - powershell
 status: published
 created: 2026-04-03
-updated: 2026-04-22T09:20
+updated: 2026-04-30T05:21
 ---
 # Windows OpenSSH: WSL as Default Shell Breaks Remote Commands
--- a/05-troubleshooting/networking/windows-sshd-stops-after-reboot.md
+++ b/05-troubleshooting/networking/windows-sshd-stops-after-reboot.md
@ -10,7 +10,7 @@ tags:
  - majorrig
 status: published
 created: 2026-04-02
-updated: 2026-04-22T09:20
+updated: 2026-04-30T05:21
 ---
 # Windows OpenSSH Server (sshd) Stops After Reboot
--- a/05-troubleshooting/yt-dlp-fedora-js-challenge.md
+++ b/05-troubleshooting/yt-dlp-fedora-js-challenge.md
@ -10,7 +10,7 @@ tags:
  - deno
 status: published
 created: 2026-04-02
-updated: 2026-04-22T11:33
+updated: 2026-04-30T05:21
 ---
 # yt-dlp YouTube JS Challenge Fix (Fedora)
--- a/MajorWiki-Deploy-Status.md
+++ b/MajorWiki-Deploy-Status.md
@ -2,7 +2,7 @@
 title: MajorWiki Deployment Status
 status: deployed
 project: MajorTwin
-updated: 2026-04-07T10:48
+updated: 2026-04-30T05:30
 created: 2026-04-02T16:10
 ---
@ -79,6 +79,23 @@ git push
 Gitea receives the push → fires webhook → majorlab pulls → MkDocs rebuilds → `notes.majorshouse.com` updates automatically.
 > [!tip] One-liner wrapper
 > On MajorRig, the `~/bin/wiki-commit "msg"` helper runs `git pull --rebase --autostash` → `git add -A` → `git commit` → `git push` in one shot. Sidesteps fast-forward rejections from cowork pushes (e.g. MajorAir pushing in parallel) and the empty-credentials issue with HTTPS.
 ## 🔒 Pre-Commit Hook (in repo)
 `.githooks/pre-commit` (tracked) blocks any commit that adds or renames a `*.md` article without a corresponding entry in `SUMMARY.md`. Bypass with `git commit --no-verify` if you genuinely need to.
 **Per-clone setup** (one-time, per workstation that uses the repo):
 ```bash
 cd <wiki-repo>
 git config core.hooksPath .githooks
 git config pull.rebase true
 ```
 The hooksPath line is required — git doesn't run hooks from a tracked directory by default. The `pull.rebase true` makes plain `git pull` always rebase locally, matching the `wiki-commit` wrapper's behavior.
 ## 📋 Wiki Maintenance Protocol
 Every time a new article is added, the following **MUST** be updated to maintain index integrity:
--- a/README.md
+++ b/README.md
@ -1,6 +1,6 @@
 ---
 created: 2026-04-06T09:52
-updated: 2026-04-29T22:46
+updated: 2026-04-30T05:21
 ---
 # MajorLinux Tech Wiki — Index
--- a/SUMMARY.md
+++ b/SUMMARY.md
@ -1,6 +1,6 @@
 ---
 created: 2026-04-02T16:03
-updated: 2026-04-29T22:45
+updated: 2026-04-30T11:24
 ---
 * [Home](index.md)
 * [Linux & Sysadmin](01-linux/index.md)
@ -43,6 +43,7 @@ updated: 2026-04-29T22:45
    * [Fail2ban Custom Jail: Apache 404 Scanner Detection](02-selfhosting/security/fail2ban-apache-404-scanner-jail.md)
    * [Fail2ban Custom Jail: Apache PHP Webshell Probe Detection](02-selfhosting/security/fail2ban-apache-php-probe-jail.md)
    * [Fail2ban Custom Jail: WordPress Login Brute Force](02-selfhosting/security/fail2ban-wordpress-login-jail.md)
    * [wp-fail2ban Plugin Logpath on Debian/Ubuntu (auth.log not syslog)](02-selfhosting/security/wp-fail2ban-logpath-debian-ubuntu.md)
    * [SELinux: Fixing Fail2ban grep execmem Denial](02-selfhosting/security/selinux-fail2ban-execmem-fix.md)
    * [UFW Firewall Management](02-selfhosting/security/ufw-firewall-management.md)
    * [Firewall Hardening with firewalld on Fedora Fleet](02-selfhosting/security/firewalld-fleet-hardening.md)
@ -77,6 +78,7 @@ updated: 2026-04-29T22:45
    * [ISP SNI Filtering with Caddy](05-troubleshooting/isp-sni-filtering-caddy.md)
    * [Obsidian Vault Recovery — Loading Cache Hang](05-troubleshooting/obsidian-cache-hang-recovery.md)
    * [Qwen2.5-14B OOM on RTX 3080 Ti (12GB)](05-troubleshooting/gpu-display/qwen-14b-oom-3080ti.md)
    * [LoRA adapter — GGUF conversion fails with 'config.json not found'](05-troubleshooting/gpu-display/lora-adapter-gguf-conversion-fails.md)
    * [yt-dlp YouTube JS Challenge Fix on Fedora](05-troubleshooting/yt-dlp-fedora-js-challenge.md)
    * [Gemini CLI Manual Update](05-troubleshooting/gemini-cli-manual-update.md)
    * [MajorWiki Setup & Publishing Pipeline](05-troubleshooting/majwiki-setup-and-pipeline.md)
@ -90,6 +92,7 @@ updated: 2026-04-29T22:45
    * [Ollama Drops Off Tailscale When Mac Sleeps](05-troubleshooting/ollama-macos-sleep-tailscale-disconnect.md)
    * [Ollama: `ollama run` with Piped Stdin Bypasses Chat Template + SYSTEM Prompt](05-troubleshooting/ollama-chat-template-pipe-stdin-bypass.md)
    * [rsync over Tailscale: Hung in TCP Teardown After Transfer Completes](05-troubleshooting/networking/rsync-tailscale-teardown-stall.md)
    * [iOS Tailscale Clients Report HostName="localhost" — Breaks /etc/hosts Generators](05-troubleshooting/networking/tailscale-status-json-hostname-localhost-ios.md)
    * [macOS: Repeating Alert Tone from Mirrored iPhone Notification](05-troubleshooting/macos-mirrored-notification-alert-loop.md)
    * [ClamAV CPU Spike: Safe Scheduling with nice/ionice](05-troubleshooting/security/clamscan-cpu-spike-nice-ionice.md)
    * [Ansible: Vault Password File Not Found](05-troubleshooting/ansible-vault-password-file-missing.md)
--- a/index.md
+++ b/index.md
@ -1,6 +1,6 @@
 ---
 created: 2026-04-06T09:52
-updated: 2026-04-29T22:45
+updated: 2026-04-30T05:21
 ---
 # MajorLinux Tech Wiki — Index