Documents three lessons from the 2026-05-10 fleet outage where the
Fedora half (majorhome, majorlab) had been silently failing to send
notification mail for days:
- Missing /etc/pki/tls/certs/ca-bundle.crt symlink (extracted bundle
exists at /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem but the
consumer-path symlink was lost during a ca-certificates package
event). Diagnosis includes the cross-tool tell — dnf and curl break
with the same path. Fix is a single ln -sfn.
- Methodology: Fedora and majormail log postfix to journald; Debian and
Ubuntu log to /var/log/mail.log. Querying the wrong source returns
false negatives for healthy hosts.
- Bounce-source addresses (Watchtower NOTIFICATION_EMAIL_FROM,
fail2ban sender, root@<host>.localdomain) must resolve to real
mailboxes — otherwise the first failed delivery generates
bounce-of-bounce churn.
Also promoting the article from untracked to committed; it had been
authored on 2026-05-09 and not yet added to the repo.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Same-day correction. The proposed per-droplet relaxed alert (>95%/30m)
turned out to also trip on a 1 vCPU box during low-traffic weekly scans,
because there's literally no real load for nice 19 to yield to —
clamscan opportunistically fills the vCPU and DO sees 100% utilization
regardless of `%nice` vs `%user` split. Documents the three realistic
options (accept page / switch to clamdscan / disable alert) and the
underlying limit (no DO threshold can distinguish polite from impolite
CPU when the box is fully utilized).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DO's hypervisor-level CPU metric doesn't know about nice/ionice — a
"polite" weekly clamscan on a 1 vCPU droplet still reads 100% utilization
and trips a default >85%/5m alert. Adds a new section explaining the
trade-off and providing the DO API recipe (PUT existing alert with
explicit entities, POST a new relaxed alert scoped to the small
droplet) plus when not to bother (2+ vCPU boxes won't trip).
Triggered by the 2026-05-10 teelia incident where the weekly cron fired
the fleet-wide CPU alert despite the cron script already wrapping
clamscan in nice 19 + ionice idle + cgroup memory limits.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Documents the failure mode where issuing a synchronous `ssh host reboot`
through Claude Desktop's shell MCP poisons the local MCP transport when
the target severs its session before responding cleanly — eventually
force-disconnecting every MCP at once. Covers diagnostic chain, recovery,
fire-and-forget reboot patterns, and worked example from the 2026-05-10
majorhome AMD-card reboot.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Walks the four-step diagnostic chain (post created → activity delivered →
follower exists → notification semantics) for the common confusion where
a Castopod admin's auto-broadcast "doesn't show up" on a Mastodon account
they expected. Most cases are not federation bugs but the difference
between favouriting/boosting (no follow required) and following + the
fact that Mastodon notifications fire only for mentions/follows/favs/
boosts/etc., not for new posts from people you follow. Documents the bell
icon and `@`-mention escape hatches.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When a remote actor updates their avatar, Mastodon (Paperclip) deletes the
old S3 object and stores only the new filename. Castopod 2.0.0 caches the
URL of every federated actor in cp_fediverse_actors and never refetches,
so its admin templates emit a dead link forever (the resulting S3 403 is
anti-enumeration, hiding what is really a 404). Article documents the
diagnosis pattern and three fixes (manual UPDATE, DELETE-and-refetch,
bulk audit), plus the Mastodon-side query for sourcing the correct URL.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The stock alarm definition counts only 1xx/2xx/304/401/429 as successful,
which causes false CRITICALs on WP sites where 301 canonicalization is
normal traffic (legacy /?p=NNNN, slug edits, host/TLS upgrades, etc.).
Article documents the root cause, verification steps via the access log,
and an in-place threshold retune that keeps the alarm useful as an
"obvious meltdown" floor while delegating real outage detection to the
5xx and 4xx alarms.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Documents the long-standing UX regression caused by
`tootctl media remove --prune-profiles` (and `--remove-headers`)
running on a schedule: cached remote avatars are deleted, but
Mastodon does not auto-refetch on profile view, so quiet remote
accounts stay broken indefinitely.
Article covers:
- The mutually-exclusive flag bug (silent skip if combined)
- Mastodon's actual avatar-refresh trigger model (Update activities,
not profile views)
- A `refresh-my-follows.sh` pattern with a defensible WHERE clause
(avatar NULL AND avatar_remote_url present) to avoid infinite
retry on accounts whose origin has no avatar
- Why header_file_name IS NULL is a bad signal (~20% of users
legitimately have no custom header)
- The cron decision: most admins should drop --prune-profiles
Updated article count (89 → 106), domain counts, per-section
listings, and Recently Updated table. Added all articles published
since 2026-04-18 including Pi-hole, Mastodon, fail2ban digest,
LoRA GGUF, Tailscale iOS, and more.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- fail2ban-digest-mode-fleet: recidive-only email model, sshd now silent,
defaults-debian.conf gotcha added
- netdata-docker-health-alarm-tuning: 30m/10m config, tuning history table
- New: wp-fail2ban-logpath-debian-ubuntu, lora-adapter-gguf-conversion-fails,
tailscale-status-json-hostname-localhost-ios
- Various article updates and nav index refreshes
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Articles from prior sessions that were written locally but never shipped:
- 02-selfhosting/cloud/aws-s3-cost-management.md — lifecycle rules, storage class selection, bucket inventory, unexpected-growth investigation
- 02-selfhosting/dns-networking/wake-on-lan-router-ssh.md — WOL magic packets via Asus router SSH + ether-wake, Ansible vault integration
- 02-selfhosting/services/claude-code-remote-control.md — mobile access to a persistent host Claude Code session
Nav updated (index.md + SUMMARY.md):
- Added Cloud subsection under Self-Hosting for aws-s3
- Added wake-on-lan and aws-s3 entries to SUMMARY
- Added claude-code-remote-control to index's Services section
- Added ansible-ssh-host-alias-bypass nav entry (article shipped in 2dbeb22)
- Article count 87 → 89, self-hosting 30 → 32, troubleshooting 33 → 34
Local and remote both have 76 articles on disk, but the counter
and per-domain table were stale (74 total / self-hosting 21 /
troubleshooting 29). Trued up to 76 / 22 / 30.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Documents the 2026-04-09 scanner incident where 301-redirected PHP probes
bypassed the existing apache-404scan jail, leaving the scanner unbanned
and firing Netdata web_log_1m_redirects alerts. New jail catches 301/302/
403/404 PHP responses while excluding legitimate WordPress endpoints.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Captures the majorlab incident where the backup watchdog emailed a missing
heartbeat after a kernel-update reboot wiped /var/run, even though the
backup had actually completed cleanly. Documents the tmpfs root cause and
the fix of storing heartbeats under /var/lib instead.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Covers shell quoting for URLs containing &, ?, #, and other characters
that Bash interprets as operators. Common gotcha when downloading from
CDNs with token-based URLs.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Corrected inflated article count (was 76, actual is 73).
Updated domain breakdown and frontmatter timestamps from Obsidian.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Article count 73 → 74. Added to SUMMARY.md, index.md, README.md,
and 02-selfhosting/index.md (which was also missing 5 other security
articles from prior sessions).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Removed Obsidian [[wikilinks]] pointing to vault-only docs (01-Phases, majorlab)
that don't resolve on the MkDocs site. Kept deploy status as a proper markdown link.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Documents the 2026-03-14 incident where MajorAir's public IP was banned
by the postfix-sasl jail after repeated SASL auth failures, silently
blocking all IMAP connections from Spark Desktop.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>