Operational/how-to references updated to the role entry playbooks after the
ADR-0001 migration. Historical incident narrative (dated callouts, commit
refs) preserved.
- clamav-fleet-deployment: override + re-run -> clamav.yml; role note
- ssh-hardening-ansible-fleet: note this is now the ssh_hardening role
- vps-migration-baseline-checklist: table -> clamav.yml / ssh_hardening.yml
- ssh-socket-tailscale-race-condition: Affected Hosts + Prevention + References
-> tailscale role tasks (network_wait/ssh_only_ubuntu/ssh_only_fedora)
- freshclam-logwatch-false-no-updates: codify refs -> clamav role
dcaprod-hetzner + tttpod-hetzner were missing tailscale-wait-ready.service
(inert ssh.service gate -> latent bind race); corrected playbook applied to
both. teelia uses Tailscale SSH (no sshd, immune). All Ubuntu hosts now on
the dependency-free-socket + ssh.service-gate pattern.
- Fedora hosts are NOT automatically immune: a leftover manual
`ListenAddress <tailscale-ip>` drop-in reintroduces the sshd boot bind-race
even under firewalld (hit on majordiscord 2026-06-07; fix = remove it).
- The Ubuntu playbook kept shipping the cycle-causing [Unit] gate on
ssh.socket despite the 2026-06-04 resolution; re-running it re-armed the
ordering cycle (clobbered majorlinux; majortoot-hetzner found armed).
Corrected in MajorAnsible e0d35aa. Fleet ssh-lockdown state is inconsistent
(dcaprod/tttpod lack wait-ready; teelia no override) — needs a per-host audit.
Document the majormail 2026-06-07 incident: when userdb home == maildir
root, the LDA/Sieve duplicate database (.dovecot.lda-dupes + .locks) lands
inside the mail store and the maildir lister exposes it as phantom
mailboxes ("dovecot.lda-dupes"), logging stat(.../tmp) "Not a directory".
Fix: point home at a non-dotted subdir. Wired into the troubleshooting
index and SUMMARY.
Document the majormail spam-routing failure (2026-06-06): a cleanup
header_checks REDIRECT keyed on the milter-added X-Spam-Flag never fired for
real inbound mail (only locally-injected), so spam kept reaching the inbox.
Fix is to route in Sieve at delivery (after the milter), with a redirect +
loop guard. Includes the 'local-injection tests lie' warning.
Document the daily /etc/cron.daily/clamav-freshness watchdog as the real
detector for stale signatures, and the key gotcha that 'mail' is absent on
most fleet hosts so alert scripts must use /usr/sbin/sendmail -t.
logwatch's clam-update counts only 'process started' lines (emitted only at
daemon restart), so daemon-mode freshclam false-alarms on quiet days despite
signatures updating. Fix: $ignore_no_updates=1 drop-in. Includes the
real-vs-false check (a daemonless box with freshclam disabled is a TRUE alert).
New page documenting the majormail (2026-06-05) issue: /etc/localtime
shipped labeled etc_t instead of locale_t on the Hetzner image, so SELinux
denied systemd-timedated and timedatectl/community.general.timezone reported
success while the symlink stayed at UTC. Fix: restorecon before setting TZ.
Indexed in index.md (SELinux) + SUMMARY.md.
- New page: Dovecot IMAP vsz_limit OOM from a bloated/corrupt index.log
(152M index on an empty folder killed IMAP children with error 83).
- fail2ban IMAP self-ban: add permanent ignoreip-whitelist fix + dynamic-IP caveat.
- firewalld mail ports: add 'submission/587 never added' variant + correct
Fedora service name; note Ansible now manages the full mail-service set.
- Index + SUMMARY updated with the new page.
Three updates to the inbound spam filtering guide, all driven by the 2026-06-04
majormail-hetzner Phase 6 cutover and follow-up tuning:
1. Section 6 (Dovecot Sieve): warn explicitly that `plugin/sieve_before` was
dropped in Pigeonhole 2.4 and silently does nothing — no startup warning,
spam just keeps landing in INBOX. The 2.4 replacement is a top-level
`sieve_script <name> { type = before; path = …; }` block. Also note the
Fedora-flat-dovecot.conf pitfall (some packagings ship dovecot.conf
without `!include conf.d/*.conf`, so the block has to live in the main
file directly). Added a `sievec` compile step.
2. New §6b: route spam to a separate `junk@` mailbox via Postfix cleanup
`header_checks` REDIRECT. This makes spam invisible to the user's
mailbox entirely — Spark/IDLE-based clients don't push-notify because
the message never reaches the subscribed mailbox at all. Includes the
`regexp:` vs `pcre:` map-type tip (use regexp on stock Fedora to avoid
the postfix-pcre package dependency).
3. New §7a: weekly systemd timer for sa-learn. The §7 warning about
"don't run sa-learn from cron unless folders are clean" is correct as
the safe default — but when you adopt the §6b REDIRECT-to-junk@
pattern, the junk@ mailbox is pure spam by design and a weekly
`--spam`/`--ham`/`--sync`/`--force-expire` chain becomes safe and
useful. Full unit templates included.
Gotchas table gains four entries:
- Pigeonhole 2.4 silent breakage of plugin/sieve_before
- postfix-pcre vs regexp map type confusion
- Why sieve fileinto Junk still pushes a Spark notification
- Why local `sendmail` injection doesn't trigger the REDIRECT (smtpd
milters skip sendmail-injected mail, so X-Spam-Flag isn't added)
All changes match what's now codified in the `majormail` Ansible role
(commit 7a8b9eb in MajorAnsible).
New 02-selfhosting/services article: the full Postfix/Dovecot inbound spam stack
on Fedora — spamass-milter tag-only wiring (the -r footgun), socket permissions
(sa-milt group + UMask), site-wide Bayes DB, Sieve-to-Junk, and sa-learn training
(folders, spam/ham balance, manual-not-cron). From the majormail setup.
Also extends selinux-dovecot-vmail-context with a Permissive-mode variant + a
postfix_cleanup->mysqld_etc companion-denial note. SUMMARY.md nav updated.
The weekly media-prune cron (and monthly accounts refresh --all) were
removed 2026-06-01 after repeatedly breaking avatars. Update the
majortoot sections: the 648->7GB shrink was a one-time safe attachment
cleanup; automation is now disabled; prune attachments manually if ever
needed, never profiles. Cross-link the two new troubleshooting articles.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
New article mastodon-s3-acl-upload-failures.md: a BucketOwnerEnforced S3
bucket plus a stale S3_PERMISSION/S3_ACL in .env.production makes every
Mastodon upload fail with AccessControlListNotSupported, silently. Covers
symptoms (incl. why a missing object returns 403 not 404), diagnosis,
the fix (S3_PERMISSION= empty, public read via bucket policy), recovery,
a synthetic-write health check, and Ansible enforcement.
Extend mastodon-prune-profiles-trap.md: add a "Bulk restore at scale"
procedure (list existing keys, null missing DB refs, enqueue
RedownloadAvatar/HeaderWorker), a "storage-level deletion without DB
de-ref" section, and a stronger recommendation to disable automated
profile pruning (and scheduled accounts refresh --all) entirely.
Link both from SUMMARY.md and the selfhosting index.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two related additions covering the 2026-05-31 cutover-night incidents on
majorlinux and majortoot-hetzner.
ssh-socket-tailscale-race-condition.md (update Race 1 fix):
- After=tailscaled.service Requires=tailscaled.service orders against the
service becoming active, not against tailscale0 having an IPv4 — hosts
kept losing SSH intermittently after reboots (incident: majorlinux +
majortoot-hetzner 2026-05-31, during cutover-night Ansible reboot).
- Canonical fix: a oneshot tailscale-wait-ready.service that polls
`ip -4 -o addr show tailscale0` until an address is present, with
ssh.socket After=/Requires= that service. Document the full evolution
(2026-05-19 BindsTo → 2026-05-23 Requires → 2026-05-31 wait-ready) so
future readers don't try the half-fixes thinking they're sufficient.
- Add majortoot-hetzner to affected hosts.
mastodon-post-install-hardening.md (new):
Four upstream-install gaps that bit during the majortoot-hetzner cutover:
1. /home/mastodon at 0750 (useradd default) → nginx www-data can't
traverse → every static asset 403s → unstyled "purple screen" in the
browser while API/HTML still work through the puma proxy.
2. .env.production at 0644 (mastodon-setup default) → DB_PASS,
SECRET_KEY_BASE, OTP_SECRET world-readable once gap (1) is fixed.
3. mastodon user shell at /usr/sbin/nologin → `su - mastodon` blocked.
4. rbenv init in .bashrc only → login shells don't source .bashrc; even
when chained, Ubuntu's .bashrc returns early for non-interactive
shells. Fix: .bash_profile sets up rbenv BEFORE sourcing .profile +
.bashrc, so it works for both interactive and non-interactive logins.
All four codified in MajorAnsible configure_mastodon_permissions.yml
with self-asserting verification steps.
02-selfhosting/index.md + SUMMARY.md:
Add a "Services" section to the selfhosting index linking the
mastodon-post-install-hardening article (and the other orphaned
services/ entries while there). SUMMARY.md gains one new entry.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
BindsTo=tailscaled.service causes a systemd ordering cycle that
prevents ssh.socket from starting on reboot. Updated the recommended
fix to use Requires= and added a warning admonition explaining why
BindsTo must not be used. Added tttpod-hetzner to affected hosts
list and linked the 2026-05-23 dcaprod incident.
Added Race 2: tailscaled starts before network-online.target, causing
Tailscale to get stuck with SetNetworkUp(false). Covers both Ubuntu
ssh.socket and cross-platform tailscaled ordering issues. Updated
references to include majordiscord incident and new Ansible playbook.
Documents the systemd socket activation race where ssh.socket binds
to the Tailscale IP before tailscaled is ready, causing SSH to become
unreachable after a Tailscale reconnect. Includes diagnosis steps and
the After=/BindsTo= fix.
- netdata-apps-fds-group-false-positive: the apps_group_file_descriptors_utilization
false 100% on forking/root app groups (tailscaled on MajorToot 2026-05-15),
the not-a-privilege gotcha, fleet-wide silence fix in MajorAnsible.
- obs-stale-script-paths: pending from prior session (not on remote).
- SUMMARY.md: link both (re-applied onto upstream after concurrent rebase).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fedora 44 Hetzner images ship without rsyslog — logwatch produces
zero output because /var/log/messages doesn't exist. Added rsyslog
to baseline table and new diagnostic section to logwatch article.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
nice -n 19 only yields when other processes compete; on single-core
VPS boxes the scan still saturates CPU. Document the expectation.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New article documenting missing /etc/pki/tls/certs/ca-bundle.crt symlink
on Hetzner Fedora images breaking Postfix TLS, curl, and dnf. Updated
VPS migration baseline checklist with timezone, CA bundle, and crond
verification steps. Updated logwatch fleet setup with crond check.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Documents three more patterns surfaced in the 2026-05-10 fleet-mail
investigation, all hitting hosts derived from cloud images or
cross-provider migrations:
- Packer/snapshot-leftover myhostname (postfix EHLO + message-id
identifies the build artifact, not the production hostname; remote
spam scorers hate it)
- Empty relayhost silently routes mail via the public MX instead of
the Tailscale-internal path, exposing it to spamchk that internal
traffic bypasses
- Stale SASL passwd map referencing a missing file from a previous
external-SMTP relay setup, deferring every send with "local data
error"
Each looks benign in isolation. Together they made dcaprod's Logwatch
disappear into spamchk for weeks while showing 250 OK on the source.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Generalizes the Castopod/UuidModel incident from 2026-05-10. PHP 8.4
deprecated implicit-nullable parameters (`function f(int $x = null)`).
Old vendor libraries spam E_DEPRECATED warnings; CodeIgniter wraps each
in a 23-frame stack trace; per-minute spark cron amplifies into 53-80
MB/day log bleed and 22% sustained CPU floor on small VPS boxes.
Documents the four-line sed fix AND the substring-match gotcha that
extended the fix from 30 seconds to 30 minutes — bare `int \$limit = null`
patterns substring-match `?int \$limit = null` elsewhere in the file
and produce illegal `??type` syntax. Covers anchored sed patterns,
reference-parameter handling (&\$db), the lint-after-every-edit rule,
and a bonus section on hunting stray developer debug prints
(`log_message('critical', 'ITS HEEEEEEEEEEEERE')`).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Documents three lessons from the 2026-05-10 fleet outage where the
Fedora half (majorhome, majorlab) had been silently failing to send
notification mail for days:
- Missing /etc/pki/tls/certs/ca-bundle.crt symlink (extracted bundle
exists at /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem but the
consumer-path symlink was lost during a ca-certificates package
event). Diagnosis includes the cross-tool tell — dnf and curl break
with the same path. Fix is a single ln -sfn.
- Methodology: Fedora and majormail log postfix to journald; Debian and
Ubuntu log to /var/log/mail.log. Querying the wrong source returns
false negatives for healthy hosts.
- Bounce-source addresses (Watchtower NOTIFICATION_EMAIL_FROM,
fail2ban sender, root@<host>.localdomain) must resolve to real
mailboxes — otherwise the first failed delivery generates
bounce-of-bounce churn.
Also promoting the article from untracked to committed; it had been
authored on 2026-05-09 and not yet added to the repo.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Same-day correction. The proposed per-droplet relaxed alert (>95%/30m)
turned out to also trip on a 1 vCPU box during low-traffic weekly scans,
because there's literally no real load for nice 19 to yield to —
clamscan opportunistically fills the vCPU and DO sees 100% utilization
regardless of `%nice` vs `%user` split. Documents the three realistic
options (accept page / switch to clamdscan / disable alert) and the
underlying limit (no DO threshold can distinguish polite from impolite
CPU when the box is fully utilized).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DO's hypervisor-level CPU metric doesn't know about nice/ionice — a
"polite" weekly clamscan on a 1 vCPU droplet still reads 100% utilization
and trips a default >85%/5m alert. Adds a new section explaining the
trade-off and providing the DO API recipe (PUT existing alert with
explicit entities, POST a new relaxed alert scoped to the small
droplet) plus when not to bother (2+ vCPU boxes won't trip).
Triggered by the 2026-05-10 teelia incident where the weekly cron fired
the fleet-wide CPU alert despite the cron script already wrapping
clamscan in nice 19 + ionice idle + cgroup memory limits.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Documents the failure mode where issuing a synchronous `ssh host reboot`
through Claude Desktop's shell MCP poisons the local MCP transport when
the target severs its session before responding cleanly — eventually
force-disconnecting every MCP at once. Covers diagnostic chain, recovery,
fire-and-forget reboot patterns, and worked example from the 2026-05-10
majorhome AMD-card reboot.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Walks the four-step diagnostic chain (post created → activity delivered →
follower exists → notification semantics) for the common confusion where
a Castopod admin's auto-broadcast "doesn't show up" on a Mastodon account
they expected. Most cases are not federation bugs but the difference
between favouriting/boosting (no follow required) and following + the
fact that Mastodon notifications fire only for mentions/follows/favs/
boosts/etc., not for new posts from people you follow. Documents the bell
icon and `@`-mention escape hatches.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When a remote actor updates their avatar, Mastodon (Paperclip) deletes the
old S3 object and stores only the new filename. Castopod 2.0.0 caches the
URL of every federated actor in cp_fediverse_actors and never refetches,
so its admin templates emit a dead link forever (the resulting S3 403 is
anti-enumeration, hiding what is really a 404). Article documents the
diagnosis pattern and three fixes (manual UPDATE, DELETE-and-refetch,
bulk audit), plus the Mastodon-side query for sourcing the correct URL.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The stock alarm definition counts only 1xx/2xx/304/401/429 as successful,
which causes false CRITICALs on WP sites where 301 canonicalization is
normal traffic (legacy /?p=NNNN, slug edits, host/TLS upgrades, etc.).
Article documents the root cause, verification steps via the access log,
and an in-place threshold retune that keeps the alarm useful as an
"obvious meltdown" floor while delegating real outage detection to the
5xx and 4xx alarms.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Documents the long-standing UX regression caused by
`tootctl media remove --prune-profiles` (and `--remove-headers`)
running on a schedule: cached remote avatars are deleted, but
Mastodon does not auto-refetch on profile view, so quiet remote
accounts stay broken indefinitely.
Article covers:
- The mutually-exclusive flag bug (silent skip if combined)
- Mastodon's actual avatar-refresh trigger model (Update activities,
not profile views)
- A `refresh-my-follows.sh` pattern with a defensible WHERE clause
(avatar NULL AND avatar_remote_url present) to avoid infinite
retry on accounts whose origin has no avatar
- Why header_file_name IS NULL is a bad signal (~20% of users
legitimately have no custom header)
- The cron decision: most admins should drop --prune-profiles
The pre-commit hook (which enforces SUMMARY.md links for new articles)
was tracked at mode 100644, so even with `core.hooksPath=.githooks`
configured, git silently skipped it. Bump tracked mode to 100755 so
fresh clones get the working hook without a manual chmod step.
Discovered while installing the wiki-commit/hooks setup on MajorMac.
No content change; .githooks/ is outside the MkDocs source so this
will not alter the rendered notes.majorshouse.com site.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Documents the gotcha hit during the 2026-05-06 update.yml refactor:
the second-positional-argument back-reference form of regex_search
('\1') doesn't reliably select capture groups when used inside
set_fact. The fix is to match the broader substring and use
.split()[0] (or [-1], etc.) to peel off the value, with a default()
bridge for the no-match case.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>