New article mastodon-s3-acl-upload-failures.md: a BucketOwnerEnforced S3
bucket plus a stale S3_PERMISSION/S3_ACL in .env.production makes every
Mastodon upload fail with AccessControlListNotSupported, silently. Covers
symptoms (incl. why a missing object returns 403 not 404), diagnosis,
the fix (S3_PERMISSION= empty, public read via bucket policy), recovery,
a synthetic-write health check, and Ansible enforcement.
Extend mastodon-prune-profiles-trap.md: add a "Bulk restore at scale"
procedure (list existing keys, null missing DB refs, enqueue
RedownloadAvatar/HeaderWorker), a "storage-level deletion without DB
de-ref" section, and a stronger recommendation to disable automated
profile pruning (and scheduled accounts refresh --all) entirely.
Link both from SUMMARY.md and the selfhosting index.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two related additions covering the 2026-05-31 cutover-night incidents on
majorlinux and majortoot-hetzner.
ssh-socket-tailscale-race-condition.md (update Race 1 fix):
- After=tailscaled.service Requires=tailscaled.service orders against the
service becoming active, not against tailscale0 having an IPv4 — hosts
kept losing SSH intermittently after reboots (incident: majorlinux +
majortoot-hetzner 2026-05-31, during cutover-night Ansible reboot).
- Canonical fix: a oneshot tailscale-wait-ready.service that polls
`ip -4 -o addr show tailscale0` until an address is present, with
ssh.socket After=/Requires= that service. Document the full evolution
(2026-05-19 BindsTo → 2026-05-23 Requires → 2026-05-31 wait-ready) so
future readers don't try the half-fixes thinking they're sufficient.
- Add majortoot-hetzner to affected hosts.
mastodon-post-install-hardening.md (new):
Four upstream-install gaps that bit during the majortoot-hetzner cutover:
1. /home/mastodon at 0750 (useradd default) → nginx www-data can't
traverse → every static asset 403s → unstyled "purple screen" in the
browser while API/HTML still work through the puma proxy.
2. .env.production at 0644 (mastodon-setup default) → DB_PASS,
SECRET_KEY_BASE, OTP_SECRET world-readable once gap (1) is fixed.
3. mastodon user shell at /usr/sbin/nologin → `su - mastodon` blocked.
4. rbenv init in .bashrc only → login shells don't source .bashrc; even
when chained, Ubuntu's .bashrc returns early for non-interactive
shells. Fix: .bash_profile sets up rbenv BEFORE sourcing .profile +
.bashrc, so it works for both interactive and non-interactive logins.
All four codified in MajorAnsible configure_mastodon_permissions.yml
with self-asserting verification steps.
02-selfhosting/index.md + SUMMARY.md:
Add a "Services" section to the selfhosting index linking the
mastodon-post-install-hardening article (and the other orphaned
services/ entries while there). SUMMARY.md gains one new entry.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
BindsTo=tailscaled.service causes a systemd ordering cycle that
prevents ssh.socket from starting on reboot. Updated the recommended
fix to use Requires= and added a warning admonition explaining why
BindsTo must not be used. Added tttpod-hetzner to affected hosts
list and linked the 2026-05-23 dcaprod incident.
Added Race 2: tailscaled starts before network-online.target, causing
Tailscale to get stuck with SetNetworkUp(false). Covers both Ubuntu
ssh.socket and cross-platform tailscaled ordering issues. Updated
references to include majordiscord incident and new Ansible playbook.
Documents the systemd socket activation race where ssh.socket binds
to the Tailscale IP before tailscaled is ready, causing SSH to become
unreachable after a Tailscale reconnect. Includes diagnosis steps and
the After=/BindsTo= fix.
- netdata-apps-fds-group-false-positive: the apps_group_file_descriptors_utilization
false 100% on forking/root app groups (tailscaled on MajorToot 2026-05-15),
the not-a-privilege gotcha, fleet-wide silence fix in MajorAnsible.
- obs-stale-script-paths: pending from prior session (not on remote).
- SUMMARY.md: link both (re-applied onto upstream after concurrent rebase).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fedora 44 Hetzner images ship without rsyslog — logwatch produces
zero output because /var/log/messages doesn't exist. Added rsyslog
to baseline table and new diagnostic section to logwatch article.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
nice -n 19 only yields when other processes compete; on single-core
VPS boxes the scan still saturates CPU. Document the expectation.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New article documenting missing /etc/pki/tls/certs/ca-bundle.crt symlink
on Hetzner Fedora images breaking Postfix TLS, curl, and dnf. Updated
VPS migration baseline checklist with timezone, CA bundle, and crond
verification steps. Updated logwatch fleet setup with crond check.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Documents three more patterns surfaced in the 2026-05-10 fleet-mail
investigation, all hitting hosts derived from cloud images or
cross-provider migrations:
- Packer/snapshot-leftover myhostname (postfix EHLO + message-id
identifies the build artifact, not the production hostname; remote
spam scorers hate it)
- Empty relayhost silently routes mail via the public MX instead of
the Tailscale-internal path, exposing it to spamchk that internal
traffic bypasses
- Stale SASL passwd map referencing a missing file from a previous
external-SMTP relay setup, deferring every send with "local data
error"
Each looks benign in isolation. Together they made dcaprod's Logwatch
disappear into spamchk for weeks while showing 250 OK on the source.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Generalizes the Castopod/UuidModel incident from 2026-05-10. PHP 8.4
deprecated implicit-nullable parameters (`function f(int $x = null)`).
Old vendor libraries spam E_DEPRECATED warnings; CodeIgniter wraps each
in a 23-frame stack trace; per-minute spark cron amplifies into 53-80
MB/day log bleed and 22% sustained CPU floor on small VPS boxes.
Documents the four-line sed fix AND the substring-match gotcha that
extended the fix from 30 seconds to 30 minutes — bare `int \$limit = null`
patterns substring-match `?int \$limit = null` elsewhere in the file
and produce illegal `??type` syntax. Covers anchored sed patterns,
reference-parameter handling (&\$db), the lint-after-every-edit rule,
and a bonus section on hunting stray developer debug prints
(`log_message('critical', 'ITS HEEEEEEEEEEEERE')`).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Documents three lessons from the 2026-05-10 fleet outage where the
Fedora half (majorhome, majorlab) had been silently failing to send
notification mail for days:
- Missing /etc/pki/tls/certs/ca-bundle.crt symlink (extracted bundle
exists at /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem but the
consumer-path symlink was lost during a ca-certificates package
event). Diagnosis includes the cross-tool tell — dnf and curl break
with the same path. Fix is a single ln -sfn.
- Methodology: Fedora and majormail log postfix to journald; Debian and
Ubuntu log to /var/log/mail.log. Querying the wrong source returns
false negatives for healthy hosts.
- Bounce-source addresses (Watchtower NOTIFICATION_EMAIL_FROM,
fail2ban sender, root@<host>.localdomain) must resolve to real
mailboxes — otherwise the first failed delivery generates
bounce-of-bounce churn.
Also promoting the article from untracked to committed; it had been
authored on 2026-05-09 and not yet added to the repo.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Same-day correction. The proposed per-droplet relaxed alert (>95%/30m)
turned out to also trip on a 1 vCPU box during low-traffic weekly scans,
because there's literally no real load for nice 19 to yield to —
clamscan opportunistically fills the vCPU and DO sees 100% utilization
regardless of `%nice` vs `%user` split. Documents the three realistic
options (accept page / switch to clamdscan / disable alert) and the
underlying limit (no DO threshold can distinguish polite from impolite
CPU when the box is fully utilized).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DO's hypervisor-level CPU metric doesn't know about nice/ionice — a
"polite" weekly clamscan on a 1 vCPU droplet still reads 100% utilization
and trips a default >85%/5m alert. Adds a new section explaining the
trade-off and providing the DO API recipe (PUT existing alert with
explicit entities, POST a new relaxed alert scoped to the small
droplet) plus when not to bother (2+ vCPU boxes won't trip).
Triggered by the 2026-05-10 teelia incident where the weekly cron fired
the fleet-wide CPU alert despite the cron script already wrapping
clamscan in nice 19 + ionice idle + cgroup memory limits.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Documents the failure mode where issuing a synchronous `ssh host reboot`
through Claude Desktop's shell MCP poisons the local MCP transport when
the target severs its session before responding cleanly — eventually
force-disconnecting every MCP at once. Covers diagnostic chain, recovery,
fire-and-forget reboot patterns, and worked example from the 2026-05-10
majorhome AMD-card reboot.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Walks the four-step diagnostic chain (post created → activity delivered →
follower exists → notification semantics) for the common confusion where
a Castopod admin's auto-broadcast "doesn't show up" on a Mastodon account
they expected. Most cases are not federation bugs but the difference
between favouriting/boosting (no follow required) and following + the
fact that Mastodon notifications fire only for mentions/follows/favs/
boosts/etc., not for new posts from people you follow. Documents the bell
icon and `@`-mention escape hatches.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When a remote actor updates their avatar, Mastodon (Paperclip) deletes the
old S3 object and stores only the new filename. Castopod 2.0.0 caches the
URL of every federated actor in cp_fediverse_actors and never refetches,
so its admin templates emit a dead link forever (the resulting S3 403 is
anti-enumeration, hiding what is really a 404). Article documents the
diagnosis pattern and three fixes (manual UPDATE, DELETE-and-refetch,
bulk audit), plus the Mastodon-side query for sourcing the correct URL.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The stock alarm definition counts only 1xx/2xx/304/401/429 as successful,
which causes false CRITICALs on WP sites where 301 canonicalization is
normal traffic (legacy /?p=NNNN, slug edits, host/TLS upgrades, etc.).
Article documents the root cause, verification steps via the access log,
and an in-place threshold retune that keeps the alarm useful as an
"obvious meltdown" floor while delegating real outage detection to the
5xx and 4xx alarms.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Documents the long-standing UX regression caused by
`tootctl media remove --prune-profiles` (and `--remove-headers`)
running on a schedule: cached remote avatars are deleted, but
Mastodon does not auto-refetch on profile view, so quiet remote
accounts stay broken indefinitely.
Article covers:
- The mutually-exclusive flag bug (silent skip if combined)
- Mastodon's actual avatar-refresh trigger model (Update activities,
not profile views)
- A `refresh-my-follows.sh` pattern with a defensible WHERE clause
(avatar NULL AND avatar_remote_url present) to avoid infinite
retry on accounts whose origin has no avatar
- Why header_file_name IS NULL is a bad signal (~20% of users
legitimately have no custom header)
- The cron decision: most admins should drop --prune-profiles
The pre-commit hook (which enforces SUMMARY.md links for new articles)
was tracked at mode 100644, so even with `core.hooksPath=.githooks`
configured, git silently skipped it. Bump tracked mode to 100755 so
fresh clones get the working hook without a manual chmod step.
Discovered while installing the wiki-commit/hooks setup on MajorMac.
No content change; .githooks/ is outside the MkDocs source so this
will not alter the rendered notes.majorshouse.com site.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Documents the gotcha hit during the 2026-05-06 update.yml refactor:
the second-positional-argument back-reference form of regex_search
('\1') doesn't reliably select capture groups when used inside
set_fact. The fix is to match the broader substring and use
.split()[0] (or [-1], etc.) to peel off the value, with a default()
bridge for the no-match case.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Updated article count (89 → 106), domain counts, per-section
listings, and Recently Updated table. Added all articles published
since 2026-04-18 including Pi-hole, Mastodon, fail2ban digest,
LoRA GGUF, Tailscale iOS, and more.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- fail2ban-digest-mode-fleet: recidive-only email model, sshd now silent,
defaults-debian.conf gotcha added
- netdata-docker-health-alarm-tuning: 30m/10m config, tuning history table
- New: wp-fail2ban-logpath-debian-ubuntu, lora-adapter-gguf-conversion-fails,
tailscale-status-json-hostname-localhost-ios
- Various article updates and nav index refreshes
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Re-diagnoses today's notes.majorshouse.com outage. Original framing
was "ISP filter expanded to include 'notes'" — but the actual root
cause was a stale A record pointing at 136.54.3.248 (not majorlab's
current home IP). Corrects the comparison table to show CNAMEs to
apex resolve to 136.56.0.55, and recommends a Cloudflare-proxied
CNAME as the durable shape so the apex follows home IP automatically
and ISP-level SNI weirdness is bypassed at the same time.
Includes the working CF API payload used to flip the record, and an
audit checklist for any new *.majorshouse.com subdomain.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Documents the gotcha where convert_hf_to_gguf.py crashes with
'config.json not found' because the training output directory holds
only the LoRA adapter, not a merged HF model. Includes inline
save_pretrained_merged() fix snippet, verification checklist, and
resume-pipeline-without-retraining pattern.
Discovered today during the MajorTwin v8c pipeline failure (Step 4).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Documents the non-obvious failure mode where /etc/hosts generator scripts
using `tailscale status --json | jq '.HostName'` get poisoned by iOS
peers, which always report HostName as the literal string "localhost"
because iOS doesn't expose the device name to apps.
Includes the buggy and fixed jq filter (use .DNSName first label
instead), a real-world Postfix outage example, and a verification
checklist. Linked from troubleshooting index and SUMMARY.
Discovered while diagnosing a 24h Postfix outage on majordiscord.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a 'One-liner wrapper' tip and a 'Pre-Commit Hook (in repo)' section
to MajorWiki-Deploy-Status.md describing the per-clone setup needed on
each workstation:
git config core.hooksPath .githooks
git config pull.rebase true
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The hook fails any commit that adds (or renames) a .md article without a
matching SUMMARY.md entry, addressing the recurring 'article exists but
isn't navigable' drift. Excludes meta files (README/index/SUMMARY,
category index.md, MajorWiki-Deploy-Status). Bypass with --no-verify.
Hook lives in .githooks/ (tracked). Each clone needs:
git config core.hooksPath .githooks
Companion wrapper ~/bin/wiki-commit (workstation-only, not in repo) does
pull --rebase --autostash + add -A + commit + push so cowork pushes
don't surprise.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Documents the gotcha discovered during the 2026-04-30 DCAProd XML-RPC
outage triage: wp-fail2ban plugin emits via PHP syslog(LOG_AUTH) which
lands in /var/log/auth.log on Debian/Ubuntu, not /var/log/syslog.
wordpress-{hard,soft,extra} jails configured with logpath=/var/log/syslog
(common in tutorials and ansible roles) silently catch zero events.
Article includes diagnostic steps, the fix, fail2ban-regex verification,
distro cheat sheet (Debian/Ubuntu vs RHEL/Fedora vs systemd-journal-only),
and a note on why wordpress-login is unaffected.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>