Documents three more patterns surfaced in the 2026-05-10 fleet-mail
investigation, all hitting hosts derived from cloud images or
cross-provider migrations:
- Packer/snapshot-leftover myhostname (postfix EHLO + message-id
identifies the build artifact, not the production hostname; remote
spam scorers hate it)
- Empty relayhost silently routes mail via the public MX instead of
the Tailscale-internal path, exposing it to spamchk that internal
traffic bypasses
- Stale SASL passwd map referencing a missing file from a previous
external-SMTP relay setup, deferring every send with "local data
error"
Each looks benign in isolation. Together they made dcaprod's Logwatch
disappear into spamchk for weeks while showing 250 OK on the source.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Generalizes the Castopod/UuidModel incident from 2026-05-10. PHP 8.4
deprecated implicit-nullable parameters (`function f(int $x = null)`).
Old vendor libraries spam E_DEPRECATED warnings; CodeIgniter wraps each
in a 23-frame stack trace; per-minute spark cron amplifies into 53-80
MB/day log bleed and 22% sustained CPU floor on small VPS boxes.
Documents the four-line sed fix AND the substring-match gotcha that
extended the fix from 30 seconds to 30 minutes — bare `int \$limit = null`
patterns substring-match `?int \$limit = null` elsewhere in the file
and produce illegal `??type` syntax. Covers anchored sed patterns,
reference-parameter handling (&\$db), the lint-after-every-edit rule,
and a bonus section on hunting stray developer debug prints
(`log_message('critical', 'ITS HEEEEEEEEEEEERE')`).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Documents three lessons from the 2026-05-10 fleet outage where the
Fedora half (majorhome, majorlab) had been silently failing to send
notification mail for days:
- Missing /etc/pki/tls/certs/ca-bundle.crt symlink (extracted bundle
exists at /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem but the
consumer-path symlink was lost during a ca-certificates package
event). Diagnosis includes the cross-tool tell — dnf and curl break
with the same path. Fix is a single ln -sfn.
- Methodology: Fedora and majormail log postfix to journald; Debian and
Ubuntu log to /var/log/mail.log. Querying the wrong source returns
false negatives for healthy hosts.
- Bounce-source addresses (Watchtower NOTIFICATION_EMAIL_FROM,
fail2ban sender, root@<host>.localdomain) must resolve to real
mailboxes — otherwise the first failed delivery generates
bounce-of-bounce churn.
Also promoting the article from untracked to committed; it had been
authored on 2026-05-09 and not yet added to the repo.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Same-day correction. The proposed per-droplet relaxed alert (>95%/30m)
turned out to also trip on a 1 vCPU box during low-traffic weekly scans,
because there's literally no real load for nice 19 to yield to —
clamscan opportunistically fills the vCPU and DO sees 100% utilization
regardless of `%nice` vs `%user` split. Documents the three realistic
options (accept page / switch to clamdscan / disable alert) and the
underlying limit (no DO threshold can distinguish polite from impolite
CPU when the box is fully utilized).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DO's hypervisor-level CPU metric doesn't know about nice/ionice — a
"polite" weekly clamscan on a 1 vCPU droplet still reads 100% utilization
and trips a default >85%/5m alert. Adds a new section explaining the
trade-off and providing the DO API recipe (PUT existing alert with
explicit entities, POST a new relaxed alert scoped to the small
droplet) plus when not to bother (2+ vCPU boxes won't trip).
Triggered by the 2026-05-10 teelia incident where the weekly cron fired
the fleet-wide CPU alert despite the cron script already wrapping
clamscan in nice 19 + ionice idle + cgroup memory limits.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Documents the failure mode where issuing a synchronous `ssh host reboot`
through Claude Desktop's shell MCP poisons the local MCP transport when
the target severs its session before responding cleanly — eventually
force-disconnecting every MCP at once. Covers diagnostic chain, recovery,
fire-and-forget reboot patterns, and worked example from the 2026-05-10
majorhome AMD-card reboot.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Walks the four-step diagnostic chain (post created → activity delivered →
follower exists → notification semantics) for the common confusion where
a Castopod admin's auto-broadcast "doesn't show up" on a Mastodon account
they expected. Most cases are not federation bugs but the difference
between favouriting/boosting (no follow required) and following + the
fact that Mastodon notifications fire only for mentions/follows/favs/
boosts/etc., not for new posts from people you follow. Documents the bell
icon and `@`-mention escape hatches.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When a remote actor updates their avatar, Mastodon (Paperclip) deletes the
old S3 object and stores only the new filename. Castopod 2.0.0 caches the
URL of every federated actor in cp_fediverse_actors and never refetches,
so its admin templates emit a dead link forever (the resulting S3 403 is
anti-enumeration, hiding what is really a 404). Article documents the
diagnosis pattern and three fixes (manual UPDATE, DELETE-and-refetch,
bulk audit), plus the Mastodon-side query for sourcing the correct URL.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The stock alarm definition counts only 1xx/2xx/304/401/429 as successful,
which causes false CRITICALs on WP sites where 301 canonicalization is
normal traffic (legacy /?p=NNNN, slug edits, host/TLS upgrades, etc.).
Article documents the root cause, verification steps via the access log,
and an in-place threshold retune that keeps the alarm useful as an
"obvious meltdown" floor while delegating real outage detection to the
5xx and 4xx alarms.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Documents the long-standing UX regression caused by
`tootctl media remove --prune-profiles` (and `--remove-headers`)
running on a schedule: cached remote avatars are deleted, but
Mastodon does not auto-refetch on profile view, so quiet remote
accounts stay broken indefinitely.
Article covers:
- The mutually-exclusive flag bug (silent skip if combined)
- Mastodon's actual avatar-refresh trigger model (Update activities,
not profile views)
- A `refresh-my-follows.sh` pattern with a defensible WHERE clause
(avatar NULL AND avatar_remote_url present) to avoid infinite
retry on accounts whose origin has no avatar
- Why header_file_name IS NULL is a bad signal (~20% of users
legitimately have no custom header)
- The cron decision: most admins should drop --prune-profiles
The pre-commit hook (which enforces SUMMARY.md links for new articles)
was tracked at mode 100644, so even with `core.hooksPath=.githooks`
configured, git silently skipped it. Bump tracked mode to 100755 so
fresh clones get the working hook without a manual chmod step.
Discovered while installing the wiki-commit/hooks setup on MajorMac.
No content change; .githooks/ is outside the MkDocs source so this
will not alter the rendered notes.majorshouse.com site.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Documents the gotcha hit during the 2026-05-06 update.yml refactor:
the second-positional-argument back-reference form of regex_search
('\1') doesn't reliably select capture groups when used inside
set_fact. The fix is to match the broader substring and use
.split()[0] (or [-1], etc.) to peel off the value, with a default()
bridge for the no-match case.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Updated article count (89 → 106), domain counts, per-section
listings, and Recently Updated table. Added all articles published
since 2026-04-18 including Pi-hole, Mastodon, fail2ban digest,
LoRA GGUF, Tailscale iOS, and more.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- fail2ban-digest-mode-fleet: recidive-only email model, sshd now silent,
defaults-debian.conf gotcha added
- netdata-docker-health-alarm-tuning: 30m/10m config, tuning history table
- New: wp-fail2ban-logpath-debian-ubuntu, lora-adapter-gguf-conversion-fails,
tailscale-status-json-hostname-localhost-ios
- Various article updates and nav index refreshes
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Re-diagnoses today's notes.majorshouse.com outage. Original framing
was "ISP filter expanded to include 'notes'" — but the actual root
cause was a stale A record pointing at 136.54.3.248 (not majorlab's
current home IP). Corrects the comparison table to show CNAMEs to
apex resolve to 136.56.0.55, and recommends a Cloudflare-proxied
CNAME as the durable shape so the apex follows home IP automatically
and ISP-level SNI weirdness is bypassed at the same time.
Includes the working CF API payload used to flip the record, and an
audit checklist for any new *.majorshouse.com subdomain.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Documents the gotcha where convert_hf_to_gguf.py crashes with
'config.json not found' because the training output directory holds
only the LoRA adapter, not a merged HF model. Includes inline
save_pretrained_merged() fix snippet, verification checklist, and
resume-pipeline-without-retraining pattern.
Discovered today during the MajorTwin v8c pipeline failure (Step 4).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Documents the non-obvious failure mode where /etc/hosts generator scripts
using `tailscale status --json | jq '.HostName'` get poisoned by iOS
peers, which always report HostName as the literal string "localhost"
because iOS doesn't expose the device name to apps.
Includes the buggy and fixed jq filter (use .DNSName first label
instead), a real-world Postfix outage example, and a verification
checklist. Linked from troubleshooting index and SUMMARY.
Discovered while diagnosing a 24h Postfix outage on majordiscord.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a 'One-liner wrapper' tip and a 'Pre-Commit Hook (in repo)' section
to MajorWiki-Deploy-Status.md describing the per-clone setup needed on
each workstation:
git config core.hooksPath .githooks
git config pull.rebase true
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The hook fails any commit that adds (or renames) a .md article without a
matching SUMMARY.md entry, addressing the recurring 'article exists but
isn't navigable' drift. Excludes meta files (README/index/SUMMARY,
category index.md, MajorWiki-Deploy-Status). Bypass with --no-verify.
Hook lives in .githooks/ (tracked). Each clone needs:
git config core.hooksPath .githooks
Companion wrapper ~/bin/wiki-commit (workstation-only, not in repo) does
pull --rebase --autostash + add -A + commit + push so cowork pushes
don't surprise.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Documents the gotcha discovered during the 2026-04-30 DCAProd XML-RPC
outage triage: wp-fail2ban plugin emits via PHP syslog(LOG_AUTH) which
lands in /var/log/auth.log on Debian/Ubuntu, not /var/log/syslog.
wordpress-{hard,soft,extra} jails configured with logpath=/var/log/syslog
(common in tutorials and ansible roles) silently catch zero events.
Article includes diagnostic steps, the fix, fail2ban-regex verification,
distro cheat sheet (Debian/Ubuntu vs RHEL/Fedora vs systemd-journal-only),
and a note on why wordpress-login is unaffected.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two articles surfaced during the v8 deploy + eval on 2026-04-25:
- Ollama: `ollama run` with piped stdin bypasses the chat template and
SYSTEM prompt — output looks like raw base-model completion. Caught
during initial v8 smoke test. Fix: use /api/chat HTTP endpoint.
- rsync over Tailscale can hang in TCP teardown after the data has
fully transferred. Verify with md5sum, then kill the hung pipeline.
Includes a watcher-threshold gotcha (set below true file size, not
above) and prevention tips.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5 files had stale unmerged index entries from a prior aborted merge —
working tree was already clean (no conflict markers), differences were
purely `updated:` field timestamps that drifted across machines via
Obsidian Sync. Working-tree timestamps (most recent) are kept as the
resolution.
Sixth file (mastodon-instance-tuning.md) preserves staged S3 cache
management content that a working-tree revert would have lost.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Covers creating groups, assigning clients, scoping allow rules to
specific groups via API and CLI. Includes ghost attribution gotcha
(router DNS proxy + secondary DNS causes FTL cache mis-attribution)
and the fix (Pi-hole as sole DNS, remove secondary).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New article covering the conversion from per-ban email alerts to a
three-tier model (silent default, sshd/recidive immediate, daily digest).
Includes Ansible automation, gotchas with lineinfile regex collisions,
and fq-hostname override for clean subjects.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Articles from prior sessions that were written locally but never shipped:
- 02-selfhosting/cloud/aws-s3-cost-management.md — lifecycle rules, storage class selection, bucket inventory, unexpected-growth investigation
- 02-selfhosting/dns-networking/wake-on-lan-router-ssh.md — WOL magic packets via Asus router SSH + ether-wake, Ansible vault integration
- 02-selfhosting/services/claude-code-remote-control.md — mobile access to a persistent host Claude Code session
Nav updated (index.md + SUMMARY.md):
- Added Cloud subsection under Self-Hosting for aws-s3
- Added wake-on-lan and aws-s3 entries to SUMMARY
- Added claude-code-remote-control to index's Services section
- Added ansible-ssh-host-alias-bypass nav entry (article shipped in 2dbeb22)
- Article count 87 → 89, self-hosting 30 → 32, troubleshooting 33 → 34
Documents why `ansible myhost -m ping` fails with Permission denied
while `ssh myhost` works — SSH Host blocks match on literal pattern,
not on resolved HostName, so `ansible_host: <IP>` bypasses the alias
and the declared IdentityFile never gets applied. Covers the portable
fix (ansible_ssh_private_key_file in host_vars), the symlink sidebar
for standardizing key names across control nodes, alternatives, and
a diagnosis checklist.
Also catches index.md up with the ansible-check-mode-false-positives
article that was already published but missing from the nav.
Documents the cosmetic but persistent warning during dnf upgrades:
"/usr/sbin cannot be merged yet, /usr/sbin/ebtables points to
/etc/alternatives/ebtables"
Stale update-alternatives symlinks (not rpm-owned) block Fedora's
/usr/sbin -> /usr/bin consolidation. Article covers root cause,
investigation steps, and the fix (tear down + re-register with
/usr/bin paths only). References the Ansible playbook
fix_ebtables_usrmerge.yml that implements this fleet-wide.
Applied 2026-04-19 across majorlab, majorhome, majormail, majordiscord.
claude-mem 12.1.3 passes --setting-sources with no value, which Claude Code
2.1.x rejects. Documents the silent summaryStored=null symptom, the real
error revealed under DEBUG logging, and the claude-shim workaround.