Compare commits

...
Sign in to create a new pull request.

112 commits

Author SHA1 Message Date
8d9bd34118 Merge branch 'code/majorrig/mastodon-mention-spam-wiki' 2026-06-22 13:50:21 -04:00
2def4c6f30 wiki: add Mastodon crowdfunding/mention-spam triage runbook
Runbook for telling broadcast fundraising solicitation from genuine
mentions: signal checklist, SQL to investigate the account and its
origin instance via nodeinfo, BlockService snippet, and a proportionate
escalation ladder (mute -> block -> report -> domain-limit -> domain-block).
Registered in SUMMARY.md and the self-hosting section index.
2026-06-22 13:49:35 -04:00
44c9d38b9f Merge branch 'code/MajorAir/macos-btm-audit-wiki' 2026-06-21 13:01:34 -04:00
623f04720c Add macOS guide: auditing & cleaning Background App Activity (sfltool dumpbtm) 2026-06-21 13:00:35 -04:00
69d60b7753 Merge branch 'code/MajorAir/restic-snapshot-group-gotcha' 2026-06-21 12:34:06 -04:00
c358e0dfea restic runbook: document the snapshot-group-per-path-set gotcha
Changing a host's restic_paths spawns a new snapshot group (restic
groups by host+paths), so old and new path-sets each keep their own
retention lineage. Surfaced while extending majorlab's backup scope.
2026-06-21 12:33:56 -04:00
a45ef55862 Merge branch 'code/MajorAir/wp-textdomain-wiki' 2026-06-21 11:44:52 -04:00
e767ebffcb Add runbook: WordPress 6.7 _load_textdomain_just_in_time notice
Covers the WP 6.7 doing_it_wrong notice fired when a theme/plugin
translates before init (e.g. nav-menu labels on after_setup_theme).
Documents source fix (defer to init) and the update-safe mu-plugin
suppression via doing_it_wrong_trigger_error, plus the renamed-theme
domain gotcha. Derived from the majorlinux.com kappa/marstheme triage.
2026-06-21 11:44:48 -04:00
96db073b78 Add LVM volume-grow guide; publish iPhone Mirroring + Claude Code login fixes 2026-06-19 15:00:19 -04:00
cf5e35da1d Merge branch 'code/majorair/steam-deck-wifi-flap-article' 2026-06-19 11:36:09 -04:00
cb90bb69a2 wiki: add Steam Deck Wi-Fi flapping runbook (IWD periodic scan + rtw88 power save)
Client-side fix for OG Steam Deck (RTL8822CE/rtw88) flapping ~once a minute on
SteamOS: disable IWD periodic scan + disable Wi-Fi power save via NM dispatcher.
Cross-linked with the 160MHz airtime article; registered in SUMMARY.md nav.
2026-06-19 11:36:06 -04:00
4599ed607c wiki: add restic + B2 fleet backups runbook
Architecture, per-engine DB dump patterns, restore procedure, add-a-host,
and gotchas (RESTIC_CACHE_DIR/$HOME, missing sqlite3, docker dump env vars,
delete-capable B2 key). Linked in SUMMARY under storage-backup.
2026-06-19 10:05:16 -04:00
2bed2cbae3 Merge branch 'code/majormac/ansible-roles-migration-article' 2026-06-18 14:32:02 -04:00
ebdb28e9e2 Add wiki article: migrating flat Ansible playbooks to roles (capture-based reconciliation) 2026-06-18 14:31:46 -04:00
4fa5e33d93 Merge branch 'code/majormac/tm-orphaned-previous-article' 2026-06-18 10:09:45 -04:00
cfff75af1c Add troubleshooting article: Time Machine orphaned APFS .previous blocks backups 2026-06-18 10:09:45 -04:00
06162273f7 Merge branch 'code/majormac/ssh-key-backfill-article' 2026-06-17 13:15:19 -04:00
e1767bc19e Add troubleshooting article: Permission denied (publickey) after key rotation
New 05-troubleshooting/networking article covering the per-host nature of
authorized_keys: rotating a workstation SSH key requires backfilling the new
pubkey to every host, or hosts holding only the old key reject it with
Permission denied (publickey). Includes fleet-sweep diagnosis, idempotent
backed-up backfill via a still-trusted transit user, and prevention. Wired
into SUMMARY.md nav.
2026-06-17 13:14:41 -04:00
0d08e21ee4 Merge branch 'code/majorair/yt-dlp-update-docs' 2026-06-16 19:12:21 -04:00
2121d3ff1b yt-dlp: document -U trap and avoid duplicate pip installs
Add a Maintenance subsection covering why 'yt-dlp -U' fails on PyPI
builds and how to update via pip, plus how to detect/remove a duplicate
user+system install (the issue hit on majorhome 2026-06-16).
2026-06-16 19:12:06 -04:00
1d73b2defa Merge branch 'code/majorair/keychain-prompt-wiki' 2026-06-15 20:12:21 -04:00
34d9ee42b1 Add wiki: Claude Code keychain prompt keeps reappearing on macOS
New troubleshooting article for the recurring 'security wants to access
Claude Code-credentials' prompt that persists even after Always Allow
(ACL invalidation on binary-signature change / token refresh / post-boot
churn). Covers triage, the reset-and-relogin fix, and the file-based
credentials workaround with its plaintext tradeoff. Registered in
SUMMARY + troubleshooting index; cross-linked with the corrupt-credential
login-failure article (distinct symptom).
2026-06-15 20:12:11 -04:00
700ca95158 Merge branch 'code/majorair/iphone-mirroring-regression' 2026-06-15 19:58:24 -04:00
a5df9e4873 Correct iPhone Mirroring article: regressed on 27.0 beta, not a Tailscale fix
2026-06-15: mirroring is reproducibly stuck on Connecting again with
Tailscale accept-routes still off, so the 06-14 it-works conclusion was
wrong. _asquic endpoint resolves but the QUIC/AWDL datapath never
completes; awdl0 bounce, full reboot, and phone radio cycle all failed.
Reframed as an intermittent macOS 27.0 beta AWDL bug; QuickTime USB
remains the workaround.
2026-06-15 19:58:20 -04:00
7703b963e1 Merge branch 'code/majorair/wiki-dummy-ip' 2026-06-15 19:26:58 -04:00
5050001909 Replace real majormail IP with documentation IP in logwatch example
The postfix MX-lookup example hard-coded majormail's real public IP
(stale DO address). Swap in an RFC 5737 documentation IP (203.0.113.10)
so the published wiki doesn't expose a real fleet IP.
2026-06-15 19:26:49 -04:00
9085740fa3 Merge branch 'code/majorair/iphone-mirroring-llw0-correction' 2026-06-14 19:10:33 -04:00
75154ff80c iPhone Mirroring: correct transport finding (video on llw0 not awdl0), it works on ch44, what-changed + MajorMac open test (2026-06-14 evening) 2026-06-14 19:10:06 -04:00
4c95f8a88a Merge branch 'code/majorair/iphone-mirroring-doc-update' 2026-06-14 04:31:55 -04:00
805c0f0a8f iPhone Mirroring AWDL article: refined root cause, Tailscale/congestion ruled out, ch36+ch44 both fail, QuickTime USB workaround, revisit checklist (2026-06-14) 2026-06-14 04:30:22 -04:00
e5d1e39af9 Merge branch 'code/majorair/wiki-stale-hostname-config-variant' 2026-06-14 04:00:25 -04:00
852375ddf0 logwatch-hostname wiki: add hostname-correct-but-config-baked variant
majormail (2026-06-14) had the correct system hostname but still mailed
from majormail-hetzner — the old provisioning label was hardcoded in
logwatch.conf MailFrom and fail2ban jail.local sender. Add a variant
section covering the config grep sweep and the templated-vs-static
Ansible regression caveat.
2026-06-14 04:00:18 -04:00
9dd730fc29 Add nav entries for Warp keychain login + iPhone Mirroring AWDL articles 2026-06-13 09:58:26 -04:00
e0595c04fd Publish drafts: Warp keychain login + iPhone Mirroring AWDL stall 2026-06-13 09:57:37 -04:00
MajorLinux
27ea2dc62b Add troubleshooting article: Wi-Fi 160 MHz airtime saturation breaking game streaming 2026-06-13 09:48:43 -04:00
3f94ebb963 Merge branch 'code/majormac/wiki-forgejo-recovery' 2026-06-12 17:36:55 -04:00
14cc1ba4b8 wiki: Forgejo account recovery & CLI admin when locked out of the GUI
Covers enabling the [mailer] for password recovery (relay via a tailnet mail
server, no-auth/mynetworks, FORCE_TRUST_SERVER_CERT for IP targets), CLI password
reset + the must-change-password=true gotcha, adding an SSH key via the basic-auth
API when locked out, and ruling out a server-side cause for a 'changing' password.
2026-06-12 17:36:54 -04:00
fecae727d1 Merge branch 'code/majormac/logwatch-hostname-wiki' 2026-06-12 10:58:17 -04:00
0d1697c0d6 wiki: Logwatch wrong hostname (<host>-hetzner) after migration
New troubleshooting runbook for Logwatch reports titled with the Hetzner
provisioning label instead of the real hostname; cross-linked from the
logwatch fleet-setup and VPS migration baseline articles, plus a new
'set system hostname' step in the post-migration checklist.
2026-06-12 10:58:17 -04:00
4f6898eb6c Merge branch 'code/majormac/ansible-hostkey-wiki' 2026-06-12 09:32:00 -04:00
11b455a0e2 Add runbook: Ansible host-key verification failed after host rebuild/migration
Documents the Ansible-by-IP known_hosts gap: interactive ssh works (key
stored under hostname) but Ansible connects by inventory IP and fails with
UNREACHABLE/Host key verification failed. Includes tailnet-safe ssh-keyscan
fix and prevention notes. Surfaced by the Hetzner migration IP churn.
2026-06-12 09:30:09 -04:00
bc4ff144df wiki: add Ansible reboot.yml become-timeout-on-WSL2 troubleshooting article
Documents why WSL2 hosts fail an Ansible reboot play at privilege
escalation (Timeout waiting for privilege escalation prompt) — WSL2 has
no real reboot semantics + become stalls over the Windows OpenSSH->WSL2
bridge — and the fix: scope reboot.yml to hosts: all:!wsl. Registered
in SUMMARY.md and 05-troubleshooting/index.md.
2026-06-12 03:57:17 -04:00
950759da52 wiki: add MagicDNS-names-vs-pinned-IPs Tailscale SSH article
New troubleshooting/networking article covering the three SSH failure modes
after a fleet migration (stale hardcoded IP, Tailscale 1.98.x cold-path
teardown, rebuilt-box host-key mismatch) and the durable fix (MagicDNS names +
known_hosts purge + ConnectTimeout), with the WSL2 no-resolver caveat.
Cross-links the existing host-key article (adds a 'when pinning the IP is
wrong' callout) and adds the SUMMARY nav entry.
2026-06-12 01:33:31 -04:00
877c4b815f wiki: add WSL2 Fedora 44 in-place upgrade article (gcc14 blocker + CUDA repo swap) 2026-06-11 22:48:55 -04:00
27b1ae244c Merge branch 'code/majorrig/wiki-hevc-already-failed-skip' 2026-06-11 20:16:21 -04:00
ce2e761d33 hevc-vaapi-batch-encode: add already_failed() skip for streaming content
Document that VAAPI HEVC on Polaris can't beat already-efficient H.264 (YouTube/
Twitch/stream archives), so output comes out larger and lands in hevc_failed.txt.
Add already_failed() guard so the batch skips known-bad files on queue rebuilds
instead of re-attempting them. Also: MIN_FREE_GB note (start-only check) and a
source-bitrate triage snippet for picking real encode candidates.
2026-06-11 20:16:19 -04:00
513d94aa84 Merge branch 'code/majorrig/wiki-ssh-magicdns-article' 2026-06-11 20:12:34 -04:00
9b066d0e54 Add troubleshooting article: SSH alias MagicDNS fall-through host-key failure
New 05-troubleshooting/networking article covering the case where ssh <alias>
fails host-key verification because no Host block exists and the alias resolves
via Tailscale MagicDNS to a name with no known_hosts entry (key stored under the
IP). Registered in SUMMARY.md and the troubleshooting index.
2026-06-11 20:12:22 -04:00
5ef0fdfad4 draft: WIP wiki articles (warp keychain credential, iPhone Mirroring AWDL stall)
Backing up two unpublished draft articles that existed only in a working-tree
stash. Drafts — NOT in SUMMARY.md nav and NOT merged to main, so not published
to notes.majorshouse.com. Pre-commit nav check bypassed intentionally (--no-verify).

- 05-troubleshooting/claude-code-warp-login-corrupt-keychain-credential.md
- 05-troubleshooting/iphone-mirroring-connecting-hang-awdl-stall-beta.md
2026-06-11 15:41:28 -04:00
a414e4cdbe Merge: Ansible role doc-ref updates across 5 wiki articles 2026-06-11 11:33:42 -04:00
06a794316b docs: point Ansible references at the new roles (clamav/ssh_hardening/tailscale)
Operational/how-to references updated to the role entry playbooks after the
ADR-0001 migration. Historical incident narrative (dated callouts, commit
refs) preserved.

- clamav-fleet-deployment: override + re-run -> clamav.yml; role note
- ssh-hardening-ansible-fleet: note this is now the ssh_hardening role
- vps-migration-baseline-checklist: table -> clamav.yml / ssh_hardening.yml
- ssh-socket-tailscale-race-condition: Affected Hosts + Prevention + References
  -> tailscale role tasks (network_wait/ssh_only_ubuntu/ssh_only_fedora)
- freshclam-logwatch-false-no-updates: codify refs -> clamav role
2026-06-11 11:33:42 -04:00
68bfb099ac Merge branch 'code/majorrig/wiki-ssh-fleet-reconciled' 2026-06-07 06:22:39 -04:00
c3045e33dd troubleshooting: ssh-race article — fleet audited & reconciled 2026-06-07
dcaprod-hetzner + tttpod-hetzner were missing tailscale-wait-ready.service
(inert ssh.service gate -> latent bind race); corrected playbook applied to
both. teelia uses Tailscale SSH (no sshd, immune). All Ubuntu hosts now on
the dependency-free-socket + ssh.service-gate pattern.
2026-06-07 06:22:35 -04:00
0cde19e064 Merge branch 'code/majorrig/wiki-ssh-race-fedora-and-cycle' 2026-06-07 05:56:29 -04:00
8d4dee5da3 troubleshooting: correct ssh tailscale-race article (Fedora ListenAddress variant + playbook cycle landmine)
- Fedora hosts are NOT automatically immune: a leftover manual
  `ListenAddress <tailscale-ip>` drop-in reintroduces the sshd boot bind-race
  even under firewalld (hit on majordiscord 2026-06-07; fix = remove it).
- The Ubuntu playbook kept shipping the cycle-causing [Unit] gate on
  ssh.socket despite the 2026-06-04 resolution; re-running it re-armed the
  ordering cycle (clobbered majorlinux; majortoot-hetzner found armed).
  Corrected in MajorAnsible e0d35aa. Fleet ssh-lockdown state is inconsistent
  (dcaprod/tttpod lack wait-ready; teelia no override) — needs a per-host audit.
2026-06-07 05:56:25 -04:00
fda2d35ea5 Merge branch 'code/majorrig/wiki-dovecot-lda-dupes' 2026-06-07 05:06:57 -04:00
01ae62e621 troubleshooting: Dovecot phantom mailboxes from .dovecot.lda-dupes (mail_home overlapping maildir root)
Document the majormail 2026-06-07 incident: when userdb home == maildir
root, the LDA/Sieve duplicate database (.dovecot.lda-dupes + .locks) lands
inside the mail store and the maildir lister exposes it as phantom
mailboxes ("dovecot.lda-dupes"), logging stat(.../tmp) "Not a directory".
Fix: point home at a non-dotted subdir. Wired into the troubleshooting
index and SUMMARY.
2026-06-07 05:06:43 -04:00
662741e7ad troubleshooting: Postfix header_checks can't act on milter-added headers
Document the majormail spam-routing failure (2026-06-06): a cleanup
header_checks REDIRECT keyed on the milter-added X-Spam-Flag never fired for
real inbound mail (only locally-injected), so spam kept reaching the inbox.
Fix is to route in Sieve at delivery (after the milter), with a redirect +
loop guard. Includes the 'local-injection tests lie' warning.
2026-06-06 10:38:04 -04:00
d8f07e8e2e wiki: add ClamAV freshness watchdog + sendmail (not mail) alert guidance
Document the daily /etc/cron.daily/clamav-freshness watchdog as the real
detector for stale signatures, and the key gotcha that 'mail' is absent on
most fleet hosts so alert scripts must use /usr/sbin/sendmail -t.
2026-06-06 07:17:56 -04:00
5d7354e856 troubleshooting: freshclam daemon-mode logwatch false 'no updates' alert
logwatch's clam-update counts only 'process started' lines (emitted only at
daemon restart), so daemon-mode freshclam false-alarms on quiet days despite
signatures updating. Fix: $ignore_no_updates=1 drop-in. Includes the
real-vs-false check (a daemonless box with freshclam disabled is a TRUE alert).
2026-06-06 07:06:29 -04:00
d755b77126 troubleshooting: SELinux /etc/localtime mislabel silently breaks timezone
New page documenting the majormail (2026-06-05) issue: /etc/localtime
shipped labeled etc_t instead of locale_t on the Hetzner image, so SELinux
denied systemd-timedated and timedatectl/community.general.timezone reported
success while the symlink stayed at UTC. Fix: restorecon before setting TZ.
Indexed in index.md (SELinux) + SUMMARY.md.
2026-06-05 14:22:00 -04:00
26eb13ab2f troubleshooting: document majormail client-connectivity incident (2026-06-05)
- New page: Dovecot IMAP vsz_limit OOM from a bloated/corrupt index.log
  (152M index on an empty folder killed IMAP children with error 83).
- fail2ban IMAP self-ban: add permanent ignoreip-whitelist fix + dynamic-IP caveat.
- firewalld mail ports: add 'submission/587 never added' variant + correct
  Fedora service name; note Ansible now manages the full mail-service set.
- Index + SUMMARY updated with the new page.
2026-06-05 14:04:22 -04:00
5260548caa wiki: spam filtering — add Pigeonhole 2.4 syntax, REDIRECT-to-junk pattern, weekly timer
Three updates to the inbound spam filtering guide, all driven by the 2026-06-04
majormail-hetzner Phase 6 cutover and follow-up tuning:

1. Section 6 (Dovecot Sieve): warn explicitly that `plugin/sieve_before` was
   dropped in Pigeonhole 2.4 and silently does nothing — no startup warning,
   spam just keeps landing in INBOX. The 2.4 replacement is a top-level
   `sieve_script <name> { type = before; path = …; }` block. Also note the
   Fedora-flat-dovecot.conf pitfall (some packagings ship dovecot.conf
   without `!include conf.d/*.conf`, so the block has to live in the main
   file directly). Added a `sievec` compile step.

2. New §6b: route spam to a separate `junk@` mailbox via Postfix cleanup
   `header_checks` REDIRECT. This makes spam invisible to the user's
   mailbox entirely — Spark/IDLE-based clients don't push-notify because
   the message never reaches the subscribed mailbox at all. Includes the
   `regexp:` vs `pcre:` map-type tip (use regexp on stock Fedora to avoid
   the postfix-pcre package dependency).

3. New §7a: weekly systemd timer for sa-learn. The §7 warning about
   "don't run sa-learn from cron unless folders are clean" is correct as
   the safe default — but when you adopt the §6b REDIRECT-to-junk@
   pattern, the junk@ mailbox is pure spam by design and a weekly
   `--spam`/`--ham`/`--sync`/`--force-expire` chain becomes safe and
   useful. Full unit templates included.

Gotchas table gains four entries:
- Pigeonhole 2.4 silent breakage of plugin/sieve_before
- postfix-pcre vs regexp map type confusion
- Why sieve fileinto Junk still pushes a Spark notification
- Why local `sendmail` injection doesn't trigger the REDIRECT (smtpd
  milters skip sendmail-injected mail, so X-Spam-Flag isn't added)

All changes match what's now codified in the `majormail` Ansible role
(commit 7a8b9eb in MajorAnsible).
2026-06-04 20:48:01 -04:00
2e58c4625c wiki: remove deploy-pipeline test marker 2026-06-04 16:44:56 -04:00
b81362bb78 wiki: temporary deploy-pipeline test marker (will be reverted) 2026-06-04 16:43:57 -04:00
110a6d49e5 wiki: add inbound spam filtering guide (spamass-milter + SpamAssassin Bayes)
New 02-selfhosting/services article: the full Postfix/Dovecot inbound spam stack
on Fedora — spamass-milter tag-only wiring (the -r footgun), socket permissions
(sa-milt group + UMask), site-wide Bayes DB, Sieve-to-Junk, and sa-learn training
(folders, spam/ham balance, manual-not-cron). From the majormail setup.

Also extends selinux-dovecot-vmail-context with a Permissive-mode variant + a
postfix_cleanup->mysqld_etc companion-denial note. SUMMARY.md nav updated.
2026-06-04 16:31:14 -04:00
e6a249403c s3-cost-management: prune automation disabled; correct guidance
The weekly media-prune cron (and monthly accounts refresh --all) were
removed 2026-06-01 after repeatedly breaking avatars. Update the
majortoot sections: the 648->7GB shrink was a one-time safe attachment
cleanup; automation is now disabled; prune attachments manually if ever
needed, never profiles. Cross-link the two new troubleshooting articles.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-01 15:46:42 -04:00
4e63d8546c mastodon: document S3 ACL upload failures + bulk avatar restore
New article mastodon-s3-acl-upload-failures.md: a BucketOwnerEnforced S3
bucket plus a stale S3_PERMISSION/S3_ACL in .env.production makes every
Mastodon upload fail with AccessControlListNotSupported, silently. Covers
symptoms (incl. why a missing object returns 403 not 404), diagnosis,
the fix (S3_PERMISSION= empty, public read via bucket policy), recovery,
a synthetic-write health check, and Ansible enforcement.

Extend mastodon-prune-profiles-trap.md: add a "Bulk restore at scale"
procedure (list existing keys, null missing DB refs, enqueue
RedownloadAvatar/HeaderWorker), a "storage-level deletion without DB
de-ref" section, and a stronger recommendation to disable automated
profile pruning (and scheduled accounts refresh --all) entirely.

Link both from SUMMARY.md and the selfhosting index.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-01 15:45:23 -04:00
155651c373 wiki: ssh.socket wait-ready gate + mastodon post-install hardening
Two related additions covering the 2026-05-31 cutover-night incidents on
majorlinux and majortoot-hetzner.

ssh-socket-tailscale-race-condition.md (update Race 1 fix):
- After=tailscaled.service Requires=tailscaled.service orders against the
  service becoming active, not against tailscale0 having an IPv4 — hosts
  kept losing SSH intermittently after reboots (incident: majorlinux +
  majortoot-hetzner 2026-05-31, during cutover-night Ansible reboot).
- Canonical fix: a oneshot tailscale-wait-ready.service that polls
  `ip -4 -o addr show tailscale0` until an address is present, with
  ssh.socket After=/Requires= that service. Document the full evolution
  (2026-05-19 BindsTo → 2026-05-23 Requires → 2026-05-31 wait-ready) so
  future readers don't try the half-fixes thinking they're sufficient.
- Add majortoot-hetzner to affected hosts.

mastodon-post-install-hardening.md (new):
Four upstream-install gaps that bit during the majortoot-hetzner cutover:
1. /home/mastodon at 0750 (useradd default) → nginx www-data can't
   traverse → every static asset 403s → unstyled "purple screen" in the
   browser while API/HTML still work through the puma proxy.
2. .env.production at 0644 (mastodon-setup default) → DB_PASS,
   SECRET_KEY_BASE, OTP_SECRET world-readable once gap (1) is fixed.
3. mastodon user shell at /usr/sbin/nologin → `su - mastodon` blocked.
4. rbenv init in .bashrc only → login shells don't source .bashrc; even
   when chained, Ubuntu's .bashrc returns early for non-interactive
   shells. Fix: .bash_profile sets up rbenv BEFORE sourcing .profile +
   .bashrc, so it works for both interactive and non-interactive logins.

All four codified in MajorAnsible configure_mastodon_permissions.yml
with self-asserting verification steps.

02-selfhosting/index.md + SUMMARY.md:
Add a "Services" section to the selfhosting index linking the
mastodon-post-install-hardening article (and the other orphaned
services/ entries while there). SUMMARY.md gains one new entry.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-31 11:08:24 -04:00
73c10111e0 Merge branch 'cowork/majorair/wiki-batch-may25' 2026-05-25 13:56:23 -04:00
52ca8a0413 wiki: batch update — 4 new articles + 4 updates
New articles:
- Postfix SendGrid TLS handshake failure (port 465 vs 587)
- Plex transcoding troubleshooting
- Ansible Ubuntu reboot detection kernel mismatch
- WSL2 PyTorch checkpoint Windows filesystem deadlock

Updated:
- AWS S3 cost management (expanded)
- Network overview (IP updates)
- HEVC VAAPI batch encode (progress + fixes)
- SUMMARY.md (new entries)
2026-05-25 13:55:10 -04:00
dc897d4a67 Merge branch 'cowork/majorair/ssh-socket-bindsto-fix' 2026-05-23 02:40:45 -04:00
3b8c8b0597 ssh.socket wiki: correct BindsTo→Requires, add warning
BindsTo=tailscaled.service causes a systemd ordering cycle that
prevents ssh.socket from starting on reboot. Updated the recommended
fix to use Requires= and added a warning admonition explaining why
BindsTo must not be used. Added tttpod-hetzner to affected hosts
list and linked the 2026-05-23 dcaprod incident.
2026-05-23 02:40:04 -04:00
318f50c50b Merge branch 'cowork/majorair/tailscale-boot-race-wiki' 2026-05-19 20:39:19 -04:00
65b0aa4567 wiki: expand Tailscale race condition article with network-online race
Added Race 2: tailscaled starts before network-online.target, causing
Tailscale to get stuck with SetNetworkUp(false). Covers both Ubuntu
ssh.socket and cross-platform tailscaled ordering issues. Updated
references to include majordiscord incident and new Ansible playbook.
2026-05-19 20:39:18 -04:00
eb39da9a26 Merge cowork/majorair/ssh-socket-wiki: ssh.socket Tailscale race condition article 2026-05-19 19:36:19 -04:00
7dc591d257 wiki: add ssh.socket Tailscale race condition troubleshooting article
Documents the systemd socket activation race where ssh.socket binds
to the Tailscale IP before tailscaled is ready, causing SSH to become
unreachable after a Tailscale reconnect. Includes diagnosis steps and
the After=/BindsTo= fix.
2026-05-19 19:35:16 -04:00
64ac418a36 wiki: add ClamAV daemonless mode section + HEVC VAAPI article link 2026-05-15 09:02:24 -04:00
Marcus (via Claude Code)
28518e403e Add troubleshooting articles: Netdata apps-group FD false-positive + OBS stale script paths
- netdata-apps-fds-group-false-positive: the apps_group_file_descriptors_utilization
  false 100% on forking/root app groups (tailscaled on MajorToot 2026-05-15),
  the not-a-privilege gotcha, fleet-wide silence fix in MajorAnsible.
- obs-stale-script-paths: pending from prior session (not on remote).
- SUMMARY.md: link both (re-applied onto upstream after concurrent rebase).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 03:22:12 -04:00
a785e85821 Merge branch 'code/majorair/rsyslog-logwatch-fix' 2026-05-13 10:36:06 -04:00
4ec481c584 wiki: add rsyslog requirement to migration checklist and logwatch docs
Fedora 44 Hetzner images ship without rsyslog — logwatch produces
zero output because /var/log/messages doesn't exist. Added rsyslog
to baseline table and new diagnostic section to logwatch article.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 10:36:00 -04:00
c22457f1aa Merge branch 'code/majorair/teelia-cpu-docs' 2026-05-11 18:32:18 -04:00
ac84610380 wiki: add 1 vCPU nice/ionice limitation note to ClamAV article
nice -n 19 only yields when other processes compete; on single-core
VPS boxes the scan still saturates CPU. Document the expectation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-11 18:32:01 -04:00
3df0979786 Merge branch 'code/majorair/logwatch-ca-bundle-docs'
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-11 07:37:48 -04:00
de9b661b9d wiki: add Fedora CA bundle article, update migration checklist and logwatch docs
New article documenting missing /etc/pki/tls/certs/ca-bundle.crt symlink
on Hetzner Fedora images breaking Postfix TLS, curl, and dnf. Updated
VPS migration baseline checklist with timezone, CA bundle, and crond
verification steps. Updated logwatch fleet setup with crond check.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-11 07:35:42 -04:00
9c62e7f804 Logwatch fleet article: add cloud-image config-drift section
Documents three more patterns surfaced in the 2026-05-10 fleet-mail
investigation, all hitting hosts derived from cloud images or
cross-provider migrations:

- Packer/snapshot-leftover myhostname (postfix EHLO + message-id
  identifies the build artifact, not the production hostname; remote
  spam scorers hate it)
- Empty relayhost silently routes mail via the public MX instead of
  the Tailscale-internal path, exposing it to spamchk that internal
  traffic bypasses
- Stale SASL passwd map referencing a missing file from a previous
  external-SMTP relay setup, deferring every send with "local data
  error"

Each looks benign in isolation. Together they made dcaprod's Logwatch
disappear into spamchk for weeks while showing 250 OK on the source.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 12:58:00 -04:00
724ae2a5e3 Add troubleshooting article: PHP 8.4 implicit-nullable vendor patch
Generalizes the Castopod/UuidModel incident from 2026-05-10. PHP 8.4
deprecated implicit-nullable parameters (`function f(int $x = null)`).
Old vendor libraries spam E_DEPRECATED warnings; CodeIgniter wraps each
in a 23-frame stack trace; per-minute spark cron amplifies into 53-80
MB/day log bleed and 22% sustained CPU floor on small VPS boxes.

Documents the four-line sed fix AND the substring-match gotcha that
extended the fix from 30 seconds to 30 minutes — bare `int \$limit = null`
patterns substring-match `?int \$limit = null` elsewhere in the file
and produce illegal `??type` syntax. Covers anchored sed patterns,
reference-parameter handling (&\$db), the lint-after-every-edit rule,
and a bonus section on hunting stray developer debug prints
(`log_message('critical', 'ITS HEEEEEEEEEEEERE')`).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 12:52:25 -04:00
631d7e8bc5 Logwatch fleet article: add Fedora CA bundle diagnosis + bounce-source guidance
Documents three lessons from the 2026-05-10 fleet outage where the
Fedora half (majorhome, majorlab) had been silently failing to send
notification mail for days:

- Missing /etc/pki/tls/certs/ca-bundle.crt symlink (extracted bundle
  exists at /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem but the
  consumer-path symlink was lost during a ca-certificates package
  event). Diagnosis includes the cross-tool tell — dnf and curl break
  with the same path. Fix is a single ln -sfn.
- Methodology: Fedora and majormail log postfix to journald; Debian and
  Ubuntu log to /var/log/mail.log. Querying the wrong source returns
  false negatives for healthy hosts.
- Bounce-source addresses (Watchtower NOTIFICATION_EMAIL_FROM,
  fail2ban sender, root@<host>.localdomain) must resolve to real
  mailboxes — otherwise the first failed delivery generates
  bounce-of-bounce churn.

Also promoting the article from untracked to committed; it had been
authored on 2026-05-09 and not yet added to the repo.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 12:08:15 -04:00
a852f7b7bd ClamAV fleet caveat: add follow-up on the polite-CPU-on-1vCPU edge case
Same-day correction. The proposed per-droplet relaxed alert (>95%/30m)
turned out to also trip on a 1 vCPU box during low-traffic weekly scans,
because there's literally no real load for nice 19 to yield to —
clamscan opportunistically fills the vCPU and DO sees 100% utilization
regardless of `%nice` vs `%user` split. Documents the three realistic
options (accept page / switch to clamdscan / disable alert) and the
underlying limit (no DO threshold can distinguish polite from impolite
CPU when the box is fully utilized).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 02:32:35 -04:00
af14e36caf ClamAV fleet article: add DigitalOcean monitoring caveat for 1vCPU droplets
DO's hypervisor-level CPU metric doesn't know about nice/ionice — a
"polite" weekly clamscan on a 1 vCPU droplet still reads 100% utilization
and trips a default >85%/5m alert. Adds a new section explaining the
trade-off and providing the DO API recipe (PUT existing alert with
explicit entities, POST a new relaxed alert scoped to the small
droplet) plus when not to bother (2+ vCPU boxes won't trip).

Triggered by the 2026-05-10 teelia incident where the weekly cron fired
the fleet-wide CPU alert despite the cron script already wrapping
clamscan in nice 19 + ionice idle + cgroup memory limits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 02:24:17 -04:00
545df9f5c6 Add troubleshooting article: Claude Desktop MCP mass-disconnect from blocking SSH reboot
Documents the failure mode where issuing a synchronous `ssh host reboot`
through Claude Desktop's shell MCP poisons the local MCP transport when
the target severs its session before responding cleanly — eventually
force-disconnecting every MCP at once. Covers diagnostic chain, recovery,
fire-and-forget reboot patterns, and worked example from the 2026-05-10
majorhome AMD-card reboot.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 01:28:11 -04:00
7c566cda50 Add: diagnosing Castopod posts that don't appear on Mastodon
Walks the four-step diagnostic chain (post created → activity delivered →
follower exists → notification semantics) for the common confusion where
a Castopod admin's auto-broadcast "doesn't show up" on a Mastodon account
they expected. Most cases are not federation bugs but the difference
between favouriting/boosting (no follow required) and following + the
fact that Mastodon notifications fire only for mentions/follows/favs/
boosts/etc., not for new posts from people you follow. Documents the bell
icon and `@`-mention escape hatches.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 00:05:18 -04:00
1c17bdb60a Add: Castopod federation — stale cached avatar URL fix
When a remote actor updates their avatar, Mastodon (Paperclip) deletes the
old S3 object and stores only the new filename. Castopod 2.0.0 caches the
URL of every federated actor in cp_fediverse_actors and never refetches,
so its admin templates emit a dead link forever (the resulting S3 403 is
anti-enumeration, hiding what is really a 404). Article documents the
diagnosis pattern and three fixes (manual UPDATE, DELETE-and-refetch,
bulk audit), plus the Mastodon-side query for sourcing the correct URL.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 01:51:18 -04:00
393df3cc45 Add: tuning Netdata web_log_1m_successful for redirect-heavy WordPress
The stock alarm definition counts only 1xx/2xx/304/401/429 as successful,
which causes false CRITICALs on WP sites where 301 canonicalization is
normal traffic (legacy /?p=NNNN, slug edits, host/TLS upgrades, etc.).
Article documents the root cause, verification steps via the access log,
and an in-place threshold retune that keeps the alarm useful as an
"obvious meltdown" floor while delegating real outage detection to the
5xx and 4xx alarms.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 01:12:21 -04:00
306e5f1f16 Merge branch 'cowork/majormac/mastodon-prune-profiles-trap' 2026-05-07 12:01:48 -04:00
3bcc58a805 services: add Mastodon --prune-profiles trap and recovery article
Documents the long-standing UX regression caused by
`tootctl media remove --prune-profiles` (and `--remove-headers`)
running on a schedule: cached remote avatars are deleted, but
Mastodon does not auto-refetch on profile view, so quiet remote
accounts stay broken indefinitely.

Article covers:
- The mutually-exclusive flag bug (silent skip if combined)
- Mastodon's actual avatar-refresh trigger model (Update activities,
  not profile views)
- A `refresh-my-follows.sh` pattern with a defensible WHERE clause
  (avatar NULL AND avatar_remote_url present) to avoid infinite
  retry on accounts whose origin has no avatar
- Why header_file_name IS NULL is a bad signal (~20% of users
  legitimately have no custom header)
- The cron decision: most admins should drop --prune-profiles
2026-05-07 12:01:47 -04:00
5f31a57ae6 Merge branch 'cowork/majormac/githooks-executable' 2026-05-06 09:44:08 -04:00
7e422ee332 githooks: mark pre-commit executable
The pre-commit hook (which enforces SUMMARY.md links for new articles)
was tracked at mode 100644, so even with `core.hooksPath=.githooks`
configured, git silently skipped it. Bump tracked mode to 100755 so
fresh clones get the working hook without a manual chmod step.

Discovered while installing the wiki-commit/hooks setup on MajorMac.
No content change; .githooks/ is outside the MkDocs source so this
will not alter the rendered notes.majorshouse.com site.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 09:42:06 -04:00
3c4cc74aef Merge: wiki — Ansible regex_search set_fact gotcha 2026-05-06 08:28:22 -04:00
ca123b0312 wiki: add troubleshooting article — Ansible regex_search capture group fails in set_fact
Documents the gotcha hit during the 2026-05-06 update.yml refactor:
the second-positional-argument back-reference form of regex_search
('\1') doesn't reliably select capture groups when used inside
set_fact. The fix is to match the broader substring and use
.split()[0] (or [-1], etc.) to peel off the value, with a default()
bridge for the no-match case.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 08:28:21 -04:00
488268ccd1 Merge branch 'cowork/majorair/rescue-stash-keep-files-may02' 2026-05-02 17:50:28 -04:00
213a84ed79 wiki: add .keep files for 04-streaming and 05-troubleshooting subdirs 2026-05-02 17:50:22 -04:00
ae864452f8 wiki: add Fail2Ban Digest Mode nav entry to SUMMARY.md 2026-05-02 17:17:04 -04:00
49a1173dfc Merge cowork/majorair/index-refresh-may02 — full index refresh (106 articles) 2026-05-02 16:45:37 -04:00
c5b4de4184 wiki: full index refresh — 106 articles, 17 new since Apr 18
Updated article count (89 → 106), domain counts, per-section
listings, and Recently Updated table. Added all articles published
since 2026-04-18 including Pi-hole, Mastodon, fail2ban digest,
LoRA GGUF, Tailscale iOS, and more.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-02 16:45:30 -04:00
021c7f6539 Merge cowork/majorair/wiki-updates-may02 — fail2ban digest + netdata docker health + 3 new articles 2026-05-02 16:28:48 -04:00
4126656c05 wiki: update fail2ban digest + netdata docker health + 3 new articles
- fail2ban-digest-mode-fleet: recidive-only email model, sshd now silent,
  defaults-debian.conf gotcha added
- netdata-docker-health-alarm-tuning: 30m/10m config, tuning history table
- New: wp-fail2ban-logpath-debian-ubuntu, lora-adapter-gguf-conversion-fails,
  tailscale-status-json-hostname-localhost-ios
- Various article updates and nav index refreshes

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-02 14:58:07 -04:00
264f1f64c3 Merge: wiki — 2026-04-30 SNI filter article update 2026-04-30 13:08:37 -04:00
74c4ed9959 wiki: 2026-04-30 update to ISP SNI filtering article
Re-diagnoses today's notes.majorshouse.com outage. Original framing
was "ISP filter expanded to include 'notes'" — but the actual root
cause was a stale A record pointing at 136.54.3.248 (not majorlab's
current home IP). Corrects the comparison table to show CNAMEs to
apex resolve to 136.56.0.55, and recommends a Cloudflare-proxied
CNAME as the durable shape so the apex follows home IP automatically
and ISP-level SNI weirdness is bypassed at the same time.

Includes the working CF API payload used to flip the record, and an
audit checklist for any new *.majorshouse.com subdomain.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 13:08:36 -04:00
34cc5c3d0b Merge: wiki — LoRA→GGUF troubleshooting article 2026-04-30 11:24:38 -04:00
6e7a0ca21f wiki: add troubleshooting article — LoRA adapter GGUF conversion fails
Documents the gotcha where convert_hf_to_gguf.py crashes with
'config.json not found' because the training output directory holds
only the LoRA adapter, not a merged HF model. Includes inline
save_pretrained_merged() fix snippet, verification checklist, and
resume-pipeline-without-retraining pattern.

Discovered today during the MajorTwin v8c pipeline failure (Step 4).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 11:22:59 -04:00
85f8a5df2d Merge pull request 'wiki: add troubleshooting article — iOS Tailscale clients report HostName="localhost"' (#1) from code/majormac/tailscale-ios-hostname-fix into main
Reviewed-on: #1
2026-04-30 10:00:01 -04:00
80 changed files with 7451 additions and 234 deletions

0
.githooks/pre-commit Normal file → Executable file
View file

View file

@ -10,7 +10,7 @@ tags:
- majorrig
status: published
created: 2026-03-16
updated: 2026-04-29T22:45
updated: 2026-04-30T05:21
---
# WSL2 Backup via PowerShell Scheduled Task

View file

@ -0,0 +1,119 @@
---
title: WSL2 In-Place Upgrade to Fedora 44 (with gcc14 Blocker + CUDA Repo Swap)
domain: linux
category: distro-specific
tags:
- wsl2
- fedora
- windows
- upgrade
- dnf
- cuda
- majorrig
status: published
created: 2026-06-11
updated: 2026-06-11
---
# WSL2 In-Place Upgrade to Fedora 44 (with gcc14 Blocker + CUDA Repo Swap)
In-place upgrade of the FedoraLinux-43 WSL2 instance on MajorRig to Fedora 44 using `dnf system-upgrade` + `dnf5 offline reboot`. Hit one transaction blocker (`gcc14` compat package retired in F44) and swapped the stale `cuda-fedora39` repo to `cuda-fedora44` afterward. Performed 2026-06-11.
## The Short Answer
```powershell
# PowerShell — backup first
wsl --shutdown
wsl --export FedoraLinux-43 D:\backups\fedora43.tar
```
```bash
# Inside Fedora
sudo dnf upgrade --refresh -y
sudo shutdown -h now
# relaunch, then:
sudo dnf remove gcc14-c++ gcc14 # F44 dropped gcc14 — blocks the transaction
sudo dnf system-upgrade download --releasever=44
sudo dnf5 offline reboot # applies offline upgrade, shuts distro down
# wait a few minutes, relaunch:
cat /etc/fedora-release # → Fedora release 44 (Forty Four)
```
```powershell
# PowerShell — keep WSL itself current
wsl --update
```
## Steps
1. **Back up the instance** (PowerShell). The export tar is roughly the size of the installed system — this one was 86 GB. The target directory must already exist or you get `Wsl/ERROR_PATH_NOT_FOUND`.
```powershell
wsl --shutdown
mkdir D:\backups
wsl --export FedoraLinux-43 D:\backups\fedora43.tar
```
2. **Fully update the current release, then restart the distro**
```bash
sudo dnf upgrade --refresh -y
sudo shutdown -h now
```
3. **Remove upgrade blockers.** `gcc14`/`gcc14-c++` (compat packages) were retired in Fedora 44, so the transaction fails with "does not belong to a distupgrade repository". Remove them (or use `--allowerasing` and review the summary):
```bash
sudo dnf remove gcc14-c++ gcc14
```
4. **Download and apply the upgrade**
```bash
sudo dnf system-upgrade download --releasever=44
sudo dnf5 offline reboot
```
The "reboot" applies the offline transaction and shuts the distro down — there's no real systemd reboot in WSL. Wait a couple of minutes, then relaunch. If it errors on `systemctl`, the fallback is:
```bash
export DNF_SYSTEM_UPGRADE_NO_REBOOT=1
sudo -E dnf system-upgrade reboot
```
5. **Verify and tidy up**
```bash
cat /etc/fedora-release # Fedora release 44 (Forty Four)
sudo dnf upgrade --refresh # catch post-upgrade updates
gcc --version # F44 ships gcc 16; reinstall with `dnf install gcc gcc-c++` if removed
```
```powershell
wsl --update # fixes the post-upgrade Wsl/Service/E_UNEXPECTED catastrophic failure some users hit
```
## CUDA Repo Swap
`dnf repolist` still showed `cuda-fedora39-x86_64` — NVIDIA repos are pinned per Fedora release and don't follow distro upgrades. NVIDIA publishes a fedora44 repo:
```bash
sudo rm /etc/yum.repos.d/cuda-fedora39*.repo
sudo dnf config-manager addrepo --from-repofile=https://developer.download.nvidia.com/compute/cuda/repos/fedora44/x86_64/cuda-fedora44.repo
sudo dnf upgrade --refresh
sudo dnf repolist # confirm cuda-fedora44-x86_64
```
**WSL caveat:** never install the NVIDIA *driver* inside WSL — the Windows host driver provides the GPU. Only install toolkit packages (e.g. `cuda-toolkit`).
## Gotchas & Notes
- **Don't skip more than two releases** in one jump — staged upgrades otherwise.
- **The WSL distro name is just a Windows label** — it still says "FedoraLinux-43" after the upgrade. Cosmetic fixes: Windows Terminal profile name, Start Menu shortcut, and `DistributionName`/`ShortcutPath` under `HKCU\Software\Microsoft\Windows\CurrentVersion\Lxss\{uuid}`.
- **Keep the backup tar** until the upgraded instance has proven stable for a few days, then delete to reclaim the space.
- **Restore path if needed:** `wsl --import FedoraRestore C:\WSL\FedoraRestore D:\backups\fedora43.tar` — remember imports default to root; fix via `/etc/wsl.conf` `[user] default=majorlinux`.
## See Also
- [WSL2 Instance Migration (Fedora 43)](wsl2-instance-migration-fedora43.md)
- [WSL2 Backup via PowerShell](wsl2-backup-powershell.md)

View file

@ -23,7 +23,14 @@ A collection of guides covering Linux administration, shell scripting, networkin
- [Ansible Getting Started](shell-scripting/ansible-getting-started.md)
- [Bash Scripting Patterns](shell-scripting/bash-scripting-patterns.md)
## Storage
- [SnapRAID & MergerFS Storage Setup](storage/snapraid-mergerfs-setup.md)
- [mdadm — Rebuilding a RAID Array After Reinstall](storage/mdadm-raid-rebuild.md)
- [Growing an LVM Volume by Absorbing Another Disk](storage/lvm-grow-volume-absorb-disk.md)
## Distro-Specific
- [Linux Distro Guide for Beginners](distro-specific/linux-distro-guide-beginners.md)
- [WSL2 Instance Migration to Fedora 43](distro-specific/wsl2-instance-migration-fedora43.md)
- [WSL2 In-Place Upgrade to Fedora 44](distro-specific/wsl2-fedora44-inplace-upgrade.md)

View file

@ -10,7 +10,7 @@ tags:
- remote-access
status: published
created: 2026-03-08
updated: 2026-04-22T09:20
updated: 2026-04-30T05:21
---
# SSH Config and Key Management

View file

@ -0,0 +1,159 @@
---
title: "Growing an LVM Volume by Absorbing Another Disk"
domain: linux
category: storage
tags: [lvm, lvextend, vgextend, pvcreate, resize2fs, ext4, storage, disk, homelab]
status: published
created: 2026-06-17
updated: 2026-06-17
---
# Growing an LVM Volume by Absorbing Another Disk
When an LVM-backed filesystem fills up and its volume group (VG) has no free
extents, you can grow it by adding a second physical disk as a new physical
volume (PV), extending the VG onto it, then extending the logical volume (LV)
and its filesystem. With ext4 this can be done **online** — no unmount, no
downtime for the volume being grown.
This guide covers the common case where the disk you want to absorb is currently
in use by its own LVM volume (you must evacuate and tear that down first), and
the precautions that keep it safe.
> [!warning] This enlarges your failure domain
> A single LV spanning two disks linearly (the default — no RAID/mirror) means
> **losing either disk loses the entire volume.** ext4 has no parity. Only do
> this for data you can rebuild, or layer redundancy (mdadm/LVM RAID) underneath.
> Back up anything irreplaceable first.
## The Short Answer
If the target disk (`/dev/sdX`) is already empty and unused:
```bash
sudo pvcreate /dev/sdX
sudo vgextend myvg /dev/sdX
sudo lvextend -l +100%FREE /dev/myvg/mylv
sudo resize2fs /dev/mapper/myvg-mylv # ext4, online; use xfs_growfs for XFS
```
The rest of this article handles the harder case: the target disk is currently
holding its own LVM volume with data on it.
## Step-by-Step
### 1. Survey the current layout
```bash
sudo pvs # physical volumes → which VG each belongs to
sudo vgs # volume groups, free extents (VFree)
sudo lvs # logical volumes and sizes
lsblk -o NAME,SIZE,TYPE,MOUNTPOINT
df -h
```
Confirm:
- The VG you want to grow (`myvg`) has `0` `VFree` (that's why you're here).
- The disk you want to absorb (`/dev/sdX`) is a **standalone** PV — not a member
of an mdadm array, a mergerfs branch, or a SnapRAID parity disk. Repurposing a
disk that something else depends on will break that thing silently.
### 2. Evacuate the disk you're about to absorb
Anything on the target disk will be **destroyed**. Move it somewhere with room to
spare, then prove the copy is intact before you trust it.
```bash
# Copy preserving permissions/timestamps
sudo rsync -a /mnt/olddisk/important /destination/with/space/
# Verify byte-for-byte — empty output + exit code 0 means identical
sudo diff -rq /mnt/olddisk/important /destination/with/space/important && echo OK
```
For large trees the `diff -rq` (full byte comparison) is slow but is the
authoritative check — don't skip it before the destructive phase. If an
application tracks files by path (databases, media servers), update its path
references to the new location *now*, while the old copy still exists as a
fallback.
### 3. Unmount and remove the old disk from fstab
```bash
sudo fuser -m /mnt/olddisk # confirm nothing holds it open
sudo umount /mnt/olddisk
mountpoint -q /mnt/olddisk && echo "STILL MOUNTED" || echo "unmounted"
sudo cp /etc/fstab /etc/fstab.bak-$(date +%Y%m%d) # always back up fstab
sudo sed -i '/olddisk/d' /etc/fstab # remove the stale entry
grep olddisk /etc/fstab || echo "fstab line gone"
```
> [!tip] Verify your `sed` pattern only matches the line you mean
> A too-broad pattern can delete the wrong fstab entry. Check the file before and
> after, and keep the backup until you've confirmed the system still boots.
### 4. Tear down the old disk's LVM
```bash
sudo lvremove -y /dev/oldvg/oldlv
sudo vgremove -y oldvg
sudo pvremove -y /dev/sdX # wipes the LVM label off the disk
```
This is the point of no return for the old disk's data — which is why steps 23
verified the copy first.
### 5. Add the disk to the target VG and extend
```bash
sudo pvcreate -y /dev/sdX
sudo vgextend myvg /dev/sdX
sudo lvextend -l +100%FREE /dev/myvg/mylv
```
`lvs`/`vgs` should now show the LV grown to span both disks and `0` free extents.
### 6. Grow the filesystem (online)
```bash
# ext4 — works while mounted
sudo resize2fs /dev/mapper/myvg-mylv
# XFS — grows online too, but takes the mountpoint, not the device
sudo xfs_growfs /mountpoint
```
`resize2fs` is idempotent — if it gets interrupted, just run it again; it reports
"Nothing to do!" once the filesystem already fills the LV.
### 7. Verify
```bash
df -h /mountpoint # should reflect the new larger size
sudo pvs # /dev/sdX now listed under myvg
sudo vgs myvg # two PVs, larger VSize
```
## Notes & Gotchas
- **Online resize works for the volume being grown, not the one being removed.**
The disk you absorb must be unmounted and torn down; the destination LV stays
mounted throughout.
- **`resize2fs` interruption is safe.** ext4 online resize is journaled; re-run it.
- **macOS cruft on evacuated disks.** Trees touched by macOS often carry
`._*` AppleDouble files and `.DS_Store` — harmless to drop, but they inflate
file counts in `diff`/`rsync` output. Don't mistake them for real data.
- **Check SMART on a disk you're promoting into a bigger role.** A disk with a
pending-sector history is riskier once it's in the critical path for a whole
multi-disk volume than it was holding a small isolated one.
- **Mountpoint cleanup.** After the old disk is gone, its former mountpoint
directory may reappear (it was shadowed by the mount). `rmdir` it if empty.
Note `ls -A` exits `0` on an empty directory, so don't gate cleanup on its exit
status — test contents explicitly.
## Related
- [SnapRAID & MergerFS Storage Setup](snapraid-mergerfs-setup.md) — add redundancy/parity instead of a linear span
- [mdadm — Rebuilding a RAID Array After Reinstall](mdadm-raid-rebuild.md)

View file

@ -5,7 +5,7 @@ category: cloud
tags: [aws, s3, cost, billing, mastodon, glacier]
status: published
created: 2026-04-19
updated: 2026-04-19
updated: 2026-06-01
---
# AWS S3 Cost Management
@ -17,24 +17,24 @@ The majorlinux AWS account is used exclusively for S3 object storage. This cover
- **Account ID:** `408469496267`
- **Account name:** majorlinux
- **Services in use:** S3 (Standard + Glacier Deep Archive), AWS Config, Cost Explorer
- **Monthly spend:** ~$32/mo (March 2026); expected ~$16/mo post-media-prune
- **Monthly spend:** ~$24/mo (May 2026, post-media-prune, post-STANDARD_IA revert)
## Buckets and Cost Drivers
| Bucket | Size | Storage Class | Cost/mo | Purpose |
|--------|------|---------------|---------|--------|
| `majortoot` | 648 GB (mostly remote cache) | S3 Standard | ~$15/mo | Mastodon media |
| `majorhomebackup` | 16 TiB | Glacier Deep Archive | ~$16/mo | MLS stream archives (sole copy) |
| `majortoot` | ~7 GB (one-time prune; automation disabled) | S3 Standard | ~$0.16/mo | Mastodon media |
| `majorhomebackup` | 16 TiB | Glacier Deep Archive | ~$1112/mo | MLS stream archives (sole copy) |
| `config-bucket-*` | ~185 KB | S3 Standard | ~$0.00 | AWS Config snapshots |
## CLI Setup
AWS CLI installed on MajorMac via Homebrew. Credentials configured at `~/.aws/credentials`.
AWS CLI installed on MajorMac via Homebrew. Credentials for `MajorCLI` user at `~/.aws/credentials`.
```bash
brew install awscli
# Credentials pulled from Ansible vault:
# AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY in group_vars/all/vault.yml
# Credentials: MajorCLI IAM user (S3 + Billing read access)
# Key ID: AKIAV6GVN4HF4Y6EV4NM — created 2026-05-23
```
### Useful commands
@ -42,18 +42,22 @@ brew install awscli
```bash
# Check current month spend by service
aws ce get-cost-and-usage \
--time-period Start=2026-04-01,End=2026-04-30 \
--time-period Start=2026-05-01,End=2026-05-31 \
--granularity MONTHLY \
--metrics "UnblendedCost" \
--group-by Type=DIMENSION,Key=SERVICE
# Daily cost breakdown with top usage types
aws ce get-cost-and-usage \
--time-period Start=2026-05-01,End=2026-05-23 \
--granularity DAILY \
--metrics "UnblendedCost" \
--filter '{"Dimensions":{"Key":"SERVICE","Values":["Amazon Simple Storage Service"]}}' \
--group-by Type=DIMENSION,Key=USAGE_TYPE
# View anomaly alerts
aws ce get-anomalies \
--date-interval StartDate=2026-04-01,EndDate=2026-04-30
# Check conformance pack compliance
aws configservice get-conformance-pack-compliance-details \
--conformance-pack-name MajorConformance
--date-interval StartDate=2026-05-01,EndDate=2026-05-31
# List budgets
aws budgets describe-budgets --account-id 408469496267
@ -62,25 +66,52 @@ aws budgets describe-budgets --account-id 408469496267
## Budget Alert
`MajorS3MonthlyAlert` configured 2026-04-19:
- 80% threshold → email at $20 actual spend
- 100% threshold → email at $25 actual spend
- 80% threshold → email at $24 actual spend
- 100% threshold → email at $30 actual spend
- Recipient: maj.linux@gmail.com
> [!note] Thresholds updated 2026-05-23 to reflect actual ~$24/mo steady-state spend (was $20/$25, set when spend was higher due to large majortoot bucket before prune took effect).
## Cost Reduction Options
### majortoot — S3 Standard-IA
### majortoot — S3 Standard-IA (⚠️ DO NOT USE — tried and reverted)
Switching `S3_STORAGE_CLASS=STANDARD_IA` in Mastodon's `.env.production` reduces storage cost from $0.023/GB to $0.0125/GB for new uploads. Expected saving: ~$45/mo after cache is pruned down to local-only content.
**Attempted 2026-05 — reverted 2026-05-17. Do not retry without careful planning.**
See [[mastodon-instance-tuning]] for full instructions.
The theory: switching `S3_STORAGE_CLASS=STANDARD_IA` saves ~$45/mo on storage. In practice, the bulk avatar restore operation (`restore-avatars.sh`, May 910) ran while STANDARD_IA was active. The ~5,223 account refreshes across 1,095 domains generated ~470,000 SIA Tier 1 PUT requests ($4.72) plus early-deletion fees ($1.21) when the objects were replaced after reverting to STANDARD on May 17.
### majortoot — Weekly media prune
**STANDARD_IA is only economical if:**
- The bucket has no large bulk-write operations (media cache rebuilds, avatar restores)
- Objects are written and left for >30 days (early deletion incurs minimum 30-day fee)
- The per-request cost ($0.01/1,000 for SIA vs $0.005/1,000 for Standard) doesn't offset storage savings
Weekly cron deployed (`0 3 * * 0`) via `configure_mastodon_media_prune.yml`. Removes remote federated cache older than 7 days. Expected to reduce bucket from 648 GB to ~7 GB over time.
With the weekly prune now running correctly and the bucket shrinking toward ~7 GB, the storage savings of SIA are negligible (~$0.05/mo). **Leave at STANDARD.**
### majortoot — media pruning (automation DISABLED 2026-06-01)
A weekly prune cron (`0 3 * * 0`, via `configure_mastodon_media_prune.yml`) **used to** run `tootctl media remove --days=7`. It shrank the bucket from 648 GB to ~7 GB — a one-time cleanup of years of accumulated remote **attachment** cache, which is safe and accounts for the bulk of the savings above.
**That automation was removed 2026-06-01.** The same playbook also carried a monthly `tootctl accounts refresh --all`, and automated profile pruning (plus a storage-level deletion during the cost-cull/migration) repeatedly broke remote avatars. The playbook is now an *enforce-absent* guard, and a [synthetic upload health check](../services/mastodon-s3-acl-upload-failures.md) alerts if media serving/uploads regress. See [[mastodon-prune-profiles-trap]] and [[mastodon-s3-acl-upload-failures]].
**Going forward:** the bucket is already small (~7 GB) and attachment cache re-accumulates slowly. If it ever grows enough to matter, run an **attachment-only** prune **manually and deliberately** (`bin/tootctl media remove --days=30`) — never automate profile/header pruning or `accounts refresh --all`.
### majorhomebackup — Self-host consideration
Deep Archive at $0.00099/GB is the cheapest cloud tier — no cloud alternative is cheaper. If the MLS archives are no longer needed, deletion would save ~$16/mo. A 20TB HDD (~$300400) would break even in ~2 years vs. continued cloud storage. **These are the sole copy — do not delete without a separate backup.**
Deep Archive at $0.00099/GB is the cheapest cloud tier — no cloud alternative is cheaper. If the MLS archives are no longer needed, deletion would save ~$1112/mo. A 20TB HDD (~$300400) would break even in ~2.5 years vs. continued cloud storage. **These are the sole copy — do not delete without a separate backup.**
## IAM Users
| User | Scope | Credentials location | Notes |
|------|-------|---------------------|-------|
| `MajorToot` | S3 full (MajorsHouse group) | `~/.aws/credentials` on majortoot | Key rotated 2026-05-23 |
| `MajorHome` | S3 full (MajorsHouse group) | `~/.aws/credentials` on majorhome | Key pending rotation (see below) |
| `MajorCLI` | S3 full + Billing read (MajorsHouse group + AWSBillingReadOnlyAccess) | `~/.aws/credentials` on MajorMac | Created 2026-05-23, replaces root key |
> [!warning] Root access keys deleted 2026-05-23. Do NOT create new root access keys. Use `MajorCLI` for CLI work on MajorMac. The root account password (in Vaultwarden) is sufficient for console access.
> [!warning] MajorHome key (`AKIAV6GVN4HF7POCNW6D`) exposed in shell session 2026-05-23. Rotate via AWS Console → IAM → Users → MajorHome → Security credentials. Update `~/.aws/credentials` on majorhome afterward.
> [!note] `MajorCLI` does not have IAM permissions. Future key rotation requires AWS Console login or temporary IAM policy attachment. Consider adding a `SelfManageKeys` inline policy to `MajorCLI` via console.
## Conformance Pack
@ -92,15 +123,35 @@ Deep Archive at $0.00099/GB is the cheapest cloud tier — no cloud alternative
Evaluations cost $0.001 each and run on a periodic schedule. Safe to ignore; at current scale costs pennies per month.
## IAM Users
| User | Scope | Credentials location |
|------|-------|---------------------|
| `MajorToot` | S3 only — no billing/Cost Explorer | `~/.aws/credentials` on majortoot |
| Root | Full access | `~/.aws/credentials` on MajorMac (configured 2026-04-19) |
## CloudTrail Audit Logging
`MajorTrail` configured 2026-05-23:
- **S3 bucket:** `majorcloudtrail-408469496267`
- **Multi-region:** yes — captures API calls across all regions
- **Global service events:** yes — includes IAM, STS, S3 control plane
- **Log file validation:** enabled — tamper detection via digest files
- **Retention:** logs accumulate in S3; no automatic expiry configured
Use CloudTrail to investigate unexpected cost spikes, IAM key usage, and bucket write activity. Without it, historical API calls are unrecoverable (learned the hard way from the May 2026 SIA spike investigation).
```bash
# List recent CloudTrail events (last 1h, S3 writes only)
aws cloudtrail lookup-events \
--lookup-attributes AttributeKey=EventName,AttributeValue=PutObject \
--start-time $(date -u -v-1H +%Y-%m-%dT%H:%M:%SZ 2>/dev/null || date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ) \
--query 'Events[].{Time:EventTime,User:Username,Resource:Resources[0].ResourceName}' \
--output table
# Look up events by specific access key
aws cloudtrail lookup-events \
--lookup-attributes AttributeKey=AccessKeyId,AttributeValue=AKIAV6GVN4HF3BWAIAGC \
--output table
```
## Related
- [[Services/AWS]] — infrastructure record
- [[mastodon-instance-tuning]] — media cache management
- [[mastodon-prune-profiles-trap]] — avatar restore incident + bulk-restore procedure
- [[mastodon-s3-acl-upload-failures]] — silent upload failures on ACL-disabled buckets
- [[majortoot]] — Mastodon host

View file

@ -0,0 +1,98 @@
---
title: VPS Migration Baseline Checklist
description: What to verify after migrating a server to a new provider — the packages, services, and configs that must match the old box
tags:
- migration
- vps
- hetzner
- digitalocean
- ansible
- checklist
status: published
created: 2026-05-09
updated: 2026-05-13T10:35
---
# VPS Migration Baseline Checklist
When migrating a server from one VPS provider to another, it's easy to focus on the application (bots, web services, databases) and forget the infrastructure baseline. This checklist covers the common components that make a server operational beyond just running the app.
## Background
During the Hetzner migration (2026-05), `majordiscord` was migrated with only the application layer (PhantomBot, Red-DiscordBot) and core infrastructure (Netdata, Tailscale, fail2ban). Missing from the new box: Postfix (email relay), logwatch, ClamAV, and dnf-automatic. The gap went unnoticed for a week because all monitoring email depended on the missing Postfix.
## The Checklist
### Before Migration
Power on both old and new boxes. Run this comparison to find gaps:
```bash
# Fedora — list baseline packages on both hosts
ssh root@OLD_HOST 'rpm -qa --qf "%{NAME}\n" | sort | grep -iE "fail2ban|logwatch|postfix|netdata|clamav|dnf-auto|tailscale|cronie|firewalld"'
ssh root@NEW_HOST 'rpm -qa --qf "%{NAME}\n" | sort | grep -iE "fail2ban|logwatch|postfix|netdata|clamav|dnf-auto|tailscale|cronie|firewalld"'
# Ubuntu — list baseline packages on both hosts
ssh root@OLD_HOST 'dpkg -l | grep -iE "fail2ban|logwatch|postfix|netdata|clamav|unattended|tailscale" | awk "{print \$2}" | sort'
ssh root@NEW_HOST 'dpkg -l | grep -iE "fail2ban|logwatch|postfix|netdata|clamav|unattended|tailscale" | awk "{print \$2}" | sort'
```
Compare enabled services:
```bash
ssh root@HOST 'systemctl list-unit-files --state=enabled --no-pager | grep -iE "fail2ban|logwatch|postfix|netdata|clamav|dnf-auto|tailscale|cronie|firewalld|sshd"'
```
### Baseline Components
Every server in the fleet should have these. Check each one after migration:
| Component | Package (Fedora) | Package (Ubuntu) | Ansible Playbook | Notes |
|-----------|-----------------|------------------|------------------|-------|
| Monitoring | `netdata` | `netdata` | `netdata.yml` | Claim to Netdata Cloud if applicable |
| VPN | `tailscale` | `tailscale` | — (manual join) | Rename node in Tailscale admin |
| Intrusion prevention | `fail2ban` | `fail2ban` | `harden.yml` | Check jail.local, banaction matches firewall |
| Email relay | `postfix` | `postfix` | `configure_postfix_relay.yml` | Required by logwatch, Netdata, fail2ban |
| Log summaries | `logwatch` | `logwatch` | `logwatch.yml` | Override file, not defaults — see [logwatch fleet setup](../monitoring/logwatch-fleet-setup.md) |
| Firewall | `firewalld` | `ufw` | `configure_firewall_*.yml` | Verify fail2ban banaction matches |
| Cron | `cronie` | `cron` | — (usually pre-installed) | Required by logwatch |
| Auto-updates | `dnf-automatic` | `unattended-upgrades` | `ansible-unattended-upgrades-fleet` | Security patches only |
| Antivirus | `clamav` | `clamav` | `clamav.yml` (clamav role) | Internet-facing hosts only |
| SSH hardening | `openssh-server` | `openssh-server` | `ssh_hardening.yml` (ssh_hardening role) | Key-only, no root password |
| Timezone | — | — | — | US servers: `America/New_York`; UK: `Europe/London`. Hetzner defaults to UTC. |
| CA bundle (Fedora) | `ca-certificates` | `ca-certificates` | — | Verify `/etc/pki/tls/certs/ca-bundle.crt` symlink exists — see [Fedora CA bundle fix](../../05-troubleshooting/security/fedora-ca-bundle-missing-symlink.md) |
| Syslog (Fedora) | `rsyslog` | — (pre-installed) | — | Fedora 44 Hetzner images have journald only. Logwatch needs `/var/log/messages` + `/var/log/secure`. |
### After Migration
1. **Set the timezone**`timedatectl set-timezone America/New_York` (US) or `Europe/London` (UK). Hetzner images default to UTC.
2. **Set the system hostname** — Hetzner provisions the box as `<host>-hetzner`. Run `hostnamectl set-hostname <host>` and fix the loopback line: `sed -i "s/127.0.1.1.*/127.0.1.1 <host> <host>/" /etc/hosts`. Skip this and **Logwatch emails arrive titled `Logwatch for <host>-hetzner`** weeks later. Do it alongside the Tailscale node rename and Postfix `myhostname` — all three read from the provisioning label. See [Logwatch wrong hostname after migration](../../05-troubleshooting/logwatch-wrong-hostname-after-migration.md).
3. **Verify CA bundle (Fedora)**`ls /etc/pki/tls/certs/ca-bundle.crt`. If missing, Postfix TLS, curl, and dnf will all fail silently. See [Fedora CA bundle fix](../../05-troubleshooting/security/fedora-ca-bundle-missing-symlink.md).
4. **Run `harden.yml` against the new host** — catches most gaps in one pass
5. **Send a test email**`echo test | mail -s "test" marcus@majorshouse.com` — if this fails, nothing else can alert you
6. **Verify crond is running**`systemctl is-active crond` (Fedora) or `systemctl is-active cron` (Ubuntu). cronie can be `enabled` but not `active` after provisioning.
7. **Check Netdata Cloud** — verify the new node appears and alerts are flowing
8. **Compare fail2ban jails**`fail2ban-client status` on both old and new
9. **Verify logwatch sends**`sudo logwatch --output mail --range today`
10. **Keep the old box powered off but not destroyed** for at least 7 days after remediation
### Using doctl to Manage Old Droplets
```bash
# Authenticate (token from Ansible vault)
cd ~/MajorAnsible
ansible-vault view group_vars/all/vault.yml | grep vault_do_oauth_token | awk '{print $2}' | xargs doctl auth init --access-token
# List droplets
doctl compute droplet list --format Name,ID,Status,PublicIPv4
# Power on for comparison
doctl compute droplet-action power-on DROPLET_ID
# Power off when done
doctl compute droplet-action power-off DROPLET_ID
```
## Lesson Learned
Application migration is not server migration. The app can work perfectly while the monitoring, alerting, and email infrastructure is completely broken. Always compare the full package baseline between old and new boxes before calling a migration complete.

View file

@ -5,7 +5,7 @@ category: dns-networking
tags: [tailscale, networking, infrastructure, dns, vpn]
status: published
created: 2026-04-02
updated: 2026-04-02
updated: 2026-05-19
---
# 🌐 Network Overview
@ -19,12 +19,13 @@ The **MajorsHouse** infrastructure is connected via a private **Tailscale** mesh
## 🌍 Geographic Nodes
| Host | Location | IP | OS |
|---|---|---|---|
| `dcaprod` | 🇺🇸 US | 100.104.11.146 | Ubuntu 24.04 |
| `majortoot` | 🇺🇸 US | 100.110.197.17 | Ubuntu 24.04 |
| `majorhome` | 🇺🇸 US | 100.120.209.106 | Fedora 43 |
| `teelia` | 🇬🇧 UK | 100.120.32.69 | Ubuntu 24.04 |
| Host | Location | IP | OS | Notes |
|---|---|---|---|---|
| `dcaprod` | 🇺🇸 US | 100.104.11.146 | Ubuntu 24.04 | DO droplet — live until ~2026-05-22 |
| `dcaprod-hetzner` | 🇺🇸 US | 100.98.223.93 | Ubuntu 24.04 | Hetzner CPX21 — migration target; DNS cutover ~May 22 |
| `majortoot` | 🇺🇸 US | 100.110.197.17 | Ubuntu 24.04 | |
| `majorhome` | 🇺🇸 US | 100.120.209.106 | Fedora 43 | |
| `teelia` | 🇬🇧 UK | 100.120.32.69 | Ubuntu 24.04 | |
## 🔗 Tailscale Setup
@ -35,4 +36,4 @@ Tailscale is configured as a persistent service on all nodes. Key features used
- **ACLs:** Managed via the Tailscale admin console to restrict cross-group communication where necessary.
---
*Last updated: 2026-03-04*
*Last updated: 2026-05-19*

View file

@ -7,7 +7,7 @@ tags:
- asus
- ssh
created: 2026-04-19
updated: 2026-04-29T22:45
updated: 2026-04-30T05:21
---
# Wake-on-LAN via Router SSH

View file

@ -1,6 +1,6 @@
---
created: 2026-04-13T10:15
updated: 2026-04-29T22:45
updated: 2026-05-31
---
# 🏠 Self-Hosting & Homelab
@ -30,6 +30,19 @@ Guides for running your own services at home, including Docker, reverse proxies,
- [Tuning Netdata Docker Health Alarms](monitoring/netdata-docker-health-alarm-tuning.md)
- [Deploying Netdata to a New Server](monitoring/netdata-new-server-setup.md)
## Services
- [Mastodon Instance Tuning](services/mastodon-instance-tuning.md)
- [Mastodon Post-Install Hardening (Permissions + Account)](services/mastodon-post-install-hardening.md)
- [Mastodon DB Maintenance](services/mastodon-db-maintenance.md)
- [Mastodon Federation](services/mastodon-federation.md)
- [Mastodon `--prune-profiles` Trap](services/mastodon-prune-profiles-trap.md)
- [Mastodon on S3 — Silent Upload Failures](services/mastodon-s3-acl-upload-failures.md)
- [Mastodon — Triaging Crowdfunding / Mention-Spam Accounts](services/mastodon-mention-spam-crowdfunding.md)
- [Ghost SMTP via Mailgun](services/ghost-smtp-mailgun-setup.md)
- [Updating n8n Docker](services/updating-n8n-docker.md)
- [Claude Code Remote Control](services/claude-code-remote-control.md)
## Security
- [Linux Server Hardening Checklist](security/linux-server-hardening-checklist.md)

View file

@ -0,0 +1,296 @@
---
title: Logwatch Fleet Setup — Surviving Package Upgrades
description: Configure logwatch on mixed Debian/Fedora fleets so settings survive package upgrades
tags:
- logwatch
- monitoring
- ansible
- fedora
- ubuntu
status: published
created: 2026-05-09
updated: 2026-05-13T10:35
---
# Logwatch Fleet Setup — Surviving Package Upgrades
Logwatch ships with a defaults file at `/usr/share/logwatch/default.conf/logwatch.conf`. On Fedora, package upgrades **silently reset** this file — wiping any customizations. The fix is to put all settings in the **local override file** at `/etc/logwatch/conf/logwatch.conf`, which is never touched by package managers.
## The Problem
Fedora 44's logwatch 7.14-1 upgrade (April 2026) reset `Output` from `mail` back to `stdout` in the defaults file. Servers that had been emailing daily reports for months went silent with zero errors. `rpm -V logwatch` shows the defaults file was modified (`S.5....T.`), but there's no warning during upgrade.
Ubuntu is less affected because its `/etc/cron.daily/00logwatch` script passes `--output mail` explicitly, overriding the config. Fedora's cron script does not.
## The Fix
Write all settings to the **override file** (`/etc/logwatch/conf/logwatch.conf`):
```ini
# Managed by Ansible — do not edit manually.
# Local overrides — survives package upgrades.
Output = mail
MailTo = marcus@majorshouse.com
MailFrom = Logwatch@hostname.majorshouse.com
Detail = Low
```
Key settings:
| Setting | Value | Why |
|---------|-------|-----|
| `Output` | `mail` | Must be `mail`, not `stdout`. Fedora's cron script doesn't pass `--output mail` like Ubuntu's does. |
| `MailTo` | recipient address | Where reports go. |
| `MailFrom` | per-host sender | Makes it easy to identify which server sent the report. |
| `Detail` | `Low` | Keeps emails scannable. Raise to `Med` or `High` for debugging. |
## Ansible Playbook
The `logwatch.yml` playbook handles both OS families:
```yaml
- name: Install and configure logwatch
hosts: all
become: true
gather_facts: true
tasks:
- name: Install logwatch (Debian/Ubuntu)
ansible.builtin.apt:
name: logwatch
state: present
when: ansible_facts['os_family'] == "Debian"
- name: Install logwatch (Fedora)
ansible.builtin.dnf:
name: logwatch
state: present
when: ansible_facts['os_family'] == "RedHat"
- name: Ensure logwatch override directory exists
ansible.builtin.file:
path: /etc/logwatch/conf
state: directory
mode: '0755'
- name: Configure logwatch override (survives package upgrades)
ansible.builtin.copy:
dest: /etc/logwatch/conf/logwatch.conf
mode: '0644'
content: |
# Managed by Ansible — do not edit manually.
Output = mail
MailTo = {{ logwatch_email }}
MailFrom = Logwatch@{{ inventory_hostname }}.majorshouse.com
Detail = Low
```
Include it in `harden.yml` so every new server gets logwatch as part of the baseline.
## Verifying
After deploying, test immediately:
```bash
# Verify crond is actually running — cronie can be "enabled" but not "active"
systemctl is-active crond # Fedora
systemctl is-active cron # Ubuntu
# If inactive, start it
sudo systemctl start crond
# Then test logwatch manually
sudo logwatch --output mail --range today
```
Check that the email arrives. If it doesn't, verify:
1. **crond is running** — if `inactive`, cron.daily never fires and logwatch never runs. No errors anywhere.
2. **Postfix is installed and relaying** — logwatch depends on a working local MTA.
3. **CA bundle exists (Fedora)** — missing `/etc/pki/tls/certs/ca-bundle.crt` breaks Postfix TLS relay. See [Fedora CA bundle fix](../../05-troubleshooting/security/fedora-ca-bundle-missing-symlink.md).
## Diagnosing Silent Failures
```bash
# Check if the defaults file was modified by a package upgrade
rpm -V logwatch # Fedora
dpkg -V logwatch # Debian
# Look for S.5....T. on the defaults file — means it was replaced
# S = size, 5 = md5, T = timestamp changed
# Check if logwatch produces any output at all
logwatch --output stdout --range yesterday | wc -l
# If 0 lines — logwatch has no log data to report (see rsyslog section below)
```
## Fedora: rsyslog Missing — Logwatch Produces Zero Output
Fedora 44 cloud images (Hetzner, possibly others) ship with **journald only** — no rsyslog. This means `/var/log/messages`, `/var/log/secure`, and `/var/log/cron` do not exist. Logwatch scans those files, finds nothing, produces empty output, and sends no email. Exit code is still 0 — no error anywhere.
This is particularly insidious because everything else can be correct (crond running, postfix relaying, logwatch config pointing to the right recipient) and you'll still get silence.
```bash
# Diagnose
rpm -q rsyslog # "package rsyslog is not installed"
ls /var/log/messages # "No such file or directory"
# Fix
dnf install -y rsyslog
systemctl enable --now rsyslog
# Verify log files appear
ls /var/log/messages /var/log/secure /var/log/cron
# Test logwatch
logwatch --output stdout --range today | wc -l # should be >0
```
## Fedora CA Bundle Missing — Postfix TLS Engine Unavailable
If the Fedora half of your fleet is silent but the Debian/Ubuntu half is fine, and your relayhost requires TLS, suspect a missing CA bundle. Symptom on the sending host:
```
postfix/error: status=deferred (delivery temporarily suspended:
TLS is required, but our TLS engine is unavailable)
```
The tell that this is the CA bundle and not a postfix-internal problem: **dnf and curl are also broken on the box.** Run any `sudo dnf list` / `sudo curl https://...` and look for:
```
Curl error (77): Problem with the SSL CA cert (path? access rights?)
[error adding trust anchors from file: /etc/pki/tls/certs/ca-bundle.crt]
```
That's the same path postfix's `smtp_tls_CAfile` defaults to. Every TLS client on the box is failing because a single symlink is missing.
### Diagnosis
```bash
# Is the consumer-path symlink there?
ls -la /etc/pki/tls/certs/ca-bundle.crt
# Expected: lrwxrwxrwx ... -> /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem
# Is the extracted bundle itself intact?
ls -la /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem
sudo grep -c 'BEGIN CERTIFICATE' /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem
# Expected: ~140-150 certs, ~220 KB
```
If the extracted bundle exists but the consumer-path symlink is gone, you've found it. `update-ca-trust extract` regenerates the `extracted/` paths but does **not** recreate the upstream-style symlink at `/etc/pki/tls/certs/ca-bundle.crt` — that symlink is shipped by the `ca-certificates` package and can be lost during a partial upgrade or a stray `rm`.
### Fix
```bash
sudo ln -sfn /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem \
/etc/pki/tls/certs/ca-bundle.crt
sudo systemctl reload postfix
sudo postqueue -f # drain deferred mail
```
Verify with `sudo grep -c 'BEGIN CERTIFICATE' /etc/pki/tls/certs/ca-bundle.crt` (should match the extracted bundle's count) and `sudo dnf list --installed postfix` (should no longer show the curl error).
### Audit the rest of the Fedora fleet
Once you find one host with this issue, check the others — package events that broke one box may have broken its siblings:
```bash
for host in $(your fleet | grep fedora); do
echo "$host: $(ssh $host 'ls /etc/pki/tls/certs/ca-bundle.crt 2>&1' | tail -1)"
done
```
Hosts returning "No such file or directory" are silently broken. They won't fail loudly until something asks them to do TLS — which on a small homelab might be never until logwatch tries to mail you weeks later.
### Methodology note: postfix logs differ between distros
Don't trust a single log source when surveying a mixed fleet. **Fedora and majormail log postfix to journald** (`journalctl -u postfix`); **Debian/Ubuntu log to `/var/log/mail.log`** (and rotated `mail.log.1` / `mail.log.*.gz`). Querying journalctl on Ubuntu returns "no entries" even when mail is flowing — easy way to declare a working host broken. Always run `tail /var/log/mail.log` on Debian-family hosts and `journalctl -u postfix` on Fedora-family hosts.
## Bounce-source addresses must be real mailboxes
A subtle related class of bug: services like Watchtower, fail2ban, cron, and Netdata default to sending notifications **from** an identity that doesn't exist as a recipient — `watchtower@majorshouse.com`, `fail2ban@<host>.majorshouse.com`, `root@<host>.localdomain`. While the relayhost is healthy, nobody notices. The moment any delivery fails (network blip, recipient typo, queue overflow, the CA bundle bug above), the local MTA tries to bounce the original message back to that sender — finds no mailbox — and the bounce itself bounces. You get MAILER-DAEMON queue churn and `5.7.1 Relay access denied` rejections in your mail server logs.
Fix it once at the source: set `WATCHTOWER_NOTIFICATION_EMAIL_FROM`, fail2ban's `sender =`, and similar to a **real mailbox** on your mail server (e.g., `marcus@majorshouse.com`). Bounces then land somewhere a human can read them, and the noise disappears.
## Per-host config drift on cloud-image-derived servers
When fleet hosts are spun up from images (DigitalOcean droplet snapshots, Packer artifacts, cloud-init templates), three specific config drift patterns silently break notification mail. Each one looks fine in isolation; the combination produces "mail leaves the host with `250 OK queued` and disappears."
### 1. Packer/snapshot-leftover `myhostname` in postfix
A host built from a Packer-baked image often has `postfix myhostname = packer-<uuid>` baked into `main.cf` from the build process. The system hostname might have been correctly set by terraform/cloud-init at first boot, but postfix's `myhostname` was hardcoded during image build and was never overridden. Result: every outbound message-id and EHLO carries the Packer artifact name (e.g., `<20260509120011.7EB6ABD83C@packer-641079bc-bc17-b5e1-1425-be745d012d0b>`), no SPF/DKIM matches that name, and remote spam filters score it as suspicious.
**Detect:**
```bash
postconf myhostname | grep -E 'packer-|builder-|<image-build-prefix>'
```
**Fix:**
```bash
hostnamectl set-hostname <real-fqdn>
postconf -e 'myhostname = <real-fqdn>'
sed -i '/^127\.0\.1\.1/d' /etc/hosts && \
echo "127.0.1.1 <real-fqdn> <short-name>" >> /etc/hosts
systemctl reload postfix
```
> [!tip] Same drift, different symptom: the Logwatch **title**
> Hetzner provisions boxes with `<host>-hetzner` as the *system* hostname. When that's never corrected, Logwatch (which reads the live hostname at runtime) mails reports titled `Logwatch for <host>-hetzner` — no postfix involvement needed. Same `hostnamectl set-hostname` + `/etc/hosts` fix as above. See [Logwatch wrong hostname after migration](../../05-troubleshooting/logwatch-wrong-hostname-after-migration.md).
### 2. Empty `relayhost` quietly forces public-MX delivery
If `postconf relayhost` returns an empty value, postfix doesn't fail — it just does an MX lookup for the destination domain and tries to deliver directly. For mail to your own mail server, that means going via the **public MX** (the domain's external MX record, e.g., `mail.majorshouse.com → 203.0.113.10:25`) instead of the **internal/Tailscale relay path** the rest of the fleet uses.
The public-MX path is subject to whatever spam filtering, content checks, and trust rules the receiving MX has configured for external traffic. Internal Tailscale-IP traffic typically gets a faster trust shortcut (e.g., bypass spamchk pipe). So this single configuration drift causes one host's mail to land in a different code path than its siblings — and then silently get filtered.
**Detect:** look for fleet hosts where `postconf relayhost` returns blank and compare to known-good siblings.
**Fix:** set `relayhost = [<mailserver-tailscale-ip>]:587` (or whatever port your fleet convention uses).
### 3. Stale SASL passwd map referencing a missing file
Postfix configurations migrated from a previous setup often retain `smtp_sasl_auth_enable = yes` and `smtp_sasl_password_maps = hash:/etc/postfix/sasl_passwd` even when no SASL is needed for the current relay path. If the actual `sasl_passwd` file isn't there (because the migration didn't carry it, or the new relay doesn't require auth), every send attempt produces:
```
error: open database /etc/postfix/sasl_passwd.db: No such file or directory
warning: smtp_sasl_password_maps lookup error
status=deferred (local data error while talking to <relay>)
```
Especially common after migrating from external SMTP (SendGrid, Mailgun, etc., which use SASL) to an internal Tailscale relay (which doesn't).
**Detect:**
```bash
postconf -n | grep -E 'smtp_sasl_(auth_enable|password_maps)'
[ -f /etc/postfix/sasl_passwd ] || echo "sasl_passwd file missing"
```
**Fix — disable SASL if the new relay doesn't need it:**
```bash
postconf -e 'smtp_sasl_auth_enable = no'
postconf -e 'smtp_tls_wrappermode = no' # if switching from port 465 to 587
postconf -X 'smtp_sasl_password_maps'
systemctl reload postfix
```
### Audit shortcut
For a quick per-host comparison across the fleet:
```bash
for host in your fleet hosts; do
echo "=== $host ==="
ssh "$host" 'postconf myhostname relayhost smtp_sasl_auth_enable 2>&1' | head -3
done
```
Anomalies (Packer hostnames, blank relayhost, SASL enabled where siblings have it disabled) jump out immediately.
## Lesson Learned
Never customize `/usr/share/logwatch/default.conf/logwatch.conf`. Always use `/etc/logwatch/conf/logwatch.conf`. This applies to any software that has a "defaults" file and an "override" file — the override survives upgrades, the defaults file does not.
A second, broader lesson from the 2026-05-10 fleet outage: **silent fleet-wide email gaps are usually a stack of unrelated failures, not one cause.** That morning's investigation surfaced a missing CA bundle on two Fedora hosts, a postfix relayhost using a name that postfix's resolver couldn't handle, two services with non-mailbox sender addresses generating bounce churn, and a corrupt syslog-vs-journald assumption that hid working hosts. Each was minor in isolation. Together they made all seven hosts look broken when in fact only two were. Triage by ground-truth (what arrived in the destination mailbox) before assuming what's broken at the source.

View file

@ -1,11 +1,17 @@
---
title: "Tuning Netdata Docker Health Alarms to Prevent Update Flapping"
title: Tuning Netdata Docker Health Alarms to Prevent Update Flapping
domain: selfhosting
category: monitoring
tags: [netdata, docker, nextcloud, alarms, health, monitoring]
tags:
- netdata
- docker
- nextcloud
- alarms
- health
- monitoring
status: published
created: 2026-03-18
updated: 2026-03-28
updated: 2026-05-02T11:04
---
# Tuning Netdata Docker Health Alarms to Prevent Update Flapping
@ -61,9 +67,9 @@ chart labels: container_name=!nextcloud-aio-nextcloud *
### Dedicated Nextcloud AIO Alarm
Added 2026-03-23, updated 2026-03-28. The `nextcloud-aio-nextcloud` container needs a more lenient window than other containers. Its healthcheck (`/healthcheck.sh`) verifies PostgreSQL connectivity (port 5432) and PHP-FPM (port 9000). PHP-FPM takes ~90 seconds to warm up after a normal restart — but during nightly AIO update cycles, the full startup (occ upgrade, app updates, migrations) can take 5+ minutes. On 2026-03-27, a startup hung and left the container unhealthy for 20 hours until the next nightly cycle replaced it.
Added 2026-03-23, updated 2026-05-02. The `nextcloud-aio-nextcloud` container needs a more lenient window than other containers. Its healthcheck (`/healthcheck.sh`) verifies PostgreSQL connectivity (port 5432) and PHP-FPM (port 9000). PHP-FPM takes ~90 seconds to warm up after a normal restart — but during nightly AIO update cycles, the full startup (occ upgrade, app updates, migrations) can take 5+ minutes. On 2026-03-27, a startup hung and left the container unhealthy for 20 hours until the next nightly cycle replaced it.
The dedicated alarm uses a 10-minute lookup window and 10-minute delay to absorb normal startup, while still catching sustained failures:
The dedicated alarm uses a 30-minute lookup window and 10-minute delay to absorb normal startup and update cycles (~40 minutes total grace), while still catching sustained failures:
```ini
# Dedicated alarm for nextcloud-aio-nextcloud — lenient window to absorb nightly update cycle
@ -76,15 +82,23 @@ template: docker_nextcloud_unhealthy
component: Docker
units: status
every: 30s
lookup: average -10m of unhealthy
lookup: average -30m of unhealthy
chart labels: container_name=nextcloud-aio-nextcloud
warn: $this > 0
warn: $this >= 1
delay: up 10m down 5m multiplier 1.5 max 30m
summary: Nextcloud container health sustained
info: nextcloud-aio-nextcloud has been unhealthy for a sustained period — not a transient update blip
info: nextcloud-aio-nextcloud has been continuously unhealthy for 30+ minutes — not a transient update blip
to: sysadmin
```
**Tuning history:**
| Date | Lookup | Delay | Trigger | Notes |
|---|---|---|---|---|
| 2026-03-23 | 35m | 35m | Initial split from general alarm | Absorbed PHP-FPM warm-up |
| 2026-04-29 | 15m | 5m | Backup blip (~6m) never triggered | Tightened after stability |
| 2026-05-02 | 30m | 10m | 15m still too aggressive for update cycles | ~40m total grace; catches real outages |
## Watchdog Cron: Auto-Restart on Sustained Unhealthy
If the Nextcloud container stays unhealthy for more than 1 hour (well past any normal startup window), a cron watchdog on majorlab auto-restarts it and logs the event. This was added 2026-03-28 after an incident where the container sat unhealthy for 20 hours until the next nightly backup cycle replaced it.

View file

@ -0,0 +1,130 @@
---
title: "Migrating Flat Ansible Playbooks to Roles (Safely)"
domain: selfhosting
category: security
tags: [ansible, roles, refactor, fleet, migration, fail2ban, infrastructure]
status: published
created: 2026-06-18
updated: 2026-06-18
---
# Migrating Flat Ansible Playbooks to Roles (Safely)
## Overview
A fleet repo tends to grow a sprawl of flat `configure_*.yml` playbooks — one per subsystem, plus near-duplicates for variants (e.g. ~10 `configure_fail2ban_*` playbooks), all sharing a single overloaded top-level `templates/` directory. It works, but it resists reuse: there is no clean `defaults/` precedence, no encapsulation, and no way to compose a host's full configuration in one place.
Ansible **roles** fix this — but migrating a *live* fleet is where it gets dangerous. The risk is not the refactor itself; it's accidentally changing deployed behaviour while you "just reorganize." This article covers the incremental, regression-free approach used to migrate an 11-host fleet, including the two techniques that keep it safe: **byte-identical migration** and **capture-based reconciliation**.
> This is a process/pattern article. For the specific roles in this fleet, see the internal runbook. The techniques here generalize to any flat-playbook → role migration.
## Decide What Becomes a Role vs. What Stays a Playbook
Not everything should be a role. Draw the line by purpose:
| Becomes a role | Stays a playbook |
|---|---|
| Reusable host **configuration** (a subsystem you converge to a desired state) | **Ops / one-off** actions: `update`, `reboot`, `harden`, `bootstrap`, `provision`, `fix_*`, `verify_*` |
| Has templates/files, defaults, handlers | Orchestrators that just `import_playbook` other things |
| Applied repeatedly and idempotently | Run-once or run-as-needed remediation |
Roles get the standard `roles/<name>/` layout (`tasks/`, `defaults/`, `handlers/`, `templates/`, `files/`, `meta/`). Name them after the **subsystem noun** (`fail2ban`, `clamav`, `firewall`) — drop the `configure_` verb prefix.
## The Incremental Loop (one role per branch)
Migrate **one subsystem per branch** and validate before merging. This keeps every change small enough to diff by eye and roll back cleanly:
1. `git mv` the templates/files into `roles/<name>/` so **git tracks them as renames** (history preserved, 100% rename score).
2. Move task bodies into `roles/<name>/tasks/` (split by lifecycle: install → service → config → verify).
3. Lift tunables into `roles/<name>/defaults/main.yml`; keep per-host overrides in `group_vars`/`host_vars`.
4. Add a thin entry playbook `<name>.yml` (`hosts: <group>` + `roles: [<name>]`).
5. Validate with `--check --diff` against a single host **before** merging.
6. Merge, then move to the next subsystem.
## Technique 1: Byte-Identical Migration
When the goal is "reorganize without changing behaviour," **prove** it. After moving a playbook into a role, the rendered task bodies should be identical to the original. Verify with a normalized diff against `main`:
```bash
# Compare the role's task body against the original flat playbook,
# ignoring only comments/whitespace you intend to change.
git show main:configure_clamav.yml > /tmp/old.yml
# ...extract the task list from roles/clamav/tasks/*.yml and diff
diff <(yq '.[] | .tasks' /tmp/old.yml) <(cat roles/clamav/tasks/*.yml)
```
The acceptance bar: `--check --diff` against a real host returns **`changed=0`** (or only the diffs you explicitly intended, like a doc-comment line). If a "faithful" migration shows unexpected `changed=N`, you altered behaviour — stop and reconcile before merging. Templates moved via `git mv` show as **100% renames** in `git show --stat`, which is your proof the deployed content is unchanged.
## Technique 2: Consolidating Near-Duplicates with Feature Flags
The big win is collapsing a family of near-duplicate playbooks (the ~10 `configure_fail2ban_*`) into **one role with flag-gated task files**:
```yaml
# group_vars/<group>.yml — hosts self-select which jails/components they get
fail2ban_jail_sshd: true
fail2ban_jail_wordpress: true
fail2ban_jail_nginx_bad_request: false
```
```yaml
# roles/fail2ban/tasks/main.yml
- import_tasks: jail_wordpress.yml
when: fail2ban_jail_wordpress | default(false)
```
> **Critical gotcha — key flags to inventory GROUPS, not `ansible_os_family`.** It is tempting to gate OS-specific task files on `ansible_os_family == 'Debian'`. Don't. Inventory groups frequently include hosts the *original playbooks deliberately excluded* (e.g. a LAN-only Debian box that should get the network-wait step but **not** the public SSH bind, or a WSL host in the `fedora` group that must be skipped). Keep the original curated host patterns and set the flag per play/group. Keying on `os_family` silently widens a play's host set and is exactly how a "refactor" pushes config to a host that never had it.
## Technique 3: Capture-Based Reconciliation (the safety net)
This is the one that prevents an outage. Sometimes a role gets written as a **fresh re-implementation** of a subsystem rather than a faithful move — a cleaner `jail.local`, new drop-ins, a different default set. It may even be merged into `site.yml`. The trap: that role has **never been rolled out**, and its config *diverges* from what's actually deployed.
Running it would push divergent config to a live, security-sensitive subsystem (intrusion protection, firewall) across the whole fleet on the next `harden.yml`.
The check that catches it:
```bash
ansible-playbook fail2ban.yml --check --diff --limit <host>
# Divergent role => changed=8-12 per host + failures (missing filters/timers)
# Faithful role => changed=0, failed=0
```
**Capture-based reconciliation** is the fix: instead of pushing the role's idea of "correct," bring the **role into parity with the live, working config** first. Capture what's actually deployed, fold it into the role's templates/defaults until `--check` is clean fleet-wide, *then* switch the orchestrator over and retire the old playbooks. Order of operations:
1. **Decide the source of truth** — the live config or the new role. For security subsystems, the live (working) config wins.
2. **Reconcile** the role to match live until `--check` shows `changed=0, failed=0` on every host.
3. **Roll out host-by-host** with real runs; verify the service restarts cleanly and (for fail2ban) jails are actually active.
4. **Only then** delete the old playbooks, rewire `harden.yml`/`bootstrap.yml`, and remove the orphaned top-level templates.
Never delete the old mechanism until the new one is proven converged everywhere. "It's in `site.yml`" is not the same as "it's been rolled out."
## Composition: `site.yml`, `harden.yml`, `bootstrap.yml`
Once subsystems are roles, compose them with thin orchestrators that `import_playbook` the role entry points — so each subsystem keeps a **single source of truth** for its host mapping:
```yaml
# site.yml — day-to-day fleet convergence, in dependency order
- import_playbook: swap.yml
- import_playbook: tailscale.yml
- import_playbook: ssh_hardening.yml
- import_playbook: firewall.yml
- import_playbook: fail2ban.yml
- import_playbook: clamav.yml
```
Order matters: base layer (swap) → networking (tailscale) → access (ssh_hardening) → perimeter (firewall) → intrusion protection (fail2ban). Bootstrap-only roles (guest agent, root password, provisioning prerequisites) belong in `bootstrap.yml`, not `site.yml`.
## Verification Checklist
- [ ] Templates moved with `git mv` (show as 100% renames)
- [ ] `--check --diff` on a real host = `changed=0` (or only intended diffs)
- [ ] Consolidation flags keyed to **inventory groups**, not `ansible_os_family`
- [ ] Re-implemented roles reconciled to live parity **before** rollout (no surprise `changed=N`)
- [ ] Security subsystems rolled out host-by-host with service-active verification
- [ ] Old playbooks/templates deleted **only after** the role is converged fleet-wide
- [ ] Orchestrators (`site.yml`/`harden.yml`/`bootstrap.yml`) rewired; stale references swept
## Related
- [SSH Hardening Fleet-Wide with Ansible](ssh-hardening-ansible-fleet.md)
- [ClamAV Fleet Deployment with Ansible](clamav-fleet-deployment.md)
- [Firewall Hardening with firewalld on Fedora Fleet](firewalld-fleet-hardening.md)
- [Standardizing unattended-upgrades with Ansible](ansible-unattended-upgrades-fleet.md)

View file

@ -11,7 +11,7 @@ tags:
- cron
status: published
created: 2026-04-18
updated: 2026-04-18T11:13
updated: 2026-05-15T03:00
---
# ClamAV Fleet Deployment with Ansible
@ -31,6 +31,10 @@ ClamAV is the standard open-source antivirus for Linux servers. For internet-fac
## Ansible Playbook
> On the MajorsHouse fleet this is packaged as the **`clamav` role** (`roles/clamav/`,
> tasks split install → service → scan → verify) and run via `clamav.yml` or `site.yml`.
> The standalone playbook below is the illustrative equivalent.
```yaml
- name: Deploy ClamAV to internet-facing hosts
hosts: internet_facing # dca, majorlinux, teelia, tttpod, majortoot, majormail
@ -147,6 +151,120 @@ clamscan /tmp/eicar-test.txt
rm /tmp/eicar-test.txt
```
## DigitalOcean Monitoring Caveat (1 vCPU droplets)
`nice -n 19 ionice -c 3` plus `MemoryMax`/`MemorySwapMax` cgroups make clamscan "polite" to the Linux scheduler — it yields to PHP-FPM, MySQL, etc. instantly. **But hypervisor-level CPU monitoring (DigitalOcean, Linode, Hetzner) doesn't know about niceness.** It sees raw CPU utilization. On a 1 vCPU droplet during quiet hours, a single-threaded clamscan can fill 100% of the vCPU on its own, tripping a default `>85%/5m` CPU alert every week — even though the workload is genuinely insulating real traffic.
**Symptoms:**
- Weekly `[ALERT] CPU is running high` email from DO at the same time/day every week
- The alert clears within 1060 min (when scan finishes)
- No actual user-visible service degradation
- Netdata shows CPU 80100% but PHP-FPM/MySQL response times barely move
**Fix: per-droplet alert scoping.** Two changes via the DO API:
1. **Scope the existing fleet-wide CPU alert to exclude affected 1 vCPU droplets** by setting `entities` to an explicit array of *all other* droplet IDs.
2. **Add a new alert scoped to just the affected droplet(s)** with a relaxed threshold:
- `value: 95`
- `window: "30m"`
- `entities: [<droplet_id>]`
The relaxed threshold still catches runaway PHP loops, mining trojans, and actual sustained saturation — but ignores the weekly polite scan.
### Apply via DO API
```bash
TOKEN="<your DigitalOcean PAT>"
# 1. Scope existing CPU alert (PUT requires the full alert spec)
curl -sS -X PUT \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"alerts": {"email": ["you@example.com"], "slack": []},
"compare": "GreaterThan",
"description": "CPU is running high (excludes 1vCPU clamscan boxes)",
"enabled": true,
"entities": ["<droplet_id_1>", "<droplet_id_2>"],
"tags": [],
"type": "v1/insights/droplet/cpu",
"value": 85,
"window": "5m"
}' \
"https://api.digitalocean.com/v2/monitoring/alerts/<existing_uuid>"
# 2. Create a relaxed alert for the small box
curl -sS -X POST \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"alerts": {"email": ["you@example.com"], "slack": []},
"compare": "GreaterThan",
"description": "<host> CPU sustained high (clamscan-aware)",
"enabled": true,
"entities": ["<small_droplet_id>"],
"tags": [],
"type": "v1/insights/droplet/cpu",
"value": 95,
"window": "30m"
}' \
"https://api.digitalocean.com/v2/monitoring/alerts"
```
To list current alerts (find UUIDs and current `entities`):
```bash
curl -sS -H "Authorization: Bearer $TOKEN" \
"https://api.digitalocean.com/v2/monitoring/alerts" | jq
```
**When *not* to do this:** If your droplet has 2+ vCPUs and clamscan only consumes ~50% of total, you probably won't trip an 85% alert in the first place. The per-droplet exemption is mainly for 1 vCPU boxes.
**When the per-droplet relaxed alert *also* trips (and what to do):** On a 1 vCPU droplet during low-traffic hours (e.g., the default Sunday-morning weekly cron window), clamscan has *nothing real to yield to*`nice 19` only matters when something else wants the CPU. The kernel correctly schedules clamscan as nice/idle (`iostat` shows `%nice ~94, %idle 0`) but DO sees `100% - 0% idle = 100% CPU` and trips even the 95%/30m threshold for the duration of the scan (~3050 min on small webserver boxes). At that point the realistic options are:
1. **Accept the weekly page** as expected noise — simplest, no further engineering
2. **Switch to `clamdscan`** (daemon-backed) — scans finish ~35× faster and fit in a 30m window, but `clamd` adds ~250 MB resident memory continuously
3. **Disable the per-droplet CPU alert entirely** for that host and rely on Netdata for the real signal
The "polite CPU is invisible to DO" trick stops working once the box is small enough that the polite work fills the entire core unopposed. There is no DO threshold that distinguishes "polite scan filling idle CPU" from "runaway process pinning the vCPU" — that distinction lives in `iostat`'s `%nice` vs `%user` split, which DO doesn't expose.
**Alternative considered: switch to `clamdscan`** — uses a resident `clamd` daemon, signatures stay loaded, scan finishes ~10× faster with much less CPU/RAM. Better long-term answer, but requires running `clamd` continuously (memory cost on small boxes is ~250 MB resident vs the cron approach which only holds RAM during scan). Trade-off, not strictly better.
## Daemonless Mode on Memory-Constrained Hosts
On hosts with ≤2 GB RAM, running `clamd` continuously is often counterproductive. The daemon loads its full signature database (~950 MB RSS) into memory and keeps it resident. On small VMs this crowds out MySQL, PHP-FPM, and other services — often pushing the whole system into swap rather than preventing anything.
**Affected hosts (fleet history):**
| Host | RAM | Incident | Resolution |
|------|-----|----------|------------|
| teelia | 1.9 GB | 2026-04-27 — clamd 728 MB RSS, 94% RAM alert | daemonless |
| dcaprod | 3.8 GB | 2026-04-30 — clamd OOM thrash after 512M cgroup cap | daemonless |
| majorlinux | 2.0 GB | 2026-05-15 — clamd 980 MB swap, mysqld swapping 293 MB | daemonless |
**The fix: `clamav_use_daemon: false` host_var**
The `clamav` role supports a per-host override. Add to the host's `host_vars/<hostname>/vars.yml`:
```yaml
clamav_use_daemon: false
```
Then re-run the role:
```bash
ansible-playbook clamav.yml --limit <hostname>
```
This will:
- Stop and disable `clamav-daemon.service` and `clamav-daemon.socket`
- Deploy the weekly scan template using `clamscan` (daemonless, loads DB per run)
- Leave `clamav-freshclam` active so definitions stay current
**Trade-off:** Each weekly scan loads the signature DB fresh (~950 MB peak RAM for the scan duration, then freed). The scan takes longer than `clamdscan` (~35× on a warm daemon), but this is acceptable for a weekly background job. The `systemd-run MemoryMax` cgroup wrapper in the scan template caps peak usage so the scan can't OOM the host.
**Rule of thumb:** Use daemon mode (`clamav_use_daemon: true` or unset) on hosts with ≥4 GB RAM where scan speed matters (mail servers, upload handlers). Use daemonless on webservers and small VMs where continuous memory residency is the bigger risk.
## See Also
- [clamscan-cpu-spike-nice-ionice](../../05-troubleshooting/security/clamscan-cpu-spike-nice-ionice.md) — troubleshooting CPU spikes from unthrottled scans

View file

@ -1,11 +1,18 @@
---
title: "Fail2Ban Digest Mode — Fleet-Wide Quiet Alerts"
title: Fail2Ban Digest Mode — Fleet-Wide Quiet Alerts
domain: selfhosting
category: security
tags: [fail2ban, security, email, ansible, fleet, cron, digest]
tags:
- fail2ban
- security
- email
- ansible
- fleet
- cron
- digest
status: published
created: 2026-04-22
updated: 2026-04-22
updated: 2026-05-02T14:56
---
# Fail2Ban Digest Mode — Fleet-Wide Quiet Alerts
@ -21,11 +28,11 @@ Three tiers replace the firehose:
| Tier | Jails | Action | Why |
|------|-------|--------|-----|
| **Immediate email** | `sshd`, `recidive` | `action_mwl` | Security-critical — someone is actively targeting auth or is a repeat offender |
| **Immediate email** | `recidive` | `action_mwl` | Repeat offenders only — someone has been banned multiple times across jails |
| **Silent ban** | Everything else | `action_` (default) | Ban happens, firewall rule applied, no email sent |
| **Daily digest** | All jails | Cron script at 08:00 UTC | One summary email per host with ban counts across all jails |
This reduces email volume from hundreds per day to ~10 (one digest per host + occasional sshd/recidive alerts).
This reduces email volume from hundreds per day to ~10 (one digest per host + occasional recidive alerts).
## jail.local Configuration
@ -40,18 +47,20 @@ action = %(action_)s
This overrides the stock `action_mwl` for all jails. Bans still happen — the firewall rule is applied — but no email is sent.
### Keep immediate alerts for critical jails
### Keep immediate alerts for recidive only
```ini
[sshd]
enabled = true
action = %(action_mwl)s
action = %(action_)s
[recidive]
enabled = true
action = %(action_mwl)s
```
> **Updated 2026-05-02:** sshd was moved to silent (`action_`). Only recidive (repeat offenders) now triggers immediate email. sshd bans are captured in the daily digest.
### Clean up email subjects with fq-hostname
By default, fail2ban uses the system FQDN in email subjects. On Tailscale hosts, this produces ugly subjects like `[Fail2Ban] sshd: banned 1.2.3.4 on MajorToot.tail7f2d9.ts.net`. Override it in `[DEFAULT]`:
@ -91,8 +100,9 @@ The playbook `configure_fail2ban_digest.yml` deploys the full digest model fleet
### What it does
1. Deploys a Python helper script that performs **section-aware editing** of `jail.local` (see gotchas below)
2. Sets `action = %(action_)s` in `[DEFAULT]`
3. Sets `action = %(action_mwl)s` in `[sshd]` and `[recidive]`
2. Sets `action = %(action_)s` in `[DEFAULT]` and `[sshd]`
3. Sets `action = %(action_mwl)s` in `[recidive]`
4. Removes stale `action = %(action_mwl)s` from `defaults-debian.conf` if present
4. Sets `fq-hostname` per host using an override dict
5. Deploys the digest script from a Jinja2 template
6. Creates the cron job via `ansible.builtin.cron`
@ -143,6 +153,14 @@ option 'action' in section 'DEFAULT' already exists
The Python editor script handles this by replacing existing keys rather than appending.
### defaults-debian.conf overrides jail.local
On Debian/Ubuntu, `/etc/fail2ban/jail.d/defaults-debian.conf` is loaded **after** `jail.local`. If it contains `action = %(action_mwl)s`, it silently overrides your silent default — every jail sends email on every ban. The Ansible playbook now removes this line automatically. If you see per-ban emails after deploying digest mode, check this file first:
```bash
grep action /etc/fail2ban/jail.d/defaults-debian.conf
```
### fq-hostname scope
Setting `fq-hostname` in `[DEFAULT]` affects all action templates that use the `<fq-hostname>` tag — including both immediate emails and the digest subject. This is the desired behavior, but be aware that it overrides the system hostname globally within fail2ban.

View file

@ -31,6 +31,9 @@ Rather than editing `/etc/ssh/sshd_config` directly (which may be managed by the
## Ansible Playbook
> On the MajorsHouse fleet this is packaged as the **`ssh_hardening` role** (`roles/ssh_hardening/`)
> and run via `ssh_hardening.yml` or `site.yml`. The standalone playbook below is the illustrative equivalent.
```yaml
- name: Harden SSH daemon fleet-wide
hosts: all:!raspbian

View file

@ -10,7 +10,7 @@ tags:
- docker
status: published
created: 2026-04-02
updated: 2026-04-29T22:45
updated: 2026-04-30T05:21
---
# Mastodon Instance Tuning

View file

@ -0,0 +1,170 @@
---
title: "Mastodon — Triaging Crowdfunding / Mention-Spam Accounts"
description: How to tell broadcast fundraising solicitation from genuine mentions, investigate the account and its origin instance with SQL + nodeinfo, and pick a proportionate moderation action.
tags:
- mastodon
- moderation
- abuse
- federation
- self-hosting
created: 2026-06-22
updated: 2026-06-22
---
# Mastodon — Triaging Crowdfunding / Mention-Spam Accounts
If you run a Mastodon instance, sooner or later you (or your users) start getting tagged by accounts you've never interacted with, posting donation appeals with a link and a wall of hashtags. Some are real people in desperate situations; some are recycled-link scams. Either way, when an account is **broadcasting a solicitation at you** rather than replying to you, it's a moderation question, not a conversation.
This article is the runbook for telling the two apart, investigating both the **account** and its **origin instance**, and choosing an action that's proportionate instead of nuking eight years of legit federation over two bad actors.
## TL;DR
- A mention is **broadcast spam**, not engagement, when it's a *standalone post* (not a reply) that *tags a large fixed list* of accounts and carries a *donation link*, usually from a *throwaway profile* on an *open-registration instance*.
- Investigate before acting: pull the account's age/stats/bio and check whether the post is a reply or a 40-way blast (SQL below). Profile the origin instance via its public `nodeinfo`.
- **Default action is an account-level block**, which also federates and removes their follow of you. Escalate to domain-limit / domain-block only when *one instance* produces *repeat offenders*.
- Keep a log so single incidents that are actually a pattern become visible.
## Signals that a mention is broadcast solicitation
Score it on how many of these hold:
| Signal | Why it matters |
|---|---|
| **Standalone post, not a reply** (`in_reply_to_account_id IS NULL`) but still tags you | They're broadcasting, not responding |
| **Tags a large fixed recipient list** (e.g. 40+) | Mass distribution; the same list reused across senders = coordination |
| **Donation link** in post or bio (`chuffed.org`, `gofundme`, `paypal.me`, `ko-fi`) | The payload |
| **Throwaway profile** — days old, few followers, follows you but you don't follow back | Disposable, baiting a profile view |
| **Mass-follow ratio** — following thousands / few hundred followers | Engagement farming |
| **"I am not a scammer" disclaimer** in bio | Known red-flag phrase |
| **Origin instance: open registration, no approval** | Easy throwaway-account farm |
> [!warning] Judgment, not a purity test
> Many of these accounts are real people. The goal is not to adjudicate need — it's to stop *broadcast solicitation aimed at you* and track the *source instances*. Prefer the lightest action that stops it.
## Investigate the account
Connect to the DB on the instance:
```bash
ssh <your-mastodon-host>
sudo -u postgres psql mastodon_production
```
**Profile + stats for a suspect** (age, post count, follower ratio, bio):
```sql
SELECT a.username||'@'||a.domain,
to_char(a.created_at,'YYYY-MM-DD') AS first_seen_locally,
st.statuses_count, st.followers_count, st.following_count,
left(regexp_replace(COALESCE(a.note,''),'<[^>]+>','','g'),200) AS bio
FROM accounts a LEFT JOIN account_stats st ON st.account_id=a.id
WHERE a.domain='<INSTANCE>' AND a.username='<HANDLE>';
```
**Is the mention a reply or a blast?** `standalone=t` with a high `num_tagged` is the tell:
```sql
SELECT a.username, to_char(s.created_at,'YYYY-MM-DD HH24:MI') AS posted,
s.in_reply_to_account_id IS NULL AS standalone,
(SELECT count(*) FROM mentions mm WHERE mm.status_id=s.id) AS num_tagged
FROM mentions m JOIN statuses s ON s.id=m.status_id
JOIN accounts a ON a.id=s.account_id
JOIN accounts me ON me.id=m.account_id AND me.username='<YOU>' AND me.domain IS NULL
WHERE a.username='<HANDLE>' AND a.domain='<INSTANCE>'
ORDER BY s.created_at DESC;
```
**All recent direct mentions of you** (sweep for the wider pattern):
```sql
SELECT to_char(n.created_at,'YYYY-MM-DD HH24:MI') AS when,
a.username||COALESCE('@'||a.domain,'@local') AS who,
COALESCE(s.uri,'') AS uri,
left(regexp_replace(COALESCE(s.text,''),'<[^>]+>','','g'),200) AS body
FROM notifications n
JOIN accounts recip ON recip.id=n.account_id AND recip.username='<YOU>' AND recip.domain IS NULL
JOIN accounts a ON a.id=n.from_account_id
LEFT JOIN mentions m ON m.id=n.activity_id AND n.activity_type='Mention'
LEFT JOIN statuses s ON s.id=m.status_id
WHERE n.type='mention' ORDER BY n.created_at DESC LIMIT 40;
```
## Profile the origin instance
Don't judge an instance by one bad account. Pull its public metadata — no auth needed:
```bash
# Software, version, user counts, registration policy
NI=$(curl -s https://<INSTANCE>/.well-known/nodeinfo | python3 -c 'import sys,json;print(json.load(sys.stdin)["links"][-1]["href"])')
curl -s "$NI" | python3 -m json.tool # software, openRegistrations, usage.users
# Title, contact/admin, rules, registration approval flag
curl -s https://<INSTANCE>/api/v2/instance | python3 -m json.tool
```
What to read off it:
- **`openRegistrations: true` + `approval_required: false`** → throwaway-account farm; expect more of the same.
- **`totalUsers` vs `activeMonth`** → a huge dormant base is typical of sign-up-and-leave farms.
- **Federation age on your side** — how long you've known the instance, how many of its accounts you cache. A long, broad relationship argues *against* a domain block.
- **The instance's own rules** — many ban "backlink accounts" / harassment, which the mass-tag fundraising violates. That makes **reporting to its admin a legitimate, in-policy path.**
```sql
-- What your instance already knows about the domain
SELECT (SELECT count(*) FROM accounts WHERE domain='<INSTANCE>') AS known_accounts,
(SELECT count(*) FROM statuses s JOIN accounts a ON a.id=s.account_id WHERE a.domain='<INSTANCE>') AS cached_statuses,
(SELECT to_char(min(created_at),'YYYY-MM-DD') FROM accounts WHERE domain='<INSTANCE>') AS first_seen,
(SELECT count(*) FROM domain_blocks WHERE domain='<INSTANCE>') AS is_domain_blocked;
```
## The escalation ladder
| Level | Action | Effect | When |
|---|---|---|---|
| 1 | **Mute** | You stop seeing them; silent | Borderline; you don't want to cut them off |
| 2 | **Block (account)** | Cuts mentions, removes their follow, federates to their instance | **Default first action** |
| 3 | **Report** to source admin | Forwards the offending posts to their moderators | Repeat or egregious; in-policy on most instances |
| 4 | **Domain-limit (silence)** | Their posts show only if you follow that account | One instance, multiple offenders |
| 5 | **Domain-block (suspend)** | Severs all known accounts + federation | Instance is predominantly abuse |
### Blocking from a user account (federates + removes follow)
There is no `tootctl accounts block`. Do it through the model's `BlockService` so it tears down the relationship and federates correctly:
```ruby
# run as the mastodon user:
# sudo -u mastodon bash -c 'cd /home/mastodon/live && RAILS_ENV=production bin/rails runner /tmp/block.rb'
me = Account.find_by(username: "<YOU>", domain: nil)
%w[Handle1 Handle2].each do |u|
t = Account.find_by(username: u, domain: "<INSTANCE>")
next puts("NOTFOUND #{u}") if t.nil?
BlockService.new.call(me, t)
puts "BLOCKED #{u} blocking=#{me.blocking?(t)} they_follow_me=#{t.following?(me)}"
end
```
`blocking=true` with `they_follow_me=false` confirms the block landed and the follow was severed.
### Instance-level actions
Domain-limit / domain-block live in the admin UI (**Moderation → Federation**) or via `tootctl`:
```bash
# Silence (limit) — posts hidden unless followed
RAILS_ENV=production bin/tootctl domains ... # or set severity=silence in the admin UI
# Suspend (block) the whole instance
RAILS_ENV=production bin/tootctl ... # admin UI "Add domain block" is the safe path
```
> [!tip] Reach for the lightest hammer
> A domain block is rarely the right first move against an established instance — you lose every legit account and years of federation to swat a couple of accounts. Block the accounts, report them to the source admin, and only escalate the *instance* when it demonstrates a sustained, multi-actor pattern.
## Keep a log
Track offenders and source instances over time so a "one-off" that's actually a campaign becomes visible, and so domain-level decisions are evidence-based. A simple table — date, account, instance, signals, action — plus an instance-watch table with each source's registration policy and offender count is enough.
## Related
- [Mastodon `--prune-profiles` Trap](mastodon-prune-profiles-trap.md)
- [Mastodon DB Maintenance](mastodon-db-maintenance.md)
- [Mastodon Federation](mastodon-federation.md)

View file

@ -0,0 +1,174 @@
---
title: Mastodon Post-Install Hardening (Permissions + Account)
domain: selfhosting
category: services
tags:
- mastodon
- fediverse
- self-hosting
- hardening
- ansible
- nginx
- rbenv
status: published
created: 2026-05-31
updated: 2026-05-31
---
# Mastodon Post-Install Hardening (Permissions + Account)
Four gaps that the upstream Mastodon install guide doesn't lock down — each silently breaks something or leaves a credential exposed. Found on majortoot-hetzner during its 2026-05-31 cutover; codified in MajorAnsible's `configure_mastodon_permissions.yml`.
---
## Gap 1: `/home/mastodon` is `0750` — nginx 403s every asset
### Symptom
Browser loads `https://<your-instance>/` and shows an unstyled **purple background with no content** (Mastodon's React entry HTML loaded, but every JS / CSS / manifest request 403'd). API endpoints like `/api/v1/instance` still return 200 because they fall through nginx's `try_files` to the puma proxy — but static assets need direct filesystem access.
### Cause
Debian/Ubuntu's `useradd` default umask creates `/home/<user>` as `0750` (owner+group only). nginx runs as `www-data`, which is in neither — it cannot **traverse** into `/home/mastodon/live/public/` to serve `packs/assets/*.js`, manifest.json, etc. The errors land in `/var/log/nginx/error.log`:
```
[crit] stat() "/home/mastodon/live/public/packs/assets/foo.js" failed (13: Permission denied)
```
### Fix
```bash
chmod 0751 /home/mastodon
```
`0751` gives `other` execute (traversal) only, **not read** — files inside that aren't world-readable stay private. Take the opportunity to lock `.env.production` in the next gap.
---
## Gap 2: `.env.production` is `0644` — DB_PASS and SECRET_KEY_BASE are world-readable
### Symptom
Once Gap 1 is fixed and `/home/mastodon` is traversable, any local user (and any compromised process running as nginx, sidekiq under reduced privileges, a container escape, etc.) can `cat /home/mastodon/live/.env.production` and read every Mastodon secret.
### Cause
The `mastodon-setup` interactive wizard writes `.env.production` with default `0644` permissions. The file contains:
- `DB_PASS` — PostgreSQL password
- `SECRET_KEY_BASE` — session cookie signing key
- `OTP_SECRET` — 2FA encryption key
- SMTP credentials
- S3 / object-storage credentials if configured
### Fix
```bash
chmod 0600 /home/mastodon/live/.env.production
chown mastodon:mastodon /home/mastodon/live/.env.production
```
No service restart needed — Rails reads `.env.production` at process boot, not per-request. Existing `puma`, `sidekiq`, and `streaming` services keep running.
---
## Gap 3: `mastodon` user shell is `/usr/sbin/nologin``su - mastodon` fails
### Symptom
```
root@majortoot:~# su - mastodon
This account is currently not available.
```
Blocks all `tootctl` and Rails console admin via SSH.
### Cause
If the user was created with `useradd --system mastodon`, the system-account default is shell `/usr/sbin/nologin`. Mastodon's own installer typically sets `/bin/bash` but a manual / Ansible / Packer build path may have used `--system`.
### Fix
```bash
usermod -s /bin/bash mastodon
```
Verify with `getent passwd mastodon | cut -d: -f7``/bin/bash`.
---
## Gap 4: Login shells don't load rbenv — `tootctl` reports "ruby: command not found"
### Symptom
After fixing Gap 3, `su - mastodon` succeeds, but:
```
mastodon@majortoot:~$ which ruby
(no output, exit 1)
mastodon@majortoot:~$ cd /home/mastodon/live && bin/tootctl version
/usr/bin/env: 'ruby': No such file or directory
```
### Cause
A typical Mastodon install puts rbenv init in `~/.bashrc`. But bash **login** shells (which `su -` and `ssh user@host` open) source `.bash_profile`, `.bash_login`, or `.profile` in that order — **not** `.bashrc`. If `.bash_profile` doesn't exist and `.profile` doesn't init rbenv, the login shell never gets rbenv on PATH.
Even when `.bash_profile` chains `.bashrc`, Ubuntu's default `.bashrc` has a guard at the top:
```bash
case $- in
*i*) ;;
*) return;;
esac
```
This **returns early for non-interactive shells**, which is exactly what `su - mastodon -c "<command>"` opens — so the rbenv init lines later in `.bashrc` are never reached.
### Fix
Drop a `.bash_profile` that sets up rbenv **before** sourcing `.bashrc`, so it works for both interactive and non-interactive login shells:
```bash
# /home/mastodon/.bash_profile (mode 0644, owned by mastodon:mastodon)
export PATH="$HOME/.rbenv/bin:$HOME/.rbenv/shims:$PATH"
if command -v rbenv >/dev/null 2>&1; then
eval "$(rbenv init -)"
fi
# Then load POSIX login env + bash interactive config
[ -f ~/.profile ] && . ~/.profile
[ -f ~/.bashrc ] && . ~/.bashrc
```
Verify:
```bash
su - mastodon -c "ruby -v" # → ruby 3.x.x …
su - mastodon -c "cd /home/mastodon/live && RAILS_ENV=production bin/tootctl version"
```
---
## Codified
All four gaps are handled by `configure_mastodon_permissions.yml` in MajorAnsible. The playbook is idempotent, requires no service restart, and includes self-asserting verification steps:
| Assertion | What it catches |
|---|---|
| `sudo -u www-data stat /home/mastodon/live/public/packs` must succeed | Gap 1 regression |
| `sudo -u www-data cat .env.production` must fail | Gap 2 regression |
| `su - mastodon -c "ruby -v"` must succeed and output "ruby" | Gap 3 or 4 regression |
Apply to all Mastodon hosts:
```bash
ansible-playbook configure_mastodon_permissions.yml
```
## References
- [[majortoot#2026-05-31 — ssh.socket race post-reboot on majortoot-hetzner (during cutover night)]]
- [[majortoot#tootctl CLI Note]]
- MajorAnsible: `configure_mastodon_permissions.yml`
- Related: [[mastodon-instance-tuning|Mastodon Instance Tuning]] · [[mastodon-db-maintenance|Mastodon DB Maintenance]]

View file

@ -0,0 +1,220 @@
---
title: Mastodon — The `--prune-profiles` Trap and How to Recover
description: Why running `tootctl media remove --prune-profiles` blows away avatars that don't come back, and how to repopulate them on demand
tags:
- mastodon
- tootctl
- federation
- self-hosting
- troubleshooting
created: 2026-05-07
updated: 2026-06-01
---
# Mastodon — The `--prune-profiles` Trap and How to Recover
If you administer a Mastodon instance and run `tootctl media remove --prune-profiles` on a schedule, you're probably introducing a long-running cosmetic regression that no one will be able to explain when it happens.
This article documents what the flag actually does, why the missing avatars don't auto-recover, and the smallest tool you can ship to fix things on demand.
## TL;DR
- `tootctl media remove --prune-profiles` deletes cached **remote** avatars older than `--days=N` from your S3/local storage **and** clears `accounts.avatar_file_name` in the database.
- Mastodon does **not** re-fetch avatars when a client views a profile. Re-fetch happens only on incoming `Update` ActivityPub activities or via an explicit `tootctl accounts refresh`.
- Quiet remote accounts therefore stay broken — sometimes for weeks — after a prune.
- The disk savings are modest (≈250 KB per account on average) and the cosmetic damage hits exactly the accounts you care about most: your follows.
- Most admins should **drop `--prune-profiles` and `--remove-headers` from cron** and refresh on demand instead.
## What the flags actually do
`tootctl media remove` has three distinct modes:
| Invocation | Target | Default `--days` |
|---|---|---|
| `tootctl media remove` | remote media **attachments** (images/video in posts) | 7 |
| `tootctl media remove --prune-profiles` | remote **avatars** | 7 |
| `tootctl media remove --remove-headers` | remote **headers** | 7 |
Each mode deletes the file from your storage backend and nullifies the corresponding `accounts.avatar_file_name` / `header_file_name` column. They are **mutually exclusive** — passing two at once produces:
```
--prune-profiles and --remove-headers should not be specified simultaneously
```
If your cron script combines them, **the avatar/header pruning silently never runs**, and the first time you correct the bug you'll suddenly nuke everything that's accumulated since the instance was created.
## Why the pictures don't come back
Mastodon's media-recovery model is event-driven, not lazy. The triggers that cause a remote avatar to be re-fetched are:
1. The remote actor emits an `Update` ActivityPub activity — typically when they edit their profile, change avatar, change display name, etc.
2. Less reliably, certain `Create` activities on accounts whose actor state appears stale.
3. Manual: `tootctl accounts refresh user@instance.tld`, the web UI's "Refresh profile" button (gear menu on the profile page), or admin actions touching the actor record.
What does **not** trigger a re-fetch:
- Loading the profile in any client (web, iOS app, Ivory, Tusky, Toot!, etc.).
- Liking, replying to, boosting, or following toots from the user.
- Viewing the user in your followers/following list.
This is why you see **broken avatars consistently across every client and device** — the asset is missing on your server, and your clients are all faithfully fetching from the same broken URL.
Active accounts re-emit `Update` activities reasonably often, so they self-heal over hours/days. Quiet accounts, accounts on small or down instances, and accounts whose owners simply don't update their profiles can stay broken indefinitely.
## Recovery on demand
Single account:
```bash
sudo -u mastodon -H bash -c '
cd /home/mastodon/live
export RAILS_ENV=production
export PATH=/home/mastodon/.rbenv/bin:/home/mastodon/.rbenv/shims:$PATH
bin/tootctl accounts refresh user@instance.tld
'
```
For your local user's follows, a small wrapper that finds only accounts with broken avatars *whose origin actually advertises one*:
```bash
#!/bin/bash
# refresh-my-follows.sh — repopulate broken avatars for the local user's
# follows. Idempotent. Skips accounts whose origin has no avatar (e.g.,
# users who never set one) and headers entirely (most users have none).
set -euo pipefail
export PATH="/home/mastodon/.rbenv/bin:/home/mastodon/.rbenv/shims:$PATH"
export RAILS_ENV=production
cd /home/mastodon/live
USER_TO_REFRESH="${1:-yourusername}"
accts=$(bin/rails runner "
acct = Account.find_by(username: %q($USER_TO_REFRESH), domain: nil)
abort %q(no such local account) unless acct
acct.following
.where.not(domain: nil)
.where(avatar_file_name: nil)
.where.not(avatar_remote_url: [nil, ''])
.pluck(:username, :domain)
.each { |u, d| puts %Q(#{u}@#{d}) }
" | grep -E '^[^[:space:]@]+@[^[:space:]@]+$' || true)
count=$(printf '%s\n' "$accts" | grep -cv '^$' || true)
echo "Found $count remote follows with missing avatar"
i=0
while IFS= read -r a; do
[ -z "$a" ] && continue
i=$((i+1))
printf '[%d/%d] refresh %s ... ' "$i" "$count" "$a"
if bin/tootctl accounts refresh "$a" >/dev/null 2>&1; then
echo OK
else
echo FAIL
fi
done <<< "$accts"
```
Three things in that WHERE clause matter:
- `avatar_file_name: nil` — local cache is empty, so we need to fetch.
- `domain: not nil` — only remote accounts have cached avatars to repopulate.
- `avatar_remote_url: [nil, '']` excluded — if the origin actor object has no avatar, refresh will not populate anything. Including these accounts puts the script in an infinite-retry loop on every run.
## Bulk restore at scale
When the breakage is large — a bad prune across the whole instance, or a storage-level deletion (see the next section) — refreshing follows one at a time isn't enough. The generalized procedure:
1. List the keys that actually exist in storage, so you only touch the broken ones.
2. For each account whose current `avatar`/`header` key is **absent**, null the `*_file_name` (the redownload workers skip accounts that still have a file name) and enqueue the worker.
3. Let Sidekiq's `pull` queue drain.
```ruby
require "aws-sdk-s3"; require "set"
c = Aws::S3::Client.new(region: ENV["S3_REGION"], access_key_id: ENV["AWS_ACCESS_KEY_ID"], secret_access_key: ENV["AWS_SECRET_ACCESS_KEY"])
b = ENV["S3_BUCKET"]
def keys(c, b, prefix)
s = Set.new; t = nil
loop do
r = c.list_objects_v2(bucket: b, prefix: prefix, continuation_token: t, max_keys: 1000)
r.contents.each { |o| s << o.key }
break unless r.is_truncated
t = r.next_continuation_token
end
s
end
avset = keys(c, b, "cache/accounts/avatars/")
hdset = keys(c, b, "cache/accounts/headers/")
Account.where.not(domain: nil)
.where("avatar_file_name IS NOT NULL OR header_file_name IS NOT NULL")
.find_each(batch_size: 1000) do |a|
if a.avatar_file_name.present? && a.avatar_remote_url.present? &&
!avset.include?(a.avatar.path.sub(%r{^/}, ""))
a.update_column(:avatar_file_name, nil)
RedownloadAvatarWorker.perform_async(a.id)
end
if a.header_file_name.present? && a.header_remote_url.present? &&
!hdset.include?(a.header.path.sub(%r{^/}, ""))
a.update_column(:header_file_name, nil)
RedownloadHeaderWorker.perform_async(a.id)
end
end
```
Notes:
- Listing existing keys first means you re-fetch only what's missing, instead of re-downloading every avatar — which would re-bloat a bucket you may have just trimmed.
- The workers return early if `*_file_name` is present, which is why you must `update_column(..., nil)` before enqueuing.
- Avatars are small (tens of KB each), so re-fetching the whole missing set typically adds a few GB and a few hours of Sidekiq `pull` work. Headers are larger but still modest.
- Origins that deleted the avatar after you cached it return 404 — the permanent, irrecoverable tail.
## Broader failure: storage-level deletion without DB de-ref
`--prune-profiles` is one way avatars vanish, but it at least nulls the database column, so the account re-fetches on its next `Update`. The **more dangerous** variant is deleting objects directly in your storage backend — a manual `aws s3 rm`, an S3 lifecycle expiration rule, a bucket migration that doesn't copy everything, or any "cost cleanup" done outside `tootctl`. Those delete the file but leave `accounts.avatar_file_name` **set**, pointing at an object that no longer exists.
Why it's worse:
- The DB still thinks the avatar is present, and the redownload workers skip the account (`*_file_name` is non-null) — so it never self-heals until an `Update` arrives.
- It can hit **every** remote account at once, not just quiet ones.
- It looks identical to the S3-ACL upload bug — see [Mastodon on S3 — Silent Upload Failures](mastodon-s3-acl-upload-failures.md). Tell them apart by checking whether new uploads succeed (ACL bug) versus only old objects being gone (a one-off deletion).
Recover with the [bulk restore](#bulk-restore-at-scale) procedure above. **Prevent** it by never deleting Mastodon media at the storage level: prune *attachments* through `tootctl media remove` (which derefs the DB and re-fetches on demand) and leave avatars/headers alone.
## Why `header_file_name IS NULL` is a bad signal
A naive script will treat both `avatar_file_name IS NULL` and `header_file_name IS NULL` as "broken." Don't.
Roughly 20% of Mastodon users never set a custom header — the default blank header isn't represented as a file, so `header_file_name` is legitimately `NULL` for them. After a `tootctl accounts refresh`, the field stays `NULL` because there is genuinely nothing to fetch. A script with `OR header_file_name IS NULL` will retry these accounts forever and never make progress.
Avatar is different — nearly all real users set one, so `avatar_file_name IS NULL AND avatar_remote_url IS NOT NULL` is a reliable "broken and fixable" signal.
## The cron decision
If your weekly media-prune cron currently looks like:
```bash
bin/tootctl media remove --days=7 --concurrency=5
bin/tootctl media remove --prune-profiles --days=7 --concurrency=5
bin/tootctl media remove --remove-headers --days=7 --concurrency=5
bin/tootctl preview_cards remove --days=30 --concurrency=5
```
Consider deleting the middle two lines. The attachment prune is the real disk-saver (gigabytes per week on a busy instance). The avatar prune is small (~250 KB per remote account) and damages your UX. The header prune is even smaller and rarely worth it.
**Stronger recommendation:** after being bitten more than once, the safest policy is to **disable automated profile/header pruning entirely** — and reconsider scheduled `tootctl accounts refresh --all`, which re-fetches every profile and is destructive when uploads are failing at the time. Keep only a deliberate, occasional **attachment** prune if bucket size demands it. Pair that with a synthetic upload monitor (see [Mastodon on S3 — Silent Upload Failures](mastodon-s3-acl-upload-failures.md)) so any future regression is caught in hours instead of by a user weeks later.
## Edge cases
- **Origin-side 404:** the actor object advertises an avatar URL, but the URL itself returns 404. Your local cache stays empty no matter how many times you refresh. Only the origin user can fix it (re-upload). The script above will keep retrying these on every run; if that bothers you, add a "tried within last N hours" filter.
- **Suspended accounts:** `tootctl accounts refresh` returns OK on suspended accounts but does not download media. They'll stay broken, which is correct behavior.
- **Sidekiq backlog:** the avatar fetch is queued as a Sidekiq job, not done synchronously. If your `pull` queue is deep, you'll see a delay between "OK" and the avatar actually appearing in the database.
## Related
- [Mastodon Instance Tuning](mastodon-instance-tuning.md) — broader perf notes for self-hosters
- [Mastodon DB Maintenance](mastodon-db-maintenance.md) — what to run on a schedule and when
- [Mastodon Federation](mastodon-federation.md) — how the actor refresh fits into the larger federation model

View file

@ -0,0 +1,138 @@
---
title: Mastodon on S3 — Silent Upload Failures When the Bucket Disables ACLs
description: Why a BucketOwnerEnforced S3 bucket plus a stale S3_PERMISSION/S3_ACL in .env.production makes every Mastodon media upload fail with AccessControlListNotSupported, how to diagnose it, and how to fix and monitor it.
domain: selfhosting
category: services
tags:
- mastodon
- fediverse
- self-hosting
- aws
- s3
- paperclip
- troubleshooting
status: published
created: 2026-06-01
updated: 2026-06-01
---
# Mastodon on S3 — Silent Upload Failures When the Bucket Disables ACLs
If your Mastodon instance stores media on S3 and you switch the bucket to **Object Ownership = `BucketOwnerEnforced`** (which AWS now recommends, and which the console nudges you toward), every media upload can start failing **silently** unless you also remove the object-ACL setting from `.env.production`. New avatars, headers, and attachments stop appearing; old ones keep working; nothing obvious is logged. This article is the diagnosis and fix.
## TL;DR
- `BucketOwnerEnforced` **disables ACLs entirely** on the bucket. Any request that carries an `x-amz-acl` header is rejected with `AccessControlListNotSupported: The bucket does not allow ACLs`.
- Mastodon (via Paperclip) attaches `x-amz-acl` to every upload **if** `S3_PERMISSION` (or `S3_ACL`) is set in `.env.production`. The common value `S3_PERMISSION=public-read` — or a migration leftover like `S3_PERMISSION=private` — triggers the rejection.
- Result: **every new upload fails**, but the database row is still updated, so Mastodon believes it has the file. The object never lands → broken image. Objects written *before* the bucket changed keep serving fine, which masks the problem.
- **Fix:** set `S3_PERMISSION=` (empty) and remove any `S3_ACL=` line, then restart `mastodon-web` + `mastodon-sidekiq`. Public read is now served by the **bucket policy**, not per-object ACLs.
## Symptoms
- Newly-changed avatars/headers show broken; attachments on new posts fail to display.
- Avatars that were cached **before** the bucket setting changed still work — so "some work, some don't."
- `tootctl` and the web UI report success; Sidekiq doesn't obviously error.
- Direct fetch of a broken object's URL returns **403 AccessDenied** (not 404 — see below).
## Why a missing object returns 403, not 404
A typical Mastodon S3 bucket policy grants public `s3:GetObject` but **not** `s3:ListBucket`. Without `ListBucket`, S3 hides whether a key exists: a `GET` on a **missing** key returns **403 AccessDenied**, identical to a permissions denial. So "403" here usually means *the object isn't there*, not *the object is forbidden*. This is why the failure reads like a permissions problem when it's really a failed write.
## Diagnosis
Run these with the instance's own S3 credentials (e.g. via `bin/rails runner`, which loads `.env.production`):
```ruby
require "aws-sdk-s3"
c = Aws::S3::Client.new(region: ENV["S3_REGION"],
access_key_id: ENV["AWS_ACCESS_KEY_ID"],
secret_access_key: ENV["AWS_SECRET_ACCESS_KEY"])
b = ENV["S3_BUCKET"]
# 1. Is the bucket ACL-disabled?
puts c.get_bucket_ownership_controls(bucket: b).ownership_controls.rules.map(&:object_ownership).inspect
# => ["BucketOwnerEnforced"] <-- ACLs are OFF
# 2. Does an upload WITH an ACL fail, and WITHOUT one succeed?
begin
c.put_object(bucket: b, key: "tmp/acltest", body: "x", acl: "public-read")
puts "PUT+acl: OK"
rescue => e
puts "PUT+acl FAILS: #{e.class} / #{e.message}" # AccessControlListNotSupported
end
c.put_object(bucket: b, key: "tmp/noacltest", body: "x") # succeeds
c.delete_object(bucket: b, key: "tmp/noacltest")
# 3. Confirm a "broken" avatar's object is actually missing
key = Account.find_by(username: "someuser", domain: "remote.tld").avatar.path.sub(%r{^/}, "")
begin; c.head_object(bucket: b, key: key); puts "EXISTS"
rescue Aws::S3::Errors::NotFound; puts "MISSING"; end
```
If #1 shows `BucketOwnerEnforced` and #2 shows the ACL'd PUT failing while the plain PUT succeeds, you've confirmed it.
Check `.env.production` for the offending settings:
```bash
grep -E '^S3_(ACL|PERMISSION|NO_INHERIT)' /home/mastodon/live/.env.production
# S3_ACL=private <-- remove
# S3_PERMISSION=private <-- set empty
```
## The fix
1. Edit `.env.production`:
- `S3_PERMISSION=` (empty — Paperclip then sends no `x-amz-acl` header)
- remove/comment any `S3_ACL=` line
2. Restart so the env is reloaded: `systemctl restart mastodon-sidekiq mastodon-web`
3. Verify the previously-failing write path now works — reprocess any existing avatar and confirm it serves 200:
```ruby
a = Account.local.first
a.avatar.reprocess! # used to raise AccessControlListNotSupported; now succeeds
```
Public readability is now provided by the **bucket policy** (grant `s3:GetObject` on `arn:aws:s3:::your-bucket/*` to `Principal: "*"`), with the account-level **Block Public Access** "ACLs" toggles off and "policy" allowed. You do **not** need per-object ACLs at all.
### Recovering the avatars that broke while it was failing
Any media that failed to upload during the broken window is gone from S3 while the DB still references it. Because Mastodon's redownload workers **skip accounts whose `*_file_name` is already set**, you must null the dead reference first, then enqueue the worker. See [Mastodon — The `--prune-profiles` Trap and How to Recover](mastodon-prune-profiles-trap.md#bulk-restore-at-scale) for the bulk procedure.
## Don't let it happen silently again — monitor uploads
The worst part of this bug is the silence. Add a periodic **synthetic write check** that uploads a tiny object with the app's own credentials, confirms it, deletes it, and alerts on failure:
```ruby
s3.put_object(bucket: b, key: "health/upload-check", body: "ok") # no acl
s3.head_object(bucket: b, key: "health/upload-check")
s3.delete_object(bucket: b, key: "health/upload-check")
# any exception -> email an alert
```
Pair it with an HTTP check that your **local** account avatars all return 200 (they always should). Run both every few hours from cron. A regression then pages you in hours instead of being discovered by a user weeks later.
## Ansible enforcement
If you manage the host with Ansible, enforce the safe values so a future template render can't reintroduce the ACL header:
```yaml
- name: Ensure S3_PERMISSION is empty (no x-amz-acl on uploads)
ansible.builtin.lineinfile:
path: /home/mastodon/live/.env.production
regexp: '^S3_PERMISSION='
line: 'S3_PERMISSION='
notify: Restart Mastodon services
- name: Remove any active S3_ACL line (ACLs unsupported on this bucket)
ansible.builtin.lineinfile:
path: /home/mastodon/live/.env.production
regexp: '^S3_ACL=.+'
state: absent
notify: Restart Mastodon services
```
## Related
- [Mastodon — The `--prune-profiles` Trap and How to Recover](mastodon-prune-profiles-trap.md) — the other way avatars go missing, plus the bulk-restore script
- [Mastodon Post-Install Hardening (Permissions + Account)](mastodon-post-install-hardening.md)
- [AWS S3 Cost Management](../cloud/aws-s3-cost-management.md) — pruning attachments to control bucket size (safely)

View file

@ -0,0 +1,276 @@
---
title: "Inbound Spam Filtering: spamass-milter + SpamAssassin Bayes on Postfix/Dovecot (Fedora)"
domain: selfhosting
category: services
tags: [postfix, dovecot, spamassassin, spamass-milter, bayes, spam, sieve, fedora, email, selinux]
status: published
created: 2026-06-04
updated: 2026-06-05
---
# Inbound Spam Filtering: spamass-milter + SpamAssassin Bayes on Postfix/Dovecot
How to add inbound spam scanning to a Postfix/Dovecot virtual-mailbox server on Fedora: SpamAssassin scans every inbound message via `spamass-milter`, spam is **tagged (never rejected)**, Dovecot's Sieve files it into the user's `Junk` folder, and a **site-wide Bayes database** — shared between the scan path and manual `sa-learn` training — learns from your real mail.
This is a "tag and quarantine" design (not "reject at SMTP"), which is the safe default: a misfire lands a message in Junk for review rather than bouncing legitimate mail.
## Architecture
```
inbound SMTP (25) ─► Postfix smtpd
│ smtpd_milters:
│ 1. OpenDKIM (verify/sign)
│ 2. spamass-milter ─► spamc ─► spamd (SpamAssassin)
│ adds X-Spam-Flag / X-Spam-Status headers
Dovecot LMTP delivery ─► global Sieve
if X-Spam-Flag: YES ─► fileinto "Junk"
else ─► INBOX
Bayes DB /var/lib/spamassassin/bayes/ (site-wide, shared)
├─ spamd auto-learns at scan time
└─ sa-learn manual/scripted training from Maildir folders
```
## 1. Install
```bash
sudo dnf install spamassassin spamass-milter
sudo systemctl enable --now spamassassin # spamd
```
On Fedora the `spamass-milter` unit runs as the unprivileged **`sa-milt`** user and creates its socket at `/run/spamass-milter/spamass-milter.sock`. Remember that user — the Bayes DB ownership and the socket permissions both hinge on it.
## 2. Configure spamass-milter — tag-only
Edit `/etc/sysconfig/spamass-milter`:
```sh
EXTRA_FLAGS="-a -r 999999"
```
> [!warning] The `-r` flag is a footgun
> `-r nn` rejects mail scoring ≥ `nn` at SMTP time. **Omitting `-r` does NOT mean "never reject"** — this build still rejects flagged spam at a low default threshold (a GTUBE test will get `550 Blocked by SpamAssassin`). To get pure tag-only behaviour, set the threshold absurdly high (`-r 999999`) so nothing ever reaches it. Do **not** use `-r -1` — that means "reject anything tagged as spam."
- `-a` — skip messages on **authenticated** connections, so your own outbound/submission mail isn't scanned or tagged.
## 3. Socket permissions (so Postfix can connect)
The socket is created `0770 sa-milt:sa-milt` only if you widen the unit's umask; by default it's `0755` and Postfix (running as `postfix`) can't write to it. Two steps:
```bash
# 1. Let the socket be group-accessible
sudo install -d /etc/systemd/system/spamass-milter.service.d
printf '[Service]\nUMask=0007\n' | sudo tee /etc/systemd/system/spamass-milter.service.d/socket-perms.conf
# 2. Put postfix in the sa-milt group, then RESTART postfix (group is read at start)
sudo usermod -aG sa-milt postfix
sudo systemctl daemon-reload
sudo systemctl enable --now spamass-milter
```
Verify: `sudo -u postfix test -w /run/spamass-milter/spamass-milter.sock && echo OK`.
## 4. Wire into Postfix
Append the milter **alongside** OpenDKIM — don't replace it. Inbound (`smtpd`) gets both; local-injected mail (`non_smtpd`) stays DKIM-only.
```bash
postconf -e 'smtpd_milters = local:/run/opendkim/opendkim.sock unix:/run/spamass-milter/spamass-milter.sock'
postconf -e 'milter_default_action = accept' # if SA is down, accept the mail — never defer/bounce
sudo systemctl restart postfix # restart (not reload) to pick up the new group
```
`milter_default_action = accept` is important: if the milter ever hiccups, mail still flows.
## 5. Site-wide Bayes DB
Put the Bayes DB in one fixed location so the scan path and your training script share it. In `/etc/mail/spamassassin/local.cf`:
```
use_bayes 1
bayes_auto_learn 1
bayes_path /var/lib/spamassassin/bayes/bayes
bayes_file_mode 0660
```
Create the directory owned by the **scanning user** (`sa-milt`), under `/var/lib/spamassassin` so it inherits the correct SELinux type (`spamd_var_lib_t`):
```bash
sudo install -d -m 2770 -o sa-milt -g sa-milt /var/lib/spamassassin/bayes
sudo restorecon -Rv /var/lib/spamassassin/bayes
sudo systemctl restart spamassassin
```
The `2770` setgid + `bayes_file_mode 0660` means whether the DB is written by `spamd` (as `sa-milt`) or by `sa-learn` (as `root`, from a training script), all parties can read and write it.
## 6. File spam into Junk (Dovecot Sieve)
A global Sieve before-script files anything SpamAssassin flagged. `/etc/dovecot/sieve/global/spam-to-junk.sieve`:
```sieve
require ["fileinto", "mailbox"];
if anyof (header :contains "X-Spam-Flag" "YES", header :contains "X-Spam-Status" "Yes") {
fileinto :create "Junk";
stop;
}
```
Register it as a global before-script in `dovecot.conf` (NOT under `plugin {}` on Pigeonhole 2.4+ — see warning below), then compile and restart Dovecot:
```bash
sievec /etc/dovecot/sieve/global/spam-to-junk.sieve # produces .svbin
systemctl restart dovecot
```
> [!warning] Pigeonhole 2.4 dropped `plugin/sieve_before` — it silently does nothing
> Before Dovecot/Pigeonhole 2.4, the canonical way to register a global before-script was:
>
> ```
> plugin {
> sieve_before = /etc/dovecot/sieve/global/spam-to-junk.sieve
> }
> ```
>
> On **Dovecot 2.4+**, that setting is gone and **silently ignored** — no warning at start-up, the script never runs, and your X-Spam-Flag mail just lands in INBOX wondering why nothing files it. The 2.4 replacement is a top-level `sieve_script` block (not inside `plugin {}`):
>
> ```
> sieve_script spam_before {
> type = before
> path = /etc/dovecot/sieve/global/spam-to-junk.sieve
> }
> ```
>
> Verify with `doveconf -n | grep -A2 spam_before`. If it doesn't appear, dovecot.conf isn't reading your file — check that `!include conf.d/*.conf` exists in dovecot.conf (some Fedora rebuilds ship a flat dovecot.conf without it; the block has to live in dovecot.conf directly).
## 6b. (Optional) Route spam to a separate mailbox — silence iOS push notifications
`fileinto :create "Junk"` moves spam to the user's `.Junk` folder, but the user's IMAP session still sees a new-message event in INBOX (briefly, before sieve moves it) or in Junk (depending on client subscriptions). For clients with IMAP IDLE + push, that's a notification you don't want — e.g. Spark on iPhone/iPad fires APNS on any new message touching a subscribed folder.
To make spam **invisible to the user's mailbox entirely**, REDIRECT the envelope at Postfix `cleanup` (after the milter adds `X-Spam-Flag`, before LMTP delivery) so spam lands in a separate `junk@` mailbox the user doesn't subscribe to:
```bash
# /etc/postfix/cleanup_header_checks
/^X-Spam-Flag:[[:space:]]+YES/ REDIRECT junk@example.com
```
```bash
postconf -e 'header_checks = regexp:/etc/postfix/cleanup_header_checks'
systemctl reload postfix
```
> [!tip] Use `regexp:`, not `pcre:`, on stock Fedora
> `pcre:` requires the `postfix-pcre` package. `regexp:` is built into postfix and supports POSIX extended regex — use `[[:space:]]+` for whitespace and `\\\\` for backslash. The patterns in cleanup_header_checks are simple enough that regexp is plenty.
The Sieve from §6 still runs as a safety net for any tagged message that escapes the cleanup REDIRECT (e.g. a message addressed to the junk@ mailbox itself, or aliases not covered by the REDIRECT rule). Defense in depth.
Train Bayes from the `junk@` Maildir instead of (or in addition to) per-user Junk folders:
```bash
sa-learn --spam /var/vmail/example.com/junk/{cur,new}
```
## 7. Training the Bayes filter
SpamAssassin's Bayes only starts scoring once it has learned **≥ 200 spam AND ≥ 200 ham** (`bayes_min_spam_num` / `bayes_min_ham_num`). Train from your Maildir folders with `sa-learn`. **Run it as `root`** — root can read every user's Maildir *and* write the Bayes DB.
```bash
# Spam — your Junk folder(s) and any dedicated spam mailbox
sa-learn --spam /var/vmail/example.com/user/.Junk/{cur,new}
# Ham — Sent + Inbox (known-good)
sa-learn --ham /var/vmail/example.com/user/{cur,new}
sa-learn --ham /var/vmail/example.com/user/.Sent/{cur,new}
sa-learn --sync
sa-learn --dump magic | grep -E 'nspam|nham'
```
`bayes_path` is read from `local.cf`, so no `--dbpath` is needed.
> [!tip] Keep spam and ham roughly balanced
> Bayes accuracy drops when one corpus dwarfs the other (aim for within ~3:1). Don't dump a 90,000-message archive of ham against a few hundred spam — it biases everything toward "ham" and spam slips through. Use Sent + recent Inbox for ham, not your entire archive.
> [!warning] Train manually, not from cron — unless your folders are always clean
> `sa-learn` learns whatever is *in* the folder. If a spam slips into the Inbox, or you haven't yet rescued a false-positive out of Junk, an unattended cron run will mislearn it. Prefer a manual script you run **after** triaging Junk/Inbox. (`sa-learn` is idempotent and re-classifies on re-run, so a mistake is fixable: move the message to the right folder and run again.)
### 7a. Weekly systemd timer (safe when junk@ is dedicated and INBOX is curated)
The warning above is the safe default. If you use the §6b REDIRECT-to-junk@ pattern, **the junk mailbox is pure spam by design** (only `X-Spam-Flag:YES` envelopes reach it), and your INBOX is curated by hand — the misclassification risk drops to near zero, and a weekly timer becomes both safe and useful. Add `--force-expire` to age out stale tokens so the Bayes corpus doesn't drift.
```ini
# /etc/systemd/system/sa-learn-majormail.service
[Unit]
Description=SpamAssassin Bayes training from majorshouse.com Maildir
After=spamassassin.service
Wants=spamassassin.service
[Service]
Type=oneshot
Nice=10
IOSchedulingClass=idle
ExecStart=/usr/bin/sa-learn --spam --no-sync \
/var/vmail/example.com/junk/cur \
/var/vmail/example.com/junk/new
ExecStart=/usr/bin/sa-learn --ham --no-sync \
/var/vmail/example.com/user/cur \
/var/vmail/example.com/user/new \
/var/vmail/example.com/user/.Sent/cur \
/var/vmail/example.com/user/.Sent/new
ExecStart=/usr/bin/sa-learn --sync
ExecStart=/usr/bin/sa-learn --force-expire
```
```ini
# /etc/systemd/system/sa-learn-majormail.timer
[Unit]
Description=Weekly SpamAssassin Bayes training + expiry
[Timer]
OnCalendar=Sun 04:15
Persistent=true
RandomizedDelaySec=20min
[Install]
WantedBy=timers.target
```
```bash
systemctl daemon-reload
systemctl enable --now sa-learn-majormail.timer
systemctl list-timers sa-learn-majormail.timer
```
`Persistent=true` runs the missed job on next boot if the host was off at 04:15. `--force-expire` is a no-op until SA's expiry heuristic decides tokens are due (typically every few weeks for the default `bayes_expiry_max_db_size`).
## 8. Test
Send a [GTUBE](https://spamassassin.apache.org/gtube/) probe through port 25 (unauthenticated) and a normal message:
```bash
# from a host that can reach :25 — GTUBE scores ~1000
printf 'Subject: gtube\n\nXJS*C4JDBQADN1.NSBN3*2IDNEN*GTUBE-STANDARD-ANTI-UBE-TEST-EMAIL*C.34X\n' \
| sendmail -f test@example.org user@example.com
```
Confirm in `/var/log/maillog` that `spamd` scanned it (`result: Y …`), the message was **delivered** (no `milter-reject`), it landed in `.Junk`, and the stored message has `X-Spam-Flag: YES`.
## Gotchas recap
| Symptom | Cause | Fix |
|---|---|---|
| Spam gets `550 Blocked by SpamAssassin` (you wanted Junk) | spamass-milter rejects at a default threshold | `-r 999999` for tag-only |
| Postfix can't reach the milter socket | socket `0755`, postfix not in `sa-milt` group | `UMask=0007` drop-in + `usermod -aG sa-milt postfix` + restart postfix |
| `sa-learn` trains but `spamd` doesn't use it | per-user vs site Bayes mismatch | set `bayes_path` in `local.cf` (site-wide) |
| Bayes never scores (`BAYES_*` absent) | below the 200/200 learn floor | train more, keep spam/ham balanced |
| Your own outbound mail gets tagged | scanning authenticated mail | `-a` flag |
| AVC denials on the Bayes DB (SELinux) | DB outside `/var/lib/spamassassin` | keep it under that path (`spamd_var_lib_t`) + `restorecon` |
| `plugin/sieve_before` does nothing — spam keeps reaching INBOX | Pigeonhole 2.4 silently dropped that setting | use the top-level `sieve_script <name> { type = before; path = ...; }` block instead |
| `postfix reload` fails: `unsupported dictionary type: pcre` | `pcre:` map requires `postfix-pcre` package | install it, OR use `regexp:` (built-in POSIX) |
| Sieve `fileinto Junk` still notifies Spark/iOS | client subscribes to Junk; LMTP delivery briefly hits INBOX | REDIRECT envelope at Postfix cleanup (§6b) so the message never reaches the user's mailbox at all |
| Local `sendmail` test doesn't trigger REDIRECT | `sendmail` bypasses smtpd milters → no `X-Spam-Flag` added | inject through SMTP :25 (e.g. swaks) OR pre-set the header in the test message |
## See also
- [[selinux-dovecot-vmail-context|SELinux: Fixing Dovecot Mail Spool Context (/var/vmail)]]
- [[linux-server-hardening-checklist|Linux Server Hardening Checklist]] (basic `sa-learn` section)

View file

@ -0,0 +1,137 @@
---
title: "App-Consistent Fleet Backups with restic + Backblaze B2"
domain: selfhosting
category: storage-backup
tags: [restic, backblaze, b2, backup, ansible, systemd, postgresql, mysql, sqlite, docker, disaster-recovery]
status: published
created: 2026-06-19
updated: 2026-06-19
---
# App-Consistent Fleet Backups with restic + Backblaze B2
A repeatable pattern for backing up a mixed fleet (Ubuntu + Fedora, VPS + homelab, bare services + Docker) to Backblaze B2 with [restic](https://restic.net) — encrypted, deduplicated, and **app-consistent** (databases are dumped before the snapshot, not copied live). Driven by Ansible and a per-host `systemd` timer.
## The Short Answer
Per host, nightly: **dump every database to a staging dir → `restic backup` that staging dir plus the data paths → apply retention → wipe staging.** A monthly timer runs `restic prune`. Anything that fails emails the admin. One B2 bucket holds a separate repo per host at `b2:<bucket>:<hostname>`.
Retention is `--keep-daily 7 --keep-weekly 4 --keep-monthly 6` (~6 months of history).
## Why dump databases first
Copying a live database's files (`/var/lib/mysql`, a running SQLite file, a Postgres data dir) gives you a *crash-consistent* copy at best — restorable only if you're lucky. Logical dumps are guaranteed consistent:
- **MySQL / MariaDB:** `mysqldump --single-transaction --routines --triggers --databases <db>`
- **PostgreSQL:** `pg_dump -Fc <db>` (custom format) via the `postgres` system user (peer auth)
- **SQLite:** `sqlite3 <file> ".backup '<out>'"` — uses the online backup API, safe against a running writer
- **Dockerized DBs:** `docker exec <container> sh -c '<dump cmd>'`, letting the container's own shell expand its root-password env var
restic then backs up the dump files (which dedupe beautifully — only the changed blocks upload each night).
## Repository layout
- **One private B2 bucket** (e.g. `majorshouse-backups`).
- **One repo per host:** `b2:majorshouse-backups:<hostname>`.
- The application key needs **read + write + delete** for the bucket. restic deletes objects during `forget`/`prune`, so a pure *append-only* key will break retention. (True append-only requires splitting `forget`/`prune` onto a separate maintenance key — a worthwhile hardening step, but not the default.)
- Credentials live in an `EnvironmentFile` (`/etc/restic/restic-env`, mode `0600`, root): `RESTIC_REPOSITORY`, `RESTIC_PASSWORD`, `B2_ACCOUNT_ID`, `B2_ACCOUNT_KEY`.
## The backup script (shape)
```bash
set -uo pipefail
STAGING=/var/backups/restic-staging
rm -rf "$STAGING"; mkdir -p "$STAGING"; chmod 700 "$STAGING"
# per-engine dumps into $STAGING ...
mysqldump --single-transaction --routines --triggers --databases wordpress > "$STAGING/mysql-wordpress.sql"
sudo -u postgres pg_dump -Fc mastodon_production > "$STAGING/pg-mastodon_production.dump"
sqlite3 /opt/phantombot/config/phantombot.db ".backup '$STAGING/sqlite-phantombot.db'"
restic backup --tag fleet-backup --host "$(hostname -s)" \
"$STAGING" /var/www /etc/letsencrypt --exclude /path/to/already-offsite/media
restic forget --keep-daily 7 --keep-weekly 4 --keep-monthly 6
rm -rf "$STAGING"
```
Wrap each step so a failure mails the admin and aborts (don't silently back up a half-state). On hosts where the `mail` CLI is absent, pipe a message to `/usr/sbin/sendmail -t` instead.
## systemd units
A oneshot service + a timer. Stagger `OnCalendar` per host to spread B2 load, and **always set `RESTIC_CACHE_DIR`** (see Gotchas):
```ini
# restic-backup.service
[Service]
Type=oneshot
EnvironmentFile=/etc/restic/restic-env
Environment=RESTIC_CACHE_DIR=/var/cache/restic
ExecStart=/usr/local/sbin/restic-backup.sh
Nice=10
IOSchedulingClass=idle
```
```ini
# restic-backup.timer
[Timer]
OnCalendar=*-*-* 02:30:00
RandomizedDelaySec=20m
Persistent=true
[Install]
WantedBy=timers.target
```
A second `restic-prune.timer` runs `restic prune` monthly (`OnCalendar=*-*-01 04:00:00`).
## Restore procedure
The whole point. From the target host (or any host with the repo creds):
```bash
# load repo + B2 creds without echoing them
set -a; . /etc/restic/restic-env; set +a
restic snapshots # list; note the snapshot ID or use 'latest'
# restore specific paths to a scratch dir (never restore in place blindly)
restic restore latest --target /tmp/restore \
--include /var/backups/restic-staging \
--include /var/www/html/wp-config.php
# verify before doing anything with it
ls -la /tmp/restore/var/backups/restic-staging/
head -1 /tmp/restore/var/backups/restic-staging/mysql-wordpress.sql # "-- MySQL dump 10.13 ..."
```
To recover a database, restore the dump then load it: `mysql <db> < mysql-<db>.sql`, `pg_restore -d <db> pg-<db>.dump`, or copy the SQLite file back. **Test restores periodically** — a backup you've never restored is a hope, not a backup. Restore the highest-stakes data (password manager, mail) first in any drill.
## Adding a host
1. Add it to the `backups` inventory group.
2. Give it a `host_vars` scope — which DBs to dump and which paths to back up:
```yaml
restic_backup_oncalendar: "*-*-* 02:40:00" # stagger
restic_mysql_dbs: [castopod_db]
restic_paths: [/var/www/html/castopod]
restic_excludes: [/var/www/html/castopod/public/media] # already offsite
```
3. Run the playbook against that host. The role installs restic, deploys the script + units, `restic init`s the repo if absent, and enables the timers.
## Gotchas & Notes
- **`RESTIC_CACHE_DIR` is mandatory under systemd.** systemd services run with no `$HOME`, so restic can't find its cache and warns *"unable to locate cache directory: neither $XDG_CACHE_HOME nor $HOME are defined"* — and re-reads **every file** each run (no incremental). Point it at `/var/cache/restic` in the unit.
- **`sqlite3` may not be installed.** A host that runs a SQLite-backed app (e.g. a bot) often lacks the `sqlite3`/`sqlite` CLI. Install it where `restic_sqlite_paths` is set, or the `.backup` step fails.
- **Docker DB password env-var names vary.** Don't assume: the MariaDB image may use `MYSQL_ROOT_PASSWORD` (not `MARIADB_ROOT_PASSWORD`), and a Postgres container's superuser is whatever `POSTGRES_USER` is set to — reference `"$POSTGRES_USER"` rather than hardcoding `postgres`. Check with `docker exec <c> sh -c 'env | grep -oE "^(MYSQL|MARIADB|POSTGRES)_[A-Z_]*"'` (name only).
- **B2 key needs delete capability.** Otherwise `forget`/`prune` fail. Scope the key to the bucket; reach for per-host `namePrefix`-restricted keys for blast-radius isolation.
- **Exclude data that's already offsite.** Media already synced to object storage (S3/B2 via the app or `rclone`) should be `--exclude`d so you don't pay to store it twice.
- **First upload is slow, the rest are fast.** The initial snapshot reads and uploads everything; subsequent runs only ship changed blocks. For a large first run, fire it detached and watch from a transient unit that emails you on completion.
- **Keep secrets out of git.** The repo password and B2 key belong in an Ansible vault (committed encrypted), referenced into the role — never in plaintext vars.
- **Changing a host's backup paths starts a new snapshot group.** `restic forget` groups snapshots by `host`+`paths` by default, so adding or removing a path on an existing host creates a *separate* lineage: the old path-set and the new one each retain their own 7d/4w/6m snapshots, and `restic snapshots` shows both. Expected, not a bug — but it means the old-path snapshots age out on their own schedule rather than being superseded. To collapse everything into one retention bucket, run `forget` with `--group-by host` (be deliberate: it then treats *any* path-set on that host as the same group).
## See Also
- [rsync Backup Patterns](rsync-backup-patterns.md)
- [SnapRAID & MergerFS Storage Setup](../../01-linux/storage/snapraid-mergerfs-setup.md)
- [restic documentation](https://restic.readthedocs.io)

0
04-streaming/audio/.keep Normal file
View file

View file

View file

View file

@ -0,0 +1,331 @@
---
title: "HEVC Batch Re-Encode for Plex Using VAAPI (AMD GPU)"
domain: streaming
category: plex
tags: [plex, ffmpeg, hevc, vaapi, amd, gpu, encode, storage, rx480]
status: published
created: 2026-05-15
updated: 2026-06-05
---
# HEVC Batch Re-Encode for Plex Using VAAPI (AMD GPU)
## Problem
Plex NVMe storage is filling up from a large library of H.264-encoded video files (YouTube downloads, stream archives, etc.). Re-encoding to HEVC (H.265) reclaims 3050% of disk space. The catch: Plex tracks each file's "date added" in a SQLite database, and that order matters for playback queues. Naive re-encode-and-replace approaches can corrupt or reset that metadata.
## Solution
Use `ffmpeg` with `hevc_vaapi` (AMD GPU hardware encoder) to batch re-encode files in-place using an atomic rename swap that preserves the Plex database record — including `added_at` — without any Plex downtime or database editing.
---
## How Plex Stores "Date Added"
Plex does **not** use file modification time (`mtime`) for "date added." It stores a Unix timestamp in its SQLite database:
```sql
-- Plex DB location (override via systemd unit may differ — check):
-- /var/lib/plexmediaserver/Library/Application Support/Plex Media Server/
-- Plug-in Support/Databases/com.plexapp.plugins.library.db
-- (or wherever PLEX_MEDIA_SERVER_APPLICATION_SUPPORT_DIR points)
SELECT mi.added_at, datetime(mi.added_at, 'unixepoch'), mp.file
FROM metadata_items mi
JOIN media_items me ON me.metadata_item_id = mi.id
JOIN media_parts mp ON mp.media_item_id = me.id
WHERE mp.file LIKE '%your-file%';
```
> **Note:** If the default path returns 0 rows, check your actual data directory:
> ```bash
> systemctl cat plexmediaserver | grep APPLICATION_SUPPORT
> ```
The `added_at` field is keyed to the **file path** in `media_parts`. As long as the file path doesn't change, the database record — including `added_at` — is untouched even after the file's content is replaced.
---
## Why VAAPI Instead of libx265
On a host with an AMD RX 480/580 (or similar Polaris GPU), hardware HEVC encoding via VAAPI is roughly **9× faster** than software libx265 at comparable quality:
| Encoder | Speed (1080p) | Notes |
|---|---|---|
| libx265 -preset medium | ~21 fps / 0.35× | Best quality/size ratio |
| hevc_vaapi QP 28 | ~186 fps / 3.1× | Sufficient for streaming content |
For 1080p streaming content (game streams, podcasts, YouTube archival), the quality difference is imperceptible. libx265 is preferable only for archival encodes where absolute quality matters.
### Verify VAAPI is working
```bash
vainfo 2>&1 | grep -E "vaapi|HEVC|hevc|Driver"
ls /dev/dri/renderD128
```
You need `VAProfileHEVCMain : VAEntrypointEncSlice` in the output. If missing, install `mesa-va-drivers-freeworld` (RPM Fusion) for AMD hardware.
---
## The Atomic Swap Strategy
The key insight: `mv file.tmp file` on the **same filesystem** is an atomic inode rename at the kernel level. Plex sees the same path still present — it never fires a "file removed" event, so the `metadata_items` record (including `added_at`) is preserved.
**Safe sequence:**
1. Encode source → `.hevc.tmp.mp4` alongside the original
2. Verify the output with `ffprobe`
3. `touch -r original.mp4 temp.mp4` — copy mtime (cosmetic, not required)
4. `mv temp.mp4 original.mp4` — atomic replace
**The one pitfall:** if the original file is deleted *before* the `mv`, Plex orphans the DB record (removes `metadata_items` entry on next scan) and re-indexes the new file with a fresh `added_at`. The original must still exist at swap time.
---
## The Batch Script
Script lives at `~/hevc_batch.sh` on majorhome.
```bash
# Dry run — scan and report what would be encoded, no changes
bash ~/hevc_batch.sh --dry-run
# Full run (default: files >1GB, QP 28)
tmux new-session -d -s hevc_batch 'bash ~/hevc_batch.sh'
# Custom options
bash ~/hevc_batch.sh --min-size-gb 2 --qp 26
```
### Queue and resume
The script writes a queue file at `~/hevc_queue.txt` on first run (scanning all files with ffprobe — takes ~10 min for a large library). On subsequent runs it resumes from where it left off. Completed files are logged to `~/hevc_done.txt`. Failed files go to `~/hevc_failed.txt`.
To restart from scratch: `rm ~/hevc_queue.txt ~/hevc_done.txt`
### Log output
```bash
# Structured log lines only (skip ffmpeg progress noise)
grep '^\[20' ~/hevc_batch.log
# Watch live progress
tail -f ~/hevc_batch.log | grep '^\[20'
```
Each file logs:
- Source size and codec
- `Plex added_at before: <unix timestamp>`
- ffmpeg exit code and elapsed time
- Output size and savings
- `DB check: added_at PRESERVED ✓` (or WARN if changed)
### Space guard
The script aborts if free space on the Plex volume drops below 10GB (`MIN_FREE_GB`). Worst-case headroom needed is `source_size + tmp_size` simultaneously — on a 4GB source file that's ~8GB peak. Note: the space check only runs at the **start** of each encode, not during — a large file can still consume significant disk mid-encode.
---
## ffmpeg Command
```bash
ffmpeg \
-vaapi_device /dev/dri/renderD128 \
-i "input.mp4" \
-vf 'format=nv12,hwupload' \
-c:v hevc_vaapi -rc_mode CQP -qp 28 \
-c:a copy \
-movflags +faststart \
-y "output.tmp.mp4"
```
- `-rc_mode CQP -qp 28` — constant quantizer; higher value = smaller file / lower quality. QP 24 is high quality, QP 28 is good for streaming content.
- `-vf 'format=nv12,hwupload'` — required to move frames to GPU memory for VAAPI encoding.
- `-c:a copy` — passes audio through untouched.
- `hevc_vaapi` does not support 10-bit output on Polaris (RX 480/580). For 10-bit HDR sources, fall back to `libx265` with color signaling flags.
---
## Plex Data Directory Override
On majorhome, the Plex data directory is overridden in the systemd unit — the default path `/var/lib/plexmediaserver/` is empty:
```bash
systemctl cat plexmediaserver | grep APPLICATION_SUPPORT
# Environment=PLEX_MEDIA_SERVER_APPLICATION_SUPPORT_DIR=/plex/plexdata/Library/Application Support
```
The actual DB path is therefore:
```
/plex/plexdata/Library/Application Support/Plex Media Server/Plug-in Support/Databases/com.plexapp.plugins.library.db
```
---
---
## Troubleshooting
### Encode keeps stopping after a few files
**Symptom:** The script runs, encodes a handful of files, then exits. Restarting it produces the same behavior — processes a few, then exits again.
**Cause:** `hevc_batch.sh` is a **one-shot batch processor**, not a daemon. It reads through the queue file once from top to bottom, encodes whatever hasn't been done, then exits cleanly with `Batch complete: N processed`. It does not loop or restart itself.
On subsequent restarts, the script reuses the existing `hevc_queue.txt` rather than rebuilding it — the rebuild only runs if the queue file is missing or empty:
```bash
if [[ ! -f "$QUEUE" ]] || [[ ! -s "$QUEUE" ]]; then
build_queue
fi
```
This means restarts process only the few items left in the stale queue that haven't been marked done, then exit.
**Fix:** Delete the queue file before restarting so the script rescans the library and builds a fresh queue:
```bash
su - majorlinux -c 'rm ~/hevc_queue.txt && tmux new-session -d -s hevc_batch "bash ~/hevc_batch.sh"'
```
> Do **not** delete `hevc_done.txt` — that's the deduplication record. The rebuilt queue will skip anything already in `hevc_done.txt`.
---
### "Parse error, at least 3 arguments" in the log
**Symptom:** Log lines like `Parse error, at least 3 arguments were expected, only 1 given in string 'h.mp4'` scattered between encode entries.
**Cause:** ffmpeg printing its own internal parsing warnings to stderr for filenames containing Unicode special characters used in Giant Bomb / YouTube-DL titles ( — fullwidth variants). The bash script handles these correctly via `IFS= read -r`; these messages are cosmetic ffmpeg noise and do not affect the encode.
**Action:** None — these are safe to ignore.
---
### "SKIP (not found): uiem DLC & Far Far West.mp4" — truncated filenames
**Symptom:** "not found" skip entries in the log show what look like the *ends* of filenames (e.g., `uiem DLC & Far Far West.mp4` instead of `Resident Evil Requiem DLC & Far Far West.mp4`).
**Cause:** The queue file has corrupt/truncated entries — lines where the beginning of the path was lost, likely from a write error or interrupted pipe when the queue was originally built. The script can't find these truncated paths on disk and skips them.
**Fix:** Delete the queue file to force a full rebuild (see above). The rebuild uses `find` with a fresh scan — no truncation possible.
---
### Checking real progress
```bash
# Files done, failed, and remaining in queue
wc -l ~/hevc_done.txt ~/hevc_failed.txt ~/hevc_queue.txt
# Remaining = queue total - done - failed
# (some "remaining" may be not-found or parse-error skips)
# Last 10 log entries
grep '^\[20' ~/hevc_batch.log | tail -10
# Watch live
tail -f ~/hevc_batch.log | grep '^\[20'
# Disk free on /plex
df -h /plex | tail -1
```
---
### Script exits with `set -euo pipefail`
The script uses `set -euo pipefail` — any unhandled non-zero exit code kills it immediately. If the script exits with no "Batch complete" line in the log, look for the last log entry before the gap to identify the failing command. Most encode-path errors are handled with `|| echo ""` guards, but external tools (sqlite3, ffprobe) can still trip this under unusual conditions.
---
## Related
- [[plex-4k-codec-compatibility]] — Apple TV Direct Play compatibility, HEVC HDR notes
- [[plex-transcoding-troubleshooting]] — Playback stops, software transcode CPU limits, VAAPI setup
- [[snapraid-mergerfs-setup]] — MajorRAID storage pool setup
- [[SnapRAID-Majorhome]] — majorhome SnapRAID project
---
### ffmpeg "Error opening output file" / "Invalid argument" on specific files
**Symptom:** One or more files fail with this in the log:
```
Error opening output file /plex/plex/Giant Bomb's Sub-A-Thon Day 3 PART 4.hevc.tmp.mp4.
Error opening output files: Invalid argument
[YYYY-MM-DD HH:MM:SS] ffmpeg exited 234 in 0s
[YYYY-MM-DD HH:MM:SS] FAILED: ffmpeg error — keeping original, removing tmp
```
The file ends up in `hevc_failed.txt` and the original is untouched.
**Cause:** ffmpeg has its own URL/protocol parser that runs on all input and output path strings before any filesystem access. The ASCII pipe character `|` (U+007C) triggers ffmpeg's pipe protocol handler — it tries to interpret `output|file.mp4` as "pipe output to the process named `file.mp4`" and fails with EINVAL. This happens even though the shell variable is properly quoted and the Linux filesystem supports `|` in filenames. The fullwidth variant `` (U+FF5C) can also cause issues depending on ffmpeg's build.
Common in libraries with Giant Bomb, YouTube, or Twitch downloads — those titles frequently use `` as a visual separator.
**Fix:** Sanitize the `stem` used for the `.hevc.tmp.` output filename. The *source* file keeps its original name (the final `mv` writes back to the original path, which the filesystem handles fine); only the temp file needs a clean name for ffmpeg:
```bash
# In encode_file(), replace:
local tmp="${dir}/${stem}.hevc.tmp.${ext}"
# With:
local safe_stem="${stem//|/-}"
safe_stem="${safe_stem///-}"
local tmp="${dir}/${safe_stem}.hevc.tmp.${ext}"
```
After patching, delete the affected entries from `hevc_failed.txt` (or leave them — they'll be re-queued on the next run since they're not in `hevc_done.txt`) and restart the batch.
---
### Many files failing: output larger than source (streaming content)
**Symptom:** A large portion of the queue ends up in `hevc_failed.txt` with log lines like:
```
[2026-06-05 ...] Output: 4.7G savings=0 (output larger than source)
[2026-06-05 ...] WARN: output is larger than source — skipping swap, keeping original
```
**Cause:** These files are YouTube downloads or streaming archives (Giant Bomb, Twitch VODs, etc.) that were already encoded with an efficient H.264 encoder (typically YouTube's VP9-to-AVC pipeline or a broadcast H.264 encoder at a reasonable bitrate). VAAPI HEVC encoding at QP 28 on a Polaris GPU (RX 480/580) is a hardware encoder with limited rate control precision — it cannot beat a well-tuned software H.264 encode on already-compressed talking-head/gaming content. The output reliably comes out 1525% *larger* than the source.
The script handles this correctly: it detects output > source, deletes the tmp, keeps the original, and writes to `hevc_failed.txt`. The files are not corrupted. However, without the `already_failed()` guard, the script will re-attempt these files on every queue rebuild, wasting CPU time and briefly consuming 48 GB of disk per failed attempt.
**Fix — add `already_failed()` skip logic:**
Patch `~/hevc_batch.sh` to skip files already in `hevc_failed.txt`:
```bash
# After the existing already_done() function, add:
already_failed() {
[[ -f "$FAILED" ]] && grep -qF "$1" "$FAILED"
}
# In build_queue(), after the already_done "$f" && continue line:
already_failed "$f" && continue
# In the main loop, after the already_done "$file" check:
already_failed "$file" && { log "SKIP (already failed): $file"; continue; }
```
After patching, the batch will skip all 132+ known-bad files on the next pass and only attempt fresh queue entries.
**Tuning options to improve savings on dense content:**
- Lower QP: `--qp 24` or `--qp 22` — more aggressive quality target, better chance of beating source size. Trade-off: larger output for files that do compress.
- Accept the failures: for streaming content archives, the source is already "good enough." Only files that are genuinely oversized H.264 (old stream captures at very high bitrate) will benefit from HEVC re-encode.
**Identifying which files are worth encoding:**
```bash
# Show source bitrate for all queued files — high-bitrate sources are candidates
while IFS= read -r f; do
bitrate=$(ffprobe -v quiet -show_entries format=bit_rate -of csv=p=0 "$f" 2>/dev/null)
echo "$bitrate $f"
done < ~/hevc_queue.txt | sort -rn | head -20
```
Files above ~8,000 kbits/s are typically good encode candidates. Files at 3,0005,000 kbits/s (typical YouTube/Twitch 1080p) will usually fail.

View file

@ -0,0 +1,126 @@
---
title: "Plex Transcoding Troubleshooting"
domain: streaming
category: plex
tags: [plex, transcoding, hevc, h264, vaapi, troubleshooting, apple-tv]
status: published
created: 2026-05-22
updated: 2026-05-22
---
# Plex Transcoding Troubleshooting
Common issues when Plex is transcoding instead of direct playing, and how to fix them.
## Playback Stops After ~1 Minute
**Symptom:** Video starts normally, plays for 6090 seconds, then freezes or stops. Hitting play again works briefly, then stops again.
**Cause:** The Plex server is software-transcoding the stream and the CPU can't keep up in real time. Plex delivers video as a series of short HLS segments (3 seconds each by default). When the transcoder falls behind real-time, the client exhausts its segment buffer and stops.
This is most common when:
- The client has an auto-quality or bandwidth-limit setting enabled, forcing a transcode even for natively supported codecs
- The source file is HEVC and the client is set to anything other than "Play Original"
- Multiple streams are transcoding concurrently and saturating the CPU
### How to Confirm
SSH into the Plex host and check for an active software transcode:
```bash
ps aux | grep 'Plex Transcoder' | grep -v grep
```
Look for `libx264` or `libx265` in the output — these are CPU software encoders. A CPU% above 3040% per stream on an i7-7700K means it's at or near the real-time limit for 1080p60.
### Fix: Enable Direct Play
The correct fix is to eliminate the transcode entirely.
**On Apple TV:**
1. Open the Plex app → tap the user icon → **Settings**
2. Go to **Quality**
3. Set both **"Home Streaming"** and **"Remote Streaming"** to **"Play Original"** (or "Maximum")
4. Restart playback
Apple TV 4K supports direct play for H.264, HEVC (H.265), and most common containers (MP4, MKV). With "Play Original" set, Plex streams the file as-is with no server-side processing.
**On other clients:** Look for a Quality or Streaming Quality setting and set it to Original/Maximum. The specific label varies by app version.
### If Direct Play Isn't Possible
If the client genuinely can't decode the source codec (e.g., a browser playing HEVC), reduce the transcode quality to something the CPU can sustain in real time:
- **8 Mbps 1080p** is usually achievable for a single stream on an i7-7700K
- Avoid 1080p60 at high bitrates — the frame rate doubles the encoding work
Alternatively, enable hardware transcoding (see below).
---
## Understanding When Plex Transcodes
Plex will transcode (convert on the fly) when any of the following are true:
| Trigger | Example |
|---------|---------|
| Client can't decode the codec | Browser playing HEVC |
| Client quality is set below original | "8 Mbps 1080p" selected |
| Audio codec isn't supported by client | DTS-MA, TrueHD on some devices |
| Subtitles need burning in | Forced image-based subs (PGS) |
| Bandwidth limit set in Plex server settings | Server-side quality cap |
Direct play happens when the client supports the video codec, audio codec, container, and no quality downgrade is requested.
---
## Hardware Transcoding (VAAPI / RX 480)
majorhome has an XFX Radeon RX 480 8GB with VAAPI support. Hardware transcoding can offload video encoding from the CPU and allows more concurrent transcode streams.
**Enable in Plex:**
Settings → Transcoder → **"Use hardware acceleration when available"** (requires Plex Pass)
**Caveats:**
- The RX 480 VAAPI encoder (`hevc_vaapi`, `h264_vaapi`) is benchmarked ~3× slower than the i7-7700K CPU for single-stream x264 output on this workload. Hardware transcoding only wins when the CPU is already saturated (2+ concurrent streams).
- VAAPI hardware transcode on AMD requires the `radeonsi` Mesa driver and `libva-mesa-driver`. Both are present on majorhome.
**Check VAAPI is working:**
```bash
vainfo 2>/dev/null | grep -E "VAProfile|VAEntrypoint"
```
---
## CPU Transcoding Capacity (i7-7700K)
| Scenario | CPU Load | Sustainable? |
|----------|----------|-------------|
| 1× HEVC → H.264 1080p30 | ~20% | ✅ Yes |
| 1× HEVC → H.264 1080p60 | ~40% | ⚠️ Borderline — may drop behind |
| 2× HEVC → H.264 1080p60 | ~80% | ❌ Will fall behind in real time |
| 1× H.264 → H.264 1080p (remux only) | ~5% | ✅ Yes |
**Bottom line:** One software-transcode stream at 1080p60 is at the edge of what the i7-7700K can sustain. Two will fail. Direct play eliminates the problem entirely.
---
## Checking Active Transcode Sessions
```bash
# See all active Plex Transcoder processes and what they're encoding
ps aux | grep 'Plex Transcoder' | grep -v grep | grep -oP '\-i \S+' | sed 's/-i //'
# Full transcode command (codec, bitrate, resolution)
ps aux | grep 'Plex Transcoder' | grep -v grep
```
You can also see active sessions in Plex Web → Dashboard → Now Playing.
---
## Related
- [Plex 4K Codec Compatibility (Apple TV)](plex-4k-codec-compatibility.md)
- [[../../../MajorInfrastructure/Services/Plex|Plex — Infrastructure Doc]]
- [[../../../../30-Areas/MajorInfrastructure/Servers/majorhome|majorhome]]

View file

View file

@ -11,7 +11,7 @@ tags:
- troubleshooting
status: published
created: 2026-04-18
updated: 2026-04-29T22:45
updated: 2026-04-30T05:21
---
# Ansible Check Mode False Positives in Verify/Assert Tasks

View file

@ -0,0 +1,103 @@
---
title: "Ansible reboot.yml: become Timeout on WSL2 Hosts (Exclude Them)"
domain: troubleshooting
category: ansible
tags: [ansible, wsl, wsl2, windows, reboot, become, privilege-escalation, openssh, inventory]
status: published
created: 2026-06-12
updated: 2026-06-12
---
# Ansible reboot.yml: become Timeout on WSL2 Hosts (Exclude Them)
## Problem
Running a reboot play across a Fedora fleet that includes a WSL2 "host" fails on the WSL2 box at privilege escalation — before the reboot command ever runs:
```console
$ ansible-playbook reboot.yml --limit fedora
TASK [Reboot the server] *******************************************************
changed: [majorhome]
changed: [majorlab]
changed: [majormail]
changed: [majordiscord]
[ERROR]: Task failed: Action failed: Timeout (62s) waiting for privilege
escalation prompt:
fatal: [majorrig-wsl]: FAILED! => {"changed": false,
"msg": "Timeout (62s) waiting for privilege escalation prompt:",
"reboot": false}
```
Every real server reboots fine. Only the WSL2 host fails, and `"reboot": false` confirms the shutdown command never executed.
## Cause
Two independent problems, either of which is enough to break a reboot play against WSL2:
1. **WSL2 has no real reboot semantics.** `ansible.builtin.reboot` issues a shutdown, then blocks up to `reboot_timeout` (e.g. 900s) waiting for SSH to come back. A WSL2 distro doesn't reboot — it just terminates, and nothing relaunches it automatically. The task would hang the full timeout and then fail.
2. **`become` times out over the Windows OpenSSH → WSL2 bridge.** When a WSL2 box is reached as `majorlinux@host` through Windows' built-in OpenSSH Server (which forwards into WSL via the default shell), Ansible's privilege-escalation handshake watches the SSH stream for the sudo prompt/success marker. Across the Windows-intercept pty, that marker detection stalls until the 62s `timeout`. This happens **even with passwordless sudo**`NOPASSWD` is configured and correct; Ansible simply never sees the handshake complete.
The error surfaces as #2 (it fails at escalation first), but #1 is the deeper reason WSL2 doesn't belong in a reboot play at all.
## Solution
**Exclude the WSL group from the reboot play.** A WSL2 instance is a managed *workstation environment*, not a server — it belongs in package/update plays but not in server lifecycle operations like reboot.
Scope the play to exclude the `wsl` group so even a broad `--limit` skips it:
```yaml
# reboot.yml
- name: Reboot servers
hosts: all:!wsl # was: hosts: all
become: true
tasks:
- name: Reboot the server
ansible.builtin.reboot:
msg: "Reboot initiated by Ansible"
reboot_timeout: 900
```
This assumes your WSL2 hosts are in a dedicated inventory group:
```yaml
wsl:
hosts:
majorrig-wsl:
ansible_host: 100.98.47.29
```
Verify the targeting before running — the WSL host should be gone:
```console
$ ansible-playbook reboot.yml --limit fedora --list-hosts
play #1 (all:!wsl): Reboot servers
hosts (4):
majorhome
majorlab
majordiscord
majormail
```
### Rebooting the WSL2 instance itself
When you genuinely need to "reboot" WSL2, do it from the Windows side — not Ansible:
```powershell
wsl --shutdown
```
The distro relaunches on next access (next SSH login or `wsl` invocation). WSL2 stays in `update.yml` (dnf upgrades) and other package plays; it's only excluded from reboot and other server-specific roles.
## Why not just fix the become timeout?
You *could* raise `timeout` or tweak the become flow, but it doesn't address problem #1 — even a successful escalation would leave the reboot task hanging the full `reboot_timeout` because WSL2 never comes back the way the module expects. Excluding WSL from server lifecycle plays is the correct fix, not a workaround.
## Related
- [Ansible: ansible.cfg Ignored on WSL2 Windows Mounts](ansible-wsl2-world-writable-mount-ignores-cfg.md)
- [Windows OpenSSH: WSL Default Shell Breaks Remote Commands](networking/windows-openssh-wsl-default-shell-breaks-remote-commands.md)
- [Ansible: SSH Timeout During dnf upgrade on Fedora Hosts](ansible-ssh-timeout-dnf-upgrade.md)
</content>
</invoke>

View file

@ -0,0 +1,72 @@
---
title: "Ansible regex_search — capture-group argument doesn't work in set_fact"
domain: troubleshooting
category: general
tags: [ansible, jinja, regex, set_fact, gotcha]
status: published
created: 2026-05-06
updated: 2026-05-06
---
# Ansible `regex_search` — capture-group argument doesn't work in `set_fact`
## Problem
You want to extract a number from a registered command's stdout — e.g. the package count from a dnf or apt upgrade — and stash it in a fact. The natural-looking `regex_search('pattern', '\1')` form fails or produces an empty string when used inside `set_fact`:
```yaml
- name: Capture package count # ❌ does not behave as expected
ansible.builtin.set_fact:
pkg_count: "{{ apt_upgrade_result.stdout | regex_search('([0-9]+) upgraded', '\\1') }}"
```
You'll see one of:
- An empty `pkg_count` (the filter ran but the back-reference returned nothing in this context)
- A Jinja error about argument arity if the syntax is slightly off
- The whole matched substring instead of just the captured group
## Root cause
In `set_fact` templating, the second-positional-argument form of `regex_search` (the back-reference `'\1'` you've seen in tutorials) doesn't reliably select capture groups. The filter is happiest returning the full match. Capture-group selection works in some contexts (e.g. `vars:` blocks, certain Jinja invocations) but not consistently inside `set_fact`, which makes "copy this snippet from the docs" fail intermittently.
## Fix — match the broader pattern, then split
Stop fighting the back-reference. Use `regex_search` to grab a string that *contains* the value you want, then peel it apart with plain Python string ops:
```yaml
- name: Capture package count # ✅ works in set_fact
ansible.builtin.set_fact:
pkg_count: "{{ (apt_upgrade_result.stdout | regex_search('[0-9]+ upgraded') | default('0')).split()[0] }}"
```
What this does:
1. `regex_search('[0-9]+ upgraded')` returns the matching substring (e.g. `"7 upgraded"`) or `None` on no match.
2. `default('0')` turns the `None` case into the string `"0"` so the next step always has something to operate on.
3. `.split()[0]` keeps just the number.
The result (`"7"`) is a string — cast with `| int` if you need arithmetic.
## Where this comes up in MajorAnsible
The `update.yml` executive-summary task uses this pattern to pull package counts out of `apt_upgrade_result.stdout` and `dnf_upgrade_result.stdout` so each host can print one tidy line:
```
majorhome: 7 pkg(s) upgraded | No reboot needed | 2 active screen(s)
majormail: 14 pkg(s) upgraded | REBOOT REQUIRED | Snapshot taken
majorlab: 0 pkg(s) upgraded | No reboot needed
```
The summary line is built with a Jinja `parts` array joined with `' | '` so segments that don't apply (no snapshot, no screens) drop out cleanly without leaving trailing separators.
## Quick checks if this still misbehaves
- **Confirm the source variable.** Ansible 2.x sometimes returns stdout as `result.stdout` and sometimes as `result.stdout_lines`; the `regex_search` filter wants a string, not a list. Use `.stdout` (or `.stdout | join('\n')` for a multi-line list).
- **Escape your backslashes.** In YAML strings, `\d` needs to be written `\\d` or wrapped in single quotes: `'(\d+) upgraded'`.
- **Always provide a default.** `regex_search` returns `None` on miss, which will explode `.split()[0]`. The `| default('0')` bridge is mandatory in production playbooks where some hosts will legitimately have zero upgrades.
## Related
- [[ansible-vault-password-file-missing]] — another set_fact / vault interaction quirk
- [[ansible-ssh-timeout-dnf-upgrade]] — companion gotcha when running `update.yml`

View file

@ -0,0 +1,106 @@
---
title: "Ansible: Ubuntu Reboot Detection Misses Kernel Upgrades"
domain: troubleshooting
category: ansible
tags: [ansible, ubuntu, kernel, reboot, needrestart, apt]
status: published
created: 2026-05-19
updated: 2026-05-19
---
# Ansible: Ubuntu Reboot Detection Misses Kernel Upgrades
## Problem
`update.yml` runs across the Ubuntu fleet, a kernel package is upgraded, but the executive summary reports `No reboot needed` — even though a reboot is genuinely required. Running `uname -r` on the host confirms it's still on the old kernel.
Example: majortoot had `linux-image-6.8.0-117-generic` installed on May 16 after a Tailscale update triggered `needrestart`, but the playbook kept reporting clean.
## Root Cause
The standard check for Ubuntu reboot state is:
```yaml
- name: Check if a reboot is required for Ubuntu servers
ansible.builtin.stat:
path: /var/run/reboot-required
register: ubuntu_reboot_flag
```
`/var/run/reboot-required` is written by `update-notifier-common`'s `notify-reboot-required` script, called by `/etc/kernel/postinst.d/update-notifier` when a kernel package is installed via `apt`.
The problem is `needrestart`. It runs after every `apt` invocation via a `DPkg::Post-Invoke` hook (`apt-pinvoke -m u`). In **unattended mode** (`-m u`), needrestart detects the pending kernel upgrade and calls `announce_ver()` in `NeedRestart::UI::Ubuntu` — but that function only prints to stdout. It does **not** call `_write_reboot_file()`. Only `announce_ucode()` (microcode upgrades) calls `_write_reboot_file()`.
So the sequence is:
1. `apt` installs kernel → `notify-reboot-required` creates `/run/reboot-required`
2. Some later `apt` run (e.g. Ansible installs Tailscale) → `needrestart -m u` runs → detects kernel mismatch → calls `announce_ver()` → prints to stdout (suppressed in Ansible) → **does not** recreate the sentinel file
3. Next Ansible run: stat check finds no file → reports `No reboot needed`
The `/run` filesystem is tmpfs and clears on reboot, but the sentinel file can disappear between reboots any time needrestart runs without recreating it.
## Fix — Dual Check in update.yml
Add a parallel kernel comparison task after the existing stat check:
```yaml
- name: Check running kernel vs installed kernel (Ubuntu)
ansible.builtin.shell: |
RUNNING=$(uname -r)
INSTALLED=$(dpkg -l 'linux-image-[0-9]*-generic' 2>/dev/null \
| awk '/^ii/{print $2}' \
| sed 's/linux-image-//' \
| sort -V | tail -1)
if [ -n "$INSTALLED" ] && [ "$RUNNING" != "$INSTALLED" ]; then
echo "KERNEL_MISMATCH"
fi
register: kernel_mismatch_check
changed_when: false
when: ansible_facts['os_family'] == "Debian"
```
Then update the `host_summary` Jinja2 template to OR both conditions:
```jinja2
{%- if ansible_facts['os_family'] == 'Debian' and (
(ubuntu_reboot_flag is defined and ubuntu_reboot_flag.stat is defined and ubuntu_reboot_flag.stat.exists)
or
(kernel_mismatch_check is defined and 'KERNEL_MISMATCH' in (kernel_mismatch_check.stdout | default('')))
) -%}
{%- set _ = parts.append('REBOOT REQUIRED') -%}
```
## Common Mistake — Comparing the Wrong dpkg Field
An initial version of this fix used `$3` (the package version) and `cut`:
```bash
# WRONG — version field never matches uname -r
INSTALLED=$(dpkg -l 'linux-image-*-generic' | awk '/^ii/{print $3}' | sort -V | tail -1 | cut -d- -f1-4)
```
| Field | Example value |
|-------|--------------|
| `dpkg $3` (version) after cut | `6.8.0-57.59` |
| `uname -r` | `6.8.0-57-generic` |
These formats never match. Every Ubuntu host permanently reports `KERNEL_MISMATCH`. Always use the **name column (`$2`)**, strip the `linux-image-` prefix, and compare directly to `uname -r`.
Also use `linux-image-[0-9]*-generic` (not `*-generic`) to exclude the `linux-image-generic` meta-package from the sort.
## Verification
Run against a known-pending host before and after reboot:
```bash
ansible-playbook update.yml --limit majortoot
```
Before reboot: `majortoot: 0 pkg(s) upgraded | REBOOT REQUIRED`
After reboot: `majortoot: 0 pkg(s) upgraded | No reboot needed`
## Related
- [[ansible-regex-search-set-fact-capture-group]] — companion Jinja2 gotcha in the same `host_summary` task
- [[ansible-unattended-upgrades-fleet]] — managing the Ubuntu auto-upgrade stack
- [[ansible-check-mode-false-positives]] — another Ansible reporting quirk

View file

View file

@ -0,0 +1,73 @@
---
title: "Claude Code Keychain Prompt Keeps Reappearing on macOS (ACL Invalidation)"
domain: troubleshooting
category: claude-code
tags: [claude-code, authentication, oauth, keychain, macos, acl, security]
status: published
created: 2026-06-15
updated: 2026-06-15
---
# Claude Code Keychain Prompt Keeps Reappearing on macOS (ACL Invalidation)
## Symptom
A macOS dialog repeatedly pops up:
> **security wants to access key "Claude Code-credentials" in your keychain.**
> To allow this, enter the "login" keychain password. — `[Always Allow] [Deny] [Allow]`
The tell-tale sign: it **comes back even after clicking "Always Allow"** — the usual "trust forever" button doesn't make it stop. Login still works; it's the *permission prompt* that won't quiet down. This is **distinct** from [Claude Code won't log in](claude-code-warp-login-corrupt-keychain-credential.md), where the stored credential is corrupt and login itself fails.
## Cause
Claude Code stores its OAuth token in the macOS **login keychain** as `Claude Code-credentials`, read via `/usr/bin/security`. macOS binds an "Always Allow" grant (the keychain item's ACL) to the **code-signing identity** of the requesting binary. That grant is silently invalidated when:
- **Claude Code updates** — the new binary's signature no longer matches the saved ACL. This is the most common trigger (see claude-code issues #48162, #9403).
- **The credential item is recreated on token refresh** — wipes the ACL.
- **Post-reboot keychain churn** — right after boot, the just-unlocked login keychain plus a concurrent token refresh can race ahead of the ACL settling, producing a *burst* of prompts that stops once a clean refresh completes.
It is **not** a lock-timeout issue if `security show-keychain-info` reports `no-timeout` (below).
## Triage (non-destructive — these do not trigger a prompt)
```bash
# Confirm the item exists (metadata only; no secret read)
security find-generic-password -l "Claude Code-credentials" | grep -E "svce|acct"
# Confirm the login keychain isn't auto-locking
security show-keychain-info ~/Library/Keychains/login.keychain-db
# -> "no-timeout" means it won't relock; so recurring prompts = ACL invalidation, not locking
```
## Fixes
### One-off burst (e.g. right after a reboot)
Click **Always Allow** (not Allow) once a clean token refresh has completed. With a `no-timeout` keychain the grant then holds, and the post-boot prompt storm usually self-clears within a minute. *Observed exactly this on MajorAir 2026-06-15 — a reboot triggered a burst that stopped on its own.*
### Keeps returning after updates (durable) — reset the credential
Deleting and re-creating the item rebinds a fresh ACL to the current binary. Costs one re-login.
```bash
security delete-generic-password -s "Claude Code-credentials"
# then re-authenticate inside Claude Code: /login (or relaunch `claude`)
```
### Bypass the keychain entirely (workaround)
Claude Code falls back to `~/.claude/.credentials.json` in non-GUI contexts (SSH, tmux). On a local Mac this can be repurposed to stop keychain prompts for good:
```bash
# pipe straight to the file — never echo the token into a shared terminal
security find-generic-password -s "Claude Code-credentials" -w > ~/.claude/.credentials.json
chmod 600 ~/.claude/.credentials.json
security delete-generic-password -s "Claude Code-credentials"
```
**Caveats:**
- Token is then **plaintext at rest** (mode 600) instead of encrypted in the keychain.
- A future Claude Code update may rewrite the keychain item.
- GUI-session behaviour for the file fallback is **less documented** than the SSH/tmux case — **verify it holds for your setup before relying on it.**
- Do **not** substitute `CLAUDE_CODE_OAUTH_TOKEN` — it is known to delete credentials on exit (issue #37512).
## Notes
- Same keychain item as the corrupt-credential login failure; if login itself breaks, see the related article.
- Always redirect `-w` output straight to a file — never into a terminal whose scrollback feeds shared context.
## Related
- [Claude Code Won't Log In (Warp & iTerm2) — Corrupt Keychain Credential](claude-code-warp-login-corrupt-keychain-credential.md)
- Config: `~/.claude.json`, login keychain item `Claude Code-credentials`
- First observed: MajorAir, 2026-06-15 (post-reboot prompt burst; self-cleared)

View file

@ -0,0 +1,66 @@
---
title: "Claude Code Won't Log In (Warp & iTerm2) — Corrupt Keychain Credential"
domain: troubleshooting
category: claude-code
tags: [claude-code, authentication, oauth, keychain, macos, warp, iterm2]
status: published
created: 2026-06-09
updated: 2026-06-09
---
# Claude Code Won't Log In (Warp & iTerm2) — Corrupt Keychain Credential
## Symptom
Claude Code (v2.1.169) would not log in from Warp. The login flow never completed.
The same failure occurred in iTerm2, which ruled out a terminal-specific cause.
## Investigation path
1. **Version**`claude --version` = 2.1.169. Already well past the v2.1.1052.1.107
bracketed-paste regression (fixed in 2.1.108), so the known paste bug was not it.
2. **Environment / overrides** — none of `ANTHROPIC_API_KEY`, `CLAUDE_CODE_OAUTH_TOKEN`,
`ANTHROPIC_AUTH_TOKEN`, `ANTHROPIC_BASE_URL`, or `CLAUDE_CODE_PATH` were set, so no stale
key or shim was hijacking auth. System clock was correct (rules out token-time skew).
3. **Account record**`~/.claude.json` had `oauthAccount` and `userID` populated
(`maj.linux@gmail.com`), i.e. Claude Code believed it already had an account.
4. **Keychain** — a `Claude Code-credentials` generic-password item existed, but
`security find-generic-password -s "Claude Code-credentials" -w` returned an
**empty / non-JSON payload** (failed to parse). The credential entry was present but
its secret was empty/corrupt.
## Root cause
A **corrupt (empty) Keychain credential** named `Claude Code-credentials`. Claude Code saw
an existing credential, tried to read/refresh it, failed to parse it, and wedged *before* it
could start a clean login. Because the account also existed in `~/.claude.json`, the CLI kept
trying to use the broken credential instead of prompting fresh auth. This is system-level
(Keychain), which is why it reproduced across both Warp and iTerm2.
## The fix
```bash
# 1. Remove the broken credential
security delete-generic-password -s "Claude Code-credentials"
# 2. Re-authenticate
claude # then /login, or:
claude /login
```
If `/login` still hangs after that, also clear the stale account record and retry:
```bash
cp ~/.claude.json ~/.claude.json.bak
python3 -c "import json,pathlib; f=pathlib.Path.home()/'.claude.json'; d=json.load(open(f)); d.pop('oauthAccount',None); json.dump(d,open(f,'w'),indent=2)"
claude /login
```
Resolved on step 1+2 — login succeeded after deleting the corrupt Keychain item.
## Notes
- On macOS, Claude Code credentials live in the **login Keychain** (`Claude Code-credentials`),
not in `~/.claude/.credentials.json` (that path is Linux/other).
- Quick triage command to spot the same failure again:
```bash
security find-generic-password -s "Claude Code-credentials" -w | python3 -m json.tool
```
If that errors with "Expecting value", the stored secret is empty/corrupt — delete and re-login.
## Related
- [Claude Code Keychain Prompt Keeps Reappearing on macOS (ACL Invalidation)](claude-code-keychain-prompt-recurring-macos.md) — different symptom: login works but the permission prompt won't stop
- Config: `~/.claude.json` (oauthAccount, userID), login Keychain item `Claude Code-credentials`
- Other Claude Code note: `claude-mem-setting-sources-empty-arg.md`

View file

@ -0,0 +1,190 @@
---
title: "Claude Desktop MCP Mass-Disconnect After Blocking SSH Reboot"
domain: troubleshooting
category: troubleshooting
tags:
- claude-desktop
- mcp
- wsl
- wsl2
- ssh
- reboot
- troubleshooting
- hang
- transport
status: published
created: 2026-05-10
updated: 2026-05-10
---
# Claude Desktop MCP Mass-Disconnect After Blocking SSH Reboot
> **TL;DR** — Issuing a synchronous `ssh host reboot` through Claude Desktop's shell MCP can hang the MCP transport when the target dies mid-session. Eventually the MCP manager force-disconnects **every** MCP at once. Recovery is a full Claude Desktop restart. Prevention is a fire-and-forget reboot pattern that lets the SSH session close cleanly before the target goes down.
---
## Symptom
You're running Claude Desktop with several MCPs configured (shell, filesystem, mail, etc.), most launched via `wsl.exe` against your WSL2 distro. You ask Claude to reboot a remote host through the shell MCP — typically something like `ssh fleethost reboot` or `ssh fleethost sudo systemctl reboot`. Things appear to succeed. Then, anywhere from immediately to ~30 minutes later:
- **Every MCP disconnects within tens of milliseconds of each other** — not in the order you'd expect from independent failures
- Claude Desktop's main panel shows all MCP servers as failed/disconnected
- The app itself is still running but cannot reconnect MCPs cleanly until you fully restart it
- New chats can't use any MCP tools
The MCP server logs (`%APPDATA%\Claude\logs\mcp-server-*.log`) end with the standard *"Server transport closed unexpectedly, this is likely due to the process exiting early"* message — but they end at the **same instant** for every server.
---
## Why this happens
Claude Desktop launches each MCP server as a stdio child process (commonly `wsl.exe npx -y <server>` or `wsl.exe <binary>`). The MCP manager owns the stdio pipes and a transport per server. When you ask Claude to run a synchronous `ssh remote reboot` via the shell MCP:
1. The shell MCP calls SSH and waits for the remote process to exit so it can return stdout/stderr to Claude Desktop
2. The remote `reboot` (or `systemctl reboot`) executes on the target — but reboot is special: the target severs its own SSH session as part of going down, often **without** sending a clean TCP FIN
3. The local SSH client sits there waiting for a response that never comes
4. The shell MCP's stdio pipe stays open, blocked on the SSH child
5. Claude Desktop's MCP manager waits on the shell MCP's stdio pipe
6. After some watchdog/timeout interval, the manager force-tears-down — and because of how the manager is wired, it tears down **all** MCP transports together, not just the wedged one
The blast radius is "every MCP in the session," not just the one that issued the reboot.
---
## Diagnostic chain
Use this exact order — it lets you rule out each layer cleanly.
### 1. Are the disconnect timestamps clustered?
Open `%APPDATA%\Claude\logs\mcp.log` (or each per-server log) and find the *Server transport closed* lines for each MCP. Are they within tens or hundreds of milliseconds of each other?
```
2026-05-10T04:10:17.167Z [shell] Server transport closed unexpectedly
2026-05-10T04:10:17.175Z [mail] Server transport closed unexpectedly
2026-05-10T04:10:17.177Z [majorvault] Server transport closed unexpectedly
2026-05-10T04:10:17.202Z [filesystem] Server transport closed unexpectedly
```
If yes → a parent killed the children. This is **not** independent MCP failures.
### 2. Is there a Crashpad minidump?
```powershell
dir "$env:APPDATA\Claude\Crashpad\reports"
dir "$env:APPDATA\Claude\Crashpad\pending"
```
Empty directories (or directories with no files newer than the disconnect time) = **Claude Desktop did not crash, it hung**. A real crash would have written a minidump.
### 3. Are the MCP child processes still alive in WSL?
```bash
ps -eo pid,etime,cmd | grep -E 'mcp|claude' | grep -v grep
```
If you see your MCP server processes still running with elapsed times spanning the disconnect (or fresh respawns from auto-recovery attempts), the WSL side is healthy. The damage is on the Claude Desktop ↔ MCP transport, not the MCP servers themselves.
### 4. What was the shell MCP doing right before the disconnect?
Check `%APPDATA%\Claude\logs\main.log` for the last `mcp__shell__shell_exec` permission grants and tool calls, and `%APPDATA%\Claude\logs\mcp-server-shell.log` for the last commands invoked. If you see an SSH command issued against a host that you also know to be currently rebooting / unreachable, you've found the trigger.
Confirm with a separate health probe of the remote host (do this in **WSL or a fresh terminal**, not through the wedged Claude Desktop):
```bash
ping -c 3 -W 2 <host-or-tailscale-ip>
ssh -o ConnectTimeout=5 -o BatchMode=yes <host> uptime
tailscale status | grep <host>
```
100% packet loss + missing tailnet entry + SSH timeout = the target is genuinely down or hung mid-reboot.
---
## Recovery
1. **Fully quit Claude Desktop** — system tray icon → *Quit*. Closing the window is not enough; you must terminate the main process so the MCP manager state is cleared.
2. *(Optional)* If you want a clean slate in WSL, kill orphaned MCP child processes:
```bash
pkill -f mcp-shell
pkill -f mail-mcp
pkill -f mcp-majorvault
# ...etc for any other MCP binaries you run
```
This is rarely necessary — fresh spawns will replace them on next launch.
3. **Reopen Claude Desktop**. Watch `mcp.log` and `main.log`:
```
[LocalMcpServerManager] Connected to shell (1 tools)
[LocalMcpServerManager] Connected to filesystem (14 tools)
[LocalMcpServerManager] Connected to mail (30 tools)
...
```
Tool counts should match your `claude_desktop_config.json`. The "UtilityProcess Check: Extension X not found in installed extensions" warnings are benign — Claude Desktop just notes that your MCPs aren't bundled built-in extensions (because they're WSL-launched).
---
## Prevention — fire-and-forget reboot patterns
Don't hand the MCP shell a command that intentionally severs its own SSH session and expects the shell to wait for clean closure. Instead, schedule the reboot to happen **after** SSH disconnects:
### Option A — `nohup` + background (most portable)
```bash
ssh host 'nohup shutdown -r +1 >/dev/null 2>&1 &'
```
Schedules a reboot 1 minute out, returns immediately, SSH closes cleanly. The minute delay gives you time to cancel (`ssh host 'sudo shutdown -c'`) if you change your mind.
### Option B — bounded keepalive timeout
```bash
ssh -o ServerAliveInterval=5 -o ServerAliveCountMax=2 host 'systemctl reboot'
```
If the remote drops without responding within 10 s of keepalives, the local SSH client hangs up — bounding the worst case to ~10 s instead of "until something kills the MCP." Less elegant than Option A but works for one-shot situations.
### Option C — schedule on the box itself
Use a cron `@reboot` reschedule, a `systemd` oneshot timer, or `at` on the box:
```bash
ssh host 'echo "systemctl reboot" | at now + 1 minute'
```
### Anti-pattern (don't do this)
```bash
# ❌ Synchronous reboot through MCP shell
ssh host reboot
ssh host sudo reboot
ssh host 'shutdown -r now'
```
These all hold the MCP stdio pipe open waiting for a session that is being severed at the kernel level on the remote side.
---
## Worked example — 2026-05-10 majorhome reboot
| Time (EDT) | Event |
|---|---|
| 00:41:06 | Claude Desktop emits permission prompt for `mcp__shell__shell_exec` |
| 00:41:08 | Shell MCP disconnect+reconnect cycle (transient, recovered in 2 s) |
| 00:41:10 | `[LocalMcpServerManager] Connected to shell (1 tools)` |
| 00:41:26 | Permission granted — likely the `ssh majorhome reboot` call |
| 00:42:16 | `[Result] Turn succeeded` → session marked `running → idle` |
| 00:42 | `main.log` goes silent |
| 04:10:17 UTC (00:10:17 EDT *prior* — note timezone delta in mcp.log vs main.log) | All 5 MCPs disconnect within 35 ms |
| 01:0001:10 | majorhome physically recovers, comes back up clean (`uptime` 19 min, `systemctl is-system-running` = `running`) |
| 01:13:42 | After full Claude Desktop restart, all 5 MCPs respawn |
| 01:15:22 | All 5 MCPs reconnected, tools registered |
majorhome itself was never the problem — the reboot succeeded. The damage was the SSH session that never closed cleanly, which poisoned the local Claude Desktop MCP transport.
---
## See also
- [Claude Desktop MCP Server Started via wsl.exe Sees Empty Environment (WSLENV)](wsl-env-claude-desktop-mcp.md) — different failure mode (start-up env passing) on the same Claude Desktop + WSL stack
- [Pi-hole AI Blocklist Blocks Claude Desktop (ERR_CONNECTION_REFUSED)](networking/pihole-blocks-claude-desktop.md) — another Claude Desktop transport-layer failure
- [Windows OpenSSH: WSL as Default Shell Breaks Remote Commands](networking/windows-openssh-wsl-default-shell-breaks-remote-commands.md) — related WSL/SSH stdio behavior

View file

@ -0,0 +1,105 @@
---
title: "Forgejo: Account Recovery & CLI Admin When Locked Out of the GUI"
domain: troubleshooting
category: general
tags: [forgejo, gitea, smtp, docker, account-recovery, self-hosting]
status: published
created: 2026-06-12
updated: 2026-06-12
---
# Forgejo: Account Recovery & CLI Admin When Locked Out of the GUI
Two related problems on a single-admin self-hosted **Forgejo** (or Gitea): the GUI *"Forgot password"* is disabled, and you can't log in to fix it. Here's how to (1) enable account recovery properly, and (2) recover from the command line when you're already locked out.
## Symptoms
- The *Forgot password* page shows: **"Account recovery is only available when email is set up. Please set up email to enable account recovery."**
- You can't log in (wrong/forgotten password), so you can't add an SSH key or change settings in the GUI either.
## Part 1 — Enable account recovery (configure the mailer)
Account recovery needs SMTP. If you already run a mail server on your tailnet, relay through it — **no app password needed** when the Forgejo host is `mynetworks`-trusted by that mail server.
Edit `app.ini` (in the data volume, e.g. `/data/gitea/conf/app.ini`):
```ini
[mailer]
ENABLED = true
PROTOCOL = smtp+starttls
SMTP_ADDR = 100.x.y.z ; mail server's tailnet IP
SMTP_PORT = 587
FROM = forgejo@example.com
FORCE_TRUST_SERVER_CERT = true ; required when connecting by IP (cert CN won't match)
```
Notes:
- `FORCE_TRUST_SERVER_CERT = true` is needed when you target the relay by **IP** — the TLS cert is issued for a hostname, not the IP, so verification would otherwise fail. Acceptable on a trusted internal hop.
- Omit `USER`/`PASSWD` if the relay accepts your host via `mynetworks` (no SASL). Otherwise add SMTP auth.
- `app.ini` lives in the persistent volume, so the change **survives container re-creation** (e.g. Watchtower's nightly pull).
Apply and verify:
```bash
docker restart forgejo
docker logs forgejo 2>&1 | grep -i "Mail Service Enabled" # confirms the mailer loaded
```
Test the SMTP path **before** trusting it (run from the host, mimicking Forgejo's connection):
```bash
python3 - <<'EOF'
import smtplib, ssl
ctx = ssl.create_default_context(); ctx.check_hostname = False; ctx.verify_mode = ssl.CERT_NONE
s = smtplib.SMTP("100.x.y.z", 587, timeout=15)
s.ehlo(); s.starttls(context=ctx); s.ehlo()
s.sendmail("forgejo@example.com", ["you@example.com"],
"Subject: test\r\n\r\nForgejo relay path test")
s.quit(); print("SENT_OK")
EOF
```
`SENT_OK` means the relay accepted the message. `/user/forgot_password` should now show the reset form instead of the email error.
> **Container can't reach the tailnet IP?** Docker bridge networks usually route to Tailscale via the host (SNAT to the host's tailnet IP). Confirm with:
> `docker exec forgejo nc -w5 100.x.y.z 587 </dev/null && echo REACHABLE`
## Part 2 — Recover from the CLI (already locked out)
Forgejo's admin CLI runs inside the container as the git user (UID 1000) and needs no login.
**Reset a password:**
```bash
docker exec -u 1000 forgejo forgejo admin user change-password -u <user> -p '<newpass>'
```
> ⚠️ **Gotcha:** `change-password` sets `must_change_password=true` by default. That **forces a change on next GUI login _and_ returns HTTP 403 on the API** (`"You must change your password"`). Clear it:
> ```bash
> docker exec -u 1000 forgejo forgejo admin user must-change-password --unset <user>
> ```
**Add an SSH key without the GUI** (basic-auth API — works only if 2FA is off):
```bash
curl -u <user>:'<pass>' -X POST -H 'Content-Type: application/json' \
-d '{"title":"laptop","key":"ssh-ed25519 AAAA... you@host"}' \
http://localhost:3004/api/v1/user/keys
# HTTP 201 = created
```
Forgejo regenerates the git user's `authorized_keys` from the database, so `ssh -p <port> git@host` authenticates immediately afterward — no restart needed.
## "The password keeps changing" — it (probably) isn't
If a self-hosted Forgejo admin password *seems* to reset itself, a stock Forgejo container does **not** reset admin passwords. Rule out the server first:
- the compose has **no** admin/password env and no custom entrypoint;
- **no** cron, systemd timer, or script runs `forgejo admin user change-password`;
- the data volume is persistent (re-creation keeps the DB, password included).
If all three hold, nothing server-side is changing it — the "changing" password is a **client-side** artifact: a duplicate or stale entry in your password manager autofilling different values. Delete the duplicates and keep one.
## See also
- Forgejo — [Config Cheat Sheet → mailer](https://forgejo.org/docs/latest/admin/config-cheat-sheet/)

View file

@ -0,0 +1,119 @@
---
title: "LoRA adapter — GGUF conversion fails with 'config.json not found'"
domain: troubleshooting
category: gpu-display
tags: [lora, qlora, gguf, llama.cpp, unsloth, fine-tuning, qwen]
status: published
created: 2026-04-30
updated: 2026-04-30
---
# LoRA adapter — GGUF conversion fails with 'config.json not found'
## Problem
After a QLoRA fine-tune, you point `llama.cpp/convert_hf_to_gguf.py` at the training output directory and it crashes immediately:
```
FileNotFoundError: [Errno 2] No such file or directory:
'/path/to/training-runs/<run>/final/config.json'
```
The output directory looks fine — it contains:
```
adapter_config.json
adapter_model.safetensors (~150 MB for a 7B base)
chat_template.jinja
tokenizer_config.json
tokenizer.json
```
But no `config.json`, and `adapter_model.safetensors` is 150 MB — way smaller than the ~14 GB you'd expect for a full Qwen2.5-7B 16-bit checkpoint.
## Root cause
`model.save_pretrained()` after a LoRA/QLoRA train saves **only the adapter weights**, not a merged full-precision model. `convert_hf_to_gguf.py` expects a full HuggingFace model directory — it reads `config.json` to identify the architecture. Adapter-only directories don't have one.
You need to merge the LoRA adapter into the base model first, then point the GGUF converter at the merged dir.
## Solution
### Quick fix — inline merge step
Insert this block between training completion and `convert_hf_to_gguf.py`:
```python
from unsloth import FastLanguageModel
adapter = "/path/to/training-runs/<run>/final"
merged = "/path/to/training-runs/<run>/merged"
model, tok = FastLanguageModel.from_pretrained(
model_name=adapter,
max_seq_length=2048,
load_in_4bit=True,
)
model.save_pretrained_merged(merged, tok, save_method="merged_16bit")
```
Then run the GGUF converter against the **merged** dir, not the adapter dir:
```bash
python3 llama.cpp/convert_hf_to_gguf.py /path/to/training-runs/<run>/merged \
--outfile model-f16.gguf --outtype f16
```
The merged dir will contain `config.json`, `model-00001-of-00004.safetensors` (multiple shards totaling the full base model size), `generation_config.json`, etc.
### Cleaner fix — use a wrapper
If you do this often, encapsulate it:
1. Wrapper Python script accepts `--adapter`, `--output`, `--skip-merge`, `--all-quants`
2. Step 1: load adapter via `FastLanguageModel.from_pretrained()`, call `save_pretrained_merged()`
3. Step 2: subprocess `convert_hf_to_gguf.py` on the merged dir
4. Step 3: subprocess `llama-quantize` for each requested quant
This is what `~/corpus/scripts/convert_gguf.py` does on MajorRig (rewritten 2026-04-09 for the MajorTwin v7b cycle).
## Why this trips people up
- Unsloth and PEFT both save adapter-only by default after `trainer.save_model()` or `model.save_pretrained()`. There's no warning that downstream tools expect a merged model.
- The training output **looks** complete — there's a `tokenizer.json`, a `chat_template.jinja`, and a non-trivial `.safetensors`. It feels like a checkpoint.
- A pipeline that uses `convert_gguf.py` (with merge) once and then someone reimplements Step 4 inline (skipping the wrapper) will silently lose the merge step. This is what happened in MajorTwin v8c (Apr 30, 2026) — see [[majortwin-v8b-plan#Pipeline Bug + Fix (2026-04-30)]].
## Verification checklist
After training, before running the GGUF converter, verify the directory you're pointing at:
| File | Adapter-only dir | Merged dir |
|---|---|---|
| `adapter_config.json` | ✅ | ❌ |
| `adapter_model.safetensors` | ✅ (~150 MB / 7B) | ❌ |
| `config.json` | ❌ | ✅ |
| `model-*.safetensors` (sharded) | ❌ | ✅ (~14 GB / 7B) |
| `generation_config.json` | ❌ | ✅ |
| `tokenizer.json` | ✅ | ✅ |
If you see only the left column, you need to merge before converting.
## Resuming a failed pipeline without re-training
The adapter is small and self-contained. If your pipeline crashes at the GGUF step, you do NOT need to retrain — the LoRA adapter at `<run>/final/` is intact. Write a resume wrapper that runs only:
1. Merge (`save_pretrained_merged`)
2. F16 conversion (`convert_hf_to_gguf.py`)
3. Quantization (`llama-quantize`)
4. Deploy
This saves the cost of however many GPU-hours the training took. See `~/corpus/scripts/resume_v8c_step4.sh` on MajorRig for an example.
## Related
- [[qwen-14b-oom-3080ti]] — base model size choice on a 12GB GPU
- [[majortwin-v8b-plan]] — v8c pipeline architecture and resume
## Maintenance
- 2026-04-30 — Created after MajorTwin v8c pipeline failed Step 4. Root-caused, patched, resumed.

View file

@ -1,6 +1,6 @@
---
created: 2026-03-15T06:37
updated: 2026-04-29T23:55
updated: 2026-05-02T17:50
---
# 🔧 General Troubleshooting
@ -8,12 +8,18 @@ Practical fixes for common Linux, networking, and application problems.
## 🖥️ GPU & AI
- [Qwen2.5-14B OOM on RTX 3080 Ti (12GB)](gpu-display/qwen-14b-oom-3080ti.md)
- [LoRA adapter — GGUF conversion fails with 'config.json not found'](gpu-display/lora-adapter-gguf-conversion-fails.md)
## 🌐 Networking & Web
- [Wi-Fi Game Streaming Stutter: 160 MHz Channel Width Saturating the 5 GHz Radio](networking/wifi-160mhz-airtime-saturation-game-streaming.md)
- [Apache Outage: Fail2ban Self-Ban + Missing iptables Rules](networking/fail2ban-self-ban-apache-outage.md)
- [Mail Client Stops Receiving: Fail2ban IMAP Self-Ban](networking/fail2ban-imap-self-ban-mail-client.md)
- [firewalld: Mail Ports Wiped After Reload](networking/firewalld-mail-ports-reset.md)
- [Dovecot IMAP Clients Fail to Sync: vsz_limit OOM from a Bloated Index Log](networking/dovecot-imap-oom-vsz-limit-bloated-index.md)
- [Postfix header_checks Can't Act on Milter-Added Headers (Use Sieve)](networking/postfix-header-checks-vs-milter-headers.md)
- [Dovecot Phantom Mailboxes from .dovecot.lda-dupes (mail_home Overlapping the Maildir Root)](networking/dovecot-mail-home-maildir-root-phantom-mailboxes.md)
- [Tailscale SSH: Unexpected Re-Authentication Prompt](networking/tailscale-ssh-reauth-prompt.md)
- [SSH Alias Falls Through to MagicDNS — Host-Key Verification Failure (No `Host` Block)](networking/ssh-missing-host-block-magicdns-host-key-failure.md)
- [iOS Tailscale Clients Report HostName="localhost" — Breaks /etc/hosts Generators](networking/tailscale-status-json-hostname-localhost-ios.md)
- [rsync over Tailscale: Hung in TCP Teardown After Transfer Completes](networking/rsync-tailscale-teardown-stall.md)
- [Windows OpenSSH: WSL Default Shell Breaks Remote Commands](networking/windows-openssh-wsl-default-shell-breaks-remote-commands.md)
@ -26,6 +32,8 @@ Practical fixes for common Linux, networking, and application problems.
- [SSH Timeout During dnf upgrade on Fedora Hosts](ansible-ssh-timeout-dnf-upgrade.md)
- [Vault Password File Missing](ansible-vault-password-file-missing.md)
- [ansible.cfg Ignored on WSL2 Windows Mounts](ansible-wsl2-world-writable-mount-ignores-cfg.md)
- [regex_search — capture-group argument doesn't work in set_fact](ansible-regex-search-set-fact-capture-group.md)
- [reboot.yml: become Timeout on WSL2 Hosts (Exclude Them)](ansible-reboot-become-timeout-wsl2.md)
## 📦 Docker & Systems
- [Docker & Caddy Recovery After Reboot (Fedora + SELinux)](docker-caddy-selinux-post-reboot-recovery.md)
@ -36,6 +44,7 @@ Practical fixes for common Linux, networking, and application problems.
## 🔒 SELinux
- [SELinux: Fixing Dovecot Mail Spool Context (/var/vmail)](selinux-dovecot-vmail-context.md)
- [SELinux: Wrong /etc/localtime Label Silently Breaks Timezone Changes](selinux-localtime-label-breaks-timezone.md)
## 💾 Storage
- [mdadm RAID Recovery After USB Hub Disconnect](storage/mdadm-usb-hub-disconnect-recovery.md)
@ -43,9 +52,12 @@ Practical fixes for common Linux, networking, and application problems.
## 📝 Application Specific
- [Obsidian Vault Recovery — Loading Cache Hang](obsidian-cache-hang-recovery.md)
- [Gemini CLI Manual Update](gemini-cli-manual-update.md)
- [iPhone Mirroring Hangs on 'Connecting…' — AWDL Data Stall (27.0 Beta)](iphone-mirroring-connecting-hang-awdl-stall-beta.md)
## 🤖 AI / Local LLM
- [Ollama Drops Off Tailscale When Mac Sleeps](ollama-macos-sleep-tailscale-disconnect.md)
- [Ollama: `ollama run` with Piped Stdin Bypasses Chat Template + SYSTEM Prompt](ollama-chat-template-pipe-stdin-bypass.md)
- [Windows OpenSSH Server (sshd) Stops After Reboot](networking/windows-sshd-stops-after-reboot.md)
- [claude-mem Silently Fails with Claude Code 2.1+ (Empty `--setting-sources`)](claude-mem-setting-sources-empty-arg.md)
- [Claude Code Won't Log In (Warp & iTerm2) — Corrupt Keychain Credential](claude-code-warp-login-corrupt-keychain-credential.md)
- [Claude Code Keychain Prompt Keeps Reappearing on macOS (ACL Invalidation)](claude-code-keychain-prompt-recurring-macos.md)

View file

@ -0,0 +1,150 @@
---
title: "iPhone Mirroring Hangs on 'Connecting…' — AWDL Data Stall (27.0 Beta)"
domain: troubleshooting
category: macos
tags: [macos, iphone-mirroring, continuity, awdl, rapport, quic, tailscale, mullvad, beta, channel-validation, aimesh, quicktime, usb]
status: published
created: 2026-06-09
updated: 2026-06-15
---
# iPhone Mirroring Hangs on 'Connecting…' — AWDL Data Stall (27.0 Beta)
## Update 20260615 — REGRESSED; reproducibly stuck on "Connecting", and Tailscale was **not** the cure
> **Correction to the 20260614 "it WORKS" update below.** On 20260615 iPhone Mirroring is **reproducibly stuck on "Connecting to iPhone 16 Pro"** on MajorAir again — with Tailscale `accept-routes` *still* `false`. So the acceptroutes change was **correlation, not the fix**: this is an **intermittent macOS 27.0 beta AWDL bug, independent of Tailscale**.
>
> **Tried this round — all failed to establish a session:** Tailscale `accept-routes=false` (already in place) · `sudo ifconfig awdl0 down/up` · **full Mac reboot** · cycling the iPhone's WiFi + Bluetooth.
>
> **Log signature:** `rapportd` resolves the phone's `_asquic._udp.local` endpoint and `_companion-link` registers (discovery *succeeds*), but the QUICoverAWDL **datapath never completes into a live session**`wifip2pd` loops on `AWDLDiscoveryTimeout (hasAdvertises=false)`. Each reset advanced the handshake one stage further (noadvertises → resolvestarted → endpointresolved) yet none reached a streaming session. **`llw0` never went active (0 bytes)** — confirming no A/V ever flowed, regardless of what the 0614 note measured.
>
> **Stance:** beta OS bug, **no reliable userside fix**. Use the **QuickTime USB mirror** workaround (below) when you actually need the phone on screen. The 0614 "it works on `llw0`" measurements were real *for that one session* but are **not reproducible** across seeds/sessions — treat mirroring as intermittently broken on the 27.0 betas. This reconfirms the original **Root cause (conclusion)** section further down (a beta bug, "nothing in local config wrong"), which the 0614 update had prematurely overridden.
## Update 20260614 (evening) — it WORKS; the "AWDL starvation" finding was the wrong interface
> iPhone Mirroring is now **working** on MajorAir — stable session, clean video, no missing icons — on **ch44/80** with Tailscale `accept-routes=false`. An earlier pass the same day blamed an "AWDL bulkpath starving at ~90 B/s"; that was **measuring the wrong interface** and is corrected here.
**The video transport is `llw0` (lowlatency WLAN), not `awdl0`.**
Measured during an active session: **`llw0` ≈ 800 KB/s** (≈6 Mbps of real video), `en0` ~60 KB/s, **`awdl0` ~1 KB/s**. `awdl0` only ever carries AWDL *discovery/control* (~90 B/s) — whether mirroring works or not. So "90 B/s on `awdl0` = starved bulk path" was a **red herring**: the A/V stream rides `llw0`, which the earlier pass never measured.
**What was actually broken was session *stability*.** The `XPC_ERROR_CONNECTION_INTERRUPTED` / `MediaContinuityKit.TaskTimeoutError` teardown loop kept the `llw0` stream from ever sustaining (→ glitchy / missing icons). When the session holds, `llw0` streams clean.
**What changed (not cleanly isolated):** three things differed between the broken and working states — (1) the network fully **settled on ch44** over ~15 h (the failing ch44 test was minutes after a chaotic AiMesh resync + reconnect scramble), (2) Tailscale **`accept-routes` was turned off** (it had been polluting IPv4 routing + the Continuity control plane), and (3) both devices slept/woke. Which one mattered is not yet proven.
**Open test — isolates Tailscale's role:** repro on **MajorMac** with *unaltered* Tailscale (`accept-routes` still **ON**). If mirroring breaks there but works on MajorAir (acceptroutes OFF), that pins Tailscale's accepted routes as the trigger. See [[MajorAir#Known Issues]] for the `accept-routes=false` fix.
**Still valid from earlier today:** congestion ruled out (router `chanim_stats` ch36 = 90 % idle, 86 % txop); the AiMesh / router infra notes below; and iPhone Mirroring is **wirelessonly — no USB transport** (for a wired screen view, use QuickTime, below).
> ⚠️ The iPhoneradio `isValidChannel`/`awdl0` evidence cited in the original 20260609 writeup below describes AWDL *discovery* health, **not** the video path — read it in light of this correction.
**Wired workaround (works today, no AWDL):**
iPhone Mirroring is **wirelessonly — there is no USB transport** (confirmed: cable connected throughout, every attempt still used `awdl0`). For a wired view of the screen:
> **QuickTime Player → File → New Movie Recording → ⌄ next to record → select the iPhone** = fullrate USBC screen mirror (view + record). Does **not** give remote control (tap/type) — that's unique to iPhone Mirroring.
**Infra notes (RTAX82U, AiMesh controller):**
- Router SSH is on **port 1025** (not 22); creds in Ansible vault (`router_username` / `router_password`).
- The 5 GHz channel is **AiMeshcoordinated** and **resists CLI changes**`wl chanspec` / nvram `wl1_chanspec` get reasserted by `acsd2` + AiMesh within seconds, even after `restart_wireless`. Only setting Control Channel to an **explicit value in the Web UI** holds meshwide. Left "Auto" → acsd2 picks **36** (the cleanest channel).
- Any channel change triggers a **mesh resync (~1 min) that drops all WiFi**; during it MajorAir falls back to the iPhone's **USB Personal Hotspot** (`en7` / `172.20.10.x`) and won't autorejoin home WiFi while the hotspot feeds it internet (manual WiFimenu join needed).
- **Current state: 5 GHz on ch44/80** (same clean UNII1 spectrum as 36; left here to avoid another resync — the Deck streams identically on 44).
**If it breaks again — troubleshooting checklist:**
1. **It's session stability, not bandwidth.** Look for teardown loops: `log show --last 3m --predicate 'process == "iPhone Mirroring"' | grep -iE "interrupt|timeout|endpoint"`.
2. **Measure the right interface** — video rides **`llw0`** (hundreds of KB/s when the screen is active), *not* `awdl0` (~90 B/s control is normal): `netstat -ib | awk '/<Link#/{print $1, $7}'` before/after a few seconds.
3. **Tailscale:** confirm `accept-routes=false` on the Mac (`tailscale debug prefs | grep RouteAll`) — see [[MajorAir#Known Issues]].
4. **Let the network settle** after any WiFi/channel change — an AiMesh resync churns AWDL/Continuity state for a minute+; retry once stable.
5. iPhone: on home WiFi, near the Mac, **Personal Hotspot off**, not in Low Power Mode.
6. **Wired fallback that always works:** QuickTime → New Movie Recording → select the iPhone (USBC; view/record only, no control).
---
## Symptom
iPhone Mirroring on the Mac sits on **"Connecting…"** forever and never shows the iPhone screen.
- Mac: **macOS 27.0 dev beta** (build 26A5353q), MajorAir
- iPhone: **Major16Pro / iPhone17,1, iOS 27.0 dev beta**, same Apple ID (maj.linux@gmail.com)
## Root cause (conclusion)
A **bug in the iPhone Mirroring beta** (both devices on the `.0` developer seeds). The connection
authenticates, the AWDL peer-to-peer link comes up, the TLS handshake starts — then **bidirectional
data stalls ~2 seconds in** and the link is torn down. Deterministic, reproduces every attempt.
**Nothing in the local configuration was wrong.** Filed via Feedback Assistant; expected to clear in
a future seed.
Two *real but secondary* network-layer issues were found and fixed along the way (see below) — they
can block mirroring independently, but were not the cause of the final 2-second stall.
## The smoking gun (unified log)
Per connection attempt the sequence is always:
```
rapportd: Session start … linkType "AWDL", error "NoError" # link negotiated OK
iPhone Mirroring: Installing verify block for 1 authorized peer key(s)
boringssl: TLS client read_server_hello # iPhone DID respond
quic: path over awdl0 received event established / promoted to primary
nw_flow_connected: Transport protocol connected (socket)
… flow:connect_stalled @2.003s # stalls exactly ~2s in
quic_conn_log_summary: Connection attempts: 6, RETRY received: no, PTOs: 5 # packets sent, zero ACKs
[C1.1.1 … awdl0 … failed socket-flow (unsatisfied (No network route))] # link dropped (symptom, not cause)
```
The connection is pinned to AWDL (`allowed subtypes: wifi_awdl, prohibit fallback`), so once the
AWDL data path stalls there is no fallback and it fails. "No network route" is the *result* of the
teardown, not the trigger. The trigger is that after the initial handshake packets, **sustained
QUIC traffic over AWDL gets no ACKs** (PTOs).
## Investigation path (what was ruled out)
- **Discovery / proximity** — healthy throughout. BLE + Bonjour resolve the iPhone; `rapportd`
sees it with good RSSI, same iCloud (`DF < MyiCloud >`), `WiFiP2P`.
- **Tailscale (full-tunnel)** — with Tailscale connected, the attempt died at "No network route"
*before* even reaching AWDL. Cause: `RouteAll: true` (accept-routes / a `::/0` advertised route)
installs **IPv6 default routes via `utun`** (`default → fe80::%utun0..3`) that black-hole the
IPv6 path AWDL needs. **`tailscale down` is NOT enough** — it only sets `WantRunning=false`; the
macOS VPN *configuration* (`scutil --nc list` showed it still `Connected`) and the system
extension keep reasserting the routes across reboots. Must disable in **System Settings → VPN**.
- **Mullvad**`mullvad-daemon` running; **"Local network sharing" was set to `block`**, which
blocks LAN/AWDL/multicast. Changed to **`allow`** (`mullvad lan set allow`). Kill-switch was off.
- **macOS firewall** — off. No Little Snitch/LuLu app installed.
- **Lockdown Mode** — off (iPhone).
- **OS-version mismatch** — ruled out; both Mac and iPhone on 27.0 dev beta.
- **Device trust / re-pairing** — there is **no local pairing record on the Mac** to reset.
`rapportd` lists the iPhone as **"PairedSys Conjectured"** = trust is *derived from the shared
Apple ID*, not a manual pairing. Forgetting the Mac on the iPhone does not force re-setup; the
Mac just re-derives the association from iCloud and reconnects. (App containers
`~/Library/Containers/com.apple.ScreenContinuity` and the rapport stores held no device record;
the "1 authorized peer key" lives in the protected system keychain.)
- **Reboots / airplane-mode toggle / Mac-side AWDL + rapportd reset** — no change.
## Secondary issues found & fixed (do these regardless)
1. **Mullvad** — set **Local network sharing = allow** (done). Required for any LAN/AWDL feature.
2. **Tailscale** — do not run **full-tunnel / accept a `::/0` route** while mirroring; it installs
IPv6 default routes via `utun` that kill the local link. Toggle the VPN off in System Settings
(not just `tailscale down`) if it ever needs to be fully out of the path.
3. **Orphaned Little Snitch network extension** — the app was uninstalled but its
`at.obdev.littlesnitch.networkextension` is still `[activated enabled]`
(`systemextensionsctl list`). Remove via **System Settings → General → Login Items &
Extensions → Network Extensions**. A zombie filter extension with no app behind it can
black-hole traffic.
## Status / next steps
- **No user-side fix.** Filed in Feedback Assistant.
- Debug capture saved: `~/Desktop/iPhoneMirroring-debug-20260609-0026.txt` (summary + log narrative).
For a full report, trigger a sysdiagnose (**⌃⌥⇧⌘ + .**) right after reproducing and attach it.
- VPNs restored after session: Tailscale back up; Mullvad left disconnected with LAN sharing = allow.
## Useful diagnostic commands (for next time)
```bash
# Connection narrative
log show --last 10m --style compact --predicate \
'(subsystem == "com.apple.MediaContinuityKit") OR (process == "iPhone Mirroring")' | tail -60
# rapport / AWDL negotiation
log show --last 5m --style compact --predicate 'process == "rapportd"' | grep -iE "AWDL|Pair|Session"
# VPN config really on? (CLI "down" lies)
scutil --nc list ; scutil --nc status "Tailscale"
# IPv6 default routes hijacked by utun?
netstat -rn -f inet6 | awk '$1=="default"{print}'
# Active system extensions (filters/VPNs)
systemextensionsctl list
# Mullvad LAN sharing
mullvad lan get
```
## Related
- `macos-mirrored-notification-alert-loop.md` (other Continuity issue)
- Hosts/VPN context: MajorTwin project doc (Tailscale tailnet, 100.x addresses)

View file

@ -1,11 +1,17 @@
---
title: "ISP SNI Filtering & Caddy Troubleshooting"
title: ISP SNI Filtering & Caddy Troubleshooting
domain: troubleshooting
category: general
tags: [isp, sni, caddy, tls, dns, cloudflare]
tags:
- isp
- sni
- caddy
- tls
- dns
- cloudflare
status: published
created: 2026-04-02
updated: 2026-04-02
updated: 2026-04-30T13:07
---
# ISP SNI Filtering & Caddy Troubleshooting
@ -29,3 +35,89 @@ notes.majorshouse.com {
```
Once the hostname was changed to one without the "wiki" keyword, the TLS handshake completed successfully.
---
## 🔁 2026-04-30 Update — Stale A Record + Cloudflare Proxy Fix
The hostname rename held for ~4 weeks. On 2026-04-30 the wiki went down with a TLS handshake failure on `notes.majorshouse.com`. The on-the-spot framing was "ISP filter expanded to include 'notes'" — but Cloudflare DNS audit showed a different (and arguably worse) root cause: **the `notes` A record was pointing at `136.54.3.248`, an IP that is not majorlab's current home IP.** Whichever host responds at that address either does not run Caddy or does not know about the `notes.majorshouse.com` SNI, so the TLS handshake was rejected with `internal_error 80`.
### Re-diagnosis
```bash
# Cert + Caddy + mkdocs all healthy on majorlab
$ ssh majorlab 'systemctl is-active caddy; ss -tlnp | grep :443'
active
LISTEN 0 4096 *:443 users:(("caddy",pid=1549,fd=7))
# Loopback-served TLS works fine — cert valid Mar 11 → Jun 9 2026
$ ssh majorlab 'curl -sS -o /dev/null -w "%{http_code}\n" --resolve notes.majorshouse.com:443:127.0.0.1 https://notes.majorshouse.com/'
200
# External TLS handshake gets rejected with internal_error
$ openssl s_client -servername notes.majorshouse.com -connect 136.54.3.248:443
… SSL alert number 80 (internal_error) …
```
### The smoking-gun comparison
Other `*.majorshouse.com` services worked because they were CNAMEs to the apex, which resolves to majorlab's actual home IP:
| Subdomain | DNS shape | Final IP | Status |
|---|---|---|---|
| `notes.majorshouse.com` | **A → `136.54.3.248`** (stale) | `136.54.3.248` (wrong host) | ❌ TLS rejected |
| `git.majorshouse.com` | CNAME → `majorshouse.com.` | `136.56.0.55` (majorlab) | ✅ |
| `n8n.majorshouse.com` | CNAME → `majorshouse.com.` | `136.56.0.55` (majorlab) | ✅ |
| `matrix.majorshouse.com` | CNAME → `majorshouse.com.` | `136.56.0.55` (majorlab) | ✅ |
None of the working subdomains were proxied through Cloudflare (`proxied=false` on all of them); they simply had the right IP. The `notes` A record was the only one pointing somewhere wrong — most likely a stale value from a prior ISP / IP change that never got cleaned up.
### ✅ Fix — switch `notes` to a Cloudflare-proxied CNAME
Rather than just correcting the A record (which would silently break again the next time the home IP changes), the fix is a CNAME to the apex with proxy on. That gives two protections in one move: it always tracks the apex (so home IP changes propagate automatically) and it puts the wiki behind Cloudflare's edge (so any future ISP-side weirdness like the original `wiki` SNI filter is also bypassed).
```bash
# via Cloudflare API (token from ansible-vault: vault_cloudflare_api_token)
PUT /zones/{ZONE_ID}/dns_records/{NOTES_RECORD_ID}
{
"type": "CNAME",
"name": "notes.majorshouse.com",
"content": "majorshouse.com",
"ttl": 1,
"proxied": true,
"comment": "switched A→CNAME proxied to bypass stale IP / ISP SNI filter"
}
```
Or via the dashboard:
1. Cloudflare → `majorshouse.com` zone → DNS → Records
2. Edit the `notes` record: Type `CNAME`, Target `majorshouse.com`, Proxy `Proxied` (orange cloud)
3. Save
External clients now hit Cloudflare edge IPs (`104.21.x.x` / `172.67.x.x`) which TLS-terminate at the edge and tunnel back to majorlab's apex IP. ACME on majorlab keeps working — Cloudflare passes the HTTP-01 challenge through on port 80. Caddy's `notes.majorshouse.com {}` block needs no change.
Verify (response should show `server: cloudflare` and `via: 1.0 Caddy`):
```bash
curl -sSI https://notes.majorshouse.com/
```
### Why a Cloudflare-proxied CNAME is the durable shape
- **Apex follows the home IP automatically.** Update the apex A record once when the ISP changes; every subdomain inherits it without per-record fixes.
- **TLS handshake is offloaded to CF.** Any ISP-level SNI weirdness (the original `wiki` ban; theoretical future bans) becomes irrelevant — external clients SNI=`notes.majorshouse.com` to Cloudflare, which the ISP doesn't filter.
- **Free.** Cloudflare's free tier covers proxy + TLS termination.
### Audit checklist for any home-hosted `*.majorshouse.com` subdomain
- [ ] DNS record is a **CNAME** to `majorshouse.com.`, not an A record to a literal home IP.
- [ ] Cloudflare proxy (orange cloud, `proxied=true`) enabled on the record — at minimum for any subdomain where TLS reachability matters.
- [ ] Caddy entry on majorlab references the public hostname; `reverse_proxy` stays on the localhost port.
- [ ] HTTPS verified from outside the LAN (phone on cellular is sufficient) within the first hour after the change.
- [ ] If an A record is genuinely required (e.g. it must NOT go through CF), document why in the deploy notes for that service.
### Related
- [[majwiki-setup-and-pipeline]] — full wiki deploy pipeline; the DNS step there should reference this fix
- [[Network-Overview]] — fleet IP table

View file

@ -0,0 +1,150 @@
---
title: "Logwatch Reports the Wrong Hostname (`<host>-hetzner`) After a Migration"
domain: troubleshooting
category: monitoring
tags: [logwatch, hostname, hetzner, migration, monitoring, provisioning, fail2ban]
status: published
created: 2026-06-12
updated: 2026-06-14
---
# Logwatch Reports the Wrong Hostname (`<host>-hetzner`) After a Migration
## Symptom
Daily Logwatch emails from a recently migrated server arrive titled with the
provisioning label instead of the real hostname:
```
Logwatch for tttpod-hetzner (Linux)
Logwatch for dcaprod-hetzner (Linux)
```
Everything else works — the report is generated, mailed, and delivered. Only the
**name in the title is wrong**, which makes reports harder to scan and breaks any
filter or rule that keys on the expected hostname.
## Cause
Logwatch titles each report with the box's **live system hostname**
(`hostnamectl --static` / `/etc/hostname`) read at runtime — it does *not* keep
its own copy of the name.
Hetzner Cloud servers are provisioned with a temporary node label as the system
hostname — `<host>-hetzner` (e.g. `tttpod-hetzner`). The migration runbook renames
the **Tailscale node** back to the bare name and sets Postfix `myhostname`, but the
**OS hostname** itself is easy to miss because nothing surfaces it day to day. It
stays `<host>-hetzner` until something reads `hostname` — Logwatch is usually the
first thing to do so, weeks later.
Confirm the box is actually mislabelled:
```bash
ssh root@<host> 'hostnamectl --static; cat /etc/hostname; grep 127.0.1.1 /etc/hosts'
# static: tttpod-hetzner
# /etc/hostname: tttpod-hetzner
# 127.0.1.1 tttpod-hetzner tttpod-hetzner
```
## Fix
Set the real hostname and fix the matching `/etc/hosts` loopback line:
```bash
ssh root@<host> '
hostnamectl set-hostname <host>
sed -i "s/127.0.1.1.*/127.0.1.1 <host> <host>/" /etc/hosts
hostnamectl --static # verify -> <host>
'
```
That's it. **Logwatch has no hardcoded hostname override** — verify with:
```bash
grep -ri hostname /etc/logwatch/ /etc/cron.daily/0logwatch /etc/cron.daily/logwatch 2>/dev/null
cat /etc/mailname 2>/dev/null
```
If those are empty (the normal case), Logwatch reads the live hostname on its next
run, so the **next daily report self-corrects** — no service restart, no logwatch
config change needed.
> [!note] If `grep` *does* find a hostname pinned in `/etc/logwatch/conf/logwatch.conf`
> (e.g. a `HostLimit`/`MailFrom` line baked in by Ansible), update it there too —
> the override file wins over the live hostname.
## Sweep the whole fleet
This is a per-box provisioning leftover, so check every migrated host at once —
more than one is usually affected:
```bash
for ip in 100.98.223.93 100.95.137.38 100.64.169.62 100.112.127.0 100.73.85.46; do
echo -n "$ip -> "
ssh -o ConnectTimeout=8 -o BatchMode=yes root@$ip 'hostnamectl --static' 2>/dev/null \
|| echo '(unreachable)'
done
```
Any value ending in `-hetzner` (or your provider's build label) needs the fix above.
In the 2026-06 sweep, `tttpod` and `dcaprod` were still `*-hetzner` at the OS
level; `majortoot`, `majormail`, and `majorlinux` had the correct system hostname
— but see the variant below: `majormail`'s *configs* were still stale even though
its hostname wasn't.
## Variant: hostname is correct, but a config has the old name baked in
A second, sneakier form of this drift: the **system hostname is already right**, so
the sweep above passes and the Logwatch report *title* is correct — yet mail still
arrives **from** `<host>-hetzner` because the old label is hardcoded in a service's
`From`/`sender` field. These fields are static text, not derived from the live
hostname, so fixing `hostnamectl` does nothing for them.
Seen on `majormail` (2026-06-14): system hostname was `majormail`, but
`Logwatch@majormail-hetzner...` was still the sender. Two configs held it:
```bash
# sweep a box for the old provisioning label in any send-related config
ssh root@<host> 'grep -rsn "<host>-hetzner" /etc/logwatch/ /etc/fail2ban/ \
/etc/postfix/ /etc/aliases /etc/mailname 2>/dev/null'
# /etc/logwatch/conf/logwatch.conf:MailFrom = Logwatch@<host>-hetzner.majorshouse.com
# /etc/fail2ban/jail.local:sender = fail2ban@<host>-hetzner.majorshouse.com
```
Fix in place (no restart needed for Logwatch; reload fail2ban for its change):
```bash
ssh root@<host> '
sed -i "s/<host>-hetzner/<host>/g" /etc/logwatch/conf/logwatch.conf /etc/fail2ban/jail.local
systemctl reload fail2ban
'
```
> [!warning] Check the Ansible source, or it comes back
> A live `sed` is undone by the next playbook run if the repo still carries the old
> value. Distinguish two cases:
> - **Templated** (safe): e.g. `logwatch.yml` sets `MailFrom = Logwatch@{{ inventory_hostname }}...`. If the inventory host is named correctly, a run *regenerates* the right value — it even self-heals a stale box.
> - **Static file** (will regress): e.g. `roles/fail2ban/files/hosts/<host>/jail.local` with the literal `sender = ...@<host>-hetzner...`. Grep the repo (`grep -rn "<host>-hetzner" .`) and fix the file too, or every deploy re-pushes the stale sender.
Inert backups (`jail.local.bak*`, `*~`) may still contain the old string — they
don't send mail, so leave them.
## Prevention
Fold "set the system hostname" into the migration bootstrap so it never drifts:
```bash
hostnamectl set-hostname <host>
sed -i "s/127.0.1.1.*/127.0.1.1 <host> <host>/" /etc/hosts
```
Do this in the **same step** that renames the Tailscale node and sets Postfix
`myhostname` — all three read from the provisioning label and all three must be
corrected together. See the
[VPS Migration Baseline Checklist](../02-selfhosting/cloud/vps-migration-baseline-checklist.md).
## Related
- [Logwatch Fleet Setup — Surviving Package Upgrades](../02-selfhosting/monitoring/logwatch-fleet-setup.md) — the broader "logwatch went silent / wrong-source" class, including the Packer `myhostname` variant of this same drift
- [VPS Migration Baseline Checklist](../02-selfhosting/cloud/vps-migration-baseline-checklist.md) — the full post-migration verification list
- [Ansible UNREACHABLE: Host Key Verification Failed After a Host Rebuild or Migration](networking/ansible-host-key-verification-failed-rebuilt-host.md) — another IP/identity-drift gotcha from the same Hetzner migration

View file

@ -0,0 +1,154 @@
---
title: "Auditing & Cleaning macOS Background App Activity (sfltool dumpbtm)"
domain: troubleshooting
category: general
tags: [macos, background-tasks, btm, sfltool, login-items, system-extensions, uninstall, little-snitch]
status: published
created: 2026-06-21
updated: 2026-06-21
---
# Auditing & Cleaning macOS Background App Activity (`sfltool dumpbtm`)
## Overview
macOS tracks every login item, agent, daemon, helper, and extension that may run in the background in its **Background Task Management (BTM)** database. The GUI shows this under **System Settings → General → Login Items & Extensions** ("Allow in the Background"), but the GUI is summarised and hides paths, identifiers, and orphans.
`sfltool dumpbtm` prints the full BTM database from the command line — and the per-user records need **no `sudo`**. This is the fastest way to answer "what is allowed to run in the background, and does each entry still map to an installed app?"
## List what's registered
```bash
sfltool dumpbtm # per-user records, no sudo required
```
Each record looks like:
```
Name: CleanMyMac Menu
Type: login item (0x4)
Disposition: [enabled, allowed, notified] (0xb)
Identifier: 4.com.macpaw.CleanMyMac-mas.Menu
URL: Contents/Library/LoginItems/CleanMyMac_5_MAS_Menu.app
Bundle Identifier: com.macpaw.CleanMyMac-mas.Menu
Parent Identifier: 2.com.macpaw.CleanMyMac-mas
```
### Reading the fields
- **Disposition**`enabled` = actively allowed to run in the background. `disabled` = present but off.
- **Type** — what kind of item it is:
| Type | Meaning |
|---|---|
| `app (0x2)` | A normal application entry |
| `login item (0x4)` | Launches at login (menu-bar apps, helpers) |
| `agent (0x8)` / `legacy agent` | Per-user background agent |
| `legacy daemon (0x10010)` | System-wide background daemon |
| `background tasks (0x2000)` | Abstract background-task registration owned by a parent app — **has no file path of its own** |
| `developer (0x20)` | A per-developer grouping header (the collapsible row in Settings), **not an app** |
| `quicklook` / `spotlight` / `dock tile` | Plugins/extensions — not really "background apps" |
## Map entries to installed apps (find orphans)
Two gotchas make naïve path-checking fail:
1. **Absolute paths are stored as `file://` URLs**, not plain `/…`. Strip the `file://` prefix and URL-decode (`%20` → space).
2. **Child items store a *relative* `URL`** (e.g. `Contents/Library/LoginItems/…`) that must be joined to the **parent record's** absolute path, found via `Parent Identifier`.
A small parser that resolves each record to a real path and flags true orphans:
```python
import sys, re, os, urllib.parse
items, cur = [], None
def push():
global cur
if cur is not None: items.append(cur)
for line in sys.stdin:
s = line.strip()
if re.match(r"^#\d+:$", s): push(); cur = {}; continue
if cur is None: continue
m = re.match(r"^([A-Za-z][A-Za-z /]+):\s*(.*)$", s)
if m: cur[m.group(1).strip()] = m.group(2).strip()
push()
byid = {it["Identifier"]: it for it in items if it.get("Identifier")}
def abspath(it, d=0):
if d > 8: return None
u = it.get("URL", "")
if u and u != "(null)":
if u.startswith("file://"): return urllib.parse.unquote(u[7:]).rstrip("/")
if u.startswith("/"): return u.rstrip("/")
par = byid.get(it.get("Parent Identifier", ""))
if par:
b = abspath(par, d + 1)
if b: return os.path.join(b, urllib.parse.unquote(u)).rstrip("/")
return None
for it in items:
if not it.get("Name"): continue
p = abspath(it)
if p and not os.path.exists(p):
print("ORPHAN:", it["Name"], "->", p)
```
```bash
sfltool dumpbtm | python3 btm_check.py
```
> **Expected non-orphans:** `background tasks (0x2000)` and `developer (0x20)` rows legitimately store no path — they are not missing apps. Helpers/daemons that resolve *inside* a parent bundle (e.g. `/Applications/Foo.app/Contents/Library/LoginItems/…`) or in `/Library/…` are also fine; they just don't appear as a top-level `.app`. That is usually why an entry "has no application you can find."
## Disable background for an app
This **cannot be scripted** — Apple deliberately gates the toggle behind the GUI:
**System Settings → General → Login Items & Extensions → "Allow in the Background"** → switch the app off.
Disabling a `developer (0x20)` grouping header turns off all of that developer's sub-items at once.
## Uninstall cleanly — the system-extension trap
**Dragging an app to the Trash is not a full uninstall.** Apps that install a **network/system extension** plus a privileged daemon (firewalls and VPNs especially — Little Snitch, Mullvad, etc.) leave their `/Library` daemon **still loaded and running** after the app is trashed. The BTM entry persists and the background service keeps working.
### 1. Prefer the app's own uninstaller
- **Bundled uninstall script** (Mullvad): runs cleanly, deactivates the system extension, resets the firewall.
```bash
sudo "/Applications/Mullvad VPN.app/Contents/Resources/uninstall.sh"
```
- Some apps ship an uninstaller in their DMG or a CLI tool. **Note:** Little Snitch 6.x has **no DMG uninstaller and no `littlesnitch uninstall` subcommand** — manual removal is the supported route there.
### 2. Check whether a system extension is still active
```bash
systemextensionsctl list
```
If the app's extension is **not** listed (only unrelated ones like Tailscale/Canon remain), the extension is already deactivated and a manual file removal is now complete and safe.
### 3. Manual removal (when no uninstaller exists)
Find every component first:
```bash
ls /Library/LaunchDaemons/<id>* /Library/LaunchAgents/<id>* 2>/dev/null
ls -d "/Library/Application Support/<Vendor>" 2>/dev/null
ls ~/Library/Preferences/<id>* 2>/dev/null
```
Then boot out the daemon and remove the files:
```bash
sudo launchctl bootout system /Library/LaunchDaemons/<id>.daemon.plist 2>/dev/null
sudo rm -f /Library/LaunchDaemons/<id>.daemon.plist /Library/LaunchAgents/<id>.agent.plist
sudo rm -rf "/Library/Application Support/<Vendor>" "$HOME/.Trash/<App>.app"
rm -f ~/Library/Preferences/<id>*.plist # user-owned, no sudo
```
> **Shared-container caution:** before deleting `~/Library/Group Containers/*`, check it isn't shared. Microsoft apps share `UBF8T346G9.com.microsoft.oneauth`, `…entrabroker`, and `…teams` across Office/Teams/RDP — delete only the app-specific container (e.g. `…com.microsoft.rdc`), never the shared auth ones.
## Stale BTM "ghost" entries
After a manual uninstall, `sfltool dumpbtm` may still list the removed app, pointing at now-deleted paths. These are harmless orphans (nothing left to load). **BTM reconciles them on the next reboot / login cycle** — a reboot also finalises any system-extension teardown.
## Quick reference
```bash
sfltool dumpbtm # full per-user BTM dump (no sudo)
sfltool dumpbtm | grep -A6 'Name:' # browse records
systemextensionsctl list # active network/system extensions
# Verify a removal:
sfltool dumpbtm | grep -i <vendor> # should be empty after a reboot
```
## See also
- Apple gates "Allow in the Background" behind System Settings — there is no supported CLI toggle for BTM dispositions.
- For VPN/firewall apps, always reach for the vendor uninstaller first; manual `rm` alone can leave a registered system extension behind.

View file

@ -0,0 +1,94 @@
---
title: "Ansible UNREACHABLE: Host Key Verification Failed After a Host Rebuild or Migration"
domain: troubleshooting
category: networking
tags: [ansible, ssh, known-hosts, tailscale, host-key, migration]
status: published
created: 2026-06-12
updated: 2026-06-12
---
# Ansible UNREACHABLE: Host Key Verification Failed After a Host Rebuild or Migration
## Symptom
A subset of hosts in an Ansible run fail at **Gathering Facts** while the rest succeed:
```
[ERROR]: Task failed: Data could not be sent to remote host "100.112.127.0".
Make sure this host can be reached over ssh: Host key verification failed.
fatal: [majormail]: UNREACHABLE! => {"unreachable": true, ...}
```
The failing hosts are exactly the ones that were recently **rebuilt or migrated** (new server, new OS install, or a cloud move that issued a new Tailscale IP). Hosts that were never rebuilt connect fine.
Confusingly, **interactive `ssh root@<host>` works perfectly** for the same boxes — only Ansible fails.
## Cause
SSH stores each accepted host key in `~/.ssh/known_hosts` keyed by the **exact address you connected with**. A key accepted for `ssh root@tttpod` is saved under the hostname `tttpod`; it is *not* indexed under that node's IP.
Ansible inventories almost always set `ansible_host` to a **literal IP** (here, the Tailscale `100.x.x.x` address). So Ansible's SSH lookup is by IP, finds no matching entry, and with `StrictHostKeyChecking=yes` (or `accept-new` already exhausted) it refuses the connection:
```
No ED25519 host key is known for 100.112.127.0 and you have requested strict checking.
Host key verification failed.
```
The hostname-form and IP-form entries are independent. Fixing interactive SSH (e.g. converting aliases to MagicDNS names and re-accepting keys) does **nothing** for Ansible, because Ansible never uses the hostname.
A rebuilt host also generates **brand-new host keys**, so any old IP-form entry would additionally be a mismatch — but the common case after a migration to a *new* IP is simply that no IP entry exists at all.
## Diagnosis
```bash
# 1. Is there any known_hosts entry for the failing IP? (0 = none)
ssh-keygen -F 100.112.127.0
# 2. Reproduce the exact failure without an interactive prompt:
ssh -o BatchMode=yes -o StrictHostKeyChecking=yes root@100.112.127.0 true
# -> "Host key verification failed." confirms the gap
# 3. Confirm the inventory IP is actually the host's CURRENT address
# (guards against stale-IP drift, a separate problem):
tailscale status | grep majormail
ssh-keyscan -t ed25519 100.112.127.0 | ssh-keygen -lf - # fingerprint it
```
If step 3 shows the inventory IP matches the live Tailscale node and the box answers `ssh-keyscan`, the only problem is the missing IP-form key.
## Fix
Add the **IP-form** host keys to the `known_hosts` of the user that runs Ansible. Back up first, scan over the tailnet, de-dup:
```bash
cp ~/.ssh/known_hosts ~/.ssh/known_hosts.bak.$(date +%Y%m%d)
for ip in 100.98.223.93 100.112.127.0 100.73.85.46 100.95.137.38 100.76.51.16 100.64.169.62; do
ssh-keyscan -T 5 -t rsa,ecdsa,ed25519 "$ip" >> ~/.ssh/known_hosts
done
sort -u ~/.ssh/known_hosts -o ~/.ssh/known_hosts
```
Verify before re-running the playbook:
```bash
ansible <hosts> -m ping # expect "pong" from each
```
### Why `ssh-keyscan` is safe here
`ssh-keyscan` trusts whatever answers on the wire — normally a MITM risk. Over **Tailscale**, the connection rides WireGuard, which cryptographically authenticates the peer by its tailnet identity: reaching `100.x.x.x` *guarantees* you are talking to the node that owns that tailnet address. Scanning and trusting the key over the tailnet is therefore as trustworthy as the tailnet itself. Always cross-check the IP against `tailscale status` first (step 3) so you scan the right node.
## Prevention
- **Per-workstation, not fleet-wide.** `known_hosts` is local to each machine + user. After a migration, *every* host that runs Ansible (each workstation, plus any control node like `majorlab`) needs the IP keys added independently. Adding them on one Mac does not help the others.
- **Sweep on every migration phase.** A rolling migration changes one node's IP at a time; fold the keyscan above into the post-cutover checklist so Ansible never breaks mid-rollout.
- **Alternative — `accept-new`.** Setting `host_key_checking = False` in `ansible.cfg` (or `ANSIBLE_HOST_KEY_CHECKING=False`) sidesteps the prompt but trades away host-key verification entirely. Prefer the explicit keyscan: it keeps strict checking on for every *future* run while accepting the new key exactly once, under your control.
## Related
- SSH-Aliases — Fleet SSH access; the MagicDNS-vs-pinned-IP strategy and the Ansible-by-IP `known_hosts` note
- Network Overview — Tailscale fleet inventory and current IPs
- Hetzner-Migration-Status — the migration that triggered the fleet-wide IP churn
- [[ssh-socket-tailscale-race-condition]] — a different "SSH unreachable after reboot" failure mode

View file

@ -0,0 +1,105 @@
---
title: "Dovecot IMAP Clients Fail to Sync: vsz_limit OOM from a Bloated Index Log"
domain: troubleshooting
category: networking
tags: [dovecot, imap, oom, vsz_limit, index, maildir, fedora, mail]
status: published
created: 2026-06-05
updated: 2026-06-05
---
# Dovecot IMAP Clients Fail to Sync: vsz_limit OOM from a Bloated Index Log
All IMAP clients fail to connect or hang while syncing a particular folder, even though the box has plenty of free RAM and disk. The cause is a corrupt/bloated per-folder `dovecot.index.log` that overflows Dovecot's **per-process** virtual-memory cap (`default_vsz_limit`, 256 MB by default) when it is `mmap`ed — so the IMAP child is killed on every sync attempt.
> First seen on **majormail** (Fedora 44, Dovecot 2.4.4), 2026-06-05. An empty `.Later` folder had a 152 MB `dovecot.index.log`.
## Symptoms
- Multiple/all IMAP clients can't connect, or connect but never finish syncing.
- Often only **one folder** is the trigger — the client hangs the moment it opens/syncs that folder.
- The server is otherwise healthy: Postfix delivering, Dovecot `active`, ports listening, TLS valid.
- `free -h` shows the host has plenty of RAM available — this is **not** a host-level OOM.
## Log Signature
`journalctl -u dovecot` shows, per affected user/folder:
```
imap(user@dom): Fatal: block_alloc(8388608): Out of memory
imap(user@dom): Fatal: master: service(imap): child NNN returned error 83
(Out of memory (service imap { vsz_limit=256 MB }, you may need to increase it) ...)
imap(user@dom): Error: Mailbox X: mmap(size=158769660) failed ...: Cannot allocate memory
imap(user@dom): Error: Mailbox X: Failed to map transaction log .../dovecot.index.log
at sync_offset=N after locking: Beginning of the log isn't available
```
The two tells: **`error 83` naming `vsz_limit`** (Dovecot literally suggests raising it), and an **`mmap(size=…)` value that is huge relative to the folder's real contents**.
## Why It Happens
Each Maildir folder has its own `dovecot.index.log` transaction log. If it grows or corrupts to tens/hundreds of MB (here: 152 MB on a folder with **zero** messages), Dovecot tries to `mmap` the whole thing into the IMAP worker. That worker runs under `default_vsz_limit` (compiled default **256 MB**). The mapping blows the cap, the kernel refuses the allocation, and the child dies with `error 83`. Because every client re-syncs that folder on connect, it fails for **all** of them at once.
Key point: the limit is **per-process virtual size**, not host memory. A box with 2.5 GB free RAM still hits it.
## Diagnosis
```bash
# 1. The smoking gun — OOM / error 83 mentioning vsz_limit
journalctl -u dovecot --since "-3h" | grep -iE "out of memory|error 83|vsz_limit"
# 2. Confirm it is NOT a host OOM (expect plenty free)
free -h ; df -h /var/vmail
# 3. Current per-process cap (256 M = compiled default, no explicit setting)
doveconf default_vsz_limit
# 4. Find the bloated index — size wildly out of proportion to message count
du -sh /var/vmail/<domain>/<user>/.<Folder>
ls -lh /var/vmail/<domain>/<user>/.<Folder>/dovecot.index*
ls -1 /var/vmail/<domain>/<user>/.<Folder>/{cur,new} | wc -l # real message count
```
## Fix
Two parts: raise the cap, and repair the bloated index.
```bash
# (1) Raise default_vsz_limit. Flat Fedora dovecot.conf has no !include conf.d/*,
# so add it at top-level scope (after `protocols = ...`):
# default_vsz_limit = 1G
doveconf -n >/dev/null && echo CONFIG_OK # validate
systemctl restart dovecot # required to apply the new vsz
doveconf default_vsz_limit # -> 1G
# (2a) Rebuild the index from the real messages
doveadm force-resync -u <user@dom> <Folder>
# (2b) If force-resync leaves a stale multi-MB index.log AND the folder has
# 0 message files, it is safe to delete the index files and let Dovecot
# regenerate them clean (152 M -> 24 K in the original case):
L=/var/vmail/<domain>/<user>/.<Folder>
rm -f $L/dovecot.index $L/dovecot.index.log $L/dovecot.index.cache $L/dovecot.index.backup
doveadm mailbox status -u <user@dom> "messages vsize" <Folder> # regenerates
```
Verify: `journalctl -u dovecot --since "-2m" | grep -ic "out of memory"` returns `0`, and the folder reads without error.
> **Only delete index files when the folder's `cur/` and `new/` are empty** (or you are certain the messages are intact). The index is rebuildable from the message files; deleting indexes never deletes mail, but verify the count first.
## Codified
majormail's role sets this permanently so the cap survives a config rebuild:
`roles/majormail/templates/dovecot.conf.j2``default_vsz_limit = 1G` (MajorAnsible commit `a69ac5d`).
## Key Notes
- **`error 83` = vsz, not host RAM.** Don't go chasing free memory — read the parenthetical in the error; Dovecot names the exact setting.
- **A huge index on a tiny/empty folder is the corruption,** not the messages. Resync, and truncate the index if the folder is empty.
- **`tcpdump` may not be installed** on a minimal Fedora mail host — don't conclude "no packets arrived" from an empty capture without confirming the tool exists (`which tcpdump`).
- 1 G is a comfortable headroom for large mailboxes; raise further only if a genuinely large single mailbox needs it.
## Related
- [Mail Client Stops Receiving: Fail2ban IMAP Self-Ban](fail2ban-imap-self-ban-mail-client.md)
- [firewalld: Mail Ports Wiped After Reload](firewalld-mail-ports-reset.md)
- [SELinux: Dovecot vmail Context](../selinux-dovecot-vmail-context.md)

View file

@ -0,0 +1,111 @@
---
title: "Dovecot Phantom Mailboxes from .dovecot.lda-dupes (mail_home Overlapping the Maildir Root)"
domain: troubleshooting
category: networking
tags: [dovecot, maildir, mail_home, sieve, lda-dupes, duplicate-database, pigeonhole, phantom-mailbox]
status: published
created: 2026-06-07
updated: 2026-06-07
---
# Dovecot Phantom Mailboxes from `.dovecot.lda-dupes` (mail_home Overlapping the Maildir Root)
Dovecot starts logging errors like this on mailbox LIST, and `doveadm mailbox list` grows phantom mailboxes named after Dovecot's own control files:
```
imap(user@example.com): Error: maildir: stat(/var/vmail/example.com/user/.dovecot.lda-dupes/tmp) failed: Not a directory
```
```
$ doveadm mailbox list -u user@example.com
INBOX
dovecot
dovecot.lda-dupes
dovecot.lda-dupes.locks
```
> Hit on **majormail** (2026-06-07), the day after switching the global spam Sieve to `redirect`. Mail delivery was unaffected — purely log noise plus phantom folders a client could see on `LIST "*"`.
## Why
The LDA/Sieve **duplicate database** (`.dovecot.lda-dupes`, plus a `.dovecot.lda-dupes.locks` lock dir) is created in the user's **home** directory. Per the Dovecot maintainer, its location strictly follows the user's home — it is *not* separately configurable.
If `mail_home` (the userdb `home` field) is set equal to the **maildir root** (`mail_path`), those control files get written *inside* the mail store:
```
mail_path = /var/vmail/%{user|domain}/%{user|username} # maildir root
userdb static { fields { home = /var/vmail/%{user|domain}/%{user|username} } } # SAME path — the bug
```
The maildir++ layout treats every `.`-prefixed entry in the root as a mailbox folder. So:
- `.dovecot.lda-dupes` (a **file**) → lister stats `.dovecot.lda-dupes/tmp`**"Not a directory"** (cosmetic, logged every LIST).
- `.dovecot.lda-dupes.locks` (a **directory**) → opened as a maildir, auto-populated with `cur/new/tmp/dovecot-uidlist/dovecot.index.log`, and exposed as a real phantom mailbox.
The trigger is anything that exercises duplicate tracking — Sieve `redirect` (loop-guard), `vacation`, or the `duplicate` test. A pure `fileinto` setup never creates the db, which is why the error can appear suddenly after a Sieve change.
## How to confirm
```bash
# Phantom mailboxes named after the control files:
doveadm mailbox list -u user@example.com | grep -E '^dovecot'
# Is home the SAME as the maildir root? (the root cause)
doveadm user user@example.com | grep -E 'home|mail_path'
# home /var/vmail/example.com/user <- equals mail_path == bug
# mail_path /var/vmail/example.com/user
# The offending control files living inside the maildir root:
ls -la /var/vmail/example.com/user/.dovecot.lda-dupes*
# -rw------- … .dovecot.lda-dupes (regular file — the dedup db)
# drwx------ … .dovecot.lda-dupes.locks (dir — the lock dir, mis-listed)
```
## Fix
Point `home` at a path **separate from the maildir root**. The cleanest low-risk option is a **non-dotted subdir** of the user dir, so `mail_path` stays put and **no mail migration** is needed (a dotted name would just become another phantom folder):
```diff
userdb static {
fields {
uid = vmail
gid = vmail
- home = /var/vmail/%{user|domain}/%{user|username}
+ home = /var/vmail/%{user|domain}/%{user|username}/home
}
}
```
Then deploy and clean up the stale artifacts:
```bash
# 1. Deploy the config change, restart/reload Dovecot.
# 2. Confirm home moved:
doveadm user user@example.com | grep home # -> /var/vmail/example.com/user/home
# 3. Remove the stale dupe-db + the cached list index from the maildir root
# (all regenerable):
cd /var/vmail/example.com/user/
rm -rf .dovecot.lda-dupes .dovecot.lda-dupes.locks dovecot.list.index dovecot.list.index.log
# 4. Pre-create the new home (so the first dupe-db write can't fail):
install -d -o vmail -g vmail -m 700 /var/vmail/example.com/user/home
# 5. Verify:
doveadm mailbox list -u user@example.com | grep -E '^dovecot' || echo CLEAN
```
The duplicate db now regenerates under `…/user/home/`, where the maildir lister never looks.
## Gotchas
- **`mail_home` follows userdb.** A userdb-returned `home` field overrides the global `mail_home` setting, so fix it where userdb defines it (here, `userdb static { fields { home = … } }`).
- **What else keys off `~`:** personal Sieve (`~/.dovecot.sieve`, `~/sieve`), `mail_attribute_dict`, and some quota backends. Before moving home, confirm none of those hold live data in the old location (`ls -a` the maildir root). A *global* spam Sieve at a fixed path (`/etc/dovecot/sieve/global/…`) is unaffected.
- **Indexes** default to `mail_path`, not home, so moving home doesn't touch `dovecot.index*`.
- **Don't trust a local-injection test** to exercise Sieve `redirect`: Postfix `cleanup` header_checks may intercept it first, and `dovecot-lda` may not apply the same before-script as LMTP. Verify the relocation at the authoritative level (`doveadm user` home), since the db location is home-relative by design.
## Related
- [[postfix-header-checks-vs-milter-headers]] — the spam-routing migration that introduced the Sieve `redirect` (and thus the dupe db) on majormail.
- Upstream: Dovecot mailing-list thread "Change location where .dovecot.lda-dupes* file/dir are created" — maintainer confirms the db follows the user's home.

View file

@ -5,7 +5,7 @@ category: networking
tags: [fail2ban, imap, dovecot, email, self-ban]
status: published
created: 2026-04-02
updated: 2026-04-02
updated: 2026-06-05
---
# Mail Client Stops Receiving: Fail2ban IMAP Self-Ban
@ -79,6 +79,21 @@ fail2ban-client set dovecot-invalid unbanip <IP>
Mail should resume immediately without restarting any services.
### Permanent fix — whitelist the trusted IP (`ignoreip`)
Unbanning is temporary: if the client keeps failing auth (wrong password, stale token), the same IP gets re-banned within minutes. For a **known, trusted network** (e.g. your home egress IP) add it to Fail2ban's `ignoreip` so it can never be banned:
```bash
# /etc/fail2ban/jail.local — [DEFAULT] section, applies to ALL jails
ignoreip = 127.0.0.1/8 ::1 100.64.0.0/10 <home_ip>
fail2ban-client reload
fail2ban-client get postfix-sasl ignoreip # confirm the IP is listed
```
On majormail this is codified via `fail2ban_ignoreip` in `host_vars/majormail-hetzner/vars.yml` (MajorAnsible commit `fa91fe3`).
> ⚠️ `ignoreip` takes a **public egress** IP, which may be dynamic. If your ISP reassigns it, the whitelist points at a stale address and bans can return — recheck the egress IP first. Use a subnet only if you trust the whole range.
---
## 🔁 Why This Happens

View file

@ -5,7 +5,7 @@ category: networking
tags: [firewalld, mail, imap, fedora, ports]
status: published
created: 2026-04-02
updated: 2026-04-02
updated: 2026-06-05
---
# firewalld: Mail Ports Wiped After Reload (IMAP + Webmail Outage)
@ -66,8 +66,24 @@ Expected output:
dhcpv6-client http https imap imaps mdns smtp smtp-submission smtps ssh
```
## Variant: One port (587) fails while the rest work — service never added
A subtler version of this: IMAP (993) and implicit-TLS submission (465) work fine, but **only STARTTLS submission on 587 fails** — clients on 587 get "no route to host." This is **not** a reload wipe; the `submission` service was simply never added during initial setup (the box's mail ports were opened by hand and one was missed).
```bash
# Each mail service, individually — submission will be the odd one out
for s in smtp smtps submission imap imaps; do printf "%-12s " "$s"; firewall-cmd --query-service=$s; done
# Fix (Fedora 44 / firewalld names the 587 service `submission`, NOT `smtp-submission`)
firewall-cmd --permanent --zone=public --add-service=submission
firewall-cmd --reload
```
> On majormail the full mail-service set is now managed declaratively in `roles/majormail/tasks/postfix.yml` (smtp/smtps/**submission**/imap/imaps), so a hand-edit can't leave 587 behind again (MajorAnsible commit `b75f14a`). Seen 2026-06-05.
## Key Notes
- **Service name differs by distro/version:** the 587 service is `submission` on current Fedora firewalld; older/other docs may say `smtp-submission`. Verify with `firewall-cmd --get-services | tr ' ' '\n' | grep submission`.
- **Always use `--permanent`** when adding services to firewalld on a server. Without it, the rule exists only until the next reload.
- **Fail2ban + firewalld**: Fail2ban uses firewalld as its ban backend (`firewallcmd-rich-rules`). When Fail2ban restarts or crashes, it may trigger a `firewall-cmd --reload`, resetting any runtime-only rules.
- **Verify after any firewall event**: After Fail2ban restarts, system reboots, or `firewall-cmd --reload`, always confirm mail services are still present with `firewall-cmd --list-services --zone=public`.
@ -77,3 +93,4 @@ dhcpv6-client http https imap imaps mdns smtp smtp-submission smtps ssh
- [Linux Server Hardening Checklist](../../02-selfhosting/security/linux-server-hardening-checklist.md)
- [Mail Client Stops Receiving: Fail2ban IMAP Self-Ban](fail2ban-imap-self-ban-mail-client.md)
- [Dovecot IMAP Clients Fail to Sync: vsz_limit OOM from a Bloated Index Log](dovecot-imap-oom-vsz-limit-bloated-index.md)

View file

@ -0,0 +1,72 @@
---
title: "Postfix header_checks Can't Act on Milter-Added Headers (Use Sieve)"
domain: troubleshooting
category: networking
tags: [postfix, milter, header_checks, spamassassin, spamass-milter, dovecot, sieve, spam]
status: published
created: 2026-06-06
updated: 2026-06-06
---
# Postfix header_checks Can't Act on Milter-Added Headers (Use Sieve)
A Postfix `header_checks` rule that keys on a header added by a **milter** (e.g. `X-Spam-Flag: YES` from `spamass-milter`/`rspamd`/`opendkim`) appears correct, is wired up, and even fires for test mail — yet silently does nothing for real inbound mail. The cause: `header_checks` run in the `cleanup` daemon and **do not reliably see headers a milter adds**, so a rule like:
```
/^X-Spam-Flag:[[:space:]]+YES/ REDIRECT junk@example.com
```
never matches genuine inbound spam, even though the delivered message clearly contains `X-Spam-Flag: YES`.
> Hit on **majormail** (2026-06-06): spam-routing REDIRECT had been dead since it was deployed — spam kept reaching the inbox.
## Why
Milter header modifications and `header_checks` happen at different stages of `cleanup`, and `header_checks` evaluate the message as received from the network, **before** the milter's header additions are folded in. So for an `smtpd_milter`-tagged message, the flag header is not visible to `header_checks` at the time they run.
Confusingly, **locally-injected test mail can fire the rule** (timing/origin differences) — so a quick `swaks`/`smtplib` test to `localhost:25` "passes" while real inbound mail silently slips through. Don't trust a local-injection test for this; verify against real inbound mail (or with the method below).
## How to confirm
```bash
# A delivered message that SHOULD have matched — but wasn't acted on:
grep -iE '^(X-Spam-Flag|Delivered-To|Subject):' /var/vmail/<dom>/<user>/.Junk/cur/<msg>
# X-Spam-Flag: YES Delivered-To: <user>@… (i.e. NOT redirected)
# Is the spam scanner an smtpd milter? (then header_checks can't see its headers)
postconf smtpd_milters
# smtpd_milters = … unix:/run/spamass-milter/spamass-milter.sock
# maillog: the header_checks REDIRECT never appears for real inbound spam,
# only (if at all) for locally-submitted mail ("redirect: … from local").
grep -i 'redirect:' /var/log/maillog
```
## Fix — act at delivery time, in Sieve
Dovecot **Sieve** runs at LMTP delivery, *after* the milter, so it reliably sees milter-added headers. Do the routing there instead of in `header_checks`. To keep spam out of the real mailbox entirely (so a push client like Spark never sees it), `redirect` to a dedicated account rather than `fileinto Junk`:
```sieve
require ["envelope"];
if header :contains "X-Spam-Flag" "YES" {
# Loop guard: a global before-script also runs for junk@'s own delivery.
if envelope :is "to" "junk@example.com" {
keep;
stop;
}
redirect "junk@example.com";
stop;
}
```
On majormail this is the global before-script `roles/majormail/templates/spam-to-junk.sieve.j2` (MajorAnsible `07dab90`); `redirect` cancels the implicit keep so the real mailbox stays clean (INBOX *and* Junk). Verify deterministically with `sieve-test -u <user> -r <recipient> <script> <msg.eml>` — it prints the resulting actions.
## Key Notes
- **Don't use `header_checks` for milter-added headers.** Options that *do* see them: Sieve at delivery (simplest), or running the scanner as a re-injecting `content_filter` (the re-injected message has the flag as a real header). spamass-milter cannot rewrite the envelope recipient itself.
- **`redirect` re-injects via the MTA** — if a *global* before-script does the redirect, it also runs for the destination mailbox's delivery; guard with an `envelope :is "to"` check or you get a mail loop.
- **Local-injection tests lie here.** A `localhost:25` test may fire a header_checks rule that real inbound mail never triggers.
## Related
- [Mail Client Stops Receiving: Fail2ban IMAP Self-Ban](fail2ban-imap-self-ban-mail-client.md)
- [Dovecot IMAP Clients Fail to Sync: vsz_limit OOM from a Bloated Index Log](dovecot-imap-oom-vsz-limit-bloated-index.md)

View file

@ -0,0 +1,85 @@
# Postfix + SendGrid: TLS Handshake Failure (Port 465 vs 587)
## Symptom
Outbound mail silently queues with no delivery. `postqueue -p` shows deferred messages:
```
(Cannot start TLS: handshake failure)
```
`/var/log/maillog` shows:
```
SSL_connect error to smtp.sendgrid.net[...]:465: -1
warning: TLS library problem: error:0A00010B:SSL routines::wrong version number
```
Or on port 587:
```
warning: TLS library problem: error:0A0000C1:SSL routines::no shared cipher
```
## Root Cause
Port **465** (SMTPS) uses **implicit TLS** — the connection starts encrypted immediately. Port **587** (submission) uses **STARTTLS** — the connection starts plaintext, then upgrades.
Postfix has two settings that must match the port:
| Port | `smtp_tls_wrappermode` | `smtp_tls_security_level` |
|------|------------------------|---------------------------|
| 465 | `yes` | `encrypt` |
| 587 | `no` | `encrypt` (or `may`) |
If `smtp_tls_wrappermode=yes` is set with port 587, Postfix sends a TLS ClientHello immediately but the server expects a plaintext SMTP greeting first — `wrong version number`.
If `smtp_tls_wrappermode=no` is set with port 465, Postfix sends a plaintext EHLO but the server expects a TLS ClientHello — `no shared cipher` or connection reset.
## Fix
Use port 587 + STARTTLS (recommended — more widely supported and debuggable):
```bash
postconf -e 'relayhost = [smtp.sendgrid.net]:587'
postconf -e 'smtp_tls_wrappermode = no'
postconf -e 'smtp_tls_security_level = encrypt'
systemctl restart postfix
postqueue -f # flush stuck messages
```
## Verify
```bash
# Check config
postconf relayhost smtp_tls_wrappermode smtp_tls_security_level
# Test TLS connection manually
openssl s_client -starttls smtp -connect smtp.sendgrid.net:587 -brief
# Watch delivery
tail -f /var/log/maillog | grep status=
```
Successful delivery looks like:
```
Untrusted TLS connection established to smtp.sendgrid.net[...]:587: TLSv1.3 with cipher TLS_AES_128_GCM_SHA256
status=sent (250 Ok: queued as ...)
```
## Why "Untrusted"?
If `smtp_tls_CAfile` and `smtp_tls_CApath` are both empty, Postfix can't verify the server certificate and logs "Untrusted TLS connection." The connection is still encrypted — just not authenticated. To fix, point to the system CA bundle:
```bash
postconf -e 'smtp_tls_CAfile = /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem' # Fedora
# or
postconf -e 'smtp_tls_CAfile = /etc/ssl/certs/ca-certificates.crt' # Ubuntu/Debian
```
## Notes
- OpenSSL 3.x is stricter about protocol mismatches than OpenSSL 1.1 — a config that worked on older distros may break after an OS upgrade.
- SendGrid supports both ports, but port 587 + STARTTLS is the documented recommendation.
- This applies to any SMTP relay (Mailgun, AWS SES, etc.), not just SendGrid — the port/wrappermode pairing is universal.

View file

@ -0,0 +1,133 @@
---
title: "SSH Alias Falls Through to MagicDNS — Host-Key Verification Failure (No `Host` Block)"
domain: selfhosting
category: troubleshooting
tags:
- ssh
- ssh-config
- tailscale
- magicdns
- known-hosts
- host-key
- troubleshooting
status: published
created: 2026-06-11
updated: 2026-06-12
---
# SSH Alias Falls Through to MagicDNS — Host-Key Verification Failure (No `Host` Block)
## The Problem
You `ssh` to a host you've reached many times before, but now it dies before any
auth happens:
```
$ ssh MyMac
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
Host key verification failed.
```
On a headless box (WSL, a server, a CI runner) there's no askpass binary, so the
prompt can't even be shown — SSH just aborts. Connecting **by Tailscale IP** works
fine:
```
$ ssh user@100.74.124.81 # works
$ ssh MyMac # Host key verification failed
```
## Why It Happens
There is **no `Host MyMac` block in `~/.ssh/config` at all** — and there never was.
The connection only ever worked by IP, or interactively (where you clicked through
the first-connect `yes` prompt without noticing).
When no `Host` block matches, SSH uses the literal argument as the hostname. With
Tailscale MagicDNS, `MyMac` (or `mymac`) resolves to the node — so the *connection*
succeeds — but the host key it presents is checked against `known_hosts` under the
name **`mymac`**, which has no entry. Meanwhile the key you actually trust is stored
under the **IP**:
```
$ ssh-keygen -F 100.74.124.81 # found — line 67
$ ssh-keygen -F mymac # nothing
```
So strict host-key checking has nothing to match, tries to prompt to accept the
"new" key, and on a headless host that prompt fails → `Host key verification failed`.
Confirm there's no block (and that `ssh -G` is just echoing defaults):
```
$ ssh -G MyMac | grep -E '^(hostname|user|port) '
hostname mymac # lowercased literal — NOT an explicit HostName
user youruser # your local username default — not from a block
port 22 # default
```
If `hostname` equals the arg you typed (just lowercased) and `user` is your local
login name, there is no matching `Host` block.
## The Fix
Add an explicit `Host` block that **pins the IP** that `known_hosts` already trusts.
This matches the convention every other host in a Tailscale fleet should follow —
pin the `100.x` address, not the MagicDNS name:
```sshconfig
Host MyMac mymac
HostName 100.74.124.81
User youruser
IdentityFile ~/.ssh/id_ed25519
```
> [!note] When pinning the IP is the *wrong* call
> Pinning the IP is right while the host is **stable**. If the box gets migrated or
> rebuilt — new Tailscale IP *and* new host key — the pin rots and `known_hosts`
> mismatches. At that point switch to **MagicDNS names** so the alias self-heals. See
> *[MagicDNS Names vs Pinned IPs for Tailscale SSH (After a Fleet Migration)](tailscale-ssh-magicdns-vs-pinned-ip-after-migration.md)*.
Now `ssh MyMac` resolves to `100.74.124.81`, whose key is in `known_hosts`, and the
check passes with no prompt. Verify non-interactively:
```
$ ssh -o BatchMode=yes MyMac 'hostname'
mymac.majorlan
```
`BatchMode=yes` disables every prompt — if it returns the hostname cleanly, the key
is trusted and a real key authenticated.
**Don't over-pin the identity.** Run `ssh -v user@<IP> true` and check the
`Will attempt key` / accepted-key lines first. A workstation often authenticates
with the *default* `id_ed25519`, not a fleet key — if `id_ed25519_fleet` isn't even
offered, don't put it in the block.
## Cleanup: Stale `known_hosts` Cruft
Drive-by `ssh` attempts leave junk entries like `mymac-2` (auto-suffixed names from
old keys). They never match anything once you pin the IP. Purge them:
```
$ ssh-keygen -R mymac-2
```
## How to Diagnose This
1. `ssh -o BatchMode=yes <alias> true` — if it fails with `Host key verification
failed` (not `Permission denied`), it's a host-key problem, not auth.
2. `ssh -G <alias> | grep -E '^(hostname|user|port) '` — if `hostname` is just your
typed arg and there's no real `HostName`, there's no `Host` block.
3. `ssh-keygen -F <name>` vs `ssh-keygen -F <ip>` — find which name actually holds
the trusted key. Pin whichever one `known_hosts` has (usually the IP).
## Why This Gotcha Is Invisible
It only surfaces on a host with **no askpass** (headless / WSL / cron). On a desktop,
the first-connect prompt appears, you hit `yes`, an entry gets written under the
MagicDNS name, and it "just works" — masking the fact that no `Host` block exists and
the IP-keyed entry is the only durable trust. Move the same config to a headless box
and the missing block becomes a hard failure. Related: SSH only applies `Host` blocks
by **literal pattern match**, so connecting by IP also skips them — see *Ansible Fails
with Permission Denied While `ssh <alias>` Works (Host Alias Bypass)*.

View file

@ -0,0 +1,160 @@
---
title: "SSH `Permission denied (publickey)` After Rotating a Key — Backfill Every `authorized_keys`"
domain: selfhosting
category: troubleshooting
tags:
- ssh
- ssh-keys
- authorized-keys
- key-rotation
- publickey
- fleet
- troubleshooting
status: published
created: 2026-06-17
updated: 2026-06-17
---
# SSH `Permission denied (publickey)` After Rotating a Key — Backfill Every `authorized_keys`
## The Problem
A host you've SSH'd into for months suddenly rejects you — but **only some hosts**, not all:
```
$ ssh root@host-a
root@host-a: Permission denied (publickey).
$ ssh root@host-b # same key, same workstation — works fine
host-b $
```
Nothing changed on the servers. The thing that changed is on **your** side: at some
point the workstation's SSH key was **regenerated** (lost laptop, rebuild, a key file
clobbered by a botched copy, a routine rotation). The new public key was pushed to a
few hosts but never fanned out to the rest. Every host still holding only the *old*
public key now rejects the new private key with `Permission denied (publickey)`.
> The tell: it's `Permission denied (publickey)`, **not** `Host key verification
> failed`. The former is an **authorization** failure (the server doesn't trust your
> key); the latter is the server's key not matching your `known_hosts`. Different
> problem — see *[SSH Alias Falls Through to MagicDNS — Host-Key Verification Failure](ssh-missing-host-block-magicdns-host-key-failure.md)*.
## Why It Happens
Public-key auth is **per-host**: the server only lets you in if your public key is a
line in that host's `~/.ssh/authorized_keys`. There is no central directory — each
host is its own island. So when you rotate a key, *every* host needs the new public
key appended independently.
It's easy to do this partially without noticing. You regenerate the key, then over the
next hour you happen to SSH into three boxes and (re-)deploy the key there as part of
other work. Those three now trust the new key. The other six don't — and you won't
find out until weeks later when you reach for one of them.
Confirm it's an authorization (key) failure and see which key is being offered:
```
$ ssh -v root@host-a 2>&1 | grep -E 'Offering|Authentications|Permission denied'
debug1: Offering public key: /home/you/.ssh/id_ed25519 ED25519 SHA256:XeY1/N9qwB…
debug1: Authentications that can continue: publickey
root@host-a: Permission denied (publickey).
```
The server offered you nothing but `publickey`, you offered your current key, and it
was refused → your key isn't in that host's `authorized_keys`.
## Scope It First — Don't Fix One Host at a Time
The host you noticed is rarely the only one. Sweep the whole fleet in one pass before
touching anything, so you fix the real set, not just the squeaky wheel:
```bash
for h in host-a host-b host-c host-d host-e host-f; do
r=$(ssh -o BatchMode=yes -o ConnectTimeout=8 root@"$h" 'echo OK' 2>&1 | tail -1)
echo "$h: $r"
done
```
`BatchMode=yes` suppresses password/passphrase prompts so a failure fails fast instead
of hanging. Anything that doesn't print `OK` needs the backfill.
## The Fix
You need a **second, still-trusted** way onto each failing host to append the new key.
Common transit options, best first:
- **Another of your keys that still works** (e.g. a config-management / automation
user whose key is authorized fleet-wide, ideally with `sudo`).
- **Another workstation** whose key those hosts still trust.
- **The provider's web console / serial console** as a last resort.
> [!warning] A jump host only helps if *it* can reach the target
> "Bounce through a box that still trusts me" only works if that box's own key is in
> the target's `authorized_keys`. A host can trust *your* key yet have no standing
> trust to a third host (and hit its own `Host key verification failed` on the way).
> Test the full two-hop path before relying on it.
Using a fleet-wide automation user (`deploy`) with passwordless `sudo` as the transit,
append the new key idempotently, with a backup, to every failing host:
```bash
PUBKEY=$(cat ~/.ssh/id_ed25519.pub)
STAMP=$(date +%Y%m%d-%H%M%S)
for h in host-a host-c host-e; do # only the hosts that failed the sweep
ssh deploy@"$h" "sudo bash -s" <<EOF
set -e
F=/root/.ssh/authorized_keys
mkdir -p /root/.ssh && touch "\$F"
cp "\$F" "\$F.bak-$STAMP" # backup before any change
grep -qF "$PUBKEY" "\$F" || printf '%s\n' "$PUBKEY" >> "\$F" # append only if absent
chmod 600 "\$F"
EOF
done
```
Three things that keep this safe:
- **Append, never overwrite.** `>> "$F"` and the `grep -qF … ||` guard mean you add
one line and only if it's missing. Re-running is a no-op — never clobber an
`authorized_keys` with `>` or you'll lock out every *other* key on the box.
- **Back up first.** The `.bak-<stamp>` copy is your undo.
- **`chmod 600`.** SSH silently ignores an `authorized_keys` that's group/world
writable, which looks exactly like "the key didn't take."
Then verify directly — not through the transit user:
```bash
for h in host-a host-c host-e; do
echo "$h: $(ssh -o BatchMode=yes root@"$h" 'echo OK' 2>&1 | tail -1)"
done
```
All `OK` means the new key authenticates on its own.
## Prevention
- **Treat rotation as fleet-wide.** When a workstation key changes, the very next step
is to fan the new public key out to **every** host's `authorized_keys` in one pass —
not opportunistically as you happen to log in. A short `for` loop over the full host
list (or a config-management task — see below) closes the gap immediately.
- **Manage `authorized_keys` declaratively.** An Ansible `ansible.posix.authorized_key`
task (or equivalent) that lists the *current* set of keys makes "who can log in" a
reviewed, version-controlled fact instead of an append-only pile that drifts per host.
- **Keep the old key authorized until the new one is verified everywhere**, then remove
the stale line in a deliberate cleanup pass.
## How to Diagnose This (Checklist)
1. `ssh -o BatchMode=yes <host> true``Permission denied (publickey)` (auth), not
`Host key verification failed` (host key). Confirms which problem you have.
2. `ssh -v <host> 2>&1 | grep Offering` → which private key is being offered, and its
fingerprint.
3. Sweep the whole fleet with the `BatchMode` loop → get the **full** list of affected
hosts before fixing.
4. Append the new public key (idempotent, backed up, `chmod 600`) via a still-trusted
transit path.
5. Re-verify each host with a direct `BatchMode` login.
Related: *[SSH Config & Key Management](../../01-linux/networking/ssh-config-key-management.md)*
and *[SSH Hardening Across a Fleet with Ansible](../../02-selfhosting/security/ssh-hardening-ansible-fleet.md)*.

View file

@ -0,0 +1,157 @@
# Tailscale Boot Race Conditions (SSH Unreachable After Reboot)
Two related race conditions can make a host unreachable via Tailscale after reboot. Both stem from systemd services starting before Tailscale or the network is ready.
---
## Race 1: ssh.socket Binds Before Tailscale Is Up (Ubuntu)
### Symptom
SSH to a host via Tailscale IP times out. `tailscale ping` works, `tailscale status` shows `active; direct`, but SSH on port 22 refuses connections. No access via Hetzner console if root password is unset.
### Cause
Ubuntu 24.04 uses systemd **socket activation** for SSH (`ssh.socket` instead of persistent `ssh.service`). When the socket override binds to a Tailscale IP, it can start *before* `tailscaled.service` is ready. The bind may succeed initially (Tailscale state file caches the IP), but a later Tailscale reconnect or interface reset invalidates the bound address silently — SSH dies with no recovery path.
### Diagnosis
```bash
# From another host:
tailscale ping <IP> # succeeds — host is up
ssh root@<IP> # times out — sshd not listening
# After gaining console access or reboot:
systemctl status ssh.socket # check Listen: address
journalctl -b -1 -u ssh # likely empty — sshd never spawned
journalctl -b -1 -u ssh.socket # socket started before tailscaled
```
### Fix (current — 2026-05-31)
`After=tailscaled.service` orders against the service becoming `active`**not** against the `tailscale0` interface actually having an IPv4 address. tailscaled flips to active within a second of starting, but the kernel doesn't have the address bound to the interface until DERP relays connect and the control plane confirms the node. ssh.socket attempting `ListenStream=<TS IP>:22` in that window fails with `Cannot assign requested address`, the socket goes into a failed state, and there is no automatic retry.
The proper gate is a dedicated readiness service that **waits for the tailscale0 IPv4 address to exist** before letting ssh.socket bind:
```ini
# /etc/systemd/system/tailscale-wait-ready.service
[Unit]
Description=Wait until tailscale0 has an IPv4 address
After=tailscaled.service
Requires=tailscaled.service
ConditionPathExists=/usr/sbin/ip
[Service]
Type=oneshot
RemainAfterExit=yes
TimeoutStartSec=120
ExecStart=/usr/bin/bash -c 'for i in $(seq 1 120); do ip -4 -o addr show tailscale0 2>/dev/null | grep -q "inet " && exit 0; sleep 1; done; exit 1'
[Install]
WantedBy=multi-user.target
```
```ini
# /etc/systemd/system/ssh.socket.d/override.conf
[Unit]
After=tailscale-wait-ready.service
Requires=tailscale-wait-ready.service
[Socket]
ListenStream=
ListenStream=<TAILSCALE_IP>:22
```
Reload + restart:
```bash
systemctl daemon-reload
systemctl enable tailscale-wait-ready.service
systemctl restart ssh.socket
ss -tlnp | grep :22 # verify bound to Tailscale IP
```
!!! note "Evolution of this fix"
- **2026-05-19 v1**`After=tailscaled.service` + `BindsTo=tailscaled.service`. Worked initially but caused a shutdown-time ordering cycle.
- **2026-05-23 v2**`BindsTo` swapped for `Requires` to break the cycle. Fixed the cycle but did **not** wait for `tailscale0` to actually have an IP — just for `tailscaled` to be active. Hosts continued losing SSH after some reboots (intermittent, depending on whether the race won).
- **2026-05-31 v3** — Added `tailscale-wait-ready.service` to gate ssh.socket on the interface having an address. This is the current canonical fix.
!!! warning "Do NOT use BindsTo"
`BindsTo=tailscaled.service` creates a **systemd ordering cycle** during shutdown: `basic.target → sockets.target → ssh.socket → tailscaled.service → basic.target`. Systemd breaks the cycle by deleting jobs unpredictably, which can prevent `ssh.socket` from starting on the next boot. Use `Requires=` for startup ordering without the bidirectional lifecycle coupling.
### Affected Hosts
Ubuntu hosts locked via the `tailscale` role (`ssh_only_ubuntu` task, formerly `configure_tailscale_ssh_only.yml`): majorlinux, dcaprod-hetzner, tttpod-hetzner, majortoot-hetzner.
> [!danger] The Ubuntu playbook shipped the cycle pattern until 2026-06-07
> Despite the 2026-06-04 resolution above, `configure_tailscale_ssh_only.yml` in the repo kept deploying the `[Unit] Requires=tailscale-wait-ready.service` gate on **ssh.socket** (the cycle-causer) and never added the ssh.service gate — so re-running it *re-armed* the ordering cycle. Caught 2026-06-07: it clobbered majorlinux's hand-fix, and **majortoot-hetzner was found already armed** with the latent cycle (would have lost SSH on its next reboot). Both restored/defused; playbook corrected in MajorAnsible `e0d35aa` (gate on ssh.service, dependency-free socket).
>
> **Fleet audited & reconciled 2026-06-07:** dcaprod-hetzner + tttpod-hetzner had the dependency-free socket already but were **missing `tailscale-wait-ready.service`** (their ssh.service gate referenced a non-existent unit → inert → latent *bind* race, not a cycle); the corrected playbook was applied to both, deploying the service and activating the gate. teelia uses **Tailscale SSH** (no sshd, ss.socket/ssh.service disabled) — immune to both races. All Ubuntu hosts now run the same pattern: dependency-free `ss.socket` bind + `ssh.service` readiness gate + `tailscale-wait-ready.service`.
> [!warning] Fedora hosts are NOT automatically immune (corrected 2026-06-07)
> The firewalld method (`configure_tailscale_ssh_only_fedora.yml`) binds sshd on `0.0.0.0:22` and enforces Tailscale-only via the firewall, so it has no dependency on the Tailscale address — **unless** a host also carries a leftover manual `ListenAddress <tailscale-ip>` drop-in (`/etc/ssh/sshd_config.d/tailscale-only.conf`) from the pre-firewall lockdown. Then sshd.service hits the same boot bind-race (`Bind to port 22 on <ts-ip> failed: Cannot assign requested address`) and flaps every reboot. Hit on **majordiscord 2026-06-07**; fixed by removing the redundant drop-in (firewall stays the enforcing layer). The Fedora playbook now removes it automatically (MajorAnsible `b4a9090`).
---
## Race 2: tailscaled Starts Before Network Is Online (All Hosts)
### Symptom
Host reboots but never appears on Tailscale. `tailscale ping` times out entirely. SSH is dead because Tailscale never connects. The host is up (accessible via provider console) but isolated from the Tailscale network.
### Cause
`tailscaled.service` ships with `After=network-pre.target`, which fires *before* the network interface has an IP. On VPS hosts (especially Hetzner), the interface can take several seconds to come online. Tailscale starts, sees no network (`SetNetworkUp(false)`, `link state: defaultRoute= ifs={} v4=false v6=false`), fails DNS bootstrap and DERP relay connections, and gets stuck — never retrying.
### Diagnosis
```bash
# From Hetzner console or another access method:
journalctl -b -u tailscaled | grep -E "SetNetworkUp|link state|error|DERP"
# Look for:
# magicsock: SetNetworkUp(false)
# link state: interfaces.State{defaultRoute= ifs={} v4=false v6=false}
# health: Tailscale could not connect to any relay server
```
### Fix
Deploy a systemd drop-in to wait for full network connectivity:
```ini
# /etc/systemd/system/tailscaled.service.d/override.conf
[Unit]
After=network-online.target
Wants=network-online.target
```
Then reload and restart:
```bash
systemctl daemon-reload
systemctl restart tailscaled
```
### Affected Hosts
All hosts where Tailscale is the primary access path. Particularly impactful on VPS hosts with slow interface bringup. Both Fedora and Ubuntu hosts are affected.
---
## Prevention
- Set root passwords on all VPS hosts for emergency console access
- The `tailscale` role deploys all fixes automatically (run via `tailscale.yml` / `site.yml`):
- `network_wait` task — tailscaled network-online dependency (all hosts)
- `ssh_only_ubuntu` task — dependency-free ssh.socket bind + ssh.service readiness gate + `tailscale-wait-ready.service` (Ubuntu group)
- `ssh_only_fedora` task — firewalld Tailscale-only lockdown; removes any leftover `ListenAddress` drop-in (Fedora group)
## References
- [[dcaprod#2026-05-19 — SSH unreachable due to ssh.socket race condition with Tailscale]]
- [[majordiscord#2026-05-19 — Tailscale boot race: unreachable after Ansible reboot]]
- [[majorlinux#2026-05-19 — ssh.socket override patched: added Tailscale dependency]]
- [[dcaprod#2026-05-23 — SSH unreachable again: BindsTo ordering cycle in ssh.socket override]]
- [[majorlinux#2026-05-31 — ssh.socket race recurrence post-reboot (Requires= insufficient; added wait-ready gate)]]
- [[majortoot#2026-05-31 — ssh.socket race post-reboot on majortoot-hetzner (during cutover night)]]
- Ansible: the `tailscale` role (`tailscale.yml`) — `network_wait` + `ssh_only_ubuntu`/`ssh_only_fedora` tasks, consolidated from the former `configure_tailscale_*` playbooks (MajorAnsible `656302e`)

View file

@ -0,0 +1,133 @@
---
title: "Steam Deck Wi-Fi Flapping: IWD Periodic Scan + rtw88 Power Save"
domain: troubleshooting
category: networking
tags: [wifi, steam-deck, steamos, iwd, networkmanager, rtw88, rtl8822ce, power-save, supplicant-disconnect, flapping]
status: published
created: 2026-06-19
updated: 2026-06-19
---
# Steam Deck Wi-Fi Flapping: IWD Periodic Scan + rtw88 Power Save
## 🛑 Problem
An OG Steam Deck (LCD model, Realtek **RTL8822CE** on the `rtw88_8822ce` driver) kept "losing" Wi-Fi — it would connect, hold for around a minute, drop, then reconnect a second later, over and over. From the router side the device looked like it was constantly coming and going; from the couch it felt like the network "wouldn't stay connected."
Crucially, **this was not a router problem.** The AP config was correct, RF was clean (strong signal, zero tx retries / beacon loss), and every other client on the network was rock-solid. The fault was entirely on the Deck.
## 🔍 Diagnosis
SteamOS uses **NetworkManager with the `iwd` backend** (not `wpa_supplicant`). That detail is the whole ballgame.
### Step 1 — Confirm the flap and its cadence
```bash
# how many disconnects this boot?
journalctl -b -u NetworkManager --no-pager | grep -c supplicant-disconnect
# 50
# when did they happen?
journalctl -b -u NetworkManager --no-pager | grep supplicant-disconnect \
| awk '{print $1,$2,$3}' | tail
# 10:20:52 · 10:21:54 · 10:22:57 · 10:24:00 · 10:25:03 · 10:26:05 · 10:27:08 ...
```
**~63 seconds between every drop.** A fixed, metronome-like interval is the tell — this is a *timer*, not RF noise. The NetworkManager log shows the pattern plainly:
```
activated -> failed (reason 'supplicant-disconnect')
... -> activated # reconnects ~1s later
```
### Step 2 — Prove the link is healthy *when it's up*
```bash
iw dev wlan0 station dump | grep -iE 'signal|bitrate|failed|retries|beacon loss'
# signal: -65 dBm
# tx retries: 0
# tx failed: 0
# beacon loss: 0
```
Strong signal, zero retries, zero beacon loss — the association is clean while it lasts. So the drop is being *commanded*, not caused by a bad radio link.
### Step 3 — Identify the chip and the backend
```bash
lspci -k | grep -A3 -iE 'network|wireless'
# Realtek RTL8822CE ... Kernel driver in use: rtw88_8822ce
```
The `~63s` interval is **IWD's default periodic background scan**. With no `/etc/iwd/main.conf` present, IWD scans on a timer even while connected, and on the `rtw88` driver that scan knocks the current association over — producing the `supplicant-disconnect` every minute.
A secondary annoyance: `iw dev wlan0 get power_save` reported `on`, which showed up as wildly jittery LAN latency (869 ms to the gateway over Wi-Fi, where a healthy 5 GHz link is 210 ms).
## ✅ Fix
Two independent changes — the first stops the flap, the second smooths latency.
### 1. Disable IWD's periodic scan (stops the flap)
```bash
sudo mkdir -p /etc/iwd
printf '[Scan]\nDisablePeriodicScan=true\n' | sudo tee /etc/iwd/main.conf
sudo systemctl restart iwd # briefly drops Wi-Fi; NetworkManager auto-reconnects
```
Trade-off: with periodic scanning off, the Deck roams to a different/stronger AP (e.g. another AiMesh node) more lazily. Fine for a device that mostly sits in one spot.
### 2. Disable Wi-Fi power save (kills the latency jitter)
The obvious `nmcli connection modify <name> 802-11-wireless.powersave 2` **does not work under the IWD backend** — NetworkManager doesn't enforce that property when `iwd` is managing the radio. Use a dispatcher script instead, with a retry loop because `rtw88` won't accept the setting in the first instant after association on a cold boot:
```bash
sudo tee /etc/NetworkManager/dispatcher.d/90-wifi-powersave >/dev/null <<'SCRIPT'
#!/bin/sh
# Disable Wi-Fi power save on the wireless iface (retry: rtw88 may not accept it instantly on boot)
case "$2" in
up|dhcp4-change|connectivity-change)
case "$1" in
wl*)
for n in 1 2 3 4 5; do
/usr/bin/iw dev "$1" set power_save off 2>/dev/null
[ "$(/usr/bin/iw dev "$1" get power_save 2>/dev/null)" = "Power save: off" ] && break
sleep 1
done
;;
esac
;;
esac
SCRIPT
sudo chmod +x /etc/NetworkManager/dispatcher.d/90-wifi-powersave
sudo iw dev wlan0 set power_save off # apply now without waiting for a reconnect
```
> 💡 A single-shot dispatcher (no retry) **silently fails on a cold boot** — it fires before the interface is ready, the `iw` call no-ops, and power save stays on. Verify with `iw get power_save` *after a real reboot*, not just after a service restart.
## 🔁 Verification
```bash
# was 50/boot, ~once a minute:
journalctl -b -u NetworkManager --no-pager | grep -c supplicant-disconnect
# 0
iw dev wlan0 get power_save
# Power save: off
```
A 3-minute continuous `ping` showed **180/180 replies, 0 loss**, latency tightened to **611 ms**. Confirmed across a full cold reboot: the Deck auto-rejoins Wi-Fi, both settings persist, and the disconnect counter stays at 0.
## 📌 Notes
- **Persistence:** `/etc/iwd/main.conf` and the dispatcher live in `/etc`, which survives reboots. A major SteamOS update *can* reset `/etc` — re-apply if the flapping returns after an OS update.
- **Fully reversible:**
```bash
sudo rm /etc/iwd/main.conf /etc/NetworkManager/dispatcher.d/90-wifi-powersave
sudo systemctl restart iwd
```
- **Interface name** is usually `wlan0`; confirm with `iw dev` if different.
- The same IWD-periodic-scan behavior can affect other `iwd`-based distros (Arch, some Fedora spins) on flaky/older Wi-Fi chips — the `DisablePeriodicScan` fix is general, not Deck-specific.
## 🔗 Related
- [Wi-Fi Game Streaming Stutter: 160 MHz Channel Width Saturating the 5 GHz Radio](wifi-160mhz-airtime-saturation-game-streaming.md) — the *other* Steam Deck Wi-Fi issue (airtime contention, router-side), distinct from this client-side flap.

View file

@ -0,0 +1,163 @@
---
title: "MagicDNS Names vs Pinned IPs for Tailscale SSH (After a Fleet Migration)"
domain: troubleshooting
category: networking
tags:
- ssh
- ssh-config
- tailscale
- magicdns
- known-hosts
- host-key
- migration
- wsl2
status: published
created: 2026-06-12
updated: 2026-06-12
---
# MagicDNS Names vs Pinned IPs for Tailscale SSH (After a Fleet Migration)
You have SSH aliases for a Tailscale fleet (`alias tttpod='ssh root@100.84.42.102'`).
They worked for months. Then you migrate or rebuild some nodes — and now a third of
them hang on connect or refuse the host key. This is the failure mode that hardcoded
addresses hit, and why the durable answer is **MagicDNS names**, not pinned IPs.
> This is the sequel to *[SSH Alias Falls Through to MagicDNS — Host-Key Verification
> Failure (No `Host` Block)](ssh-missing-host-block-magicdns-host-key-failure.md)*.
> That article says **pin the IP** `known_hosts` already trusts — correct when the
> node is stable. This one covers what happens when a migration changes the IP *and*
> the host key, which is exactly when IP-pinning stops paying off.
## The Three Failure Modes
A migration/rebuild can trigger any of these — often several at once across a fleet,
which is what makes it confusing:
### 1. Stale hardcoded IP → connection times out
The node re-registered on the tailnet with a **new** Tailscale IP, but your alias
still names the old one:
```
$ tttpod
ssh: connect to host 100.84.42.102 port 22: Operation timed out
```
The old address is dead; SSH waits the full timeout and gives up. Confirm by asking
the tailnet for the node's *current* IP by name:
```
$ tailscale status | grep tttpod
100.95.137.38 tttpod ... # alias points at 100.84.42.102 — stale
```
### 2. Cold-path teardown → first connect after idle times out
The IP is correct and the node is up (it answers `ping`), but TCP/22 still times out
on the *first* try after a quiet period, then works on retry. Tailscale 1.98.x is more
aggressive about tearing down **idle direct UDP paths**; the first SSH has to
re-establish NAT traversal, which can overrun SSH's default connect timeout.
```
$ tailscale status | grep tttpod
100.95.137.38 tttpod ... idle, tx 9360 rx 0 # cold path
$ tailscale ping tttpod
pong from tttpod (100.95.137.38) via 5.161.118.84:41641 in 48ms # warms instantly
```
### 3. Host-key verification failed → box was rebuilt
The node was reinstalled, so it presents a **new** SSH host key. Your `known_hosts`
still has the old one, so even `StrictHostKeyChecking=accept-new` aborts — `accept-new`
only adds *genuinely new* hosts, it refuses a **mismatch**:
```
$ ssh root@tttpod hostname
Host key verification failed.
```
## The Fix
Three changes, applied on every **name-capable** machine (see the WSL2 caveat below):
### a. Switch aliases from IPs to MagicDNS names
```bash
# before — rots on every migration
alias tttpod='ssh root@100.84.42.102'
# after — always resolves the node's current IP
alias tttpod='ssh root@tttpod'
```
MagicDNS resolves the name to whatever IP the node currently has, so a future
migration needs **zero** alias edits. This is the whole point: the tailnet already
knows the mapping — stop duplicating (and stale-ing) it in your dotfiles.
> **Exception:** if there's no tailnet device with that exact name (e.g. an alias
> `teelia` pointing at a node actually named `temptedparadise`), MagicDNS can't
> resolve it — keep the IP for that one.
### b. Purge stale host keys, then re-accept
After a rebuild, clear the old entries under **both** the name and the current IP,
then reconnect with `accept-new` to record the fresh key. Over Tailscale's
authenticated WireGuard tunnel, a key change from a known rebuild is safe to accept.
```bash
for pair in "tttpod:100.95.137.38" "majortoot:100.64.169.62" "dcaprod:100.98.223.93"; do
n="${pair%%:*}"; ip="${pair##*:}"
ssh-keygen -R "$n"; ssh-keygen -R "$ip"
done
# repopulate
ssh -o StrictHostKeyChecking=accept-new root@tttpod hostname
```
### c. Add a cold-path cushion to `~/.ssh/config`
Give the first (cold) connection time to renegotiate instead of erroring:
```sshconfig
Host majorlinux tttpod majortoot majordiscord dcaprod majormail majorhome
ConnectTimeout 25
ServerAliveInterval 30
ServerAliveCountMax 4
```
`ConnectTimeout 25` turns the cold-path timeout into a ~12 s pause. The keepalives
hold the path open during an active session so it doesn't drop mid-command.
## Caveat: WSL2 Can't Use MagicDNS
A Linux box under **WSL2** typically has **no `tailscale` CLI and no MagicDNS
resolver** — it rides the Windows host's networking, and name lookups for tailnet
nodes fail:
```
$ getent hosts tttpod # (inside WSL2)
# nothing — no resolution
$ command -v tailscale # nothing — CLI lives on the Windows side
```
On those machines you **must** keep hardcoded IPs in `~/.ssh/config` (or use `Host`
blocks with explicit `HostName <ip>`), and refresh them by hand when a node migrates.
There's no self-healing option there — the trade is unavoidable.
## Diagnosis Checklist
1. `tailscale status | grep <host>` — does your alias's IP match the **current** one?
(Mode 1: stale IP.)
2. `ping`/`tailscale ping <host>` works but TCP/22 times out on first try, succeeds on
retry? (Mode 2: cold path.)
3. `ssh root@<host> true``Host key verification failed` (not `Permission denied`)?
(Mode 3: rebuilt box, stale `known_hosts`.)
4. Is the client a WSL2 box? `getent hosts <name>` returns nothing → MagicDNS
unavailable, stay on IPs.
## Takeaway
Pin the IP when a host is **stable** and the IP-keyed `known_hosts` entry is your
durable trust anchor. Switch to **MagicDNS names** when hosts **move** — migrations,
rebuilds, provider changes — so the tailnet's own name→IP mapping does the work your
dotfiles kept getting wrong. And on WSL2, you don't get the choice: hardcoded IPs,
refreshed by hand.

View file

@ -0,0 +1,115 @@
---
title: "Wi-Fi Game Streaming Stutter: 160 MHz Channel Width Saturating the 5 GHz Radio"
domain: troubleshooting
category: networking
tags: [wifi, 5ghz, 160mhz, channel-width, dfs, steam-deck, game-streaming, asuswrt, airtime, chanim]
status: published
created: 2026-06-13
updated: 2026-06-13
---
# Wi-Fi Game Streaming Stutter: 160 MHz Channel Width Saturating the 5 GHz Radio
## 🛑 Problem
Streaming a game from a desktop (wired) to a Steam Deck over Wi-Fi was stuttering intermittently — fine for a while, then choppy, hard to reproduce on demand. Throughput tests "looked fine," which is exactly why it was hard to pin down: **game streaming fails on jitter and microbursts of contention, not on average bandwidth.**
The Wi-Fi was an Asus RT-AX82U (AsusWRT, stock firmware) with the 5 GHz radio set to **Auto channel at 160 MHz width**.
## 🔍 Diagnosis
The key insight: **signal was excellent, but latency was not.** That combination means the airwaves are busy, not weak.
### Step 1 — Measure jitter to the gateway from a Wi-Fi client
```bash
ping -c 20 -i 0.2 192.168.50.1
# round-trip min/avg/max/stddev = 7.5/27.0/61.0/16.5 ms
```
27 ms **average** and 16 ms of jitter to your *own router* over Wi-Fi is pathological. A healthy 5 GHz link sits at 25 ms. Yet the client's signal was **-43 dBm** (excellent) with a clean **-92 dBm** noise floor. Strong signal + high jitter = **airtime contention**, not range or interference at the receiver.
### Step 2 — Confirm channel utilization at the router
AsusWRT/Broadcom exposes per-channel airtime stats via `wl chanim_stats`. SSH into the router and run it against the 5 GHz interface:
```bash
# 5 GHz interface name varies (eth6/eth7); resolve it from nvram
IF=$(nvram get wl1_ifname)
wl -i "$IF" chanspec # e.g. 36/160 (0xe832) → channel 36, 160 MHz
wl -i "$IF" assoclist | wc -l # number of associated 5 GHz clients
wl -i "$IF" chanim_stats
```
The smoking gun (`chanim_stats`, version 3):
```
chanspec tx inbss obss nocat nopkt doze txop goodtx badtx glitch ... idle
0xe832 92 2 1 2 1 0 4 8 81 2 14
```
Read it as percentages of airtime:
| Field | Value | Meaning |
|-------|-------|---------|
| `tx` | **92** | Channel busy transmitting 92% of the time |
| `txop` | **4** | Transmit-opportunities available only 4% — the channel is starved |
| `idle` | **14** | Channel idle only 14% |
| `goodtx` / `badtx` | 8 / **81** | Failed/retried transmits vastly outnumber good ones |
Seventeen clients were associated to that one 5 GHz radio.
### Step 3 — Understand why 160 MHz makes it worse
A 160 MHz channel on the lower 5 GHz band spans channels **3664**, which overlaps DFS sub-blocks. To stay clean it needs 160 MHz of *uncontended* spectrum — but in a dense RF environment (≈25 neighbor APs here, several on 5 GHz channels 48/52/100/132/153 that overlap or border the block), any one busy neighbor degrades the **entire** wide channel. 160 MHz also makes the radio **DFS-radar exposed**: a single radar detection forces a channel-switch with a 1 s+ blackout — a stream-killer.
So 160 MHz buys a higher *peak* PHY rate that game streaming doesn't need, at the cost of the *stability* it absolutely does.
## ✅ Fix
Drop the 5 GHz radio to **80 MHz** and pin it to a **non-DFS** channel (UNII-1: 36/40/44/48 — no radar, no DFS blackouts).
GUI: **Wireless → 5 GHz → Channel Bandwidth = 80 MHz**, **Control Channel = 36**, turn off "Auto."
Or over SSH (`nvram` + `restart_wireless`):
```bash
nvram set wl1_bw_cap=7 # cap at 80 MHz (bitmask: 1=20, 3=40, 7=80, 15=160)
nvram set wl1_chanspec=36/80 # channel 36 @ 80 MHz
nvram set wl1_channel=36
nvram commit
service restart_wireless # ~15-20s radio bounce, drops all clients briefly
```
> [!warning] `restart_wireless` drops every Wi-Fi client for 1520 seconds. `nvram commit` runs *before* the restart, so the config persists even if your own SSH/Wi-Fi session drops.
## 📊 Result
Verified from both the router and a client after the radio came back:
| Metric | Before (36/160) | After (36/80) |
|--------|-----------------|---------------|
| Channel tx-busy | 92% | **9%** |
| Transmit-opportunity available | 4% | **79%** |
| Channel idle | 14% | **87%** |
| Failed tx (`badtx` vs `goodtx`) | 81 vs 8 | **1 vs 3** |
| Gateway ping (avg / floor) | 27 ms / 7.5 ms | **9 ms / 2.7 ms** |
| PHY peak rate | 1729 Mbps | 1200 Mbps |
The PHY peak dropped (narrower channel) but that is irrelevant — Steam Remote Play wants ~3050 Mbps with *consistent* airtime, which it now has. The stutter resolved.
## 🧠 Takeaways
- **Diagnose Wi-Fi streaming problems with jitter, not throughput.** A speed test can pass while a stream stutters. Ping your gateway and watch the stddev.
- **Strong signal + high latency = airtime congestion.** Don't chase signal strength when RSSI is already good; look at channel utilization (`chanim_stats`).
- **160 MHz is a trap in a dense RF environment.** Use 80 MHz for reliability; reserve 160 MHz for clean spectrum and short range.
- **Prefer non-DFS channels (3648) for anything latency-sensitive** — DFS radar events cause silent multi-second dropouts.
- **Wire the *source*.** The streaming PC should be on Ethernet so the video only crosses the air once (AP → handheld). The handheld has to be Wi-Fi; the desktop doesn't.
- **Isolate IoT on 2.4 GHz** (separate SSID) so it never competes for 5 GHz airtime with latency-sensitive clients.
## Related
- [Steam Deck Wi-Fi Flapping: IWD Periodic Scan + rtw88 Power Save](steam-deck-wifi-flapping-iwd-periodic-scan-rtw88.md) — the *other* Steam Deck Wi-Fi issue (client-side flap), distinct from this router-side airtime problem.
- [Network Overview](../../02-selfhosting/dns-networking/network-overview.md)
- [Wake-on-LAN via Router SSH](../../02-selfhosting/dns-networking/wake-on-lan-router-ssh.md)
- [Pi-hole v6 Group Management — Per-Client DNS Rules](../../02-selfhosting/dns-networking/pihole-v6-group-management.md)

View file

@ -11,7 +11,7 @@ tags:
- powershell
status: published
created: 2026-04-03
updated: 2026-04-22T09:20
updated: 2026-04-30T05:21
---
# Windows OpenSSH: WSL as Default Shell Breaks Remote Commands

View file

@ -10,7 +10,7 @@ tags:
- majorrig
status: published
created: 2026-04-02
updated: 2026-04-22T09:20
updated: 2026-04-30T05:21
---
# Windows OpenSSH Server (sshd) Stops After Reboot

View file

@ -0,0 +1,129 @@
---
title: "OBS Studio — \"Error opening file: (null)\" After Windows Profile Rename"
domain: troubleshooting
category: streaming
tags: [obs, streaming, windows, lua, profile-migration]
status: published
created: 2026-05-14
updated: 2026-05-14
---
# OBS Studio — "Error opening file: (null)" After Windows Profile Rename
## Symptom
Loading a scene collection in OBS Studio triggers a popup like:
```
[<ScriptName>.lua] Error opening file: (null)
```
The `(null)` is the giveaway: OBS resolved the registered script path to nothing — the file doesn't exist where the scene collection says it does. Most commonly this happens after a Windows profile was renamed or migrated and `C:\Users\<old>\...` paths were not updated.
## Why it happens
OBS stores per-scene-collection Lua/Python script registrations inside the scene collection JSON at:
```
%APPDATA%\obs-studio\basic\scenes\<Collection>.json
```
Each entry under `modules.scripts-tool[]` is an absolute Windows path. Renaming the Windows profile does not rewrite these — the JSON keeps pointing at the old `C:\Users\<old>\...` location, and OBS surfaces the resolution failure as a `(null)` popup on collection load.
## Diagnose
From WSL (or any shell with access to `%APPDATA%`):
```bash
OBS_DIR="/mnt/c/Users/<current-windows-user>/AppData/Roaming/obs-studio"
# 1. List scene collections
ls "$OBS_DIR/basic/scenes/"
# 2. Find collections referencing the missing script
grep -l -i "<script-name-substring>" "$OBS_DIR/basic/scenes/"*.json
# 3. Dump the scripts-tool paths from each suspect collection
python3 -c "
import json, sys
d = json.load(open(sys.argv[1]))
for s in d.get('modules', {}).get('scripts-tool', []):
print(s.get('path'))
" "$OBS_DIR/basic/scenes/<Collection>.json"
```
If a printed path contains `C:/Users/<old-username>/...` and the file doesn't exist on disk, you've found it.
## Fix
> [!warning] Close OBS first
> OBS rewrites the scene collection JSON when it exits. Any edit made while OBS is running will be overwritten. Confirm with `tasklist.exe | grep obs64` (WSL) or Task Manager.
### 1. Make the missing script reachable
Either:
- **Re-extract / restore the script** to a path under the new profile (recommended — gives you a clean canonical home), or
- **Leave it in the rescue/migration folder** and point OBS there (fragile if the rescue folder is later deleted).
### 2. Back up the scene collection JSON
```bash
SCENES="/mnt/c/Users/<current-windows-user>/AppData/Roaming/obs-studio/basic/scenes"
STAMP="$(date +%Y%m%d-%H%M%S)"
cp -p "$SCENES/<Collection>.json" "$SCENES/<Collection>.json.$STAMP.bak"
```
### 3. Rewrite the paths atomically
Edit the JSON in place by parsing it, replacing the matched path strings, and writing through a temp file (so a crash mid-write can't corrupt the collection):
```bash
python3 <<'PY'
import json, os
scenes = "/mnt/c/Users/<current-windows-user>/AppData/Roaming/obs-studio/basic/scenes"
mapping = {
"C:/Users/<old>/Pictures/.../<script>.lua":
"C:/Users/<new>/Pictures/.../<script>.lua",
}
for fn in ("<Collection>.json",):
path = os.path.join(scenes, fn)
d = json.load(open(path))
for entry in d.get("modules", {}).get("scripts-tool", []):
if entry.get("path") in mapping:
entry["path"] = mapping[entry["path"]]
tmp = path + ".tmp"
json.dump(d, open(tmp, "w"), indent=4)
os.replace(tmp, path)
PY
```
OBS scene JSONs use forward slashes in Windows paths — preserve that style.
### 4. Verify
Re-run the diagnostic Python snippet and confirm every printed path resolves to a real file (translate `C:/``/mnt/c/` from WSL).
### 5. Reopen OBS
Load the scene collection. The popup should be gone.
## Why not just remove the script?
If the script is part of a third-party overlay pack (Twitch Pimpage, OWN3D, etc.), removing the registration also removes the overlay's source presets — fixing the path keeps the imported scenes intact. If you don't actually use the overlay anymore, removing the `scripts-tool` entry is fine; OBS will silently drop the broken reference on next save.
## Generalization
This same pattern applies to any OBS asset path stored in a scene collection or profile:
- Browser source local files
- Image / media source files
- Lua / Python script paths
- VST plugin paths
All of them are absolute, all of them survive a Windows profile rename in stale form, and all of them can be batch-rewritten with the same JSON-edit pattern above. Search for the old username substring across `%APPDATA%\obs-studio\` to catch them all in one pass.
## Related
- [[../../MajorInfrastructure/Devices/MajorRig|MajorRig device note]] — Incident Log 2026-05-14 (TTT/MLS scene popups) and 2026-05-07 (`majli` profile retirement that left these references stranded)
- [[../04-streaming/obs/obs-studio-setup-encoding|OBS Studio Setup and Encoding Settings]]

View file

View file

@ -0,0 +1,225 @@
---
title: Patching PHP 8.4 Implicit-Nullable Deprecations in Vendor Packages
domain: troubleshooting
category: troubleshooting
tags:
- php
- php-8.4
- codeigniter
- castopod
- composer
- vendor
- deprecation
- troubleshooting
status: published
created: 2026-05-10
updated: 2026-05-10
---
# Patching PHP 8.4 Implicit-Nullable Deprecations in Vendor Packages
> **TL;DR** — PHP 8.4 deprecated implicit-nullable parameters (`function f(int $x = null)` without `?int`). Old vendor packages that haven't been updated will spam `E_DEPRECATED` warnings on every load. CodeIgniter (and similar frameworks) wrap each warning in a 23-frame stack trace, which on a per-minute cron multiplies into hundreds of MB/day of logs and a noticeable CPU floor on small VPS boxes. The fix is a four-line `sed` patch — but be very careful: a naive sed pattern can substring-match an *already-nullable* parameter and produce illegal `??type` syntax.
---
## Symptom
You're running a CodeIgniter 4 app (Castopod, BookStack, etc.) on PHP 8.4 with an older vendored library that hasn't been updated to declare nullable types properly. The combination produces:
- **Sustained CPU floor** on a 1 vCPU box (typically 1525% baseline) when the framework's spark/cron scheduler runs every 60 seconds
- **Massive daily log volume** in `writable/logs/log-YYYY-MM-DD.log` — 50100 MB per day is common
- Each WARNING line is followed by a **23-frame stack trace** through Composer's autoloader, the framework's autoloader, and the application's command/model entry point
- The actual scheduler task may report `Failed:` even though it logs no obvious error — the deprecation is fatal in some PHP/CodeIgniter combinations
The deprecation warnings look like:
```
WARNING - 2026-05-10 16:33:01 --> [DEPRECATED]
Vendor\Package\SomeModel::doFindAll(): Implicitly marking parameter $limit as nullable is deprecated,
the explicit nullable type must be used instead in
VENDORPATH/vendor/package/src/SomeModel.php on line 287.
```
## Why this matters more than a typical deprecation
Three multipliers turn "minor PHP deprecation" into "the box is on fire":
1. **Per-minute cron**`php spark tasks:run` runs every 60 seconds. Each run loads the framework, hits the deprecation, dumps a stack trace.
2. **CodeIgniter's error handler is verbose** — it catches `E_DEPRECATED` and writes a full backtrace to disk. There's no debug-vs-production split here.
3. **Small VPS boxes have a thin idle margin** — on a 1 vCPU droplet, sustained 22% from PHP startup overhead + log writes is enough to trip a default `>85% / 5min` DigitalOcean alert during traffic spikes.
## Diagnostic chain
### 1. Confirm the symptom is deprecation cascade, not autoload failure
The stack trace makes this look like an autoload error — it isn't. Check the WARNING line itself:
- **`[DEPRECATED] ... Implicitly marking parameter ... as nullable`** → vendor library + PHP 8.4 mismatch (this article applies)
- **`Class 'X' not found`** → actual autoload problem (different fix)
### 2. Identify the PHP version
```bash
php -v
```
If it's 8.4+, implicit-nullable is now `E_DEPRECATED`. (PHP 8.4.0 was released 2024-11-21; many distros bumped during 202526.)
### 3. List the offending lines
The log itself names them. Grep for the unique vendor-path pattern:
```bash
grep 'DEPRECATED' /var/www/<app>/writable/logs/log-$(date +%Y-%m-%d).log \
| awk -F'on line ' '{print $2}' | sort -u
```
You'll typically see three to six line numbers in one file — each parameter that needs `?` prefixing.
### 4. Inspect each line before patching
```bash
F=/var/www/<app>/vendor/<package>/src/<File>.php
sed -n '287p;520p' "$F" # Show only the lines named by the warnings
```
Look for **already-prefixed** parameters in the same function or nearby — if `?type $foo = null` already exists in the file, your sed pattern must not match it.
## The fix — anchored sed
**Step 1: Backup.**
```bash
F=/var/www/<app>/vendor/<package>/src/<File>.php
sudo cp -p "$F" "$F.bak.$(date +%Y%m%d-%H%M%S)"
```
**Step 2: Apply patches with anchors.** Don't use bare patterns like `int \$limit = null` — they'll substring-match against `?int \$limit = null` (an already-nullable parameter elsewhere in the file) and produce `??int $limit = null`, which PHP rejects as a `ParseError: unexpected token "??"`.
Anchor on the function signature:
```bash
sudo sed -i \
-e 's|^\(\s*protected function doFindAll(\)int \$limit = null|\1?int $limit = null|' \
-e 's|^\(\s*protected function doUpdateBatch(\)array \$set = null, string \$index = null|\1?array $set = null, ?string $index = null|' \
"$F"
```
For constructors with reference operators (`&$db`), include the `&` in the anchor:
```bash
sudo sed -i 's|ConnectionInterface &\$db = null|?ConnectionInterface \&$db = null|' "$F"
```
**Step 3: Lint immediately.**
```bash
sudo php -l "$F"
# Must print: No syntax errors detected in <path>
```
If lint fails, restore from the backup and try a tighter anchor — don't chain another sed onto a broken file.
**Step 4: Verify the runtime.**
```bash
sudo -u www-data php /var/www/<app>/spark tasks:run | grep -E '(Executed|Failed)'
```
The previously-Failing task should now show `Executed`.
**Step 5: Confirm the log bleed stops.** Wait 60s, then:
```bash
LOG=/var/www/<app>/writable/logs/log-$(date +%Y-%m-%d).log
SINCE=$(date -d '60 seconds ago' '+%H:%M:%S')
awk -v t="$SINCE" '/DEPRECATED/ && $4>=t' "$LOG" | wc -l
# Expect: 0
```
## The substring-match gotcha (the one that bit me)
This is the failure mode that turns a 30-second fix into a 30-minute incident:
```bash
# DANGEROUS
sed -i 's|int \$limit = null|?int $limit = null|' "$F"
```
That pattern matches both:
- `protected function doFindAll(int $limit = null, …)` — the line you want to fix
- `protected function doInsertBatch(?array $set = null, ?bool $escape = null, int $batchSize = 100)` — somewhere else in the file, where there's an `int $limit = null` substring **inside** an already-nullable signature you don't want to touch
After sed, the second line becomes `??array $set = null` (or similar) — illegal in PHP. The first time the autoloader tries to load the file, you get:
```
ParseError: syntax error, unexpected token "??", expecting variable
at vendor/.../src/<File>.php:426
```
Recovery is restore-from-backup, then re-apply with anchored patterns. **Always lint before reload, before flush, before next anything.**
## Are reference parameters tricky? Yes.
`&$db` (pass-by-reference) needs the ampersand preserved when adding the `?` prefix:
| Before | After |
|---|---|
| `ConnectionInterface &$db = null` | `?ConnectionInterface &$db = null` |
| `array &$rows` | `?array &$rows` |
In sed, escape the ampersand in the replacement (`\&`) because unescaped `&` in the replacement means "the matched text." Easy way to test the right escaping: run sed with `--debug` or do a dry-run with `-n` and `p`.
## Bonus: hunt for stray debug prints while you're in there
When you're already grepping the application source for one issue, scan for sloppy `log_message('critical', ...)` calls left in by upstream developers. Real-world finds include:
- `log_message('critical', 'ITS HEEEEEEEEEEEERE');` — left in Castopod's `modules/Fediverse/Filters/FediverseFilter.php` line 62, firing on every fediverse request, contributing 195 CRITICAL entries to one day's log
- `log_message('critical', 'TODO');`
- `log_message('critical', 'wtf');`
```bash
grep -rE "log_message\(['\"]critical['\"]" /var/www/<app>/modules/ /var/www/<app>/app/ \
| grep -v -E 'TODO|FIXME' \
| head -10
```
These are usually safe to remove (or downgrade to `debug` level) — they don't represent real failure conditions, just developer artifacts.
## Why not just upgrade the vendor package?
`composer update <package>` is the proper fix. But:
- Many PHP applications (Castopod especially) ship pre-built `vendor/` and don't expect composer to be installed at runtime
- A major version bump (`v1.x → v2.x`) implies API changes that the application may not handle
- `composer update` may pull in cascading dependency updates you don't want
Hot-patching is the right answer when:
- The application doesn't ship with `composer.json` referencing the package directly
- The fix is purely syntactic (parameter type declarations)
- A future application release will likely include the upgraded vendor anyway
Just **document the patch** and add a follow-up task to re-apply (or skip) after the next application upgrade. Without that note, the next time the box is rebuilt or upgraded, you'll spend another evening chasing the same stack trace.
## Specific examples observed in the MajorsHouse fleet
### Castopod 1.20+ on PHP 8.4
`vendor/michalsn/codeigniter4-uuid/src/UuidModel.php` v1.3.1 — four nullable-prefix corrections needed:
| Line | Original | Patched |
|---|---|---|
| 54 | `__construct(ConnectionInterface &$db = null, ValidationInterface $validation = null)` | `__construct(?ConnectionInterface &$db = null, ?ValidationInterface $validation = null)` |
| 287 | `doFindAll(int $limit = null, int $offset = 0)` | `doFindAll(?int $limit = null, int $offset = 0)` |
| 520 | `doUpdateBatch(array $set = null, string $index = null, …)` | `doUpdateBatch(?array $set = null, ?string $index = null, …)` |
Line 426 (`doInsertBatch(?array $set = null, ?bool $escape = null, …)`) was already correct — the substring-match gotcha above was triggered by it.
Upstream `michalsn/codeigniter4-uuid` v2.0.0 (released 2024) declares all parameters with explicit `?type` syntax and has no deprecation warnings. Castopod hadn't upgraded the dependency as of Castopod 1.20.
## See also
- [Castopod Posts Don't Appear on Mastodon — Diagnosing the Federation Path](security/castopod-broadcast-not-on-mastodon.md) — tttpod-specific diagnostic
- [PHP RFC: Deprecate implicitly nullable parameter types](https://wiki.php.net/rfc/deprecate-implicitly-nullable-types) — the canonical PHP 8.4 reference

View file

@ -0,0 +1,154 @@
---
title: "Castopod Posts Don't Appear on Mastodon — Diagnosing the Federation Path"
domain: troubleshooting
category: security
tags: [castopod, mastodon, fediverse, activitypub, federation, notifications]
status: published
created: 2026-05-10
updated: 2026-05-10
---
# Castopod Posts Don't Appear on Mastodon — Diagnosing the Federation Path
## 🛑 Problem
You publish a podcast episode (or a standalone post) on Castopod. The Castopod admin shows it went out fine. But on the Mastodon account that you *expected* to see it from — your own personal account, an account that follows your podcast, a colleague's — the post never shows up. Or it shows up in the home timeline but the notification bell never rings.
Three different failure modes hide behind "I didn't get the post." This article walks the diagnostic chain that distinguishes them.
---
## 🔬 The four checks, in order
Run these in sequence. The first one that fails tells you what's actually wrong.
### Check 1 — Did Castopod create the post?
On the Castopod host:
```sh
mysql -u $CP_DB_USER -p$CP_DB_PASS $CP_DB_NAME --binary-as-hex -e "
SELECT HEX(id), actor_id, LEFT(message,80), episode_id, published_at, created_at
FROM cp_fediverse_posts
ORDER BY created_at DESC LIMIT 5
"
```
If your post isn't here at all, Castopod didn't generate it. That's a Castopod-side bug — check `writable/logs/log-<date>.log`, verify the per-minute task scheduler is firing (`php spark tasks:list` should show `Last Run` for `fediverse-broadcast`), and confirm the cron exists:
```sh
sudo crontab -u www-data -l | grep tasks:run
# expect: * * * * * php /var/www/html/castopod/spark tasks:run >> /dev/null 2>&1
```
### Check 2 — Did Castopod queue and deliver the activity?
```sh
mysql -u $CP_DB_USER -p$CP_DB_PASS $CP_DB_NAME --binary-as-hex -e "
SELECT HEX(id), actor_id, type, status, scheduled_at, created_at
FROM cp_fediverse_activities
WHERE type='Create'
ORDER BY created_at DESC LIMIT 10
"
```
The `status` column tells you everything:
| Status | Meaning |
|---|---|
| `queued` | Sitting in the queue, broadcast task hasn't run yet (or is bogged down) |
| `processing` | In-flight |
| `delivered` | All follower inboxes returned 2xx |
| `failed` | One or more inbox POSTs returned non-2xx, gave up after retries |
If `status='delivered'`, Castopod has done its job — and yet someone says they didn't see the post. Move to Check 3.
### Check 3 — Are they actually a follower?
The single most common cause of "I didn't see it." Federation only delivers `Create` activities to **followers** (and to anyone explicitly mentioned). Interacting with a post (favourite, boost) does NOT establish a follow relationship.
On the Castopod host:
```sh
mysql -u $CP_DB_USER -p$CP_DB_PASS $CP_DB_NAME -e "
SELECT a.username, a.domain, f.created_at
FROM cp_fediverse_follows f
JOIN cp_fediverse_actors a ON a.id = f.actor_id
ORDER BY f.created_at DESC
"
```
`cp_fediverse_follows.actor_id` is the **follower** (remote actor); `target_actor_id` is your local podcast actor. If the user's `username@domain` isn't in this list, they don't follow your podcast, and the Create activity was never sent to their inbox.
Cross-check from the Mastodon side (if you control both):
```sh
sudo -u postgres psql mastodon_production -t -A -c "
SELECT a.username, a.domain
FROM follows f
JOIN accounts a ON a.id = f.target_account_id
WHERE f.account_id = <mastodon-account-id>
AND a.domain = '<your-castopod-domain>'
"
```
Empty result on both sides = they're not following. **Resolution: have them search `@yourpodcast@yourdomain.tld` in their Mastodon and click Follow.**
A subtler corner of this check: `accounts WHERE domain='<your-castopod-domain>'` returning 0 rows on the Mastodon side means Mastodon has never even webfingered your podcast actor. The user may have *thought* they followed at some point, but it never went through (e.g., they typed the handle wrong, or the follow request errored).
### Check 4 — Is "didn't see it" a notification problem, not a delivery problem?
Even after a successful follow, the post lands in the **home timeline** by default. Mastodon **notifications** (the bell icon, the unread badge) fire for a specific list of activity types — and "new post from someone I follow" isn't one of them. Notifications fire for:
- Mentions (`@you` in the post body)
- Follows (someone follows you)
- Favourites of your posts
- Boosts of your posts
- Polls ending
- Status edits (post you favourited was edited)
- Admin alerts
So even with delivery working perfectly and the follow in place, "I didn't get a notification on my account" is the expected state for a regular podcast post. Three ways to make notifications happen:
1. **Bell icon on the followed profile.** Mastodon UI: open the followed account's profile → click the bell. Enables per-account post notifications. Now every new post from that account raises a notification.
2. **`@`-mention in the post.** Have Castopod include `@you@yourdomain.tld` in the post text. Mention activities always raise notifications regardless of follow/bell state. (You may not control the post text on someone else's Castopod, but you control your own.)
3. **Cross-post via a different actor.** If you also run a Mastodon account for the show, post manually from there and `@`-mention the audience accounts you want to page.
---
## 🧪 Worked example
A real case: someone running a podcast on Castopod 2.0.0-next.4 expected a new episode's auto-post to appear on their personal Mastodon. It didn't.
- Check 1 → post present in `cp_fediverse_posts`, episode_id correct ✓
- Check 2 → matching `cp_fediverse_activities` row, `type='Create'`, `status='delivered'`
- Check 3 → 8 followers in `cp_fediverse_follows`, none from the personal Mastodon's domain ✗
Outcome: the user wasn't following their own podcast. They had been favouriting and boosting its posts (which doesn't require following), and assumed those interactions implied a follow. Resolution: search-and-follow from the personal Mastodon. After the follow propagated, future broadcasts arrived as expected.
The post itself never raised a notification (only landed in home timeline). They later enabled the bell icon on the podcast profile and started getting notified on new episodes.
---
## 🧭 When this isn't the answer
If Check 3 shows the person IS a follower but they still didn't receive the post:
- **Check their inbox** if you have access: Mastodon nginx access log:
```sh
sudo grep '<castopod-ip-or-domain>' /var/log/nginx/access.log | grep inbox
```
Expect a `POST /users/<them>/inbox HTTP/2.0 202` from the Castopod IP shortly after `published_at`. No POST = Castopod didn't deliver despite claiming `status='delivered'` (rare; check Castopod's HTTP signing config and any outbound firewall on the Castopod host).
- **Check Sidekiq** on Mastodon for `ActivityPub::ProcessingWorker` failures around the activity timestamp.
- **Check domain blocks**: `SELECT * FROM domain_blocks WHERE domain = '<castopod-domain>'` on Mastodon. A silenced or suspended domain on either end would explain everything.
---
## 📚 References
- ActivityPub spec — [Delivery semantics](https://www.w3.org/TR/activitypub/#delivery): `Create` activities go to actors in `to`/`cc`/`bcc`/`audience`; for public posts that resolves to the actor's followers collection.
- Mastodon notification types: `app/models/notification.rb``TYPES` constant
- Castopod fediverse module: `modules/Fediverse/Commands/Broadcast.php` (the per-minute task) and `modules/Fediverse/Models/ActivityModel.php` (queue model)
- Related: [Castopod: Stale Federated Avatar URLs After Remote Profile Updates](castopod-stale-federated-avatar.md) — sister article, also Castopod fediverse module

View file

@ -0,0 +1,190 @@
---
title: "Castopod: Stale Federated Avatar URLs After Remote Profile Updates"
domain: troubleshooting
category: security
tags: [castopod, mastodon, fediverse, activitypub, s3, federation]
status: published
created: 2026-05-08
updated: 2026-05-08
---
# Castopod: Stale Federated Avatar URLs After Remote Profile Updates
## 🛑 Problem
Your Castopod admin pages — most visibly the notifications list (`/cp-admin/podcasts/<id>/notifications`) — show broken avatars for federated actors. The browser dev tools (or a direct `curl -I`) on the avatar URL returns:
```
HTTP/1.1 403 Forbidden
Server: AmazonS3
```
…with the response body:
```xml
<Error>
<Code>AccessDenied</Code>
<Message>Access Denied</Message>
...
</Error>
```
The hostname is the remote instance's S3 bucket (e.g. `s3.amazonaws.com/<their-bucket>/accounts/avatars/...`). Other actors in the same notifications list — those with avatars on Mastodon's own CDN, or on instances using path-stable storage — render fine.
This article explains *why* the alarm code is misleading, *what's actually broken*, and how to fix it on Castopod.
---
## 🔬 Why "AccessDenied" is misleading
S3 returns `403 AccessDenied` to anonymous requesters for **any** missing object — by design, as anti-enumeration. Anonymous users typically don't have `s3:ListBucket` permission on the bucket, so S3 deliberately can't tell them whether the key is missing or merely forbidden. Both cases produce the same 403.
So when you see `403 AccessDenied` on a remote avatar URL, **the actual problem is almost always that the object no longer exists**. The bucket is fine; the file is gone.
### Verifying that interpretation
If you have access to the remote instance (or to S3 credentials for that bucket):
```sh
aws s3api head-object --bucket <bucket> --key accounts/avatars/.../<filename>.jpeg
```
If you see `An error occurred (404) when calling the HeadObject operation: Not Found`, the object is genuinely gone — and the upstream user has updated their avatar.
---
## 🔍 What's actually broken
Mastodon (and most ActivityPub servers using Paperclip-style storage) **deletes the old object** on avatar replacement and stores only the current filename in the DB. The remote instance is functioning normally — its current `<img>` URL points to a different filename and serves correctly.
Castopod 2.0.0 (verified up to `2.0.0-next.4`) **caches the avatar URL** of every federated actor in `cp_fediverse_actors.avatar_image_url` when it first sees activity from that actor — and never refetches. The admin templates (e.g. `themes/cp_admin/podcast/notifications.php`) emit that stored URL directly into `<img src>`. Once the upstream replaces the avatar:
- Old object deleted → S3 returns 403 to anonymous fetchers
- Castopod still renders the dead URL forever
- Every cached page using that template shows a broken image
The same pattern applies to `cover_image_url` (header).
---
## ✅ Fix
You have three options, in increasing order of "this stays fixed."
### Option 1 — Manual SQL update (one-shot)
Recommended for one or two stale actors. Get the current URL from the upstream instance.
If the upstream is your own Mastodon instance:
```sh
sudo -u postgres psql mastodon_production -t -A \
-c "SELECT id, avatar_file_name, header_file_name FROM accounts WHERE username='<their-username>'"
```
Construct the canonical URL using the standard Paperclip path scheme. For an account ID like `109326168175475699`, the path is built by chunking the ID three digits at a time:
```
accounts/avatars/109/326/168/175/475/699/original/<avatar_file_name>
accounts/headers/109/326/168/175/475/699/original/<header_file_name>
```
Then UPDATE the Castopod row:
```sh
mysql -u $CP_DB_USER -p$CP_DB_PASS $CP_DB_NAME <<'SQL'
UPDATE cp_fediverse_actors
SET avatar_image_url = 'https://<s3-host>/<bucket>/accounts/avatars/109/326/168/175/475/699/original/<new>.jpeg',
cover_image_url = 'https://<s3-host>/<bucket>/accounts/headers/109/326/168/175/475/699/original/<new>.jpg',
updated_at = NOW()
WHERE username = '<their-username>'
AND domain = '<their-domain>';
SQL
```
Then clear the Castopod cache so any cached HTML rerenders:
```sh
cd /var/www/html/castopod
sudo -u www-data php spark cache:clear
```
Verify:
```sh
curl -sI 'https://<new-url>' | head -1 # expect HTTP/1.1 200 OK
```
### Option 2 — Delete and let Castopod refetch
For a one-shot self-healing fix, delete the actor row entirely:
```sql
DELETE FROM cp_fediverse_actors WHERE username='<u>' AND domain='<d>';
```
Castopod will repopulate the row from the next inbound activity from that actor (favourite, boost, mention, follow…). **Caveat — verify foreign-key cascades first:** `cp_fediverse_favourites`, `cp_fediverse_follows`, `cp_fediverse_posts`, and `cp_fediverse_notifications` all reference `actor_id`. Depending on the migration version, ON DELETE may cascade or restrict. Check with:
```sh
mysql -u $CP_DB_USER -p$CP_DB_PASS $CP_DB_NAME -e "
SELECT TABLE_NAME, CONSTRAINT_NAME, DELETE_RULE
FROM information_schema.REFERENTIAL_CONSTRAINTS
WHERE CONSTRAINT_SCHEMA = '$CP_DB_NAME'
AND REFERENCED_TABLE_NAME = 'cp_fediverse_actors';
"
```
If deletes cascade, you'll lose the activity history attributed to that actor. Use Option 1 instead.
### Option 3 — Bulk audit and update
If multiple federated actors have likely-stale avatars (any old enough that an upstream user might have refreshed their profile picture), audit them all:
```sh
mysql -u $CP_DB_USER -p$CP_DB_PASS $CP_DB_NAME -BNe "
SELECT id, username, domain, avatar_image_url
FROM cp_fediverse_actors
WHERE avatar_image_url IS NOT NULL
" | while IFS=$'\t' read -r id user dom url; do
code=$(curl -s -o /dev/null -w "%{http_code}" "$url")
[ "$code" != "200" ] && echo "BROKEN $code $id $user@$dom $url"
done
```
For each broken row, fetch the upstream's current actor JSON and update from `icon.url` / `image.url`:
```sh
curl -s -H 'Accept: application/activity+json' \
"https://<their-domain>/users/<their-username>" | jq '{icon, image}'
```
Then run the Option 1 SQL update with the fresh URLs.
---
## 🧪 Why this isn't fixable on the upstream side
Once the old object is deleted, you can't restore the URL without re-uploading bytes to the **exact original key** — which Mastodon won't do, because its DB only knows about the new filename. Trying to "fix" it on the Mastodon side means resurrecting a file Mastodon has no record of and that no fresh ActivityPub request would emit a URL for. The fix has to live on the consumer (Castopod) because Castopod is the one holding the stale reference.
This applies to every federation consumer that caches URLs by reference rather than fetching bytes locally. Mastodon, Pleroma, Akkoma, and Misskey all cache the bytes; that's why they self-heal across remote avatar swaps. Castopod 2.0.0 currently does not.
---
## 🛠 Long-term mitigations
This is a Castopod design issue worth raising upstream:
- Add a `last_refreshed_at` to `cp_fediverse_actors` and a worker that refetches actor JSON on a schedule.
- Or fetch and store avatars locally on first sight, the way Mastodon does.
A `fediverse:refresh-actor` spark command would also let admins fix stale rows without writing SQL.
If you have a recurring case (you update your Mastodon avatar often, and you also operate a Castopod instance under your own control), keep the Option 1 SQL handy as a one-liner. After your own avatar update, run it within minutes and the dead-URL window closes before it spreads to many cached pages.
---
## 📚 References
- Castopod source (`themes/cp_admin/podcast/notifications.php`) — uses `avatar_image_url` directly in `<img src>`
- AWS S3 anti-enumeration: `403` vs `404` is bucket-policy-dependent; see [GetObject — Permissions Required](https://docs.aws.amazon.com/AmazonS3/latest/API/API_GetObject.html#API_GetObject_RequestPermissions)
- Mastodon Paperclip storage layout: `accounts/avatars/<3-digit chunks of account id>/original/<file_name>`
- Related fix patterns: [Tuning Netdata `web_log_1m_successful` for Redirect-Heavy WordPress Sites](netdata-web-log-successful-redirect-heavy-tuning.md) — shares the "the alarm is technically correct, but means something different than you think" theme

View file

@ -1,11 +1,17 @@
---
title: "ClamAV Safe Scheduling on Live Servers"
title: ClamAV Safe Scheduling on Live Servers
domain: troubleshooting
category: security
tags: [clamav, cpu, nice, ionice, cron, vps]
tags:
- clamav
- cpu
- nice
- ionice
- cron
- vps
status: published
created: 2026-04-02
updated: 2026-04-02
updated: 2026-05-11T18:31
---
# ClamAV Safe Scheduling on Live Servers
@ -75,6 +81,7 @@ kill <PID>
- `ionice -c 3` (Idle) requires Linux kernel ≥ 2.6.13 and CFQ/BFQ I/O scheduler. Works on most Ubuntu/Debian/Fedora systems.
- On multi-core servers, consider also using `cpulimit` for a hard cap: `cpulimit -l 30 -- clamscan ...`
- Always keep `--exclude=/sys` (and optionally `--exclude=/proc`, `--exclude=/dev`) to avoid scanning virtual filesystems.
- **1 vCPU limitation:** `nice` and `ionice` only help when other processes compete for resources. On a single-core VPS, clamscan will still saturate the CPU at 57-100% even with `nice -n 19 ionice -c 3` — there's nothing to yield to. Accept the weekly spike as benign, or reduce scan scope to shorten the window.
## Related

View file

@ -0,0 +1,116 @@
---
title: "Fedora CA Bundle Missing Symlink — TLS Breaks Fleet-Wide"
description: Hetzner-provisioned Fedora images may be missing the /etc/pki/tls/certs/ca-bundle.crt symlink, silently breaking Postfix TLS relay, curl, and dnf
tags:
- fedora
- tls
- postfix
- ca-certificates
- hetzner
- troubleshooting
status: published
created: 2026-05-11
updated: 2026-05-11
---
# Fedora CA Bundle Missing Symlink
On Fedora, many TLS clients (Postfix, curl, dnf) look for the CA bundle at `/etc/pki/tls/certs/ca-bundle.crt`. This path is normally a symlink to `/etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem`, shipped by the `ca-certificates` package.
On Hetzner Cloud Fedora images (observed on Fedora 44, May 2026), this symlink can be missing despite `ca-certificates` being installed. The extracted bundle exists, but the consumer-facing symlink does not.
## Symptoms
Postfix relay to a TLS-required upstream fails:
```
postfix/smtp: cannot load Certification Authority data,
CAfile="/etc/pki/tls/certs/ca-bundle.crt",
CApath="/etc/pki/tls/certs": disabling TLS support
```
If your relay requires TLS (port 465 with `smtp_tls_wrappermode = yes`, or `smtp_tls_security_level = encrypt`), mail silently queues as deferred. No bounce, no alert — just silence.
Other symptoms on the same box:
```bash
# curl fails
curl https://example.com
# error: Problem with the SSL CA cert (path? access rights?)
# dnf fails
dnf list --installed
# Curl error (77): Problem with the SSL CA cert
```
## Diagnosis
```bash
# Check the symlink
ls -la /etc/pki/tls/certs/ca-bundle.crt
# Expected: symlink -> /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem
# Broken: "No such file or directory"
# Verify the extracted bundle exists
ls -la /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem
# Should exist (~220 KB, ~140-150 certs)
# Confirm the package is installed
rpm -q ca-certificates
# Should return a version string
```
If the extracted bundle exists but the symlink at `/etc/pki/tls/certs/ca-bundle.crt` is missing, that's the problem.
## Fix
```bash
sudo ln -sf /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem \
/etc/pki/tls/certs/ca-bundle.crt
sudo systemctl restart postfix
sudo postqueue -f # flush any deferred mail
```
Verify:
```bash
# Symlink exists
ls -la /etc/pki/tls/certs/ca-bundle.crt
# Postfix can relay
echo "Subject: TLS test" | sendmail -v marcus@majorshouse.com
# curl works
curl -sI https://example.com | head -1
```
## Fleet Audit
If one Hetzner-provisioned Fedora host has this issue, check the others:
```bash
for host in majordiscord majorlab majorhome majormail; do
echo "$host: $(ssh root@$host 'ls /etc/pki/tls/certs/ca-bundle.crt 2>&1' | tail -1)"
done
```
Hosts returning "No such file or directory" are silently broken for all TLS operations.
## Why This Happens
`update-ca-trust extract` regenerates the files under `/etc/pki/ca-trust/extracted/` but does not create the legacy consumer-path symlink at `/etc/pki/tls/certs/ca-bundle.crt`. That symlink is shipped by the `ca-certificates` RPM. On cloud images built from minimal installs or snapshot-based provisioning, the symlink can be lost during image creation or a partial upgrade.
## Prevention
Add to your provisioning checklist (see [VPS Migration Baseline Checklist](../../02-selfhosting/cloud/vps-migration-baseline-checklist.md)):
```bash
# Fedora provisioning — verify CA bundle symlink
ls /etc/pki/tls/certs/ca-bundle.crt || \
ln -sf /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem /etc/pki/tls/certs/ca-bundle.crt
```
## Related
- [Logwatch Fleet Setup](../../02-selfhosting/monitoring/logwatch-fleet-setup.md) — logwatch depends on a working Postfix relay, which depends on TLS, which depends on this symlink
- [VPS Migration Baseline Checklist](../../02-selfhosting/cloud/vps-migration-baseline-checklist.md) — includes CA bundle verification step

View file

@ -0,0 +1,80 @@
---
title: "Logwatch Falsely Reports 'No freshclam updates' in ClamAV Daemon Mode"
domain: troubleshooting
category: security
tags: [clamav, freshclam, logwatch, false-positive, fedora, ubuntu, ansible]
status: published
created: 2026-06-06
updated: 2026-06-06
---
# Logwatch Falsely Reports "No freshclam updates" in ClamAV Daemon Mode
Logwatch's daily `clam-update` section emails:
> No updates detected in the log for the freshclam daemon (the ClamAV update process). If the freshclam daemon is not running, you may need to restart it.
…even though freshclam **is** running and signatures **are** current. It's a parser quirk specific to running freshclam as a daemon. Don't act on the "restart it" suggestion — first confirm whether signatures are actually stale.
> Seen on **tttpod** (2026-06-06). All four freshclam hosts (majorlinux, majortoot-hetzner, teelia, tttpod) hit this on quiet days.
## First: is it real or false?
```bash
systemctl is-active clamav-freshclam # active?
ls -l /var/lib/clamav/daily.c[lv]d # mtime today/yesterday?
grep 'updated' /var/log/clamav/freshclam.log | tail # real download events
```
- **Fresh `daily.cld` + active service → false positive** (this page).
- **`daily.cld` weeks old / service disabled → real.** Re-enable freshclam and update (see Related). A daemonless box still needs freshclam enabled — `clamav_use_daemon: false` only disables the *scanner* daemon, not the updater.
## Why It False-Alarms
logwatch's `clam-update` script (`/usr/share/logwatch/scripts/services/clam-update`) decides "updated" by counting **`ClamAV update process started`** lines (`$UpdatedNum`) within its range (`Range = yesterday`). It does **not** count the actual `daily.cld updated (version: …)` download lines.
freshclam emits "update process started" **only when the daemon (re)starts** — not on its periodic in-daemon checks (`Checks 24`, `ExecStart=/usr/bin/freshclam -d`). So on any day the box doesn't reboot or restart freshclam, yesterday's log has zero "started" lines → `$UpdatedNum == 0` → the warning fires, regardless of whether signatures downloaded. (Conversely, on a day you *do* reboot, the warning won't fire.) The script was written for the old cron-driven freshclam, which started a fresh process each run.
## Fix
Silence just that one message — real `ERROR` / `WARNING` / outdated alerts still report:
```bash
# /etc/logwatch/conf/services/clam-update.conf
$ignore_no_updates = 1
```
No service restart needed; logwatch picks it up on its next daily run. (The variable is read as `$ENV{'ignore_no_updates'}` by the script — note: **not** prefixed `clam_update_`, despite what the script's own self-help text suggests.)
## Codify (Ansible)
Deploy the drop-in wherever freshclam runs in daemon mode. On the fleet it's a task in the `clamav` role (`roles/clamav/tasks/install.yml`, group `clamav`), right after freshclam is enabled — originally added in MajorAnsible commit `cb27c93`:
```yaml
- name: Suppress logwatch clam-update false "no updates" alert (daemon-mode freshclam)
ansible.builtin.copy:
dest: /etc/logwatch/conf/services/clam-update.conf
mode: '0644'
content: |
$ignore_no_updates = 1
tags: [logwatch]
```
## Key Notes
- **Confirm freshness before suppressing.** If signatures really are stale (freshclam off / no update timer), suppressing hides a genuine security gap. On a daemonless host that disabled freshclam, the warning is *true*.
- The script's built-in options B/C (about syslog format) don't apply when freshclam logs to its own file (`LogSyslog false`); `$ignore_no_updates` is the right lever.
- **Don't alert with `mail`.** The `mail`/`mailx` CLI is absent on most fleet hosts (only Postfix's `/usr/sbin/sendmail` is guaranteed). A health script that ends in `mail -s … root` silently fails to send. Pipe a headered message to `/usr/sbin/sendmail -t` addressed to `admin_email` directly (don't rely on an `/etc/aliases` `root` rewrite either).
## Proactive monitoring (don't rely on logwatch for "is it updating?")
Since logwatch's heuristic is suppressed, a **direct daily watchdog** is what actually catches a dead freshclam. The `clamav` role deploys `/etc/cron.daily/clamav-freshness` (originally MajorAnsible `9d1a1a9`) to every `clamav`-group host: it emails the admin (via `sendmail`) if `clamav-freshclam` is inactive **or** `daily.cld` is older than `clamav_staleness_threshold_days` (default 3) — and stays silent otherwise. Test without emailing:
```bash
CLAMAV_STALE_DAYS=0 /etc/cron.daily/clamav-freshness # forces the stale branch
```
This is what would have caught dcaprod's 20-day drift immediately instead of it surfacing by accident.
## Related
- [ClamAV CPU Spike: Safe Scheduling with nice/ionice](clamscan-cpu-spike-nice-ionice.md)

View file

@ -0,0 +1,112 @@
---
title: Netdata apps-group FD-utilisation false 100% (silenced fleet-wide)
domain: troubleshooting
category: security
tags:
- netdata
- apps.plugin
- file-descriptors
- tailscale
- false-positive
- ansible
- fleet
status: published
created: 2026-05-15
updated: 2026-05-15T02:40
---
# Netdata apps-group FD-utilisation false 100%
The Netdata stock alarm **`apps_group_file_descriptors_utilization`** (from
`/usr/lib/netdata/conf.d/health.d/file_descriptors.conf`) fires
`Raised to Warning — App group <X> file descriptors utilization = 100%`
emails for application groups that are perfectly healthy. First hit on
**MajorToot** (the `tailscaled` app group), 2026-05-15.
## The Problem
A Netdata email arrives: *"App group tailscaled file descriptors utilization
= 100% on MajorToot"*. The process is fine. On the host:
```
PID 1047 tailscaled (daemon) fds=35 soft_limit=524287 util=0.01%
PID 1984541 tailscaled (child) fds=10 soft_limit=524287 util=0.00%
PID 1984548 bash (tailscale hook) fds=5 soft_limit=1024 util=0.49%
```
No PID exceeds **0.5%**, yet `app.fds_open_limit` reads ~100%. Over 1h the raw
chart was min 0 / **mean 36.7** / max 100, with sustained multi-minute 100%
plateaus (not isolated spikes).
> This is **not** an `apps.plugin` privilege problem. apps.plugin already has
> `cap_dac_read_search,cap_sys_ptrace` and `sudo -u netdata cat
> /proc/<pid>/limits` succeeds. Verify before "fixing" privileges — it's a
> no-op.
## Root Cause
The stock alarm does `lookup: max -10s` over **every PID in the app group**.
App groups whose processes fork short-lived children (tailscaled spawns
route/DNS helpers and bash hooks; `bash` children inherit the systemd default
soft limit of 1024) trip a false 100%: apps.plugin's per-PID FD-limit read
**races on transient/just-forked PIDs**, and because the group lookup uses
`max`, a single bad 10-second sample pegs the entire group to ~100%. The
signal carries no usable information for any forking/root app group.
A `lookup: average -5m` does **not** rescue it — the bogus reading sits at
~100% for sustained multi-minute stretches, so the 5-minute rolling average
itself still reaches 100.0% (empirically verified on MajorToot).
## The Fix
Silence this template fleet-wide, keep the reliable system-wide FD alarm.
- **Codified in Ansible** (do not hand-edit hosts): `MajorAnsible/netdata.yml`
ships `templates/health_apps_fds_group.conf.j2` to
`/etc/netdata/health.d/apps_fds_group_override.conf` and reloads via
`netdatacli reload-health`.
- The override redefines `apps_group_file_descriptors_utilization` with
`to: silent`. Netdata loads `/etc/netdata/health.d/` *after* the stock
`conf.d` dir, so a same-name template deterministically supersedes the stock
one (same mechanism as the manual `tcp_resets.conf` override, 2026-04-30).
- **Safety net retained:** the companion stock template
`system_file_descriptors_utilization` (on `system.file_nr_utilization`,
`crit > 90`, `to: sysadmin`) is untouched and still catches genuine
system-wide FD exhaustion regardless of app grouping.
- The reload handler is restart-tolerant (`retries`/`until` + `failed_when`
ignoring a `netdata.pipe` socket-absent error) because on hosts where the
notify-config also drifts, `Restart Netdata` and `Reload Netdata health`
can race during the ~5s restart window.
## Verification
```bash
ssh <host> 'curl -s "http://localhost:19999/api/v1/alarms?all=true" \
| python3 -c "import sys,json;A=json.load(sys.stdin)[\"alarms\"]; \
print(A[\"app.tailscaled_fds_open_limit.apps_group_file_descriptors_utilization\"][\"recipient\"])"'
# expect: silent
```
After the fix the alarm still shows `status=WARNING` in the dashboard
(cosmetic — silencing suppresses the *notification*, not the computed state);
`recipient=silent` confirms no more emails. The system-wide alarm should read
`CLEAR recipient=sysadmin`.
## Notes
- Silenced fleet-wide on all 10 servers 2026-05-15 (workstations majorrig/
majormac were asleep — irrelevant, they are not fleet servers).
- Any future host running a forking/root daemon in a named app group would
have hit the same false positive; silencing is fleet-wide and pre-emptive.
- **Follow-up debt:** the manual `/etc/netdata/health.d/tcp_resets.conf`
override on MajorToot (2026-04-30) is still **not codified in
`netdata.yml`** — a per-host divergence the fleet play does not manage.
Worth folding into Ansible the same way.
## Related
- [[clamscan-cpu-spike-nice-ionice]]
- [[netdata-web-log-successful-redirect-heavy-tuning]]
- Server doc: `30-Areas/MajorInfrastructure/Servers/majortoot.md` (incident
2026-05-15)
- Playbook: `MajorAnsible/netdata.yml` +
`templates/health_apps_fds_group.conf.j2`

View file

@ -0,0 +1,196 @@
---
title: "Tuning Netdata `web_log_1m_successful` for Redirect-Heavy WordPress Sites"
domain: troubleshooting
category: security
tags: [netdata, monitoring, wordpress, apache, fail2ban, alerts]
status: published
created: 2026-05-08
updated: 2026-05-08
---
# Tuning Netdata `web_log_1m_successful` for Redirect-Heavy WordPress Sites
## 🛑 Problem
Netdata's stock `web_log_1m_successful` alarm fires CRITICAL on a perfectly healthy WordPress site whenever a crawler hammers legacy URLs. Example email/notification:
```
[CRITICAL] web_log_1m_successful = 54.1%
Ratio of successful HTTP requests over the last minute (1xx, 2xx, 304, 401, 429)
```
Meanwhile the front page returns HTTP 200, no 5xx errors are logged, and only a handful of 4xx noise hits appear. So why the alert?
---
## 🔬 Root Cause
The metric counts as **"successful"** only the response code classes:
```
1xx, 2xx, 304, 401, 429
```
**301 redirects are NOT counted as successful.** They land in the `redirect` dimension and pull the success ratio down.
WordPress sites generate large volumes of 301s as a normal part of life:
| Redirect source | Why a 301 |
|---|---|
| `/?p=NNNN` legacy shortlinks | Canonical URL rewrite to slug |
| Stale post slugs after permalink edits | Old → new path |
| `/feed``/feed/` | Trailing-slash normalization |
| `http://``https://` | TLS upgrade |
| `domain.com``www.domain.com` | Host canonicalization |
| Proxy CONNECT probes (e.g. `www.instagram.com:443`) | Apache returns 301 to canonical host |
When a feed scraper or vulnerability crawler walks a long list of legacy `/?p=` URLs, **every single hit is a 301**. A short burst can push the ratio of `success / total_requests` below 75% (warn) or 65% (stock crit) within a single minute — even though the server is functioning perfectly.
### Verifying the cause
Pull the last few thousand lines of the access log and split by status code:
```sh
sudo tail -5000 /var/log/apache2/access.log | awk '{print $9}' | sort | uniq -c | sort -rn
```
If you see something like:
```
196 200
162 301
1 405
1 404
1 400
```
…the math is `196 / (196+162+5) ≈ 54%`, which matches the alarm value almost exactly. **The alert is correct by its definition; the definition is wrong for this workload.**
Cross-check the source IPs:
```sh
sudo tail -2000 /var/log/apache2/access.log | awk '{print $1}' | sort | uniq -c | sort -rn | head -10
```
If a single IP dominates (hundreds of requests in minutes) and most of its hits are 301 to legacy URLs, you have your culprit.
---
## ✅ Solution
Two parts: **fix the alarm definition** so normal redirect bursts don't trip it, and **block the abusive scraper** so it stops generating noise.
### 1. Retune `web_log_1m_successful` thresholds
Edit `/etc/netdata/health.d/web_log.conf` (this is a local override of the stock template). Locate the `template: web_log_1m_successful` block and replace its `warn`/`crit` lines:
```diff
template: web_log_1m_successful
on: web_log.type_requests
class: Workload
type: Web Server
component: Web log
lookup: sum -1m unaligned of success
calc: $this * 100 / $web_log_1m_requests
units: %
every: 10s
- warn: ($web_log_1m_requests > 120) ? ($this < (($status >= $WARNING ) ? ( 90 ) : ( 80 )) ) : ( 0 )
- crit: ($web_log_1m_requests > 120) ? ($this < (($status == $CRITICAL) ? ( 75 ) : ( 65 )) ) : ( 0 )
+ warn: ($web_log_1m_requests > 120) ? ($this < (($status >= $WARNING ) ? ( 50 ) : ( 40 )) ) : ( 0 )
+ crit: ($web_log_1m_requests > 120) ? ($this < (($status == $CRITICAL) ? ( 30 ) : ( 20 )) ) : ( 0 )
delay: up 2m down 15m multiplier 1.5 max 1h
summary: Web log successful
info: Ratio of successful HTTP requests over the last minute (1xx, 2xx, 304, 401, 429)
to: webmaster
```
Then reload Netdata health:
```sh
sudo netdatacli reload-health
```
Confirm the new thresholds are active:
```sh
curl -s http://localhost:19999/api/v1/alarms?all \
| jq -r '.alarms | to_entries[] | select(.value.name == "web_log_1m_successful") | .value.warn,.value.crit'
```
You should see the new `50/40` warn and `30/20` crit values.
### 2. Why the new thresholds make sense
The stock alarm assumes a low-redirect workload (typical SPA backend: lots of 200s, very few 301s). On a WP site with active permalink rewrites, expect routine ratios of 7095% successful with occasional dips into the 50s during crawler bursts. The retuned alarm:
- **Warn at <40%** — not until *most* responses are non-2xx
- **Crit at <20%** — only when the site is genuinely melting down (e.g., backend down, Apache returning 5xx for everything)
You haven't disabled the safety net — you've moved it past the floor of normal redirect-heavy noise.
### 3. Lean on the right alarms for real outages
Two other web_log alarms remain stock and **are** the correct outage signals:
| Alarm | Catches | Default thresholds |
|---|---|---|
| `web_log_1m_internal_errors` | 5xx ratio | warn 2% / crit 5% |
| `web_log_1m_bad_requests` | 4xx (excl. 401, 429) | warn 30% |
Verify both are active and CLEAR after your retune:
```sh
curl -s http://localhost:19999/api/v1/alarms?all \
| jq -r '.alarms | to_entries[] | select(.value.name | test("web_log")) | "\(.value.status) | \(.value.name)"'
```
### 4. Block the abusive scraper
Identify the dominant offender from step 1's IP list and ban it permanently via the recidive jail (assuming `bantime = -1` is set in `jail.local`):
```sh
sudo fail2ban-client set recidive banip 74.7.242.61
sudo fail2ban-client status recidive
```
The recidive jail uses iptables/nftables, so the IP is dropped at the firewall — Apache no longer sees it, and the redirect-flood stops contributing to the ratio. If `bantime` is finite on your host, edit `/etc/fail2ban/jail.local`:
```ini
[recidive]
bantime = -1
findtime = 86400
maxretry = 3
```
---
## 🧪 Verification
After both changes:
```sh
# 1. Active alarms — should be empty (or only your real ones)
curl -s http://localhost:19999/api/v1/alarms?active | jq '.alarms'
# 2. Recidive ban list includes the IP
sudo fail2ban-client status recidive
# 3. Live ratio — should climb above 50% within 12 minutes
watch -n 5 'curl -s http://localhost:19999/api/v1/data?chart=web_log_apache.requests_by_type\&after=-60\&points=1\&format=json | jq'
```
---
## 🧭 When NOT to apply this
- If your site is an API or SPA backend that should have a 200-dominated traffic mix, the stock thresholds are correct — diagnose what's actually returning 301 instead of relaxing the alarm.
- If 5xx errors are climbing in tandem with the success-ratio drop, retuning the 1m_successful alarm will mask a real outage. **Always check `web_log_1m_internal_errors` first.**
---
## 📚 References
- Netdata stock template: `/usr/lib/netdata/conf.d/health.d/web_log.conf`
- Local override: `/etc/netdata/health.d/web_log.conf`
- Netdata web_log Go module dimensions: `success`, `redirect`, `bad`, `error`, `other`
- Related: [Custom Fail2ban Jail: Apache Directory Scanning](apache-dirscan-fail2ban-jail.md)

View file

@ -98,6 +98,29 @@ ausearch -m avc -ts recent | grep dovecot
No output = no new denials.
## Variant: a Freshly-Rebuilt Box Left in Permissive Mode
If a server was rebuilt or migrated and came up **Permissive** (check `getenforce`), the symptom flips: mail works fine, but `/var/log/audit/audit.log` quietly fills with thousands of `dovecot_t → var_t` denials that *would* break IMAP/LMTP the instant you switch to Enforcing. The mailstore was created or `rsync`'d onto `/var/vmail` with no fcontext rule, so it defaulted to `var_t`.
Apply the relabel above first, then flip to Enforcing **only after** verifying zero new denials:
```bash
MARK=$(date +%H:%M:%S)
# ...deliver a test message + do an IMAP login...
ausearch -m avc -ts "$MARK" | grep -c denied # expect 0
setenforce 1
sed -i 's/^SELINUX=permissive/SELINUX=enforcing/' /etc/selinux/config
```
**Companion denial:** a Postfix virtual-mailbox server that looks up recipients in MySQL also trips `postfix_cleanup_t` reading `/etc/my.cnf*` (`mysqld_etc_t`). Allow it with a small local module:
```bash
ausearch -m avc -c cleanup | audit2allow -M local_postfix_mysql
semodule -i local_postfix_mysql.pp
```
See also [[postfix-spamassassin-bayes-spam-filtering|Inbound Spam Filtering]] — the SpamAssassin Bayes DB belongs under `/var/lib/spamassassin` (`spamd_var_lib_t`) for the same labeling reason.
## Key Notes
- **One rule is enough**`"/var/vmail(/.*)?"` with `mail_spool_t` covers every file and directory under `/var/vmail`, including all `tmp/` subdirectories.

View file

@ -0,0 +1,92 @@
---
title: "SELinux: Wrong /etc/localtime Label Silently Breaks Timezone Changes"
domain: troubleshooting
category: general
tags: [selinux, timezone, timedatectl, localtime, fedora, ansible, hetzner]
status: published
created: 2026-06-05
updated: 2026-06-05
---
# SELinux: Wrong /etc/localtime Label Silently Breaks Timezone Changes
`timedatectl set-timezone` (and Ansible's `community.general.timezone`) report **success but the timezone never actually changes**`date` keeps showing the old zone. The cause is an SELinux mislabel on `/etc/localtime`: it must be `locale_t`, but freshly-provisioned images sometimes ship it as `etc_t`, which makes SELinux deny `systemd-timedated` from rewriting the symlink.
> Hit on **majormail** (Fedora 44, SELinux Enforcing, Hetzner Cloud image), 2026-06-05. The box stayed on UTC for hours despite the timezone task "succeeding."
## Symptoms
- `timedatectl set-timezone America/New_York` exits **0**, but `date` still shows the old zone/offset.
- `timedatectl show -p Timezone --value` reports the **new** zone while `readlink /etc/localtime` still points at the **old** one — an inconsistent split state.
- Ansible `community.general.timezone` reports `changed=false` ("already set") because its idempotence check reads the stale in-memory value from `timedatectl`.
- `journalctl -u systemd-timedated` shows: `Failed to set time zone: Permission denied`.
- A direct `ln -sf … /etc/localtime` **works** — but a brand-new symlink may get the wrong label again, sending you in circles.
## Why It Happens
`systemd-timedated` changes the timezone by replacing the `/etc/localtime` symlink. Under SELinux Enforcing, that target must be labeled `locale_t`. If it is `etc_t` (or anything else), timedated is denied (`Permission denied`) and aborts — but `timedatectl`/the Ansible module surface this poorly, so the change looks like it took. The denial may be **dontaudit-suppressed**, so `ausearch -m avc` can come up empty, hiding the real cause.
## Diagnosis
```bash
# The split state — these two should agree but won't:
readlink /etc/localtime # e.g. .../Etc/UTC (the truth)
timedatectl show -p Timezone --value # e.g. America/New_York (stale)
date '+%Z %z' # confirms actual zone via the symlink
# The label — this is the smoking gun:
ls -Z /etc/localtime # WRONG: ...:etc_t:s0
matchpathcon /etc/localtime # EXPECTED: ...:locale_t:s0
# The denial (only if dontaudit is disabled):
journalctl -u systemd-timedated | grep -i 'permission denied'
```
## Fix
Relabel first, *then* set the timezone the normal way:
```bash
restorecon -v /etc/localtime # etc_t -> locale_t
timedatectl set-timezone America/New_York
# verify all three agree now:
date '+%F %T %Z (%z)'; readlink /etc/localtime; ls -Z /etc/localtime
```
If you set the symlink by hand (`ln -sf`) as a stopgap, run `restorecon /etc/localtime` afterward — a manually created symlink can inherit `etc_t` and re-break the next `timedatectl` call.
Then restart anything that caches the zone at startup so logs/schedules switch over:
```bash
systemctl restart rsyslog crond
```
(`journalctl` renders in local time automatically; rsyslog-written logs like `/var/log/maillog` keep the old zone until rsyslog restarts.)
## Codify (Ansible)
Run `restorecon` on `/etc/localtime` **before** the timezone task, so a mislabeled symlink can't silently defeat it:
```yaml
- name: Ensure correct SELinux label on /etc/localtime
ansible.builtin.command: restorecon -v /etc/localtime
register: localtime_relabel
changed_when: "'Relabeled' in localtime_relabel.stdout"
when: ansible_selinux.status | default('disabled') == 'enabled'
- name: Set timezone
community.general.timezone:
name: America/New_York
```
On majormail this is in `roles/majormail/tasks/main.yml` (MajorAnsible commit `2ff566d`).
## Key Notes
- **`timedatectl`/the Ansible module lie here.** Always confirm with `readlink /etc/localtime` + `date`, not just `timedatectl show`.
- **The denial can be invisible.** dontaudit rules may hide the AVC; trust the label mismatch (`ls -Z` vs `matchpathcon`) over an empty `ausearch`.
- **Fresh cloud images are the usual offender** — a clean rebuild/provision is where the wrong label sneaks in.
## Related
- [SELinux: Fixing Dovecot Mail Spool Context (/var/vmail)](selinux-dovecot-vmail-context.md)
- [Dovecot IMAP Clients Fail to Sync: vsz_limit OOM from a Bloated Index Log](networking/dovecot-imap-oom-vsz-limit-bloated-index.md)

View file

@ -0,0 +1,120 @@
---
title: "Time Machine: Orphaned APFS .previous Folder Blocks All Backups"
domain: troubleshooting
category: general
tags: [macos, time-machine, apfs, backup, fsck, disk-utility]
status: published
created: 2026-06-18
updated: 2026-06-18
---
# Time Machine: Orphaned APFS `.previous` Folder Blocks All Backups
## Overview
On an APFS Time Machine destination, an interrupted backup can leave behind an orphaned staging folder named `<timestamp>.previous` (plus a matching, uncatalogued APFS snapshot). Every subsequent backup reads that folder during *FindingChanges*, hits a metadata-type mismatch, and aborts — so backups silently stop running. macOS shows only a generic "**Time Machine couldn't complete the backup … An unknown error occurred.**"
The trap: because the orphan is **not in Time Machine's catalog** and the destination is OS-protected, every obvious removal tool (`rm`, `chmod`, `tmutil delete`, `diskutil deleteSnapshot`) refuses it. The clean fix is **First Aid (`fsck_apfs`)**, which has authority over the volume and clears the orphaned snapshot.
## Symptoms
- "Time Machine couldn't complete the backup to '<disk>' — An unknown error occurred."
- Backups haven't run since around the time of an interrupted/cancelled backup.
- The destination disk is mounted and has plenty of free space (not full, not disconnected).
- `tmutil status` cycles through `Starting` / `FindingChanges` and never reaches `Copying`.
## Root Cause
`backupd` logs the real error on a loop (every ~15 s):
```bash
log show --predicate 'subsystem == "com.apple.TimeMachine"' --last 10m --style compact \
| grep -iE 'previous|error'
```
```
[TMStructure] Expected SnapshotInProgressContainer metadata type but found APFSBackup
metadata type at URL '.../<disk>/2026-06-17-172230.previous/'
```
An earlier backup was interrupted mid-run. It left two orphans tied to that timestamp, **neither registered in Time Machine's backup catalog**:
1. A staging directory `<timestamp>.previous` on the destination volume.
2. A matching APFS snapshot `com.apple.TimeMachine.<timestamp>.backup`.
Time Machine expects the staging folder to be a `SnapshotInProgressContainer` but finds completed-backup (`APFSBackup`) metadata, so it bails before copying anything.
> **Ignore the surrounding log noise.** `com.apple.backupd.sandbox.xpc: connection invalid`, `Mountpoint '…' is still valid`, and `missingName` on `/System/Volumes/Data/home` are all normal on a healthy backup — flagged `E` but harmless. The only line that matters is the `SnapshotInProgressContainer` mismatch.
## Diagnosis
Confirm the disk is healthy (not the problem) and locate the orphan:
```bash
tmutil status # stuck in Starting/FindingChanges, never Copying
df -h | grep -i "<disk-name>" # mounted, plenty free
diskutil apfs listSnapshots <diskNsN> # note the highest/last snapshot timestamp
```
If `listSnapshots` shows a final snapshot whose timestamp matches the `.previous` folder in the error, that's the orphaned pair.
## Why the Obvious Tools Fail
Do **not** burn time trying to force the folder out — here's what each tool does and why it refuses:
| Command | Result | Reason |
|---|---|---|
| `sudo rm -rf …/<ts>.previous` | `Operation not permitted` | TM applies a `group:everyone deny delete` ACL that overrides root. |
| `sudo chmod -RN …/<ts>.previous` | runs for minutes, then fails | A `.previous` folder is a **full copy of the entire Mac filesystem**; `-R` walks the whole tree and can't clear ACLs on the SIP-`restricted` system files inside (`/usr/bin/sh`, frameworks, keymaps). `rm` then hits the same wall. |
| `sudo tmutil delete -p …/<ts>.previous` | `Invalid deletion target (error 22)` | Not a registered backup. |
| `sudo tmutil delete -t <timestamp>` | `error 2 (No such file)` | No catalog entry for that timestamp. |
| `sudo diskutil apfs deleteSnapshot <diskNsN> -uuid <uuid>` | `Not a valid APFS Snapshot UUID` | TM-managed snapshot; diskutil won't remove it directly. |
> **If you started a `chmod -R` and killed it:** the live system is unaffected — `chmod -R` does not follow symlinks out of the backup tree. Verify with `ls -lde ~/Desktop` (normal ACLs = untouched). Stop a runaway with `sudo pkill -f '<timestamp>.previous'`.
## Fix — Run First Aid (`fsck_apfs`)
First Aid runs with full authority over the volume and clears the orphaned snapshot, which defuses the `.previous` folder's metadata mismatch.
```bash
# 1. Stop the looping backup
sudo tmutil stopbackup
# 2. Verify the destination volume (live mode is fine; read-only check)
sudo diskutil verifyVolume <diskNsN>
# or: Disk Utility → View → Show All Devices → select the TM volume → First Aid → Run
```
`verifyVolume` enumerates and validates every snapshot; the verify/remount cycle purges the orphaned in-progress snapshot. Expected result:
```
The volume <name> appears to be OK
File system check exit code is 0
```
Confirm the orphan snapshot is gone (count drops by one; the matching timestamp no longer appears):
```bash
diskutil apfs listSnapshots <diskNsN>
```
Then restart and watch it succeed:
```bash
sudo tmutil startbackup --auto
tmutil status # should reach BackupPhase = Copying with no SnapshotInProgressContainer errors
```
If `verifyVolume` reports problems rather than "appears to be OK", run the repair (it must unmount the volume):
```bash
sudo diskutil repairVolume <diskNsN>
```
## Notes
- The first backup after the fix is often a large catch-up (hundreds of GB) because the chain was broken — let it finish; it returns to quick hourly increments afterward.
- The inert `<timestamp>.previous` **folder** may still sit on the volume after the fix. Time Machine now ignores it, so it's not blocking — but it consumes space. Removing it cleanly requires booting to **Recovery Mode**, `csrutil disable`, `rm -rf` the folder, then `csrutil enable` — only worth it to reclaim the space.
- Time Machine identifies its destination by `DestinationID` (a UUID), not the volume name, so renaming the disk later is safe.
- Interrupted backups are more likely on flaky USB-SATA bridge enclosures (e.g. some WD My Passport units) whose slow sleep/wake transitions can drop the drive mid-backup.
## Tags
`macos` `time-machine` `apfs` `backup` `fsck-apfs` `disk-utility` `snapshot` `first-aid`
## See Also
- [SnapRAID & MergerFS Storage Setup](../01-linux/storage/snapraid-mergerfs-setup.md)
- MajorMac Incident Log (2026-06-18) — the originating incident

View file

@ -0,0 +1,193 @@
---
title: "WordPress 6.7 _load_textdomain_just_in_time Notice (Theme/Plugin Loads Translations Too Early)"
domain: troubleshooting
category: troubleshooting
tags:
- wordpress
- wordpress-6.7
- php
- i18n
- textdomain
- theme
- mu-plugin
- deprecation
- troubleshooting
status: published
created: 2026-06-21
updated: 2026-06-21
---
# WordPress 6.7 `_load_textdomain_just_in_time` Notice
> **TL;DR** — WordPress 6.7 added a `doing_it_wrong` notice that fires when a translation function (`__()`, `_e()`, `esc_html__()`, …) is called for a text domain **before the `init` action**. It's almost always a theme or plugin registering nav menus / sidebars / labels on `after_setup_theme` (which runs before `init`). The notice is **debug-only and harmless** — translations still load via the just-in-time fallback. If the offending code is in your own (or an updatable) theme/plugin, fix it at the source by deferring to `init`. If it's a **non-updating or third-party** theme you don't want to hand-edit, suppress *only this one notice* with a `doing_it_wrong_trigger_error` filter in a tiny mu-plugin.
---
## Symptom
With `WP_DEBUG` on (or in Query Monitor's PHP panel), you see:
```
Function _load_textdomain_just_in_time was called incorrectly.
Translation loading for the <domain> domain was triggered too early.
This is usually an indicator for some code in the plugin or theme running too early.
Translations should be loaded at the init action or later.
(This message was added in version 6.7.0.)
_load_textdomain_just_in_time() wp-includes/l10n.php
get_translations_for_domain() wp-includes/l10n.php
translate() wp-includes/l10n.php
__() wp-includes/l10n.php
WordPress Core
```
The key fields are **the domain name** (e.g. `marstheme`, `woocommerce`, `astra`) and the fact that the stack bottoms out in **WordPress Core** via `__()` — that tells you *some* extension called a translation function, not that core is broken.
## Why it happens (the WP 6.7 change)
Before 6.7, WordPress silently "just-in-time" loaded a text domain the first time you translated a string in it. 6.7 kept the JIT loading but started **warning** when it's triggered before `init`, because:
- Translations loaded before `init` can't be filtered/overridden by other plugins that hook `init`.
- It signals the extension is doing setup work earlier than the WordPress lifecycle intends.
The usual culprit is code on **`after_setup_theme`** (which fires *before* `init`) that translates a label inline, e.g.:
```php
function mytheme_setup() {
register_nav_menus( array(
'primary' => __( 'Primary Menu', 'mytheme' ), // <-- translate call before init
) );
}
add_action( 'after_setup_theme', 'mytheme_setup' );
```
> **Important:** explicitly calling `load_theme_textdomain()` / `load_plugin_textdomain()` early does **not** fix the notice, and as of WP 4.6+ themes on wordpress.org don't even need to call it. The notice is about the *translate call*, not about whether the domain was loaded. Moving only the `load_*_textdomain()` call around is a common dead-end (see the gotcha below).
## Diagnostic chain
### 1. Identify the domain and what owns it
The notice names the domain. Find which theme/plugin uses it:
```bash
WPROOT=/var/www/html
grep -rlw '<domain>' "$WPROOT/wp-content/themes" "$WPROOT/wp-content/plugins" 2>/dev/null
# Which extension has the most references (i.e. owns the domain)?
grep -rl '<domain>' "$WPROOT/wp-content/" 2>/dev/null \
| sed -E "s#$WPROOT/wp-content/(themes|plugins|mu-plugins)/([^/]+)/.*#\1/\2#" \
| sort | uniq -c | sort -rn | head
```
> **Watch for renamed/forked themes.** The domain often does **not** match the theme's folder name. A theme bought as "Mars" and re-slugged to `kappa` keeps `marstheme` as its text domain in all 40+ template files. So `wp theme list` shows `kappa` active while the notice says `marstheme` — they're the same thing.
### 2. Confirm it's active and whether it can be updated
```bash
sudo -u www-data wp --path=$WPROOT theme list --fields=name,status,version,update
sudo -u www-data wp --path=$WPROOT plugin list --fields=name,status,version,update
```
- `update available`**update it first** (newest releases of most themes/plugins fixed this in late 2024/2025). That's the proper fix; the rest of this article is for when you can't.
- `update none` on a **renamed/custom fork** → no upstream exists, so updating is impossible. Go to the suppression fix.
### 3. Pin down the early call (optional)
```bash
grep -rn "__(\s*['\"].*['\"]\s*,\s*['\"]<domain>['\"]" \
"$WPROOT/wp-content/themes/<theme>" | head
```
Look for translate calls inside functions hooked to `after_setup_theme`, `setup_theme`, `plugins_loaded`, or run at file scope in `functions.php`.
## The fix
### Option A — fix it at the source (own / updatable code)
Defer the translation. Either register the raw string and translate at render time, or move the registration to `init`:
```php
// Before: translated on after_setup_theme (too early)
add_action( 'after_setup_theme', function () {
register_nav_menus( array( 'primary' => __( 'Primary Menu', 'mytheme' ) ) );
} );
// After: register the menu location on init, where translation is allowed
add_action( 'init', function () {
register_nav_menus( array( 'primary' => __( 'Primary Menu', 'mytheme' ) ) );
} );
```
Don't do this by editing a theme/plugin that receives updates — your change is wiped on the next update. Use Option B for those.
### Option B — suppress just this notice (third-party / non-updating code)
When the early call lives in a theme you don't control and can't update (a renamed commercial fork, an abandoned plugin), the clean, update-safe move is to silence **only** the `_load_textdomain_just_in_time` notice — not all `doing_it_wrong` output — via a must-use plugin.
Create `wp-content/mu-plugins/fix-textdomain.php`:
```php
<?php
/**
* Suppress the WP 6.7 "_load_textdomain_just_in_time was called incorrectly"
* notice for a theme/plugin that translates before init.
*
* Scope is intentionally narrow: only this one function is silenced, so other
* doing_it_wrong notices still surface. Translations still load via the JIT
* fallback, so nothing visible changes for visitors.
*/
add_filter( 'doing_it_wrong_trigger_error', function ( $trigger, $function_name ) {
return '_load_textdomain_just_in_time' === $function_name ? false : $trigger;
}, 10, 2 );
```
`mu-plugins/` loads automatically (no activation, can't be deactivated from the admin), and runs early enough to register the filter before the notice fires.
#### Verify
```bash
WPROOT=/var/www/html
# 1. Syntax-check the mu-plugin
php -l "$WPROOT/wp-content/mu-plugins/fix-textdomain.php"
# -> No syntax errors detected
# 2. Confirm WP still boots and the filter is registered
sudo -u www-data wp --path=$WPROOT eval \
'echo has_filter("doing_it_wrong_trigger_error") ? "filter set\n" : "MISSING\n";'
# 3. Clear the debug log, trigger an early translate, confirm 0 new notices
DBG="$WPROOT/wp-content/debug.log"
[ -f "$DBG" ] && : > "$DBG"
sudo -u www-data wp --path=$WPROOT eval '__("Primary Menu","<domain>");' >/dev/null 2>&1
grep -c "<domain>" "$DBG" 2>/dev/null || echo 0
# -> 0
```
## Gotchas
### The "load the textdomain earlier/later" dead-end
A very common (wrong) first attempt is an mu-plugin that just calls `load_theme_textdomain()` on `plugins_loaded` or `after_setup_theme`:
```php
// DOES NOT FIX THE NOTICE
add_action( 'plugins_loaded', function () {
load_theme_textdomain( 'mytheme', get_template_directory() . '/languages' );
}, 0 );
```
`plugins_loaded` still runs **before `init`**, and — more importantly — the notice is triggered by the theme's own early `__()` call, not by whether you've loaded the domain. This code is dead weight. If you find one in place, replace it with the Option B filter rather than tweaking its hook/priority.
### Don't blanket-suppress all deprecations
Resist `error_reporting(E_ALL & ~E_DEPRECATED)` or returning `false` from `doing_it_wrong_trigger_error` unconditionally — that also hides genuinely useful warnings (a plugin breaking on a future PHP/WP bump). Scope the filter to the one `function_name`.
### Renamed theme ⇒ domain ≠ folder
Re-stating because it costs the most time: the domain in the notice can be the theme's *original* slug, not its current folder. Always `grep` for the domain to find the real owner before concluding "I don't even have that theme installed."
## See also
- [Patching PHP 8.4 Implicit-Nullable Deprecations in Vendor Packages](php-84-vendor-implicit-nullable-patch.md) — the other "harmless deprecation that floods logs" pattern on the WordPress fleet
- [WordPress developer note: i18n improvements in 6.7](https://make.wordpress.org/core/2024/10/21/i18n-improvements-in-6-7/) — the canonical reference for this change

View file

@ -0,0 +1,125 @@
---
title: "WSL2: PyTorch Training Deadlocks on Windows Filesystem Checkpoint Saves"
domain: troubleshooting
category: wsl2
tags: [wsl2, pytorch, huggingface, training, llm, checkpoint, windows, ntfs, deadlock, majortwin]
status: published
created: 2026-05-23
updated: 2026-05-23
---
# WSL2: PyTorch Training Deadlocks on Windows Filesystem Checkpoint Saves
## Problem
A Hugging Face Trainer / Unsloth fine-tuning run starts successfully, logs training steps for a while, then freezes completely. The tqdm progress bar stops advancing, GPU utilization drops to near-zero, but the training process stays alive at 100% CPU with the full model loaded in VRAM. No new checkpoint directories appear.
**Confirming it's a checkpoint deadlock:**
```bash
# Check if training is frozen — same step count + elapsed time across checks
tmux capture-pane -t <session> -p | tail -5
sleep 60
tmux capture-pane -t <session> -p | tail -5
# GPU idle despite process alive
nvidia-smi --query-gpu=utilization.gpu,memory.used --format=csv,noheader
# No new checkpoint directories written
ls -lt /mnt/d/your/training/output/ | head -10
```
If the tqdm step count is identical both times and the newest directory timestamp is from a previous run, the save is deadlocked.
---
## Root Cause
WSL2's `/mnt/d/` paths go through the **virtio-9p filesystem driver** to reach the host Windows NTFS volume. Large sequential writes — like saving a multi-GB PyTorch checkpoint (optimizer states, model weights, scheduler, RNG state) — can deadlock when:
- A Windows process (antivirus, VSS, Windows Search) holds a lock on the output directory
- The Windows virtual disk hits write pressure from concurrent activity
The Linux process blocks in a kernel `write()` syscall waiting for virtio-9p to acknowledge the write. The process is alive and spinning at 100% CPU in the kernel, but no userspace progress occurs. This is distinct from OOM kills (which log clearly) and out-of-disk errors (which exit cleanly).
---
## Fix: Train on Linux-Native Storage
Keep all training I/O on Linux ext4 (`~/`), and copy final artifacts to Windows only after training completes.
### Change output paths
```bash
# Before
TRAIN_OUT="/mnt/d/corpus/training-runs/v9"
GGUF_OUT="/mnt/d/corpus/models"
# After — Linux-native for training
TRAIN_OUT="/home/majorlinux/corpus/training-runs/v8i"
GGUF_OUT="/home/majorlinux/corpus/models"
```
The WSL2 home directory lives on a Linux ext4 `.vhdx` managed by WSL2 — writes here bypass virtio-9p entirely.
### Copy to Windows after training finishes
```bash
cp "$GGUF_OUT/majortwin-v8i-q4-k-m.gguf" "/mnt/d/corpus/models/"
cp "$GGUF_OUT/majortwin-v8i-q4-k-m.gguf" "/mnt/d/MajorTwin/06-Models/"
```
Single large-file copies to `/mnt/d/` complete reliably — it's repeated checkpoint saves during training that deadlock.
### Kill a stuck training process
```bash
kill $(pgrep -f 'train_v3.py')
sleep 2
tmux kill-session -t majortwin_v8i
nvidia-smi --query-gpu=utilization.gpu,memory.used --format=csv,noheader
# Should show low utilization and <1GB memory used
```
The original checkpoint files from the previous run in `/mnt/d/` are untouched — the deadlock prevents writes, it does not corrupt existing data.
---
## Why Previous Runs May Have Worked
The deadlock is not guaranteed. It depends on Windows-side state at checkpoint save time. Factors:
- Antivirus scanning newly created checkpoint files
- Windows Search indexing the output directory
- VSS snapshot in progress
- Concurrent Windows desktop I/O
A run on a quiet machine may succeed; the same run during normal desktop use may deadlock.
---
## Confirming the Fix
```bash
# Watch for checkpoint directories appearing at each save_steps interval
watch -n 30 'ls -lt ~/corpus/training-runs/v8i/ | head -8'
# GPU should be active (8599%) during training steps
nvidia-smi --query-gpu=utilization.gpu --format=csv,noheader
```
---
## Notes
- Setting `save_strategy="no"` in TrainingArguments eliminates checkpoint saves entirely — useful as a diagnostic to confirm this is the cause, at the cost of no crash recovery.
- `torch.compile()` / `torch._inductor` can add hours of CPU-bound kernel compilation before the first training step. Long startup + eventual freeze together can make a session look permanently stuck when they're actually two separate issues.
- This applies to any large sequential WSL2→Windows write, not just PyTorch — large `rsync` or `tar` to `/mnt/<drive>/` can also stall.
---
## Related
- [[wsl2-rebuild-fedora43-training-env]] — Full WSL2 training environment setup
- [[wsl2-backup-powershell]] — Backing up WSL2 virtual disks from PowerShell
- [[ansible-wsl2-world-writable-mount-ignores-cfg]] — Other WSL2 filesystem quirks

View file

@ -10,7 +10,7 @@ tags:
- deno
status: published
created: 2026-04-02
updated: 2026-04-22T11:33
updated: 2026-06-16T18:35
---
# yt-dlp YouTube JS Challenge Fix (Fedora)
@ -84,12 +84,43 @@ echo '--remote-components ejs:github' > ~/.config/yt-dlp/config
## Maintenance
YouTube pushes extractor changes frequently. Keep yt-dlp current:
YouTube pushes extractor changes frequently. Keep yt-dlp current.
### Updating: the `-U` trap + avoid duplicate installs
`yt-dlp -U` **does not work** when yt-dlp was installed via pip/PyPI — the PyPI build deliberately disables the self-updater:
```
ERROR: You installed yt-dlp with pip or using the wheel from PyPi; Use that to update
```
Update through pip instead. **Pick one install method and stick to it** — running both a user install and a system install leaves two copies that drift out of sync (one updates, the other stays stale and shadows it depending on `$PATH` / sudo).
**Recommended — single user install (no sudo):**
```bash
pip3 install -U --user yt-dlp
```
This lives in `~/.local/bin/yt-dlp` and is first on a normal user's `$PATH`. Update it the same way; never use sudo.
**Alternative — system-wide (Fedora, PEP 668):**
```bash
sudo pip install -U yt-dlp --break-system-packages
```
> Only use `--break-system-packages` if you intentionally want a root-owned copy in `/usr/local`. Do **not** mix it with a `--user` install.
**Check for and remove a duplicate install:**
```bash
which -a yt-dlp # more than one path = duplicate installs
sudo pip3 uninstall -y yt-dlp # removes the /usr/local (system) copy + its wrapper
```
> If installed via the standalone binary (not pip), `yt-dlp -U` is the correct updater.
---
## Known Limitations

View file

@ -2,7 +2,7 @@
title: MajorWiki Deployment Status
status: deployed
project: MajorTwin
updated: 2026-04-07T10:48
updated: 2026-04-30T05:30
created: 2026-04-02T16:10
---

View file

@ -1,6 +1,6 @@
---
created: 2026-04-06T09:52
updated: 2026-04-29T22:46
updated: 2026-04-30T05:21
---
# MajorLinux Tech Wiki — Index

View file

@ -1,6 +1,6 @@
---
created: 2026-04-02T16:03
updated: 2026-04-29T23:55
updated: 2026-06-21T11:46
---
* [Home](index.md)
* [Linux & Sysadmin](01-linux/index.md)
@ -12,10 +12,12 @@ updated: 2026-04-29T23:55
* [Bash Scripting Patterns](01-linux/shell-scripting/bash-scripting-patterns.md)
* [SnapRAID & MergerFS Storage Setup](01-linux/storage/snapraid-mergerfs-setup.md)
* [mdadm — Rebuilding a RAID Array After Reinstall](01-linux/storage/mdadm-raid-rebuild.md)
* [Growing an LVM Volume by Absorbing Another Disk](01-linux/storage/lvm-grow-volume-absorb-disk.md)
* [Linux Distro Guide for Beginners](01-linux/distro-specific/linux-distro-guide-beginners.md)
* [WSL2 Instance Migration to Fedora 43](01-linux/distro-specific/wsl2-instance-migration-fedora43.md)
* [WSL2 Training Environment Rebuild](01-linux/distro-specific/wsl2-rebuild-fedora43-training-env.md)
* [WSL2 Backup via PowerShell](01-linux/distro-specific/wsl2-backup-powershell.md)
* [WSL2 In-Place Upgrade to Fedora 44](01-linux/distro-specific/wsl2-fedora44-inplace-upgrade.md)
* [Self-Hosting & Homelab](02-selfhosting/index.md)
* [Self-Hosting Starter Guide](02-selfhosting/docker/self-hosting-starter-guide.md)
* [Docker vs VMs for the Homelab](02-selfhosting/docker/docker-vs-vms-homelab.md)
@ -28,15 +30,23 @@ updated: 2026-04-29T23:55
* [Wake-on-LAN via Router SSH](02-selfhosting/dns-networking/wake-on-lan-router-ssh.md)
* [Pi-hole v6 Group Management — Per-Client DNS Rules](02-selfhosting/dns-networking/pihole-v6-group-management.md)
* [AWS S3 Cost Management](02-selfhosting/cloud/aws-s3-cost-management.md)
* [VPS Migration Baseline Checklist](02-selfhosting/cloud/vps-migration-baseline-checklist.md)
* [rsync Backup Patterns](02-selfhosting/storage-backup/rsync-backup-patterns.md)
* [Fleet Backups with restic + B2](02-selfhosting/storage-backup/restic-b2-fleet-backups.md)
* [Tuning Netdata Web Log Alerts](02-selfhosting/monitoring/tuning-netdata-web-log-alerts.md)
* [Tuning Netdata Docker Health Alarms](02-selfhosting/monitoring/netdata-docker-health-alarm-tuning.md)
* [Deploying Netdata to a New Server](02-selfhosting/monitoring/netdata-new-server-setup.md)
* [Netdata SELinux AVC Denial Monitoring](02-selfhosting/monitoring/netdata-selinux-avc-chart.md)
* [Netdata n8n Enriched Alert Emails](02-selfhosting/monitoring/netdata-n8n-enriched-alerts.md)
* [Logwatch Fleet Setup — Surviving Package Upgrades](02-selfhosting/monitoring/logwatch-fleet-setup.md)
* [Updating n8n Running in Docker](02-selfhosting/services/updating-n8n-docker.md)
* [Mastodon Instance Tuning](02-selfhosting/services/mastodon-instance-tuning.md)
* [Mastodon Post-Install Hardening (Permissions + Account)](02-selfhosting/services/mastodon-post-install-hardening.md)
* [Mastodon — The `--prune-profiles` Trap and How to Recover](02-selfhosting/services/mastodon-prune-profiles-trap.md)
* [Mastodon on S3 — Silent Upload Failures (BucketOwnerEnforced/ACLs)](02-selfhosting/services/mastodon-s3-acl-upload-failures.md)
* [Mastodon — Triaging Crowdfunding / Mention-Spam Accounts](02-selfhosting/services/mastodon-mention-spam-crowdfunding.md)
* [Ghost Email Configuration with Mailgun](02-selfhosting/services/ghost-smtp-mailgun-setup.md)
* [Inbound Spam Filtering: spamass-milter + SpamAssassin Bayes](02-selfhosting/services/postfix-spamassassin-bayes-spam-filtering.md)
* [Claude Code Remote Control — Mobile Access to a Persistent Host Session](02-selfhosting/services/claude-code-remote-control.md)
* [Linux Server Hardening Checklist](02-selfhosting/security/linux-server-hardening-checklist.md)
* [Standardizing unattended-upgrades with Ansible](02-selfhosting/security/ansible-unattended-upgrades-fleet.md)
@ -50,7 +60,10 @@ updated: 2026-04-29T23:55
* [Fail2ban Custom Jail: Nginx Bad Request Detection](02-selfhosting/security/fail2ban-nginx-bad-request-jail.md)
* [Fail2ban Custom Jail: Apache Bad Request Detection](02-selfhosting/security/fail2ban-apache-bad-request-jail.md)
* [SSH Hardening Fleet-Wide with Ansible](02-selfhosting/security/ssh-hardening-ansible-fleet.md)
* [Migrating Flat Ansible Playbooks to Roles (Safely)](02-selfhosting/security/ansible-flat-playbooks-to-roles.md)
* [ClamAV Fleet Deployment with Ansible](02-selfhosting/security/clamav-fleet-deployment.md)
* [Fail2Ban Digest Mode — Fleet-Wide Quiet Alerts](02-selfhosting/security/fail2ban-digest-mode-fleet.md)
* [Apache CVE-2026-23918 — HTTP/2 Double Free Mitigation](02-selfhosting/security/apache-cve-2026-23918-http2-mitigation.md)
* [Open Source & Alternatives](03-opensource/index.md)
* [SearXNG: Private Self-Hosted Search](03-opensource/alternatives/searxng.md)
* [FreshRSS: Self-Hosted RSS Reader](03-opensource/alternatives/freshrss.md)
@ -65,42 +78,79 @@ updated: 2026-04-29T23:55
* [Streaming & Podcasting](04-streaming/index.md)
* [OBS Studio Setup & Encoding](04-streaming/obs/obs-studio-setup-encoding.md)
* [Plex 4K Codec Compatibility (Apple TV)](04-streaming/plex/plex-4k-codec-compatibility.md)
* [HEVC Batch Re-Encode for Plex Using VAAPI (AMD GPU)](04-streaming/plex/hevc-vaapi-batch-encode.md)
* [Plex Transcoding Troubleshooting](04-streaming/plex/plex-transcoding-troubleshooting.md)
* [Troubleshooting](05-troubleshooting/index.md)
* [Wi-Fi Game Streaming Stutter: 160 MHz Channel Width Saturating the 5 GHz Radio](05-troubleshooting/networking/wifi-160mhz-airtime-saturation-game-streaming.md)
* [Steam Deck Wi-Fi Flapping: IWD Periodic Scan + rtw88 Power Save](05-troubleshooting/networking/steam-deck-wifi-flapping-iwd-periodic-scan-rtw88.md)
* [Apache Outage: Fail2ban Self-Ban + Missing iptables Rules](05-troubleshooting/networking/fail2ban-self-ban-apache-outage.md)
* [Postfix + SendGrid: TLS Handshake Failure (Port 465 vs 587)](05-troubleshooting/networking/postfix-sendgrid-tls-handshake-failure.md)
* [Mail Client Stops Receiving: Fail2ban IMAP Self-Ban](05-troubleshooting/networking/fail2ban-imap-self-ban-mail-client.md)
* [firewalld: Mail Ports Wiped After Reload](05-troubleshooting/networking/firewalld-mail-ports-reset.md)
* [Dovecot IMAP Clients Fail to Sync: vsz_limit OOM from a Bloated Index Log](05-troubleshooting/networking/dovecot-imap-oom-vsz-limit-bloated-index.md)
* [Postfix header_checks Can't Act on Milter-Added Headers (Use Sieve)](05-troubleshooting/networking/postfix-header-checks-vs-milter-headers.md)
* [Dovecot Phantom Mailboxes from .dovecot.lda-dupes (mail_home Overlapping the Maildir Root)](05-troubleshooting/networking/dovecot-mail-home-maildir-root-phantom-mailboxes.md)
* [Tailscale SSH: Unexpected Re-Authentication Prompt](05-troubleshooting/networking/tailscale-ssh-reauth-prompt.md)
* [ssh.socket Unreachable After Reboot (Tailscale Race Condition)](05-troubleshooting/networking/ssh-socket-tailscale-race-condition.md)
* [Fail2ban & UFW Rule Bloat Cleanup](05-troubleshooting/networking/fail2ban-ufw-rule-bloat-cleanup.md)
* [Custom Fail2ban Jail: Apache Directory Scanning](05-troubleshooting/security/apache-dirscan-fail2ban-jail.md)
* [Tuning Netdata `web_log_1m_successful` for Redirect-Heavy WordPress Sites](05-troubleshooting/security/netdata-web-log-successful-redirect-heavy-tuning.md)
* [Castopod: Stale Federated Avatar URLs After Remote Profile Updates](05-troubleshooting/security/castopod-stale-federated-avatar.md)
* [Castopod Posts Don't Appear on Mastodon — Diagnosing the Federation Path](05-troubleshooting/security/castopod-broadcast-not-on-mastodon.md)
* [Nextcloud AIO Unhealthy 20h After Nightly Update](05-troubleshooting/docker/nextcloud-aio-unhealthy-20h-stuck.md)
* [n8n Behind Reverse Proxy: X-Forwarded-For Trust Fix](05-troubleshooting/docker/n8n-proxy-trust-x-forwarded-for.md)
* [Docker & Caddy Recovery After Reboot (Fedora + SELinux)](05-troubleshooting/docker-caddy-selinux-post-reboot-recovery.md)
* [ISP SNI Filtering with Caddy](05-troubleshooting/isp-sni-filtering-caddy.md)
* [Obsidian Vault Recovery — Loading Cache Hang](05-troubleshooting/obsidian-cache-hang-recovery.md)
* [Qwen2.5-14B OOM on RTX 3080 Ti (12GB)](05-troubleshooting/gpu-display/qwen-14b-oom-3080ti.md)
* [LoRA adapter — GGUF conversion fails with 'config.json not found'](05-troubleshooting/gpu-display/lora-adapter-gguf-conversion-fails.md)
* [yt-dlp YouTube JS Challenge Fix on Fedora](05-troubleshooting/yt-dlp-fedora-js-challenge.md)
* [Gemini CLI Manual Update](05-troubleshooting/gemini-cli-manual-update.md)
* [MajorWiki Setup & Publishing Pipeline](05-troubleshooting/majwiki-setup-and-pipeline.md)
* [Gitea Actions Runner: Boot Race Condition Fix](05-troubleshooting/gitea-runner-boot-race-network-target.md)
* [Forgejo: Account Recovery & CLI Admin When Locked Out of the GUI](05-troubleshooting/forgejo-mailer-and-cli-recovery.md)
* [Cron Heartbeat False Alarm: /var/run Cleared by Reboot](05-troubleshooting/cron-heartbeat-tmpfs-reboot-false-alarm.md)
* [SELinux: Fixing Dovecot Mail Spool Context (/var/vmail)](05-troubleshooting/selinux-dovecot-vmail-context.md)
* [SELinux: Wrong /etc/localtime Label Silently Breaks Timezone Changes](05-troubleshooting/selinux-localtime-label-breaks-timezone.md)
* [mdadm RAID Recovery After USB Hub Disconnect](05-troubleshooting/storage/mdadm-usb-hub-disconnect-recovery.md)
* [Windows OpenSSH Server (sshd) Stops After Reboot](05-troubleshooting/networking/windows-sshd-stops-after-reboot.md)
* [Windows OpenSSH: WSL Default Shell Breaks Remote Commands](05-troubleshooting/networking/windows-openssh-wsl-default-shell-breaks-remote-commands.md)
* [Pi-hole AI Blocklist Blocks Claude Desktop (ERR_CONNECTION_REFUSED)](05-troubleshooting/networking/pihole-blocks-claude-desktop.md)
* [Claude Desktop MCP Server Started via wsl.exe Sees Empty Environment (WSLENV)](05-troubleshooting/wsl-env-claude-desktop-mcp.md)
* [Claude Desktop MCP Mass-Disconnect After Blocking SSH Reboot](05-troubleshooting/claude-desktop-mcp-mass-disconnect-blocking-reboot.md)
* [Patching PHP 8.4 Implicit-Nullable Deprecations in Vendor Packages](05-troubleshooting/php-84-vendor-implicit-nullable-patch.md)
* [WordPress 6.7 `_load_textdomain_just_in_time` Notice (Translations Loaded Too Early)](05-troubleshooting/wordpress-67-textdomain-just-in-time-notice.md)
* [Ollama Drops Off Tailscale When Mac Sleeps](05-troubleshooting/ollama-macos-sleep-tailscale-disconnect.md)
* [Ollama: `ollama run` with Piped Stdin Bypasses Chat Template + SYSTEM Prompt](05-troubleshooting/ollama-chat-template-pipe-stdin-bypass.md)
* [Claude Code Won't Log In (Warp & iTerm2) — Corrupt Keychain Credential](05-troubleshooting/claude-code-warp-login-corrupt-keychain-credential.md)
* [Claude Code Keychain Prompt Keeps Reappearing on macOS (ACL Invalidation)](05-troubleshooting/claude-code-keychain-prompt-recurring-macos.md)
* [iPhone Mirroring Hangs on 'Connecting…' — AWDL Data Stall (27.0 Beta)](05-troubleshooting/iphone-mirroring-connecting-hang-awdl-stall-beta.md)
* [rsync over Tailscale: Hung in TCP Teardown After Transfer Completes](05-troubleshooting/networking/rsync-tailscale-teardown-stall.md)
* [iOS Tailscale Clients Report HostName="localhost" — Breaks /etc/hosts Generators](05-troubleshooting/networking/tailscale-status-json-hostname-localhost-ios.md)
* [macOS: Repeating Alert Tone from Mirrored iPhone Notification](05-troubleshooting/macos-mirrored-notification-alert-loop.md)
* [Auditing & Cleaning macOS Background App Activity (sfltool dumpbtm)](05-troubleshooting/macos-background-app-activity-audit-sfltool.md)
* [Time Machine: Orphaned APFS `.previous` Folder Blocks All Backups](05-troubleshooting/time-machine-apfs-orphaned-previous-blocks-backup.md)
* [OBS Studio: Stale Script Paths After Windows Profile Rename](05-troubleshooting/obs-stale-script-paths-after-windows-profile-rename.md)
* [ClamAV CPU Spike: Safe Scheduling with nice/ionice](05-troubleshooting/security/clamscan-cpu-spike-nice-ionice.md)
* [Logwatch Falsely Reports 'No freshclam updates' in ClamAV Daemon Mode](05-troubleshooting/security/freshclam-logwatch-false-no-updates.md)
* [Fedora CA Bundle Missing Symlink — TLS Breaks Fleet-Wide](05-troubleshooting/security/fedora-ca-bundle-missing-symlink.md)
* [Netdata apps-group FD-utilisation false 100% (silenced fleet-wide)](05-troubleshooting/security/netdata-apps-fds-group-false-positive.md)
* [Ansible: Vault Password File Not Found](05-troubleshooting/ansible-vault-password-file-missing.md)
* [Ansible: ansible.cfg Ignored on WSL2 Windows Mounts](05-troubleshooting/ansible-wsl2-world-writable-mount-ignores-cfg.md)
* [WSL2: PyTorch Training Deadlocks on Windows Filesystem Checkpoint Saves](05-troubleshooting/wsl2-pytorch-checkpoint-windows-filesystem-deadlock.md)
* [Ansible: SSH Timeout During dnf upgrade on Fedora Hosts](05-troubleshooting/ansible-ssh-timeout-dnf-upgrade.md)
* [Ansible: regex_search Capture-Group Argument Fails in set_fact](05-troubleshooting/ansible-regex-search-set-fact-capture-group.md)
* [Ansible: Ubuntu Reboot Detection Misses Kernel Upgrades](05-troubleshooting/ansible-ubuntu-reboot-detection-kernel-mismatch.md)
* [Ansible: reboot.yml become Timeout on WSL2 Hosts (Exclude Them)](05-troubleshooting/ansible-reboot-become-timeout-wsl2.md)
* [Fedora Networking & Kernel Troubleshooting](05-troubleshooting/fedora-networking-kernel-recovery.md)
* [Systemd Session Scope Fails at Login](05-troubleshooting/systemd/session-scope-failure-at-login.md)
* [wget/curl: URLs with Special Characters Fail in Bash](05-troubleshooting/wget-url-special-characters.md)
* [Ansible: Check Mode False Positives in Verify/Assert Tasks](05-troubleshooting/ansible-check-mode-false-positives.md)
* [Ansible Fails with Permission Denied While `ssh <alias>` Works (Host Alias Bypass)](05-troubleshooting/ansible-ssh-host-alias-bypass.md)
* [SSH Alias Falls Through to MagicDNS — Host-Key Verification Failure (No `Host` Block)](05-troubleshooting/networking/ssh-missing-host-block-magicdns-host-key-failure.md)
* [MagicDNS Names vs Pinned IPs for Tailscale SSH (After a Fleet Migration)](05-troubleshooting/networking/tailscale-ssh-magicdns-vs-pinned-ip-after-migration.md)
* [`Permission denied (publickey)` After Rotating a Key — Backfill Every `authorized_keys`](05-troubleshooting/networking/ssh-rotated-key-not-backfilled-authorized-keys.md)
* [Ansible UNREACHABLE: Host Key Verification Failed After a Host Rebuild or Migration](05-troubleshooting/networking/ansible-host-key-verification-failed-rebuilt-host.md)
* [Logwatch Reports the Wrong Hostname (`<host>-hetzner`) After a Migration](05-troubleshooting/logwatch-wrong-hostname-after-migration.md)
* [Ghost EmailAnalytics Lag Warning — What It Means and When to Worry](05-troubleshooting/ghost-emailanalytics-lag-warning.md)
* [claude-mem: --setting-sources Empty Arg Bug (Claude Code 2.1.x)](05-troubleshooting/claude-mem-setting-sources-empty-arg.md)

341
index.md
View file

@ -1,179 +1,214 @@
---
created: 2026-04-06T09:52
updated: 2026-04-29T22:45
updated: 2026-05-10T01:30
---
# MajorLinux Tech Wiki — Index
> A growing reference of Linux, self-hosting, open source, streaming, and troubleshooting guides. Written by MajorLinux. Used by MajorTwin.
>
> **Last updated:** 2026-04-18
> **Article count:** 89
> **Last updated:** 2026-05-10
> **Article count:** 111
## Domains
| Domain | Folder | Articles |
|---|---|---|
| 🐧 Linux & Sysadmin | `01-linux/` | 12 |
| 🏠 Self-Hosting & Homelab | `02-selfhosting/` | 32 |
| 🏠 Self-Hosting & Homelab | `02-selfhosting/` | 39 |
| 🔓 Open Source Tools | `03-opensource/` | 10 |
| 🎙️ Streaming & Podcasting | `04-streaming/` | 2 |
| 🔧 General Troubleshooting | `05-troubleshooting/` | 34 |
| 🔧 General Troubleshooting | `05-troubleshooting/` | 48 |
---
## 🐧 Linux & Sysadmin
### Files & Permissions
- [Linux File Permissions](01-linux/files-permissions/linux-file-permissions.md) — chmod, chown, special bits, finding permission problems
### Distro-Specific
- [Linux Distro Guide for Beginners](01-linux/distro-specific/linux-distro-guide-beginners.md)
- [WSL2 Backup via PowerShell Scheduled Task](01-linux/distro-specific/wsl2-backup-powershell.md)
- [WSL2 Instance Migration (Fedora 43)](01-linux/distro-specific/wsl2-instance-migration-fedora43.md)
- [Wsl2 Rebuild Fedora43 Training Env](01-linux/distro-specific/wsl2-rebuild-fedora43-training-env.md)
### Process Management
- [Managing Linux Services with systemd](01-linux/process-management/managing-linux-services-systemd-ansible.md) — systemctl, journalctl, writing service files, Ansible service management
### Files & Permissions
- [Linux File Permissions and Ownership](01-linux/files-permissions/linux-file-permissions.md)
### Networking
- [SSH Config & Key Management](01-linux/networking/ssh-config-key-management.md) — key generation, ssh-copy-id, ~/.ssh/config, managing multiple keys, Windows OpenSSH admin key auth
- [SSH Config and Key Management](01-linux/networking/ssh-config-key-management.md)
### Package Management
- [Package Management Reference](01-linux/packages/package-management-reference.md) — apt, dnf, pacman side-by-side reference, Flatpak/Snap
- [Linux Package Management Reference: apt, dnf, pacman](01-linux/packages/package-management-reference.md)
### Process Management
- [Managing Linux Services: systemd and Ansible](01-linux/process-management/managing-linux-services-systemd-ansible.md)
### Shell & Scripting
- [Ansible Getting Started](01-linux/shell-scripting/ansible-getting-started.md) — inventory, ad-hoc commands, playbooks, handlers, roles
- [Bash Scripting Patterns](01-linux/shell-scripting/bash-scripting-patterns.md) — set -euo pipefail, logging, error handling, argument parsing, common patterns
- [Ansible Getting Started: Inventory, Playbooks, and Ad-Hoc Commands](01-linux/shell-scripting/ansible-getting-started.md)
- [Bash Scripting Patterns for Sysadmins](01-linux/shell-scripting/bash-scripting-patterns.md)
### Storage
- [SnapRAID & MergerFS Storage Setup](01-linux/storage/snapraid-mergerfs-setup.md) — Pooling mismatched drives and adding parity on Linux
- [mdadm — Rebuilding a RAID Array After Reinstall](01-linux/storage/mdadm-raid-rebuild.md) — reassembling and recovering mdadm arrays after OS reinstall
- [SnapRAID & MergerFS Storage Setup](01-linux/storage/snapraid-mergerfs-setup.md)
- [mdadm — Rebuilding a RAID Array After Reinstall](01-linux/storage/mdadm-raid-rebuild.md)
### Distro-Specific
- [Linux Distro Guide for Beginners](01-linux/distro-specific/linux-distro-guide-beginners.md) — Ubuntu recommendation, distro comparison, desktop environments
- [WSL2 Instance Migration to Fedora 43](01-linux/distro-specific/wsl2-instance-migration-fedora43.md) — moving WSL2 VHDX from C: to another drive
- [WSL2 Training Environment Rebuild (Fedora 43)](01-linux/distro-specific/wsl2-rebuild-fedora43-training-env.md) — rebuilding the MajorTwin training env in WSL2 from scratch
- [WSL2 Backup via PowerShell Scheduled Task](01-linux/distro-specific/wsl2-backup-powershell.md) — automating WSL2 exports on a schedule using PowerShell
---
## 🏠 Self-Hosting & Homelab
### Docker & Containers
- [Self-Hosting Starter Guide](02-selfhosting/docker/self-hosting-starter-guide.md) — hardware options, Docker install, first services, networking basics
- [Docker vs VMs for the Homelab](02-selfhosting/docker/docker-vs-vms-homelab.md) — when to use containers vs VMs, KVM setup, how to run both
- [Debugging Broken Docker Containers](02-selfhosting/docker/debugging-broken-docker-containers.md) — logs, inspect, exec, port conflicts, permission errors
- [Docker Healthchecks](02-selfhosting/docker/docker-healthchecks.md) — writing and debugging HEALTHCHECK instructions in Docker containers
- [Watchtower SMTP via Localhost Postfix Relay](02-selfhosting/docker/watchtower-smtp-localhost-relay.md) — credential-free container update notifications by routing through a local Postfix relay
### Reverse Proxies
- [Setting Up Caddy as a Reverse Proxy](02-selfhosting/reverse-proxy/setting-up-caddy-reverse-proxy.md) — Caddyfile basics, automatic HTTPS, local TLS, DNS challenge
### Cloud
- [AWS S3 Cost Management](02-selfhosting/cloud/aws-s3-cost-management.md)
### DNS & Networking
- [Tailscale for Homelab Remote Access](02-selfhosting/dns-networking/tailscale-homelab-remote-access.md) — installation, MagicDNS, making services accessible, subnet router, ACLs
- [Network Overview](02-selfhosting/dns-networking/network-overview.md) — MajorsHouse network topology, Tailscale IPs, and connectivity map
- [Wake-on-LAN via Router SSH](02-selfhosting/dns-networking/wake-on-lan-router-ssh.md) — send WOL magic packets through an Asus router over SSH, with Ansible vault integration
- [Network Overview](02-selfhosting/dns-networking/network-overview.md)
- [Pi-hole DoH / DoT Bypass Defense](02-selfhosting/dns-networking/pihole-doh-dot-bypass-defense.md)
- [Pi-hole v6 Adlist Management via SQL](02-selfhosting/dns-networking/pihole-v6-adlist-management.md)
- [Pi-hole v6 Group Management: Per-Client DNS Rules](02-selfhosting/dns-networking/pihole-v6-group-management.md)
- [Tailscale for Homelab Remote Access](02-selfhosting/dns-networking/tailscale-homelab-remote-access.md)
- [Wake-on-LAN via Router SSH](02-selfhosting/dns-networking/wake-on-lan-router-ssh.md)
### Cloud
- [AWS S3 Cost Management](02-selfhosting/cloud/aws-s3-cost-management.md) — identify and control S3 costs: lifecycle rules, storage class selection, bucket inventory, unexpected-growth investigation
### Storage & Backup
- [rsync Backup Patterns](02-selfhosting/storage-backup/rsync-backup-patterns.md) — flags reference, remote backup, incremental with hard links, cron/systemd
### Docker & Containers
- [Debugging Broken Docker Containers](02-selfhosting/docker/debugging-broken-docker-containers.md)
- [Docker Healthchecks](02-selfhosting/docker/docker-healthchecks.md)
- [Docker vs VMs in the Homelab: Why Not Both?](02-selfhosting/docker/docker-vs-vms-homelab.md)
- [Self-Hosting Starter Guide](02-selfhosting/docker/self-hosting-starter-guide.md)
- [Watchtower SMTP via Localhost Postfix Relay](02-selfhosting/docker/watchtower-smtp-localhost-relay.md)
### Monitoring
- [Tuning Netdata Web Log Alerts](02-selfhosting/monitoring/tuning-netdata-web-log-alerts.md) — tuning web_log_1m_redirects threshold for HTTPS-forcing servers
- [Tuning Netdata Docker Health Alarms](02-selfhosting/monitoring/netdata-docker-health-alarm-tuning.md) — preventing false alerts during nightly Nextcloud AIO container update cycles
- [Deploying Netdata to a New Server](02-selfhosting/monitoring/netdata-new-server-setup.md) — install, email notifications, and Netdata Cloud claim for Ubuntu/Debian servers
- [Netdata + n8n Enriched Alert Emails](02-selfhosting/monitoring/netdata-n8n-enriched-alerts.md) — rich HTML alert emails with remediation steps and wiki links via n8n
- [Netdata SELinux AVC Denial Monitoring](02-selfhosting/monitoring/netdata-selinux-avc-chart.md) — custom Netdata chart for tracking SELinux AVC denials
- [Deploying Netdata to a New Server](02-selfhosting/monitoring/netdata-new-server-setup.md)
- [Netdata SELinux AVC Denial Monitoring](02-selfhosting/monitoring/netdata-selinux-avc-chart.md)
- [Netdata n8n Enriched Alert Emails](02-selfhosting/monitoring/netdata-n8n-enriched-alerts.md)
- [Tuning Netdata Docker Health Alarms to Prevent Update Flapping](02-selfhosting/monitoring/netdata-docker-health-alarm-tuning.md)
- [Tuning Netdata Web Log Alerts](02-selfhosting/monitoring/tuning-netdata-web-log-alerts.md)
### Reverse Proxies
- [Setting Up a Reverse Proxy with Caddy](02-selfhosting/reverse-proxy/setting-up-caddy-reverse-proxy.md)
### Security
- [Linux Server Hardening Checklist](02-selfhosting/security/linux-server-hardening-checklist.md) — non-root user, SSH key auth, sshd_config, firewall, fail2ban, SpamAssassin
- [Standardizing unattended-upgrades with Ansible](02-selfhosting/security/ansible-unattended-upgrades-fleet.md) — fleet-wide automatic security updates across Ubuntu servers
- [Fail2ban Custom Jail: Apache 404 Scanner Detection](02-selfhosting/security/fail2ban-apache-404-scanner-jail.md) — custom filter and jail for blocking 404 scanners
- [Fail2ban Custom Jail: Apache PHP Webshell Probe Detection](02-selfhosting/security/fail2ban-apache-php-probe-jail.md) — catching PHP webshell/backdoor probes that return 301 on HTTPS-redirecting servers
- [Fail2ban Custom Jail: WordPress Login Brute Force](02-selfhosting/security/fail2ban-wordpress-login-jail.md) — access-log-based wp-login.php brute force detection without plugins
- [SELinux: Fixing Fail2ban grep execmem Denial](02-selfhosting/security/selinux-fail2ban-execmem-fix.md) — resolving execmem AVC denials from Fail2ban's grep on Fedora
- [UFW Firewall Management](02-selfhosting/security/ufw-firewall-management.md) — managing UFW rules, common patterns, troubleshooting
- [Firewall Hardening with firewalld on Fedora Fleet](02-selfhosting/security/firewalld-fleet-hardening.md) — audit-and-harden pattern for Fedora fleet hosts using Ansible; flush stale rules, rebuild minimal whitelists
- [Fail2ban Custom Jail: Nginx Bad Request Detection](02-selfhosting/security/fail2ban-nginx-bad-request-jail.md) — wiring the stock nginx-bad-request filter to a jail to catch malformed-request scanners
- [Fail2ban Custom Jail: Apache Bad Request Detection](02-selfhosting/security/fail2ban-apache-bad-request-jail.md) — custom filter for Apache 400 Bad Request responses (no stock equivalent exists)
- [SSH Hardening Fleet-Wide with Ansible](02-selfhosting/security/ssh-hardening-ansible-fleet.md) — drop-in sshd config hardening across mixed Ubuntu/Fedora fleets
- [ClamAV Fleet Deployment with Ansible](02-selfhosting/security/clamav-fleet-deployment.md) — deploy ClamAV with nice/ionice throttling, freshclam, and quarantine to internet-facing hosts
- [ClamAV Fleet Deployment with Ansible](02-selfhosting/security/clamav-fleet-deployment.md)
- [Fail2Ban Digest Mode — Fleet-Wide Quiet Alerts](02-selfhosting/security/fail2ban-digest-mode-fleet.md)
- [Fail2ban Custom Jail: Apache 404 Scanner Detection](02-selfhosting/security/fail2ban-apache-404-scanner-jail.md)
- [Fail2ban Custom Jail: Apache Bad Request Detection](02-selfhosting/security/fail2ban-apache-bad-request-jail.md)
- [Fail2ban Custom Jail: Apache PHP Webshell Probe Detection](02-selfhosting/security/fail2ban-apache-php-probe-jail.md)
- [Fail2ban Custom Jail: WordPress Login Brute Force](02-selfhosting/security/fail2ban-wordpress-login-jail.md)
- [Fail2ban: Enable the nginx-bad-request Jail](02-selfhosting/security/fail2ban-nginx-bad-request-jail.md)
- [Firewall Hardening with firewalld on Fedora Fleet](02-selfhosting/security/firewalld-fleet-hardening.md)
- [Linux Server Hardening Checklist](02-selfhosting/security/linux-server-hardening-checklist.md)
- [SELinux: Fixing Fail2ban grep execmem Denial on Fedora](02-selfhosting/security/selinux-fail2ban-execmem-fix.md)
- [SSH Hardening Fleet-Wide with Ansible](02-selfhosting/security/ssh-hardening-ansible-fleet.md)
- [Standardizing unattended-upgrades Across Ubuntu Fleet with Ansible](02-selfhosting/security/ansible-unattended-upgrades-fleet.md)
- [UFW Firewall Management](02-selfhosting/security/ufw-firewall-management.md)
- [wp-fail2ban Plugin Logpath on Debian/Ubuntu (auth.log, not syslog)](02-selfhosting/security/wp-fail2ban-logpath-debian-ubuntu.md)
### Services
- [Updating n8n Running in Docker](02-selfhosting/services/updating-n8n-docker.md) — pinned version updates, password reset, Arcane timing gaps
- [Mastodon Instance Tuning](02-selfhosting/services/mastodon-instance-tuning.md) — character limit increase, media cache management for self-hosted Mastodon
- [Ghost Email Configuration with Mailgun](02-selfhosting/services/ghost-smtp-mailgun-setup.md) — configuring Ghost's two independent mail systems (newsletter API + transactional SMTP) with Mailgun
- [Claude Code Remote Control — Mobile Access to a Persistent Host Session](02-selfhosting/services/claude-code-remote-control.md) — running `claude remote-control` on a host so `claude.ai` and the Claude mobile app can drive the CLI, with vault + MCPs intact
- [Claude Code Remote Control — Mobile Access to a Persistent Host Session](02-selfhosting/services/claude-code-remote-control.md)
- [Ghost Email Configuration with Mailgun](02-selfhosting/services/ghost-smtp-mailgun-setup.md)
- [Mastodon DB Maintenance — Statuses, Accounts, and VACUUM](02-selfhosting/services/mastodon-db-maintenance.md)
- [Mastodon Federation — Domain Blocks, Silencing, and FediSeer](02-selfhosting/services/mastodon-federation.md)
- [Mastodon Instance Tuning](02-selfhosting/services/mastodon-instance-tuning.md)
- [Mastodon — The `--prune-profiles` Trap and How to Recover](02-selfhosting/services/mastodon-prune-profiles-trap.md)
- [Updating n8n Running in Docker](02-selfhosting/services/updating-n8n-docker.md)
### Storage & Backup
- [rsync Backup Patterns](02-selfhosting/storage-backup/rsync-backup-patterns.md)
---
## 🔓 Open Source Tools
### Alternatives
- [SearXNG: Private Self-Hosted Search](03-opensource/alternatives/searxng.md) — metasearch engine that queries multiple engines without exposing your identity
- [FreshRSS: Self-Hosted RSS Reader](03-opensource/alternatives/freshrss.md) — algorithm-free feed aggregator with mobile app sync
- [Gitea: Self-Hosted Git](03-opensource/alternatives/gitea.md) — lightweight GitHub alternative, webhooks, single Docker container
### Productivity
- [rmlint: Duplicate File Scanning](03-opensource/productivity/rmlint-duplicate-scanning.md) — extremely fast duplicate file finding and storage reclamation
- [FreshRSS — Self-Hosted RSS Reader](03-opensource/alternatives/freshrss.md)
- [Gitea — Self-Hosted Git](03-opensource/alternatives/gitea.md)
- [SearXNG — Private Self-Hosted Search](03-opensource/alternatives/searxng.md)
### Development Tools
- [tmux: Persistent Terminal Sessions](03-opensource/dev-tools/tmux.md) — detachable sessions for long-running jobs over SSH
- [screen: Simple Persistent Sessions](03-opensource/dev-tools/screen.md) — lightweight terminal multiplexer, universally available
- [rsync: Fast, Resumable File Transfers](03-opensource/dev-tools/rsync.md) — incremental file sync locally and over SSH, survives interruptions
- [Ventoy: Multi-Boot USB Tool](03-opensource/dev-tools/ventoy.md) — drop ISOs on a USB drive and boot any of them, no reflashing
### Privacy & Security
- [Vaultwarden: Self-Hosted Password Manager](03-opensource/privacy-security/vaultwarden.md) — Bitwarden-compatible server in a single Docker container, passwords stay on your hardware
- [Ventoy — Multi-Boot USB Tool](03-opensource/dev-tools/ventoy.md)
- [rsync — Fast, Resumable File Transfers](03-opensource/dev-tools/rsync.md)
- [screen — Simple Persistent Terminal Sessions](03-opensource/dev-tools/screen.md)
- [tmux — Persistent Terminal Sessions](03-opensource/dev-tools/tmux.md)
### Media & Creative
- [yt-dlp: Video Downloading](03-opensource/media-creative/yt-dlp.md) — download from YouTube and hundreds of other sites, Plex-optimized format selection
- [yt-dlp — Video Downloading](03-opensource/media-creative/yt-dlp.md)
### Privacy & Security
- [Vaultwarden — Self-Hosted Password Manager](03-opensource/privacy-security/vaultwarden.md)
### Productivity
- [rmlint — Extreme Duplicate File Scanning](03-opensource/productivity/rmlint-duplicate-scanning.md)
---
## 🎙️ Streaming & Podcasting
### OBS Studio
- [OBS Studio Setup & Encoding](04-streaming/obs/obs-studio-setup-encoding.md) — installation, NVENC/x264 settings, scene setup, audio filters, Linux Wayland notes
- [OBS Studio Setup and Encoding Settings](04-streaming/obs/obs-studio-setup-encoding.md)
### Plex
- [Plex 4K Codec Compatibility (Apple TV)](04-streaming/plex/plex-4k-codec-compatibility.md) — AV1/VP9 vs HEVC, batch conversion script, yt-dlp auto-convert hook
- [Plex 4K Codec Compatibility (Apple TV)](04-streaming/plex/plex-4k-codec-compatibility.md)
---
## 🔧 General Troubleshooting
- [Apache Outage: Fail2ban Self-Ban + Missing iptables Rules](05-troubleshooting/networking/fail2ban-self-ban-apache-outage.md) — diagnosing and fixing Apache outages caused by missing firewall rules and Fail2ban self-bans
- [Mail Client Stops Receiving: Fail2ban IMAP Self-Ban](05-troubleshooting/networking/fail2ban-imap-self-ban-mail-client.md) — diagnosing why one device stops receiving email when the mail server is healthy
- [firewalld: Mail Ports Wiped After Reload](05-troubleshooting/networking/firewalld-mail-ports-reset.md) — recovering IMAP and webmail after firewalld reload drops all mail service rules
- [Fail2ban & UFW Rule Bloat: 30k Rules Slowing Down a VPS](05-troubleshooting/networking/fail2ban-ufw-rule-bloat-cleanup.md) — diagnosing and cleaning up massive nftables/UFW rule accumulation
- [Tailscale SSH: Unexpected Re-Authentication Prompt](05-troubleshooting/networking/tailscale-ssh-reauth-prompt.md) — resolving unexpected re-auth prompts on Tailscale SSH connections
- [Docker & Caddy Recovery After Reboot (Fedora + SELinux)](05-troubleshooting/docker-caddy-selinux-post-reboot-recovery.md) — fixing docker.socket, SELinux port blocks, and httpd_can_network_connect after reboot
- [n8n Behind Reverse Proxy: X-Forwarded-For Trust Fix](05-troubleshooting/docker/n8n-proxy-trust-x-forwarded-for.md) — fixing webhook failures caused by missing proxy trust configuration
- [Nextcloud AIO Container Unhealthy for 20 Hours](05-troubleshooting/docker/nextcloud-aio-unhealthy-20h-stuck.md) — diagnosing stuck Nextcloud AIO containers after nightly update cycles
- [ISP SNI Filtering with Caddy](05-troubleshooting/isp-sni-filtering-caddy.md) — troubleshooting why wiki.majorshouse.com was blocked by Google Fiber
- [Obsidian Cache Hang Recovery](05-troubleshooting/obsidian-cache-hang-recovery.md) — resolving "Loading cache" hang in Obsidian by cleaning Electron app data and ML artifacts
- [macOS Repeating Alert Tone from Mirrored Notification](05-troubleshooting/macos-mirrored-notification-alert-loop.md) — stopping alert tone loops from mirrored iPhone notifications on Mac
- [Qwen2.5-14B OOM on RTX 3080 Ti (12GB)](05-troubleshooting/gpu-display/qwen-14b-oom-3080ti.md) — fixes and alternatives when hitting VRAM limits during fine-tuning
- [yt-dlp YouTube JS Challenge Fix on Fedora](05-troubleshooting/yt-dlp-fedora-js-challenge.md) — fixing YouTube JS challenge solver errors and missing formats on Fedora
- [Gemini CLI Manual Update](05-troubleshooting/gemini-cli-manual-update.md) — how to manually update the Gemini CLI when automatic updates fail
- [MajorWiki Setup & Pipeline](05-troubleshooting/majwiki-setup-and-pipeline.md) — setting up MajorWiki and the Obsidian → Gitea → MkDocs publishing pipeline
- [Gitea Actions Runner: Boot Race Condition Fix](05-troubleshooting/gitea-runner-boot-race-network-target.md) — fixing act_runner crash loop on boot caused by DNS not ready at startup
- [Cron Heartbeat False Alarm: /var/run Cleared by Reboot](05-troubleshooting/cron-heartbeat-tmpfs-reboot-false-alarm.md) — why `/run` is tmpfs and how a reboot wipes cron heartbeat files, and where to put them instead
- [SELinux: Fixing Dovecot Mail Spool Context (/var/vmail)](05-troubleshooting/selinux-dovecot-vmail-context.md) — fixing thousands of AVC denials when /var/vmail has wrong SELinux context
- [mdadm RAID Recovery After USB Hub Disconnect](05-troubleshooting/storage/mdadm-usb-hub-disconnect-recovery.md) — diagnosing and recovering a failed mdadm array caused by a USB hub dropout
- [Windows OpenSSH Server (sshd) Stops After Reboot](05-troubleshooting/networking/windows-sshd-stops-after-reboot.md) — fixing sshd not running after reboot due to Manual startup type
- [Windows OpenSSH: WSL Default Shell Breaks Remote Commands](05-troubleshooting/networking/windows-openssh-wsl-default-shell-breaks-remote-commands.md) — fixing remote SSH command failures when wsl.exe is the default shell
- [Ollama Drops Off Tailscale When Mac Sleeps](05-troubleshooting/ollama-macos-sleep-tailscale-disconnect.md) — keeping Ollama reachable over Tailscale by disabling macOS sleep on AC power
- [Ansible: Vault Password File Not Found](05-troubleshooting/ansible-vault-password-file-missing.md) — fixing the missing vault_pass file error when running ansible-playbook
- [Ansible: ansible.cfg Ignored on WSL2 Windows Mounts](05-troubleshooting/ansible-wsl2-world-writable-mount-ignores-cfg.md) — fixing silent config ignore due to world-writable /mnt/d/ permissions
- [Ansible SSH Timeout During dnf upgrade](05-troubleshooting/ansible-ssh-timeout-dnf-upgrade.md) — preventing SSH timeouts during long-running dnf upgrades on Fedora
- [Fedora Networking & Kernel Troubleshooting](05-troubleshooting/fedora-networking-kernel-recovery.md) — nmcli quick fix, GRUB kernel rollback, and recovery for Fedora fleet
- [Custom Fail2ban Jail: Apache Directory Scanning](05-troubleshooting/security/apache-dirscan-fail2ban-jail.md) — blocking directory scanners and junk HTTP methods
- [ClamAV Safe Scheduling on Live Servers](05-troubleshooting/security/clamscan-cpu-spike-nice-ionice.md) — preventing clamscan CPU spikes with nice and ionice
- [Systemd Session Scope Fails at Login](05-troubleshooting/systemd/session-scope-failure-at-login.md) — fixing session-cN.scope failures during login
- [wget/curl: URLs with Special Characters Fail in Bash](05-troubleshooting/wget-url-special-characters.md) — fixing broken downloads caused by unquoted URLs with &, ?, # characters
- [Ansible: Check Mode False Positives in Verify/Assert Tasks](05-troubleshooting/ansible-check-mode-false-positives.md) — guarding verify/assert tasks with `when: not ansible_check_mode` to prevent false failures in dry runs
- [Ansible Fails with Permission Denied While `ssh <alias>` Works (Host Alias Bypass)](05-troubleshooting/ansible-ssh-host-alias-bypass.md) — SSH Host blocks match on literal pattern; `ansible_host: <IP>` bypasses the alias and the IdentityFile never gets applied
- [Ghost EmailAnalytics Lag Warning — What It Means and When to Worry](05-troubleshooting/ghost-emailanalytics-lag-warning.md) — explaining the lag counter, `submitted` status, and `fetchMissing end == begin` skip
- [claude-mem: --setting-sources Empty Arg Bug (Claude Code 2.1.x)](05-troubleshooting/claude-mem-setting-sources-empty-arg.md) — fixing silent pipeline failure when claude-mem 12.1.x spawns Claude Code 2.1.112+
- [Ansible Check Mode False Positives in Verify/Assert Tasks](05-troubleshooting/ansible-check-mode-false-positives.md)
- [Ansible Fails with Permission Denied While `ssh <alias>` Works (Host Alias Bypass)](05-troubleshooting/ansible-ssh-host-alias-bypass.md)
- [Ansible SSH Timeout During dnf upgrade on Fedora Hosts](05-troubleshooting/ansible-ssh-timeout-dnf-upgrade.md)
- [Ansible: Vault Password File Not Found](05-troubleshooting/ansible-vault-password-file-missing.md)
- [Ansible Ignores ansible.cfg on WSL2 Windows Mounts](05-troubleshooting/ansible-wsl2-world-writable-mount-ignores-cfg.md)
- [claude-mem Silently Fails with Claude Code 2.1+ (Empty --setting-sources)](05-troubleshooting/claude-mem-setting-sources-empty-arg.md)
- [Cron Heartbeat False Alarm: /var/run Cleared by Reboot](05-troubleshooting/cron-heartbeat-tmpfs-reboot-false-alarm.md)
- [Docker & Caddy Recovery After Reboot (Fedora + SELinux)](05-troubleshooting/docker-caddy-selinux-post-reboot-recovery.md)
- [Fantastical Google Sync Error Flood — Phantom Calendars Fixed via syncselect](05-troubleshooting/fantastical-google-phantom-calendar-syncselect.md)
- [Fantastical MCP Server: Permission Denied on Launch (macOS Quarantine)](05-troubleshooting/fantastical-mcp-permission-denied.md)
- [Fedora Networking & Kernel Troubleshooting](05-troubleshooting/fedora-networking-kernel-recovery.md)
- [Gemini CLI: Manual Update Guide](05-troubleshooting/gemini-cli-manual-update.md)
- [Ghost EmailAnalytics Lag Warning — What It Means and When to Worry](05-troubleshooting/ghost-emailanalytics-lag-warning.md)
- [Gitea Actions Runner: Boot Race Condition Fix](05-troubleshooting/gitea-runner-boot-race-network-target.md)
- [ISP SNI Filtering & Caddy Troubleshooting](05-troubleshooting/isp-sni-filtering-caddy.md)
- [macOS Repeating Alert Tone from Mirrored iPhone Notification](05-troubleshooting/macos-mirrored-notification-alert-loop.md)
- [MajorWiki Setup & Publishing Pipeline](05-troubleshooting/majwiki-setup-and-pipeline.md)
- [Obsidian Vault Recovery — Loading Cache Hang](05-troubleshooting/obsidian-cache-hang-recovery.md)
- [Ollama: `ollama run` with Piped Stdin Bypasses Chat Template + SYSTEM Prompt](05-troubleshooting/ollama-chat-template-pipe-stdin-bypass.md)
- [Ollama Drops Off Tailscale When Mac Sleeps](05-troubleshooting/ollama-macos-sleep-tailscale-disconnect.md)
- [Python smtplib: Missing Date/Message-ID Headers Break Mail Clients](05-troubleshooting/python-smtplib-missing-rfc-headers.md)
- [SELinux: Fixing Dovecot Mail Spool Context (/var/vmail)](05-troubleshooting/selinux-dovecot-vmail-context.md)
- [Ubuntu dist-upgrade Quarantines Third-Party Repos](05-troubleshooting/ubuntu-dist-upgrade-repo-quarantine.md)
- [wget/curl: URLs with Special Characters Fail in Bash](05-troubleshooting/wget-url-special-characters.md)
- [yt-dlp YouTube JS Challenge Fix (Fedora)](05-troubleshooting/yt-dlp-fedora-js-challenge.md)
### Docker & Containers
- [Nextcloud AIO Container Unhealthy for 20 Hours After Nightly Update](05-troubleshooting/docker/nextcloud-aio-unhealthy-20h-stuck.md)
- [n8n Behind Reverse Proxy: X-Forwarded-For Trust Fix](05-troubleshooting/docker/n8n-proxy-trust-x-forwarded-for.md)
### GPU & Display
- [LoRA adapter — GGUF conversion fails with 'config.json not found](05-troubleshooting/gpu-display/lora-adapter-gguf-conversion-fails.md)
- [Qwen2.5-14B OOM on RTX 3080 Ti (12GB)](05-troubleshooting/gpu-display/qwen-14b-oom-3080ti.md)
### Networking
- [Apache Outage: Fail2ban Self-Ban + Missing iptables Rules](05-troubleshooting/networking/fail2ban-self-ban-apache-outage.md)
- [Fail2ban & UFW Rule Bloat: 30k Rules Slowing Down a VPS](05-troubleshooting/networking/fail2ban-ufw-rule-bloat-cleanup.md)
- [Mail Client Stops Receiving: Fail2ban IMAP Self-Ban](05-troubleshooting/networking/fail2ban-imap-self-ban-mail-client.md)
- [Pi-hole AI Blocklist Blocks Claude Desktop (ERR_CONNECTION_REFUSED)](05-troubleshooting/networking/pihole-blocks-claude-desktop.md)
- [Tailscale SSH: Unexpected Re-Authentication Prompt](05-troubleshooting/networking/tailscale-ssh-reauth-prompt.md)
- [Windows OpenSSH Server (sshd) Stops After Reboot](05-troubleshooting/networking/windows-sshd-stops-after-reboot.md)
- [Windows OpenSSH: WSL as Default Shell Breaks Remote Commands](05-troubleshooting/networking/windows-openssh-wsl-default-shell-breaks-remote-commands.md)
- [firewalld: Mail Ports Wiped After Reload (IMAP + Webmail Outage)](05-troubleshooting/networking/firewalld-mail-ports-reset.md)
- [iOS Tailscale Clients Report HostName="localhost" — Breaks /etc/hosts Generators](05-troubleshooting/networking/tailscale-status-json-hostname-localhost-ios.md)
- [rsync over Tailscale: Hung in TCP Teardown After Transfer Completes](05-troubleshooting/networking/rsync-tailscale-teardown-stall.md)
### Security
- [ClamAV Safe Scheduling on Live Servers](05-troubleshooting/security/clamscan-cpu-spike-nice-ionice.md)
- [Custom Fail2ban Jail: Apache Directory Scanning & Junk Methods](05-troubleshooting/security/apache-dirscan-fail2ban-jail.md)
- [Tuning Netdata `web_log_1m_successful` for Redirect-Heavy WordPress Sites](05-troubleshooting/security/netdata-web-log-successful-redirect-heavy-tuning.md)
- [Castopod: Stale Federated Avatar URLs After Remote Profile Updates](05-troubleshooting/security/castopod-stale-federated-avatar.md)
- [Castopod Posts Don't Appear on Mastodon — Diagnosing the Federation Path](05-troubleshooting/security/castopod-broadcast-not-on-mastodon.md)
### Storage
- [mdadm RAID Recovery After USB Hub Disconnect](05-troubleshooting/storage/mdadm-usb-hub-disconnect-recovery.md)
### Systemd
- [Systemd Session Scope Fails at Login (session-cN.scope)](05-troubleshooting/systemd/session-scope-failure-at-login.md)
---
@ -182,57 +217,45 @@ updated: 2026-04-29T22:45
| Date | Article | Domain |
|---|---|---|
| 2026-04-19 | [Wake-on-LAN via Router SSH](02-selfhosting/dns-networking/wake-on-lan-router-ssh.md) | Self-Hosting |
| 2026-04-18 | [Ghost Email Configuration with Mailgun](02-selfhosting/services/ghost-smtp-mailgun-setup.md) | Self-Hosting |
| 2026-04-18 | [Firewall Hardening with firewalld on Fedora Fleet](02-selfhosting/security/firewalld-fleet-hardening.md) | Self-Hosting |
| 2026-04-18 | [ClamAV Fleet Deployment with Ansible](02-selfhosting/security/clamav-fleet-deployment.md) | Self-Hosting |
| 2026-04-18 | [Ansible: Check Mode False Positives in Verify/Assert Tasks](05-troubleshooting/ansible-check-mode-false-positives.md) | Troubleshooting |
| 2026-04-18 | [Ghost EmailAnalytics Lag Warning](05-troubleshooting/ghost-emailanalytics-lag-warning.md) | Troubleshooting |
| 2026-04-17 | [Watchtower SMTP via Localhost Postfix Relay](02-selfhosting/docker/watchtower-smtp-localhost-relay.md) | Self-Hosting |
| 2026-04-17 | [Fail2ban Custom Jail: Nginx Bad Request Detection](02-selfhosting/security/fail2ban-nginx-bad-request-jail.md) | Self-Hosting |
| 2026-04-17 | [Fail2ban Custom Jail: Apache Bad Request Detection](02-selfhosting/security/fail2ban-apache-bad-request-jail.md) | Self-Hosting |
| 2026-04-17 | [SSH Hardening Fleet-Wide with Ansible](02-selfhosting/security/ssh-hardening-ansible-fleet.md) | Self-Hosting |
| 2026-04-17 | [claude-mem: --setting-sources Empty Arg Bug](05-troubleshooting/claude-mem-setting-sources-empty-arg.md) | Troubleshooting |
| 2026-04-13 | [Cron Heartbeat False Alarm: /var/run Cleared by Reboot](05-troubleshooting/cron-heartbeat-tmpfs-reboot-false-alarm.md) | Troubleshooting |
| 2026-04-09 | [Fail2ban Custom Jail: Apache PHP Webshell Probe Detection](02-selfhosting/security/fail2ban-apache-php-probe-jail.md) | Self-Hosting |
| 2026-04-08 | [wget/curl: URLs with Special Characters Fail in Bash](05-troubleshooting/wget-url-special-characters.md) | Troubleshooting |
| 2026-04-07 | [SSH Config & Key Management](01-linux/networking/ssh-config-key-management.md) | Linux |
| 2026-04-07 | [Windows OpenSSH: WSL Default Shell Breaks Remote Commands](05-troubleshooting/networking/windows-openssh-wsl-default-shell-breaks-remote-commands.md) | Troubleshooting |
| 2026-04-07 | [Windows OpenSSH Server (sshd) Stops After Reboot](05-troubleshooting/networking/windows-sshd-stops-after-reboot.md) | Troubleshooting |
| 2026-04-03 | [Ansible: ansible.cfg Ignored on WSL2 Windows Mounts](05-troubleshooting/ansible-wsl2-world-writable-mount-ignores-cfg.md) | Troubleshooting |
| 2026-04-02 | [Fail2ban Custom Jail: WordPress Login Brute Force](02-selfhosting/security/fail2ban-wordpress-login-jail.md) | Self-Hosting |
| 2026-04-02 | [Mastodon Instance Tuning](02-selfhosting/services/mastodon-instance-tuning.md) | Self-Hosting |
| 2026-04-02 | [mdadm — Rebuilding a RAID Array After Reinstall](01-linux/storage/mdadm-raid-rebuild.md) | Linux |
| 2026-04-02 | [Fedora Networking & Kernel Troubleshooting](05-troubleshooting/fedora-networking-kernel-recovery.md) | Troubleshooting |
| 2026-04-02 | [Ventoy: Multi-Boot USB Tool](03-opensource/dev-tools/ventoy.md) | Open Source |
| 2026-04-02 | [rsync Backup Patterns](02-selfhosting/storage-backup/rsync-backup-patterns.md) (updated — Glacier Deep Archive) | Self-Hosting |
| 2026-04-02 | [yt-dlp: Video Downloading](03-opensource/media-creative/yt-dlp.md) (updated — subtitles, temp fix) | Open Source |
| 2026-04-02 | [OBS Studio Setup & Encoding](04-streaming/obs/obs-studio-setup-encoding.md) (updated — captions plugin, VLC capture) | Streaming |
| 2026-04-02 | [Linux Server Hardening Checklist](02-selfhosting/security/linux-server-hardening-checklist.md) (updated — SpamAssassin) | Self-Hosting |
| 2026-03-23 | [Ansible: Vault Password File Not Found](05-troubleshooting/ansible-vault-password-file-missing.md) | Troubleshooting |
| 2026-03-18 | [Deploying Netdata to a New Server](02-selfhosting/monitoring/netdata-new-server-setup.md) | Self-Hosting |
| 2026-03-18 | [Tuning Netdata Docker Health Alarms](02-selfhosting/monitoring/netdata-docker-health-alarm-tuning.md) | Self-Hosting |
| 2026-03-17 | [Ollama Drops Off Tailscale When Mac Sleeps](05-troubleshooting/ollama-macos-sleep-tailscale-disconnect.md) | Troubleshooting |
| 2026-03-17 | [Windows OpenSSH Server (sshd) Stops After Reboot](05-troubleshooting/networking/windows-sshd-stops-after-reboot.md) | Troubleshooting |
| 2026-03-16 | [Standardizing unattended-upgrades with Ansible](02-selfhosting/security/ansible-unattended-upgrades-fleet.md) | Self-Hosting |
| 2026-03-16 | [WSL2 Training Environment Rebuild (Fedora 43)](01-linux/distro-specific/wsl2-rebuild-fedora43-training-env.md) | Linux |
| 2026-03-16 | [WSL2 Backup via PowerShell Scheduled Task](01-linux/distro-specific/wsl2-backup-powershell.md) | Linux |
| 2026-03-15 | [firewalld: Mail Ports Wiped After Reload](05-troubleshooting/networking/firewalld-mail-ports-reset.md) | Troubleshooting |
| 2026-03-15 | [Plex 4K Codec Compatibility (Apple TV)](04-streaming/plex/plex-4k-codec-compatibility.md) | Streaming |
| 2026-03-15 | [mdadm RAID Recovery After USB Hub Disconnect](05-troubleshooting/storage/mdadm-usb-hub-disconnect-recovery.md) | Troubleshooting |
| 2026-03-15 | [yt-dlp: Video Downloading](03-opensource/media-creative/yt-dlp.md) | Open Source |
| 2026-03-14 | [SELinux: Fixing Dovecot Mail Spool Context (/var/vmail)](05-troubleshooting/selinux-dovecot-vmail-context.md) | Troubleshooting |
| 2026-03-14 | [Gitea Actions Runner: Boot Race Condition Fix](05-troubleshooting/gitea-runner-boot-race-network-target.md) | Troubleshooting |
| 2026-03-14 | [Mail Client Stops Receiving: Fail2ban IMAP Self-Ban](05-troubleshooting/networking/fail2ban-imap-self-ban-mail-client.md) | Troubleshooting |
| 2026-03-14 | [SearXNG: Private Self-Hosted Search](03-opensource/alternatives/searxng.md) | Open Source |
| 2026-03-14 | [FreshRSS: Self-Hosted RSS Reader](03-opensource/alternatives/freshrss.md) | Open Source |
| 2026-03-14 | [Gitea: Self-Hosted Git](03-opensource/alternatives/gitea.md) | Open Source |
| 2026-03-14 | [yt-dlp: Video Downloading](03-opensource/media-creative/yt-dlp.md) | Open Source |
| 2026-03-13 | [Vaultwarden: Self-Hosted Password Manager](03-opensource/privacy-security/vaultwarden.md) | Open Source |
| 2026-03-13 | [Gemini CLI Manual Update](05-troubleshooting/gemini-cli-manual-update.md) | Troubleshooting |
| 2026-03-13 | [rmlint: Duplicate File Scanning](03-opensource/productivity/rmlint-duplicate-scanning.md) | Open Source |
| 2026-03-13 | [SnapRAID & MergerFS Storage Setup](01-linux/storage/snapraid-mergerfs-setup.md) | Linux |
| 2026-03-13 | [Qwen2.5-14B OOM on RTX 3080 Ti (12GB)](05-troubleshooting/gpu-display/qwen-14b-oom-3080ti.md) | Troubleshooting |
| 2026-05-10 | [Logwatch Fleet Setup — Surviving Package Upgrades](02-selfhosting/monitoring/logwatch-fleet-setup.md) — added "Per-host config drift on cloud-image-derived servers" section: Packer-leftover myhostname, empty relayhost forcing public-MX path, stale SASL passwd maps from prior relays | Self-Hosting |
| 2026-05-10 | [Patching PHP 8.4 Implicit-Nullable Deprecations in Vendor Packages](05-troubleshooting/php-84-vendor-implicit-nullable-patch.md) — generalized from a Castopod/UuidModel incident; covers the substring-match gotcha that turns a 30-second fix into a 30-minute one | Troubleshooting |
| 2026-05-10 | [Logwatch Fleet Setup — Surviving Package Upgrades](02-selfhosting/monitoring/logwatch-fleet-setup.md) — added Fedora CA bundle missing diagnosis, journald-vs-mail.log methodology note, and bounce-source-must-be-real-mailbox section | Self-Hosting |
| 2026-05-10 | [ClamAV Fleet Deployment with Ansible](02-selfhosting/security/clamav-fleet-deployment.md) — added DigitalOcean monitoring caveat for 1vCPU droplets (with follow-up note: per-droplet relaxed alert can still trip; accept-the-page decision) | Self-Hosting |
| 2026-05-10 | [Claude Desktop MCP Mass-Disconnect After Blocking SSH Reboot](05-troubleshooting/claude-desktop-mcp-mass-disconnect-blocking-reboot.md) | Troubleshooting |
| 2026-05-10 | [Castopod Posts Don't Appear on Mastodon — Diagnosing the Federation Path](05-troubleshooting/security/castopod-broadcast-not-on-mastodon.md) | Troubleshooting |
| 2026-05-08 | [Castopod: Stale Federated Avatar URLs After Remote Profile Updates](05-troubleshooting/security/castopod-stale-federated-avatar.md) | Troubleshooting |
| 2026-05-08 | [Tuning Netdata `web_log_1m_successful` for Redirect-Heavy WordPress Sites](05-troubleshooting/security/netdata-web-log-successful-redirect-heavy-tuning.md) | Troubleshooting |
| 2026-05-07 | [Mastodon — The `--prune-profiles` Trap and How to Recover](02-selfhosting/services/mastodon-prune-profiles-trap.md) | Self-Hosting |
| 2026-05-02 | [WSL2 Backup via PowerShell Scheduled Task](01-linux/distro-specific/wsl2-backup-powershell.md) | Linux |
| 2026-05-02 | [SSH Config and Key Management](01-linux/networking/ssh-config-key-management.md) | Linux |
| 2026-05-02 | [Wake-on-LAN via Router SSH](02-selfhosting/dns-networking/wake-on-lan-router-ssh.md) | Self-Hosting |
| 2026-05-02 | [Tuning Netdata Docker Health Alarms to Prevent Update Flapping](02-selfhosting/monitoring/netdata-docker-health-alarm-tuning.md) | Self-Hosting |
| 2026-05-02 | [ClamAV Fleet Deployment with Ansible](02-selfhosting/security/clamav-fleet-deployment.md) | Self-Hosting |
| 2026-05-02 | [Fail2Ban Digest Mode — Fleet-Wide Quiet Alerts](02-selfhosting/security/fail2ban-digest-mode-fleet.md) | Self-Hosting |
| 2026-05-02 | [Mastodon Instance Tuning](02-selfhosting/services/mastodon-instance-tuning.md) | Self-Hosting |
| 2026-05-02 | [Ansible Check Mode False Positives in Verify/Assert Tasks](05-troubleshooting/ansible-check-mode-false-positives.md) | Troubleshooting |
| 2026-05-02 | [ISP SNI Filtering & Caddy Troubleshooting](05-troubleshooting/isp-sni-filtering-caddy.md) | Troubleshooting |
| 2026-05-02 | [Windows OpenSSH: WSL as Default Shell Breaks Remote Commands](05-troubleshooting/networking/windows-openssh-wsl-default-shell-breaks-remote-commands.md) | Troubleshooting |
| 2026-05-02 | [Windows OpenSSH Server (sshd) Stops After Reboot](05-troubleshooting/networking/windows-sshd-stops-after-reboot.md) | Troubleshooting |
| 2026-05-02 | [yt-dlp YouTube JS Challenge Fix (Fedora)](05-troubleshooting/yt-dlp-fedora-js-challenge.md) | Troubleshooting |
| 2026-04-30 | [wp-fail2ban Plugin Logpath on Debian/Ubuntu (auth.log, not syslog)](02-selfhosting/security/wp-fail2ban-logpath-debian-ubuntu.md) | Self-Hosting |
| 2026-04-30 | [LoRA adapter — GGUF conversion fails with 'config.json not found](05-troubleshooting/gpu-display/lora-adapter-gguf-conversion-fails.md) | Troubleshooting |
| 2026-04-29 | [iOS Tailscale Clients Report HostName="localhost" — Breaks /etc/hosts Generators](05-troubleshooting/networking/tailscale-status-json-hostname-localhost-ios.md) | Troubleshooting |
| 2026-04-29 | [Python smtplib: Missing Date/Message-ID Headers Break Mail Clients](05-troubleshooting/python-smtplib-missing-rfc-headers.md) | Troubleshooting |
| 2026-04-28 | [Ubuntu dist-upgrade Quarantines Third-Party Repos](05-troubleshooting/ubuntu-dist-upgrade-repo-quarantine.md) | Troubleshooting |
| 2026-04-26 | [Fantastical MCP Server: Permission Denied on Launch (macOS Quarantine)](05-troubleshooting/fantastical-mcp-permission-denied.md) | Troubleshooting |
| 2026-04-25 | [rsync over Tailscale: Hung in TCP Teardown After Transfer Completes](05-troubleshooting/networking/rsync-tailscale-teardown-stall.md) | Troubleshooting |
| 2026-04-25 | [Ollama: `ollama run` with Piped Stdin Bypasses Chat Template + SYSTEM Prompt](05-troubleshooting/ollama-chat-template-pipe-stdin-bypass.md) | Troubleshooting |
| 2026-04-24 | [Fantastical Google Sync Error Flood — Phantom Calendars Fixed via syncselect](05-troubleshooting/fantastical-google-phantom-calendar-syncselect.md) | Troubleshooting |
| 2026-04-23 | [Pi-hole DoH / DoT Bypass Defense](02-selfhosting/dns-networking/pihole-doh-dot-bypass-defense.md) | Self-Hosting |
| 2026-04-22 | [Pi-hole v6 Adlist Management via SQL](02-selfhosting/dns-networking/pihole-v6-adlist-management.md) | Self-Hosting |
| 2026-04-22 | [Pi-hole v6 Group Management: Per-Client DNS Rules](02-selfhosting/dns-networking/pihole-v6-group-management.md) | Self-Hosting |
| 2026-04-22 | [Mastodon DB Maintenance — Statuses, Accounts, and VACUUM](02-selfhosting/services/mastodon-db-maintenance.md) | Self-Hosting |
| 2026-04-22 | [Mastodon Federation — Domain Blocks, Silencing, and FediSeer](02-selfhosting/services/mastodon-federation.md) | Self-Hosting |
| 2026-04-22 | [Pi-hole AI Blocklist Blocks Claude Desktop (ERR_CONNECTION_REFUSED)](05-troubleshooting/networking/pihole-blocks-claude-desktop.md) | Troubleshooting |
| 2026-04-21 | [Ansible Fails with Permission Denied While `ssh <alias>` Works (Host Alias Bypass)](05-troubleshooting/ansible-ssh-host-alias-bypass.md) | Troubleshooting |
| 2026-04-20 | [Claude Code Remote Control — Mobile Access to a Persistent Host Session](02-selfhosting/services/claude-code-remote-control.md) | Self-Hosting |
| 2026-04-19 | [AWS S3 Cost Management](02-selfhosting/cloud/aws-s3-cost-management.md) | Self-Hosting |
---