Commit graph

183 commits

Author SHA1 Message Date
852375ddf0 logwatch-hostname wiki: add hostname-correct-but-config-baked variant
majormail (2026-06-14) had the correct system hostname but still mailed
from majormail-hetzner — the old provisioning label was hardcoded in
logwatch.conf MailFrom and fail2ban jail.local sender. Add a variant
section covering the config grep sweep and the templated-vs-static
Ansible regression caveat.
2026-06-14 04:00:18 -04:00
9dd730fc29 Add nav entries for Warp keychain login + iPhone Mirroring AWDL articles 2026-06-13 09:58:26 -04:00
e0595c04fd Publish drafts: Warp keychain login + iPhone Mirroring AWDL stall 2026-06-13 09:57:37 -04:00
MajorLinux
27ea2dc62b Add troubleshooting article: Wi-Fi 160 MHz airtime saturation breaking game streaming 2026-06-13 09:48:43 -04:00
3f94ebb963 Merge branch 'code/majormac/wiki-forgejo-recovery' 2026-06-12 17:36:55 -04:00
14cc1ba4b8 wiki: Forgejo account recovery & CLI admin when locked out of the GUI
Covers enabling the [mailer] for password recovery (relay via a tailnet mail
server, no-auth/mynetworks, FORCE_TRUST_SERVER_CERT for IP targets), CLI password
reset + the must-change-password=true gotcha, adding an SSH key via the basic-auth
API when locked out, and ruling out a server-side cause for a 'changing' password.
2026-06-12 17:36:54 -04:00
fecae727d1 Merge branch 'code/majormac/logwatch-hostname-wiki' 2026-06-12 10:58:17 -04:00
0d1697c0d6 wiki: Logwatch wrong hostname (<host>-hetzner) after migration
New troubleshooting runbook for Logwatch reports titled with the Hetzner
provisioning label instead of the real hostname; cross-linked from the
logwatch fleet-setup and VPS migration baseline articles, plus a new
'set system hostname' step in the post-migration checklist.
2026-06-12 10:58:17 -04:00
4f6898eb6c Merge branch 'code/majormac/ansible-hostkey-wiki' 2026-06-12 09:32:00 -04:00
11b455a0e2 Add runbook: Ansible host-key verification failed after host rebuild/migration
Documents the Ansible-by-IP known_hosts gap: interactive ssh works (key
stored under hostname) but Ansible connects by inventory IP and fails with
UNREACHABLE/Host key verification failed. Includes tailnet-safe ssh-keyscan
fix and prevention notes. Surfaced by the Hetzner migration IP churn.
2026-06-12 09:30:09 -04:00
bc4ff144df wiki: add Ansible reboot.yml become-timeout-on-WSL2 troubleshooting article
Documents why WSL2 hosts fail an Ansible reboot play at privilege
escalation (Timeout waiting for privilege escalation prompt) — WSL2 has
no real reboot semantics + become stalls over the Windows OpenSSH->WSL2
bridge — and the fix: scope reboot.yml to hosts: all:!wsl. Registered
in SUMMARY.md and 05-troubleshooting/index.md.
2026-06-12 03:57:17 -04:00
950759da52 wiki: add MagicDNS-names-vs-pinned-IPs Tailscale SSH article
New troubleshooting/networking article covering the three SSH failure modes
after a fleet migration (stale hardcoded IP, Tailscale 1.98.x cold-path
teardown, rebuilt-box host-key mismatch) and the durable fix (MagicDNS names +
known_hosts purge + ConnectTimeout), with the WSL2 no-resolver caveat.
Cross-links the existing host-key article (adds a 'when pinning the IP is
wrong' callout) and adds the SUMMARY nav entry.
2026-06-12 01:33:31 -04:00
877c4b815f wiki: add WSL2 Fedora 44 in-place upgrade article (gcc14 blocker + CUDA repo swap) 2026-06-11 22:48:55 -04:00
27b1ae244c Merge branch 'code/majorrig/wiki-hevc-already-failed-skip' 2026-06-11 20:16:21 -04:00
ce2e761d33 hevc-vaapi-batch-encode: add already_failed() skip for streaming content
Document that VAAPI HEVC on Polaris can't beat already-efficient H.264 (YouTube/
Twitch/stream archives), so output comes out larger and lands in hevc_failed.txt.
Add already_failed() guard so the batch skips known-bad files on queue rebuilds
instead of re-attempting them. Also: MIN_FREE_GB note (start-only check) and a
source-bitrate triage snippet for picking real encode candidates.
2026-06-11 20:16:19 -04:00
513d94aa84 Merge branch 'code/majorrig/wiki-ssh-magicdns-article' 2026-06-11 20:12:34 -04:00
9b066d0e54 Add troubleshooting article: SSH alias MagicDNS fall-through host-key failure
New 05-troubleshooting/networking article covering the case where ssh <alias>
fails host-key verification because no Host block exists and the alias resolves
via Tailscale MagicDNS to a name with no known_hosts entry (key stored under the
IP). Registered in SUMMARY.md and the troubleshooting index.
2026-06-11 20:12:22 -04:00
5ef0fdfad4 draft: WIP wiki articles (warp keychain credential, iPhone Mirroring AWDL stall)
Backing up two unpublished draft articles that existed only in a working-tree
stash. Drafts — NOT in SUMMARY.md nav and NOT merged to main, so not published
to notes.majorshouse.com. Pre-commit nav check bypassed intentionally (--no-verify).

- 05-troubleshooting/claude-code-warp-login-corrupt-keychain-credential.md
- 05-troubleshooting/iphone-mirroring-connecting-hang-awdl-stall-beta.md
2026-06-11 15:41:28 -04:00
a414e4cdbe Merge: Ansible role doc-ref updates across 5 wiki articles 2026-06-11 11:33:42 -04:00
06a794316b docs: point Ansible references at the new roles (clamav/ssh_hardening/tailscale)
Operational/how-to references updated to the role entry playbooks after the
ADR-0001 migration. Historical incident narrative (dated callouts, commit
refs) preserved.

- clamav-fleet-deployment: override + re-run -> clamav.yml; role note
- ssh-hardening-ansible-fleet: note this is now the ssh_hardening role
- vps-migration-baseline-checklist: table -> clamav.yml / ssh_hardening.yml
- ssh-socket-tailscale-race-condition: Affected Hosts + Prevention + References
  -> tailscale role tasks (network_wait/ssh_only_ubuntu/ssh_only_fedora)
- freshclam-logwatch-false-no-updates: codify refs -> clamav role
2026-06-11 11:33:42 -04:00
68bfb099ac Merge branch 'code/majorrig/wiki-ssh-fleet-reconciled' 2026-06-07 06:22:39 -04:00
c3045e33dd troubleshooting: ssh-race article — fleet audited & reconciled 2026-06-07
dcaprod-hetzner + tttpod-hetzner were missing tailscale-wait-ready.service
(inert ssh.service gate -> latent bind race); corrected playbook applied to
both. teelia uses Tailscale SSH (no sshd, immune). All Ubuntu hosts now on
the dependency-free-socket + ssh.service-gate pattern.
2026-06-07 06:22:35 -04:00
0cde19e064 Merge branch 'code/majorrig/wiki-ssh-race-fedora-and-cycle' 2026-06-07 05:56:29 -04:00
8d4dee5da3 troubleshooting: correct ssh tailscale-race article (Fedora ListenAddress variant + playbook cycle landmine)
- Fedora hosts are NOT automatically immune: a leftover manual
  `ListenAddress <tailscale-ip>` drop-in reintroduces the sshd boot bind-race
  even under firewalld (hit on majordiscord 2026-06-07; fix = remove it).
- The Ubuntu playbook kept shipping the cycle-causing [Unit] gate on
  ssh.socket despite the 2026-06-04 resolution; re-running it re-armed the
  ordering cycle (clobbered majorlinux; majortoot-hetzner found armed).
  Corrected in MajorAnsible e0d35aa. Fleet ssh-lockdown state is inconsistent
  (dcaprod/tttpod lack wait-ready; teelia no override) — needs a per-host audit.
2026-06-07 05:56:25 -04:00
fda2d35ea5 Merge branch 'code/majorrig/wiki-dovecot-lda-dupes' 2026-06-07 05:06:57 -04:00
01ae62e621 troubleshooting: Dovecot phantom mailboxes from .dovecot.lda-dupes (mail_home overlapping maildir root)
Document the majormail 2026-06-07 incident: when userdb home == maildir
root, the LDA/Sieve duplicate database (.dovecot.lda-dupes + .locks) lands
inside the mail store and the maildir lister exposes it as phantom
mailboxes ("dovecot.lda-dupes"), logging stat(.../tmp) "Not a directory".
Fix: point home at a non-dotted subdir. Wired into the troubleshooting
index and SUMMARY.
2026-06-07 05:06:43 -04:00
662741e7ad troubleshooting: Postfix header_checks can't act on milter-added headers
Document the majormail spam-routing failure (2026-06-06): a cleanup
header_checks REDIRECT keyed on the milter-added X-Spam-Flag never fired for
real inbound mail (only locally-injected), so spam kept reaching the inbox.
Fix is to route in Sieve at delivery (after the milter), with a redirect +
loop guard. Includes the 'local-injection tests lie' warning.
2026-06-06 10:38:04 -04:00
d8f07e8e2e wiki: add ClamAV freshness watchdog + sendmail (not mail) alert guidance
Document the daily /etc/cron.daily/clamav-freshness watchdog as the real
detector for stale signatures, and the key gotcha that 'mail' is absent on
most fleet hosts so alert scripts must use /usr/sbin/sendmail -t.
2026-06-06 07:17:56 -04:00
5d7354e856 troubleshooting: freshclam daemon-mode logwatch false 'no updates' alert
logwatch's clam-update counts only 'process started' lines (emitted only at
daemon restart), so daemon-mode freshclam false-alarms on quiet days despite
signatures updating. Fix: $ignore_no_updates=1 drop-in. Includes the
real-vs-false check (a daemonless box with freshclam disabled is a TRUE alert).
2026-06-06 07:06:29 -04:00
d755b77126 troubleshooting: SELinux /etc/localtime mislabel silently breaks timezone
New page documenting the majormail (2026-06-05) issue: /etc/localtime
shipped labeled etc_t instead of locale_t on the Hetzner image, so SELinux
denied systemd-timedated and timedatectl/community.general.timezone reported
success while the symlink stayed at UTC. Fix: restorecon before setting TZ.
Indexed in index.md (SELinux) + SUMMARY.md.
2026-06-05 14:22:00 -04:00
26eb13ab2f troubleshooting: document majormail client-connectivity incident (2026-06-05)
- New page: Dovecot IMAP vsz_limit OOM from a bloated/corrupt index.log
  (152M index on an empty folder killed IMAP children with error 83).
- fail2ban IMAP self-ban: add permanent ignoreip-whitelist fix + dynamic-IP caveat.
- firewalld mail ports: add 'submission/587 never added' variant + correct
  Fedora service name; note Ansible now manages the full mail-service set.
- Index + SUMMARY updated with the new page.
2026-06-05 14:04:22 -04:00
5260548caa wiki: spam filtering — add Pigeonhole 2.4 syntax, REDIRECT-to-junk pattern, weekly timer
Three updates to the inbound spam filtering guide, all driven by the 2026-06-04
majormail-hetzner Phase 6 cutover and follow-up tuning:

1. Section 6 (Dovecot Sieve): warn explicitly that `plugin/sieve_before` was
   dropped in Pigeonhole 2.4 and silently does nothing — no startup warning,
   spam just keeps landing in INBOX. The 2.4 replacement is a top-level
   `sieve_script <name> { type = before; path = …; }` block. Also note the
   Fedora-flat-dovecot.conf pitfall (some packagings ship dovecot.conf
   without `!include conf.d/*.conf`, so the block has to live in the main
   file directly). Added a `sievec` compile step.

2. New §6b: route spam to a separate `junk@` mailbox via Postfix cleanup
   `header_checks` REDIRECT. This makes spam invisible to the user's
   mailbox entirely — Spark/IDLE-based clients don't push-notify because
   the message never reaches the subscribed mailbox at all. Includes the
   `regexp:` vs `pcre:` map-type tip (use regexp on stock Fedora to avoid
   the postfix-pcre package dependency).

3. New §7a: weekly systemd timer for sa-learn. The §7 warning about
   "don't run sa-learn from cron unless folders are clean" is correct as
   the safe default — but when you adopt the §6b REDIRECT-to-junk@
   pattern, the junk@ mailbox is pure spam by design and a weekly
   `--spam`/`--ham`/`--sync`/`--force-expire` chain becomes safe and
   useful. Full unit templates included.

Gotchas table gains four entries:
- Pigeonhole 2.4 silent breakage of plugin/sieve_before
- postfix-pcre vs regexp map type confusion
- Why sieve fileinto Junk still pushes a Spark notification
- Why local `sendmail` injection doesn't trigger the REDIRECT (smtpd
  milters skip sendmail-injected mail, so X-Spam-Flag isn't added)

All changes match what's now codified in the `majormail` Ansible role
(commit 7a8b9eb in MajorAnsible).
2026-06-04 20:48:01 -04:00
2e58c4625c wiki: remove deploy-pipeline test marker 2026-06-04 16:44:56 -04:00
b81362bb78 wiki: temporary deploy-pipeline test marker (will be reverted) 2026-06-04 16:43:57 -04:00
110a6d49e5 wiki: add inbound spam filtering guide (spamass-milter + SpamAssassin Bayes)
New 02-selfhosting/services article: the full Postfix/Dovecot inbound spam stack
on Fedora — spamass-milter tag-only wiring (the -r footgun), socket permissions
(sa-milt group + UMask), site-wide Bayes DB, Sieve-to-Junk, and sa-learn training
(folders, spam/ham balance, manual-not-cron). From the majormail setup.

Also extends selinux-dovecot-vmail-context with a Permissive-mode variant + a
postfix_cleanup->mysqld_etc companion-denial note. SUMMARY.md nav updated.
2026-06-04 16:31:14 -04:00
e6a249403c s3-cost-management: prune automation disabled; correct guidance
The weekly media-prune cron (and monthly accounts refresh --all) were
removed 2026-06-01 after repeatedly breaking avatars. Update the
majortoot sections: the 648->7GB shrink was a one-time safe attachment
cleanup; automation is now disabled; prune attachments manually if ever
needed, never profiles. Cross-link the two new troubleshooting articles.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-01 15:46:42 -04:00
4e63d8546c mastodon: document S3 ACL upload failures + bulk avatar restore
New article mastodon-s3-acl-upload-failures.md: a BucketOwnerEnforced S3
bucket plus a stale S3_PERMISSION/S3_ACL in .env.production makes every
Mastodon upload fail with AccessControlListNotSupported, silently. Covers
symptoms (incl. why a missing object returns 403 not 404), diagnosis,
the fix (S3_PERMISSION= empty, public read via bucket policy), recovery,
a synthetic-write health check, and Ansible enforcement.

Extend mastodon-prune-profiles-trap.md: add a "Bulk restore at scale"
procedure (list existing keys, null missing DB refs, enqueue
RedownloadAvatar/HeaderWorker), a "storage-level deletion without DB
de-ref" section, and a stronger recommendation to disable automated
profile pruning (and scheduled accounts refresh --all) entirely.

Link both from SUMMARY.md and the selfhosting index.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-01 15:45:23 -04:00
155651c373 wiki: ssh.socket wait-ready gate + mastodon post-install hardening
Two related additions covering the 2026-05-31 cutover-night incidents on
majorlinux and majortoot-hetzner.

ssh-socket-tailscale-race-condition.md (update Race 1 fix):
- After=tailscaled.service Requires=tailscaled.service orders against the
  service becoming active, not against tailscale0 having an IPv4 — hosts
  kept losing SSH intermittently after reboots (incident: majorlinux +
  majortoot-hetzner 2026-05-31, during cutover-night Ansible reboot).
- Canonical fix: a oneshot tailscale-wait-ready.service that polls
  `ip -4 -o addr show tailscale0` until an address is present, with
  ssh.socket After=/Requires= that service. Document the full evolution
  (2026-05-19 BindsTo → 2026-05-23 Requires → 2026-05-31 wait-ready) so
  future readers don't try the half-fixes thinking they're sufficient.
- Add majortoot-hetzner to affected hosts.

mastodon-post-install-hardening.md (new):
Four upstream-install gaps that bit during the majortoot-hetzner cutover:
1. /home/mastodon at 0750 (useradd default) → nginx www-data can't
   traverse → every static asset 403s → unstyled "purple screen" in the
   browser while API/HTML still work through the puma proxy.
2. .env.production at 0644 (mastodon-setup default) → DB_PASS,
   SECRET_KEY_BASE, OTP_SECRET world-readable once gap (1) is fixed.
3. mastodon user shell at /usr/sbin/nologin → `su - mastodon` blocked.
4. rbenv init in .bashrc only → login shells don't source .bashrc; even
   when chained, Ubuntu's .bashrc returns early for non-interactive
   shells. Fix: .bash_profile sets up rbenv BEFORE sourcing .profile +
   .bashrc, so it works for both interactive and non-interactive logins.

All four codified in MajorAnsible configure_mastodon_permissions.yml
with self-asserting verification steps.

02-selfhosting/index.md + SUMMARY.md:
Add a "Services" section to the selfhosting index linking the
mastodon-post-install-hardening article (and the other orphaned
services/ entries while there). SUMMARY.md gains one new entry.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-31 11:08:24 -04:00
73c10111e0 Merge branch 'cowork/majorair/wiki-batch-may25' 2026-05-25 13:56:23 -04:00
52ca8a0413 wiki: batch update — 4 new articles + 4 updates
New articles:
- Postfix SendGrid TLS handshake failure (port 465 vs 587)
- Plex transcoding troubleshooting
- Ansible Ubuntu reboot detection kernel mismatch
- WSL2 PyTorch checkpoint Windows filesystem deadlock

Updated:
- AWS S3 cost management (expanded)
- Network overview (IP updates)
- HEVC VAAPI batch encode (progress + fixes)
- SUMMARY.md (new entries)
2026-05-25 13:55:10 -04:00
dc897d4a67 Merge branch 'cowork/majorair/ssh-socket-bindsto-fix' 2026-05-23 02:40:45 -04:00
3b8c8b0597 ssh.socket wiki: correct BindsTo→Requires, add warning
BindsTo=tailscaled.service causes a systemd ordering cycle that
prevents ssh.socket from starting on reboot. Updated the recommended
fix to use Requires= and added a warning admonition explaining why
BindsTo must not be used. Added tttpod-hetzner to affected hosts
list and linked the 2026-05-23 dcaprod incident.
2026-05-23 02:40:04 -04:00
318f50c50b Merge branch 'cowork/majorair/tailscale-boot-race-wiki' 2026-05-19 20:39:19 -04:00
65b0aa4567 wiki: expand Tailscale race condition article with network-online race
Added Race 2: tailscaled starts before network-online.target, causing
Tailscale to get stuck with SetNetworkUp(false). Covers both Ubuntu
ssh.socket and cross-platform tailscaled ordering issues. Updated
references to include majordiscord incident and new Ansible playbook.
2026-05-19 20:39:18 -04:00
eb39da9a26 Merge cowork/majorair/ssh-socket-wiki: ssh.socket Tailscale race condition article 2026-05-19 19:36:19 -04:00
7dc591d257 wiki: add ssh.socket Tailscale race condition troubleshooting article
Documents the systemd socket activation race where ssh.socket binds
to the Tailscale IP before tailscaled is ready, causing SSH to become
unreachable after a Tailscale reconnect. Includes diagnosis steps and
the After=/BindsTo= fix.
2026-05-19 19:35:16 -04:00
64ac418a36 wiki: add ClamAV daemonless mode section + HEVC VAAPI article link 2026-05-15 09:02:24 -04:00
Marcus (via Claude Code)
28518e403e Add troubleshooting articles: Netdata apps-group FD false-positive + OBS stale script paths
- netdata-apps-fds-group-false-positive: the apps_group_file_descriptors_utilization
  false 100% on forking/root app groups (tailscaled on MajorToot 2026-05-15),
  the not-a-privilege gotcha, fleet-wide silence fix in MajorAnsible.
- obs-stale-script-paths: pending from prior session (not on remote).
- SUMMARY.md: link both (re-applied onto upstream after concurrent rebase).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 03:22:12 -04:00
a785e85821 Merge branch 'code/majorair/rsyslog-logwatch-fix' 2026-05-13 10:36:06 -04:00
4ec481c584 wiki: add rsyslog requirement to migration checklist and logwatch docs
Fedora 44 Hetzner images ship without rsyslog — logwatch produces
zero output because /var/log/messages doesn't exist. Added rsyslog
to baseline table and new diagnostic section to logwatch article.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 10:36:00 -04:00