Compare commits

..

48 commits

Author SHA1 Message Date
8d9bd34118 Merge branch 'code/majorrig/mastodon-mention-spam-wiki' 2026-06-22 13:50:21 -04:00
2def4c6f30 wiki: add Mastodon crowdfunding/mention-spam triage runbook
Runbook for telling broadcast fundraising solicitation from genuine
mentions: signal checklist, SQL to investigate the account and its
origin instance via nodeinfo, BlockService snippet, and a proportionate
escalation ladder (mute -> block -> report -> domain-limit -> domain-block).
Registered in SUMMARY.md and the self-hosting section index.
2026-06-22 13:49:35 -04:00
44c9d38b9f Merge branch 'code/MajorAir/macos-btm-audit-wiki' 2026-06-21 13:01:34 -04:00
623f04720c Add macOS guide: auditing & cleaning Background App Activity (sfltool dumpbtm) 2026-06-21 13:00:35 -04:00
69d60b7753 Merge branch 'code/MajorAir/restic-snapshot-group-gotcha' 2026-06-21 12:34:06 -04:00
c358e0dfea restic runbook: document the snapshot-group-per-path-set gotcha
Changing a host's restic_paths spawns a new snapshot group (restic
groups by host+paths), so old and new path-sets each keep their own
retention lineage. Surfaced while extending majorlab's backup scope.
2026-06-21 12:33:56 -04:00
a45ef55862 Merge branch 'code/MajorAir/wp-textdomain-wiki' 2026-06-21 11:44:52 -04:00
e767ebffcb Add runbook: WordPress 6.7 _load_textdomain_just_in_time notice
Covers the WP 6.7 doing_it_wrong notice fired when a theme/plugin
translates before init (e.g. nav-menu labels on after_setup_theme).
Documents source fix (defer to init) and the update-safe mu-plugin
suppression via doing_it_wrong_trigger_error, plus the renamed-theme
domain gotcha. Derived from the majorlinux.com kappa/marstheme triage.
2026-06-21 11:44:48 -04:00
96db073b78 Add LVM volume-grow guide; publish iPhone Mirroring + Claude Code login fixes 2026-06-19 15:00:19 -04:00
cf5e35da1d Merge branch 'code/majorair/steam-deck-wifi-flap-article' 2026-06-19 11:36:09 -04:00
cb90bb69a2 wiki: add Steam Deck Wi-Fi flapping runbook (IWD periodic scan + rtw88 power save)
Client-side fix for OG Steam Deck (RTL8822CE/rtw88) flapping ~once a minute on
SteamOS: disable IWD periodic scan + disable Wi-Fi power save via NM dispatcher.
Cross-linked with the 160MHz airtime article; registered in SUMMARY.md nav.
2026-06-19 11:36:06 -04:00
4599ed607c wiki: add restic + B2 fleet backups runbook
Architecture, per-engine DB dump patterns, restore procedure, add-a-host,
and gotchas (RESTIC_CACHE_DIR/$HOME, missing sqlite3, docker dump env vars,
delete-capable B2 key). Linked in SUMMARY under storage-backup.
2026-06-19 10:05:16 -04:00
2bed2cbae3 Merge branch 'code/majormac/ansible-roles-migration-article' 2026-06-18 14:32:02 -04:00
ebdb28e9e2 Add wiki article: migrating flat Ansible playbooks to roles (capture-based reconciliation) 2026-06-18 14:31:46 -04:00
4fa5e33d93 Merge branch 'code/majormac/tm-orphaned-previous-article' 2026-06-18 10:09:45 -04:00
cfff75af1c Add troubleshooting article: Time Machine orphaned APFS .previous blocks backups 2026-06-18 10:09:45 -04:00
06162273f7 Merge branch 'code/majormac/ssh-key-backfill-article' 2026-06-17 13:15:19 -04:00
e1767bc19e Add troubleshooting article: Permission denied (publickey) after key rotation
New 05-troubleshooting/networking article covering the per-host nature of
authorized_keys: rotating a workstation SSH key requires backfilling the new
pubkey to every host, or hosts holding only the old key reject it with
Permission denied (publickey). Includes fleet-sweep diagnosis, idempotent
backed-up backfill via a still-trusted transit user, and prevention. Wired
into SUMMARY.md nav.
2026-06-17 13:14:41 -04:00
0d08e21ee4 Merge branch 'code/majorair/yt-dlp-update-docs' 2026-06-16 19:12:21 -04:00
2121d3ff1b yt-dlp: document -U trap and avoid duplicate pip installs
Add a Maintenance subsection covering why 'yt-dlp -U' fails on PyPI
builds and how to update via pip, plus how to detect/remove a duplicate
user+system install (the issue hit on majorhome 2026-06-16).
2026-06-16 19:12:06 -04:00
1d73b2defa Merge branch 'code/majorair/keychain-prompt-wiki' 2026-06-15 20:12:21 -04:00
34d9ee42b1 Add wiki: Claude Code keychain prompt keeps reappearing on macOS
New troubleshooting article for the recurring 'security wants to access
Claude Code-credentials' prompt that persists even after Always Allow
(ACL invalidation on binary-signature change / token refresh / post-boot
churn). Covers triage, the reset-and-relogin fix, and the file-based
credentials workaround with its plaintext tradeoff. Registered in
SUMMARY + troubleshooting index; cross-linked with the corrupt-credential
login-failure article (distinct symptom).
2026-06-15 20:12:11 -04:00
700ca95158 Merge branch 'code/majorair/iphone-mirroring-regression' 2026-06-15 19:58:24 -04:00
a5df9e4873 Correct iPhone Mirroring article: regressed on 27.0 beta, not a Tailscale fix
2026-06-15: mirroring is reproducibly stuck on Connecting again with
Tailscale accept-routes still off, so the 06-14 it-works conclusion was
wrong. _asquic endpoint resolves but the QUIC/AWDL datapath never
completes; awdl0 bounce, full reboot, and phone radio cycle all failed.
Reframed as an intermittent macOS 27.0 beta AWDL bug; QuickTime USB
remains the workaround.
2026-06-15 19:58:20 -04:00
7703b963e1 Merge branch 'code/majorair/wiki-dummy-ip' 2026-06-15 19:26:58 -04:00
5050001909 Replace real majormail IP with documentation IP in logwatch example
The postfix MX-lookup example hard-coded majormail's real public IP
(stale DO address). Swap in an RFC 5737 documentation IP (203.0.113.10)
so the published wiki doesn't expose a real fleet IP.
2026-06-15 19:26:49 -04:00
9085740fa3 Merge branch 'code/majorair/iphone-mirroring-llw0-correction' 2026-06-14 19:10:33 -04:00
75154ff80c iPhone Mirroring: correct transport finding (video on llw0 not awdl0), it works on ch44, what-changed + MajorMac open test (2026-06-14 evening) 2026-06-14 19:10:06 -04:00
4c95f8a88a Merge branch 'code/majorair/iphone-mirroring-doc-update' 2026-06-14 04:31:55 -04:00
805c0f0a8f iPhone Mirroring AWDL article: refined root cause, Tailscale/congestion ruled out, ch36+ch44 both fail, QuickTime USB workaround, revisit checklist (2026-06-14) 2026-06-14 04:30:22 -04:00
e5d1e39af9 Merge branch 'code/majorair/wiki-stale-hostname-config-variant' 2026-06-14 04:00:25 -04:00
852375ddf0 logwatch-hostname wiki: add hostname-correct-but-config-baked variant
majormail (2026-06-14) had the correct system hostname but still mailed
from majormail-hetzner — the old provisioning label was hardcoded in
logwatch.conf MailFrom and fail2ban jail.local sender. Add a variant
section covering the config grep sweep and the templated-vs-static
Ansible regression caveat.
2026-06-14 04:00:18 -04:00
9dd730fc29 Add nav entries for Warp keychain login + iPhone Mirroring AWDL articles 2026-06-13 09:58:26 -04:00
e0595c04fd Publish drafts: Warp keychain login + iPhone Mirroring AWDL stall 2026-06-13 09:57:37 -04:00
MajorLinux
27ea2dc62b Add troubleshooting article: Wi-Fi 160 MHz airtime saturation breaking game streaming 2026-06-13 09:48:43 -04:00
3f94ebb963 Merge branch 'code/majormac/wiki-forgejo-recovery' 2026-06-12 17:36:55 -04:00
14cc1ba4b8 wiki: Forgejo account recovery & CLI admin when locked out of the GUI
Covers enabling the [mailer] for password recovery (relay via a tailnet mail
server, no-auth/mynetworks, FORCE_TRUST_SERVER_CERT for IP targets), CLI password
reset + the must-change-password=true gotcha, adding an SSH key via the basic-auth
API when locked out, and ruling out a server-side cause for a 'changing' password.
2026-06-12 17:36:54 -04:00
fecae727d1 Merge branch 'code/majormac/logwatch-hostname-wiki' 2026-06-12 10:58:17 -04:00
0d1697c0d6 wiki: Logwatch wrong hostname (<host>-hetzner) after migration
New troubleshooting runbook for Logwatch reports titled with the Hetzner
provisioning label instead of the real hostname; cross-linked from the
logwatch fleet-setup and VPS migration baseline articles, plus a new
'set system hostname' step in the post-migration checklist.
2026-06-12 10:58:17 -04:00
4f6898eb6c Merge branch 'code/majormac/ansible-hostkey-wiki' 2026-06-12 09:32:00 -04:00
11b455a0e2 Add runbook: Ansible host-key verification failed after host rebuild/migration
Documents the Ansible-by-IP known_hosts gap: interactive ssh works (key
stored under hostname) but Ansible connects by inventory IP and fails with
UNREACHABLE/Host key verification failed. Includes tailnet-safe ssh-keyscan
fix and prevention notes. Surfaced by the Hetzner migration IP churn.
2026-06-12 09:30:09 -04:00
bc4ff144df wiki: add Ansible reboot.yml become-timeout-on-WSL2 troubleshooting article
Documents why WSL2 hosts fail an Ansible reboot play at privilege
escalation (Timeout waiting for privilege escalation prompt) — WSL2 has
no real reboot semantics + become stalls over the Windows OpenSSH->WSL2
bridge — and the fix: scope reboot.yml to hosts: all:!wsl. Registered
in SUMMARY.md and 05-troubleshooting/index.md.
2026-06-12 03:57:17 -04:00
950759da52 wiki: add MagicDNS-names-vs-pinned-IPs Tailscale SSH article
New troubleshooting/networking article covering the three SSH failure modes
after a fleet migration (stale hardcoded IP, Tailscale 1.98.x cold-path
teardown, rebuilt-box host-key mismatch) and the durable fix (MagicDNS names +
known_hosts purge + ConnectTimeout), with the WSL2 no-resolver caveat.
Cross-links the existing host-key article (adds a 'when pinning the IP is
wrong' callout) and adds the SUMMARY nav entry.
2026-06-12 01:33:31 -04:00
877c4b815f wiki: add WSL2 Fedora 44 in-place upgrade article (gcc14 blocker + CUDA repo swap) 2026-06-11 22:48:55 -04:00
27b1ae244c Merge branch 'code/majorrig/wiki-hevc-already-failed-skip' 2026-06-11 20:16:21 -04:00
ce2e761d33 hevc-vaapi-batch-encode: add already_failed() skip for streaming content
Document that VAAPI HEVC on Polaris can't beat already-efficient H.264 (YouTube/
Twitch/stream archives), so output comes out larger and lands in hevc_failed.txt.
Add already_failed() guard so the batch skips known-bad files on queue rebuilds
instead of re-attempting them. Also: MIN_FREE_GB note (start-only check) and a
source-bitrate triage snippet for picking real encode candidates.
2026-06-11 20:16:19 -04:00
513d94aa84 Merge branch 'code/majorrig/wiki-ssh-magicdns-article' 2026-06-11 20:12:34 -04:00
9b066d0e54 Add troubleshooting article: SSH alias MagicDNS fall-through host-key failure
New 05-troubleshooting/networking article covering the case where ssh <alias>
fails host-key verification because no Host block exists and the alias resolves
via Tailscale MagicDNS to a name with no known_hosts entry (key stored under the
IP). Registered in SUMMARY.md and the troubleshooting index.
2026-06-11 20:12:22 -04:00
28 changed files with 2595 additions and 16 deletions

View file

@ -0,0 +1,119 @@
---
title: WSL2 In-Place Upgrade to Fedora 44 (with gcc14 Blocker + CUDA Repo Swap)
domain: linux
category: distro-specific
tags:
- wsl2
- fedora
- windows
- upgrade
- dnf
- cuda
- majorrig
status: published
created: 2026-06-11
updated: 2026-06-11
---
# WSL2 In-Place Upgrade to Fedora 44 (with gcc14 Blocker + CUDA Repo Swap)
In-place upgrade of the FedoraLinux-43 WSL2 instance on MajorRig to Fedora 44 using `dnf system-upgrade` + `dnf5 offline reboot`. Hit one transaction blocker (`gcc14` compat package retired in F44) and swapped the stale `cuda-fedora39` repo to `cuda-fedora44` afterward. Performed 2026-06-11.
## The Short Answer
```powershell
# PowerShell — backup first
wsl --shutdown
wsl --export FedoraLinux-43 D:\backups\fedora43.tar
```
```bash
# Inside Fedora
sudo dnf upgrade --refresh -y
sudo shutdown -h now
# relaunch, then:
sudo dnf remove gcc14-c++ gcc14 # F44 dropped gcc14 — blocks the transaction
sudo dnf system-upgrade download --releasever=44
sudo dnf5 offline reboot # applies offline upgrade, shuts distro down
# wait a few minutes, relaunch:
cat /etc/fedora-release # → Fedora release 44 (Forty Four)
```
```powershell
# PowerShell — keep WSL itself current
wsl --update
```
## Steps
1. **Back up the instance** (PowerShell). The export tar is roughly the size of the installed system — this one was 86 GB. The target directory must already exist or you get `Wsl/ERROR_PATH_NOT_FOUND`.
```powershell
wsl --shutdown
mkdir D:\backups
wsl --export FedoraLinux-43 D:\backups\fedora43.tar
```
2. **Fully update the current release, then restart the distro**
```bash
sudo dnf upgrade --refresh -y
sudo shutdown -h now
```
3. **Remove upgrade blockers.** `gcc14`/`gcc14-c++` (compat packages) were retired in Fedora 44, so the transaction fails with "does not belong to a distupgrade repository". Remove them (or use `--allowerasing` and review the summary):
```bash
sudo dnf remove gcc14-c++ gcc14
```
4. **Download and apply the upgrade**
```bash
sudo dnf system-upgrade download --releasever=44
sudo dnf5 offline reboot
```
The "reboot" applies the offline transaction and shuts the distro down — there's no real systemd reboot in WSL. Wait a couple of minutes, then relaunch. If it errors on `systemctl`, the fallback is:
```bash
export DNF_SYSTEM_UPGRADE_NO_REBOOT=1
sudo -E dnf system-upgrade reboot
```
5. **Verify and tidy up**
```bash
cat /etc/fedora-release # Fedora release 44 (Forty Four)
sudo dnf upgrade --refresh # catch post-upgrade updates
gcc --version # F44 ships gcc 16; reinstall with `dnf install gcc gcc-c++` if removed
```
```powershell
wsl --update # fixes the post-upgrade Wsl/Service/E_UNEXPECTED catastrophic failure some users hit
```
## CUDA Repo Swap
`dnf repolist` still showed `cuda-fedora39-x86_64` — NVIDIA repos are pinned per Fedora release and don't follow distro upgrades. NVIDIA publishes a fedora44 repo:
```bash
sudo rm /etc/yum.repos.d/cuda-fedora39*.repo
sudo dnf config-manager addrepo --from-repofile=https://developer.download.nvidia.com/compute/cuda/repos/fedora44/x86_64/cuda-fedora44.repo
sudo dnf upgrade --refresh
sudo dnf repolist # confirm cuda-fedora44-x86_64
```
**WSL caveat:** never install the NVIDIA *driver* inside WSL — the Windows host driver provides the GPU. Only install toolkit packages (e.g. `cuda-toolkit`).
## Gotchas & Notes
- **Don't skip more than two releases** in one jump — staged upgrades otherwise.
- **The WSL distro name is just a Windows label** — it still says "FedoraLinux-43" after the upgrade. Cosmetic fixes: Windows Terminal profile name, Start Menu shortcut, and `DistributionName`/`ShortcutPath` under `HKCU\Software\Microsoft\Windows\CurrentVersion\Lxss\{uuid}`.
- **Keep the backup tar** until the upgraded instance has proven stable for a few days, then delete to reclaim the space.
- **Restore path if needed:** `wsl --import FedoraRestore C:\WSL\FedoraRestore D:\backups\fedora43.tar` — remember imports default to root; fix via `/etc/wsl.conf` `[user] default=majorlinux`.
## See Also
- [WSL2 Instance Migration (Fedora 43)](wsl2-instance-migration-fedora43.md)
- [WSL2 Backup via PowerShell](wsl2-backup-powershell.md)

View file

@ -23,7 +23,14 @@ A collection of guides covering Linux administration, shell scripting, networkin
- [Ansible Getting Started](shell-scripting/ansible-getting-started.md)
- [Bash Scripting Patterns](shell-scripting/bash-scripting-patterns.md)
## Storage
- [SnapRAID & MergerFS Storage Setup](storage/snapraid-mergerfs-setup.md)
- [mdadm — Rebuilding a RAID Array After Reinstall](storage/mdadm-raid-rebuild.md)
- [Growing an LVM Volume by Absorbing Another Disk](storage/lvm-grow-volume-absorb-disk.md)
## Distro-Specific
- [Linux Distro Guide for Beginners](distro-specific/linux-distro-guide-beginners.md)
- [WSL2 Instance Migration to Fedora 43](distro-specific/wsl2-instance-migration-fedora43.md)
- [WSL2 In-Place Upgrade to Fedora 44](distro-specific/wsl2-fedora44-inplace-upgrade.md)

View file

@ -0,0 +1,159 @@
---
title: "Growing an LVM Volume by Absorbing Another Disk"
domain: linux
category: storage
tags: [lvm, lvextend, vgextend, pvcreate, resize2fs, ext4, storage, disk, homelab]
status: published
created: 2026-06-17
updated: 2026-06-17
---
# Growing an LVM Volume by Absorbing Another Disk
When an LVM-backed filesystem fills up and its volume group (VG) has no free
extents, you can grow it by adding a second physical disk as a new physical
volume (PV), extending the VG onto it, then extending the logical volume (LV)
and its filesystem. With ext4 this can be done **online** — no unmount, no
downtime for the volume being grown.
This guide covers the common case where the disk you want to absorb is currently
in use by its own LVM volume (you must evacuate and tear that down first), and
the precautions that keep it safe.
> [!warning] This enlarges your failure domain
> A single LV spanning two disks linearly (the default — no RAID/mirror) means
> **losing either disk loses the entire volume.** ext4 has no parity. Only do
> this for data you can rebuild, or layer redundancy (mdadm/LVM RAID) underneath.
> Back up anything irreplaceable first.
## The Short Answer
If the target disk (`/dev/sdX`) is already empty and unused:
```bash
sudo pvcreate /dev/sdX
sudo vgextend myvg /dev/sdX
sudo lvextend -l +100%FREE /dev/myvg/mylv
sudo resize2fs /dev/mapper/myvg-mylv # ext4, online; use xfs_growfs for XFS
```
The rest of this article handles the harder case: the target disk is currently
holding its own LVM volume with data on it.
## Step-by-Step
### 1. Survey the current layout
```bash
sudo pvs # physical volumes → which VG each belongs to
sudo vgs # volume groups, free extents (VFree)
sudo lvs # logical volumes and sizes
lsblk -o NAME,SIZE,TYPE,MOUNTPOINT
df -h
```
Confirm:
- The VG you want to grow (`myvg`) has `0` `VFree` (that's why you're here).
- The disk you want to absorb (`/dev/sdX`) is a **standalone** PV — not a member
of an mdadm array, a mergerfs branch, or a SnapRAID parity disk. Repurposing a
disk that something else depends on will break that thing silently.
### 2. Evacuate the disk you're about to absorb
Anything on the target disk will be **destroyed**. Move it somewhere with room to
spare, then prove the copy is intact before you trust it.
```bash
# Copy preserving permissions/timestamps
sudo rsync -a /mnt/olddisk/important /destination/with/space/
# Verify byte-for-byte — empty output + exit code 0 means identical
sudo diff -rq /mnt/olddisk/important /destination/with/space/important && echo OK
```
For large trees the `diff -rq` (full byte comparison) is slow but is the
authoritative check — don't skip it before the destructive phase. If an
application tracks files by path (databases, media servers), update its path
references to the new location *now*, while the old copy still exists as a
fallback.
### 3. Unmount and remove the old disk from fstab
```bash
sudo fuser -m /mnt/olddisk # confirm nothing holds it open
sudo umount /mnt/olddisk
mountpoint -q /mnt/olddisk && echo "STILL MOUNTED" || echo "unmounted"
sudo cp /etc/fstab /etc/fstab.bak-$(date +%Y%m%d) # always back up fstab
sudo sed -i '/olddisk/d' /etc/fstab # remove the stale entry
grep olddisk /etc/fstab || echo "fstab line gone"
```
> [!tip] Verify your `sed` pattern only matches the line you mean
> A too-broad pattern can delete the wrong fstab entry. Check the file before and
> after, and keep the backup until you've confirmed the system still boots.
### 4. Tear down the old disk's LVM
```bash
sudo lvremove -y /dev/oldvg/oldlv
sudo vgremove -y oldvg
sudo pvremove -y /dev/sdX # wipes the LVM label off the disk
```
This is the point of no return for the old disk's data — which is why steps 23
verified the copy first.
### 5. Add the disk to the target VG and extend
```bash
sudo pvcreate -y /dev/sdX
sudo vgextend myvg /dev/sdX
sudo lvextend -l +100%FREE /dev/myvg/mylv
```
`lvs`/`vgs` should now show the LV grown to span both disks and `0` free extents.
### 6. Grow the filesystem (online)
```bash
# ext4 — works while mounted
sudo resize2fs /dev/mapper/myvg-mylv
# XFS — grows online too, but takes the mountpoint, not the device
sudo xfs_growfs /mountpoint
```
`resize2fs` is idempotent — if it gets interrupted, just run it again; it reports
"Nothing to do!" once the filesystem already fills the LV.
### 7. Verify
```bash
df -h /mountpoint # should reflect the new larger size
sudo pvs # /dev/sdX now listed under myvg
sudo vgs myvg # two PVs, larger VSize
```
## Notes & Gotchas
- **Online resize works for the volume being grown, not the one being removed.**
The disk you absorb must be unmounted and torn down; the destination LV stays
mounted throughout.
- **`resize2fs` interruption is safe.** ext4 online resize is journaled; re-run it.
- **macOS cruft on evacuated disks.** Trees touched by macOS often carry
`._*` AppleDouble files and `.DS_Store` — harmless to drop, but they inflate
file counts in `diff`/`rsync` output. Don't mistake them for real data.
- **Check SMART on a disk you're promoting into a bigger role.** A disk with a
pending-sector history is riskier once it's in the critical path for a whole
multi-disk volume than it was holding a small isolated one.
- **Mountpoint cleanup.** After the old disk is gone, its former mountpoint
directory may reappear (it was shadowed by the mount). `rmdir` it if empty.
Note `ls -A` exits `0` on an empty directory, so don't gate cleanup on its exit
status — test contents explicitly.
## Related
- [SnapRAID & MergerFS Storage Setup](snapraid-mergerfs-setup.md) — add redundancy/parity instead of a linear span
- [mdadm — Rebuilding a RAID Array After Reinstall](mdadm-raid-rebuild.md)

View file

@ -66,14 +66,15 @@ Every server in the fleet should have these. Check each one after migration:
### After Migration
1. **Set the timezone**`timedatectl set-timezone America/New_York` (US) or `Europe/London` (UK). Hetzner images default to UTC.
2. **Verify CA bundle (Fedora)**`ls /etc/pki/tls/certs/ca-bundle.crt`. If missing, Postfix TLS, curl, and dnf will all fail silently. See [Fedora CA bundle fix](../../05-troubleshooting/security/fedora-ca-bundle-missing-symlink.md).
3. **Run `harden.yml` against the new host** — catches most gaps in one pass
4. **Send a test email**`echo test | mail -s "test" marcus@majorshouse.com` — if this fails, nothing else can alert you
5. **Verify crond is running**`systemctl is-active crond` (Fedora) or `systemctl is-active cron` (Ubuntu). cronie can be `enabled` but not `active` after provisioning.
6. **Check Netdata Cloud** — verify the new node appears and alerts are flowing
7. **Compare fail2ban jails**`fail2ban-client status` on both old and new
8. **Verify logwatch sends**`sudo logwatch --output mail --range today`
9. **Keep the old box powered off but not destroyed** for at least 7 days after remediation
2. **Set the system hostname** — Hetzner provisions the box as `<host>-hetzner`. Run `hostnamectl set-hostname <host>` and fix the loopback line: `sed -i "s/127.0.1.1.*/127.0.1.1 <host> <host>/" /etc/hosts`. Skip this and **Logwatch emails arrive titled `Logwatch for <host>-hetzner`** weeks later. Do it alongside the Tailscale node rename and Postfix `myhostname` — all three read from the provisioning label. See [Logwatch wrong hostname after migration](../../05-troubleshooting/logwatch-wrong-hostname-after-migration.md).
3. **Verify CA bundle (Fedora)**`ls /etc/pki/tls/certs/ca-bundle.crt`. If missing, Postfix TLS, curl, and dnf will all fail silently. See [Fedora CA bundle fix](../../05-troubleshooting/security/fedora-ca-bundle-missing-symlink.md).
4. **Run `harden.yml` against the new host** — catches most gaps in one pass
5. **Send a test email**`echo test | mail -s "test" marcus@majorshouse.com` — if this fails, nothing else can alert you
6. **Verify crond is running**`systemctl is-active crond` (Fedora) or `systemctl is-active cron` (Ubuntu). cronie can be `enabled` but not `active` after provisioning.
7. **Check Netdata Cloud** — verify the new node appears and alerts are flowing
8. **Compare fail2ban jails**`fail2ban-client status` on both old and new
9. **Verify logwatch sends**`sudo logwatch --output mail --range today`
10. **Keep the old box powered off but not destroyed** for at least 7 days after remediation
### Using doctl to Manage Old Droplets

View file

@ -38,6 +38,7 @@ Guides for running your own services at home, including Docker, reverse proxies,
- [Mastodon Federation](services/mastodon-federation.md)
- [Mastodon `--prune-profiles` Trap](services/mastodon-prune-profiles-trap.md)
- [Mastodon on S3 — Silent Upload Failures](services/mastodon-s3-acl-upload-failures.md)
- [Mastodon — Triaging Crowdfunding / Mention-Spam Accounts](services/mastodon-mention-spam-crowdfunding.md)
- [Ghost SMTP via Mailgun](services/ghost-smtp-mailgun-setup.md)
- [Updating n8n Docker](services/updating-n8n-docker.md)
- [Claude Code Remote Control](services/claude-code-remote-control.md)

View file

@ -235,9 +235,12 @@ sed -i '/^127\.0\.1\.1/d' /etc/hosts && \
systemctl reload postfix
```
> [!tip] Same drift, different symptom: the Logwatch **title**
> Hetzner provisions boxes with `<host>-hetzner` as the *system* hostname. When that's never corrected, Logwatch (which reads the live hostname at runtime) mails reports titled `Logwatch for <host>-hetzner` — no postfix involvement needed. Same `hostnamectl set-hostname` + `/etc/hosts` fix as above. See [Logwatch wrong hostname after migration](../../05-troubleshooting/logwatch-wrong-hostname-after-migration.md).
### 2. Empty `relayhost` quietly forces public-MX delivery
If `postconf relayhost` returns an empty value, postfix doesn't fail — it just does an MX lookup for the destination domain and tries to deliver directly. For mail to your own mail server, that means going via the **public MX** (the domain's external MX record, e.g., `mail.majorshouse.com → 165.227.187.191:25`) instead of the **internal/Tailscale relay path** the rest of the fleet uses.
If `postconf relayhost` returns an empty value, postfix doesn't fail — it just does an MX lookup for the destination domain and tries to deliver directly. For mail to your own mail server, that means going via the **public MX** (the domain's external MX record, e.g., `mail.majorshouse.com → 203.0.113.10:25`) instead of the **internal/Tailscale relay path** the rest of the fleet uses.
The public-MX path is subject to whatever spam filtering, content checks, and trust rules the receiving MX has configured for external traffic. Internal Tailscale-IP traffic typically gets a faster trust shortcut (e.g., bypass spamchk pipe). So this single configuration drift causes one host's mail to land in a different code path than its siblings — and then silently get filtered.

View file

@ -0,0 +1,130 @@
---
title: "Migrating Flat Ansible Playbooks to Roles (Safely)"
domain: selfhosting
category: security
tags: [ansible, roles, refactor, fleet, migration, fail2ban, infrastructure]
status: published
created: 2026-06-18
updated: 2026-06-18
---
# Migrating Flat Ansible Playbooks to Roles (Safely)
## Overview
A fleet repo tends to grow a sprawl of flat `configure_*.yml` playbooks — one per subsystem, plus near-duplicates for variants (e.g. ~10 `configure_fail2ban_*` playbooks), all sharing a single overloaded top-level `templates/` directory. It works, but it resists reuse: there is no clean `defaults/` precedence, no encapsulation, and no way to compose a host's full configuration in one place.
Ansible **roles** fix this — but migrating a *live* fleet is where it gets dangerous. The risk is not the refactor itself; it's accidentally changing deployed behaviour while you "just reorganize." This article covers the incremental, regression-free approach used to migrate an 11-host fleet, including the two techniques that keep it safe: **byte-identical migration** and **capture-based reconciliation**.
> This is a process/pattern article. For the specific roles in this fleet, see the internal runbook. The techniques here generalize to any flat-playbook → role migration.
## Decide What Becomes a Role vs. What Stays a Playbook
Not everything should be a role. Draw the line by purpose:
| Becomes a role | Stays a playbook |
|---|---|
| Reusable host **configuration** (a subsystem you converge to a desired state) | **Ops / one-off** actions: `update`, `reboot`, `harden`, `bootstrap`, `provision`, `fix_*`, `verify_*` |
| Has templates/files, defaults, handlers | Orchestrators that just `import_playbook` other things |
| Applied repeatedly and idempotently | Run-once or run-as-needed remediation |
Roles get the standard `roles/<name>/` layout (`tasks/`, `defaults/`, `handlers/`, `templates/`, `files/`, `meta/`). Name them after the **subsystem noun** (`fail2ban`, `clamav`, `firewall`) — drop the `configure_` verb prefix.
## The Incremental Loop (one role per branch)
Migrate **one subsystem per branch** and validate before merging. This keeps every change small enough to diff by eye and roll back cleanly:
1. `git mv` the templates/files into `roles/<name>/` so **git tracks them as renames** (history preserved, 100% rename score).
2. Move task bodies into `roles/<name>/tasks/` (split by lifecycle: install → service → config → verify).
3. Lift tunables into `roles/<name>/defaults/main.yml`; keep per-host overrides in `group_vars`/`host_vars`.
4. Add a thin entry playbook `<name>.yml` (`hosts: <group>` + `roles: [<name>]`).
5. Validate with `--check --diff` against a single host **before** merging.
6. Merge, then move to the next subsystem.
## Technique 1: Byte-Identical Migration
When the goal is "reorganize without changing behaviour," **prove** it. After moving a playbook into a role, the rendered task bodies should be identical to the original. Verify with a normalized diff against `main`:
```bash
# Compare the role's task body against the original flat playbook,
# ignoring only comments/whitespace you intend to change.
git show main:configure_clamav.yml > /tmp/old.yml
# ...extract the task list from roles/clamav/tasks/*.yml and diff
diff <(yq '.[] | .tasks' /tmp/old.yml) <(cat roles/clamav/tasks/*.yml)
```
The acceptance bar: `--check --diff` against a real host returns **`changed=0`** (or only the diffs you explicitly intended, like a doc-comment line). If a "faithful" migration shows unexpected `changed=N`, you altered behaviour — stop and reconcile before merging. Templates moved via `git mv` show as **100% renames** in `git show --stat`, which is your proof the deployed content is unchanged.
## Technique 2: Consolidating Near-Duplicates with Feature Flags
The big win is collapsing a family of near-duplicate playbooks (the ~10 `configure_fail2ban_*`) into **one role with flag-gated task files**:
```yaml
# group_vars/<group>.yml — hosts self-select which jails/components they get
fail2ban_jail_sshd: true
fail2ban_jail_wordpress: true
fail2ban_jail_nginx_bad_request: false
```
```yaml
# roles/fail2ban/tasks/main.yml
- import_tasks: jail_wordpress.yml
when: fail2ban_jail_wordpress | default(false)
```
> **Critical gotcha — key flags to inventory GROUPS, not `ansible_os_family`.** It is tempting to gate OS-specific task files on `ansible_os_family == 'Debian'`. Don't. Inventory groups frequently include hosts the *original playbooks deliberately excluded* (e.g. a LAN-only Debian box that should get the network-wait step but **not** the public SSH bind, or a WSL host in the `fedora` group that must be skipped). Keep the original curated host patterns and set the flag per play/group. Keying on `os_family` silently widens a play's host set and is exactly how a "refactor" pushes config to a host that never had it.
## Technique 3: Capture-Based Reconciliation (the safety net)
This is the one that prevents an outage. Sometimes a role gets written as a **fresh re-implementation** of a subsystem rather than a faithful move — a cleaner `jail.local`, new drop-ins, a different default set. It may even be merged into `site.yml`. The trap: that role has **never been rolled out**, and its config *diverges* from what's actually deployed.
Running it would push divergent config to a live, security-sensitive subsystem (intrusion protection, firewall) across the whole fleet on the next `harden.yml`.
The check that catches it:
```bash
ansible-playbook fail2ban.yml --check --diff --limit <host>
# Divergent role => changed=8-12 per host + failures (missing filters/timers)
# Faithful role => changed=0, failed=0
```
**Capture-based reconciliation** is the fix: instead of pushing the role's idea of "correct," bring the **role into parity with the live, working config** first. Capture what's actually deployed, fold it into the role's templates/defaults until `--check` is clean fleet-wide, *then* switch the orchestrator over and retire the old playbooks. Order of operations:
1. **Decide the source of truth** — the live config or the new role. For security subsystems, the live (working) config wins.
2. **Reconcile** the role to match live until `--check` shows `changed=0, failed=0` on every host.
3. **Roll out host-by-host** with real runs; verify the service restarts cleanly and (for fail2ban) jails are actually active.
4. **Only then** delete the old playbooks, rewire `harden.yml`/`bootstrap.yml`, and remove the orphaned top-level templates.
Never delete the old mechanism until the new one is proven converged everywhere. "It's in `site.yml`" is not the same as "it's been rolled out."
## Composition: `site.yml`, `harden.yml`, `bootstrap.yml`
Once subsystems are roles, compose them with thin orchestrators that `import_playbook` the role entry points — so each subsystem keeps a **single source of truth** for its host mapping:
```yaml
# site.yml — day-to-day fleet convergence, in dependency order
- import_playbook: swap.yml
- import_playbook: tailscale.yml
- import_playbook: ssh_hardening.yml
- import_playbook: firewall.yml
- import_playbook: fail2ban.yml
- import_playbook: clamav.yml
```
Order matters: base layer (swap) → networking (tailscale) → access (ssh_hardening) → perimeter (firewall) → intrusion protection (fail2ban). Bootstrap-only roles (guest agent, root password, provisioning prerequisites) belong in `bootstrap.yml`, not `site.yml`.
## Verification Checklist
- [ ] Templates moved with `git mv` (show as 100% renames)
- [ ] `--check --diff` on a real host = `changed=0` (or only intended diffs)
- [ ] Consolidation flags keyed to **inventory groups**, not `ansible_os_family`
- [ ] Re-implemented roles reconciled to live parity **before** rollout (no surprise `changed=N`)
- [ ] Security subsystems rolled out host-by-host with service-active verification
- [ ] Old playbooks/templates deleted **only after** the role is converged fleet-wide
- [ ] Orchestrators (`site.yml`/`harden.yml`/`bootstrap.yml`) rewired; stale references swept
## Related
- [SSH Hardening Fleet-Wide with Ansible](ssh-hardening-ansible-fleet.md)
- [ClamAV Fleet Deployment with Ansible](clamav-fleet-deployment.md)
- [Firewall Hardening with firewalld on Fedora Fleet](firewalld-fleet-hardening.md)
- [Standardizing unattended-upgrades with Ansible](ansible-unattended-upgrades-fleet.md)

View file

@ -0,0 +1,170 @@
---
title: "Mastodon — Triaging Crowdfunding / Mention-Spam Accounts"
description: How to tell broadcast fundraising solicitation from genuine mentions, investigate the account and its origin instance with SQL + nodeinfo, and pick a proportionate moderation action.
tags:
- mastodon
- moderation
- abuse
- federation
- self-hosting
created: 2026-06-22
updated: 2026-06-22
---
# Mastodon — Triaging Crowdfunding / Mention-Spam Accounts
If you run a Mastodon instance, sooner or later you (or your users) start getting tagged by accounts you've never interacted with, posting donation appeals with a link and a wall of hashtags. Some are real people in desperate situations; some are recycled-link scams. Either way, when an account is **broadcasting a solicitation at you** rather than replying to you, it's a moderation question, not a conversation.
This article is the runbook for telling the two apart, investigating both the **account** and its **origin instance**, and choosing an action that's proportionate instead of nuking eight years of legit federation over two bad actors.
## TL;DR
- A mention is **broadcast spam**, not engagement, when it's a *standalone post* (not a reply) that *tags a large fixed list* of accounts and carries a *donation link*, usually from a *throwaway profile* on an *open-registration instance*.
- Investigate before acting: pull the account's age/stats/bio and check whether the post is a reply or a 40-way blast (SQL below). Profile the origin instance via its public `nodeinfo`.
- **Default action is an account-level block**, which also federates and removes their follow of you. Escalate to domain-limit / domain-block only when *one instance* produces *repeat offenders*.
- Keep a log so single incidents that are actually a pattern become visible.
## Signals that a mention is broadcast solicitation
Score it on how many of these hold:
| Signal | Why it matters |
|---|---|
| **Standalone post, not a reply** (`in_reply_to_account_id IS NULL`) but still tags you | They're broadcasting, not responding |
| **Tags a large fixed recipient list** (e.g. 40+) | Mass distribution; the same list reused across senders = coordination |
| **Donation link** in post or bio (`chuffed.org`, `gofundme`, `paypal.me`, `ko-fi`) | The payload |
| **Throwaway profile** — days old, few followers, follows you but you don't follow back | Disposable, baiting a profile view |
| **Mass-follow ratio** — following thousands / few hundred followers | Engagement farming |
| **"I am not a scammer" disclaimer** in bio | Known red-flag phrase |
| **Origin instance: open registration, no approval** | Easy throwaway-account farm |
> [!warning] Judgment, not a purity test
> Many of these accounts are real people. The goal is not to adjudicate need — it's to stop *broadcast solicitation aimed at you* and track the *source instances*. Prefer the lightest action that stops it.
## Investigate the account
Connect to the DB on the instance:
```bash
ssh <your-mastodon-host>
sudo -u postgres psql mastodon_production
```
**Profile + stats for a suspect** (age, post count, follower ratio, bio):
```sql
SELECT a.username||'@'||a.domain,
to_char(a.created_at,'YYYY-MM-DD') AS first_seen_locally,
st.statuses_count, st.followers_count, st.following_count,
left(regexp_replace(COALESCE(a.note,''),'<[^>]+>','','g'),200) AS bio
FROM accounts a LEFT JOIN account_stats st ON st.account_id=a.id
WHERE a.domain='<INSTANCE>' AND a.username='<HANDLE>';
```
**Is the mention a reply or a blast?** `standalone=t` with a high `num_tagged` is the tell:
```sql
SELECT a.username, to_char(s.created_at,'YYYY-MM-DD HH24:MI') AS posted,
s.in_reply_to_account_id IS NULL AS standalone,
(SELECT count(*) FROM mentions mm WHERE mm.status_id=s.id) AS num_tagged
FROM mentions m JOIN statuses s ON s.id=m.status_id
JOIN accounts a ON a.id=s.account_id
JOIN accounts me ON me.id=m.account_id AND me.username='<YOU>' AND me.domain IS NULL
WHERE a.username='<HANDLE>' AND a.domain='<INSTANCE>'
ORDER BY s.created_at DESC;
```
**All recent direct mentions of you** (sweep for the wider pattern):
```sql
SELECT to_char(n.created_at,'YYYY-MM-DD HH24:MI') AS when,
a.username||COALESCE('@'||a.domain,'@local') AS who,
COALESCE(s.uri,'') AS uri,
left(regexp_replace(COALESCE(s.text,''),'<[^>]+>','','g'),200) AS body
FROM notifications n
JOIN accounts recip ON recip.id=n.account_id AND recip.username='<YOU>' AND recip.domain IS NULL
JOIN accounts a ON a.id=n.from_account_id
LEFT JOIN mentions m ON m.id=n.activity_id AND n.activity_type='Mention'
LEFT JOIN statuses s ON s.id=m.status_id
WHERE n.type='mention' ORDER BY n.created_at DESC LIMIT 40;
```
## Profile the origin instance
Don't judge an instance by one bad account. Pull its public metadata — no auth needed:
```bash
# Software, version, user counts, registration policy
NI=$(curl -s https://<INSTANCE>/.well-known/nodeinfo | python3 -c 'import sys,json;print(json.load(sys.stdin)["links"][-1]["href"])')
curl -s "$NI" | python3 -m json.tool # software, openRegistrations, usage.users
# Title, contact/admin, rules, registration approval flag
curl -s https://<INSTANCE>/api/v2/instance | python3 -m json.tool
```
What to read off it:
- **`openRegistrations: true` + `approval_required: false`** → throwaway-account farm; expect more of the same.
- **`totalUsers` vs `activeMonth`** → a huge dormant base is typical of sign-up-and-leave farms.
- **Federation age on your side** — how long you've known the instance, how many of its accounts you cache. A long, broad relationship argues *against* a domain block.
- **The instance's own rules** — many ban "backlink accounts" / harassment, which the mass-tag fundraising violates. That makes **reporting to its admin a legitimate, in-policy path.**
```sql
-- What your instance already knows about the domain
SELECT (SELECT count(*) FROM accounts WHERE domain='<INSTANCE>') AS known_accounts,
(SELECT count(*) FROM statuses s JOIN accounts a ON a.id=s.account_id WHERE a.domain='<INSTANCE>') AS cached_statuses,
(SELECT to_char(min(created_at),'YYYY-MM-DD') FROM accounts WHERE domain='<INSTANCE>') AS first_seen,
(SELECT count(*) FROM domain_blocks WHERE domain='<INSTANCE>') AS is_domain_blocked;
```
## The escalation ladder
| Level | Action | Effect | When |
|---|---|---|---|
| 1 | **Mute** | You stop seeing them; silent | Borderline; you don't want to cut them off |
| 2 | **Block (account)** | Cuts mentions, removes their follow, federates to their instance | **Default first action** |
| 3 | **Report** to source admin | Forwards the offending posts to their moderators | Repeat or egregious; in-policy on most instances |
| 4 | **Domain-limit (silence)** | Their posts show only if you follow that account | One instance, multiple offenders |
| 5 | **Domain-block (suspend)** | Severs all known accounts + federation | Instance is predominantly abuse |
### Blocking from a user account (federates + removes follow)
There is no `tootctl accounts block`. Do it through the model's `BlockService` so it tears down the relationship and federates correctly:
```ruby
# run as the mastodon user:
# sudo -u mastodon bash -c 'cd /home/mastodon/live && RAILS_ENV=production bin/rails runner /tmp/block.rb'
me = Account.find_by(username: "<YOU>", domain: nil)
%w[Handle1 Handle2].each do |u|
t = Account.find_by(username: u, domain: "<INSTANCE>")
next puts("NOTFOUND #{u}") if t.nil?
BlockService.new.call(me, t)
puts "BLOCKED #{u} blocking=#{me.blocking?(t)} they_follow_me=#{t.following?(me)}"
end
```
`blocking=true` with `they_follow_me=false` confirms the block landed and the follow was severed.
### Instance-level actions
Domain-limit / domain-block live in the admin UI (**Moderation → Federation**) or via `tootctl`:
```bash
# Silence (limit) — posts hidden unless followed
RAILS_ENV=production bin/tootctl domains ... # or set severity=silence in the admin UI
# Suspend (block) the whole instance
RAILS_ENV=production bin/tootctl ... # admin UI "Add domain block" is the safe path
```
> [!tip] Reach for the lightest hammer
> A domain block is rarely the right first move against an established instance — you lose every legit account and years of federation to swat a couple of accounts. Block the accounts, report them to the source admin, and only escalate the *instance* when it demonstrates a sustained, multi-actor pattern.
## Keep a log
Track offenders and source instances over time so a "one-off" that's actually a campaign becomes visible, and so domain-level decisions are evidence-based. A simple table — date, account, instance, signals, action — plus an instance-watch table with each source's registration policy and offender count is enough.
## Related
- [Mastodon `--prune-profiles` Trap](mastodon-prune-profiles-trap.md)
- [Mastodon DB Maintenance](mastodon-db-maintenance.md)
- [Mastodon Federation](mastodon-federation.md)

View file

@ -0,0 +1,137 @@
---
title: "App-Consistent Fleet Backups with restic + Backblaze B2"
domain: selfhosting
category: storage-backup
tags: [restic, backblaze, b2, backup, ansible, systemd, postgresql, mysql, sqlite, docker, disaster-recovery]
status: published
created: 2026-06-19
updated: 2026-06-19
---
# App-Consistent Fleet Backups with restic + Backblaze B2
A repeatable pattern for backing up a mixed fleet (Ubuntu + Fedora, VPS + homelab, bare services + Docker) to Backblaze B2 with [restic](https://restic.net) — encrypted, deduplicated, and **app-consistent** (databases are dumped before the snapshot, not copied live). Driven by Ansible and a per-host `systemd` timer.
## The Short Answer
Per host, nightly: **dump every database to a staging dir → `restic backup` that staging dir plus the data paths → apply retention → wipe staging.** A monthly timer runs `restic prune`. Anything that fails emails the admin. One B2 bucket holds a separate repo per host at `b2:<bucket>:<hostname>`.
Retention is `--keep-daily 7 --keep-weekly 4 --keep-monthly 6` (~6 months of history).
## Why dump databases first
Copying a live database's files (`/var/lib/mysql`, a running SQLite file, a Postgres data dir) gives you a *crash-consistent* copy at best — restorable only if you're lucky. Logical dumps are guaranteed consistent:
- **MySQL / MariaDB:** `mysqldump --single-transaction --routines --triggers --databases <db>`
- **PostgreSQL:** `pg_dump -Fc <db>` (custom format) via the `postgres` system user (peer auth)
- **SQLite:** `sqlite3 <file> ".backup '<out>'"` — uses the online backup API, safe against a running writer
- **Dockerized DBs:** `docker exec <container> sh -c '<dump cmd>'`, letting the container's own shell expand its root-password env var
restic then backs up the dump files (which dedupe beautifully — only the changed blocks upload each night).
## Repository layout
- **One private B2 bucket** (e.g. `majorshouse-backups`).
- **One repo per host:** `b2:majorshouse-backups:<hostname>`.
- The application key needs **read + write + delete** for the bucket. restic deletes objects during `forget`/`prune`, so a pure *append-only* key will break retention. (True append-only requires splitting `forget`/`prune` onto a separate maintenance key — a worthwhile hardening step, but not the default.)
- Credentials live in an `EnvironmentFile` (`/etc/restic/restic-env`, mode `0600`, root): `RESTIC_REPOSITORY`, `RESTIC_PASSWORD`, `B2_ACCOUNT_ID`, `B2_ACCOUNT_KEY`.
## The backup script (shape)
```bash
set -uo pipefail
STAGING=/var/backups/restic-staging
rm -rf "$STAGING"; mkdir -p "$STAGING"; chmod 700 "$STAGING"
# per-engine dumps into $STAGING ...
mysqldump --single-transaction --routines --triggers --databases wordpress > "$STAGING/mysql-wordpress.sql"
sudo -u postgres pg_dump -Fc mastodon_production > "$STAGING/pg-mastodon_production.dump"
sqlite3 /opt/phantombot/config/phantombot.db ".backup '$STAGING/sqlite-phantombot.db'"
restic backup --tag fleet-backup --host "$(hostname -s)" \
"$STAGING" /var/www /etc/letsencrypt --exclude /path/to/already-offsite/media
restic forget --keep-daily 7 --keep-weekly 4 --keep-monthly 6
rm -rf "$STAGING"
```
Wrap each step so a failure mails the admin and aborts (don't silently back up a half-state). On hosts where the `mail` CLI is absent, pipe a message to `/usr/sbin/sendmail -t` instead.
## systemd units
A oneshot service + a timer. Stagger `OnCalendar` per host to spread B2 load, and **always set `RESTIC_CACHE_DIR`** (see Gotchas):
```ini
# restic-backup.service
[Service]
Type=oneshot
EnvironmentFile=/etc/restic/restic-env
Environment=RESTIC_CACHE_DIR=/var/cache/restic
ExecStart=/usr/local/sbin/restic-backup.sh
Nice=10
IOSchedulingClass=idle
```
```ini
# restic-backup.timer
[Timer]
OnCalendar=*-*-* 02:30:00
RandomizedDelaySec=20m
Persistent=true
[Install]
WantedBy=timers.target
```
A second `restic-prune.timer` runs `restic prune` monthly (`OnCalendar=*-*-01 04:00:00`).
## Restore procedure
The whole point. From the target host (or any host with the repo creds):
```bash
# load repo + B2 creds without echoing them
set -a; . /etc/restic/restic-env; set +a
restic snapshots # list; note the snapshot ID or use 'latest'
# restore specific paths to a scratch dir (never restore in place blindly)
restic restore latest --target /tmp/restore \
--include /var/backups/restic-staging \
--include /var/www/html/wp-config.php
# verify before doing anything with it
ls -la /tmp/restore/var/backups/restic-staging/
head -1 /tmp/restore/var/backups/restic-staging/mysql-wordpress.sql # "-- MySQL dump 10.13 ..."
```
To recover a database, restore the dump then load it: `mysql <db> < mysql-<db>.sql`, `pg_restore -d <db> pg-<db>.dump`, or copy the SQLite file back. **Test restores periodically** — a backup you've never restored is a hope, not a backup. Restore the highest-stakes data (password manager, mail) first in any drill.
## Adding a host
1. Add it to the `backups` inventory group.
2. Give it a `host_vars` scope — which DBs to dump and which paths to back up:
```yaml
restic_backup_oncalendar: "*-*-* 02:40:00" # stagger
restic_mysql_dbs: [castopod_db]
restic_paths: [/var/www/html/castopod]
restic_excludes: [/var/www/html/castopod/public/media] # already offsite
```
3. Run the playbook against that host. The role installs restic, deploys the script + units, `restic init`s the repo if absent, and enables the timers.
## Gotchas & Notes
- **`RESTIC_CACHE_DIR` is mandatory under systemd.** systemd services run with no `$HOME`, so restic can't find its cache and warns *"unable to locate cache directory: neither $XDG_CACHE_HOME nor $HOME are defined"* — and re-reads **every file** each run (no incremental). Point it at `/var/cache/restic` in the unit.
- **`sqlite3` may not be installed.** A host that runs a SQLite-backed app (e.g. a bot) often lacks the `sqlite3`/`sqlite` CLI. Install it where `restic_sqlite_paths` is set, or the `.backup` step fails.
- **Docker DB password env-var names vary.** Don't assume: the MariaDB image may use `MYSQL_ROOT_PASSWORD` (not `MARIADB_ROOT_PASSWORD`), and a Postgres container's superuser is whatever `POSTGRES_USER` is set to — reference `"$POSTGRES_USER"` rather than hardcoding `postgres`. Check with `docker exec <c> sh -c 'env | grep -oE "^(MYSQL|MARIADB|POSTGRES)_[A-Z_]*"'` (name only).
- **B2 key needs delete capability.** Otherwise `forget`/`prune` fail. Scope the key to the bucket; reach for per-host `namePrefix`-restricted keys for blast-radius isolation.
- **Exclude data that's already offsite.** Media already synced to object storage (S3/B2 via the app or `rclone`) should be `--exclude`d so you don't pay to store it twice.
- **First upload is slow, the rest are fast.** The initial snapshot reads and uploads everything; subsequent runs only ship changed blocks. For a large first run, fire it detached and watch from a transient unit that emails you on completion.
- **Keep secrets out of git.** The repo password and B2 key belong in an Ansible vault (committed encrypted), referenced into the role — never in plaintext vars.
- **Changing a host's backup paths starts a new snapshot group.** `restic forget` groups snapshots by `host`+`paths` by default, so adding or removing a path on an existing host creates a *separate* lineage: the old path-set and the new one each retain their own 7d/4w/6m snapshots, and `restic snapshots` shows both. Expected, not a bug — but it means the old-path snapshots age out on their own schedule rather than being superseded. To collapse everything into one retention bucket, run `forget` with `--group-by host` (be deliberate: it then treats *any* path-set on that host as the same group).
## See Also
- [rsync Backup Patterns](rsync-backup-patterns.md)
- [SnapRAID & MergerFS Storage Setup](../../01-linux/storage/snapraid-mergerfs-setup.md)
- [restic documentation](https://restic.readthedocs.io)

View file

@ -5,7 +5,7 @@ category: plex
tags: [plex, ffmpeg, hevc, vaapi, amd, gpu, encode, storage, rx480]
status: published
created: 2026-05-15
updated: 2026-05-22
updated: 2026-06-05
---
# HEVC Batch Re-Encode for Plex Using VAAPI (AMD GPU)
@ -121,7 +121,7 @@ Each file logs:
### Space guard
The script aborts if free space on the Plex volume drops below 20GB (`MIN_FREE_GB`). Worst-case headroom needed is `source_size + tmp_size` simultaneously — on a 4GB source file that's ~8GB peak.
The script aborts if free space on the Plex volume drops below 10GB (`MIN_FREE_GB`). Worst-case headroom needed is `source_size + tmp_size` simultaneously — on a 4GB source file that's ~8GB peak. Note: the space check only runs at the **start** of each encode, not during — a large file can still consume significant disk mid-encode.
---
@ -278,3 +278,54 @@ local tmp="${dir}/${safe_stem}.hevc.tmp.${ext}"
After patching, delete the affected entries from `hevc_failed.txt` (or leave them — they'll be re-queued on the next run since they're not in `hevc_done.txt`) and restart the batch.
---
### Many files failing: output larger than source (streaming content)
**Symptom:** A large portion of the queue ends up in `hevc_failed.txt` with log lines like:
```
[2026-06-05 ...] Output: 4.7G savings=0 (output larger than source)
[2026-06-05 ...] WARN: output is larger than source — skipping swap, keeping original
```
**Cause:** These files are YouTube downloads or streaming archives (Giant Bomb, Twitch VODs, etc.) that were already encoded with an efficient H.264 encoder (typically YouTube's VP9-to-AVC pipeline or a broadcast H.264 encoder at a reasonable bitrate). VAAPI HEVC encoding at QP 28 on a Polaris GPU (RX 480/580) is a hardware encoder with limited rate control precision — it cannot beat a well-tuned software H.264 encode on already-compressed talking-head/gaming content. The output reliably comes out 1525% *larger* than the source.
The script handles this correctly: it detects output > source, deletes the tmp, keeps the original, and writes to `hevc_failed.txt`. The files are not corrupted. However, without the `already_failed()` guard, the script will re-attempt these files on every queue rebuild, wasting CPU time and briefly consuming 48 GB of disk per failed attempt.
**Fix — add `already_failed()` skip logic:**
Patch `~/hevc_batch.sh` to skip files already in `hevc_failed.txt`:
```bash
# After the existing already_done() function, add:
already_failed() {
[[ -f "$FAILED" ]] && grep -qF "$1" "$FAILED"
}
# In build_queue(), after the already_done "$f" && continue line:
already_failed "$f" && continue
# In the main loop, after the already_done "$file" check:
already_failed "$file" && { log "SKIP (already failed): $file"; continue; }
```
After patching, the batch will skip all 132+ known-bad files on the next pass and only attempt fresh queue entries.
**Tuning options to improve savings on dense content:**
- Lower QP: `--qp 24` or `--qp 22` — more aggressive quality target, better chance of beating source size. Trade-off: larger output for files that do compress.
- Accept the failures: for streaming content archives, the source is already "good enough." Only files that are genuinely oversized H.264 (old stream captures at very high bitrate) will benefit from HEVC re-encode.
**Identifying which files are worth encoding:**
```bash
# Show source bitrate for all queued files — high-bitrate sources are candidates
while IFS= read -r f; do
bitrate=$(ffprobe -v quiet -show_entries format=bit_rate -of csv=p=0 "$f" 2>/dev/null)
echo "$bitrate $f"
done < ~/hevc_queue.txt | sort -rn | head -20
```
Files above ~8,000 kbits/s are typically good encode candidates. Files at 3,0005,000 kbits/s (typical YouTube/Twitch 1080p) will usually fail.

View file

@ -0,0 +1,103 @@
---
title: "Ansible reboot.yml: become Timeout on WSL2 Hosts (Exclude Them)"
domain: troubleshooting
category: ansible
tags: [ansible, wsl, wsl2, windows, reboot, become, privilege-escalation, openssh, inventory]
status: published
created: 2026-06-12
updated: 2026-06-12
---
# Ansible reboot.yml: become Timeout on WSL2 Hosts (Exclude Them)
## Problem
Running a reboot play across a Fedora fleet that includes a WSL2 "host" fails on the WSL2 box at privilege escalation — before the reboot command ever runs:
```console
$ ansible-playbook reboot.yml --limit fedora
TASK [Reboot the server] *******************************************************
changed: [majorhome]
changed: [majorlab]
changed: [majormail]
changed: [majordiscord]
[ERROR]: Task failed: Action failed: Timeout (62s) waiting for privilege
escalation prompt:
fatal: [majorrig-wsl]: FAILED! => {"changed": false,
"msg": "Timeout (62s) waiting for privilege escalation prompt:",
"reboot": false}
```
Every real server reboots fine. Only the WSL2 host fails, and `"reboot": false` confirms the shutdown command never executed.
## Cause
Two independent problems, either of which is enough to break a reboot play against WSL2:
1. **WSL2 has no real reboot semantics.** `ansible.builtin.reboot` issues a shutdown, then blocks up to `reboot_timeout` (e.g. 900s) waiting for SSH to come back. A WSL2 distro doesn't reboot — it just terminates, and nothing relaunches it automatically. The task would hang the full timeout and then fail.
2. **`become` times out over the Windows OpenSSH → WSL2 bridge.** When a WSL2 box is reached as `majorlinux@host` through Windows' built-in OpenSSH Server (which forwards into WSL via the default shell), Ansible's privilege-escalation handshake watches the SSH stream for the sudo prompt/success marker. Across the Windows-intercept pty, that marker detection stalls until the 62s `timeout`. This happens **even with passwordless sudo**`NOPASSWD` is configured and correct; Ansible simply never sees the handshake complete.
The error surfaces as #2 (it fails at escalation first), but #1 is the deeper reason WSL2 doesn't belong in a reboot play at all.
## Solution
**Exclude the WSL group from the reboot play.** A WSL2 instance is a managed *workstation environment*, not a server — it belongs in package/update plays but not in server lifecycle operations like reboot.
Scope the play to exclude the `wsl` group so even a broad `--limit` skips it:
```yaml
# reboot.yml
- name: Reboot servers
hosts: all:!wsl # was: hosts: all
become: true
tasks:
- name: Reboot the server
ansible.builtin.reboot:
msg: "Reboot initiated by Ansible"
reboot_timeout: 900
```
This assumes your WSL2 hosts are in a dedicated inventory group:
```yaml
wsl:
hosts:
majorrig-wsl:
ansible_host: 100.98.47.29
```
Verify the targeting before running — the WSL host should be gone:
```console
$ ansible-playbook reboot.yml --limit fedora --list-hosts
play #1 (all:!wsl): Reboot servers
hosts (4):
majorhome
majorlab
majordiscord
majormail
```
### Rebooting the WSL2 instance itself
When you genuinely need to "reboot" WSL2, do it from the Windows side — not Ansible:
```powershell
wsl --shutdown
```
The distro relaunches on next access (next SSH login or `wsl` invocation). WSL2 stays in `update.yml` (dnf upgrades) and other package plays; it's only excluded from reboot and other server-specific roles.
## Why not just fix the become timeout?
You *could* raise `timeout` or tweak the become flow, but it doesn't address problem #1 — even a successful escalation would leave the reboot task hanging the full `reboot_timeout` because WSL2 never comes back the way the module expects. Excluding WSL from server lifecycle plays is the correct fix, not a workaround.
## Related
- [Ansible: ansible.cfg Ignored on WSL2 Windows Mounts](ansible-wsl2-world-writable-mount-ignores-cfg.md)
- [Windows OpenSSH: WSL Default Shell Breaks Remote Commands](networking/windows-openssh-wsl-default-shell-breaks-remote-commands.md)
- [Ansible: SSH Timeout During dnf upgrade on Fedora Hosts](ansible-ssh-timeout-dnf-upgrade.md)
</content>
</invoke>

View file

@ -0,0 +1,73 @@
---
title: "Claude Code Keychain Prompt Keeps Reappearing on macOS (ACL Invalidation)"
domain: troubleshooting
category: claude-code
tags: [claude-code, authentication, oauth, keychain, macos, acl, security]
status: published
created: 2026-06-15
updated: 2026-06-15
---
# Claude Code Keychain Prompt Keeps Reappearing on macOS (ACL Invalidation)
## Symptom
A macOS dialog repeatedly pops up:
> **security wants to access key "Claude Code-credentials" in your keychain.**
> To allow this, enter the "login" keychain password. — `[Always Allow] [Deny] [Allow]`
The tell-tale sign: it **comes back even after clicking "Always Allow"** — the usual "trust forever" button doesn't make it stop. Login still works; it's the *permission prompt* that won't quiet down. This is **distinct** from [Claude Code won't log in](claude-code-warp-login-corrupt-keychain-credential.md), where the stored credential is corrupt and login itself fails.
## Cause
Claude Code stores its OAuth token in the macOS **login keychain** as `Claude Code-credentials`, read via `/usr/bin/security`. macOS binds an "Always Allow" grant (the keychain item's ACL) to the **code-signing identity** of the requesting binary. That grant is silently invalidated when:
- **Claude Code updates** — the new binary's signature no longer matches the saved ACL. This is the most common trigger (see claude-code issues #48162, #9403).
- **The credential item is recreated on token refresh** — wipes the ACL.
- **Post-reboot keychain churn** — right after boot, the just-unlocked login keychain plus a concurrent token refresh can race ahead of the ACL settling, producing a *burst* of prompts that stops once a clean refresh completes.
It is **not** a lock-timeout issue if `security show-keychain-info` reports `no-timeout` (below).
## Triage (non-destructive — these do not trigger a prompt)
```bash
# Confirm the item exists (metadata only; no secret read)
security find-generic-password -l "Claude Code-credentials" | grep -E "svce|acct"
# Confirm the login keychain isn't auto-locking
security show-keychain-info ~/Library/Keychains/login.keychain-db
# -> "no-timeout" means it won't relock; so recurring prompts = ACL invalidation, not locking
```
## Fixes
### One-off burst (e.g. right after a reboot)
Click **Always Allow** (not Allow) once a clean token refresh has completed. With a `no-timeout` keychain the grant then holds, and the post-boot prompt storm usually self-clears within a minute. *Observed exactly this on MajorAir 2026-06-15 — a reboot triggered a burst that stopped on its own.*
### Keeps returning after updates (durable) — reset the credential
Deleting and re-creating the item rebinds a fresh ACL to the current binary. Costs one re-login.
```bash
security delete-generic-password -s "Claude Code-credentials"
# then re-authenticate inside Claude Code: /login (or relaunch `claude`)
```
### Bypass the keychain entirely (workaround)
Claude Code falls back to `~/.claude/.credentials.json` in non-GUI contexts (SSH, tmux). On a local Mac this can be repurposed to stop keychain prompts for good:
```bash
# pipe straight to the file — never echo the token into a shared terminal
security find-generic-password -s "Claude Code-credentials" -w > ~/.claude/.credentials.json
chmod 600 ~/.claude/.credentials.json
security delete-generic-password -s "Claude Code-credentials"
```
**Caveats:**
- Token is then **plaintext at rest** (mode 600) instead of encrypted in the keychain.
- A future Claude Code update may rewrite the keychain item.
- GUI-session behaviour for the file fallback is **less documented** than the SSH/tmux case — **verify it holds for your setup before relying on it.**
- Do **not** substitute `CLAUDE_CODE_OAUTH_TOKEN` — it is known to delete credentials on exit (issue #37512).
## Notes
- Same keychain item as the corrupt-credential login failure; if login itself breaks, see the related article.
- Always redirect `-w` output straight to a file — never into a terminal whose scrollback feeds shared context.
## Related
- [Claude Code Won't Log In (Warp & iTerm2) — Corrupt Keychain Credential](claude-code-warp-login-corrupt-keychain-credential.md)
- Config: `~/.claude.json`, login keychain item `Claude Code-credentials`
- First observed: MajorAir, 2026-06-15 (post-reboot prompt burst; self-cleared)

View file

@ -61,5 +61,6 @@ Resolved on step 1+2 — login succeeded after deleting the corrupt Keychain ite
If that errors with "Expecting value", the stored secret is empty/corrupt — delete and re-login.
## Related
- [Claude Code Keychain Prompt Keeps Reappearing on macOS (ACL Invalidation)](claude-code-keychain-prompt-recurring-macos.md) — different symptom: login works but the permission prompt won't stop
- Config: `~/.claude.json` (oauthAccount, userID), login Keychain item `Claude Code-credentials`
- Other Claude Code note: `claude-mem-setting-sources-empty-arg.md`

View file

@ -0,0 +1,105 @@
---
title: "Forgejo: Account Recovery & CLI Admin When Locked Out of the GUI"
domain: troubleshooting
category: general
tags: [forgejo, gitea, smtp, docker, account-recovery, self-hosting]
status: published
created: 2026-06-12
updated: 2026-06-12
---
# Forgejo: Account Recovery & CLI Admin When Locked Out of the GUI
Two related problems on a single-admin self-hosted **Forgejo** (or Gitea): the GUI *"Forgot password"* is disabled, and you can't log in to fix it. Here's how to (1) enable account recovery properly, and (2) recover from the command line when you're already locked out.
## Symptoms
- The *Forgot password* page shows: **"Account recovery is only available when email is set up. Please set up email to enable account recovery."**
- You can't log in (wrong/forgotten password), so you can't add an SSH key or change settings in the GUI either.
## Part 1 — Enable account recovery (configure the mailer)
Account recovery needs SMTP. If you already run a mail server on your tailnet, relay through it — **no app password needed** when the Forgejo host is `mynetworks`-trusted by that mail server.
Edit `app.ini` (in the data volume, e.g. `/data/gitea/conf/app.ini`):
```ini
[mailer]
ENABLED = true
PROTOCOL = smtp+starttls
SMTP_ADDR = 100.x.y.z ; mail server's tailnet IP
SMTP_PORT = 587
FROM = forgejo@example.com
FORCE_TRUST_SERVER_CERT = true ; required when connecting by IP (cert CN won't match)
```
Notes:
- `FORCE_TRUST_SERVER_CERT = true` is needed when you target the relay by **IP** — the TLS cert is issued for a hostname, not the IP, so verification would otherwise fail. Acceptable on a trusted internal hop.
- Omit `USER`/`PASSWD` if the relay accepts your host via `mynetworks` (no SASL). Otherwise add SMTP auth.
- `app.ini` lives in the persistent volume, so the change **survives container re-creation** (e.g. Watchtower's nightly pull).
Apply and verify:
```bash
docker restart forgejo
docker logs forgejo 2>&1 | grep -i "Mail Service Enabled" # confirms the mailer loaded
```
Test the SMTP path **before** trusting it (run from the host, mimicking Forgejo's connection):
```bash
python3 - <<'EOF'
import smtplib, ssl
ctx = ssl.create_default_context(); ctx.check_hostname = False; ctx.verify_mode = ssl.CERT_NONE
s = smtplib.SMTP("100.x.y.z", 587, timeout=15)
s.ehlo(); s.starttls(context=ctx); s.ehlo()
s.sendmail("forgejo@example.com", ["you@example.com"],
"Subject: test\r\n\r\nForgejo relay path test")
s.quit(); print("SENT_OK")
EOF
```
`SENT_OK` means the relay accepted the message. `/user/forgot_password` should now show the reset form instead of the email error.
> **Container can't reach the tailnet IP?** Docker bridge networks usually route to Tailscale via the host (SNAT to the host's tailnet IP). Confirm with:
> `docker exec forgejo nc -w5 100.x.y.z 587 </dev/null && echo REACHABLE`
## Part 2 — Recover from the CLI (already locked out)
Forgejo's admin CLI runs inside the container as the git user (UID 1000) and needs no login.
**Reset a password:**
```bash
docker exec -u 1000 forgejo forgejo admin user change-password -u <user> -p '<newpass>'
```
> ⚠️ **Gotcha:** `change-password` sets `must_change_password=true` by default. That **forces a change on next GUI login _and_ returns HTTP 403 on the API** (`"You must change your password"`). Clear it:
> ```bash
> docker exec -u 1000 forgejo forgejo admin user must-change-password --unset <user>
> ```
**Add an SSH key without the GUI** (basic-auth API — works only if 2FA is off):
```bash
curl -u <user>:'<pass>' -X POST -H 'Content-Type: application/json' \
-d '{"title":"laptop","key":"ssh-ed25519 AAAA... you@host"}' \
http://localhost:3004/api/v1/user/keys
# HTTP 201 = created
```
Forgejo regenerates the git user's `authorized_keys` from the database, so `ssh -p <port> git@host` authenticates immediately afterward — no restart needed.
## "The password keeps changing" — it (probably) isn't
If a self-hosted Forgejo admin password *seems* to reset itself, a stock Forgejo container does **not** reset admin passwords. Rule out the server first:
- the compose has **no** admin/password env and no custom entrypoint;
- **no** cron, systemd timer, or script runs `forgejo admin user change-password`;
- the data volume is persistent (re-creation keeps the DB, password included).
If all three hold, nothing server-side is changing it — the "changing" password is a **client-side** artifact: a duplicate or stale entry in your password manager autofilling different values. Delete the duplicates and keep one.
## See also
- Forgejo — [Config Cheat Sheet → mailer](https://forgejo.org/docs/latest/admin/config-cheat-sheet/)

View file

@ -11,6 +11,7 @@ Practical fixes for common Linux, networking, and application problems.
- [LoRA adapter — GGUF conversion fails with 'config.json not found'](gpu-display/lora-adapter-gguf-conversion-fails.md)
## 🌐 Networking & Web
- [Wi-Fi Game Streaming Stutter: 160 MHz Channel Width Saturating the 5 GHz Radio](networking/wifi-160mhz-airtime-saturation-game-streaming.md)
- [Apache Outage: Fail2ban Self-Ban + Missing iptables Rules](networking/fail2ban-self-ban-apache-outage.md)
- [Mail Client Stops Receiving: Fail2ban IMAP Self-Ban](networking/fail2ban-imap-self-ban-mail-client.md)
- [firewalld: Mail Ports Wiped After Reload](networking/firewalld-mail-ports-reset.md)
@ -18,6 +19,7 @@ Practical fixes for common Linux, networking, and application problems.
- [Postfix header_checks Can't Act on Milter-Added Headers (Use Sieve)](networking/postfix-header-checks-vs-milter-headers.md)
- [Dovecot Phantom Mailboxes from .dovecot.lda-dupes (mail_home Overlapping the Maildir Root)](networking/dovecot-mail-home-maildir-root-phantom-mailboxes.md)
- [Tailscale SSH: Unexpected Re-Authentication Prompt](networking/tailscale-ssh-reauth-prompt.md)
- [SSH Alias Falls Through to MagicDNS — Host-Key Verification Failure (No `Host` Block)](networking/ssh-missing-host-block-magicdns-host-key-failure.md)
- [iOS Tailscale Clients Report HostName="localhost" — Breaks /etc/hosts Generators](networking/tailscale-status-json-hostname-localhost-ios.md)
- [rsync over Tailscale: Hung in TCP Teardown After Transfer Completes](networking/rsync-tailscale-teardown-stall.md)
- [Windows OpenSSH: WSL Default Shell Breaks Remote Commands](networking/windows-openssh-wsl-default-shell-breaks-remote-commands.md)
@ -31,6 +33,7 @@ Practical fixes for common Linux, networking, and application problems.
- [Vault Password File Missing](ansible-vault-password-file-missing.md)
- [ansible.cfg Ignored on WSL2 Windows Mounts](ansible-wsl2-world-writable-mount-ignores-cfg.md)
- [regex_search — capture-group argument doesn't work in set_fact](ansible-regex-search-set-fact-capture-group.md)
- [reboot.yml: become Timeout on WSL2 Hosts (Exclude Them)](ansible-reboot-become-timeout-wsl2.md)
## 📦 Docker & Systems
- [Docker & Caddy Recovery After Reboot (Fedora + SELinux)](docker-caddy-selinux-post-reboot-recovery.md)
@ -49,9 +52,12 @@ Practical fixes for common Linux, networking, and application problems.
## 📝 Application Specific
- [Obsidian Vault Recovery — Loading Cache Hang](obsidian-cache-hang-recovery.md)
- [Gemini CLI Manual Update](gemini-cli-manual-update.md)
- [iPhone Mirroring Hangs on 'Connecting…' — AWDL Data Stall (27.0 Beta)](iphone-mirroring-connecting-hang-awdl-stall-beta.md)
## 🤖 AI / Local LLM
- [Ollama Drops Off Tailscale When Mac Sleeps](ollama-macos-sleep-tailscale-disconnect.md)
- [Ollama: `ollama run` with Piped Stdin Bypasses Chat Template + SYSTEM Prompt](ollama-chat-template-pipe-stdin-bypass.md)
- [Windows OpenSSH Server (sshd) Stops After Reboot](networking/windows-sshd-stops-after-reboot.md)
- [claude-mem Silently Fails with Claude Code 2.1+ (Empty `--setting-sources`)](claude-mem-setting-sources-empty-arg.md)
- [Claude Code Won't Log In (Warp & iTerm2) — Corrupt Keychain Credential](claude-code-warp-login-corrupt-keychain-credential.md)
- [Claude Code Keychain Prompt Keeps Reappearing on macOS (ACL Invalidation)](claude-code-keychain-prompt-recurring-macos.md)

View file

@ -2,14 +2,61 @@
title: "iPhone Mirroring Hangs on 'Connecting…' — AWDL Data Stall (27.0 Beta)"
domain: troubleshooting
category: macos
tags: [macos, iphone-mirroring, continuity, awdl, rapport, quic, tailscale, mullvad, beta]
tags: [macos, iphone-mirroring, continuity, awdl, rapport, quic, tailscale, mullvad, beta, channel-validation, aimesh, quicktime, usb]
status: published
created: 2026-06-09
updated: 2026-06-09
updated: 2026-06-15
---
# iPhone Mirroring Hangs on 'Connecting…' — AWDL Data Stall (27.0 Beta)
## Update 20260615 — REGRESSED; reproducibly stuck on "Connecting", and Tailscale was **not** the cure
> **Correction to the 20260614 "it WORKS" update below.** On 20260615 iPhone Mirroring is **reproducibly stuck on "Connecting to iPhone 16 Pro"** on MajorAir again — with Tailscale `accept-routes` *still* `false`. So the acceptroutes change was **correlation, not the fix**: this is an **intermittent macOS 27.0 beta AWDL bug, independent of Tailscale**.
>
> **Tried this round — all failed to establish a session:** Tailscale `accept-routes=false` (already in place) · `sudo ifconfig awdl0 down/up` · **full Mac reboot** · cycling the iPhone's WiFi + Bluetooth.
>
> **Log signature:** `rapportd` resolves the phone's `_asquic._udp.local` endpoint and `_companion-link` registers (discovery *succeeds*), but the QUICoverAWDL **datapath never completes into a live session**`wifip2pd` loops on `AWDLDiscoveryTimeout (hasAdvertises=false)`. Each reset advanced the handshake one stage further (noadvertises → resolvestarted → endpointresolved) yet none reached a streaming session. **`llw0` never went active (0 bytes)** — confirming no A/V ever flowed, regardless of what the 0614 note measured.
>
> **Stance:** beta OS bug, **no reliable userside fix**. Use the **QuickTime USB mirror** workaround (below) when you actually need the phone on screen. The 0614 "it works on `llw0`" measurements were real *for that one session* but are **not reproducible** across seeds/sessions — treat mirroring as intermittently broken on the 27.0 betas. This reconfirms the original **Root cause (conclusion)** section further down (a beta bug, "nothing in local config wrong"), which the 0614 update had prematurely overridden.
## Update 20260614 (evening) — it WORKS; the "AWDL starvation" finding was the wrong interface
> iPhone Mirroring is now **working** on MajorAir — stable session, clean video, no missing icons — on **ch44/80** with Tailscale `accept-routes=false`. An earlier pass the same day blamed an "AWDL bulkpath starving at ~90 B/s"; that was **measuring the wrong interface** and is corrected here.
**The video transport is `llw0` (lowlatency WLAN), not `awdl0`.**
Measured during an active session: **`llw0` ≈ 800 KB/s** (≈6 Mbps of real video), `en0` ~60 KB/s, **`awdl0` ~1 KB/s**. `awdl0` only ever carries AWDL *discovery/control* (~90 B/s) — whether mirroring works or not. So "90 B/s on `awdl0` = starved bulk path" was a **red herring**: the A/V stream rides `llw0`, which the earlier pass never measured.
**What was actually broken was session *stability*.** The `XPC_ERROR_CONNECTION_INTERRUPTED` / `MediaContinuityKit.TaskTimeoutError` teardown loop kept the `llw0` stream from ever sustaining (→ glitchy / missing icons). When the session holds, `llw0` streams clean.
**What changed (not cleanly isolated):** three things differed between the broken and working states — (1) the network fully **settled on ch44** over ~15 h (the failing ch44 test was minutes after a chaotic AiMesh resync + reconnect scramble), (2) Tailscale **`accept-routes` was turned off** (it had been polluting IPv4 routing + the Continuity control plane), and (3) both devices slept/woke. Which one mattered is not yet proven.
**Open test — isolates Tailscale's role:** repro on **MajorMac** with *unaltered* Tailscale (`accept-routes` still **ON**). If mirroring breaks there but works on MajorAir (acceptroutes OFF), that pins Tailscale's accepted routes as the trigger. See [[MajorAir#Known Issues]] for the `accept-routes=false` fix.
**Still valid from earlier today:** congestion ruled out (router `chanim_stats` ch36 = 90 % idle, 86 % txop); the AiMesh / router infra notes below; and iPhone Mirroring is **wirelessonly — no USB transport** (for a wired screen view, use QuickTime, below).
> ⚠️ The iPhoneradio `isValidChannel`/`awdl0` evidence cited in the original 20260609 writeup below describes AWDL *discovery* health, **not** the video path — read it in light of this correction.
**Wired workaround (works today, no AWDL):**
iPhone Mirroring is **wirelessonly — there is no USB transport** (confirmed: cable connected throughout, every attempt still used `awdl0`). For a wired view of the screen:
> **QuickTime Player → File → New Movie Recording → ⌄ next to record → select the iPhone** = fullrate USBC screen mirror (view + record). Does **not** give remote control (tap/type) — that's unique to iPhone Mirroring.
**Infra notes (RTAX82U, AiMesh controller):**
- Router SSH is on **port 1025** (not 22); creds in Ansible vault (`router_username` / `router_password`).
- The 5 GHz channel is **AiMeshcoordinated** and **resists CLI changes**`wl chanspec` / nvram `wl1_chanspec` get reasserted by `acsd2` + AiMesh within seconds, even after `restart_wireless`. Only setting Control Channel to an **explicit value in the Web UI** holds meshwide. Left "Auto" → acsd2 picks **36** (the cleanest channel).
- Any channel change triggers a **mesh resync (~1 min) that drops all WiFi**; during it MajorAir falls back to the iPhone's **USB Personal Hotspot** (`en7` / `172.20.10.x`) and won't autorejoin home WiFi while the hotspot feeds it internet (manual WiFimenu join needed).
- **Current state: 5 GHz on ch44/80** (same clean UNII1 spectrum as 36; left here to avoid another resync — the Deck streams identically on 44).
**If it breaks again — troubleshooting checklist:**
1. **It's session stability, not bandwidth.** Look for teardown loops: `log show --last 3m --predicate 'process == "iPhone Mirroring"' | grep -iE "interrupt|timeout|endpoint"`.
2. **Measure the right interface** — video rides **`llw0`** (hundreds of KB/s when the screen is active), *not* `awdl0` (~90 B/s control is normal): `netstat -ib | awk '/<Link#/{print $1, $7}'` before/after a few seconds.
3. **Tailscale:** confirm `accept-routes=false` on the Mac (`tailscale debug prefs | grep RouteAll`) — see [[MajorAir#Known Issues]].
4. **Let the network settle** after any WiFi/channel change — an AiMesh resync churns AWDL/Continuity state for a minute+; retry once stable.
5. iPhone: on home WiFi, near the Mac, **Personal Hotspot off**, not in Low Power Mode.
6. **Wired fallback that always works:** QuickTime → New Movie Recording → select the iPhone (USBC; view/record only, no control).
---
## Symptom
iPhone Mirroring on the Mac sits on **"Connecting…"** forever and never shows the iPhone screen.
- Mac: **macOS 27.0 dev beta** (build 26A5353q), MajorAir

View file

@ -0,0 +1,150 @@
---
title: "Logwatch Reports the Wrong Hostname (`<host>-hetzner`) After a Migration"
domain: troubleshooting
category: monitoring
tags: [logwatch, hostname, hetzner, migration, monitoring, provisioning, fail2ban]
status: published
created: 2026-06-12
updated: 2026-06-14
---
# Logwatch Reports the Wrong Hostname (`<host>-hetzner`) After a Migration
## Symptom
Daily Logwatch emails from a recently migrated server arrive titled with the
provisioning label instead of the real hostname:
```
Logwatch for tttpod-hetzner (Linux)
Logwatch for dcaprod-hetzner (Linux)
```
Everything else works — the report is generated, mailed, and delivered. Only the
**name in the title is wrong**, which makes reports harder to scan and breaks any
filter or rule that keys on the expected hostname.
## Cause
Logwatch titles each report with the box's **live system hostname**
(`hostnamectl --static` / `/etc/hostname`) read at runtime — it does *not* keep
its own copy of the name.
Hetzner Cloud servers are provisioned with a temporary node label as the system
hostname — `<host>-hetzner` (e.g. `tttpod-hetzner`). The migration runbook renames
the **Tailscale node** back to the bare name and sets Postfix `myhostname`, but the
**OS hostname** itself is easy to miss because nothing surfaces it day to day. It
stays `<host>-hetzner` until something reads `hostname` — Logwatch is usually the
first thing to do so, weeks later.
Confirm the box is actually mislabelled:
```bash
ssh root@<host> 'hostnamectl --static; cat /etc/hostname; grep 127.0.1.1 /etc/hosts'
# static: tttpod-hetzner
# /etc/hostname: tttpod-hetzner
# 127.0.1.1 tttpod-hetzner tttpod-hetzner
```
## Fix
Set the real hostname and fix the matching `/etc/hosts` loopback line:
```bash
ssh root@<host> '
hostnamectl set-hostname <host>
sed -i "s/127.0.1.1.*/127.0.1.1 <host> <host>/" /etc/hosts
hostnamectl --static # verify -> <host>
'
```
That's it. **Logwatch has no hardcoded hostname override** — verify with:
```bash
grep -ri hostname /etc/logwatch/ /etc/cron.daily/0logwatch /etc/cron.daily/logwatch 2>/dev/null
cat /etc/mailname 2>/dev/null
```
If those are empty (the normal case), Logwatch reads the live hostname on its next
run, so the **next daily report self-corrects** — no service restart, no logwatch
config change needed.
> [!note] If `grep` *does* find a hostname pinned in `/etc/logwatch/conf/logwatch.conf`
> (e.g. a `HostLimit`/`MailFrom` line baked in by Ansible), update it there too —
> the override file wins over the live hostname.
## Sweep the whole fleet
This is a per-box provisioning leftover, so check every migrated host at once —
more than one is usually affected:
```bash
for ip in 100.98.223.93 100.95.137.38 100.64.169.62 100.112.127.0 100.73.85.46; do
echo -n "$ip -> "
ssh -o ConnectTimeout=8 -o BatchMode=yes root@$ip 'hostnamectl --static' 2>/dev/null \
|| echo '(unreachable)'
done
```
Any value ending in `-hetzner` (or your provider's build label) needs the fix above.
In the 2026-06 sweep, `tttpod` and `dcaprod` were still `*-hetzner` at the OS
level; `majortoot`, `majormail`, and `majorlinux` had the correct system hostname
— but see the variant below: `majormail`'s *configs* were still stale even though
its hostname wasn't.
## Variant: hostname is correct, but a config has the old name baked in
A second, sneakier form of this drift: the **system hostname is already right**, so
the sweep above passes and the Logwatch report *title* is correct — yet mail still
arrives **from** `<host>-hetzner` because the old label is hardcoded in a service's
`From`/`sender` field. These fields are static text, not derived from the live
hostname, so fixing `hostnamectl` does nothing for them.
Seen on `majormail` (2026-06-14): system hostname was `majormail`, but
`Logwatch@majormail-hetzner...` was still the sender. Two configs held it:
```bash
# sweep a box for the old provisioning label in any send-related config
ssh root@<host> 'grep -rsn "<host>-hetzner" /etc/logwatch/ /etc/fail2ban/ \
/etc/postfix/ /etc/aliases /etc/mailname 2>/dev/null'
# /etc/logwatch/conf/logwatch.conf:MailFrom = Logwatch@<host>-hetzner.majorshouse.com
# /etc/fail2ban/jail.local:sender = fail2ban@<host>-hetzner.majorshouse.com
```
Fix in place (no restart needed for Logwatch; reload fail2ban for its change):
```bash
ssh root@<host> '
sed -i "s/<host>-hetzner/<host>/g" /etc/logwatch/conf/logwatch.conf /etc/fail2ban/jail.local
systemctl reload fail2ban
'
```
> [!warning] Check the Ansible source, or it comes back
> A live `sed` is undone by the next playbook run if the repo still carries the old
> value. Distinguish two cases:
> - **Templated** (safe): e.g. `logwatch.yml` sets `MailFrom = Logwatch@{{ inventory_hostname }}...`. If the inventory host is named correctly, a run *regenerates* the right value — it even self-heals a stale box.
> - **Static file** (will regress): e.g. `roles/fail2ban/files/hosts/<host>/jail.local` with the literal `sender = ...@<host>-hetzner...`. Grep the repo (`grep -rn "<host>-hetzner" .`) and fix the file too, or every deploy re-pushes the stale sender.
Inert backups (`jail.local.bak*`, `*~`) may still contain the old string — they
don't send mail, so leave them.
## Prevention
Fold "set the system hostname" into the migration bootstrap so it never drifts:
```bash
hostnamectl set-hostname <host>
sed -i "s/127.0.1.1.*/127.0.1.1 <host> <host>/" /etc/hosts
```
Do this in the **same step** that renames the Tailscale node and sets Postfix
`myhostname` — all three read from the provisioning label and all three must be
corrected together. See the
[VPS Migration Baseline Checklist](../02-selfhosting/cloud/vps-migration-baseline-checklist.md).
## Related
- [Logwatch Fleet Setup — Surviving Package Upgrades](../02-selfhosting/monitoring/logwatch-fleet-setup.md) — the broader "logwatch went silent / wrong-source" class, including the Packer `myhostname` variant of this same drift
- [VPS Migration Baseline Checklist](../02-selfhosting/cloud/vps-migration-baseline-checklist.md) — the full post-migration verification list
- [Ansible UNREACHABLE: Host Key Verification Failed After a Host Rebuild or Migration](networking/ansible-host-key-verification-failed-rebuilt-host.md) — another IP/identity-drift gotcha from the same Hetzner migration

View file

@ -0,0 +1,154 @@
---
title: "Auditing & Cleaning macOS Background App Activity (sfltool dumpbtm)"
domain: troubleshooting
category: general
tags: [macos, background-tasks, btm, sfltool, login-items, system-extensions, uninstall, little-snitch]
status: published
created: 2026-06-21
updated: 2026-06-21
---
# Auditing & Cleaning macOS Background App Activity (`sfltool dumpbtm`)
## Overview
macOS tracks every login item, agent, daemon, helper, and extension that may run in the background in its **Background Task Management (BTM)** database. The GUI shows this under **System Settings → General → Login Items & Extensions** ("Allow in the Background"), but the GUI is summarised and hides paths, identifiers, and orphans.
`sfltool dumpbtm` prints the full BTM database from the command line — and the per-user records need **no `sudo`**. This is the fastest way to answer "what is allowed to run in the background, and does each entry still map to an installed app?"
## List what's registered
```bash
sfltool dumpbtm # per-user records, no sudo required
```
Each record looks like:
```
Name: CleanMyMac Menu
Type: login item (0x4)
Disposition: [enabled, allowed, notified] (0xb)
Identifier: 4.com.macpaw.CleanMyMac-mas.Menu
URL: Contents/Library/LoginItems/CleanMyMac_5_MAS_Menu.app
Bundle Identifier: com.macpaw.CleanMyMac-mas.Menu
Parent Identifier: 2.com.macpaw.CleanMyMac-mas
```
### Reading the fields
- **Disposition**`enabled` = actively allowed to run in the background. `disabled` = present but off.
- **Type** — what kind of item it is:
| Type | Meaning |
|---|---|
| `app (0x2)` | A normal application entry |
| `login item (0x4)` | Launches at login (menu-bar apps, helpers) |
| `agent (0x8)` / `legacy agent` | Per-user background agent |
| `legacy daemon (0x10010)` | System-wide background daemon |
| `background tasks (0x2000)` | Abstract background-task registration owned by a parent app — **has no file path of its own** |
| `developer (0x20)` | A per-developer grouping header (the collapsible row in Settings), **not an app** |
| `quicklook` / `spotlight` / `dock tile` | Plugins/extensions — not really "background apps" |
## Map entries to installed apps (find orphans)
Two gotchas make naïve path-checking fail:
1. **Absolute paths are stored as `file://` URLs**, not plain `/…`. Strip the `file://` prefix and URL-decode (`%20` → space).
2. **Child items store a *relative* `URL`** (e.g. `Contents/Library/LoginItems/…`) that must be joined to the **parent record's** absolute path, found via `Parent Identifier`.
A small parser that resolves each record to a real path and flags true orphans:
```python
import sys, re, os, urllib.parse
items, cur = [], None
def push():
global cur
if cur is not None: items.append(cur)
for line in sys.stdin:
s = line.strip()
if re.match(r"^#\d+:$", s): push(); cur = {}; continue
if cur is None: continue
m = re.match(r"^([A-Za-z][A-Za-z /]+):\s*(.*)$", s)
if m: cur[m.group(1).strip()] = m.group(2).strip()
push()
byid = {it["Identifier"]: it for it in items if it.get("Identifier")}
def abspath(it, d=0):
if d > 8: return None
u = it.get("URL", "")
if u and u != "(null)":
if u.startswith("file://"): return urllib.parse.unquote(u[7:]).rstrip("/")
if u.startswith("/"): return u.rstrip("/")
par = byid.get(it.get("Parent Identifier", ""))
if par:
b = abspath(par, d + 1)
if b: return os.path.join(b, urllib.parse.unquote(u)).rstrip("/")
return None
for it in items:
if not it.get("Name"): continue
p = abspath(it)
if p and not os.path.exists(p):
print("ORPHAN:", it["Name"], "->", p)
```
```bash
sfltool dumpbtm | python3 btm_check.py
```
> **Expected non-orphans:** `background tasks (0x2000)` and `developer (0x20)` rows legitimately store no path — they are not missing apps. Helpers/daemons that resolve *inside* a parent bundle (e.g. `/Applications/Foo.app/Contents/Library/LoginItems/…`) or in `/Library/…` are also fine; they just don't appear as a top-level `.app`. That is usually why an entry "has no application you can find."
## Disable background for an app
This **cannot be scripted** — Apple deliberately gates the toggle behind the GUI:
**System Settings → General → Login Items & Extensions → "Allow in the Background"** → switch the app off.
Disabling a `developer (0x20)` grouping header turns off all of that developer's sub-items at once.
## Uninstall cleanly — the system-extension trap
**Dragging an app to the Trash is not a full uninstall.** Apps that install a **network/system extension** plus a privileged daemon (firewalls and VPNs especially — Little Snitch, Mullvad, etc.) leave their `/Library` daemon **still loaded and running** after the app is trashed. The BTM entry persists and the background service keeps working.
### 1. Prefer the app's own uninstaller
- **Bundled uninstall script** (Mullvad): runs cleanly, deactivates the system extension, resets the firewall.
```bash
sudo "/Applications/Mullvad VPN.app/Contents/Resources/uninstall.sh"
```
- Some apps ship an uninstaller in their DMG or a CLI tool. **Note:** Little Snitch 6.x has **no DMG uninstaller and no `littlesnitch uninstall` subcommand** — manual removal is the supported route there.
### 2. Check whether a system extension is still active
```bash
systemextensionsctl list
```
If the app's extension is **not** listed (only unrelated ones like Tailscale/Canon remain), the extension is already deactivated and a manual file removal is now complete and safe.
### 3. Manual removal (when no uninstaller exists)
Find every component first:
```bash
ls /Library/LaunchDaemons/<id>* /Library/LaunchAgents/<id>* 2>/dev/null
ls -d "/Library/Application Support/<Vendor>" 2>/dev/null
ls ~/Library/Preferences/<id>* 2>/dev/null
```
Then boot out the daemon and remove the files:
```bash
sudo launchctl bootout system /Library/LaunchDaemons/<id>.daemon.plist 2>/dev/null
sudo rm -f /Library/LaunchDaemons/<id>.daemon.plist /Library/LaunchAgents/<id>.agent.plist
sudo rm -rf "/Library/Application Support/<Vendor>" "$HOME/.Trash/<App>.app"
rm -f ~/Library/Preferences/<id>*.plist # user-owned, no sudo
```
> **Shared-container caution:** before deleting `~/Library/Group Containers/*`, check it isn't shared. Microsoft apps share `UBF8T346G9.com.microsoft.oneauth`, `…entrabroker`, and `…teams` across Office/Teams/RDP — delete only the app-specific container (e.g. `…com.microsoft.rdc`), never the shared auth ones.
## Stale BTM "ghost" entries
After a manual uninstall, `sfltool dumpbtm` may still list the removed app, pointing at now-deleted paths. These are harmless orphans (nothing left to load). **BTM reconciles them on the next reboot / login cycle** — a reboot also finalises any system-extension teardown.
## Quick reference
```bash
sfltool dumpbtm # full per-user BTM dump (no sudo)
sfltool dumpbtm | grep -A6 'Name:' # browse records
systemextensionsctl list # active network/system extensions
# Verify a removal:
sfltool dumpbtm | grep -i <vendor> # should be empty after a reboot
```
## See also
- Apple gates "Allow in the Background" behind System Settings — there is no supported CLI toggle for BTM dispositions.
- For VPN/firewall apps, always reach for the vendor uninstaller first; manual `rm` alone can leave a registered system extension behind.

View file

@ -0,0 +1,94 @@
---
title: "Ansible UNREACHABLE: Host Key Verification Failed After a Host Rebuild or Migration"
domain: troubleshooting
category: networking
tags: [ansible, ssh, known-hosts, tailscale, host-key, migration]
status: published
created: 2026-06-12
updated: 2026-06-12
---
# Ansible UNREACHABLE: Host Key Verification Failed After a Host Rebuild or Migration
## Symptom
A subset of hosts in an Ansible run fail at **Gathering Facts** while the rest succeed:
```
[ERROR]: Task failed: Data could not be sent to remote host "100.112.127.0".
Make sure this host can be reached over ssh: Host key verification failed.
fatal: [majormail]: UNREACHABLE! => {"unreachable": true, ...}
```
The failing hosts are exactly the ones that were recently **rebuilt or migrated** (new server, new OS install, or a cloud move that issued a new Tailscale IP). Hosts that were never rebuilt connect fine.
Confusingly, **interactive `ssh root@<host>` works perfectly** for the same boxes — only Ansible fails.
## Cause
SSH stores each accepted host key in `~/.ssh/known_hosts` keyed by the **exact address you connected with**. A key accepted for `ssh root@tttpod` is saved under the hostname `tttpod`; it is *not* indexed under that node's IP.
Ansible inventories almost always set `ansible_host` to a **literal IP** (here, the Tailscale `100.x.x.x` address). So Ansible's SSH lookup is by IP, finds no matching entry, and with `StrictHostKeyChecking=yes` (or `accept-new` already exhausted) it refuses the connection:
```
No ED25519 host key is known for 100.112.127.0 and you have requested strict checking.
Host key verification failed.
```
The hostname-form and IP-form entries are independent. Fixing interactive SSH (e.g. converting aliases to MagicDNS names and re-accepting keys) does **nothing** for Ansible, because Ansible never uses the hostname.
A rebuilt host also generates **brand-new host keys**, so any old IP-form entry would additionally be a mismatch — but the common case after a migration to a *new* IP is simply that no IP entry exists at all.
## Diagnosis
```bash
# 1. Is there any known_hosts entry for the failing IP? (0 = none)
ssh-keygen -F 100.112.127.0
# 2. Reproduce the exact failure without an interactive prompt:
ssh -o BatchMode=yes -o StrictHostKeyChecking=yes root@100.112.127.0 true
# -> "Host key verification failed." confirms the gap
# 3. Confirm the inventory IP is actually the host's CURRENT address
# (guards against stale-IP drift, a separate problem):
tailscale status | grep majormail
ssh-keyscan -t ed25519 100.112.127.0 | ssh-keygen -lf - # fingerprint it
```
If step 3 shows the inventory IP matches the live Tailscale node and the box answers `ssh-keyscan`, the only problem is the missing IP-form key.
## Fix
Add the **IP-form** host keys to the `known_hosts` of the user that runs Ansible. Back up first, scan over the tailnet, de-dup:
```bash
cp ~/.ssh/known_hosts ~/.ssh/known_hosts.bak.$(date +%Y%m%d)
for ip in 100.98.223.93 100.112.127.0 100.73.85.46 100.95.137.38 100.76.51.16 100.64.169.62; do
ssh-keyscan -T 5 -t rsa,ecdsa,ed25519 "$ip" >> ~/.ssh/known_hosts
done
sort -u ~/.ssh/known_hosts -o ~/.ssh/known_hosts
```
Verify before re-running the playbook:
```bash
ansible <hosts> -m ping # expect "pong" from each
```
### Why `ssh-keyscan` is safe here
`ssh-keyscan` trusts whatever answers on the wire — normally a MITM risk. Over **Tailscale**, the connection rides WireGuard, which cryptographically authenticates the peer by its tailnet identity: reaching `100.x.x.x` *guarantees* you are talking to the node that owns that tailnet address. Scanning and trusting the key over the tailnet is therefore as trustworthy as the tailnet itself. Always cross-check the IP against `tailscale status` first (step 3) so you scan the right node.
## Prevention
- **Per-workstation, not fleet-wide.** `known_hosts` is local to each machine + user. After a migration, *every* host that runs Ansible (each workstation, plus any control node like `majorlab`) needs the IP keys added independently. Adding them on one Mac does not help the others.
- **Sweep on every migration phase.** A rolling migration changes one node's IP at a time; fold the keyscan above into the post-cutover checklist so Ansible never breaks mid-rollout.
- **Alternative — `accept-new`.** Setting `host_key_checking = False` in `ansible.cfg` (or `ANSIBLE_HOST_KEY_CHECKING=False`) sidesteps the prompt but trades away host-key verification entirely. Prefer the explicit keyscan: it keeps strict checking on for every *future* run while accepting the new key exactly once, under your control.
## Related
- SSH-Aliases — Fleet SSH access; the MagicDNS-vs-pinned-IP strategy and the Ansible-by-IP `known_hosts` note
- Network Overview — Tailscale fleet inventory and current IPs
- Hetzner-Migration-Status — the migration that triggered the fleet-wide IP churn
- [[ssh-socket-tailscale-race-condition]] — a different "SSH unreachable after reboot" failure mode

View file

@ -0,0 +1,133 @@
---
title: "SSH Alias Falls Through to MagicDNS — Host-Key Verification Failure (No `Host` Block)"
domain: selfhosting
category: troubleshooting
tags:
- ssh
- ssh-config
- tailscale
- magicdns
- known-hosts
- host-key
- troubleshooting
status: published
created: 2026-06-11
updated: 2026-06-12
---
# SSH Alias Falls Through to MagicDNS — Host-Key Verification Failure (No `Host` Block)
## The Problem
You `ssh` to a host you've reached many times before, but now it dies before any
auth happens:
```
$ ssh MyMac
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
Host key verification failed.
```
On a headless box (WSL, a server, a CI runner) there's no askpass binary, so the
prompt can't even be shown — SSH just aborts. Connecting **by Tailscale IP** works
fine:
```
$ ssh user@100.74.124.81 # works
$ ssh MyMac # Host key verification failed
```
## Why It Happens
There is **no `Host MyMac` block in `~/.ssh/config` at all** — and there never was.
The connection only ever worked by IP, or interactively (where you clicked through
the first-connect `yes` prompt without noticing).
When no `Host` block matches, SSH uses the literal argument as the hostname. With
Tailscale MagicDNS, `MyMac` (or `mymac`) resolves to the node — so the *connection*
succeeds — but the host key it presents is checked against `known_hosts` under the
name **`mymac`**, which has no entry. Meanwhile the key you actually trust is stored
under the **IP**:
```
$ ssh-keygen -F 100.74.124.81 # found — line 67
$ ssh-keygen -F mymac # nothing
```
So strict host-key checking has nothing to match, tries to prompt to accept the
"new" key, and on a headless host that prompt fails → `Host key verification failed`.
Confirm there's no block (and that `ssh -G` is just echoing defaults):
```
$ ssh -G MyMac | grep -E '^(hostname|user|port) '
hostname mymac # lowercased literal — NOT an explicit HostName
user youruser # your local username default — not from a block
port 22 # default
```
If `hostname` equals the arg you typed (just lowercased) and `user` is your local
login name, there is no matching `Host` block.
## The Fix
Add an explicit `Host` block that **pins the IP** that `known_hosts` already trusts.
This matches the convention every other host in a Tailscale fleet should follow —
pin the `100.x` address, not the MagicDNS name:
```sshconfig
Host MyMac mymac
HostName 100.74.124.81
User youruser
IdentityFile ~/.ssh/id_ed25519
```
> [!note] When pinning the IP is the *wrong* call
> Pinning the IP is right while the host is **stable**. If the box gets migrated or
> rebuilt — new Tailscale IP *and* new host key — the pin rots and `known_hosts`
> mismatches. At that point switch to **MagicDNS names** so the alias self-heals. See
> *[MagicDNS Names vs Pinned IPs for Tailscale SSH (After a Fleet Migration)](tailscale-ssh-magicdns-vs-pinned-ip-after-migration.md)*.
Now `ssh MyMac` resolves to `100.74.124.81`, whose key is in `known_hosts`, and the
check passes with no prompt. Verify non-interactively:
```
$ ssh -o BatchMode=yes MyMac 'hostname'
mymac.majorlan
```
`BatchMode=yes` disables every prompt — if it returns the hostname cleanly, the key
is trusted and a real key authenticated.
**Don't over-pin the identity.** Run `ssh -v user@<IP> true` and check the
`Will attempt key` / accepted-key lines first. A workstation often authenticates
with the *default* `id_ed25519`, not a fleet key — if `id_ed25519_fleet` isn't even
offered, don't put it in the block.
## Cleanup: Stale `known_hosts` Cruft
Drive-by `ssh` attempts leave junk entries like `mymac-2` (auto-suffixed names from
old keys). They never match anything once you pin the IP. Purge them:
```
$ ssh-keygen -R mymac-2
```
## How to Diagnose This
1. `ssh -o BatchMode=yes <alias> true` — if it fails with `Host key verification
failed` (not `Permission denied`), it's a host-key problem, not auth.
2. `ssh -G <alias> | grep -E '^(hostname|user|port) '` — if `hostname` is just your
typed arg and there's no real `HostName`, there's no `Host` block.
3. `ssh-keygen -F <name>` vs `ssh-keygen -F <ip>` — find which name actually holds
the trusted key. Pin whichever one `known_hosts` has (usually the IP).
## Why This Gotcha Is Invisible
It only surfaces on a host with **no askpass** (headless / WSL / cron). On a desktop,
the first-connect prompt appears, you hit `yes`, an entry gets written under the
MagicDNS name, and it "just works" — masking the fact that no `Host` block exists and
the IP-keyed entry is the only durable trust. Move the same config to a headless box
and the missing block becomes a hard failure. Related: SSH only applies `Host` blocks
by **literal pattern match**, so connecting by IP also skips them — see *Ansible Fails
with Permission Denied While `ssh <alias>` Works (Host Alias Bypass)*.

View file

@ -0,0 +1,160 @@
---
title: "SSH `Permission denied (publickey)` After Rotating a Key — Backfill Every `authorized_keys`"
domain: selfhosting
category: troubleshooting
tags:
- ssh
- ssh-keys
- authorized-keys
- key-rotation
- publickey
- fleet
- troubleshooting
status: published
created: 2026-06-17
updated: 2026-06-17
---
# SSH `Permission denied (publickey)` After Rotating a Key — Backfill Every `authorized_keys`
## The Problem
A host you've SSH'd into for months suddenly rejects you — but **only some hosts**, not all:
```
$ ssh root@host-a
root@host-a: Permission denied (publickey).
$ ssh root@host-b # same key, same workstation — works fine
host-b $
```
Nothing changed on the servers. The thing that changed is on **your** side: at some
point the workstation's SSH key was **regenerated** (lost laptop, rebuild, a key file
clobbered by a botched copy, a routine rotation). The new public key was pushed to a
few hosts but never fanned out to the rest. Every host still holding only the *old*
public key now rejects the new private key with `Permission denied (publickey)`.
> The tell: it's `Permission denied (publickey)`, **not** `Host key verification
> failed`. The former is an **authorization** failure (the server doesn't trust your
> key); the latter is the server's key not matching your `known_hosts`. Different
> problem — see *[SSH Alias Falls Through to MagicDNS — Host-Key Verification Failure](ssh-missing-host-block-magicdns-host-key-failure.md)*.
## Why It Happens
Public-key auth is **per-host**: the server only lets you in if your public key is a
line in that host's `~/.ssh/authorized_keys`. There is no central directory — each
host is its own island. So when you rotate a key, *every* host needs the new public
key appended independently.
It's easy to do this partially without noticing. You regenerate the key, then over the
next hour you happen to SSH into three boxes and (re-)deploy the key there as part of
other work. Those three now trust the new key. The other six don't — and you won't
find out until weeks later when you reach for one of them.
Confirm it's an authorization (key) failure and see which key is being offered:
```
$ ssh -v root@host-a 2>&1 | grep -E 'Offering|Authentications|Permission denied'
debug1: Offering public key: /home/you/.ssh/id_ed25519 ED25519 SHA256:XeY1/N9qwB…
debug1: Authentications that can continue: publickey
root@host-a: Permission denied (publickey).
```
The server offered you nothing but `publickey`, you offered your current key, and it
was refused → your key isn't in that host's `authorized_keys`.
## Scope It First — Don't Fix One Host at a Time
The host you noticed is rarely the only one. Sweep the whole fleet in one pass before
touching anything, so you fix the real set, not just the squeaky wheel:
```bash
for h in host-a host-b host-c host-d host-e host-f; do
r=$(ssh -o BatchMode=yes -o ConnectTimeout=8 root@"$h" 'echo OK' 2>&1 | tail -1)
echo "$h: $r"
done
```
`BatchMode=yes` suppresses password/passphrase prompts so a failure fails fast instead
of hanging. Anything that doesn't print `OK` needs the backfill.
## The Fix
You need a **second, still-trusted** way onto each failing host to append the new key.
Common transit options, best first:
- **Another of your keys that still works** (e.g. a config-management / automation
user whose key is authorized fleet-wide, ideally with `sudo`).
- **Another workstation** whose key those hosts still trust.
- **The provider's web console / serial console** as a last resort.
> [!warning] A jump host only helps if *it* can reach the target
> "Bounce through a box that still trusts me" only works if that box's own key is in
> the target's `authorized_keys`. A host can trust *your* key yet have no standing
> trust to a third host (and hit its own `Host key verification failed` on the way).
> Test the full two-hop path before relying on it.
Using a fleet-wide automation user (`deploy`) with passwordless `sudo` as the transit,
append the new key idempotently, with a backup, to every failing host:
```bash
PUBKEY=$(cat ~/.ssh/id_ed25519.pub)
STAMP=$(date +%Y%m%d-%H%M%S)
for h in host-a host-c host-e; do # only the hosts that failed the sweep
ssh deploy@"$h" "sudo bash -s" <<EOF
set -e
F=/root/.ssh/authorized_keys
mkdir -p /root/.ssh && touch "\$F"
cp "\$F" "\$F.bak-$STAMP" # backup before any change
grep -qF "$PUBKEY" "\$F" || printf '%s\n' "$PUBKEY" >> "\$F" # append only if absent
chmod 600 "\$F"
EOF
done
```
Three things that keep this safe:
- **Append, never overwrite.** `>> "$F"` and the `grep -qF … ||` guard mean you add
one line and only if it's missing. Re-running is a no-op — never clobber an
`authorized_keys` with `>` or you'll lock out every *other* key on the box.
- **Back up first.** The `.bak-<stamp>` copy is your undo.
- **`chmod 600`.** SSH silently ignores an `authorized_keys` that's group/world
writable, which looks exactly like "the key didn't take."
Then verify directly — not through the transit user:
```bash
for h in host-a host-c host-e; do
echo "$h: $(ssh -o BatchMode=yes root@"$h" 'echo OK' 2>&1 | tail -1)"
done
```
All `OK` means the new key authenticates on its own.
## Prevention
- **Treat rotation as fleet-wide.** When a workstation key changes, the very next step
is to fan the new public key out to **every** host's `authorized_keys` in one pass —
not opportunistically as you happen to log in. A short `for` loop over the full host
list (or a config-management task — see below) closes the gap immediately.
- **Manage `authorized_keys` declaratively.** An Ansible `ansible.posix.authorized_key`
task (or equivalent) that lists the *current* set of keys makes "who can log in" a
reviewed, version-controlled fact instead of an append-only pile that drifts per host.
- **Keep the old key authorized until the new one is verified everywhere**, then remove
the stale line in a deliberate cleanup pass.
## How to Diagnose This (Checklist)
1. `ssh -o BatchMode=yes <host> true``Permission denied (publickey)` (auth), not
`Host key verification failed` (host key). Confirms which problem you have.
2. `ssh -v <host> 2>&1 | grep Offering` → which private key is being offered, and its
fingerprint.
3. Sweep the whole fleet with the `BatchMode` loop → get the **full** list of affected
hosts before fixing.
4. Append the new public key (idempotent, backed up, `chmod 600`) via a still-trusted
transit path.
5. Re-verify each host with a direct `BatchMode` login.
Related: *[SSH Config & Key Management](../../01-linux/networking/ssh-config-key-management.md)*
and *[SSH Hardening Across a Fleet with Ansible](../../02-selfhosting/security/ssh-hardening-ansible-fleet.md)*.

View file

@ -0,0 +1,133 @@
---
title: "Steam Deck Wi-Fi Flapping: IWD Periodic Scan + rtw88 Power Save"
domain: troubleshooting
category: networking
tags: [wifi, steam-deck, steamos, iwd, networkmanager, rtw88, rtl8822ce, power-save, supplicant-disconnect, flapping]
status: published
created: 2026-06-19
updated: 2026-06-19
---
# Steam Deck Wi-Fi Flapping: IWD Periodic Scan + rtw88 Power Save
## 🛑 Problem
An OG Steam Deck (LCD model, Realtek **RTL8822CE** on the `rtw88_8822ce` driver) kept "losing" Wi-Fi — it would connect, hold for around a minute, drop, then reconnect a second later, over and over. From the router side the device looked like it was constantly coming and going; from the couch it felt like the network "wouldn't stay connected."
Crucially, **this was not a router problem.** The AP config was correct, RF was clean (strong signal, zero tx retries / beacon loss), and every other client on the network was rock-solid. The fault was entirely on the Deck.
## 🔍 Diagnosis
SteamOS uses **NetworkManager with the `iwd` backend** (not `wpa_supplicant`). That detail is the whole ballgame.
### Step 1 — Confirm the flap and its cadence
```bash
# how many disconnects this boot?
journalctl -b -u NetworkManager --no-pager | grep -c supplicant-disconnect
# 50
# when did they happen?
journalctl -b -u NetworkManager --no-pager | grep supplicant-disconnect \
| awk '{print $1,$2,$3}' | tail
# 10:20:52 · 10:21:54 · 10:22:57 · 10:24:00 · 10:25:03 · 10:26:05 · 10:27:08 ...
```
**~63 seconds between every drop.** A fixed, metronome-like interval is the tell — this is a *timer*, not RF noise. The NetworkManager log shows the pattern plainly:
```
activated -> failed (reason 'supplicant-disconnect')
... -> activated # reconnects ~1s later
```
### Step 2 — Prove the link is healthy *when it's up*
```bash
iw dev wlan0 station dump | grep -iE 'signal|bitrate|failed|retries|beacon loss'
# signal: -65 dBm
# tx retries: 0
# tx failed: 0
# beacon loss: 0
```
Strong signal, zero retries, zero beacon loss — the association is clean while it lasts. So the drop is being *commanded*, not caused by a bad radio link.
### Step 3 — Identify the chip and the backend
```bash
lspci -k | grep -A3 -iE 'network|wireless'
# Realtek RTL8822CE ... Kernel driver in use: rtw88_8822ce
```
The `~63s` interval is **IWD's default periodic background scan**. With no `/etc/iwd/main.conf` present, IWD scans on a timer even while connected, and on the `rtw88` driver that scan knocks the current association over — producing the `supplicant-disconnect` every minute.
A secondary annoyance: `iw dev wlan0 get power_save` reported `on`, which showed up as wildly jittery LAN latency (869 ms to the gateway over Wi-Fi, where a healthy 5 GHz link is 210 ms).
## ✅ Fix
Two independent changes — the first stops the flap, the second smooths latency.
### 1. Disable IWD's periodic scan (stops the flap)
```bash
sudo mkdir -p /etc/iwd
printf '[Scan]\nDisablePeriodicScan=true\n' | sudo tee /etc/iwd/main.conf
sudo systemctl restart iwd # briefly drops Wi-Fi; NetworkManager auto-reconnects
```
Trade-off: with periodic scanning off, the Deck roams to a different/stronger AP (e.g. another AiMesh node) more lazily. Fine for a device that mostly sits in one spot.
### 2. Disable Wi-Fi power save (kills the latency jitter)
The obvious `nmcli connection modify <name> 802-11-wireless.powersave 2` **does not work under the IWD backend** — NetworkManager doesn't enforce that property when `iwd` is managing the radio. Use a dispatcher script instead, with a retry loop because `rtw88` won't accept the setting in the first instant after association on a cold boot:
```bash
sudo tee /etc/NetworkManager/dispatcher.d/90-wifi-powersave >/dev/null <<'SCRIPT'
#!/bin/sh
# Disable Wi-Fi power save on the wireless iface (retry: rtw88 may not accept it instantly on boot)
case "$2" in
up|dhcp4-change|connectivity-change)
case "$1" in
wl*)
for n in 1 2 3 4 5; do
/usr/bin/iw dev "$1" set power_save off 2>/dev/null
[ "$(/usr/bin/iw dev "$1" get power_save 2>/dev/null)" = "Power save: off" ] && break
sleep 1
done
;;
esac
;;
esac
SCRIPT
sudo chmod +x /etc/NetworkManager/dispatcher.d/90-wifi-powersave
sudo iw dev wlan0 set power_save off # apply now without waiting for a reconnect
```
> 💡 A single-shot dispatcher (no retry) **silently fails on a cold boot** — it fires before the interface is ready, the `iw` call no-ops, and power save stays on. Verify with `iw get power_save` *after a real reboot*, not just after a service restart.
## 🔁 Verification
```bash
# was 50/boot, ~once a minute:
journalctl -b -u NetworkManager --no-pager | grep -c supplicant-disconnect
# 0
iw dev wlan0 get power_save
# Power save: off
```
A 3-minute continuous `ping` showed **180/180 replies, 0 loss**, latency tightened to **611 ms**. Confirmed across a full cold reboot: the Deck auto-rejoins Wi-Fi, both settings persist, and the disconnect counter stays at 0.
## 📌 Notes
- **Persistence:** `/etc/iwd/main.conf` and the dispatcher live in `/etc`, which survives reboots. A major SteamOS update *can* reset `/etc` — re-apply if the flapping returns after an OS update.
- **Fully reversible:**
```bash
sudo rm /etc/iwd/main.conf /etc/NetworkManager/dispatcher.d/90-wifi-powersave
sudo systemctl restart iwd
```
- **Interface name** is usually `wlan0`; confirm with `iw dev` if different.
- The same IWD-periodic-scan behavior can affect other `iwd`-based distros (Arch, some Fedora spins) on flaky/older Wi-Fi chips — the `DisablePeriodicScan` fix is general, not Deck-specific.
## 🔗 Related
- [Wi-Fi Game Streaming Stutter: 160 MHz Channel Width Saturating the 5 GHz Radio](wifi-160mhz-airtime-saturation-game-streaming.md) — the *other* Steam Deck Wi-Fi issue (airtime contention, router-side), distinct from this client-side flap.

View file

@ -0,0 +1,163 @@
---
title: "MagicDNS Names vs Pinned IPs for Tailscale SSH (After a Fleet Migration)"
domain: troubleshooting
category: networking
tags:
- ssh
- ssh-config
- tailscale
- magicdns
- known-hosts
- host-key
- migration
- wsl2
status: published
created: 2026-06-12
updated: 2026-06-12
---
# MagicDNS Names vs Pinned IPs for Tailscale SSH (After a Fleet Migration)
You have SSH aliases for a Tailscale fleet (`alias tttpod='ssh root@100.84.42.102'`).
They worked for months. Then you migrate or rebuild some nodes — and now a third of
them hang on connect or refuse the host key. This is the failure mode that hardcoded
addresses hit, and why the durable answer is **MagicDNS names**, not pinned IPs.
> This is the sequel to *[SSH Alias Falls Through to MagicDNS — Host-Key Verification
> Failure (No `Host` Block)](ssh-missing-host-block-magicdns-host-key-failure.md)*.
> That article says **pin the IP** `known_hosts` already trusts — correct when the
> node is stable. This one covers what happens when a migration changes the IP *and*
> the host key, which is exactly when IP-pinning stops paying off.
## The Three Failure Modes
A migration/rebuild can trigger any of these — often several at once across a fleet,
which is what makes it confusing:
### 1. Stale hardcoded IP → connection times out
The node re-registered on the tailnet with a **new** Tailscale IP, but your alias
still names the old one:
```
$ tttpod
ssh: connect to host 100.84.42.102 port 22: Operation timed out
```
The old address is dead; SSH waits the full timeout and gives up. Confirm by asking
the tailnet for the node's *current* IP by name:
```
$ tailscale status | grep tttpod
100.95.137.38 tttpod ... # alias points at 100.84.42.102 — stale
```
### 2. Cold-path teardown → first connect after idle times out
The IP is correct and the node is up (it answers `ping`), but TCP/22 still times out
on the *first* try after a quiet period, then works on retry. Tailscale 1.98.x is more
aggressive about tearing down **idle direct UDP paths**; the first SSH has to
re-establish NAT traversal, which can overrun SSH's default connect timeout.
```
$ tailscale status | grep tttpod
100.95.137.38 tttpod ... idle, tx 9360 rx 0 # cold path
$ tailscale ping tttpod
pong from tttpod (100.95.137.38) via 5.161.118.84:41641 in 48ms # warms instantly
```
### 3. Host-key verification failed → box was rebuilt
The node was reinstalled, so it presents a **new** SSH host key. Your `known_hosts`
still has the old one, so even `StrictHostKeyChecking=accept-new` aborts — `accept-new`
only adds *genuinely new* hosts, it refuses a **mismatch**:
```
$ ssh root@tttpod hostname
Host key verification failed.
```
## The Fix
Three changes, applied on every **name-capable** machine (see the WSL2 caveat below):
### a. Switch aliases from IPs to MagicDNS names
```bash
# before — rots on every migration
alias tttpod='ssh root@100.84.42.102'
# after — always resolves the node's current IP
alias tttpod='ssh root@tttpod'
```
MagicDNS resolves the name to whatever IP the node currently has, so a future
migration needs **zero** alias edits. This is the whole point: the tailnet already
knows the mapping — stop duplicating (and stale-ing) it in your dotfiles.
> **Exception:** if there's no tailnet device with that exact name (e.g. an alias
> `teelia` pointing at a node actually named `temptedparadise`), MagicDNS can't
> resolve it — keep the IP for that one.
### b. Purge stale host keys, then re-accept
After a rebuild, clear the old entries under **both** the name and the current IP,
then reconnect with `accept-new` to record the fresh key. Over Tailscale's
authenticated WireGuard tunnel, a key change from a known rebuild is safe to accept.
```bash
for pair in "tttpod:100.95.137.38" "majortoot:100.64.169.62" "dcaprod:100.98.223.93"; do
n="${pair%%:*}"; ip="${pair##*:}"
ssh-keygen -R "$n"; ssh-keygen -R "$ip"
done
# repopulate
ssh -o StrictHostKeyChecking=accept-new root@tttpod hostname
```
### c. Add a cold-path cushion to `~/.ssh/config`
Give the first (cold) connection time to renegotiate instead of erroring:
```sshconfig
Host majorlinux tttpod majortoot majordiscord dcaprod majormail majorhome
ConnectTimeout 25
ServerAliveInterval 30
ServerAliveCountMax 4
```
`ConnectTimeout 25` turns the cold-path timeout into a ~12 s pause. The keepalives
hold the path open during an active session so it doesn't drop mid-command.
## Caveat: WSL2 Can't Use MagicDNS
A Linux box under **WSL2** typically has **no `tailscale` CLI and no MagicDNS
resolver** — it rides the Windows host's networking, and name lookups for tailnet
nodes fail:
```
$ getent hosts tttpod # (inside WSL2)
# nothing — no resolution
$ command -v tailscale # nothing — CLI lives on the Windows side
```
On those machines you **must** keep hardcoded IPs in `~/.ssh/config` (or use `Host`
blocks with explicit `HostName <ip>`), and refresh them by hand when a node migrates.
There's no self-healing option there — the trade is unavoidable.
## Diagnosis Checklist
1. `tailscale status | grep <host>` — does your alias's IP match the **current** one?
(Mode 1: stale IP.)
2. `ping`/`tailscale ping <host>` works but TCP/22 times out on first try, succeeds on
retry? (Mode 2: cold path.)
3. `ssh root@<host> true``Host key verification failed` (not `Permission denied`)?
(Mode 3: rebuilt box, stale `known_hosts`.)
4. Is the client a WSL2 box? `getent hosts <name>` returns nothing → MagicDNS
unavailable, stay on IPs.
## Takeaway
Pin the IP when a host is **stable** and the IP-keyed `known_hosts` entry is your
durable trust anchor. Switch to **MagicDNS names** when hosts **move** — migrations,
rebuilds, provider changes — so the tailnet's own name→IP mapping does the work your
dotfiles kept getting wrong. And on WSL2, you don't get the choice: hardcoded IPs,
refreshed by hand.

View file

@ -0,0 +1,115 @@
---
title: "Wi-Fi Game Streaming Stutter: 160 MHz Channel Width Saturating the 5 GHz Radio"
domain: troubleshooting
category: networking
tags: [wifi, 5ghz, 160mhz, channel-width, dfs, steam-deck, game-streaming, asuswrt, airtime, chanim]
status: published
created: 2026-06-13
updated: 2026-06-13
---
# Wi-Fi Game Streaming Stutter: 160 MHz Channel Width Saturating the 5 GHz Radio
## 🛑 Problem
Streaming a game from a desktop (wired) to a Steam Deck over Wi-Fi was stuttering intermittently — fine for a while, then choppy, hard to reproduce on demand. Throughput tests "looked fine," which is exactly why it was hard to pin down: **game streaming fails on jitter and microbursts of contention, not on average bandwidth.**
The Wi-Fi was an Asus RT-AX82U (AsusWRT, stock firmware) with the 5 GHz radio set to **Auto channel at 160 MHz width**.
## 🔍 Diagnosis
The key insight: **signal was excellent, but latency was not.** That combination means the airwaves are busy, not weak.
### Step 1 — Measure jitter to the gateway from a Wi-Fi client
```bash
ping -c 20 -i 0.2 192.168.50.1
# round-trip min/avg/max/stddev = 7.5/27.0/61.0/16.5 ms
```
27 ms **average** and 16 ms of jitter to your *own router* over Wi-Fi is pathological. A healthy 5 GHz link sits at 25 ms. Yet the client's signal was **-43 dBm** (excellent) with a clean **-92 dBm** noise floor. Strong signal + high jitter = **airtime contention**, not range or interference at the receiver.
### Step 2 — Confirm channel utilization at the router
AsusWRT/Broadcom exposes per-channel airtime stats via `wl chanim_stats`. SSH into the router and run it against the 5 GHz interface:
```bash
# 5 GHz interface name varies (eth6/eth7); resolve it from nvram
IF=$(nvram get wl1_ifname)
wl -i "$IF" chanspec # e.g. 36/160 (0xe832) → channel 36, 160 MHz
wl -i "$IF" assoclist | wc -l # number of associated 5 GHz clients
wl -i "$IF" chanim_stats
```
The smoking gun (`chanim_stats`, version 3):
```
chanspec tx inbss obss nocat nopkt doze txop goodtx badtx glitch ... idle
0xe832 92 2 1 2 1 0 4 8 81 2 14
```
Read it as percentages of airtime:
| Field | Value | Meaning |
|-------|-------|---------|
| `tx` | **92** | Channel busy transmitting 92% of the time |
| `txop` | **4** | Transmit-opportunities available only 4% — the channel is starved |
| `idle` | **14** | Channel idle only 14% |
| `goodtx` / `badtx` | 8 / **81** | Failed/retried transmits vastly outnumber good ones |
Seventeen clients were associated to that one 5 GHz radio.
### Step 3 — Understand why 160 MHz makes it worse
A 160 MHz channel on the lower 5 GHz band spans channels **3664**, which overlaps DFS sub-blocks. To stay clean it needs 160 MHz of *uncontended* spectrum — but in a dense RF environment (≈25 neighbor APs here, several on 5 GHz channels 48/52/100/132/153 that overlap or border the block), any one busy neighbor degrades the **entire** wide channel. 160 MHz also makes the radio **DFS-radar exposed**: a single radar detection forces a channel-switch with a 1 s+ blackout — a stream-killer.
So 160 MHz buys a higher *peak* PHY rate that game streaming doesn't need, at the cost of the *stability* it absolutely does.
## ✅ Fix
Drop the 5 GHz radio to **80 MHz** and pin it to a **non-DFS** channel (UNII-1: 36/40/44/48 — no radar, no DFS blackouts).
GUI: **Wireless → 5 GHz → Channel Bandwidth = 80 MHz**, **Control Channel = 36**, turn off "Auto."
Or over SSH (`nvram` + `restart_wireless`):
```bash
nvram set wl1_bw_cap=7 # cap at 80 MHz (bitmask: 1=20, 3=40, 7=80, 15=160)
nvram set wl1_chanspec=36/80 # channel 36 @ 80 MHz
nvram set wl1_channel=36
nvram commit
service restart_wireless # ~15-20s radio bounce, drops all clients briefly
```
> [!warning] `restart_wireless` drops every Wi-Fi client for 1520 seconds. `nvram commit` runs *before* the restart, so the config persists even if your own SSH/Wi-Fi session drops.
## 📊 Result
Verified from both the router and a client after the radio came back:
| Metric | Before (36/160) | After (36/80) |
|--------|-----------------|---------------|
| Channel tx-busy | 92% | **9%** |
| Transmit-opportunity available | 4% | **79%** |
| Channel idle | 14% | **87%** |
| Failed tx (`badtx` vs `goodtx`) | 81 vs 8 | **1 vs 3** |
| Gateway ping (avg / floor) | 27 ms / 7.5 ms | **9 ms / 2.7 ms** |
| PHY peak rate | 1729 Mbps | 1200 Mbps |
The PHY peak dropped (narrower channel) but that is irrelevant — Steam Remote Play wants ~3050 Mbps with *consistent* airtime, which it now has. The stutter resolved.
## 🧠 Takeaways
- **Diagnose Wi-Fi streaming problems with jitter, not throughput.** A speed test can pass while a stream stutters. Ping your gateway and watch the stddev.
- **Strong signal + high latency = airtime congestion.** Don't chase signal strength when RSSI is already good; look at channel utilization (`chanim_stats`).
- **160 MHz is a trap in a dense RF environment.** Use 80 MHz for reliability; reserve 160 MHz for clean spectrum and short range.
- **Prefer non-DFS channels (3648) for anything latency-sensitive** — DFS radar events cause silent multi-second dropouts.
- **Wire the *source*.** The streaming PC should be on Ethernet so the video only crosses the air once (AP → handheld). The handheld has to be Wi-Fi; the desktop doesn't.
- **Isolate IoT on 2.4 GHz** (separate SSID) so it never competes for 5 GHz airtime with latency-sensitive clients.
## Related
- [Steam Deck Wi-Fi Flapping: IWD Periodic Scan + rtw88 Power Save](steam-deck-wifi-flapping-iwd-periodic-scan-rtw88.md) — the *other* Steam Deck Wi-Fi issue (client-side flap), distinct from this router-side airtime problem.
- [Network Overview](../../02-selfhosting/dns-networking/network-overview.md)
- [Wake-on-LAN via Router SSH](../../02-selfhosting/dns-networking/wake-on-lan-router-ssh.md)
- [Pi-hole v6 Group Management — Per-Client DNS Rules](../../02-selfhosting/dns-networking/pihole-v6-group-management.md)

View file

@ -0,0 +1,120 @@
---
title: "Time Machine: Orphaned APFS .previous Folder Blocks All Backups"
domain: troubleshooting
category: general
tags: [macos, time-machine, apfs, backup, fsck, disk-utility]
status: published
created: 2026-06-18
updated: 2026-06-18
---
# Time Machine: Orphaned APFS `.previous` Folder Blocks All Backups
## Overview
On an APFS Time Machine destination, an interrupted backup can leave behind an orphaned staging folder named `<timestamp>.previous` (plus a matching, uncatalogued APFS snapshot). Every subsequent backup reads that folder during *FindingChanges*, hits a metadata-type mismatch, and aborts — so backups silently stop running. macOS shows only a generic "**Time Machine couldn't complete the backup … An unknown error occurred.**"
The trap: because the orphan is **not in Time Machine's catalog** and the destination is OS-protected, every obvious removal tool (`rm`, `chmod`, `tmutil delete`, `diskutil deleteSnapshot`) refuses it. The clean fix is **First Aid (`fsck_apfs`)**, which has authority over the volume and clears the orphaned snapshot.
## Symptoms
- "Time Machine couldn't complete the backup to '<disk>' — An unknown error occurred."
- Backups haven't run since around the time of an interrupted/cancelled backup.
- The destination disk is mounted and has plenty of free space (not full, not disconnected).
- `tmutil status` cycles through `Starting` / `FindingChanges` and never reaches `Copying`.
## Root Cause
`backupd` logs the real error on a loop (every ~15 s):
```bash
log show --predicate 'subsystem == "com.apple.TimeMachine"' --last 10m --style compact \
| grep -iE 'previous|error'
```
```
[TMStructure] Expected SnapshotInProgressContainer metadata type but found APFSBackup
metadata type at URL '.../<disk>/2026-06-17-172230.previous/'
```
An earlier backup was interrupted mid-run. It left two orphans tied to that timestamp, **neither registered in Time Machine's backup catalog**:
1. A staging directory `<timestamp>.previous` on the destination volume.
2. A matching APFS snapshot `com.apple.TimeMachine.<timestamp>.backup`.
Time Machine expects the staging folder to be a `SnapshotInProgressContainer` but finds completed-backup (`APFSBackup`) metadata, so it bails before copying anything.
> **Ignore the surrounding log noise.** `com.apple.backupd.sandbox.xpc: connection invalid`, `Mountpoint '…' is still valid`, and `missingName` on `/System/Volumes/Data/home` are all normal on a healthy backup — flagged `E` but harmless. The only line that matters is the `SnapshotInProgressContainer` mismatch.
## Diagnosis
Confirm the disk is healthy (not the problem) and locate the orphan:
```bash
tmutil status # stuck in Starting/FindingChanges, never Copying
df -h | grep -i "<disk-name>" # mounted, plenty free
diskutil apfs listSnapshots <diskNsN> # note the highest/last snapshot timestamp
```
If `listSnapshots` shows a final snapshot whose timestamp matches the `.previous` folder in the error, that's the orphaned pair.
## Why the Obvious Tools Fail
Do **not** burn time trying to force the folder out — here's what each tool does and why it refuses:
| Command | Result | Reason |
|---|---|---|
| `sudo rm -rf …/<ts>.previous` | `Operation not permitted` | TM applies a `group:everyone deny delete` ACL that overrides root. |
| `sudo chmod -RN …/<ts>.previous` | runs for minutes, then fails | A `.previous` folder is a **full copy of the entire Mac filesystem**; `-R` walks the whole tree and can't clear ACLs on the SIP-`restricted` system files inside (`/usr/bin/sh`, frameworks, keymaps). `rm` then hits the same wall. |
| `sudo tmutil delete -p …/<ts>.previous` | `Invalid deletion target (error 22)` | Not a registered backup. |
| `sudo tmutil delete -t <timestamp>` | `error 2 (No such file)` | No catalog entry for that timestamp. |
| `sudo diskutil apfs deleteSnapshot <diskNsN> -uuid <uuid>` | `Not a valid APFS Snapshot UUID` | TM-managed snapshot; diskutil won't remove it directly. |
> **If you started a `chmod -R` and killed it:** the live system is unaffected — `chmod -R` does not follow symlinks out of the backup tree. Verify with `ls -lde ~/Desktop` (normal ACLs = untouched). Stop a runaway with `sudo pkill -f '<timestamp>.previous'`.
## Fix — Run First Aid (`fsck_apfs`)
First Aid runs with full authority over the volume and clears the orphaned snapshot, which defuses the `.previous` folder's metadata mismatch.
```bash
# 1. Stop the looping backup
sudo tmutil stopbackup
# 2. Verify the destination volume (live mode is fine; read-only check)
sudo diskutil verifyVolume <diskNsN>
# or: Disk Utility → View → Show All Devices → select the TM volume → First Aid → Run
```
`verifyVolume` enumerates and validates every snapshot; the verify/remount cycle purges the orphaned in-progress snapshot. Expected result:
```
The volume <name> appears to be OK
File system check exit code is 0
```
Confirm the orphan snapshot is gone (count drops by one; the matching timestamp no longer appears):
```bash
diskutil apfs listSnapshots <diskNsN>
```
Then restart and watch it succeed:
```bash
sudo tmutil startbackup --auto
tmutil status # should reach BackupPhase = Copying with no SnapshotInProgressContainer errors
```
If `verifyVolume` reports problems rather than "appears to be OK", run the repair (it must unmount the volume):
```bash
sudo diskutil repairVolume <diskNsN>
```
## Notes
- The first backup after the fix is often a large catch-up (hundreds of GB) because the chain was broken — let it finish; it returns to quick hourly increments afterward.
- The inert `<timestamp>.previous` **folder** may still sit on the volume after the fix. Time Machine now ignores it, so it's not blocking — but it consumes space. Removing it cleanly requires booting to **Recovery Mode**, `csrutil disable`, `rm -rf` the folder, then `csrutil enable` — only worth it to reclaim the space.
- Time Machine identifies its destination by `DestinationID` (a UUID), not the volume name, so renaming the disk later is safe.
- Interrupted backups are more likely on flaky USB-SATA bridge enclosures (e.g. some WD My Passport units) whose slow sleep/wake transitions can drop the drive mid-backup.
## Tags
`macos` `time-machine` `apfs` `backup` `fsck-apfs` `disk-utility` `snapshot` `first-aid`
## See Also
- [SnapRAID & MergerFS Storage Setup](../01-linux/storage/snapraid-mergerfs-setup.md)
- MajorMac Incident Log (2026-06-18) — the originating incident

View file

@ -0,0 +1,193 @@
---
title: "WordPress 6.7 _load_textdomain_just_in_time Notice (Theme/Plugin Loads Translations Too Early)"
domain: troubleshooting
category: troubleshooting
tags:
- wordpress
- wordpress-6.7
- php
- i18n
- textdomain
- theme
- mu-plugin
- deprecation
- troubleshooting
status: published
created: 2026-06-21
updated: 2026-06-21
---
# WordPress 6.7 `_load_textdomain_just_in_time` Notice
> **TL;DR** — WordPress 6.7 added a `doing_it_wrong` notice that fires when a translation function (`__()`, `_e()`, `esc_html__()`, …) is called for a text domain **before the `init` action**. It's almost always a theme or plugin registering nav menus / sidebars / labels on `after_setup_theme` (which runs before `init`). The notice is **debug-only and harmless** — translations still load via the just-in-time fallback. If the offending code is in your own (or an updatable) theme/plugin, fix it at the source by deferring to `init`. If it's a **non-updating or third-party** theme you don't want to hand-edit, suppress *only this one notice* with a `doing_it_wrong_trigger_error` filter in a tiny mu-plugin.
---
## Symptom
With `WP_DEBUG` on (or in Query Monitor's PHP panel), you see:
```
Function _load_textdomain_just_in_time was called incorrectly.
Translation loading for the <domain> domain was triggered too early.
This is usually an indicator for some code in the plugin or theme running too early.
Translations should be loaded at the init action or later.
(This message was added in version 6.7.0.)
_load_textdomain_just_in_time() wp-includes/l10n.php
get_translations_for_domain() wp-includes/l10n.php
translate() wp-includes/l10n.php
__() wp-includes/l10n.php
WordPress Core
```
The key fields are **the domain name** (e.g. `marstheme`, `woocommerce`, `astra`) and the fact that the stack bottoms out in **WordPress Core** via `__()` — that tells you *some* extension called a translation function, not that core is broken.
## Why it happens (the WP 6.7 change)
Before 6.7, WordPress silently "just-in-time" loaded a text domain the first time you translated a string in it. 6.7 kept the JIT loading but started **warning** when it's triggered before `init`, because:
- Translations loaded before `init` can't be filtered/overridden by other plugins that hook `init`.
- It signals the extension is doing setup work earlier than the WordPress lifecycle intends.
The usual culprit is code on **`after_setup_theme`** (which fires *before* `init`) that translates a label inline, e.g.:
```php
function mytheme_setup() {
register_nav_menus( array(
'primary' => __( 'Primary Menu', 'mytheme' ), // <-- translate call before init
) );
}
add_action( 'after_setup_theme', 'mytheme_setup' );
```
> **Important:** explicitly calling `load_theme_textdomain()` / `load_plugin_textdomain()` early does **not** fix the notice, and as of WP 4.6+ themes on wordpress.org don't even need to call it. The notice is about the *translate call*, not about whether the domain was loaded. Moving only the `load_*_textdomain()` call around is a common dead-end (see the gotcha below).
## Diagnostic chain
### 1. Identify the domain and what owns it
The notice names the domain. Find which theme/plugin uses it:
```bash
WPROOT=/var/www/html
grep -rlw '<domain>' "$WPROOT/wp-content/themes" "$WPROOT/wp-content/plugins" 2>/dev/null
# Which extension has the most references (i.e. owns the domain)?
grep -rl '<domain>' "$WPROOT/wp-content/" 2>/dev/null \
| sed -E "s#$WPROOT/wp-content/(themes|plugins|mu-plugins)/([^/]+)/.*#\1/\2#" \
| sort | uniq -c | sort -rn | head
```
> **Watch for renamed/forked themes.** The domain often does **not** match the theme's folder name. A theme bought as "Mars" and re-slugged to `kappa` keeps `marstheme` as its text domain in all 40+ template files. So `wp theme list` shows `kappa` active while the notice says `marstheme` — they're the same thing.
### 2. Confirm it's active and whether it can be updated
```bash
sudo -u www-data wp --path=$WPROOT theme list --fields=name,status,version,update
sudo -u www-data wp --path=$WPROOT plugin list --fields=name,status,version,update
```
- `update available`**update it first** (newest releases of most themes/plugins fixed this in late 2024/2025). That's the proper fix; the rest of this article is for when you can't.
- `update none` on a **renamed/custom fork** → no upstream exists, so updating is impossible. Go to the suppression fix.
### 3. Pin down the early call (optional)
```bash
grep -rn "__(\s*['\"].*['\"]\s*,\s*['\"]<domain>['\"]" \
"$WPROOT/wp-content/themes/<theme>" | head
```
Look for translate calls inside functions hooked to `after_setup_theme`, `setup_theme`, `plugins_loaded`, or run at file scope in `functions.php`.
## The fix
### Option A — fix it at the source (own / updatable code)
Defer the translation. Either register the raw string and translate at render time, or move the registration to `init`:
```php
// Before: translated on after_setup_theme (too early)
add_action( 'after_setup_theme', function () {
register_nav_menus( array( 'primary' => __( 'Primary Menu', 'mytheme' ) ) );
} );
// After: register the menu location on init, where translation is allowed
add_action( 'init', function () {
register_nav_menus( array( 'primary' => __( 'Primary Menu', 'mytheme' ) ) );
} );
```
Don't do this by editing a theme/plugin that receives updates — your change is wiped on the next update. Use Option B for those.
### Option B — suppress just this notice (third-party / non-updating code)
When the early call lives in a theme you don't control and can't update (a renamed commercial fork, an abandoned plugin), the clean, update-safe move is to silence **only** the `_load_textdomain_just_in_time` notice — not all `doing_it_wrong` output — via a must-use plugin.
Create `wp-content/mu-plugins/fix-textdomain.php`:
```php
<?php
/**
* Suppress the WP 6.7 "_load_textdomain_just_in_time was called incorrectly"
* notice for a theme/plugin that translates before init.
*
* Scope is intentionally narrow: only this one function is silenced, so other
* doing_it_wrong notices still surface. Translations still load via the JIT
* fallback, so nothing visible changes for visitors.
*/
add_filter( 'doing_it_wrong_trigger_error', function ( $trigger, $function_name ) {
return '_load_textdomain_just_in_time' === $function_name ? false : $trigger;
}, 10, 2 );
```
`mu-plugins/` loads automatically (no activation, can't be deactivated from the admin), and runs early enough to register the filter before the notice fires.
#### Verify
```bash
WPROOT=/var/www/html
# 1. Syntax-check the mu-plugin
php -l "$WPROOT/wp-content/mu-plugins/fix-textdomain.php"
# -> No syntax errors detected
# 2. Confirm WP still boots and the filter is registered
sudo -u www-data wp --path=$WPROOT eval \
'echo has_filter("doing_it_wrong_trigger_error") ? "filter set\n" : "MISSING\n";'
# 3. Clear the debug log, trigger an early translate, confirm 0 new notices
DBG="$WPROOT/wp-content/debug.log"
[ -f "$DBG" ] && : > "$DBG"
sudo -u www-data wp --path=$WPROOT eval '__("Primary Menu","<domain>");' >/dev/null 2>&1
grep -c "<domain>" "$DBG" 2>/dev/null || echo 0
# -> 0
```
## Gotchas
### The "load the textdomain earlier/later" dead-end
A very common (wrong) first attempt is an mu-plugin that just calls `load_theme_textdomain()` on `plugins_loaded` or `after_setup_theme`:
```php
// DOES NOT FIX THE NOTICE
add_action( 'plugins_loaded', function () {
load_theme_textdomain( 'mytheme', get_template_directory() . '/languages' );
}, 0 );
```
`plugins_loaded` still runs **before `init`**, and — more importantly — the notice is triggered by the theme's own early `__()` call, not by whether you've loaded the domain. This code is dead weight. If you find one in place, replace it with the Option B filter rather than tweaking its hook/priority.
### Don't blanket-suppress all deprecations
Resist `error_reporting(E_ALL & ~E_DEPRECATED)` or returning `false` from `doing_it_wrong_trigger_error` unconditionally — that also hides genuinely useful warnings (a plugin breaking on a future PHP/WP bump). Scope the filter to the one `function_name`.
### Renamed theme ⇒ domain ≠ folder
Re-stating because it costs the most time: the domain in the notice can be the theme's *original* slug, not its current folder. Always `grep` for the domain to find the real owner before concluding "I don't even have that theme installed."
## See also
- [Patching PHP 8.4 Implicit-Nullable Deprecations in Vendor Packages](php-84-vendor-implicit-nullable-patch.md) — the other "harmless deprecation that floods logs" pattern on the WordPress fleet
- [WordPress developer note: i18n improvements in 6.7](https://make.wordpress.org/core/2024/10/21/i18n-improvements-in-6-7/) — the canonical reference for this change

View file

@ -10,7 +10,7 @@ tags:
- deno
status: published
created: 2026-04-02
updated: 2026-04-30T05:21
updated: 2026-06-16T18:35
---
# yt-dlp YouTube JS Challenge Fix (Fedora)
@ -84,12 +84,43 @@ echo '--remote-components ejs:github' > ~/.config/yt-dlp/config
## Maintenance
YouTube pushes extractor changes frequently. Keep yt-dlp current:
YouTube pushes extractor changes frequently. Keep yt-dlp current.
### Updating: the `-U` trap + avoid duplicate installs
`yt-dlp -U` **does not work** when yt-dlp was installed via pip/PyPI — the PyPI build deliberately disables the self-updater:
```
ERROR: You installed yt-dlp with pip or using the wheel from PyPi; Use that to update
```
Update through pip instead. **Pick one install method and stick to it** — running both a user install and a system install leaves two copies that drift out of sync (one updates, the other stays stale and shadows it depending on `$PATH` / sudo).
**Recommended — single user install (no sudo):**
```bash
pip3 install -U --user yt-dlp
```
This lives in `~/.local/bin/yt-dlp` and is first on a normal user's `$PATH`. Update it the same way; never use sudo.
**Alternative — system-wide (Fedora, PEP 668):**
```bash
sudo pip install -U yt-dlp --break-system-packages
```
> Only use `--break-system-packages` if you intentionally want a root-owned copy in `/usr/local`. Do **not** mix it with a `--user` install.
**Check for and remove a duplicate install:**
```bash
which -a yt-dlp # more than one path = duplicate installs
sudo pip3 uninstall -y yt-dlp # removes the /usr/local (system) copy + its wrapper
```
> If installed via the standalone binary (not pip), `yt-dlp -U` is the correct updater.
---
## Known Limitations

View file

@ -1,6 +1,6 @@
---
created: 2026-04-02T16:03
updated: 2026-05-15T09:00
updated: 2026-06-21T11:46
---
* [Home](index.md)
* [Linux & Sysadmin](01-linux/index.md)
@ -12,10 +12,12 @@ updated: 2026-05-15T09:00
* [Bash Scripting Patterns](01-linux/shell-scripting/bash-scripting-patterns.md)
* [SnapRAID & MergerFS Storage Setup](01-linux/storage/snapraid-mergerfs-setup.md)
* [mdadm — Rebuilding a RAID Array After Reinstall](01-linux/storage/mdadm-raid-rebuild.md)
* [Growing an LVM Volume by Absorbing Another Disk](01-linux/storage/lvm-grow-volume-absorb-disk.md)
* [Linux Distro Guide for Beginners](01-linux/distro-specific/linux-distro-guide-beginners.md)
* [WSL2 Instance Migration to Fedora 43](01-linux/distro-specific/wsl2-instance-migration-fedora43.md)
* [WSL2 Training Environment Rebuild](01-linux/distro-specific/wsl2-rebuild-fedora43-training-env.md)
* [WSL2 Backup via PowerShell](01-linux/distro-specific/wsl2-backup-powershell.md)
* [WSL2 In-Place Upgrade to Fedora 44](01-linux/distro-specific/wsl2-fedora44-inplace-upgrade.md)
* [Self-Hosting & Homelab](02-selfhosting/index.md)
* [Self-Hosting Starter Guide](02-selfhosting/docker/self-hosting-starter-guide.md)
* [Docker vs VMs for the Homelab](02-selfhosting/docker/docker-vs-vms-homelab.md)
@ -30,6 +32,7 @@ updated: 2026-05-15T09:00
* [AWS S3 Cost Management](02-selfhosting/cloud/aws-s3-cost-management.md)
* [VPS Migration Baseline Checklist](02-selfhosting/cloud/vps-migration-baseline-checklist.md)
* [rsync Backup Patterns](02-selfhosting/storage-backup/rsync-backup-patterns.md)
* [Fleet Backups with restic + B2](02-selfhosting/storage-backup/restic-b2-fleet-backups.md)
* [Tuning Netdata Web Log Alerts](02-selfhosting/monitoring/tuning-netdata-web-log-alerts.md)
* [Tuning Netdata Docker Health Alarms](02-selfhosting/monitoring/netdata-docker-health-alarm-tuning.md)
* [Deploying Netdata to a New Server](02-selfhosting/monitoring/netdata-new-server-setup.md)
@ -41,6 +44,7 @@ updated: 2026-05-15T09:00
* [Mastodon Post-Install Hardening (Permissions + Account)](02-selfhosting/services/mastodon-post-install-hardening.md)
* [Mastodon — The `--prune-profiles` Trap and How to Recover](02-selfhosting/services/mastodon-prune-profiles-trap.md)
* [Mastodon on S3 — Silent Upload Failures (BucketOwnerEnforced/ACLs)](02-selfhosting/services/mastodon-s3-acl-upload-failures.md)
* [Mastodon — Triaging Crowdfunding / Mention-Spam Accounts](02-selfhosting/services/mastodon-mention-spam-crowdfunding.md)
* [Ghost Email Configuration with Mailgun](02-selfhosting/services/ghost-smtp-mailgun-setup.md)
* [Inbound Spam Filtering: spamass-milter + SpamAssassin Bayes](02-selfhosting/services/postfix-spamassassin-bayes-spam-filtering.md)
* [Claude Code Remote Control — Mobile Access to a Persistent Host Session](02-selfhosting/services/claude-code-remote-control.md)
@ -56,6 +60,7 @@ updated: 2026-05-15T09:00
* [Fail2ban Custom Jail: Nginx Bad Request Detection](02-selfhosting/security/fail2ban-nginx-bad-request-jail.md)
* [Fail2ban Custom Jail: Apache Bad Request Detection](02-selfhosting/security/fail2ban-apache-bad-request-jail.md)
* [SSH Hardening Fleet-Wide with Ansible](02-selfhosting/security/ssh-hardening-ansible-fleet.md)
* [Migrating Flat Ansible Playbooks to Roles (Safely)](02-selfhosting/security/ansible-flat-playbooks-to-roles.md)
* [ClamAV Fleet Deployment with Ansible](02-selfhosting/security/clamav-fleet-deployment.md)
* [Fail2Ban Digest Mode — Fleet-Wide Quiet Alerts](02-selfhosting/security/fail2ban-digest-mode-fleet.md)
* [Apache CVE-2026-23918 — HTTP/2 Double Free Mitigation](02-selfhosting/security/apache-cve-2026-23918-http2-mitigation.md)
@ -76,6 +81,8 @@ updated: 2026-05-15T09:00
* [HEVC Batch Re-Encode for Plex Using VAAPI (AMD GPU)](04-streaming/plex/hevc-vaapi-batch-encode.md)
* [Plex Transcoding Troubleshooting](04-streaming/plex/plex-transcoding-troubleshooting.md)
* [Troubleshooting](05-troubleshooting/index.md)
* [Wi-Fi Game Streaming Stutter: 160 MHz Channel Width Saturating the 5 GHz Radio](05-troubleshooting/networking/wifi-160mhz-airtime-saturation-game-streaming.md)
* [Steam Deck Wi-Fi Flapping: IWD Periodic Scan + rtw88 Power Save](05-troubleshooting/networking/steam-deck-wifi-flapping-iwd-periodic-scan-rtw88.md)
* [Apache Outage: Fail2ban Self-Ban + Missing iptables Rules](05-troubleshooting/networking/fail2ban-self-ban-apache-outage.md)
* [Postfix + SendGrid: TLS Handshake Failure (Port 465 vs 587)](05-troubleshooting/networking/postfix-sendgrid-tls-handshake-failure.md)
* [Mail Client Stops Receiving: Fail2ban IMAP Self-Ban](05-troubleshooting/networking/fail2ban-imap-self-ban-mail-client.md)
@ -101,6 +108,7 @@ updated: 2026-05-15T09:00
* [Gemini CLI Manual Update](05-troubleshooting/gemini-cli-manual-update.md)
* [MajorWiki Setup & Publishing Pipeline](05-troubleshooting/majwiki-setup-and-pipeline.md)
* [Gitea Actions Runner: Boot Race Condition Fix](05-troubleshooting/gitea-runner-boot-race-network-target.md)
* [Forgejo: Account Recovery & CLI Admin When Locked Out of the GUI](05-troubleshooting/forgejo-mailer-and-cli-recovery.md)
* [Cron Heartbeat False Alarm: /var/run Cleared by Reboot](05-troubleshooting/cron-heartbeat-tmpfs-reboot-false-alarm.md)
* [SELinux: Fixing Dovecot Mail Spool Context (/var/vmail)](05-troubleshooting/selinux-dovecot-vmail-context.md)
* [SELinux: Wrong /etc/localtime Label Silently Breaks Timezone Changes](05-troubleshooting/selinux-localtime-label-breaks-timezone.md)
@ -111,11 +119,17 @@ updated: 2026-05-15T09:00
* [Claude Desktop MCP Server Started via wsl.exe Sees Empty Environment (WSLENV)](05-troubleshooting/wsl-env-claude-desktop-mcp.md)
* [Claude Desktop MCP Mass-Disconnect After Blocking SSH Reboot](05-troubleshooting/claude-desktop-mcp-mass-disconnect-blocking-reboot.md)
* [Patching PHP 8.4 Implicit-Nullable Deprecations in Vendor Packages](05-troubleshooting/php-84-vendor-implicit-nullable-patch.md)
* [WordPress 6.7 `_load_textdomain_just_in_time` Notice (Translations Loaded Too Early)](05-troubleshooting/wordpress-67-textdomain-just-in-time-notice.md)
* [Ollama Drops Off Tailscale When Mac Sleeps](05-troubleshooting/ollama-macos-sleep-tailscale-disconnect.md)
* [Ollama: `ollama run` with Piped Stdin Bypasses Chat Template + SYSTEM Prompt](05-troubleshooting/ollama-chat-template-pipe-stdin-bypass.md)
* [Claude Code Won't Log In (Warp & iTerm2) — Corrupt Keychain Credential](05-troubleshooting/claude-code-warp-login-corrupt-keychain-credential.md)
* [Claude Code Keychain Prompt Keeps Reappearing on macOS (ACL Invalidation)](05-troubleshooting/claude-code-keychain-prompt-recurring-macos.md)
* [iPhone Mirroring Hangs on 'Connecting…' — AWDL Data Stall (27.0 Beta)](05-troubleshooting/iphone-mirroring-connecting-hang-awdl-stall-beta.md)
* [rsync over Tailscale: Hung in TCP Teardown After Transfer Completes](05-troubleshooting/networking/rsync-tailscale-teardown-stall.md)
* [iOS Tailscale Clients Report HostName="localhost" — Breaks /etc/hosts Generators](05-troubleshooting/networking/tailscale-status-json-hostname-localhost-ios.md)
* [macOS: Repeating Alert Tone from Mirrored iPhone Notification](05-troubleshooting/macos-mirrored-notification-alert-loop.md)
* [Auditing & Cleaning macOS Background App Activity (sfltool dumpbtm)](05-troubleshooting/macos-background-app-activity-audit-sfltool.md)
* [Time Machine: Orphaned APFS `.previous` Folder Blocks All Backups](05-troubleshooting/time-machine-apfs-orphaned-previous-blocks-backup.md)
* [OBS Studio: Stale Script Paths After Windows Profile Rename](05-troubleshooting/obs-stale-script-paths-after-windows-profile-rename.md)
* [ClamAV CPU Spike: Safe Scheduling with nice/ionice](05-troubleshooting/security/clamscan-cpu-spike-nice-ionice.md)
* [Logwatch Falsely Reports 'No freshclam updates' in ClamAV Daemon Mode](05-troubleshooting/security/freshclam-logwatch-false-no-updates.md)
@ -127,10 +141,16 @@ updated: 2026-05-15T09:00
* [Ansible: SSH Timeout During dnf upgrade on Fedora Hosts](05-troubleshooting/ansible-ssh-timeout-dnf-upgrade.md)
* [Ansible: regex_search Capture-Group Argument Fails in set_fact](05-troubleshooting/ansible-regex-search-set-fact-capture-group.md)
* [Ansible: Ubuntu Reboot Detection Misses Kernel Upgrades](05-troubleshooting/ansible-ubuntu-reboot-detection-kernel-mismatch.md)
* [Ansible: reboot.yml become Timeout on WSL2 Hosts (Exclude Them)](05-troubleshooting/ansible-reboot-become-timeout-wsl2.md)
* [Fedora Networking & Kernel Troubleshooting](05-troubleshooting/fedora-networking-kernel-recovery.md)
* [Systemd Session Scope Fails at Login](05-troubleshooting/systemd/session-scope-failure-at-login.md)
* [wget/curl: URLs with Special Characters Fail in Bash](05-troubleshooting/wget-url-special-characters.md)
* [Ansible: Check Mode False Positives in Verify/Assert Tasks](05-troubleshooting/ansible-check-mode-false-positives.md)
* [Ansible Fails with Permission Denied While `ssh <alias>` Works (Host Alias Bypass)](05-troubleshooting/ansible-ssh-host-alias-bypass.md)
* [SSH Alias Falls Through to MagicDNS — Host-Key Verification Failure (No `Host` Block)](05-troubleshooting/networking/ssh-missing-host-block-magicdns-host-key-failure.md)
* [MagicDNS Names vs Pinned IPs for Tailscale SSH (After a Fleet Migration)](05-troubleshooting/networking/tailscale-ssh-magicdns-vs-pinned-ip-after-migration.md)
* [`Permission denied (publickey)` After Rotating a Key — Backfill Every `authorized_keys`](05-troubleshooting/networking/ssh-rotated-key-not-backfilled-authorized-keys.md)
* [Ansible UNREACHABLE: Host Key Verification Failed After a Host Rebuild or Migration](05-troubleshooting/networking/ansible-host-key-verification-failed-rebuilt-host.md)
* [Logwatch Reports the Wrong Hostname (`<host>-hetzner`) After a Migration](05-troubleshooting/logwatch-wrong-hostname-after-migration.md)
* [Ghost EmailAnalytics Lag Warning — What It Means and When to Worry](05-troubleshooting/ghost-emailanalytics-lag-warning.md)
* [claude-mem: --setting-sources Empty Arg Bug (Claude Code 2.1.x)](05-troubleshooting/claude-mem-setting-sources-empty-arg.md)