Compare commits
No commits in common. "main" and "code/majormac/wiki-draft-warp-iphonemirroring" have entirely different histories.
main
...
code/major
28 changed files with 16 additions and 2595 deletions
|
|
@ -1,119 +0,0 @@
|
|||
---
|
||||
title: WSL2 In-Place Upgrade to Fedora 44 (with gcc14 Blocker + CUDA Repo Swap)
|
||||
domain: linux
|
||||
category: distro-specific
|
||||
tags:
|
||||
- wsl2
|
||||
- fedora
|
||||
- windows
|
||||
- upgrade
|
||||
- dnf
|
||||
- cuda
|
||||
- majorrig
|
||||
status: published
|
||||
created: 2026-06-11
|
||||
updated: 2026-06-11
|
||||
---
|
||||
|
||||
# WSL2 In-Place Upgrade to Fedora 44 (with gcc14 Blocker + CUDA Repo Swap)
|
||||
|
||||
In-place upgrade of the FedoraLinux-43 WSL2 instance on MajorRig to Fedora 44 using `dnf system-upgrade` + `dnf5 offline reboot`. Hit one transaction blocker (`gcc14` compat package retired in F44) and swapped the stale `cuda-fedora39` repo to `cuda-fedora44` afterward. Performed 2026-06-11.
|
||||
|
||||
## The Short Answer
|
||||
|
||||
```powershell
|
||||
# PowerShell — backup first
|
||||
wsl --shutdown
|
||||
wsl --export FedoraLinux-43 D:\backups\fedora43.tar
|
||||
```
|
||||
|
||||
```bash
|
||||
# Inside Fedora
|
||||
sudo dnf upgrade --refresh -y
|
||||
sudo shutdown -h now
|
||||
# relaunch, then:
|
||||
sudo dnf remove gcc14-c++ gcc14 # F44 dropped gcc14 — blocks the transaction
|
||||
sudo dnf system-upgrade download --releasever=44
|
||||
sudo dnf5 offline reboot # applies offline upgrade, shuts distro down
|
||||
# wait a few minutes, relaunch:
|
||||
cat /etc/fedora-release # → Fedora release 44 (Forty Four)
|
||||
```
|
||||
|
||||
```powershell
|
||||
# PowerShell — keep WSL itself current
|
||||
wsl --update
|
||||
```
|
||||
|
||||
## Steps
|
||||
|
||||
1. **Back up the instance** (PowerShell). The export tar is roughly the size of the installed system — this one was 86 GB. The target directory must already exist or you get `Wsl/ERROR_PATH_NOT_FOUND`.
|
||||
|
||||
```powershell
|
||||
wsl --shutdown
|
||||
mkdir D:\backups
|
||||
wsl --export FedoraLinux-43 D:\backups\fedora43.tar
|
||||
```
|
||||
|
||||
2. **Fully update the current release, then restart the distro**
|
||||
|
||||
```bash
|
||||
sudo dnf upgrade --refresh -y
|
||||
sudo shutdown -h now
|
||||
```
|
||||
|
||||
3. **Remove upgrade blockers.** `gcc14`/`gcc14-c++` (compat packages) were retired in Fedora 44, so the transaction fails with "does not belong to a distupgrade repository". Remove them (or use `--allowerasing` and review the summary):
|
||||
|
||||
```bash
|
||||
sudo dnf remove gcc14-c++ gcc14
|
||||
```
|
||||
|
||||
4. **Download and apply the upgrade**
|
||||
|
||||
```bash
|
||||
sudo dnf system-upgrade download --releasever=44
|
||||
sudo dnf5 offline reboot
|
||||
```
|
||||
|
||||
The "reboot" applies the offline transaction and shuts the distro down — there's no real systemd reboot in WSL. Wait a couple of minutes, then relaunch. If it errors on `systemctl`, the fallback is:
|
||||
|
||||
```bash
|
||||
export DNF_SYSTEM_UPGRADE_NO_REBOOT=1
|
||||
sudo -E dnf system-upgrade reboot
|
||||
```
|
||||
|
||||
5. **Verify and tidy up**
|
||||
|
||||
```bash
|
||||
cat /etc/fedora-release # Fedora release 44 (Forty Four)
|
||||
sudo dnf upgrade --refresh # catch post-upgrade updates
|
||||
gcc --version # F44 ships gcc 16; reinstall with `dnf install gcc gcc-c++` if removed
|
||||
```
|
||||
|
||||
```powershell
|
||||
wsl --update # fixes the post-upgrade Wsl/Service/E_UNEXPECTED catastrophic failure some users hit
|
||||
```
|
||||
|
||||
## CUDA Repo Swap
|
||||
|
||||
`dnf repolist` still showed `cuda-fedora39-x86_64` — NVIDIA repos are pinned per Fedora release and don't follow distro upgrades. NVIDIA publishes a fedora44 repo:
|
||||
|
||||
```bash
|
||||
sudo rm /etc/yum.repos.d/cuda-fedora39*.repo
|
||||
sudo dnf config-manager addrepo --from-repofile=https://developer.download.nvidia.com/compute/cuda/repos/fedora44/x86_64/cuda-fedora44.repo
|
||||
sudo dnf upgrade --refresh
|
||||
sudo dnf repolist # confirm cuda-fedora44-x86_64
|
||||
```
|
||||
|
||||
**WSL caveat:** never install the NVIDIA *driver* inside WSL — the Windows host driver provides the GPU. Only install toolkit packages (e.g. `cuda-toolkit`).
|
||||
|
||||
## Gotchas & Notes
|
||||
|
||||
- **Don't skip more than two releases** in one jump — staged upgrades otherwise.
|
||||
- **The WSL distro name is just a Windows label** — it still says "FedoraLinux-43" after the upgrade. Cosmetic fixes: Windows Terminal profile name, Start Menu shortcut, and `DistributionName`/`ShortcutPath` under `HKCU\Software\Microsoft\Windows\CurrentVersion\Lxss\{uuid}`.
|
||||
- **Keep the backup tar** until the upgraded instance has proven stable for a few days, then delete to reclaim the space.
|
||||
- **Restore path if needed:** `wsl --import FedoraRestore C:\WSL\FedoraRestore D:\backups\fedora43.tar` — remember imports default to root; fix via `/etc/wsl.conf` `[user] default=majorlinux`.
|
||||
|
||||
## See Also
|
||||
|
||||
- [WSL2 Instance Migration (Fedora 43)](wsl2-instance-migration-fedora43.md)
|
||||
- [WSL2 Backup via PowerShell](wsl2-backup-powershell.md)
|
||||
|
|
@ -23,14 +23,7 @@ A collection of guides covering Linux administration, shell scripting, networkin
|
|||
- [Ansible Getting Started](shell-scripting/ansible-getting-started.md)
|
||||
- [Bash Scripting Patterns](shell-scripting/bash-scripting-patterns.md)
|
||||
|
||||
## Storage
|
||||
|
||||
- [SnapRAID & MergerFS Storage Setup](storage/snapraid-mergerfs-setup.md)
|
||||
- [mdadm — Rebuilding a RAID Array After Reinstall](storage/mdadm-raid-rebuild.md)
|
||||
- [Growing an LVM Volume by Absorbing Another Disk](storage/lvm-grow-volume-absorb-disk.md)
|
||||
|
||||
## Distro-Specific
|
||||
|
||||
- [Linux Distro Guide for Beginners](distro-specific/linux-distro-guide-beginners.md)
|
||||
- [WSL2 Instance Migration to Fedora 43](distro-specific/wsl2-instance-migration-fedora43.md)
|
||||
- [WSL2 In-Place Upgrade to Fedora 44](distro-specific/wsl2-fedora44-inplace-upgrade.md)
|
||||
|
|
|
|||
|
|
@ -1,159 +0,0 @@
|
|||
---
|
||||
title: "Growing an LVM Volume by Absorbing Another Disk"
|
||||
domain: linux
|
||||
category: storage
|
||||
tags: [lvm, lvextend, vgextend, pvcreate, resize2fs, ext4, storage, disk, homelab]
|
||||
status: published
|
||||
created: 2026-06-17
|
||||
updated: 2026-06-17
|
||||
---
|
||||
|
||||
# Growing an LVM Volume by Absorbing Another Disk
|
||||
|
||||
When an LVM-backed filesystem fills up and its volume group (VG) has no free
|
||||
extents, you can grow it by adding a second physical disk as a new physical
|
||||
volume (PV), extending the VG onto it, then extending the logical volume (LV)
|
||||
and its filesystem. With ext4 this can be done **online** — no unmount, no
|
||||
downtime for the volume being grown.
|
||||
|
||||
This guide covers the common case where the disk you want to absorb is currently
|
||||
in use by its own LVM volume (you must evacuate and tear that down first), and
|
||||
the precautions that keep it safe.
|
||||
|
||||
> [!warning] This enlarges your failure domain
|
||||
> A single LV spanning two disks linearly (the default — no RAID/mirror) means
|
||||
> **losing either disk loses the entire volume.** ext4 has no parity. Only do
|
||||
> this for data you can rebuild, or layer redundancy (mdadm/LVM RAID) underneath.
|
||||
> Back up anything irreplaceable first.
|
||||
|
||||
## The Short Answer
|
||||
|
||||
If the target disk (`/dev/sdX`) is already empty and unused:
|
||||
|
||||
```bash
|
||||
sudo pvcreate /dev/sdX
|
||||
sudo vgextend myvg /dev/sdX
|
||||
sudo lvextend -l +100%FREE /dev/myvg/mylv
|
||||
sudo resize2fs /dev/mapper/myvg-mylv # ext4, online; use xfs_growfs for XFS
|
||||
```
|
||||
|
||||
The rest of this article handles the harder case: the target disk is currently
|
||||
holding its own LVM volume with data on it.
|
||||
|
||||
## Step-by-Step
|
||||
|
||||
### 1. Survey the current layout
|
||||
|
||||
```bash
|
||||
sudo pvs # physical volumes → which VG each belongs to
|
||||
sudo vgs # volume groups, free extents (VFree)
|
||||
sudo lvs # logical volumes and sizes
|
||||
lsblk -o NAME,SIZE,TYPE,MOUNTPOINT
|
||||
df -h
|
||||
```
|
||||
|
||||
Confirm:
|
||||
|
||||
- The VG you want to grow (`myvg`) has `0` `VFree` (that's why you're here).
|
||||
- The disk you want to absorb (`/dev/sdX`) is a **standalone** PV — not a member
|
||||
of an mdadm array, a mergerfs branch, or a SnapRAID parity disk. Repurposing a
|
||||
disk that something else depends on will break that thing silently.
|
||||
|
||||
### 2. Evacuate the disk you're about to absorb
|
||||
|
||||
Anything on the target disk will be **destroyed**. Move it somewhere with room to
|
||||
spare, then prove the copy is intact before you trust it.
|
||||
|
||||
```bash
|
||||
# Copy preserving permissions/timestamps
|
||||
sudo rsync -a /mnt/olddisk/important /destination/with/space/
|
||||
|
||||
# Verify byte-for-byte — empty output + exit code 0 means identical
|
||||
sudo diff -rq /mnt/olddisk/important /destination/with/space/important && echo OK
|
||||
```
|
||||
|
||||
For large trees the `diff -rq` (full byte comparison) is slow but is the
|
||||
authoritative check — don't skip it before the destructive phase. If an
|
||||
application tracks files by path (databases, media servers), update its path
|
||||
references to the new location *now*, while the old copy still exists as a
|
||||
fallback.
|
||||
|
||||
### 3. Unmount and remove the old disk from fstab
|
||||
|
||||
```bash
|
||||
sudo fuser -m /mnt/olddisk # confirm nothing holds it open
|
||||
sudo umount /mnt/olddisk
|
||||
mountpoint -q /mnt/olddisk && echo "STILL MOUNTED" || echo "unmounted"
|
||||
|
||||
sudo cp /etc/fstab /etc/fstab.bak-$(date +%Y%m%d) # always back up fstab
|
||||
sudo sed -i '/olddisk/d' /etc/fstab # remove the stale entry
|
||||
grep olddisk /etc/fstab || echo "fstab line gone"
|
||||
```
|
||||
|
||||
> [!tip] Verify your `sed` pattern only matches the line you mean
|
||||
> A too-broad pattern can delete the wrong fstab entry. Check the file before and
|
||||
> after, and keep the backup until you've confirmed the system still boots.
|
||||
|
||||
### 4. Tear down the old disk's LVM
|
||||
|
||||
```bash
|
||||
sudo lvremove -y /dev/oldvg/oldlv
|
||||
sudo vgremove -y oldvg
|
||||
sudo pvremove -y /dev/sdX # wipes the LVM label off the disk
|
||||
```
|
||||
|
||||
This is the point of no return for the old disk's data — which is why steps 2–3
|
||||
verified the copy first.
|
||||
|
||||
### 5. Add the disk to the target VG and extend
|
||||
|
||||
```bash
|
||||
sudo pvcreate -y /dev/sdX
|
||||
sudo vgextend myvg /dev/sdX
|
||||
sudo lvextend -l +100%FREE /dev/myvg/mylv
|
||||
```
|
||||
|
||||
`lvs`/`vgs` should now show the LV grown to span both disks and `0` free extents.
|
||||
|
||||
### 6. Grow the filesystem (online)
|
||||
|
||||
```bash
|
||||
# ext4 — works while mounted
|
||||
sudo resize2fs /dev/mapper/myvg-mylv
|
||||
|
||||
# XFS — grows online too, but takes the mountpoint, not the device
|
||||
sudo xfs_growfs /mountpoint
|
||||
```
|
||||
|
||||
`resize2fs` is idempotent — if it gets interrupted, just run it again; it reports
|
||||
"Nothing to do!" once the filesystem already fills the LV.
|
||||
|
||||
### 7. Verify
|
||||
|
||||
```bash
|
||||
df -h /mountpoint # should reflect the new larger size
|
||||
sudo pvs # /dev/sdX now listed under myvg
|
||||
sudo vgs myvg # two PVs, larger VSize
|
||||
```
|
||||
|
||||
## Notes & Gotchas
|
||||
|
||||
- **Online resize works for the volume being grown, not the one being removed.**
|
||||
The disk you absorb must be unmounted and torn down; the destination LV stays
|
||||
mounted throughout.
|
||||
- **`resize2fs` interruption is safe.** ext4 online resize is journaled; re-run it.
|
||||
- **macOS cruft on evacuated disks.** Trees touched by macOS often carry
|
||||
`._*` AppleDouble files and `.DS_Store` — harmless to drop, but they inflate
|
||||
file counts in `diff`/`rsync` output. Don't mistake them for real data.
|
||||
- **Check SMART on a disk you're promoting into a bigger role.** A disk with a
|
||||
pending-sector history is riskier once it's in the critical path for a whole
|
||||
multi-disk volume than it was holding a small isolated one.
|
||||
- **Mountpoint cleanup.** After the old disk is gone, its former mountpoint
|
||||
directory may reappear (it was shadowed by the mount). `rmdir` it if empty.
|
||||
Note `ls -A` exits `0` on an empty directory, so don't gate cleanup on its exit
|
||||
status — test contents explicitly.
|
||||
|
||||
## Related
|
||||
|
||||
- [SnapRAID & MergerFS Storage Setup](snapraid-mergerfs-setup.md) — add redundancy/parity instead of a linear span
|
||||
- [mdadm — Rebuilding a RAID Array After Reinstall](mdadm-raid-rebuild.md)
|
||||
|
|
@ -66,15 +66,14 @@ Every server in the fleet should have these. Check each one after migration:
|
|||
### After Migration
|
||||
|
||||
1. **Set the timezone** — `timedatectl set-timezone America/New_York` (US) or `Europe/London` (UK). Hetzner images default to UTC.
|
||||
2. **Set the system hostname** — Hetzner provisions the box as `<host>-hetzner`. Run `hostnamectl set-hostname <host>` and fix the loopback line: `sed -i "s/127.0.1.1.*/127.0.1.1 <host> <host>/" /etc/hosts`. Skip this and **Logwatch emails arrive titled `Logwatch for <host>-hetzner`** weeks later. Do it alongside the Tailscale node rename and Postfix `myhostname` — all three read from the provisioning label. See [Logwatch wrong hostname after migration](../../05-troubleshooting/logwatch-wrong-hostname-after-migration.md).
|
||||
3. **Verify CA bundle (Fedora)** — `ls /etc/pki/tls/certs/ca-bundle.crt`. If missing, Postfix TLS, curl, and dnf will all fail silently. See [Fedora CA bundle fix](../../05-troubleshooting/security/fedora-ca-bundle-missing-symlink.md).
|
||||
4. **Run `harden.yml` against the new host** — catches most gaps in one pass
|
||||
5. **Send a test email** — `echo test | mail -s "test" marcus@majorshouse.com` — if this fails, nothing else can alert you
|
||||
6. **Verify crond is running** — `systemctl is-active crond` (Fedora) or `systemctl is-active cron` (Ubuntu). cronie can be `enabled` but not `active` after provisioning.
|
||||
7. **Check Netdata Cloud** — verify the new node appears and alerts are flowing
|
||||
8. **Compare fail2ban jails** — `fail2ban-client status` on both old and new
|
||||
9. **Verify logwatch sends** — `sudo logwatch --output mail --range today`
|
||||
10. **Keep the old box powered off but not destroyed** for at least 7 days after remediation
|
||||
2. **Verify CA bundle (Fedora)** — `ls /etc/pki/tls/certs/ca-bundle.crt`. If missing, Postfix TLS, curl, and dnf will all fail silently. See [Fedora CA bundle fix](../../05-troubleshooting/security/fedora-ca-bundle-missing-symlink.md).
|
||||
3. **Run `harden.yml` against the new host** — catches most gaps in one pass
|
||||
4. **Send a test email** — `echo test | mail -s "test" marcus@majorshouse.com` — if this fails, nothing else can alert you
|
||||
5. **Verify crond is running** — `systemctl is-active crond` (Fedora) or `systemctl is-active cron` (Ubuntu). cronie can be `enabled` but not `active` after provisioning.
|
||||
6. **Check Netdata Cloud** — verify the new node appears and alerts are flowing
|
||||
7. **Compare fail2ban jails** — `fail2ban-client status` on both old and new
|
||||
8. **Verify logwatch sends** — `sudo logwatch --output mail --range today`
|
||||
9. **Keep the old box powered off but not destroyed** for at least 7 days after remediation
|
||||
|
||||
### Using doctl to Manage Old Droplets
|
||||
|
||||
|
|
|
|||
|
|
@ -38,7 +38,6 @@ Guides for running your own services at home, including Docker, reverse proxies,
|
|||
- [Mastodon Federation](services/mastodon-federation.md)
|
||||
- [Mastodon `--prune-profiles` Trap](services/mastodon-prune-profiles-trap.md)
|
||||
- [Mastodon on S3 — Silent Upload Failures](services/mastodon-s3-acl-upload-failures.md)
|
||||
- [Mastodon — Triaging Crowdfunding / Mention-Spam Accounts](services/mastodon-mention-spam-crowdfunding.md)
|
||||
- [Ghost SMTP via Mailgun](services/ghost-smtp-mailgun-setup.md)
|
||||
- [Updating n8n Docker](services/updating-n8n-docker.md)
|
||||
- [Claude Code Remote Control](services/claude-code-remote-control.md)
|
||||
|
|
|
|||
|
|
@ -235,12 +235,9 @@ sed -i '/^127\.0\.1\.1/d' /etc/hosts && \
|
|||
systemctl reload postfix
|
||||
```
|
||||
|
||||
> [!tip] Same drift, different symptom: the Logwatch **title**
|
||||
> Hetzner provisions boxes with `<host>-hetzner` as the *system* hostname. When that's never corrected, Logwatch (which reads the live hostname at runtime) mails reports titled `Logwatch for <host>-hetzner` — no postfix involvement needed. Same `hostnamectl set-hostname` + `/etc/hosts` fix as above. See [Logwatch wrong hostname after migration](../../05-troubleshooting/logwatch-wrong-hostname-after-migration.md).
|
||||
|
||||
### 2. Empty `relayhost` quietly forces public-MX delivery
|
||||
|
||||
If `postconf relayhost` returns an empty value, postfix doesn't fail — it just does an MX lookup for the destination domain and tries to deliver directly. For mail to your own mail server, that means going via the **public MX** (the domain's external MX record, e.g., `mail.majorshouse.com → 203.0.113.10:25`) instead of the **internal/Tailscale relay path** the rest of the fleet uses.
|
||||
If `postconf relayhost` returns an empty value, postfix doesn't fail — it just does an MX lookup for the destination domain and tries to deliver directly. For mail to your own mail server, that means going via the **public MX** (the domain's external MX record, e.g., `mail.majorshouse.com → 165.227.187.191:25`) instead of the **internal/Tailscale relay path** the rest of the fleet uses.
|
||||
|
||||
The public-MX path is subject to whatever spam filtering, content checks, and trust rules the receiving MX has configured for external traffic. Internal Tailscale-IP traffic typically gets a faster trust shortcut (e.g., bypass spamchk pipe). So this single configuration drift causes one host's mail to land in a different code path than its siblings — and then silently get filtered.
|
||||
|
||||
|
|
|
|||
|
|
@ -1,130 +0,0 @@
|
|||
---
|
||||
title: "Migrating Flat Ansible Playbooks to Roles (Safely)"
|
||||
domain: selfhosting
|
||||
category: security
|
||||
tags: [ansible, roles, refactor, fleet, migration, fail2ban, infrastructure]
|
||||
status: published
|
||||
created: 2026-06-18
|
||||
updated: 2026-06-18
|
||||
---
|
||||
# Migrating Flat Ansible Playbooks to Roles (Safely)
|
||||
|
||||
## Overview
|
||||
|
||||
A fleet repo tends to grow a sprawl of flat `configure_*.yml` playbooks — one per subsystem, plus near-duplicates for variants (e.g. ~10 `configure_fail2ban_*` playbooks), all sharing a single overloaded top-level `templates/` directory. It works, but it resists reuse: there is no clean `defaults/` precedence, no encapsulation, and no way to compose a host's full configuration in one place.
|
||||
|
||||
Ansible **roles** fix this — but migrating a *live* fleet is where it gets dangerous. The risk is not the refactor itself; it's accidentally changing deployed behaviour while you "just reorganize." This article covers the incremental, regression-free approach used to migrate an 11-host fleet, including the two techniques that keep it safe: **byte-identical migration** and **capture-based reconciliation**.
|
||||
|
||||
> This is a process/pattern article. For the specific roles in this fleet, see the internal runbook. The techniques here generalize to any flat-playbook → role migration.
|
||||
|
||||
## Decide What Becomes a Role vs. What Stays a Playbook
|
||||
|
||||
Not everything should be a role. Draw the line by purpose:
|
||||
|
||||
| Becomes a role | Stays a playbook |
|
||||
|---|---|
|
||||
| Reusable host **configuration** (a subsystem you converge to a desired state) | **Ops / one-off** actions: `update`, `reboot`, `harden`, `bootstrap`, `provision`, `fix_*`, `verify_*` |
|
||||
| Has templates/files, defaults, handlers | Orchestrators that just `import_playbook` other things |
|
||||
| Applied repeatedly and idempotently | Run-once or run-as-needed remediation |
|
||||
|
||||
Roles get the standard `roles/<name>/` layout (`tasks/`, `defaults/`, `handlers/`, `templates/`, `files/`, `meta/`). Name them after the **subsystem noun** (`fail2ban`, `clamav`, `firewall`) — drop the `configure_` verb prefix.
|
||||
|
||||
## The Incremental Loop (one role per branch)
|
||||
|
||||
Migrate **one subsystem per branch** and validate before merging. This keeps every change small enough to diff by eye and roll back cleanly:
|
||||
|
||||
1. `git mv` the templates/files into `roles/<name>/` so **git tracks them as renames** (history preserved, 100% rename score).
|
||||
2. Move task bodies into `roles/<name>/tasks/` (split by lifecycle: install → service → config → verify).
|
||||
3. Lift tunables into `roles/<name>/defaults/main.yml`; keep per-host overrides in `group_vars`/`host_vars`.
|
||||
4. Add a thin entry playbook `<name>.yml` (`hosts: <group>` + `roles: [<name>]`).
|
||||
5. Validate with `--check --diff` against a single host **before** merging.
|
||||
6. Merge, then move to the next subsystem.
|
||||
|
||||
## Technique 1: Byte-Identical Migration
|
||||
|
||||
When the goal is "reorganize without changing behaviour," **prove** it. After moving a playbook into a role, the rendered task bodies should be identical to the original. Verify with a normalized diff against `main`:
|
||||
|
||||
```bash
|
||||
# Compare the role's task body against the original flat playbook,
|
||||
# ignoring only comments/whitespace you intend to change.
|
||||
git show main:configure_clamav.yml > /tmp/old.yml
|
||||
# ...extract the task list from roles/clamav/tasks/*.yml and diff
|
||||
diff <(yq '.[] | .tasks' /tmp/old.yml) <(cat roles/clamav/tasks/*.yml)
|
||||
```
|
||||
|
||||
The acceptance bar: `--check --diff` against a real host returns **`changed=0`** (or only the diffs you explicitly intended, like a doc-comment line). If a "faithful" migration shows unexpected `changed=N`, you altered behaviour — stop and reconcile before merging. Templates moved via `git mv` show as **100% renames** in `git show --stat`, which is your proof the deployed content is unchanged.
|
||||
|
||||
## Technique 2: Consolidating Near-Duplicates with Feature Flags
|
||||
|
||||
The big win is collapsing a family of near-duplicate playbooks (the ~10 `configure_fail2ban_*`) into **one role with flag-gated task files**:
|
||||
|
||||
```yaml
|
||||
# group_vars/<group>.yml — hosts self-select which jails/components they get
|
||||
fail2ban_jail_sshd: true
|
||||
fail2ban_jail_wordpress: true
|
||||
fail2ban_jail_nginx_bad_request: false
|
||||
```
|
||||
|
||||
```yaml
|
||||
# roles/fail2ban/tasks/main.yml
|
||||
- import_tasks: jail_wordpress.yml
|
||||
when: fail2ban_jail_wordpress | default(false)
|
||||
```
|
||||
|
||||
> **Critical gotcha — key flags to inventory GROUPS, not `ansible_os_family`.** It is tempting to gate OS-specific task files on `ansible_os_family == 'Debian'`. Don't. Inventory groups frequently include hosts the *original playbooks deliberately excluded* (e.g. a LAN-only Debian box that should get the network-wait step but **not** the public SSH bind, or a WSL host in the `fedora` group that must be skipped). Keep the original curated host patterns and set the flag per play/group. Keying on `os_family` silently widens a play's host set and is exactly how a "refactor" pushes config to a host that never had it.
|
||||
|
||||
## Technique 3: Capture-Based Reconciliation (the safety net)
|
||||
|
||||
This is the one that prevents an outage. Sometimes a role gets written as a **fresh re-implementation** of a subsystem rather than a faithful move — a cleaner `jail.local`, new drop-ins, a different default set. It may even be merged into `site.yml`. The trap: that role has **never been rolled out**, and its config *diverges* from what's actually deployed.
|
||||
|
||||
Running it would push divergent config to a live, security-sensitive subsystem (intrusion protection, firewall) across the whole fleet on the next `harden.yml`.
|
||||
|
||||
The check that catches it:
|
||||
|
||||
```bash
|
||||
ansible-playbook fail2ban.yml --check --diff --limit <host>
|
||||
# Divergent role => changed=8-12 per host + failures (missing filters/timers)
|
||||
# Faithful role => changed=0, failed=0
|
||||
```
|
||||
|
||||
**Capture-based reconciliation** is the fix: instead of pushing the role's idea of "correct," bring the **role into parity with the live, working config** first. Capture what's actually deployed, fold it into the role's templates/defaults until `--check` is clean fleet-wide, *then* switch the orchestrator over and retire the old playbooks. Order of operations:
|
||||
|
||||
1. **Decide the source of truth** — the live config or the new role. For security subsystems, the live (working) config wins.
|
||||
2. **Reconcile** the role to match live until `--check` shows `changed=0, failed=0` on every host.
|
||||
3. **Roll out host-by-host** with real runs; verify the service restarts cleanly and (for fail2ban) jails are actually active.
|
||||
4. **Only then** delete the old playbooks, rewire `harden.yml`/`bootstrap.yml`, and remove the orphaned top-level templates.
|
||||
|
||||
Never delete the old mechanism until the new one is proven converged everywhere. "It's in `site.yml`" is not the same as "it's been rolled out."
|
||||
|
||||
## Composition: `site.yml`, `harden.yml`, `bootstrap.yml`
|
||||
|
||||
Once subsystems are roles, compose them with thin orchestrators that `import_playbook` the role entry points — so each subsystem keeps a **single source of truth** for its host mapping:
|
||||
|
||||
```yaml
|
||||
# site.yml — day-to-day fleet convergence, in dependency order
|
||||
- import_playbook: swap.yml
|
||||
- import_playbook: tailscale.yml
|
||||
- import_playbook: ssh_hardening.yml
|
||||
- import_playbook: firewall.yml
|
||||
- import_playbook: fail2ban.yml
|
||||
- import_playbook: clamav.yml
|
||||
```
|
||||
|
||||
Order matters: base layer (swap) → networking (tailscale) → access (ssh_hardening) → perimeter (firewall) → intrusion protection (fail2ban). Bootstrap-only roles (guest agent, root password, provisioning prerequisites) belong in `bootstrap.yml`, not `site.yml`.
|
||||
|
||||
## Verification Checklist
|
||||
|
||||
- [ ] Templates moved with `git mv` (show as 100% renames)
|
||||
- [ ] `--check --diff` on a real host = `changed=0` (or only intended diffs)
|
||||
- [ ] Consolidation flags keyed to **inventory groups**, not `ansible_os_family`
|
||||
- [ ] Re-implemented roles reconciled to live parity **before** rollout (no surprise `changed=N`)
|
||||
- [ ] Security subsystems rolled out host-by-host with service-active verification
|
||||
- [ ] Old playbooks/templates deleted **only after** the role is converged fleet-wide
|
||||
- [ ] Orchestrators (`site.yml`/`harden.yml`/`bootstrap.yml`) rewired; stale references swept
|
||||
|
||||
## Related
|
||||
|
||||
- [SSH Hardening Fleet-Wide with Ansible](ssh-hardening-ansible-fleet.md)
|
||||
- [ClamAV Fleet Deployment with Ansible](clamav-fleet-deployment.md)
|
||||
- [Firewall Hardening with firewalld on Fedora Fleet](firewalld-fleet-hardening.md)
|
||||
- [Standardizing unattended-upgrades with Ansible](ansible-unattended-upgrades-fleet.md)
|
||||
|
|
@ -1,170 +0,0 @@
|
|||
---
|
||||
title: "Mastodon — Triaging Crowdfunding / Mention-Spam Accounts"
|
||||
description: How to tell broadcast fundraising solicitation from genuine mentions, investigate the account and its origin instance with SQL + nodeinfo, and pick a proportionate moderation action.
|
||||
tags:
|
||||
- mastodon
|
||||
- moderation
|
||||
- abuse
|
||||
- federation
|
||||
- self-hosting
|
||||
created: 2026-06-22
|
||||
updated: 2026-06-22
|
||||
---
|
||||
|
||||
# Mastodon — Triaging Crowdfunding / Mention-Spam Accounts
|
||||
|
||||
If you run a Mastodon instance, sooner or later you (or your users) start getting tagged by accounts you've never interacted with, posting donation appeals with a link and a wall of hashtags. Some are real people in desperate situations; some are recycled-link scams. Either way, when an account is **broadcasting a solicitation at you** rather than replying to you, it's a moderation question, not a conversation.
|
||||
|
||||
This article is the runbook for telling the two apart, investigating both the **account** and its **origin instance**, and choosing an action that's proportionate instead of nuking eight years of legit federation over two bad actors.
|
||||
|
||||
## TL;DR
|
||||
|
||||
- A mention is **broadcast spam**, not engagement, when it's a *standalone post* (not a reply) that *tags a large fixed list* of accounts and carries a *donation link*, usually from a *throwaway profile* on an *open-registration instance*.
|
||||
- Investigate before acting: pull the account's age/stats/bio and check whether the post is a reply or a 40-way blast (SQL below). Profile the origin instance via its public `nodeinfo`.
|
||||
- **Default action is an account-level block**, which also federates and removes their follow of you. Escalate to domain-limit / domain-block only when *one instance* produces *repeat offenders*.
|
||||
- Keep a log so single incidents that are actually a pattern become visible.
|
||||
|
||||
## Signals that a mention is broadcast solicitation
|
||||
|
||||
Score it on how many of these hold:
|
||||
|
||||
| Signal | Why it matters |
|
||||
|---|---|
|
||||
| **Standalone post, not a reply** (`in_reply_to_account_id IS NULL`) but still tags you | They're broadcasting, not responding |
|
||||
| **Tags a large fixed recipient list** (e.g. 40+) | Mass distribution; the same list reused across senders = coordination |
|
||||
| **Donation link** in post or bio (`chuffed.org`, `gofundme`, `paypal.me`, `ko-fi`) | The payload |
|
||||
| **Throwaway profile** — days old, few followers, follows you but you don't follow back | Disposable, baiting a profile view |
|
||||
| **Mass-follow ratio** — following thousands / few hundred followers | Engagement farming |
|
||||
| **"I am not a scammer" disclaimer** in bio | Known red-flag phrase |
|
||||
| **Origin instance: open registration, no approval** | Easy throwaway-account farm |
|
||||
|
||||
> [!warning] Judgment, not a purity test
|
||||
> Many of these accounts are real people. The goal is not to adjudicate need — it's to stop *broadcast solicitation aimed at you* and track the *source instances*. Prefer the lightest action that stops it.
|
||||
|
||||
## Investigate the account
|
||||
|
||||
Connect to the DB on the instance:
|
||||
|
||||
```bash
|
||||
ssh <your-mastodon-host>
|
||||
sudo -u postgres psql mastodon_production
|
||||
```
|
||||
|
||||
**Profile + stats for a suspect** (age, post count, follower ratio, bio):
|
||||
|
||||
```sql
|
||||
SELECT a.username||'@'||a.domain,
|
||||
to_char(a.created_at,'YYYY-MM-DD') AS first_seen_locally,
|
||||
st.statuses_count, st.followers_count, st.following_count,
|
||||
left(regexp_replace(COALESCE(a.note,''),'<[^>]+>','','g'),200) AS bio
|
||||
FROM accounts a LEFT JOIN account_stats st ON st.account_id=a.id
|
||||
WHERE a.domain='<INSTANCE>' AND a.username='<HANDLE>';
|
||||
```
|
||||
|
||||
**Is the mention a reply or a blast?** `standalone=t` with a high `num_tagged` is the tell:
|
||||
|
||||
```sql
|
||||
SELECT a.username, to_char(s.created_at,'YYYY-MM-DD HH24:MI') AS posted,
|
||||
s.in_reply_to_account_id IS NULL AS standalone,
|
||||
(SELECT count(*) FROM mentions mm WHERE mm.status_id=s.id) AS num_tagged
|
||||
FROM mentions m JOIN statuses s ON s.id=m.status_id
|
||||
JOIN accounts a ON a.id=s.account_id
|
||||
JOIN accounts me ON me.id=m.account_id AND me.username='<YOU>' AND me.domain IS NULL
|
||||
WHERE a.username='<HANDLE>' AND a.domain='<INSTANCE>'
|
||||
ORDER BY s.created_at DESC;
|
||||
```
|
||||
|
||||
**All recent direct mentions of you** (sweep for the wider pattern):
|
||||
|
||||
```sql
|
||||
SELECT to_char(n.created_at,'YYYY-MM-DD HH24:MI') AS when,
|
||||
a.username||COALESCE('@'||a.domain,'@local') AS who,
|
||||
COALESCE(s.uri,'') AS uri,
|
||||
left(regexp_replace(COALESCE(s.text,''),'<[^>]+>','','g'),200) AS body
|
||||
FROM notifications n
|
||||
JOIN accounts recip ON recip.id=n.account_id AND recip.username='<YOU>' AND recip.domain IS NULL
|
||||
JOIN accounts a ON a.id=n.from_account_id
|
||||
LEFT JOIN mentions m ON m.id=n.activity_id AND n.activity_type='Mention'
|
||||
LEFT JOIN statuses s ON s.id=m.status_id
|
||||
WHERE n.type='mention' ORDER BY n.created_at DESC LIMIT 40;
|
||||
```
|
||||
|
||||
## Profile the origin instance
|
||||
|
||||
Don't judge an instance by one bad account. Pull its public metadata — no auth needed:
|
||||
|
||||
```bash
|
||||
# Software, version, user counts, registration policy
|
||||
NI=$(curl -s https://<INSTANCE>/.well-known/nodeinfo | python3 -c 'import sys,json;print(json.load(sys.stdin)["links"][-1]["href"])')
|
||||
curl -s "$NI" | python3 -m json.tool # software, openRegistrations, usage.users
|
||||
|
||||
# Title, contact/admin, rules, registration approval flag
|
||||
curl -s https://<INSTANCE>/api/v2/instance | python3 -m json.tool
|
||||
```
|
||||
|
||||
What to read off it:
|
||||
|
||||
- **`openRegistrations: true` + `approval_required: false`** → throwaway-account farm; expect more of the same.
|
||||
- **`totalUsers` vs `activeMonth`** → a huge dormant base is typical of sign-up-and-leave farms.
|
||||
- **Federation age on your side** — how long you've known the instance, how many of its accounts you cache. A long, broad relationship argues *against* a domain block.
|
||||
- **The instance's own rules** — many ban "backlink accounts" / harassment, which the mass-tag fundraising violates. That makes **reporting to its admin a legitimate, in-policy path.**
|
||||
|
||||
```sql
|
||||
-- What your instance already knows about the domain
|
||||
SELECT (SELECT count(*) FROM accounts WHERE domain='<INSTANCE>') AS known_accounts,
|
||||
(SELECT count(*) FROM statuses s JOIN accounts a ON a.id=s.account_id WHERE a.domain='<INSTANCE>') AS cached_statuses,
|
||||
(SELECT to_char(min(created_at),'YYYY-MM-DD') FROM accounts WHERE domain='<INSTANCE>') AS first_seen,
|
||||
(SELECT count(*) FROM domain_blocks WHERE domain='<INSTANCE>') AS is_domain_blocked;
|
||||
```
|
||||
|
||||
## The escalation ladder
|
||||
|
||||
| Level | Action | Effect | When |
|
||||
|---|---|---|---|
|
||||
| 1 | **Mute** | You stop seeing them; silent | Borderline; you don't want to cut them off |
|
||||
| 2 | **Block (account)** | Cuts mentions, removes their follow, federates to their instance | **Default first action** |
|
||||
| 3 | **Report** to source admin | Forwards the offending posts to their moderators | Repeat or egregious; in-policy on most instances |
|
||||
| 4 | **Domain-limit (silence)** | Their posts show only if you follow that account | One instance, multiple offenders |
|
||||
| 5 | **Domain-block (suspend)** | Severs all known accounts + federation | Instance is predominantly abuse |
|
||||
|
||||
### Blocking from a user account (federates + removes follow)
|
||||
|
||||
There is no `tootctl accounts block`. Do it through the model's `BlockService` so it tears down the relationship and federates correctly:
|
||||
|
||||
```ruby
|
||||
# run as the mastodon user:
|
||||
# sudo -u mastodon bash -c 'cd /home/mastodon/live && RAILS_ENV=production bin/rails runner /tmp/block.rb'
|
||||
me = Account.find_by(username: "<YOU>", domain: nil)
|
||||
%w[Handle1 Handle2].each do |u|
|
||||
t = Account.find_by(username: u, domain: "<INSTANCE>")
|
||||
next puts("NOTFOUND #{u}") if t.nil?
|
||||
BlockService.new.call(me, t)
|
||||
puts "BLOCKED #{u} blocking=#{me.blocking?(t)} they_follow_me=#{t.following?(me)}"
|
||||
end
|
||||
```
|
||||
|
||||
`blocking=true` with `they_follow_me=false` confirms the block landed and the follow was severed.
|
||||
|
||||
### Instance-level actions
|
||||
|
||||
Domain-limit / domain-block live in the admin UI (**Moderation → Federation**) or via `tootctl`:
|
||||
|
||||
```bash
|
||||
# Silence (limit) — posts hidden unless followed
|
||||
RAILS_ENV=production bin/tootctl domains ... # or set severity=silence in the admin UI
|
||||
# Suspend (block) the whole instance
|
||||
RAILS_ENV=production bin/tootctl ... # admin UI "Add domain block" is the safe path
|
||||
```
|
||||
|
||||
> [!tip] Reach for the lightest hammer
|
||||
> A domain block is rarely the right first move against an established instance — you lose every legit account and years of federation to swat a couple of accounts. Block the accounts, report them to the source admin, and only escalate the *instance* when it demonstrates a sustained, multi-actor pattern.
|
||||
|
||||
## Keep a log
|
||||
|
||||
Track offenders and source instances over time so a "one-off" that's actually a campaign becomes visible, and so domain-level decisions are evidence-based. A simple table — date, account, instance, signals, action — plus an instance-watch table with each source's registration policy and offender count is enough.
|
||||
|
||||
## Related
|
||||
|
||||
- [Mastodon `--prune-profiles` Trap](mastodon-prune-profiles-trap.md)
|
||||
- [Mastodon DB Maintenance](mastodon-db-maintenance.md)
|
||||
- [Mastodon Federation](mastodon-federation.md)
|
||||
|
|
@ -1,137 +0,0 @@
|
|||
---
|
||||
title: "App-Consistent Fleet Backups with restic + Backblaze B2"
|
||||
domain: selfhosting
|
||||
category: storage-backup
|
||||
tags: [restic, backblaze, b2, backup, ansible, systemd, postgresql, mysql, sqlite, docker, disaster-recovery]
|
||||
status: published
|
||||
created: 2026-06-19
|
||||
updated: 2026-06-19
|
||||
---
|
||||
|
||||
# App-Consistent Fleet Backups with restic + Backblaze B2
|
||||
|
||||
A repeatable pattern for backing up a mixed fleet (Ubuntu + Fedora, VPS + homelab, bare services + Docker) to Backblaze B2 with [restic](https://restic.net) — encrypted, deduplicated, and **app-consistent** (databases are dumped before the snapshot, not copied live). Driven by Ansible and a per-host `systemd` timer.
|
||||
|
||||
## The Short Answer
|
||||
|
||||
Per host, nightly: **dump every database to a staging dir → `restic backup` that staging dir plus the data paths → apply retention → wipe staging.** A monthly timer runs `restic prune`. Anything that fails emails the admin. One B2 bucket holds a separate repo per host at `b2:<bucket>:<hostname>`.
|
||||
|
||||
Retention is `--keep-daily 7 --keep-weekly 4 --keep-monthly 6` (~6 months of history).
|
||||
|
||||
## Why dump databases first
|
||||
|
||||
Copying a live database's files (`/var/lib/mysql`, a running SQLite file, a Postgres data dir) gives you a *crash-consistent* copy at best — restorable only if you're lucky. Logical dumps are guaranteed consistent:
|
||||
|
||||
- **MySQL / MariaDB:** `mysqldump --single-transaction --routines --triggers --databases <db>`
|
||||
- **PostgreSQL:** `pg_dump -Fc <db>` (custom format) via the `postgres` system user (peer auth)
|
||||
- **SQLite:** `sqlite3 <file> ".backup '<out>'"` — uses the online backup API, safe against a running writer
|
||||
- **Dockerized DBs:** `docker exec <container> sh -c '<dump cmd>'`, letting the container's own shell expand its root-password env var
|
||||
|
||||
restic then backs up the dump files (which dedupe beautifully — only the changed blocks upload each night).
|
||||
|
||||
## Repository layout
|
||||
|
||||
- **One private B2 bucket** (e.g. `majorshouse-backups`).
|
||||
- **One repo per host:** `b2:majorshouse-backups:<hostname>`.
|
||||
- The application key needs **read + write + delete** for the bucket. restic deletes objects during `forget`/`prune`, so a pure *append-only* key will break retention. (True append-only requires splitting `forget`/`prune` onto a separate maintenance key — a worthwhile hardening step, but not the default.)
|
||||
- Credentials live in an `EnvironmentFile` (`/etc/restic/restic-env`, mode `0600`, root): `RESTIC_REPOSITORY`, `RESTIC_PASSWORD`, `B2_ACCOUNT_ID`, `B2_ACCOUNT_KEY`.
|
||||
|
||||
## The backup script (shape)
|
||||
|
||||
```bash
|
||||
set -uo pipefail
|
||||
STAGING=/var/backups/restic-staging
|
||||
rm -rf "$STAGING"; mkdir -p "$STAGING"; chmod 700 "$STAGING"
|
||||
|
||||
# per-engine dumps into $STAGING ...
|
||||
mysqldump --single-transaction --routines --triggers --databases wordpress > "$STAGING/mysql-wordpress.sql"
|
||||
sudo -u postgres pg_dump -Fc mastodon_production > "$STAGING/pg-mastodon_production.dump"
|
||||
sqlite3 /opt/phantombot/config/phantombot.db ".backup '$STAGING/sqlite-phantombot.db'"
|
||||
|
||||
restic backup --tag fleet-backup --host "$(hostname -s)" \
|
||||
"$STAGING" /var/www /etc/letsencrypt --exclude /path/to/already-offsite/media
|
||||
|
||||
restic forget --keep-daily 7 --keep-weekly 4 --keep-monthly 6
|
||||
rm -rf "$STAGING"
|
||||
```
|
||||
|
||||
Wrap each step so a failure mails the admin and aborts (don't silently back up a half-state). On hosts where the `mail` CLI is absent, pipe a message to `/usr/sbin/sendmail -t` instead.
|
||||
|
||||
## systemd units
|
||||
|
||||
A oneshot service + a timer. Stagger `OnCalendar` per host to spread B2 load, and **always set `RESTIC_CACHE_DIR`** (see Gotchas):
|
||||
|
||||
```ini
|
||||
# restic-backup.service
|
||||
[Service]
|
||||
Type=oneshot
|
||||
EnvironmentFile=/etc/restic/restic-env
|
||||
Environment=RESTIC_CACHE_DIR=/var/cache/restic
|
||||
ExecStart=/usr/local/sbin/restic-backup.sh
|
||||
Nice=10
|
||||
IOSchedulingClass=idle
|
||||
```
|
||||
|
||||
```ini
|
||||
# restic-backup.timer
|
||||
[Timer]
|
||||
OnCalendar=*-*-* 02:30:00
|
||||
RandomizedDelaySec=20m
|
||||
Persistent=true
|
||||
[Install]
|
||||
WantedBy=timers.target
|
||||
```
|
||||
|
||||
A second `restic-prune.timer` runs `restic prune` monthly (`OnCalendar=*-*-01 04:00:00`).
|
||||
|
||||
## Restore procedure
|
||||
|
||||
The whole point. From the target host (or any host with the repo creds):
|
||||
|
||||
```bash
|
||||
# load repo + B2 creds without echoing them
|
||||
set -a; . /etc/restic/restic-env; set +a
|
||||
|
||||
restic snapshots # list; note the snapshot ID or use 'latest'
|
||||
|
||||
# restore specific paths to a scratch dir (never restore in place blindly)
|
||||
restic restore latest --target /tmp/restore \
|
||||
--include /var/backups/restic-staging \
|
||||
--include /var/www/html/wp-config.php
|
||||
|
||||
# verify before doing anything with it
|
||||
ls -la /tmp/restore/var/backups/restic-staging/
|
||||
head -1 /tmp/restore/var/backups/restic-staging/mysql-wordpress.sql # "-- MySQL dump 10.13 ..."
|
||||
```
|
||||
|
||||
To recover a database, restore the dump then load it: `mysql <db> < mysql-<db>.sql`, `pg_restore -d <db> pg-<db>.dump`, or copy the SQLite file back. **Test restores periodically** — a backup you've never restored is a hope, not a backup. Restore the highest-stakes data (password manager, mail) first in any drill.
|
||||
|
||||
## Adding a host
|
||||
|
||||
1. Add it to the `backups` inventory group.
|
||||
2. Give it a `host_vars` scope — which DBs to dump and which paths to back up:
|
||||
|
||||
```yaml
|
||||
restic_backup_oncalendar: "*-*-* 02:40:00" # stagger
|
||||
restic_mysql_dbs: [castopod_db]
|
||||
restic_paths: [/var/www/html/castopod]
|
||||
restic_excludes: [/var/www/html/castopod/public/media] # already offsite
|
||||
```
|
||||
3. Run the playbook against that host. The role installs restic, deploys the script + units, `restic init`s the repo if absent, and enables the timers.
|
||||
|
||||
## Gotchas & Notes
|
||||
|
||||
- **`RESTIC_CACHE_DIR` is mandatory under systemd.** systemd services run with no `$HOME`, so restic can't find its cache and warns *"unable to locate cache directory: neither $XDG_CACHE_HOME nor $HOME are defined"* — and re-reads **every file** each run (no incremental). Point it at `/var/cache/restic` in the unit.
|
||||
- **`sqlite3` may not be installed.** A host that runs a SQLite-backed app (e.g. a bot) often lacks the `sqlite3`/`sqlite` CLI. Install it where `restic_sqlite_paths` is set, or the `.backup` step fails.
|
||||
- **Docker DB password env-var names vary.** Don't assume: the MariaDB image may use `MYSQL_ROOT_PASSWORD` (not `MARIADB_ROOT_PASSWORD`), and a Postgres container's superuser is whatever `POSTGRES_USER` is set to — reference `"$POSTGRES_USER"` rather than hardcoding `postgres`. Check with `docker exec <c> sh -c 'env | grep -oE "^(MYSQL|MARIADB|POSTGRES)_[A-Z_]*"'` (name only).
|
||||
- **B2 key needs delete capability.** Otherwise `forget`/`prune` fail. Scope the key to the bucket; reach for per-host `namePrefix`-restricted keys for blast-radius isolation.
|
||||
- **Exclude data that's already offsite.** Media already synced to object storage (S3/B2 via the app or `rclone`) should be `--exclude`d so you don't pay to store it twice.
|
||||
- **First upload is slow, the rest are fast.** The initial snapshot reads and uploads everything; subsequent runs only ship changed blocks. For a large first run, fire it detached and watch from a transient unit that emails you on completion.
|
||||
- **Keep secrets out of git.** The repo password and B2 key belong in an Ansible vault (committed encrypted), referenced into the role — never in plaintext vars.
|
||||
- **Changing a host's backup paths starts a new snapshot group.** `restic forget` groups snapshots by `host`+`paths` by default, so adding or removing a path on an existing host creates a *separate* lineage: the old path-set and the new one each retain their own 7d/4w/6m snapshots, and `restic snapshots` shows both. Expected, not a bug — but it means the old-path snapshots age out on their own schedule rather than being superseded. To collapse everything into one retention bucket, run `forget` with `--group-by host` (be deliberate: it then treats *any* path-set on that host as the same group).
|
||||
|
||||
## See Also
|
||||
|
||||
- [rsync Backup Patterns](rsync-backup-patterns.md)
|
||||
- [SnapRAID & MergerFS Storage Setup](../../01-linux/storage/snapraid-mergerfs-setup.md)
|
||||
- [restic documentation](https://restic.readthedocs.io)
|
||||
|
|
@ -5,7 +5,7 @@ category: plex
|
|||
tags: [plex, ffmpeg, hevc, vaapi, amd, gpu, encode, storage, rx480]
|
||||
status: published
|
||||
created: 2026-05-15
|
||||
updated: 2026-06-05
|
||||
updated: 2026-05-22
|
||||
---
|
||||
# HEVC Batch Re-Encode for Plex Using VAAPI (AMD GPU)
|
||||
|
||||
|
|
@ -121,7 +121,7 @@ Each file logs:
|
|||
|
||||
### Space guard
|
||||
|
||||
The script aborts if free space on the Plex volume drops below 10GB (`MIN_FREE_GB`). Worst-case headroom needed is `source_size + tmp_size` simultaneously — on a 4GB source file that's ~8GB peak. Note: the space check only runs at the **start** of each encode, not during — a large file can still consume significant disk mid-encode.
|
||||
The script aborts if free space on the Plex volume drops below 20GB (`MIN_FREE_GB`). Worst-case headroom needed is `source_size + tmp_size` simultaneously — on a 4GB source file that's ~8GB peak.
|
||||
|
||||
---
|
||||
|
||||
|
|
@ -278,54 +278,3 @@ local tmp="${dir}/${safe_stem}.hevc.tmp.${ext}"
|
|||
|
||||
After patching, delete the affected entries from `hevc_failed.txt` (or leave them — they'll be re-queued on the next run since they're not in `hevc_done.txt`) and restart the batch.
|
||||
|
||||
---
|
||||
|
||||
### Many files failing: output larger than source (streaming content)
|
||||
|
||||
**Symptom:** A large portion of the queue ends up in `hevc_failed.txt` with log lines like:
|
||||
|
||||
```
|
||||
[2026-06-05 ...] Output: 4.7G savings=0 (output larger than source)
|
||||
[2026-06-05 ...] WARN: output is larger than source — skipping swap, keeping original
|
||||
```
|
||||
|
||||
**Cause:** These files are YouTube downloads or streaming archives (Giant Bomb, Twitch VODs, etc.) that were already encoded with an efficient H.264 encoder (typically YouTube's VP9-to-AVC pipeline or a broadcast H.264 encoder at a reasonable bitrate). VAAPI HEVC encoding at QP 28 on a Polaris GPU (RX 480/580) is a hardware encoder with limited rate control precision — it cannot beat a well-tuned software H.264 encode on already-compressed talking-head/gaming content. The output reliably comes out 15–25% *larger* than the source.
|
||||
|
||||
The script handles this correctly: it detects output > source, deletes the tmp, keeps the original, and writes to `hevc_failed.txt`. The files are not corrupted. However, without the `already_failed()` guard, the script will re-attempt these files on every queue rebuild, wasting CPU time and briefly consuming 4–8 GB of disk per failed attempt.
|
||||
|
||||
**Fix — add `already_failed()` skip logic:**
|
||||
|
||||
Patch `~/hevc_batch.sh` to skip files already in `hevc_failed.txt`:
|
||||
|
||||
```bash
|
||||
# After the existing already_done() function, add:
|
||||
already_failed() {
|
||||
[[ -f "$FAILED" ]] && grep -qF "$1" "$FAILED"
|
||||
}
|
||||
|
||||
# In build_queue(), after the already_done "$f" && continue line:
|
||||
already_failed "$f" && continue
|
||||
|
||||
# In the main loop, after the already_done "$file" check:
|
||||
already_failed "$file" && { log "SKIP (already failed): $file"; continue; }
|
||||
```
|
||||
|
||||
After patching, the batch will skip all 132+ known-bad files on the next pass and only attempt fresh queue entries.
|
||||
|
||||
**Tuning options to improve savings on dense content:**
|
||||
|
||||
- Lower QP: `--qp 24` or `--qp 22` — more aggressive quality target, better chance of beating source size. Trade-off: larger output for files that do compress.
|
||||
- Accept the failures: for streaming content archives, the source is already "good enough." Only files that are genuinely oversized H.264 (old stream captures at very high bitrate) will benefit from HEVC re-encode.
|
||||
|
||||
**Identifying which files are worth encoding:**
|
||||
|
||||
```bash
|
||||
# Show source bitrate for all queued files — high-bitrate sources are candidates
|
||||
while IFS= read -r f; do
|
||||
bitrate=$(ffprobe -v quiet -show_entries format=bit_rate -of csv=p=0 "$f" 2>/dev/null)
|
||||
echo "$bitrate $f"
|
||||
done < ~/hevc_queue.txt | sort -rn | head -20
|
||||
```
|
||||
|
||||
Files above ~8,000 kbits/s are typically good encode candidates. Files at 3,000–5,000 kbits/s (typical YouTube/Twitch 1080p) will usually fail.
|
||||
|
||||
|
|
|
|||
|
|
@ -1,103 +0,0 @@
|
|||
---
|
||||
title: "Ansible reboot.yml: become Timeout on WSL2 Hosts (Exclude Them)"
|
||||
domain: troubleshooting
|
||||
category: ansible
|
||||
tags: [ansible, wsl, wsl2, windows, reboot, become, privilege-escalation, openssh, inventory]
|
||||
status: published
|
||||
created: 2026-06-12
|
||||
updated: 2026-06-12
|
||||
---
|
||||
|
||||
# Ansible reboot.yml: become Timeout on WSL2 Hosts (Exclude Them)
|
||||
|
||||
## Problem
|
||||
|
||||
Running a reboot play across a Fedora fleet that includes a WSL2 "host" fails on the WSL2 box at privilege escalation — before the reboot command ever runs:
|
||||
|
||||
```console
|
||||
$ ansible-playbook reboot.yml --limit fedora
|
||||
|
||||
TASK [Reboot the server] *******************************************************
|
||||
changed: [majorhome]
|
||||
changed: [majorlab]
|
||||
changed: [majormail]
|
||||
changed: [majordiscord]
|
||||
[ERROR]: Task failed: Action failed: Timeout (62s) waiting for privilege
|
||||
escalation prompt:
|
||||
fatal: [majorrig-wsl]: FAILED! => {"changed": false,
|
||||
"msg": "Timeout (62s) waiting for privilege escalation prompt:",
|
||||
"reboot": false}
|
||||
```
|
||||
|
||||
Every real server reboots fine. Only the WSL2 host fails, and `"reboot": false` confirms the shutdown command never executed.
|
||||
|
||||
## Cause
|
||||
|
||||
Two independent problems, either of which is enough to break a reboot play against WSL2:
|
||||
|
||||
1. **WSL2 has no real reboot semantics.** `ansible.builtin.reboot` issues a shutdown, then blocks up to `reboot_timeout` (e.g. 900s) waiting for SSH to come back. A WSL2 distro doesn't reboot — it just terminates, and nothing relaunches it automatically. The task would hang the full timeout and then fail.
|
||||
|
||||
2. **`become` times out over the Windows OpenSSH → WSL2 bridge.** When a WSL2 box is reached as `majorlinux@host` through Windows' built-in OpenSSH Server (which forwards into WSL via the default shell), Ansible's privilege-escalation handshake watches the SSH stream for the sudo prompt/success marker. Across the Windows-intercept pty, that marker detection stalls until the 62s `timeout`. This happens **even with passwordless sudo** — `NOPASSWD` is configured and correct; Ansible simply never sees the handshake complete.
|
||||
|
||||
The error surfaces as #2 (it fails at escalation first), but #1 is the deeper reason WSL2 doesn't belong in a reboot play at all.
|
||||
|
||||
## Solution
|
||||
|
||||
**Exclude the WSL group from the reboot play.** A WSL2 instance is a managed *workstation environment*, not a server — it belongs in package/update plays but not in server lifecycle operations like reboot.
|
||||
|
||||
Scope the play to exclude the `wsl` group so even a broad `--limit` skips it:
|
||||
|
||||
```yaml
|
||||
# reboot.yml
|
||||
- name: Reboot servers
|
||||
hosts: all:!wsl # was: hosts: all
|
||||
become: true
|
||||
tasks:
|
||||
- name: Reboot the server
|
||||
ansible.builtin.reboot:
|
||||
msg: "Reboot initiated by Ansible"
|
||||
reboot_timeout: 900
|
||||
```
|
||||
|
||||
This assumes your WSL2 hosts are in a dedicated inventory group:
|
||||
|
||||
```yaml
|
||||
wsl:
|
||||
hosts:
|
||||
majorrig-wsl:
|
||||
ansible_host: 100.98.47.29
|
||||
```
|
||||
|
||||
Verify the targeting before running — the WSL host should be gone:
|
||||
|
||||
```console
|
||||
$ ansible-playbook reboot.yml --limit fedora --list-hosts
|
||||
play #1 (all:!wsl): Reboot servers
|
||||
hosts (4):
|
||||
majorhome
|
||||
majorlab
|
||||
majordiscord
|
||||
majormail
|
||||
```
|
||||
|
||||
### Rebooting the WSL2 instance itself
|
||||
|
||||
When you genuinely need to "reboot" WSL2, do it from the Windows side — not Ansible:
|
||||
|
||||
```powershell
|
||||
wsl --shutdown
|
||||
```
|
||||
|
||||
The distro relaunches on next access (next SSH login or `wsl` invocation). WSL2 stays in `update.yml` (dnf upgrades) and other package plays; it's only excluded from reboot and other server-specific roles.
|
||||
|
||||
## Why not just fix the become timeout?
|
||||
|
||||
You *could* raise `timeout` or tweak the become flow, but it doesn't address problem #1 — even a successful escalation would leave the reboot task hanging the full `reboot_timeout` because WSL2 never comes back the way the module expects. Excluding WSL from server lifecycle plays is the correct fix, not a workaround.
|
||||
|
||||
## Related
|
||||
|
||||
- [Ansible: ansible.cfg Ignored on WSL2 Windows Mounts](ansible-wsl2-world-writable-mount-ignores-cfg.md)
|
||||
- [Windows OpenSSH: WSL Default Shell Breaks Remote Commands](networking/windows-openssh-wsl-default-shell-breaks-remote-commands.md)
|
||||
- [Ansible: SSH Timeout During dnf upgrade on Fedora Hosts](ansible-ssh-timeout-dnf-upgrade.md)
|
||||
</content>
|
||||
</invoke>
|
||||
|
|
@ -1,73 +0,0 @@
|
|||
---
|
||||
title: "Claude Code Keychain Prompt Keeps Reappearing on macOS (ACL Invalidation)"
|
||||
domain: troubleshooting
|
||||
category: claude-code
|
||||
tags: [claude-code, authentication, oauth, keychain, macos, acl, security]
|
||||
status: published
|
||||
created: 2026-06-15
|
||||
updated: 2026-06-15
|
||||
---
|
||||
|
||||
# Claude Code Keychain Prompt Keeps Reappearing on macOS (ACL Invalidation)
|
||||
|
||||
## Symptom
|
||||
A macOS dialog repeatedly pops up:
|
||||
|
||||
> **security wants to access key "Claude Code-credentials" in your keychain.**
|
||||
> To allow this, enter the "login" keychain password. — `[Always Allow] [Deny] [Allow]`
|
||||
|
||||
The tell-tale sign: it **comes back even after clicking "Always Allow"** — the usual "trust forever" button doesn't make it stop. Login still works; it's the *permission prompt* that won't quiet down. This is **distinct** from [Claude Code won't log in](claude-code-warp-login-corrupt-keychain-credential.md), where the stored credential is corrupt and login itself fails.
|
||||
|
||||
## Cause
|
||||
Claude Code stores its OAuth token in the macOS **login keychain** as `Claude Code-credentials`, read via `/usr/bin/security`. macOS binds an "Always Allow" grant (the keychain item's ACL) to the **code-signing identity** of the requesting binary. That grant is silently invalidated when:
|
||||
|
||||
- **Claude Code updates** — the new binary's signature no longer matches the saved ACL. This is the most common trigger (see claude-code issues #48162, #9403).
|
||||
- **The credential item is recreated on token refresh** — wipes the ACL.
|
||||
- **Post-reboot keychain churn** — right after boot, the just-unlocked login keychain plus a concurrent token refresh can race ahead of the ACL settling, producing a *burst* of prompts that stops once a clean refresh completes.
|
||||
|
||||
It is **not** a lock-timeout issue if `security show-keychain-info` reports `no-timeout` (below).
|
||||
|
||||
## Triage (non-destructive — these do not trigger a prompt)
|
||||
```bash
|
||||
# Confirm the item exists (metadata only; no secret read)
|
||||
security find-generic-password -l "Claude Code-credentials" | grep -E "svce|acct"
|
||||
|
||||
# Confirm the login keychain isn't auto-locking
|
||||
security show-keychain-info ~/Library/Keychains/login.keychain-db
|
||||
# -> "no-timeout" means it won't relock; so recurring prompts = ACL invalidation, not locking
|
||||
```
|
||||
|
||||
## Fixes
|
||||
|
||||
### One-off burst (e.g. right after a reboot)
|
||||
Click **Always Allow** (not Allow) once a clean token refresh has completed. With a `no-timeout` keychain the grant then holds, and the post-boot prompt storm usually self-clears within a minute. *Observed exactly this on MajorAir 2026-06-15 — a reboot triggered a burst that stopped on its own.*
|
||||
|
||||
### Keeps returning after updates (durable) — reset the credential
|
||||
Deleting and re-creating the item rebinds a fresh ACL to the current binary. Costs one re-login.
|
||||
```bash
|
||||
security delete-generic-password -s "Claude Code-credentials"
|
||||
# then re-authenticate inside Claude Code: /login (or relaunch `claude`)
|
||||
```
|
||||
|
||||
### Bypass the keychain entirely (workaround)
|
||||
Claude Code falls back to `~/.claude/.credentials.json` in non-GUI contexts (SSH, tmux). On a local Mac this can be repurposed to stop keychain prompts for good:
|
||||
```bash
|
||||
# pipe straight to the file — never echo the token into a shared terminal
|
||||
security find-generic-password -s "Claude Code-credentials" -w > ~/.claude/.credentials.json
|
||||
chmod 600 ~/.claude/.credentials.json
|
||||
security delete-generic-password -s "Claude Code-credentials"
|
||||
```
|
||||
**Caveats:**
|
||||
- Token is then **plaintext at rest** (mode 600) instead of encrypted in the keychain.
|
||||
- A future Claude Code update may rewrite the keychain item.
|
||||
- GUI-session behaviour for the file fallback is **less documented** than the SSH/tmux case — **verify it holds for your setup before relying on it.**
|
||||
- Do **not** substitute `CLAUDE_CODE_OAUTH_TOKEN` — it is known to delete credentials on exit (issue #37512).
|
||||
|
||||
## Notes
|
||||
- Same keychain item as the corrupt-credential login failure; if login itself breaks, see the related article.
|
||||
- Always redirect `-w` output straight to a file — never into a terminal whose scrollback feeds shared context.
|
||||
|
||||
## Related
|
||||
- [Claude Code Won't Log In (Warp & iTerm2) — Corrupt Keychain Credential](claude-code-warp-login-corrupt-keychain-credential.md)
|
||||
- Config: `~/.claude.json`, login keychain item `Claude Code-credentials`
|
||||
- First observed: MajorAir, 2026-06-15 (post-reboot prompt burst; self-cleared)
|
||||
|
|
@ -61,6 +61,5 @@ Resolved on step 1+2 — login succeeded after deleting the corrupt Keychain ite
|
|||
If that errors with "Expecting value", the stored secret is empty/corrupt — delete and re-login.
|
||||
|
||||
## Related
|
||||
- [Claude Code Keychain Prompt Keeps Reappearing on macOS (ACL Invalidation)](claude-code-keychain-prompt-recurring-macos.md) — different symptom: login works but the permission prompt won't stop
|
||||
- Config: `~/.claude.json` (oauthAccount, userID), login Keychain item `Claude Code-credentials`
|
||||
- Other Claude Code note: `claude-mem-setting-sources-empty-arg.md`
|
||||
|
|
|
|||
|
|
@ -1,105 +0,0 @@
|
|||
---
|
||||
title: "Forgejo: Account Recovery & CLI Admin When Locked Out of the GUI"
|
||||
domain: troubleshooting
|
||||
category: general
|
||||
tags: [forgejo, gitea, smtp, docker, account-recovery, self-hosting]
|
||||
status: published
|
||||
created: 2026-06-12
|
||||
updated: 2026-06-12
|
||||
---
|
||||
# Forgejo: Account Recovery & CLI Admin When Locked Out of the GUI
|
||||
|
||||
Two related problems on a single-admin self-hosted **Forgejo** (or Gitea): the GUI *"Forgot password"* is disabled, and you can't log in to fix it. Here's how to (1) enable account recovery properly, and (2) recover from the command line when you're already locked out.
|
||||
|
||||
## Symptoms
|
||||
|
||||
- The *Forgot password* page shows: **"Account recovery is only available when email is set up. Please set up email to enable account recovery."**
|
||||
- You can't log in (wrong/forgotten password), so you can't add an SSH key or change settings in the GUI either.
|
||||
|
||||
## Part 1 — Enable account recovery (configure the mailer)
|
||||
|
||||
Account recovery needs SMTP. If you already run a mail server on your tailnet, relay through it — **no app password needed** when the Forgejo host is `mynetworks`-trusted by that mail server.
|
||||
|
||||
Edit `app.ini` (in the data volume, e.g. `/data/gitea/conf/app.ini`):
|
||||
|
||||
```ini
|
||||
[mailer]
|
||||
ENABLED = true
|
||||
PROTOCOL = smtp+starttls
|
||||
SMTP_ADDR = 100.x.y.z ; mail server's tailnet IP
|
||||
SMTP_PORT = 587
|
||||
FROM = forgejo@example.com
|
||||
FORCE_TRUST_SERVER_CERT = true ; required when connecting by IP (cert CN won't match)
|
||||
```
|
||||
|
||||
Notes:
|
||||
|
||||
- `FORCE_TRUST_SERVER_CERT = true` is needed when you target the relay by **IP** — the TLS cert is issued for a hostname, not the IP, so verification would otherwise fail. Acceptable on a trusted internal hop.
|
||||
- Omit `USER`/`PASSWD` if the relay accepts your host via `mynetworks` (no SASL). Otherwise add SMTP auth.
|
||||
- `app.ini` lives in the persistent volume, so the change **survives container re-creation** (e.g. Watchtower's nightly pull).
|
||||
|
||||
Apply and verify:
|
||||
|
||||
```bash
|
||||
docker restart forgejo
|
||||
docker logs forgejo 2>&1 | grep -i "Mail Service Enabled" # confirms the mailer loaded
|
||||
```
|
||||
|
||||
Test the SMTP path **before** trusting it (run from the host, mimicking Forgejo's connection):
|
||||
|
||||
```bash
|
||||
python3 - <<'EOF'
|
||||
import smtplib, ssl
|
||||
ctx = ssl.create_default_context(); ctx.check_hostname = False; ctx.verify_mode = ssl.CERT_NONE
|
||||
s = smtplib.SMTP("100.x.y.z", 587, timeout=15)
|
||||
s.ehlo(); s.starttls(context=ctx); s.ehlo()
|
||||
s.sendmail("forgejo@example.com", ["you@example.com"],
|
||||
"Subject: test\r\n\r\nForgejo relay path test")
|
||||
s.quit(); print("SENT_OK")
|
||||
EOF
|
||||
```
|
||||
|
||||
`SENT_OK` means the relay accepted the message. `/user/forgot_password` should now show the reset form instead of the email error.
|
||||
|
||||
> **Container can't reach the tailnet IP?** Docker bridge networks usually route to Tailscale via the host (SNAT to the host's tailnet IP). Confirm with:
|
||||
> `docker exec forgejo nc -w5 100.x.y.z 587 </dev/null && echo REACHABLE`
|
||||
|
||||
## Part 2 — Recover from the CLI (already locked out)
|
||||
|
||||
Forgejo's admin CLI runs inside the container as the git user (UID 1000) and needs no login.
|
||||
|
||||
**Reset a password:**
|
||||
|
||||
```bash
|
||||
docker exec -u 1000 forgejo forgejo admin user change-password -u <user> -p '<newpass>'
|
||||
```
|
||||
|
||||
> ⚠️ **Gotcha:** `change-password` sets `must_change_password=true` by default. That **forces a change on next GUI login _and_ returns HTTP 403 on the API** (`"You must change your password"`). Clear it:
|
||||
> ```bash
|
||||
> docker exec -u 1000 forgejo forgejo admin user must-change-password --unset <user>
|
||||
> ```
|
||||
|
||||
**Add an SSH key without the GUI** (basic-auth API — works only if 2FA is off):
|
||||
|
||||
```bash
|
||||
curl -u <user>:'<pass>' -X POST -H 'Content-Type: application/json' \
|
||||
-d '{"title":"laptop","key":"ssh-ed25519 AAAA... you@host"}' \
|
||||
http://localhost:3004/api/v1/user/keys
|
||||
# HTTP 201 = created
|
||||
```
|
||||
|
||||
Forgejo regenerates the git user's `authorized_keys` from the database, so `ssh -p <port> git@host` authenticates immediately afterward — no restart needed.
|
||||
|
||||
## "The password keeps changing" — it (probably) isn't
|
||||
|
||||
If a self-hosted Forgejo admin password *seems* to reset itself, a stock Forgejo container does **not** reset admin passwords. Rule out the server first:
|
||||
|
||||
- the compose has **no** admin/password env and no custom entrypoint;
|
||||
- **no** cron, systemd timer, or script runs `forgejo admin user change-password`;
|
||||
- the data volume is persistent (re-creation keeps the DB, password included).
|
||||
|
||||
If all three hold, nothing server-side is changing it — the "changing" password is a **client-side** artifact: a duplicate or stale entry in your password manager autofilling different values. Delete the duplicates and keep one.
|
||||
|
||||
## See also
|
||||
|
||||
- Forgejo — [Config Cheat Sheet → mailer](https://forgejo.org/docs/latest/admin/config-cheat-sheet/)
|
||||
|
|
@ -11,7 +11,6 @@ Practical fixes for common Linux, networking, and application problems.
|
|||
- [LoRA adapter — GGUF conversion fails with 'config.json not found'](gpu-display/lora-adapter-gguf-conversion-fails.md)
|
||||
|
||||
## 🌐 Networking & Web
|
||||
- [Wi-Fi Game Streaming Stutter: 160 MHz Channel Width Saturating the 5 GHz Radio](networking/wifi-160mhz-airtime-saturation-game-streaming.md)
|
||||
- [Apache Outage: Fail2ban Self-Ban + Missing iptables Rules](networking/fail2ban-self-ban-apache-outage.md)
|
||||
- [Mail Client Stops Receiving: Fail2ban IMAP Self-Ban](networking/fail2ban-imap-self-ban-mail-client.md)
|
||||
- [firewalld: Mail Ports Wiped After Reload](networking/firewalld-mail-ports-reset.md)
|
||||
|
|
@ -19,7 +18,6 @@ Practical fixes for common Linux, networking, and application problems.
|
|||
- [Postfix header_checks Can't Act on Milter-Added Headers (Use Sieve)](networking/postfix-header-checks-vs-milter-headers.md)
|
||||
- [Dovecot Phantom Mailboxes from .dovecot.lda-dupes (mail_home Overlapping the Maildir Root)](networking/dovecot-mail-home-maildir-root-phantom-mailboxes.md)
|
||||
- [Tailscale SSH: Unexpected Re-Authentication Prompt](networking/tailscale-ssh-reauth-prompt.md)
|
||||
- [SSH Alias Falls Through to MagicDNS — Host-Key Verification Failure (No `Host` Block)](networking/ssh-missing-host-block-magicdns-host-key-failure.md)
|
||||
- [iOS Tailscale Clients Report HostName="localhost" — Breaks /etc/hosts Generators](networking/tailscale-status-json-hostname-localhost-ios.md)
|
||||
- [rsync over Tailscale: Hung in TCP Teardown After Transfer Completes](networking/rsync-tailscale-teardown-stall.md)
|
||||
- [Windows OpenSSH: WSL Default Shell Breaks Remote Commands](networking/windows-openssh-wsl-default-shell-breaks-remote-commands.md)
|
||||
|
|
@ -33,7 +31,6 @@ Practical fixes for common Linux, networking, and application problems.
|
|||
- [Vault Password File Missing](ansible-vault-password-file-missing.md)
|
||||
- [ansible.cfg Ignored on WSL2 Windows Mounts](ansible-wsl2-world-writable-mount-ignores-cfg.md)
|
||||
- [regex_search — capture-group argument doesn't work in set_fact](ansible-regex-search-set-fact-capture-group.md)
|
||||
- [reboot.yml: become Timeout on WSL2 Hosts (Exclude Them)](ansible-reboot-become-timeout-wsl2.md)
|
||||
|
||||
## 📦 Docker & Systems
|
||||
- [Docker & Caddy Recovery After Reboot (Fedora + SELinux)](docker-caddy-selinux-post-reboot-recovery.md)
|
||||
|
|
@ -52,12 +49,9 @@ Practical fixes for common Linux, networking, and application problems.
|
|||
## 📝 Application Specific
|
||||
- [Obsidian Vault Recovery — Loading Cache Hang](obsidian-cache-hang-recovery.md)
|
||||
- [Gemini CLI Manual Update](gemini-cli-manual-update.md)
|
||||
- [iPhone Mirroring Hangs on 'Connecting…' — AWDL Data Stall (27.0 Beta)](iphone-mirroring-connecting-hang-awdl-stall-beta.md)
|
||||
|
||||
## 🤖 AI / Local LLM
|
||||
- [Ollama Drops Off Tailscale When Mac Sleeps](ollama-macos-sleep-tailscale-disconnect.md)
|
||||
- [Ollama: `ollama run` with Piped Stdin Bypasses Chat Template + SYSTEM Prompt](ollama-chat-template-pipe-stdin-bypass.md)
|
||||
- [Windows OpenSSH Server (sshd) Stops After Reboot](networking/windows-sshd-stops-after-reboot.md)
|
||||
- [claude-mem Silently Fails with Claude Code 2.1+ (Empty `--setting-sources`)](claude-mem-setting-sources-empty-arg.md)
|
||||
- [Claude Code Won't Log In (Warp & iTerm2) — Corrupt Keychain Credential](claude-code-warp-login-corrupt-keychain-credential.md)
|
||||
- [Claude Code Keychain Prompt Keeps Reappearing on macOS (ACL Invalidation)](claude-code-keychain-prompt-recurring-macos.md)
|
||||
|
|
|
|||
|
|
@ -2,61 +2,14 @@
|
|||
title: "iPhone Mirroring Hangs on 'Connecting…' — AWDL Data Stall (27.0 Beta)"
|
||||
domain: troubleshooting
|
||||
category: macos
|
||||
tags: [macos, iphone-mirroring, continuity, awdl, rapport, quic, tailscale, mullvad, beta, channel-validation, aimesh, quicktime, usb]
|
||||
tags: [macos, iphone-mirroring, continuity, awdl, rapport, quic, tailscale, mullvad, beta]
|
||||
status: published
|
||||
created: 2026-06-09
|
||||
updated: 2026-06-15
|
||||
updated: 2026-06-09
|
||||
---
|
||||
|
||||
# iPhone Mirroring Hangs on 'Connecting…' — AWDL Data Stall (27.0 Beta)
|
||||
|
||||
## Update 2026‑06‑15 — REGRESSED; reproducibly stuck on "Connecting", and Tailscale was **not** the cure
|
||||
|
||||
> **Correction to the 2026‑06‑14 "it WORKS" update below.** On 2026‑06‑15 iPhone Mirroring is **reproducibly stuck on "Connecting to iPhone 16 Pro"** on MajorAir again — with Tailscale `accept-routes` *still* `false`. So the accept‑routes change was **correlation, not the fix**: this is an **intermittent macOS 27.0 beta AWDL bug, independent of Tailscale**.
|
||||
>
|
||||
> **Tried this round — all failed to establish a session:** Tailscale `accept-routes=false` (already in place) · `sudo ifconfig awdl0 down/up` · **full Mac reboot** · cycling the iPhone's Wi‑Fi + Bluetooth.
|
||||
>
|
||||
> **Log signature:** `rapportd` resolves the phone's `_asquic._udp.local` endpoint and `_companion-link` registers (discovery *succeeds*), but the QUIC‑over‑AWDL **datapath never completes into a live session** — `wifip2pd` loops on `AWDLDiscoveryTimeout (hasAdvertises=false)`. Each reset advanced the handshake one stage further (no‑advertises → resolve‑started → endpoint‑resolved) yet none reached a streaming session. **`llw0` never went active (0 bytes)** — confirming no A/V ever flowed, regardless of what the 06‑14 note measured.
|
||||
>
|
||||
> **Stance:** beta OS bug, **no reliable user‑side fix**. Use the **QuickTime USB mirror** workaround (below) when you actually need the phone on screen. The 06‑14 "it works on `llw0`" measurements were real *for that one session* but are **not reproducible** across seeds/sessions — treat mirroring as intermittently broken on the 27.0 betas. This re‑confirms the original **Root cause (conclusion)** section further down (a beta bug, "nothing in local config wrong"), which the 06‑14 update had prematurely overridden.
|
||||
|
||||
## Update 2026‑06‑14 (evening) — it WORKS; the "AWDL starvation" finding was the wrong interface
|
||||
|
||||
> iPhone Mirroring is now **working** on MajorAir — stable session, clean video, no missing icons — on **ch44/80** with Tailscale `accept-routes=false`. An earlier pass the same day blamed an "AWDL bulk‑path starving at ~90 B/s"; that was **measuring the wrong interface** and is corrected here.
|
||||
|
||||
**The video transport is `llw0` (low‑latency WLAN), not `awdl0`.**
|
||||
Measured during an active session: **`llw0` ≈ 800 KB/s** (≈6 Mbps of real video), `en0` ~60 KB/s, **`awdl0` ~1 KB/s**. `awdl0` only ever carries AWDL *discovery/control* (~90 B/s) — whether mirroring works or not. So "90 B/s on `awdl0` = starved bulk path" was a **red herring**: the A/V stream rides `llw0`, which the earlier pass never measured.
|
||||
|
||||
**What was actually broken was session *stability*.** The `XPC_ERROR_CONNECTION_INTERRUPTED` / `MediaContinuityKit.TaskTimeoutError` teardown loop kept the `llw0` stream from ever sustaining (→ glitchy / missing icons). When the session holds, `llw0` streams clean.
|
||||
|
||||
**What changed (not cleanly isolated):** three things differed between the broken and working states — (1) the network fully **settled on ch44** over ~15 h (the failing ch44 test was minutes after a chaotic AiMesh re‑sync + reconnect scramble), (2) Tailscale **`accept-routes` was turned off** (it had been polluting IPv4 routing + the Continuity control plane), and (3) both devices slept/woke. Which one mattered is not yet proven.
|
||||
|
||||
**Open test — isolates Tailscale's role:** repro on **MajorMac** with *unaltered* Tailscale (`accept-routes` still **ON**). If mirroring breaks there but works on MajorAir (accept‑routes OFF), that pins Tailscale's accepted routes as the trigger. See [[MajorAir#Known Issues]] for the `accept-routes=false` fix.
|
||||
|
||||
**Still valid from earlier today:** congestion ruled out (router `chanim_stats` ch36 = 90 % idle, 86 % txop); the AiMesh / router infra notes below; and iPhone Mirroring is **wireless‑only — no USB transport** (for a wired screen view, use QuickTime, below).
|
||||
|
||||
> ⚠️ The iPhone‑radio `isValidChannel`/`awdl0` evidence cited in the original 2026‑06‑09 write‑up below describes AWDL *discovery* health, **not** the video path — read it in light of this correction.
|
||||
|
||||
**Wired workaround (works today, no AWDL):**
|
||||
iPhone Mirroring is **wireless‑only — there is no USB transport** (confirmed: cable connected throughout, every attempt still used `awdl0`). For a wired view of the screen:
|
||||
> **QuickTime Player → File → New Movie Recording → ⌄ next to record → select the iPhone** = full‑rate USB‑C screen mirror (view + record). Does **not** give remote control (tap/type) — that's unique to iPhone Mirroring.
|
||||
|
||||
**Infra notes (RT‑AX82U, AiMesh controller):**
|
||||
- Router SSH is on **port 1025** (not 22); creds in Ansible vault (`router_username` / `router_password`).
|
||||
- The 5 GHz channel is **AiMesh‑coordinated** and **resists CLI changes** — `wl chanspec` / nvram `wl1_chanspec` get re‑asserted by `acsd2` + AiMesh within seconds, even after `restart_wireless`. Only setting Control Channel to an **explicit value in the Web UI** holds mesh‑wide. Left "Auto" → acsd2 picks **36** (the cleanest channel).
|
||||
- Any channel change triggers a **mesh re‑sync (~1 min) that drops all Wi‑Fi**; during it MajorAir falls back to the iPhone's **USB Personal Hotspot** (`en7` / `172.20.10.x`) and won't auto‑rejoin home Wi‑Fi while the hotspot feeds it internet (manual Wi‑Fi‑menu join needed).
|
||||
- **Current state: 5 GHz on ch44/80** (same clean UNII‑1 spectrum as 36; left here to avoid another re‑sync — the Deck streams identically on 44).
|
||||
|
||||
**If it breaks again — troubleshooting checklist:**
|
||||
1. **It's session stability, not bandwidth.** Look for teardown loops: `log show --last 3m --predicate 'process == "iPhone Mirroring"' | grep -iE "interrupt|timeout|endpoint"`.
|
||||
2. **Measure the right interface** — video rides **`llw0`** (hundreds of KB/s when the screen is active), *not* `awdl0` (~90 B/s control is normal): `netstat -ib | awk '/<Link#/{print $1, $7}'` before/after a few seconds.
|
||||
3. **Tailscale:** confirm `accept-routes=false` on the Mac (`tailscale debug prefs | grep RouteAll`) — see [[MajorAir#Known Issues]].
|
||||
4. **Let the network settle** after any Wi‑Fi/channel change — an AiMesh re‑sync churns AWDL/Continuity state for a minute+; retry once stable.
|
||||
5. iPhone: on home Wi‑Fi, near the Mac, **Personal Hotspot off**, not in Low Power Mode.
|
||||
6. **Wired fallback that always works:** QuickTime → New Movie Recording → select the iPhone (USB‑C; view/record only, no control).
|
||||
|
||||
---
|
||||
|
||||
## Symptom
|
||||
iPhone Mirroring on the Mac sits on **"Connecting…"** forever and never shows the iPhone screen.
|
||||
- Mac: **macOS 27.0 dev beta** (build 26A5353q), MajorAir
|
||||
|
|
|
|||
|
|
@ -1,150 +0,0 @@
|
|||
---
|
||||
title: "Logwatch Reports the Wrong Hostname (`<host>-hetzner`) After a Migration"
|
||||
domain: troubleshooting
|
||||
category: monitoring
|
||||
tags: [logwatch, hostname, hetzner, migration, monitoring, provisioning, fail2ban]
|
||||
status: published
|
||||
created: 2026-06-12
|
||||
updated: 2026-06-14
|
||||
---
|
||||
|
||||
# Logwatch Reports the Wrong Hostname (`<host>-hetzner`) After a Migration
|
||||
|
||||
## Symptom
|
||||
|
||||
Daily Logwatch emails from a recently migrated server arrive titled with the
|
||||
provisioning label instead of the real hostname:
|
||||
|
||||
```
|
||||
Logwatch for tttpod-hetzner (Linux)
|
||||
Logwatch for dcaprod-hetzner (Linux)
|
||||
```
|
||||
|
||||
Everything else works — the report is generated, mailed, and delivered. Only the
|
||||
**name in the title is wrong**, which makes reports harder to scan and breaks any
|
||||
filter or rule that keys on the expected hostname.
|
||||
|
||||
## Cause
|
||||
|
||||
Logwatch titles each report with the box's **live system hostname**
|
||||
(`hostnamectl --static` / `/etc/hostname`) read at runtime — it does *not* keep
|
||||
its own copy of the name.
|
||||
|
||||
Hetzner Cloud servers are provisioned with a temporary node label as the system
|
||||
hostname — `<host>-hetzner` (e.g. `tttpod-hetzner`). The migration runbook renames
|
||||
the **Tailscale node** back to the bare name and sets Postfix `myhostname`, but the
|
||||
**OS hostname** itself is easy to miss because nothing surfaces it day to day. It
|
||||
stays `<host>-hetzner` until something reads `hostname` — Logwatch is usually the
|
||||
first thing to do so, weeks later.
|
||||
|
||||
Confirm the box is actually mislabelled:
|
||||
|
||||
```bash
|
||||
ssh root@<host> 'hostnamectl --static; cat /etc/hostname; grep 127.0.1.1 /etc/hosts'
|
||||
# static: tttpod-hetzner
|
||||
# /etc/hostname: tttpod-hetzner
|
||||
# 127.0.1.1 tttpod-hetzner tttpod-hetzner
|
||||
```
|
||||
|
||||
## Fix
|
||||
|
||||
Set the real hostname and fix the matching `/etc/hosts` loopback line:
|
||||
|
||||
```bash
|
||||
ssh root@<host> '
|
||||
hostnamectl set-hostname <host>
|
||||
sed -i "s/127.0.1.1.*/127.0.1.1 <host> <host>/" /etc/hosts
|
||||
hostnamectl --static # verify -> <host>
|
||||
'
|
||||
```
|
||||
|
||||
That's it. **Logwatch has no hardcoded hostname override** — verify with:
|
||||
|
||||
```bash
|
||||
grep -ri hostname /etc/logwatch/ /etc/cron.daily/0logwatch /etc/cron.daily/logwatch 2>/dev/null
|
||||
cat /etc/mailname 2>/dev/null
|
||||
```
|
||||
|
||||
If those are empty (the normal case), Logwatch reads the live hostname on its next
|
||||
run, so the **next daily report self-corrects** — no service restart, no logwatch
|
||||
config change needed.
|
||||
|
||||
> [!note] If `grep` *does* find a hostname pinned in `/etc/logwatch/conf/logwatch.conf`
|
||||
> (e.g. a `HostLimit`/`MailFrom` line baked in by Ansible), update it there too —
|
||||
> the override file wins over the live hostname.
|
||||
|
||||
## Sweep the whole fleet
|
||||
|
||||
This is a per-box provisioning leftover, so check every migrated host at once —
|
||||
more than one is usually affected:
|
||||
|
||||
```bash
|
||||
for ip in 100.98.223.93 100.95.137.38 100.64.169.62 100.112.127.0 100.73.85.46; do
|
||||
echo -n "$ip -> "
|
||||
ssh -o ConnectTimeout=8 -o BatchMode=yes root@$ip 'hostnamectl --static' 2>/dev/null \
|
||||
|| echo '(unreachable)'
|
||||
done
|
||||
```
|
||||
|
||||
Any value ending in `-hetzner` (or your provider's build label) needs the fix above.
|
||||
In the 2026-06 sweep, `tttpod` and `dcaprod` were still `*-hetzner` at the OS
|
||||
level; `majortoot`, `majormail`, and `majorlinux` had the correct system hostname
|
||||
— but see the variant below: `majormail`'s *configs* were still stale even though
|
||||
its hostname wasn't.
|
||||
|
||||
## Variant: hostname is correct, but a config has the old name baked in
|
||||
|
||||
A second, sneakier form of this drift: the **system hostname is already right**, so
|
||||
the sweep above passes and the Logwatch report *title* is correct — yet mail still
|
||||
arrives **from** `<host>-hetzner` because the old label is hardcoded in a service's
|
||||
`From`/`sender` field. These fields are static text, not derived from the live
|
||||
hostname, so fixing `hostnamectl` does nothing for them.
|
||||
|
||||
Seen on `majormail` (2026-06-14): system hostname was `majormail`, but
|
||||
`Logwatch@majormail-hetzner...` was still the sender. Two configs held it:
|
||||
|
||||
```bash
|
||||
# sweep a box for the old provisioning label in any send-related config
|
||||
ssh root@<host> 'grep -rsn "<host>-hetzner" /etc/logwatch/ /etc/fail2ban/ \
|
||||
/etc/postfix/ /etc/aliases /etc/mailname 2>/dev/null'
|
||||
# /etc/logwatch/conf/logwatch.conf:MailFrom = Logwatch@<host>-hetzner.majorshouse.com
|
||||
# /etc/fail2ban/jail.local:sender = fail2ban@<host>-hetzner.majorshouse.com
|
||||
```
|
||||
|
||||
Fix in place (no restart needed for Logwatch; reload fail2ban for its change):
|
||||
|
||||
```bash
|
||||
ssh root@<host> '
|
||||
sed -i "s/<host>-hetzner/<host>/g" /etc/logwatch/conf/logwatch.conf /etc/fail2ban/jail.local
|
||||
systemctl reload fail2ban
|
||||
'
|
||||
```
|
||||
|
||||
> [!warning] Check the Ansible source, or it comes back
|
||||
> A live `sed` is undone by the next playbook run if the repo still carries the old
|
||||
> value. Distinguish two cases:
|
||||
> - **Templated** (safe): e.g. `logwatch.yml` sets `MailFrom = Logwatch@{{ inventory_hostname }}...`. If the inventory host is named correctly, a run *regenerates* the right value — it even self-heals a stale box.
|
||||
> - **Static file** (will regress): e.g. `roles/fail2ban/files/hosts/<host>/jail.local` with the literal `sender = ...@<host>-hetzner...`. Grep the repo (`grep -rn "<host>-hetzner" .`) and fix the file too, or every deploy re-pushes the stale sender.
|
||||
|
||||
Inert backups (`jail.local.bak*`, `*~`) may still contain the old string — they
|
||||
don't send mail, so leave them.
|
||||
|
||||
## Prevention
|
||||
|
||||
Fold "set the system hostname" into the migration bootstrap so it never drifts:
|
||||
|
||||
```bash
|
||||
hostnamectl set-hostname <host>
|
||||
sed -i "s/127.0.1.1.*/127.0.1.1 <host> <host>/" /etc/hosts
|
||||
```
|
||||
|
||||
Do this in the **same step** that renames the Tailscale node and sets Postfix
|
||||
`myhostname` — all three read from the provisioning label and all three must be
|
||||
corrected together. See the
|
||||
[VPS Migration Baseline Checklist](../02-selfhosting/cloud/vps-migration-baseline-checklist.md).
|
||||
|
||||
## Related
|
||||
|
||||
- [Logwatch Fleet Setup — Surviving Package Upgrades](../02-selfhosting/monitoring/logwatch-fleet-setup.md) — the broader "logwatch went silent / wrong-source" class, including the Packer `myhostname` variant of this same drift
|
||||
- [VPS Migration Baseline Checklist](../02-selfhosting/cloud/vps-migration-baseline-checklist.md) — the full post-migration verification list
|
||||
- [Ansible UNREACHABLE: Host Key Verification Failed After a Host Rebuild or Migration](networking/ansible-host-key-verification-failed-rebuilt-host.md) — another IP/identity-drift gotcha from the same Hetzner migration
|
||||
|
|
@ -1,154 +0,0 @@
|
|||
---
|
||||
title: "Auditing & Cleaning macOS Background App Activity (sfltool dumpbtm)"
|
||||
domain: troubleshooting
|
||||
category: general
|
||||
tags: [macos, background-tasks, btm, sfltool, login-items, system-extensions, uninstall, little-snitch]
|
||||
status: published
|
||||
created: 2026-06-21
|
||||
updated: 2026-06-21
|
||||
---
|
||||
# Auditing & Cleaning macOS Background App Activity (`sfltool dumpbtm`)
|
||||
|
||||
## Overview
|
||||
macOS tracks every login item, agent, daemon, helper, and extension that may run in the background in its **Background Task Management (BTM)** database. The GUI shows this under **System Settings → General → Login Items & Extensions** ("Allow in the Background"), but the GUI is summarised and hides paths, identifiers, and orphans.
|
||||
|
||||
`sfltool dumpbtm` prints the full BTM database from the command line — and the per-user records need **no `sudo`**. This is the fastest way to answer "what is allowed to run in the background, and does each entry still map to an installed app?"
|
||||
|
||||
## List what's registered
|
||||
|
||||
```bash
|
||||
sfltool dumpbtm # per-user records, no sudo required
|
||||
```
|
||||
|
||||
Each record looks like:
|
||||
|
||||
```
|
||||
Name: CleanMyMac Menu
|
||||
Type: login item (0x4)
|
||||
Disposition: [enabled, allowed, notified] (0xb)
|
||||
Identifier: 4.com.macpaw.CleanMyMac-mas.Menu
|
||||
URL: Contents/Library/LoginItems/CleanMyMac_5_MAS_Menu.app
|
||||
Bundle Identifier: com.macpaw.CleanMyMac-mas.Menu
|
||||
Parent Identifier: 2.com.macpaw.CleanMyMac-mas
|
||||
```
|
||||
|
||||
### Reading the fields
|
||||
- **Disposition** — `enabled` = actively allowed to run in the background. `disabled` = present but off.
|
||||
- **Type** — what kind of item it is:
|
||||
|
||||
| Type | Meaning |
|
||||
|---|---|
|
||||
| `app (0x2)` | A normal application entry |
|
||||
| `login item (0x4)` | Launches at login (menu-bar apps, helpers) |
|
||||
| `agent (0x8)` / `legacy agent` | Per-user background agent |
|
||||
| `legacy daemon (0x10010)` | System-wide background daemon |
|
||||
| `background tasks (0x2000)` | Abstract background-task registration owned by a parent app — **has no file path of its own** |
|
||||
| `developer (0x20)` | A per-developer grouping header (the collapsible row in Settings), **not an app** |
|
||||
| `quicklook` / `spotlight` / `dock tile` | Plugins/extensions — not really "background apps" |
|
||||
|
||||
## Map entries to installed apps (find orphans)
|
||||
|
||||
Two gotchas make naïve path-checking fail:
|
||||
|
||||
1. **Absolute paths are stored as `file://` URLs**, not plain `/…`. Strip the `file://` prefix and URL-decode (`%20` → space).
|
||||
2. **Child items store a *relative* `URL`** (e.g. `Contents/Library/LoginItems/…`) that must be joined to the **parent record's** absolute path, found via `Parent Identifier`.
|
||||
|
||||
A small parser that resolves each record to a real path and flags true orphans:
|
||||
|
||||
```python
|
||||
import sys, re, os, urllib.parse
|
||||
items, cur = [], None
|
||||
def push():
|
||||
global cur
|
||||
if cur is not None: items.append(cur)
|
||||
for line in sys.stdin:
|
||||
s = line.strip()
|
||||
if re.match(r"^#\d+:$", s): push(); cur = {}; continue
|
||||
if cur is None: continue
|
||||
m = re.match(r"^([A-Za-z][A-Za-z /]+):\s*(.*)$", s)
|
||||
if m: cur[m.group(1).strip()] = m.group(2).strip()
|
||||
push()
|
||||
byid = {it["Identifier"]: it for it in items if it.get("Identifier")}
|
||||
def abspath(it, d=0):
|
||||
if d > 8: return None
|
||||
u = it.get("URL", "")
|
||||
if u and u != "(null)":
|
||||
if u.startswith("file://"): return urllib.parse.unquote(u[7:]).rstrip("/")
|
||||
if u.startswith("/"): return u.rstrip("/")
|
||||
par = byid.get(it.get("Parent Identifier", ""))
|
||||
if par:
|
||||
b = abspath(par, d + 1)
|
||||
if b: return os.path.join(b, urllib.parse.unquote(u)).rstrip("/")
|
||||
return None
|
||||
for it in items:
|
||||
if not it.get("Name"): continue
|
||||
p = abspath(it)
|
||||
if p and not os.path.exists(p):
|
||||
print("ORPHAN:", it["Name"], "->", p)
|
||||
```
|
||||
|
||||
```bash
|
||||
sfltool dumpbtm | python3 btm_check.py
|
||||
```
|
||||
|
||||
> **Expected non-orphans:** `background tasks (0x2000)` and `developer (0x20)` rows legitimately store no path — they are not missing apps. Helpers/daemons that resolve *inside* a parent bundle (e.g. `/Applications/Foo.app/Contents/Library/LoginItems/…`) or in `/Library/…` are also fine; they just don't appear as a top-level `.app`. That is usually why an entry "has no application you can find."
|
||||
|
||||
## Disable background for an app
|
||||
|
||||
This **cannot be scripted** — Apple deliberately gates the toggle behind the GUI:
|
||||
|
||||
**System Settings → General → Login Items & Extensions → "Allow in the Background"** → switch the app off.
|
||||
|
||||
Disabling a `developer (0x20)` grouping header turns off all of that developer's sub-items at once.
|
||||
|
||||
## Uninstall cleanly — the system-extension trap
|
||||
|
||||
**Dragging an app to the Trash is not a full uninstall.** Apps that install a **network/system extension** plus a privileged daemon (firewalls and VPNs especially — Little Snitch, Mullvad, etc.) leave their `/Library` daemon **still loaded and running** after the app is trashed. The BTM entry persists and the background service keeps working.
|
||||
|
||||
### 1. Prefer the app's own uninstaller
|
||||
- **Bundled uninstall script** (Mullvad): runs cleanly, deactivates the system extension, resets the firewall.
|
||||
```bash
|
||||
sudo "/Applications/Mullvad VPN.app/Contents/Resources/uninstall.sh"
|
||||
```
|
||||
- Some apps ship an uninstaller in their DMG or a CLI tool. **Note:** Little Snitch 6.x has **no DMG uninstaller and no `littlesnitch uninstall` subcommand** — manual removal is the supported route there.
|
||||
|
||||
### 2. Check whether a system extension is still active
|
||||
```bash
|
||||
systemextensionsctl list
|
||||
```
|
||||
If the app's extension is **not** listed (only unrelated ones like Tailscale/Canon remain), the extension is already deactivated and a manual file removal is now complete and safe.
|
||||
|
||||
### 3. Manual removal (when no uninstaller exists)
|
||||
Find every component first:
|
||||
```bash
|
||||
ls /Library/LaunchDaemons/<id>* /Library/LaunchAgents/<id>* 2>/dev/null
|
||||
ls -d "/Library/Application Support/<Vendor>" 2>/dev/null
|
||||
ls ~/Library/Preferences/<id>* 2>/dev/null
|
||||
```
|
||||
Then boot out the daemon and remove the files:
|
||||
```bash
|
||||
sudo launchctl bootout system /Library/LaunchDaemons/<id>.daemon.plist 2>/dev/null
|
||||
sudo rm -f /Library/LaunchDaemons/<id>.daemon.plist /Library/LaunchAgents/<id>.agent.plist
|
||||
sudo rm -rf "/Library/Application Support/<Vendor>" "$HOME/.Trash/<App>.app"
|
||||
rm -f ~/Library/Preferences/<id>*.plist # user-owned, no sudo
|
||||
```
|
||||
|
||||
> **Shared-container caution:** before deleting `~/Library/Group Containers/*`, check it isn't shared. Microsoft apps share `UBF8T346G9.com.microsoft.oneauth`, `…entrabroker`, and `…teams` across Office/Teams/RDP — delete only the app-specific container (e.g. `…com.microsoft.rdc`), never the shared auth ones.
|
||||
|
||||
## Stale BTM "ghost" entries
|
||||
|
||||
After a manual uninstall, `sfltool dumpbtm` may still list the removed app, pointing at now-deleted paths. These are harmless orphans (nothing left to load). **BTM reconciles them on the next reboot / login cycle** — a reboot also finalises any system-extension teardown.
|
||||
|
||||
## Quick reference
|
||||
|
||||
```bash
|
||||
sfltool dumpbtm # full per-user BTM dump (no sudo)
|
||||
sfltool dumpbtm | grep -A6 'Name:' # browse records
|
||||
systemextensionsctl list # active network/system extensions
|
||||
# Verify a removal:
|
||||
sfltool dumpbtm | grep -i <vendor> # should be empty after a reboot
|
||||
```
|
||||
|
||||
## See also
|
||||
- Apple gates "Allow in the Background" behind System Settings — there is no supported CLI toggle for BTM dispositions.
|
||||
- For VPN/firewall apps, always reach for the vendor uninstaller first; manual `rm` alone can leave a registered system extension behind.
|
||||
|
|
@ -1,94 +0,0 @@
|
|||
---
|
||||
title: "Ansible UNREACHABLE: Host Key Verification Failed After a Host Rebuild or Migration"
|
||||
domain: troubleshooting
|
||||
category: networking
|
||||
tags: [ansible, ssh, known-hosts, tailscale, host-key, migration]
|
||||
status: published
|
||||
created: 2026-06-12
|
||||
updated: 2026-06-12
|
||||
---
|
||||
|
||||
# Ansible UNREACHABLE: Host Key Verification Failed After a Host Rebuild or Migration
|
||||
|
||||
## Symptom
|
||||
|
||||
A subset of hosts in an Ansible run fail at **Gathering Facts** while the rest succeed:
|
||||
|
||||
```
|
||||
[ERROR]: Task failed: Data could not be sent to remote host "100.112.127.0".
|
||||
Make sure this host can be reached over ssh: Host key verification failed.
|
||||
fatal: [majormail]: UNREACHABLE! => {"unreachable": true, ...}
|
||||
```
|
||||
|
||||
The failing hosts are exactly the ones that were recently **rebuilt or migrated** (new server, new OS install, or a cloud move that issued a new Tailscale IP). Hosts that were never rebuilt connect fine.
|
||||
|
||||
Confusingly, **interactive `ssh root@<host>` works perfectly** for the same boxes — only Ansible fails.
|
||||
|
||||
## Cause
|
||||
|
||||
SSH stores each accepted host key in `~/.ssh/known_hosts` keyed by the **exact address you connected with**. A key accepted for `ssh root@tttpod` is saved under the hostname `tttpod`; it is *not* indexed under that node's IP.
|
||||
|
||||
Ansible inventories almost always set `ansible_host` to a **literal IP** (here, the Tailscale `100.x.x.x` address). So Ansible's SSH lookup is by IP, finds no matching entry, and with `StrictHostKeyChecking=yes` (or `accept-new` already exhausted) it refuses the connection:
|
||||
|
||||
```
|
||||
No ED25519 host key is known for 100.112.127.0 and you have requested strict checking.
|
||||
Host key verification failed.
|
||||
```
|
||||
|
||||
The hostname-form and IP-form entries are independent. Fixing interactive SSH (e.g. converting aliases to MagicDNS names and re-accepting keys) does **nothing** for Ansible, because Ansible never uses the hostname.
|
||||
|
||||
A rebuilt host also generates **brand-new host keys**, so any old IP-form entry would additionally be a mismatch — but the common case after a migration to a *new* IP is simply that no IP entry exists at all.
|
||||
|
||||
## Diagnosis
|
||||
|
||||
```bash
|
||||
# 1. Is there any known_hosts entry for the failing IP? (0 = none)
|
||||
ssh-keygen -F 100.112.127.0
|
||||
|
||||
# 2. Reproduce the exact failure without an interactive prompt:
|
||||
ssh -o BatchMode=yes -o StrictHostKeyChecking=yes root@100.112.127.0 true
|
||||
# -> "Host key verification failed." confirms the gap
|
||||
|
||||
# 3. Confirm the inventory IP is actually the host's CURRENT address
|
||||
# (guards against stale-IP drift, a separate problem):
|
||||
tailscale status | grep majormail
|
||||
ssh-keyscan -t ed25519 100.112.127.0 | ssh-keygen -lf - # fingerprint it
|
||||
```
|
||||
|
||||
If step 3 shows the inventory IP matches the live Tailscale node and the box answers `ssh-keyscan`, the only problem is the missing IP-form key.
|
||||
|
||||
## Fix
|
||||
|
||||
Add the **IP-form** host keys to the `known_hosts` of the user that runs Ansible. Back up first, scan over the tailnet, de-dup:
|
||||
|
||||
```bash
|
||||
cp ~/.ssh/known_hosts ~/.ssh/known_hosts.bak.$(date +%Y%m%d)
|
||||
|
||||
for ip in 100.98.223.93 100.112.127.0 100.73.85.46 100.95.137.38 100.76.51.16 100.64.169.62; do
|
||||
ssh-keyscan -T 5 -t rsa,ecdsa,ed25519 "$ip" >> ~/.ssh/known_hosts
|
||||
done
|
||||
sort -u ~/.ssh/known_hosts -o ~/.ssh/known_hosts
|
||||
```
|
||||
|
||||
Verify before re-running the playbook:
|
||||
|
||||
```bash
|
||||
ansible <hosts> -m ping # expect "pong" from each
|
||||
```
|
||||
|
||||
### Why `ssh-keyscan` is safe here
|
||||
|
||||
`ssh-keyscan` trusts whatever answers on the wire — normally a MITM risk. Over **Tailscale**, the connection rides WireGuard, which cryptographically authenticates the peer by its tailnet identity: reaching `100.x.x.x` *guarantees* you are talking to the node that owns that tailnet address. Scanning and trusting the key over the tailnet is therefore as trustworthy as the tailnet itself. Always cross-check the IP against `tailscale status` first (step 3) so you scan the right node.
|
||||
|
||||
## Prevention
|
||||
|
||||
- **Per-workstation, not fleet-wide.** `known_hosts` is local to each machine + user. After a migration, *every* host that runs Ansible (each workstation, plus any control node like `majorlab`) needs the IP keys added independently. Adding them on one Mac does not help the others.
|
||||
- **Sweep on every migration phase.** A rolling migration changes one node's IP at a time; fold the keyscan above into the post-cutover checklist so Ansible never breaks mid-rollout.
|
||||
- **Alternative — `accept-new`.** Setting `host_key_checking = False` in `ansible.cfg` (or `ANSIBLE_HOST_KEY_CHECKING=False`) sidesteps the prompt but trades away host-key verification entirely. Prefer the explicit keyscan: it keeps strict checking on for every *future* run while accepting the new key exactly once, under your control.
|
||||
|
||||
## Related
|
||||
|
||||
- SSH-Aliases — Fleet SSH access; the MagicDNS-vs-pinned-IP strategy and the Ansible-by-IP `known_hosts` note
|
||||
- Network Overview — Tailscale fleet inventory and current IPs
|
||||
- Hetzner-Migration-Status — the migration that triggered the fleet-wide IP churn
|
||||
- [[ssh-socket-tailscale-race-condition]] — a different "SSH unreachable after reboot" failure mode
|
||||
|
|
@ -1,133 +0,0 @@
|
|||
---
|
||||
title: "SSH Alias Falls Through to MagicDNS — Host-Key Verification Failure (No `Host` Block)"
|
||||
domain: selfhosting
|
||||
category: troubleshooting
|
||||
tags:
|
||||
- ssh
|
||||
- ssh-config
|
||||
- tailscale
|
||||
- magicdns
|
||||
- known-hosts
|
||||
- host-key
|
||||
- troubleshooting
|
||||
status: published
|
||||
created: 2026-06-11
|
||||
updated: 2026-06-12
|
||||
---
|
||||
|
||||
# SSH Alias Falls Through to MagicDNS — Host-Key Verification Failure (No `Host` Block)
|
||||
|
||||
## The Problem
|
||||
|
||||
You `ssh` to a host you've reached many times before, but now it dies before any
|
||||
auth happens:
|
||||
|
||||
```
|
||||
$ ssh MyMac
|
||||
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
|
||||
Host key verification failed.
|
||||
```
|
||||
|
||||
On a headless box (WSL, a server, a CI runner) there's no askpass binary, so the
|
||||
prompt can't even be shown — SSH just aborts. Connecting **by Tailscale IP** works
|
||||
fine:
|
||||
|
||||
```
|
||||
$ ssh user@100.74.124.81 # works
|
||||
$ ssh MyMac # Host key verification failed
|
||||
```
|
||||
|
||||
## Why It Happens
|
||||
|
||||
There is **no `Host MyMac` block in `~/.ssh/config` at all** — and there never was.
|
||||
The connection only ever worked by IP, or interactively (where you clicked through
|
||||
the first-connect `yes` prompt without noticing).
|
||||
|
||||
When no `Host` block matches, SSH uses the literal argument as the hostname. With
|
||||
Tailscale MagicDNS, `MyMac` (or `mymac`) resolves to the node — so the *connection*
|
||||
succeeds — but the host key it presents is checked against `known_hosts` under the
|
||||
name **`mymac`**, which has no entry. Meanwhile the key you actually trust is stored
|
||||
under the **IP**:
|
||||
|
||||
```
|
||||
$ ssh-keygen -F 100.74.124.81 # found — line 67
|
||||
$ ssh-keygen -F mymac # nothing
|
||||
```
|
||||
|
||||
So strict host-key checking has nothing to match, tries to prompt to accept the
|
||||
"new" key, and on a headless host that prompt fails → `Host key verification failed`.
|
||||
|
||||
Confirm there's no block (and that `ssh -G` is just echoing defaults):
|
||||
|
||||
```
|
||||
$ ssh -G MyMac | grep -E '^(hostname|user|port) '
|
||||
hostname mymac # lowercased literal — NOT an explicit HostName
|
||||
user youruser # your local username default — not from a block
|
||||
port 22 # default
|
||||
```
|
||||
|
||||
If `hostname` equals the arg you typed (just lowercased) and `user` is your local
|
||||
login name, there is no matching `Host` block.
|
||||
|
||||
## The Fix
|
||||
|
||||
Add an explicit `Host` block that **pins the IP** that `known_hosts` already trusts.
|
||||
This matches the convention every other host in a Tailscale fleet should follow —
|
||||
pin the `100.x` address, not the MagicDNS name:
|
||||
|
||||
```sshconfig
|
||||
Host MyMac mymac
|
||||
HostName 100.74.124.81
|
||||
User youruser
|
||||
IdentityFile ~/.ssh/id_ed25519
|
||||
```
|
||||
|
||||
> [!note] When pinning the IP is the *wrong* call
|
||||
> Pinning the IP is right while the host is **stable**. If the box gets migrated or
|
||||
> rebuilt — new Tailscale IP *and* new host key — the pin rots and `known_hosts`
|
||||
> mismatches. At that point switch to **MagicDNS names** so the alias self-heals. See
|
||||
> *[MagicDNS Names vs Pinned IPs for Tailscale SSH (After a Fleet Migration)](tailscale-ssh-magicdns-vs-pinned-ip-after-migration.md)*.
|
||||
|
||||
Now `ssh MyMac` resolves to `100.74.124.81`, whose key is in `known_hosts`, and the
|
||||
check passes with no prompt. Verify non-interactively:
|
||||
|
||||
```
|
||||
$ ssh -o BatchMode=yes MyMac 'hostname'
|
||||
mymac.majorlan
|
||||
```
|
||||
|
||||
`BatchMode=yes` disables every prompt — if it returns the hostname cleanly, the key
|
||||
is trusted and a real key authenticated.
|
||||
|
||||
**Don't over-pin the identity.** Run `ssh -v user@<IP> true` and check the
|
||||
`Will attempt key` / accepted-key lines first. A workstation often authenticates
|
||||
with the *default* `id_ed25519`, not a fleet key — if `id_ed25519_fleet` isn't even
|
||||
offered, don't put it in the block.
|
||||
|
||||
## Cleanup: Stale `known_hosts` Cruft
|
||||
|
||||
Drive-by `ssh` attempts leave junk entries like `mymac-2` (auto-suffixed names from
|
||||
old keys). They never match anything once you pin the IP. Purge them:
|
||||
|
||||
```
|
||||
$ ssh-keygen -R mymac-2
|
||||
```
|
||||
|
||||
## How to Diagnose This
|
||||
|
||||
1. `ssh -o BatchMode=yes <alias> true` — if it fails with `Host key verification
|
||||
failed` (not `Permission denied`), it's a host-key problem, not auth.
|
||||
2. `ssh -G <alias> | grep -E '^(hostname|user|port) '` — if `hostname` is just your
|
||||
typed arg and there's no real `HostName`, there's no `Host` block.
|
||||
3. `ssh-keygen -F <name>` vs `ssh-keygen -F <ip>` — find which name actually holds
|
||||
the trusted key. Pin whichever one `known_hosts` has (usually the IP).
|
||||
|
||||
## Why This Gotcha Is Invisible
|
||||
|
||||
It only surfaces on a host with **no askpass** (headless / WSL / cron). On a desktop,
|
||||
the first-connect prompt appears, you hit `yes`, an entry gets written under the
|
||||
MagicDNS name, and it "just works" — masking the fact that no `Host` block exists and
|
||||
the IP-keyed entry is the only durable trust. Move the same config to a headless box
|
||||
and the missing block becomes a hard failure. Related: SSH only applies `Host` blocks
|
||||
by **literal pattern match**, so connecting by IP also skips them — see *Ansible Fails
|
||||
with Permission Denied While `ssh <alias>` Works (Host Alias Bypass)*.
|
||||
|
|
@ -1,160 +0,0 @@
|
|||
---
|
||||
title: "SSH `Permission denied (publickey)` After Rotating a Key — Backfill Every `authorized_keys`"
|
||||
domain: selfhosting
|
||||
category: troubleshooting
|
||||
tags:
|
||||
- ssh
|
||||
- ssh-keys
|
||||
- authorized-keys
|
||||
- key-rotation
|
||||
- publickey
|
||||
- fleet
|
||||
- troubleshooting
|
||||
status: published
|
||||
created: 2026-06-17
|
||||
updated: 2026-06-17
|
||||
---
|
||||
|
||||
# SSH `Permission denied (publickey)` After Rotating a Key — Backfill Every `authorized_keys`
|
||||
|
||||
## The Problem
|
||||
|
||||
A host you've SSH'd into for months suddenly rejects you — but **only some hosts**, not all:
|
||||
|
||||
```
|
||||
$ ssh root@host-a
|
||||
root@host-a: Permission denied (publickey).
|
||||
|
||||
$ ssh root@host-b # same key, same workstation — works fine
|
||||
host-b $
|
||||
```
|
||||
|
||||
Nothing changed on the servers. The thing that changed is on **your** side: at some
|
||||
point the workstation's SSH key was **regenerated** (lost laptop, rebuild, a key file
|
||||
clobbered by a botched copy, a routine rotation). The new public key was pushed to a
|
||||
few hosts but never fanned out to the rest. Every host still holding only the *old*
|
||||
public key now rejects the new private key with `Permission denied (publickey)`.
|
||||
|
||||
> The tell: it's `Permission denied (publickey)`, **not** `Host key verification
|
||||
> failed`. The former is an **authorization** failure (the server doesn't trust your
|
||||
> key); the latter is the server's key not matching your `known_hosts`. Different
|
||||
> problem — see *[SSH Alias Falls Through to MagicDNS — Host-Key Verification Failure](ssh-missing-host-block-magicdns-host-key-failure.md)*.
|
||||
|
||||
## Why It Happens
|
||||
|
||||
Public-key auth is **per-host**: the server only lets you in if your public key is a
|
||||
line in that host's `~/.ssh/authorized_keys`. There is no central directory — each
|
||||
host is its own island. So when you rotate a key, *every* host needs the new public
|
||||
key appended independently.
|
||||
|
||||
It's easy to do this partially without noticing. You regenerate the key, then over the
|
||||
next hour you happen to SSH into three boxes and (re-)deploy the key there as part of
|
||||
other work. Those three now trust the new key. The other six don't — and you won't
|
||||
find out until weeks later when you reach for one of them.
|
||||
|
||||
Confirm it's an authorization (key) failure and see which key is being offered:
|
||||
|
||||
```
|
||||
$ ssh -v root@host-a 2>&1 | grep -E 'Offering|Authentications|Permission denied'
|
||||
debug1: Offering public key: /home/you/.ssh/id_ed25519 ED25519 SHA256:XeY1/N9qwB…
|
||||
debug1: Authentications that can continue: publickey
|
||||
root@host-a: Permission denied (publickey).
|
||||
```
|
||||
|
||||
The server offered you nothing but `publickey`, you offered your current key, and it
|
||||
was refused → your key isn't in that host's `authorized_keys`.
|
||||
|
||||
## Scope It First — Don't Fix One Host at a Time
|
||||
|
||||
The host you noticed is rarely the only one. Sweep the whole fleet in one pass before
|
||||
touching anything, so you fix the real set, not just the squeaky wheel:
|
||||
|
||||
```bash
|
||||
for h in host-a host-b host-c host-d host-e host-f; do
|
||||
r=$(ssh -o BatchMode=yes -o ConnectTimeout=8 root@"$h" 'echo OK' 2>&1 | tail -1)
|
||||
echo "$h: $r"
|
||||
done
|
||||
```
|
||||
|
||||
`BatchMode=yes` suppresses password/passphrase prompts so a failure fails fast instead
|
||||
of hanging. Anything that doesn't print `OK` needs the backfill.
|
||||
|
||||
## The Fix
|
||||
|
||||
You need a **second, still-trusted** way onto each failing host to append the new key.
|
||||
Common transit options, best first:
|
||||
|
||||
- **Another of your keys that still works** (e.g. a config-management / automation
|
||||
user whose key is authorized fleet-wide, ideally with `sudo`).
|
||||
- **Another workstation** whose key those hosts still trust.
|
||||
- **The provider's web console / serial console** as a last resort.
|
||||
|
||||
> [!warning] A jump host only helps if *it* can reach the target
|
||||
> "Bounce through a box that still trusts me" only works if that box's own key is in
|
||||
> the target's `authorized_keys`. A host can trust *your* key yet have no standing
|
||||
> trust to a third host (and hit its own `Host key verification failed` on the way).
|
||||
> Test the full two-hop path before relying on it.
|
||||
|
||||
Using a fleet-wide automation user (`deploy`) with passwordless `sudo` as the transit,
|
||||
append the new key idempotently, with a backup, to every failing host:
|
||||
|
||||
```bash
|
||||
PUBKEY=$(cat ~/.ssh/id_ed25519.pub)
|
||||
STAMP=$(date +%Y%m%d-%H%M%S)
|
||||
for h in host-a host-c host-e; do # only the hosts that failed the sweep
|
||||
ssh deploy@"$h" "sudo bash -s" <<EOF
|
||||
set -e
|
||||
F=/root/.ssh/authorized_keys
|
||||
mkdir -p /root/.ssh && touch "\$F"
|
||||
cp "\$F" "\$F.bak-$STAMP" # backup before any change
|
||||
grep -qF "$PUBKEY" "\$F" || printf '%s\n' "$PUBKEY" >> "\$F" # append only if absent
|
||||
chmod 600 "\$F"
|
||||
EOF
|
||||
done
|
||||
```
|
||||
|
||||
Three things that keep this safe:
|
||||
|
||||
- **Append, never overwrite.** `>> "$F"` and the `grep -qF … ||` guard mean you add
|
||||
one line and only if it's missing. Re-running is a no-op — never clobber an
|
||||
`authorized_keys` with `>` or you'll lock out every *other* key on the box.
|
||||
- **Back up first.** The `.bak-<stamp>` copy is your undo.
|
||||
- **`chmod 600`.** SSH silently ignores an `authorized_keys` that's group/world
|
||||
writable, which looks exactly like "the key didn't take."
|
||||
|
||||
Then verify directly — not through the transit user:
|
||||
|
||||
```bash
|
||||
for h in host-a host-c host-e; do
|
||||
echo "$h: $(ssh -o BatchMode=yes root@"$h" 'echo OK' 2>&1 | tail -1)"
|
||||
done
|
||||
```
|
||||
|
||||
All `OK` means the new key authenticates on its own.
|
||||
|
||||
## Prevention
|
||||
|
||||
- **Treat rotation as fleet-wide.** When a workstation key changes, the very next step
|
||||
is to fan the new public key out to **every** host's `authorized_keys` in one pass —
|
||||
not opportunistically as you happen to log in. A short `for` loop over the full host
|
||||
list (or a config-management task — see below) closes the gap immediately.
|
||||
- **Manage `authorized_keys` declaratively.** An Ansible `ansible.posix.authorized_key`
|
||||
task (or equivalent) that lists the *current* set of keys makes "who can log in" a
|
||||
reviewed, version-controlled fact instead of an append-only pile that drifts per host.
|
||||
- **Keep the old key authorized until the new one is verified everywhere**, then remove
|
||||
the stale line in a deliberate cleanup pass.
|
||||
|
||||
## How to Diagnose This (Checklist)
|
||||
|
||||
1. `ssh -o BatchMode=yes <host> true` → `Permission denied (publickey)` (auth), not
|
||||
`Host key verification failed` (host key). Confirms which problem you have.
|
||||
2. `ssh -v <host> 2>&1 | grep Offering` → which private key is being offered, and its
|
||||
fingerprint.
|
||||
3. Sweep the whole fleet with the `BatchMode` loop → get the **full** list of affected
|
||||
hosts before fixing.
|
||||
4. Append the new public key (idempotent, backed up, `chmod 600`) via a still-trusted
|
||||
transit path.
|
||||
5. Re-verify each host with a direct `BatchMode` login.
|
||||
|
||||
Related: *[SSH Config & Key Management](../../01-linux/networking/ssh-config-key-management.md)*
|
||||
and *[SSH Hardening Across a Fleet with Ansible](../../02-selfhosting/security/ssh-hardening-ansible-fleet.md)*.
|
||||
|
|
@ -1,133 +0,0 @@
|
|||
---
|
||||
title: "Steam Deck Wi-Fi Flapping: IWD Periodic Scan + rtw88 Power Save"
|
||||
domain: troubleshooting
|
||||
category: networking
|
||||
tags: [wifi, steam-deck, steamos, iwd, networkmanager, rtw88, rtl8822ce, power-save, supplicant-disconnect, flapping]
|
||||
status: published
|
||||
created: 2026-06-19
|
||||
updated: 2026-06-19
|
||||
---
|
||||
|
||||
# Steam Deck Wi-Fi Flapping: IWD Periodic Scan + rtw88 Power Save
|
||||
|
||||
## 🛑 Problem
|
||||
|
||||
An OG Steam Deck (LCD model, Realtek **RTL8822CE** on the `rtw88_8822ce` driver) kept "losing" Wi-Fi — it would connect, hold for around a minute, drop, then reconnect a second later, over and over. From the router side the device looked like it was constantly coming and going; from the couch it felt like the network "wouldn't stay connected."
|
||||
|
||||
Crucially, **this was not a router problem.** The AP config was correct, RF was clean (strong signal, zero tx retries / beacon loss), and every other client on the network was rock-solid. The fault was entirely on the Deck.
|
||||
|
||||
## 🔍 Diagnosis
|
||||
|
||||
SteamOS uses **NetworkManager with the `iwd` backend** (not `wpa_supplicant`). That detail is the whole ballgame.
|
||||
|
||||
### Step 1 — Confirm the flap and its cadence
|
||||
|
||||
```bash
|
||||
# how many disconnects this boot?
|
||||
journalctl -b -u NetworkManager --no-pager | grep -c supplicant-disconnect
|
||||
# 50
|
||||
|
||||
# when did they happen?
|
||||
journalctl -b -u NetworkManager --no-pager | grep supplicant-disconnect \
|
||||
| awk '{print $1,$2,$3}' | tail
|
||||
# 10:20:52 · 10:21:54 · 10:22:57 · 10:24:00 · 10:25:03 · 10:26:05 · 10:27:08 ...
|
||||
```
|
||||
|
||||
**~63 seconds between every drop.** A fixed, metronome-like interval is the tell — this is a *timer*, not RF noise. The NetworkManager log shows the pattern plainly:
|
||||
|
||||
```
|
||||
activated -> failed (reason 'supplicant-disconnect')
|
||||
... -> activated # reconnects ~1s later
|
||||
```
|
||||
|
||||
### Step 2 — Prove the link is healthy *when it's up*
|
||||
|
||||
```bash
|
||||
iw dev wlan0 station dump | grep -iE 'signal|bitrate|failed|retries|beacon loss'
|
||||
# signal: -65 dBm
|
||||
# tx retries: 0
|
||||
# tx failed: 0
|
||||
# beacon loss: 0
|
||||
```
|
||||
|
||||
Strong signal, zero retries, zero beacon loss — the association is clean while it lasts. So the drop is being *commanded*, not caused by a bad radio link.
|
||||
|
||||
### Step 3 — Identify the chip and the backend
|
||||
|
||||
```bash
|
||||
lspci -k | grep -A3 -iE 'network|wireless'
|
||||
# Realtek RTL8822CE ... Kernel driver in use: rtw88_8822ce
|
||||
```
|
||||
|
||||
The `~63s` interval is **IWD's default periodic background scan**. With no `/etc/iwd/main.conf` present, IWD scans on a timer even while connected, and on the `rtw88` driver that scan knocks the current association over — producing the `supplicant-disconnect` every minute.
|
||||
|
||||
A secondary annoyance: `iw dev wlan0 get power_save` reported `on`, which showed up as wildly jittery LAN latency (8–69 ms to the gateway over Wi-Fi, where a healthy 5 GHz link is 2–10 ms).
|
||||
|
||||
## ✅ Fix
|
||||
|
||||
Two independent changes — the first stops the flap, the second smooths latency.
|
||||
|
||||
### 1. Disable IWD's periodic scan (stops the flap)
|
||||
|
||||
```bash
|
||||
sudo mkdir -p /etc/iwd
|
||||
printf '[Scan]\nDisablePeriodicScan=true\n' | sudo tee /etc/iwd/main.conf
|
||||
sudo systemctl restart iwd # briefly drops Wi-Fi; NetworkManager auto-reconnects
|
||||
```
|
||||
|
||||
Trade-off: with periodic scanning off, the Deck roams to a different/stronger AP (e.g. another AiMesh node) more lazily. Fine for a device that mostly sits in one spot.
|
||||
|
||||
### 2. Disable Wi-Fi power save (kills the latency jitter)
|
||||
|
||||
The obvious `nmcli connection modify <name> 802-11-wireless.powersave 2` **does not work under the IWD backend** — NetworkManager doesn't enforce that property when `iwd` is managing the radio. Use a dispatcher script instead, with a retry loop because `rtw88` won't accept the setting in the first instant after association on a cold boot:
|
||||
|
||||
```bash
|
||||
sudo tee /etc/NetworkManager/dispatcher.d/90-wifi-powersave >/dev/null <<'SCRIPT'
|
||||
#!/bin/sh
|
||||
# Disable Wi-Fi power save on the wireless iface (retry: rtw88 may not accept it instantly on boot)
|
||||
case "$2" in
|
||||
up|dhcp4-change|connectivity-change)
|
||||
case "$1" in
|
||||
wl*)
|
||||
for n in 1 2 3 4 5; do
|
||||
/usr/bin/iw dev "$1" set power_save off 2>/dev/null
|
||||
[ "$(/usr/bin/iw dev "$1" get power_save 2>/dev/null)" = "Power save: off" ] && break
|
||||
sleep 1
|
||||
done
|
||||
;;
|
||||
esac
|
||||
;;
|
||||
esac
|
||||
SCRIPT
|
||||
sudo chmod +x /etc/NetworkManager/dispatcher.d/90-wifi-powersave
|
||||
sudo iw dev wlan0 set power_save off # apply now without waiting for a reconnect
|
||||
```
|
||||
|
||||
> 💡 A single-shot dispatcher (no retry) **silently fails on a cold boot** — it fires before the interface is ready, the `iw` call no-ops, and power save stays on. Verify with `iw get power_save` *after a real reboot*, not just after a service restart.
|
||||
|
||||
## 🔁 Verification
|
||||
|
||||
```bash
|
||||
# was 50/boot, ~once a minute:
|
||||
journalctl -b -u NetworkManager --no-pager | grep -c supplicant-disconnect
|
||||
# 0
|
||||
iw dev wlan0 get power_save
|
||||
# Power save: off
|
||||
```
|
||||
|
||||
A 3-minute continuous `ping` showed **180/180 replies, 0 loss**, latency tightened to **6–11 ms**. Confirmed across a full cold reboot: the Deck auto-rejoins Wi-Fi, both settings persist, and the disconnect counter stays at 0.
|
||||
|
||||
## 📌 Notes
|
||||
|
||||
- **Persistence:** `/etc/iwd/main.conf` and the dispatcher live in `/etc`, which survives reboots. A major SteamOS update *can* reset `/etc` — re-apply if the flapping returns after an OS update.
|
||||
- **Fully reversible:**
|
||||
```bash
|
||||
sudo rm /etc/iwd/main.conf /etc/NetworkManager/dispatcher.d/90-wifi-powersave
|
||||
sudo systemctl restart iwd
|
||||
```
|
||||
- **Interface name** is usually `wlan0`; confirm with `iw dev` if different.
|
||||
- The same IWD-periodic-scan behavior can affect other `iwd`-based distros (Arch, some Fedora spins) on flaky/older Wi-Fi chips — the `DisablePeriodicScan` fix is general, not Deck-specific.
|
||||
|
||||
## 🔗 Related
|
||||
|
||||
- [Wi-Fi Game Streaming Stutter: 160 MHz Channel Width Saturating the 5 GHz Radio](wifi-160mhz-airtime-saturation-game-streaming.md) — the *other* Steam Deck Wi-Fi issue (airtime contention, router-side), distinct from this client-side flap.
|
||||
|
|
@ -1,163 +0,0 @@
|
|||
---
|
||||
title: "MagicDNS Names vs Pinned IPs for Tailscale SSH (After a Fleet Migration)"
|
||||
domain: troubleshooting
|
||||
category: networking
|
||||
tags:
|
||||
- ssh
|
||||
- ssh-config
|
||||
- tailscale
|
||||
- magicdns
|
||||
- known-hosts
|
||||
- host-key
|
||||
- migration
|
||||
- wsl2
|
||||
status: published
|
||||
created: 2026-06-12
|
||||
updated: 2026-06-12
|
||||
---
|
||||
|
||||
# MagicDNS Names vs Pinned IPs for Tailscale SSH (After a Fleet Migration)
|
||||
|
||||
You have SSH aliases for a Tailscale fleet (`alias tttpod='ssh root@100.84.42.102'`).
|
||||
They worked for months. Then you migrate or rebuild some nodes — and now a third of
|
||||
them hang on connect or refuse the host key. This is the failure mode that hardcoded
|
||||
addresses hit, and why the durable answer is **MagicDNS names**, not pinned IPs.
|
||||
|
||||
> This is the sequel to *[SSH Alias Falls Through to MagicDNS — Host-Key Verification
|
||||
> Failure (No `Host` Block)](ssh-missing-host-block-magicdns-host-key-failure.md)*.
|
||||
> That article says **pin the IP** `known_hosts` already trusts — correct when the
|
||||
> node is stable. This one covers what happens when a migration changes the IP *and*
|
||||
> the host key, which is exactly when IP-pinning stops paying off.
|
||||
|
||||
## The Three Failure Modes
|
||||
|
||||
A migration/rebuild can trigger any of these — often several at once across a fleet,
|
||||
which is what makes it confusing:
|
||||
|
||||
### 1. Stale hardcoded IP → connection times out
|
||||
|
||||
The node re-registered on the tailnet with a **new** Tailscale IP, but your alias
|
||||
still names the old one:
|
||||
|
||||
```
|
||||
$ tttpod
|
||||
ssh: connect to host 100.84.42.102 port 22: Operation timed out
|
||||
```
|
||||
|
||||
The old address is dead; SSH waits the full timeout and gives up. Confirm by asking
|
||||
the tailnet for the node's *current* IP by name:
|
||||
|
||||
```
|
||||
$ tailscale status | grep tttpod
|
||||
100.95.137.38 tttpod ... # alias points at 100.84.42.102 — stale
|
||||
```
|
||||
|
||||
### 2. Cold-path teardown → first connect after idle times out
|
||||
|
||||
The IP is correct and the node is up (it answers `ping`), but TCP/22 still times out
|
||||
on the *first* try after a quiet period, then works on retry. Tailscale 1.98.x is more
|
||||
aggressive about tearing down **idle direct UDP paths**; the first SSH has to
|
||||
re-establish NAT traversal, which can overrun SSH's default connect timeout.
|
||||
|
||||
```
|
||||
$ tailscale status | grep tttpod
|
||||
100.95.137.38 tttpod ... idle, tx 9360 rx 0 # cold path
|
||||
$ tailscale ping tttpod
|
||||
pong from tttpod (100.95.137.38) via 5.161.118.84:41641 in 48ms # warms instantly
|
||||
```
|
||||
|
||||
### 3. Host-key verification failed → box was rebuilt
|
||||
|
||||
The node was reinstalled, so it presents a **new** SSH host key. Your `known_hosts`
|
||||
still has the old one, so even `StrictHostKeyChecking=accept-new` aborts — `accept-new`
|
||||
only adds *genuinely new* hosts, it refuses a **mismatch**:
|
||||
|
||||
```
|
||||
$ ssh root@tttpod hostname
|
||||
Host key verification failed.
|
||||
```
|
||||
|
||||
## The Fix
|
||||
|
||||
Three changes, applied on every **name-capable** machine (see the WSL2 caveat below):
|
||||
|
||||
### a. Switch aliases from IPs to MagicDNS names
|
||||
|
||||
```bash
|
||||
# before — rots on every migration
|
||||
alias tttpod='ssh root@100.84.42.102'
|
||||
# after — always resolves the node's current IP
|
||||
alias tttpod='ssh root@tttpod'
|
||||
```
|
||||
|
||||
MagicDNS resolves the name to whatever IP the node currently has, so a future
|
||||
migration needs **zero** alias edits. This is the whole point: the tailnet already
|
||||
knows the mapping — stop duplicating (and stale-ing) it in your dotfiles.
|
||||
|
||||
> **Exception:** if there's no tailnet device with that exact name (e.g. an alias
|
||||
> `teelia` pointing at a node actually named `temptedparadise`), MagicDNS can't
|
||||
> resolve it — keep the IP for that one.
|
||||
|
||||
### b. Purge stale host keys, then re-accept
|
||||
|
||||
After a rebuild, clear the old entries under **both** the name and the current IP,
|
||||
then reconnect with `accept-new` to record the fresh key. Over Tailscale's
|
||||
authenticated WireGuard tunnel, a key change from a known rebuild is safe to accept.
|
||||
|
||||
```bash
|
||||
for pair in "tttpod:100.95.137.38" "majortoot:100.64.169.62" "dcaprod:100.98.223.93"; do
|
||||
n="${pair%%:*}"; ip="${pair##*:}"
|
||||
ssh-keygen -R "$n"; ssh-keygen -R "$ip"
|
||||
done
|
||||
# repopulate
|
||||
ssh -o StrictHostKeyChecking=accept-new root@tttpod hostname
|
||||
```
|
||||
|
||||
### c. Add a cold-path cushion to `~/.ssh/config`
|
||||
|
||||
Give the first (cold) connection time to renegotiate instead of erroring:
|
||||
|
||||
```sshconfig
|
||||
Host majorlinux tttpod majortoot majordiscord dcaprod majormail majorhome
|
||||
ConnectTimeout 25
|
||||
ServerAliveInterval 30
|
||||
ServerAliveCountMax 4
|
||||
```
|
||||
|
||||
`ConnectTimeout 25` turns the cold-path timeout into a ~1–2 s pause. The keepalives
|
||||
hold the path open during an active session so it doesn't drop mid-command.
|
||||
|
||||
## Caveat: WSL2 Can't Use MagicDNS
|
||||
|
||||
A Linux box under **WSL2** typically has **no `tailscale` CLI and no MagicDNS
|
||||
resolver** — it rides the Windows host's networking, and name lookups for tailnet
|
||||
nodes fail:
|
||||
|
||||
```
|
||||
$ getent hosts tttpod # (inside WSL2)
|
||||
# nothing — no resolution
|
||||
$ command -v tailscale # nothing — CLI lives on the Windows side
|
||||
```
|
||||
|
||||
On those machines you **must** keep hardcoded IPs in `~/.ssh/config` (or use `Host`
|
||||
blocks with explicit `HostName <ip>`), and refresh them by hand when a node migrates.
|
||||
There's no self-healing option there — the trade is unavoidable.
|
||||
|
||||
## Diagnosis Checklist
|
||||
|
||||
1. `tailscale status | grep <host>` — does your alias's IP match the **current** one?
|
||||
(Mode 1: stale IP.)
|
||||
2. `ping`/`tailscale ping <host>` works but TCP/22 times out on first try, succeeds on
|
||||
retry? (Mode 2: cold path.)
|
||||
3. `ssh root@<host> true` → `Host key verification failed` (not `Permission denied`)?
|
||||
(Mode 3: rebuilt box, stale `known_hosts`.)
|
||||
4. Is the client a WSL2 box? `getent hosts <name>` returns nothing → MagicDNS
|
||||
unavailable, stay on IPs.
|
||||
|
||||
## Takeaway
|
||||
|
||||
Pin the IP when a host is **stable** and the IP-keyed `known_hosts` entry is your
|
||||
durable trust anchor. Switch to **MagicDNS names** when hosts **move** — migrations,
|
||||
rebuilds, provider changes — so the tailnet's own name→IP mapping does the work your
|
||||
dotfiles kept getting wrong. And on WSL2, you don't get the choice: hardcoded IPs,
|
||||
refreshed by hand.
|
||||
|
|
@ -1,115 +0,0 @@
|
|||
---
|
||||
title: "Wi-Fi Game Streaming Stutter: 160 MHz Channel Width Saturating the 5 GHz Radio"
|
||||
domain: troubleshooting
|
||||
category: networking
|
||||
tags: [wifi, 5ghz, 160mhz, channel-width, dfs, steam-deck, game-streaming, asuswrt, airtime, chanim]
|
||||
status: published
|
||||
created: 2026-06-13
|
||||
updated: 2026-06-13
|
||||
---
|
||||
|
||||
# Wi-Fi Game Streaming Stutter: 160 MHz Channel Width Saturating the 5 GHz Radio
|
||||
|
||||
## 🛑 Problem
|
||||
|
||||
Streaming a game from a desktop (wired) to a Steam Deck over Wi-Fi was stuttering intermittently — fine for a while, then choppy, hard to reproduce on demand. Throughput tests "looked fine," which is exactly why it was hard to pin down: **game streaming fails on jitter and microbursts of contention, not on average bandwidth.**
|
||||
|
||||
The Wi-Fi was an Asus RT-AX82U (AsusWRT, stock firmware) with the 5 GHz radio set to **Auto channel at 160 MHz width**.
|
||||
|
||||
## 🔍 Diagnosis
|
||||
|
||||
The key insight: **signal was excellent, but latency was not.** That combination means the airwaves are busy, not weak.
|
||||
|
||||
### Step 1 — Measure jitter to the gateway from a Wi-Fi client
|
||||
|
||||
```bash
|
||||
ping -c 20 -i 0.2 192.168.50.1
|
||||
# round-trip min/avg/max/stddev = 7.5/27.0/61.0/16.5 ms
|
||||
```
|
||||
|
||||
27 ms **average** and 16 ms of jitter to your *own router* over Wi-Fi is pathological. A healthy 5 GHz link sits at 2–5 ms. Yet the client's signal was **-43 dBm** (excellent) with a clean **-92 dBm** noise floor. Strong signal + high jitter = **airtime contention**, not range or interference at the receiver.
|
||||
|
||||
### Step 2 — Confirm channel utilization at the router
|
||||
|
||||
AsusWRT/Broadcom exposes per-channel airtime stats via `wl chanim_stats`. SSH into the router and run it against the 5 GHz interface:
|
||||
|
||||
```bash
|
||||
# 5 GHz interface name varies (eth6/eth7); resolve it from nvram
|
||||
IF=$(nvram get wl1_ifname)
|
||||
wl -i "$IF" chanspec # e.g. 36/160 (0xe832) → channel 36, 160 MHz
|
||||
wl -i "$IF" assoclist | wc -l # number of associated 5 GHz clients
|
||||
wl -i "$IF" chanim_stats
|
||||
```
|
||||
|
||||
The smoking gun (`chanim_stats`, version 3):
|
||||
|
||||
```
|
||||
chanspec tx inbss obss nocat nopkt doze txop goodtx badtx glitch ... idle
|
||||
0xe832 92 2 1 2 1 0 4 8 81 2 14
|
||||
```
|
||||
|
||||
Read it as percentages of airtime:
|
||||
|
||||
| Field | Value | Meaning |
|
||||
|-------|-------|---------|
|
||||
| `tx` | **92** | Channel busy transmitting 92% of the time |
|
||||
| `txop` | **4** | Transmit-opportunities available only 4% — the channel is starved |
|
||||
| `idle` | **14** | Channel idle only 14% |
|
||||
| `goodtx` / `badtx` | 8 / **81** | Failed/retried transmits vastly outnumber good ones |
|
||||
|
||||
Seventeen clients were associated to that one 5 GHz radio.
|
||||
|
||||
### Step 3 — Understand why 160 MHz makes it worse
|
||||
|
||||
A 160 MHz channel on the lower 5 GHz band spans channels **36–64**, which overlaps DFS sub-blocks. To stay clean it needs 160 MHz of *uncontended* spectrum — but in a dense RF environment (≈25 neighbor APs here, several on 5 GHz channels 48/52/100/132/153 that overlap or border the block), any one busy neighbor degrades the **entire** wide channel. 160 MHz also makes the radio **DFS-radar exposed**: a single radar detection forces a channel-switch with a 1 s+ blackout — a stream-killer.
|
||||
|
||||
So 160 MHz buys a higher *peak* PHY rate that game streaming doesn't need, at the cost of the *stability* it absolutely does.
|
||||
|
||||
## ✅ Fix
|
||||
|
||||
Drop the 5 GHz radio to **80 MHz** and pin it to a **non-DFS** channel (UNII-1: 36/40/44/48 — no radar, no DFS blackouts).
|
||||
|
||||
GUI: **Wireless → 5 GHz → Channel Bandwidth = 80 MHz**, **Control Channel = 36**, turn off "Auto."
|
||||
|
||||
Or over SSH (`nvram` + `restart_wireless`):
|
||||
|
||||
```bash
|
||||
nvram set wl1_bw_cap=7 # cap at 80 MHz (bitmask: 1=20, 3=40, 7=80, 15=160)
|
||||
nvram set wl1_chanspec=36/80 # channel 36 @ 80 MHz
|
||||
nvram set wl1_channel=36
|
||||
nvram commit
|
||||
service restart_wireless # ~15-20s radio bounce, drops all clients briefly
|
||||
```
|
||||
|
||||
> [!warning] `restart_wireless` drops every Wi-Fi client for 15–20 seconds. `nvram commit` runs *before* the restart, so the config persists even if your own SSH/Wi-Fi session drops.
|
||||
|
||||
## 📊 Result
|
||||
|
||||
Verified from both the router and a client after the radio came back:
|
||||
|
||||
| Metric | Before (36/160) | After (36/80) |
|
||||
|--------|-----------------|---------------|
|
||||
| Channel tx-busy | 92% | **9%** |
|
||||
| Transmit-opportunity available | 4% | **79%** |
|
||||
| Channel idle | 14% | **87%** |
|
||||
| Failed tx (`badtx` vs `goodtx`) | 81 vs 8 | **1 vs 3** |
|
||||
| Gateway ping (avg / floor) | 27 ms / 7.5 ms | **9 ms / 2.7 ms** |
|
||||
| PHY peak rate | 1729 Mbps | 1200 Mbps |
|
||||
|
||||
The PHY peak dropped (narrower channel) but that is irrelevant — Steam Remote Play wants ~30–50 Mbps with *consistent* airtime, which it now has. The stutter resolved.
|
||||
|
||||
## 🧠 Takeaways
|
||||
|
||||
- **Diagnose Wi-Fi streaming problems with jitter, not throughput.** A speed test can pass while a stream stutters. Ping your gateway and watch the stddev.
|
||||
- **Strong signal + high latency = airtime congestion.** Don't chase signal strength when RSSI is already good; look at channel utilization (`chanim_stats`).
|
||||
- **160 MHz is a trap in a dense RF environment.** Use 80 MHz for reliability; reserve 160 MHz for clean spectrum and short range.
|
||||
- **Prefer non-DFS channels (36–48) for anything latency-sensitive** — DFS radar events cause silent multi-second dropouts.
|
||||
- **Wire the *source*.** The streaming PC should be on Ethernet so the video only crosses the air once (AP → handheld). The handheld has to be Wi-Fi; the desktop doesn't.
|
||||
- **Isolate IoT on 2.4 GHz** (separate SSID) so it never competes for 5 GHz airtime with latency-sensitive clients.
|
||||
|
||||
## Related
|
||||
|
||||
- [Steam Deck Wi-Fi Flapping: IWD Periodic Scan + rtw88 Power Save](steam-deck-wifi-flapping-iwd-periodic-scan-rtw88.md) — the *other* Steam Deck Wi-Fi issue (client-side flap), distinct from this router-side airtime problem.
|
||||
- [Network Overview](../../02-selfhosting/dns-networking/network-overview.md)
|
||||
- [Wake-on-LAN via Router SSH](../../02-selfhosting/dns-networking/wake-on-lan-router-ssh.md)
|
||||
- [Pi-hole v6 Group Management — Per-Client DNS Rules](../../02-selfhosting/dns-networking/pihole-v6-group-management.md)
|
||||
|
|
@ -1,120 +0,0 @@
|
|||
---
|
||||
title: "Time Machine: Orphaned APFS .previous Folder Blocks All Backups"
|
||||
domain: troubleshooting
|
||||
category: general
|
||||
tags: [macos, time-machine, apfs, backup, fsck, disk-utility]
|
||||
status: published
|
||||
created: 2026-06-18
|
||||
updated: 2026-06-18
|
||||
---
|
||||
# Time Machine: Orphaned APFS `.previous` Folder Blocks All Backups
|
||||
|
||||
## Overview
|
||||
On an APFS Time Machine destination, an interrupted backup can leave behind an orphaned staging folder named `<timestamp>.previous` (plus a matching, uncatalogued APFS snapshot). Every subsequent backup reads that folder during *FindingChanges*, hits a metadata-type mismatch, and aborts — so backups silently stop running. macOS shows only a generic "**Time Machine couldn't complete the backup … An unknown error occurred.**"
|
||||
|
||||
The trap: because the orphan is **not in Time Machine's catalog** and the destination is OS-protected, every obvious removal tool (`rm`, `chmod`, `tmutil delete`, `diskutil deleteSnapshot`) refuses it. The clean fix is **First Aid (`fsck_apfs`)**, which has authority over the volume and clears the orphaned snapshot.
|
||||
|
||||
## Symptoms
|
||||
- "Time Machine couldn't complete the backup to '<disk>' — An unknown error occurred."
|
||||
- Backups haven't run since around the time of an interrupted/cancelled backup.
|
||||
- The destination disk is mounted and has plenty of free space (not full, not disconnected).
|
||||
- `tmutil status` cycles through `Starting` / `FindingChanges` and never reaches `Copying`.
|
||||
|
||||
## Root Cause
|
||||
`backupd` logs the real error on a loop (every ~15 s):
|
||||
|
||||
```bash
|
||||
log show --predicate 'subsystem == "com.apple.TimeMachine"' --last 10m --style compact \
|
||||
| grep -iE 'previous|error'
|
||||
```
|
||||
```
|
||||
[TMStructure] Expected SnapshotInProgressContainer metadata type but found APFSBackup
|
||||
metadata type at URL '.../<disk>/2026-06-17-172230.previous/'
|
||||
```
|
||||
|
||||
An earlier backup was interrupted mid-run. It left two orphans tied to that timestamp, **neither registered in Time Machine's backup catalog**:
|
||||
|
||||
1. A staging directory `<timestamp>.previous` on the destination volume.
|
||||
2. A matching APFS snapshot `com.apple.TimeMachine.<timestamp>.backup`.
|
||||
|
||||
Time Machine expects the staging folder to be a `SnapshotInProgressContainer` but finds completed-backup (`APFSBackup`) metadata, so it bails before copying anything.
|
||||
|
||||
> **Ignore the surrounding log noise.** `com.apple.backupd.sandbox.xpc: connection invalid`, `Mountpoint '…' is still valid`, and `missingName` on `/System/Volumes/Data/home` are all normal on a healthy backup — flagged `E` but harmless. The only line that matters is the `SnapshotInProgressContainer` mismatch.
|
||||
|
||||
## Diagnosis
|
||||
|
||||
Confirm the disk is healthy (not the problem) and locate the orphan:
|
||||
|
||||
```bash
|
||||
tmutil status # stuck in Starting/FindingChanges, never Copying
|
||||
df -h | grep -i "<disk-name>" # mounted, plenty free
|
||||
diskutil apfs listSnapshots <diskNsN> # note the highest/last snapshot timestamp
|
||||
```
|
||||
|
||||
If `listSnapshots` shows a final snapshot whose timestamp matches the `.previous` folder in the error, that's the orphaned pair.
|
||||
|
||||
## Why the Obvious Tools Fail
|
||||
|
||||
Do **not** burn time trying to force the folder out — here's what each tool does and why it refuses:
|
||||
|
||||
| Command | Result | Reason |
|
||||
|---|---|---|
|
||||
| `sudo rm -rf …/<ts>.previous` | `Operation not permitted` | TM applies a `group:everyone deny delete` ACL that overrides root. |
|
||||
| `sudo chmod -RN …/<ts>.previous` | runs for minutes, then fails | A `.previous` folder is a **full copy of the entire Mac filesystem**; `-R` walks the whole tree and can't clear ACLs on the SIP-`restricted` system files inside (`/usr/bin/sh`, frameworks, keymaps). `rm` then hits the same wall. |
|
||||
| `sudo tmutil delete -p …/<ts>.previous` | `Invalid deletion target (error 22)` | Not a registered backup. |
|
||||
| `sudo tmutil delete -t <timestamp>` | `error 2 (No such file)` | No catalog entry for that timestamp. |
|
||||
| `sudo diskutil apfs deleteSnapshot <diskNsN> -uuid <uuid>` | `Not a valid APFS Snapshot UUID` | TM-managed snapshot; diskutil won't remove it directly. |
|
||||
|
||||
> **If you started a `chmod -R` and killed it:** the live system is unaffected — `chmod -R` does not follow symlinks out of the backup tree. Verify with `ls -lde ~/Desktop` (normal ACLs = untouched). Stop a runaway with `sudo pkill -f '<timestamp>.previous'`.
|
||||
|
||||
## Fix — Run First Aid (`fsck_apfs`)
|
||||
|
||||
First Aid runs with full authority over the volume and clears the orphaned snapshot, which defuses the `.previous` folder's metadata mismatch.
|
||||
|
||||
```bash
|
||||
# 1. Stop the looping backup
|
||||
sudo tmutil stopbackup
|
||||
|
||||
# 2. Verify the destination volume (live mode is fine; read-only check)
|
||||
sudo diskutil verifyVolume <diskNsN>
|
||||
# or: Disk Utility → View → Show All Devices → select the TM volume → First Aid → Run
|
||||
```
|
||||
|
||||
`verifyVolume` enumerates and validates every snapshot; the verify/remount cycle purges the orphaned in-progress snapshot. Expected result:
|
||||
|
||||
```
|
||||
The volume <name> appears to be OK
|
||||
File system check exit code is 0
|
||||
```
|
||||
|
||||
Confirm the orphan snapshot is gone (count drops by one; the matching timestamp no longer appears):
|
||||
|
||||
```bash
|
||||
diskutil apfs listSnapshots <diskNsN>
|
||||
```
|
||||
|
||||
Then restart and watch it succeed:
|
||||
|
||||
```bash
|
||||
sudo tmutil startbackup --auto
|
||||
tmutil status # should reach BackupPhase = Copying with no SnapshotInProgressContainer errors
|
||||
```
|
||||
|
||||
If `verifyVolume` reports problems rather than "appears to be OK", run the repair (it must unmount the volume):
|
||||
|
||||
```bash
|
||||
sudo diskutil repairVolume <diskNsN>
|
||||
```
|
||||
|
||||
## Notes
|
||||
- The first backup after the fix is often a large catch-up (hundreds of GB) because the chain was broken — let it finish; it returns to quick hourly increments afterward.
|
||||
- The inert `<timestamp>.previous` **folder** may still sit on the volume after the fix. Time Machine now ignores it, so it's not blocking — but it consumes space. Removing it cleanly requires booting to **Recovery Mode**, `csrutil disable`, `rm -rf` the folder, then `csrutil enable` — only worth it to reclaim the space.
|
||||
- Time Machine identifies its destination by `DestinationID` (a UUID), not the volume name, so renaming the disk later is safe.
|
||||
- Interrupted backups are more likely on flaky USB-SATA bridge enclosures (e.g. some WD My Passport units) whose slow sleep/wake transitions can drop the drive mid-backup.
|
||||
|
||||
## Tags
|
||||
`macos` `time-machine` `apfs` `backup` `fsck-apfs` `disk-utility` `snapshot` `first-aid`
|
||||
|
||||
## See Also
|
||||
- [SnapRAID & MergerFS Storage Setup](../01-linux/storage/snapraid-mergerfs-setup.md)
|
||||
- MajorMac Incident Log (2026-06-18) — the originating incident
|
||||
|
|
@ -1,193 +0,0 @@
|
|||
---
|
||||
title: "WordPress 6.7 _load_textdomain_just_in_time Notice (Theme/Plugin Loads Translations Too Early)"
|
||||
domain: troubleshooting
|
||||
category: troubleshooting
|
||||
tags:
|
||||
- wordpress
|
||||
- wordpress-6.7
|
||||
- php
|
||||
- i18n
|
||||
- textdomain
|
||||
- theme
|
||||
- mu-plugin
|
||||
- deprecation
|
||||
- troubleshooting
|
||||
status: published
|
||||
created: 2026-06-21
|
||||
updated: 2026-06-21
|
||||
---
|
||||
|
||||
# WordPress 6.7 `_load_textdomain_just_in_time` Notice
|
||||
|
||||
> **TL;DR** — WordPress 6.7 added a `doing_it_wrong` notice that fires when a translation function (`__()`, `_e()`, `esc_html__()`, …) is called for a text domain **before the `init` action**. It's almost always a theme or plugin registering nav menus / sidebars / labels on `after_setup_theme` (which runs before `init`). The notice is **debug-only and harmless** — translations still load via the just-in-time fallback. If the offending code is in your own (or an updatable) theme/plugin, fix it at the source by deferring to `init`. If it's a **non-updating or third-party** theme you don't want to hand-edit, suppress *only this one notice* with a `doing_it_wrong_trigger_error` filter in a tiny mu-plugin.
|
||||
|
||||
---
|
||||
|
||||
## Symptom
|
||||
|
||||
With `WP_DEBUG` on (or in Query Monitor's PHP panel), you see:
|
||||
|
||||
```
|
||||
Function _load_textdomain_just_in_time was called incorrectly.
|
||||
Translation loading for the <domain> domain was triggered too early.
|
||||
This is usually an indicator for some code in the plugin or theme running too early.
|
||||
Translations should be loaded at the init action or later.
|
||||
(This message was added in version 6.7.0.)
|
||||
|
||||
_load_textdomain_just_in_time() wp-includes/l10n.php
|
||||
get_translations_for_domain() wp-includes/l10n.php
|
||||
translate() wp-includes/l10n.php
|
||||
__() wp-includes/l10n.php
|
||||
WordPress Core
|
||||
```
|
||||
|
||||
The key fields are **the domain name** (e.g. `marstheme`, `woocommerce`, `astra`) and the fact that the stack bottoms out in **WordPress Core** via `__()` — that tells you *some* extension called a translation function, not that core is broken.
|
||||
|
||||
## Why it happens (the WP 6.7 change)
|
||||
|
||||
Before 6.7, WordPress silently "just-in-time" loaded a text domain the first time you translated a string in it. 6.7 kept the JIT loading but started **warning** when it's triggered before `init`, because:
|
||||
|
||||
- Translations loaded before `init` can't be filtered/overridden by other plugins that hook `init`.
|
||||
- It signals the extension is doing setup work earlier than the WordPress lifecycle intends.
|
||||
|
||||
The usual culprit is code on **`after_setup_theme`** (which fires *before* `init`) that translates a label inline, e.g.:
|
||||
|
||||
```php
|
||||
function mytheme_setup() {
|
||||
register_nav_menus( array(
|
||||
'primary' => __( 'Primary Menu', 'mytheme' ), // <-- translate call before init
|
||||
) );
|
||||
}
|
||||
add_action( 'after_setup_theme', 'mytheme_setup' );
|
||||
```
|
||||
|
||||
> **Important:** explicitly calling `load_theme_textdomain()` / `load_plugin_textdomain()` early does **not** fix the notice, and as of WP 4.6+ themes on wordpress.org don't even need to call it. The notice is about the *translate call*, not about whether the domain was loaded. Moving only the `load_*_textdomain()` call around is a common dead-end (see the gotcha below).
|
||||
|
||||
## Diagnostic chain
|
||||
|
||||
### 1. Identify the domain and what owns it
|
||||
|
||||
The notice names the domain. Find which theme/plugin uses it:
|
||||
|
||||
```bash
|
||||
WPROOT=/var/www/html
|
||||
grep -rlw '<domain>' "$WPROOT/wp-content/themes" "$WPROOT/wp-content/plugins" 2>/dev/null
|
||||
|
||||
# Which extension has the most references (i.e. owns the domain)?
|
||||
grep -rl '<domain>' "$WPROOT/wp-content/" 2>/dev/null \
|
||||
| sed -E "s#$WPROOT/wp-content/(themes|plugins|mu-plugins)/([^/]+)/.*#\1/\2#" \
|
||||
| sort | uniq -c | sort -rn | head
|
||||
```
|
||||
|
||||
> **Watch for renamed/forked themes.** The domain often does **not** match the theme's folder name. A theme bought as "Mars" and re-slugged to `kappa` keeps `marstheme` as its text domain in all 40+ template files. So `wp theme list` shows `kappa` active while the notice says `marstheme` — they're the same thing.
|
||||
|
||||
### 2. Confirm it's active and whether it can be updated
|
||||
|
||||
```bash
|
||||
sudo -u www-data wp --path=$WPROOT theme list --fields=name,status,version,update
|
||||
sudo -u www-data wp --path=$WPROOT plugin list --fields=name,status,version,update
|
||||
```
|
||||
|
||||
- `update available` → **update it first** (newest releases of most themes/plugins fixed this in late 2024/2025). That's the proper fix; the rest of this article is for when you can't.
|
||||
- `update none` on a **renamed/custom fork** → no upstream exists, so updating is impossible. Go to the suppression fix.
|
||||
|
||||
### 3. Pin down the early call (optional)
|
||||
|
||||
```bash
|
||||
grep -rn "__(\s*['\"].*['\"]\s*,\s*['\"]<domain>['\"]" \
|
||||
"$WPROOT/wp-content/themes/<theme>" | head
|
||||
```
|
||||
|
||||
Look for translate calls inside functions hooked to `after_setup_theme`, `setup_theme`, `plugins_loaded`, or run at file scope in `functions.php`.
|
||||
|
||||
## The fix
|
||||
|
||||
### Option A — fix it at the source (own / updatable code)
|
||||
|
||||
Defer the translation. Either register the raw string and translate at render time, or move the registration to `init`:
|
||||
|
||||
```php
|
||||
// Before: translated on after_setup_theme (too early)
|
||||
add_action( 'after_setup_theme', function () {
|
||||
register_nav_menus( array( 'primary' => __( 'Primary Menu', 'mytheme' ) ) );
|
||||
} );
|
||||
|
||||
// After: register the menu location on init, where translation is allowed
|
||||
add_action( 'init', function () {
|
||||
register_nav_menus( array( 'primary' => __( 'Primary Menu', 'mytheme' ) ) );
|
||||
} );
|
||||
```
|
||||
|
||||
Don't do this by editing a theme/plugin that receives updates — your change is wiped on the next update. Use Option B for those.
|
||||
|
||||
### Option B — suppress just this notice (third-party / non-updating code)
|
||||
|
||||
When the early call lives in a theme you don't control and can't update (a renamed commercial fork, an abandoned plugin), the clean, update-safe move is to silence **only** the `_load_textdomain_just_in_time` notice — not all `doing_it_wrong` output — via a must-use plugin.
|
||||
|
||||
Create `wp-content/mu-plugins/fix-textdomain.php`:
|
||||
|
||||
```php
|
||||
<?php
|
||||
/**
|
||||
* Suppress the WP 6.7 "_load_textdomain_just_in_time was called incorrectly"
|
||||
* notice for a theme/plugin that translates before init.
|
||||
*
|
||||
* Scope is intentionally narrow: only this one function is silenced, so other
|
||||
* doing_it_wrong notices still surface. Translations still load via the JIT
|
||||
* fallback, so nothing visible changes for visitors.
|
||||
*/
|
||||
add_filter( 'doing_it_wrong_trigger_error', function ( $trigger, $function_name ) {
|
||||
return '_load_textdomain_just_in_time' === $function_name ? false : $trigger;
|
||||
}, 10, 2 );
|
||||
```
|
||||
|
||||
`mu-plugins/` loads automatically (no activation, can't be deactivated from the admin), and runs early enough to register the filter before the notice fires.
|
||||
|
||||
#### Verify
|
||||
|
||||
```bash
|
||||
WPROOT=/var/www/html
|
||||
|
||||
# 1. Syntax-check the mu-plugin
|
||||
php -l "$WPROOT/wp-content/mu-plugins/fix-textdomain.php"
|
||||
# -> No syntax errors detected
|
||||
|
||||
# 2. Confirm WP still boots and the filter is registered
|
||||
sudo -u www-data wp --path=$WPROOT eval \
|
||||
'echo has_filter("doing_it_wrong_trigger_error") ? "filter set\n" : "MISSING\n";'
|
||||
|
||||
# 3. Clear the debug log, trigger an early translate, confirm 0 new notices
|
||||
DBG="$WPROOT/wp-content/debug.log"
|
||||
[ -f "$DBG" ] && : > "$DBG"
|
||||
sudo -u www-data wp --path=$WPROOT eval '__("Primary Menu","<domain>");' >/dev/null 2>&1
|
||||
grep -c "<domain>" "$DBG" 2>/dev/null || echo 0
|
||||
# -> 0
|
||||
```
|
||||
|
||||
## Gotchas
|
||||
|
||||
### The "load the textdomain earlier/later" dead-end
|
||||
|
||||
A very common (wrong) first attempt is an mu-plugin that just calls `load_theme_textdomain()` on `plugins_loaded` or `after_setup_theme`:
|
||||
|
||||
```php
|
||||
// DOES NOT FIX THE NOTICE
|
||||
add_action( 'plugins_loaded', function () {
|
||||
load_theme_textdomain( 'mytheme', get_template_directory() . '/languages' );
|
||||
}, 0 );
|
||||
```
|
||||
|
||||
`plugins_loaded` still runs **before `init`**, and — more importantly — the notice is triggered by the theme's own early `__()` call, not by whether you've loaded the domain. This code is dead weight. If you find one in place, replace it with the Option B filter rather than tweaking its hook/priority.
|
||||
|
||||
### Don't blanket-suppress all deprecations
|
||||
|
||||
Resist `error_reporting(E_ALL & ~E_DEPRECATED)` or returning `false` from `doing_it_wrong_trigger_error` unconditionally — that also hides genuinely useful warnings (a plugin breaking on a future PHP/WP bump). Scope the filter to the one `function_name`.
|
||||
|
||||
### Renamed theme ⇒ domain ≠ folder
|
||||
|
||||
Re-stating because it costs the most time: the domain in the notice can be the theme's *original* slug, not its current folder. Always `grep` for the domain to find the real owner before concluding "I don't even have that theme installed."
|
||||
|
||||
## See also
|
||||
|
||||
- [Patching PHP 8.4 Implicit-Nullable Deprecations in Vendor Packages](php-84-vendor-implicit-nullable-patch.md) — the other "harmless deprecation that floods logs" pattern on the WordPress fleet
|
||||
- [WordPress developer note: i18n improvements in 6.7](https://make.wordpress.org/core/2024/10/21/i18n-improvements-in-6-7/) — the canonical reference for this change
|
||||
|
|
@ -10,7 +10,7 @@ tags:
|
|||
- deno
|
||||
status: published
|
||||
created: 2026-04-02
|
||||
updated: 2026-06-16T18:35
|
||||
updated: 2026-04-30T05:21
|
||||
---
|
||||
# yt-dlp YouTube JS Challenge Fix (Fedora)
|
||||
|
||||
|
|
@ -84,43 +84,12 @@ echo '--remote-components ejs:github' > ~/.config/yt-dlp/config
|
|||
|
||||
## Maintenance
|
||||
|
||||
YouTube pushes extractor changes frequently. Keep yt-dlp current.
|
||||
|
||||
### Updating: the `-U` trap + avoid duplicate installs
|
||||
|
||||
`yt-dlp -U` **does not work** when yt-dlp was installed via pip/PyPI — the PyPI build deliberately disables the self-updater:
|
||||
|
||||
```
|
||||
ERROR: You installed yt-dlp with pip or using the wheel from PyPi; Use that to update
|
||||
```
|
||||
|
||||
Update through pip instead. **Pick one install method and stick to it** — running both a user install and a system install leaves two copies that drift out of sync (one updates, the other stays stale and shadows it depending on `$PATH` / sudo).
|
||||
|
||||
**Recommended — single user install (no sudo):**
|
||||
|
||||
```bash
|
||||
pip3 install -U --user yt-dlp
|
||||
```
|
||||
|
||||
This lives in `~/.local/bin/yt-dlp` and is first on a normal user's `$PATH`. Update it the same way; never use sudo.
|
||||
|
||||
**Alternative — system-wide (Fedora, PEP 668):**
|
||||
YouTube pushes extractor changes frequently. Keep yt-dlp current:
|
||||
|
||||
```bash
|
||||
sudo pip install -U yt-dlp --break-system-packages
|
||||
```
|
||||
|
||||
> Only use `--break-system-packages` if you intentionally want a root-owned copy in `/usr/local`. Do **not** mix it with a `--user` install.
|
||||
|
||||
**Check for and remove a duplicate install:**
|
||||
|
||||
```bash
|
||||
which -a yt-dlp # more than one path = duplicate installs
|
||||
sudo pip3 uninstall -y yt-dlp # removes the /usr/local (system) copy + its wrapper
|
||||
```
|
||||
|
||||
> If installed via the standalone binary (not pip), `yt-dlp -U` is the correct updater.
|
||||
|
||||
---
|
||||
|
||||
## Known Limitations
|
||||
|
|
|
|||
22
SUMMARY.md
22
SUMMARY.md
|
|
@ -1,6 +1,6 @@
|
|||
---
|
||||
created: 2026-04-02T16:03
|
||||
updated: 2026-06-21T11:46
|
||||
updated: 2026-05-15T09:00
|
||||
---
|
||||
* [Home](index.md)
|
||||
* [Linux & Sysadmin](01-linux/index.md)
|
||||
|
|
@ -12,12 +12,10 @@ updated: 2026-06-21T11:46
|
|||
* [Bash Scripting Patterns](01-linux/shell-scripting/bash-scripting-patterns.md)
|
||||
* [SnapRAID & MergerFS Storage Setup](01-linux/storage/snapraid-mergerfs-setup.md)
|
||||
* [mdadm — Rebuilding a RAID Array After Reinstall](01-linux/storage/mdadm-raid-rebuild.md)
|
||||
* [Growing an LVM Volume by Absorbing Another Disk](01-linux/storage/lvm-grow-volume-absorb-disk.md)
|
||||
* [Linux Distro Guide for Beginners](01-linux/distro-specific/linux-distro-guide-beginners.md)
|
||||
* [WSL2 Instance Migration to Fedora 43](01-linux/distro-specific/wsl2-instance-migration-fedora43.md)
|
||||
* [WSL2 Training Environment Rebuild](01-linux/distro-specific/wsl2-rebuild-fedora43-training-env.md)
|
||||
* [WSL2 Backup via PowerShell](01-linux/distro-specific/wsl2-backup-powershell.md)
|
||||
* [WSL2 In-Place Upgrade to Fedora 44](01-linux/distro-specific/wsl2-fedora44-inplace-upgrade.md)
|
||||
* [Self-Hosting & Homelab](02-selfhosting/index.md)
|
||||
* [Self-Hosting Starter Guide](02-selfhosting/docker/self-hosting-starter-guide.md)
|
||||
* [Docker vs VMs for the Homelab](02-selfhosting/docker/docker-vs-vms-homelab.md)
|
||||
|
|
@ -32,7 +30,6 @@ updated: 2026-06-21T11:46
|
|||
* [AWS S3 Cost Management](02-selfhosting/cloud/aws-s3-cost-management.md)
|
||||
* [VPS Migration Baseline Checklist](02-selfhosting/cloud/vps-migration-baseline-checklist.md)
|
||||
* [rsync Backup Patterns](02-selfhosting/storage-backup/rsync-backup-patterns.md)
|
||||
* [Fleet Backups with restic + B2](02-selfhosting/storage-backup/restic-b2-fleet-backups.md)
|
||||
* [Tuning Netdata Web Log Alerts](02-selfhosting/monitoring/tuning-netdata-web-log-alerts.md)
|
||||
* [Tuning Netdata Docker Health Alarms](02-selfhosting/monitoring/netdata-docker-health-alarm-tuning.md)
|
||||
* [Deploying Netdata to a New Server](02-selfhosting/monitoring/netdata-new-server-setup.md)
|
||||
|
|
@ -44,7 +41,6 @@ updated: 2026-06-21T11:46
|
|||
* [Mastodon Post-Install Hardening (Permissions + Account)](02-selfhosting/services/mastodon-post-install-hardening.md)
|
||||
* [Mastodon — The `--prune-profiles` Trap and How to Recover](02-selfhosting/services/mastodon-prune-profiles-trap.md)
|
||||
* [Mastodon on S3 — Silent Upload Failures (BucketOwnerEnforced/ACLs)](02-selfhosting/services/mastodon-s3-acl-upload-failures.md)
|
||||
* [Mastodon — Triaging Crowdfunding / Mention-Spam Accounts](02-selfhosting/services/mastodon-mention-spam-crowdfunding.md)
|
||||
* [Ghost Email Configuration with Mailgun](02-selfhosting/services/ghost-smtp-mailgun-setup.md)
|
||||
* [Inbound Spam Filtering: spamass-milter + SpamAssassin Bayes](02-selfhosting/services/postfix-spamassassin-bayes-spam-filtering.md)
|
||||
* [Claude Code Remote Control — Mobile Access to a Persistent Host Session](02-selfhosting/services/claude-code-remote-control.md)
|
||||
|
|
@ -60,7 +56,6 @@ updated: 2026-06-21T11:46
|
|||
* [Fail2ban Custom Jail: Nginx Bad Request Detection](02-selfhosting/security/fail2ban-nginx-bad-request-jail.md)
|
||||
* [Fail2ban Custom Jail: Apache Bad Request Detection](02-selfhosting/security/fail2ban-apache-bad-request-jail.md)
|
||||
* [SSH Hardening Fleet-Wide with Ansible](02-selfhosting/security/ssh-hardening-ansible-fleet.md)
|
||||
* [Migrating Flat Ansible Playbooks to Roles (Safely)](02-selfhosting/security/ansible-flat-playbooks-to-roles.md)
|
||||
* [ClamAV Fleet Deployment with Ansible](02-selfhosting/security/clamav-fleet-deployment.md)
|
||||
* [Fail2Ban Digest Mode — Fleet-Wide Quiet Alerts](02-selfhosting/security/fail2ban-digest-mode-fleet.md)
|
||||
* [Apache CVE-2026-23918 — HTTP/2 Double Free Mitigation](02-selfhosting/security/apache-cve-2026-23918-http2-mitigation.md)
|
||||
|
|
@ -81,8 +76,6 @@ updated: 2026-06-21T11:46
|
|||
* [HEVC Batch Re-Encode for Plex Using VAAPI (AMD GPU)](04-streaming/plex/hevc-vaapi-batch-encode.md)
|
||||
* [Plex Transcoding Troubleshooting](04-streaming/plex/plex-transcoding-troubleshooting.md)
|
||||
* [Troubleshooting](05-troubleshooting/index.md)
|
||||
* [Wi-Fi Game Streaming Stutter: 160 MHz Channel Width Saturating the 5 GHz Radio](05-troubleshooting/networking/wifi-160mhz-airtime-saturation-game-streaming.md)
|
||||
* [Steam Deck Wi-Fi Flapping: IWD Periodic Scan + rtw88 Power Save](05-troubleshooting/networking/steam-deck-wifi-flapping-iwd-periodic-scan-rtw88.md)
|
||||
* [Apache Outage: Fail2ban Self-Ban + Missing iptables Rules](05-troubleshooting/networking/fail2ban-self-ban-apache-outage.md)
|
||||
* [Postfix + SendGrid: TLS Handshake Failure (Port 465 vs 587)](05-troubleshooting/networking/postfix-sendgrid-tls-handshake-failure.md)
|
||||
* [Mail Client Stops Receiving: Fail2ban IMAP Self-Ban](05-troubleshooting/networking/fail2ban-imap-self-ban-mail-client.md)
|
||||
|
|
@ -108,7 +101,6 @@ updated: 2026-06-21T11:46
|
|||
* [Gemini CLI Manual Update](05-troubleshooting/gemini-cli-manual-update.md)
|
||||
* [MajorWiki Setup & Publishing Pipeline](05-troubleshooting/majwiki-setup-and-pipeline.md)
|
||||
* [Gitea Actions Runner: Boot Race Condition Fix](05-troubleshooting/gitea-runner-boot-race-network-target.md)
|
||||
* [Forgejo: Account Recovery & CLI Admin When Locked Out of the GUI](05-troubleshooting/forgejo-mailer-and-cli-recovery.md)
|
||||
* [Cron Heartbeat False Alarm: /var/run Cleared by Reboot](05-troubleshooting/cron-heartbeat-tmpfs-reboot-false-alarm.md)
|
||||
* [SELinux: Fixing Dovecot Mail Spool Context (/var/vmail)](05-troubleshooting/selinux-dovecot-vmail-context.md)
|
||||
* [SELinux: Wrong /etc/localtime Label Silently Breaks Timezone Changes](05-troubleshooting/selinux-localtime-label-breaks-timezone.md)
|
||||
|
|
@ -119,17 +111,11 @@ updated: 2026-06-21T11:46
|
|||
* [Claude Desktop MCP Server Started via wsl.exe Sees Empty Environment (WSLENV)](05-troubleshooting/wsl-env-claude-desktop-mcp.md)
|
||||
* [Claude Desktop MCP Mass-Disconnect After Blocking SSH Reboot](05-troubleshooting/claude-desktop-mcp-mass-disconnect-blocking-reboot.md)
|
||||
* [Patching PHP 8.4 Implicit-Nullable Deprecations in Vendor Packages](05-troubleshooting/php-84-vendor-implicit-nullable-patch.md)
|
||||
* [WordPress 6.7 `_load_textdomain_just_in_time` Notice (Translations Loaded Too Early)](05-troubleshooting/wordpress-67-textdomain-just-in-time-notice.md)
|
||||
* [Ollama Drops Off Tailscale When Mac Sleeps](05-troubleshooting/ollama-macos-sleep-tailscale-disconnect.md)
|
||||
* [Ollama: `ollama run` with Piped Stdin Bypasses Chat Template + SYSTEM Prompt](05-troubleshooting/ollama-chat-template-pipe-stdin-bypass.md)
|
||||
* [Claude Code Won't Log In (Warp & iTerm2) — Corrupt Keychain Credential](05-troubleshooting/claude-code-warp-login-corrupt-keychain-credential.md)
|
||||
* [Claude Code Keychain Prompt Keeps Reappearing on macOS (ACL Invalidation)](05-troubleshooting/claude-code-keychain-prompt-recurring-macos.md)
|
||||
* [iPhone Mirroring Hangs on 'Connecting…' — AWDL Data Stall (27.0 Beta)](05-troubleshooting/iphone-mirroring-connecting-hang-awdl-stall-beta.md)
|
||||
* [rsync over Tailscale: Hung in TCP Teardown After Transfer Completes](05-troubleshooting/networking/rsync-tailscale-teardown-stall.md)
|
||||
* [iOS Tailscale Clients Report HostName="localhost" — Breaks /etc/hosts Generators](05-troubleshooting/networking/tailscale-status-json-hostname-localhost-ios.md)
|
||||
* [macOS: Repeating Alert Tone from Mirrored iPhone Notification](05-troubleshooting/macos-mirrored-notification-alert-loop.md)
|
||||
* [Auditing & Cleaning macOS Background App Activity (sfltool dumpbtm)](05-troubleshooting/macos-background-app-activity-audit-sfltool.md)
|
||||
* [Time Machine: Orphaned APFS `.previous` Folder Blocks All Backups](05-troubleshooting/time-machine-apfs-orphaned-previous-blocks-backup.md)
|
||||
* [OBS Studio: Stale Script Paths After Windows Profile Rename](05-troubleshooting/obs-stale-script-paths-after-windows-profile-rename.md)
|
||||
* [ClamAV CPU Spike: Safe Scheduling with nice/ionice](05-troubleshooting/security/clamscan-cpu-spike-nice-ionice.md)
|
||||
* [Logwatch Falsely Reports 'No freshclam updates' in ClamAV Daemon Mode](05-troubleshooting/security/freshclam-logwatch-false-no-updates.md)
|
||||
|
|
@ -141,16 +127,10 @@ updated: 2026-06-21T11:46
|
|||
* [Ansible: SSH Timeout During dnf upgrade on Fedora Hosts](05-troubleshooting/ansible-ssh-timeout-dnf-upgrade.md)
|
||||
* [Ansible: regex_search Capture-Group Argument Fails in set_fact](05-troubleshooting/ansible-regex-search-set-fact-capture-group.md)
|
||||
* [Ansible: Ubuntu Reboot Detection Misses Kernel Upgrades](05-troubleshooting/ansible-ubuntu-reboot-detection-kernel-mismatch.md)
|
||||
* [Ansible: reboot.yml become Timeout on WSL2 Hosts (Exclude Them)](05-troubleshooting/ansible-reboot-become-timeout-wsl2.md)
|
||||
* [Fedora Networking & Kernel Troubleshooting](05-troubleshooting/fedora-networking-kernel-recovery.md)
|
||||
* [Systemd Session Scope Fails at Login](05-troubleshooting/systemd/session-scope-failure-at-login.md)
|
||||
* [wget/curl: URLs with Special Characters Fail in Bash](05-troubleshooting/wget-url-special-characters.md)
|
||||
* [Ansible: Check Mode False Positives in Verify/Assert Tasks](05-troubleshooting/ansible-check-mode-false-positives.md)
|
||||
* [Ansible Fails with Permission Denied While `ssh <alias>` Works (Host Alias Bypass)](05-troubleshooting/ansible-ssh-host-alias-bypass.md)
|
||||
* [SSH Alias Falls Through to MagicDNS — Host-Key Verification Failure (No `Host` Block)](05-troubleshooting/networking/ssh-missing-host-block-magicdns-host-key-failure.md)
|
||||
* [MagicDNS Names vs Pinned IPs for Tailscale SSH (After a Fleet Migration)](05-troubleshooting/networking/tailscale-ssh-magicdns-vs-pinned-ip-after-migration.md)
|
||||
* [`Permission denied (publickey)` After Rotating a Key — Backfill Every `authorized_keys`](05-troubleshooting/networking/ssh-rotated-key-not-backfilled-authorized-keys.md)
|
||||
* [Ansible UNREACHABLE: Host Key Verification Failed After a Host Rebuild or Migration](05-troubleshooting/networking/ansible-host-key-verification-failed-rebuilt-host.md)
|
||||
* [Logwatch Reports the Wrong Hostname (`<host>-hetzner`) After a Migration](05-troubleshooting/logwatch-wrong-hostname-after-migration.md)
|
||||
* [Ghost EmailAnalytics Lag Warning — What It Means and When to Worry](05-troubleshooting/ghost-emailanalytics-lag-warning.md)
|
||||
* [claude-mem: --setting-sources Empty Arg Bug (Claude Code 2.1.x)](05-troubleshooting/claude-mem-setting-sources-empty-arg.md)
|
||||
|
|
|
|||
Loading…
Add table
Reference in a new issue