Compare commits
112 commits
code/major
...
main
| Author | SHA1 | Date | |
|---|---|---|---|
| 8d9bd34118 | |||
| 2def4c6f30 | |||
| 44c9d38b9f | |||
| 623f04720c | |||
| 69d60b7753 | |||
| c358e0dfea | |||
| a45ef55862 | |||
| e767ebffcb | |||
| 96db073b78 | |||
| cf5e35da1d | |||
| cb90bb69a2 | |||
| 4599ed607c | |||
| 2bed2cbae3 | |||
| ebdb28e9e2 | |||
| 4fa5e33d93 | |||
| cfff75af1c | |||
| 06162273f7 | |||
| e1767bc19e | |||
| 0d08e21ee4 | |||
| 2121d3ff1b | |||
| 1d73b2defa | |||
| 34d9ee42b1 | |||
| 700ca95158 | |||
| a5df9e4873 | |||
| 7703b963e1 | |||
| 5050001909 | |||
| 9085740fa3 | |||
| 75154ff80c | |||
| 4c95f8a88a | |||
| 805c0f0a8f | |||
| e5d1e39af9 | |||
| 852375ddf0 | |||
| 9dd730fc29 | |||
| e0595c04fd | |||
|
|
27ea2dc62b | ||
| 3f94ebb963 | |||
| 14cc1ba4b8 | |||
| fecae727d1 | |||
| 0d1697c0d6 | |||
| 4f6898eb6c | |||
| 11b455a0e2 | |||
| bc4ff144df | |||
| 950759da52 | |||
| 877c4b815f | |||
| 27b1ae244c | |||
| ce2e761d33 | |||
| 513d94aa84 | |||
| 9b066d0e54 | |||
| 5ef0fdfad4 | |||
| a414e4cdbe | |||
| 06a794316b | |||
| 68bfb099ac | |||
| c3045e33dd | |||
| 0cde19e064 | |||
| 8d4dee5da3 | |||
| fda2d35ea5 | |||
| 01ae62e621 | |||
| 662741e7ad | |||
| d8f07e8e2e | |||
| 5d7354e856 | |||
| d755b77126 | |||
| 26eb13ab2f | |||
| 5260548caa | |||
| 2e58c4625c | |||
| b81362bb78 | |||
| 110a6d49e5 | |||
| e6a249403c | |||
| 4e63d8546c | |||
| 155651c373 | |||
| 73c10111e0 | |||
| 52ca8a0413 | |||
| dc897d4a67 | |||
| 3b8c8b0597 | |||
| 318f50c50b | |||
| 65b0aa4567 | |||
| eb39da9a26 | |||
| 7dc591d257 | |||
| 64ac418a36 | |||
|
|
28518e403e | ||
| a785e85821 | |||
| 4ec481c584 | |||
| c22457f1aa | |||
| ac84610380 | |||
| 3df0979786 | |||
| de9b661b9d | |||
| 9c62e7f804 | |||
| 724ae2a5e3 | |||
| 631d7e8bc5 | |||
| a852f7b7bd | |||
| af14e36caf | |||
| 545df9f5c6 | |||
| 7c566cda50 | |||
| 1c17bdb60a | |||
| 393df3cc45 | |||
| 306e5f1f16 | |||
| 3bcc58a805 | |||
| 5f31a57ae6 | |||
| 7e422ee332 | |||
| 3c4cc74aef | |||
| ca123b0312 | |||
| 488268ccd1 | |||
| 213a84ed79 | |||
| ae864452f8 | |||
| 49a1173dfc | |||
| c5b4de4184 | |||
| 021c7f6539 | |||
| 4126656c05 | |||
| 264f1f64c3 | |||
| 74c4ed9959 | |||
| 34cc5c3d0b | |||
| 6e7a0ca21f | |||
| 85f8a5df2d |
80 changed files with 7451 additions and 234 deletions
0
.githooks/pre-commit
Normal file → Executable file
0
.githooks/pre-commit
Normal file → Executable file
|
|
@ -10,7 +10,7 @@ tags:
|
|||
- majorrig
|
||||
status: published
|
||||
created: 2026-03-16
|
||||
updated: 2026-04-29T22:45
|
||||
updated: 2026-04-30T05:21
|
||||
---
|
||||
|
||||
# WSL2 Backup via PowerShell Scheduled Task
|
||||
|
|
|
|||
119
01-linux/distro-specific/wsl2-fedora44-inplace-upgrade.md
Normal file
119
01-linux/distro-specific/wsl2-fedora44-inplace-upgrade.md
Normal file
|
|
@ -0,0 +1,119 @@
|
|||
---
|
||||
title: WSL2 In-Place Upgrade to Fedora 44 (with gcc14 Blocker + CUDA Repo Swap)
|
||||
domain: linux
|
||||
category: distro-specific
|
||||
tags:
|
||||
- wsl2
|
||||
- fedora
|
||||
- windows
|
||||
- upgrade
|
||||
- dnf
|
||||
- cuda
|
||||
- majorrig
|
||||
status: published
|
||||
created: 2026-06-11
|
||||
updated: 2026-06-11
|
||||
---
|
||||
|
||||
# WSL2 In-Place Upgrade to Fedora 44 (with gcc14 Blocker + CUDA Repo Swap)
|
||||
|
||||
In-place upgrade of the FedoraLinux-43 WSL2 instance on MajorRig to Fedora 44 using `dnf system-upgrade` + `dnf5 offline reboot`. Hit one transaction blocker (`gcc14` compat package retired in F44) and swapped the stale `cuda-fedora39` repo to `cuda-fedora44` afterward. Performed 2026-06-11.
|
||||
|
||||
## The Short Answer
|
||||
|
||||
```powershell
|
||||
# PowerShell — backup first
|
||||
wsl --shutdown
|
||||
wsl --export FedoraLinux-43 D:\backups\fedora43.tar
|
||||
```
|
||||
|
||||
```bash
|
||||
# Inside Fedora
|
||||
sudo dnf upgrade --refresh -y
|
||||
sudo shutdown -h now
|
||||
# relaunch, then:
|
||||
sudo dnf remove gcc14-c++ gcc14 # F44 dropped gcc14 — blocks the transaction
|
||||
sudo dnf system-upgrade download --releasever=44
|
||||
sudo dnf5 offline reboot # applies offline upgrade, shuts distro down
|
||||
# wait a few minutes, relaunch:
|
||||
cat /etc/fedora-release # → Fedora release 44 (Forty Four)
|
||||
```
|
||||
|
||||
```powershell
|
||||
# PowerShell — keep WSL itself current
|
||||
wsl --update
|
||||
```
|
||||
|
||||
## Steps
|
||||
|
||||
1. **Back up the instance** (PowerShell). The export tar is roughly the size of the installed system — this one was 86 GB. The target directory must already exist or you get `Wsl/ERROR_PATH_NOT_FOUND`.
|
||||
|
||||
```powershell
|
||||
wsl --shutdown
|
||||
mkdir D:\backups
|
||||
wsl --export FedoraLinux-43 D:\backups\fedora43.tar
|
||||
```
|
||||
|
||||
2. **Fully update the current release, then restart the distro**
|
||||
|
||||
```bash
|
||||
sudo dnf upgrade --refresh -y
|
||||
sudo shutdown -h now
|
||||
```
|
||||
|
||||
3. **Remove upgrade blockers.** `gcc14`/`gcc14-c++` (compat packages) were retired in Fedora 44, so the transaction fails with "does not belong to a distupgrade repository". Remove them (or use `--allowerasing` and review the summary):
|
||||
|
||||
```bash
|
||||
sudo dnf remove gcc14-c++ gcc14
|
||||
```
|
||||
|
||||
4. **Download and apply the upgrade**
|
||||
|
||||
```bash
|
||||
sudo dnf system-upgrade download --releasever=44
|
||||
sudo dnf5 offline reboot
|
||||
```
|
||||
|
||||
The "reboot" applies the offline transaction and shuts the distro down — there's no real systemd reboot in WSL. Wait a couple of minutes, then relaunch. If it errors on `systemctl`, the fallback is:
|
||||
|
||||
```bash
|
||||
export DNF_SYSTEM_UPGRADE_NO_REBOOT=1
|
||||
sudo -E dnf system-upgrade reboot
|
||||
```
|
||||
|
||||
5. **Verify and tidy up**
|
||||
|
||||
```bash
|
||||
cat /etc/fedora-release # Fedora release 44 (Forty Four)
|
||||
sudo dnf upgrade --refresh # catch post-upgrade updates
|
||||
gcc --version # F44 ships gcc 16; reinstall with `dnf install gcc gcc-c++` if removed
|
||||
```
|
||||
|
||||
```powershell
|
||||
wsl --update # fixes the post-upgrade Wsl/Service/E_UNEXPECTED catastrophic failure some users hit
|
||||
```
|
||||
|
||||
## CUDA Repo Swap
|
||||
|
||||
`dnf repolist` still showed `cuda-fedora39-x86_64` — NVIDIA repos are pinned per Fedora release and don't follow distro upgrades. NVIDIA publishes a fedora44 repo:
|
||||
|
||||
```bash
|
||||
sudo rm /etc/yum.repos.d/cuda-fedora39*.repo
|
||||
sudo dnf config-manager addrepo --from-repofile=https://developer.download.nvidia.com/compute/cuda/repos/fedora44/x86_64/cuda-fedora44.repo
|
||||
sudo dnf upgrade --refresh
|
||||
sudo dnf repolist # confirm cuda-fedora44-x86_64
|
||||
```
|
||||
|
||||
**WSL caveat:** never install the NVIDIA *driver* inside WSL — the Windows host driver provides the GPU. Only install toolkit packages (e.g. `cuda-toolkit`).
|
||||
|
||||
## Gotchas & Notes
|
||||
|
||||
- **Don't skip more than two releases** in one jump — staged upgrades otherwise.
|
||||
- **The WSL distro name is just a Windows label** — it still says "FedoraLinux-43" after the upgrade. Cosmetic fixes: Windows Terminal profile name, Start Menu shortcut, and `DistributionName`/`ShortcutPath` under `HKCU\Software\Microsoft\Windows\CurrentVersion\Lxss\{uuid}`.
|
||||
- **Keep the backup tar** until the upgraded instance has proven stable for a few days, then delete to reclaim the space.
|
||||
- **Restore path if needed:** `wsl --import FedoraRestore C:\WSL\FedoraRestore D:\backups\fedora43.tar` — remember imports default to root; fix via `/etc/wsl.conf` `[user] default=majorlinux`.
|
||||
|
||||
## See Also
|
||||
|
||||
- [WSL2 Instance Migration (Fedora 43)](wsl2-instance-migration-fedora43.md)
|
||||
- [WSL2 Backup via PowerShell](wsl2-backup-powershell.md)
|
||||
|
|
@ -23,7 +23,14 @@ A collection of guides covering Linux administration, shell scripting, networkin
|
|||
- [Ansible Getting Started](shell-scripting/ansible-getting-started.md)
|
||||
- [Bash Scripting Patterns](shell-scripting/bash-scripting-patterns.md)
|
||||
|
||||
## Storage
|
||||
|
||||
- [SnapRAID & MergerFS Storage Setup](storage/snapraid-mergerfs-setup.md)
|
||||
- [mdadm — Rebuilding a RAID Array After Reinstall](storage/mdadm-raid-rebuild.md)
|
||||
- [Growing an LVM Volume by Absorbing Another Disk](storage/lvm-grow-volume-absorb-disk.md)
|
||||
|
||||
## Distro-Specific
|
||||
|
||||
- [Linux Distro Guide for Beginners](distro-specific/linux-distro-guide-beginners.md)
|
||||
- [WSL2 Instance Migration to Fedora 43](distro-specific/wsl2-instance-migration-fedora43.md)
|
||||
- [WSL2 In-Place Upgrade to Fedora 44](distro-specific/wsl2-fedora44-inplace-upgrade.md)
|
||||
|
|
|
|||
|
|
@ -10,7 +10,7 @@ tags:
|
|||
- remote-access
|
||||
status: published
|
||||
created: 2026-03-08
|
||||
updated: 2026-04-22T09:20
|
||||
updated: 2026-04-30T05:21
|
||||
---
|
||||
|
||||
# SSH Config and Key Management
|
||||
|
|
|
|||
159
01-linux/storage/lvm-grow-volume-absorb-disk.md
Normal file
159
01-linux/storage/lvm-grow-volume-absorb-disk.md
Normal file
|
|
@ -0,0 +1,159 @@
|
|||
---
|
||||
title: "Growing an LVM Volume by Absorbing Another Disk"
|
||||
domain: linux
|
||||
category: storage
|
||||
tags: [lvm, lvextend, vgextend, pvcreate, resize2fs, ext4, storage, disk, homelab]
|
||||
status: published
|
||||
created: 2026-06-17
|
||||
updated: 2026-06-17
|
||||
---
|
||||
|
||||
# Growing an LVM Volume by Absorbing Another Disk
|
||||
|
||||
When an LVM-backed filesystem fills up and its volume group (VG) has no free
|
||||
extents, you can grow it by adding a second physical disk as a new physical
|
||||
volume (PV), extending the VG onto it, then extending the logical volume (LV)
|
||||
and its filesystem. With ext4 this can be done **online** — no unmount, no
|
||||
downtime for the volume being grown.
|
||||
|
||||
This guide covers the common case where the disk you want to absorb is currently
|
||||
in use by its own LVM volume (you must evacuate and tear that down first), and
|
||||
the precautions that keep it safe.
|
||||
|
||||
> [!warning] This enlarges your failure domain
|
||||
> A single LV spanning two disks linearly (the default — no RAID/mirror) means
|
||||
> **losing either disk loses the entire volume.** ext4 has no parity. Only do
|
||||
> this for data you can rebuild, or layer redundancy (mdadm/LVM RAID) underneath.
|
||||
> Back up anything irreplaceable first.
|
||||
|
||||
## The Short Answer
|
||||
|
||||
If the target disk (`/dev/sdX`) is already empty and unused:
|
||||
|
||||
```bash
|
||||
sudo pvcreate /dev/sdX
|
||||
sudo vgextend myvg /dev/sdX
|
||||
sudo lvextend -l +100%FREE /dev/myvg/mylv
|
||||
sudo resize2fs /dev/mapper/myvg-mylv # ext4, online; use xfs_growfs for XFS
|
||||
```
|
||||
|
||||
The rest of this article handles the harder case: the target disk is currently
|
||||
holding its own LVM volume with data on it.
|
||||
|
||||
## Step-by-Step
|
||||
|
||||
### 1. Survey the current layout
|
||||
|
||||
```bash
|
||||
sudo pvs # physical volumes → which VG each belongs to
|
||||
sudo vgs # volume groups, free extents (VFree)
|
||||
sudo lvs # logical volumes and sizes
|
||||
lsblk -o NAME,SIZE,TYPE,MOUNTPOINT
|
||||
df -h
|
||||
```
|
||||
|
||||
Confirm:
|
||||
|
||||
- The VG you want to grow (`myvg`) has `0` `VFree` (that's why you're here).
|
||||
- The disk you want to absorb (`/dev/sdX`) is a **standalone** PV — not a member
|
||||
of an mdadm array, a mergerfs branch, or a SnapRAID parity disk. Repurposing a
|
||||
disk that something else depends on will break that thing silently.
|
||||
|
||||
### 2. Evacuate the disk you're about to absorb
|
||||
|
||||
Anything on the target disk will be **destroyed**. Move it somewhere with room to
|
||||
spare, then prove the copy is intact before you trust it.
|
||||
|
||||
```bash
|
||||
# Copy preserving permissions/timestamps
|
||||
sudo rsync -a /mnt/olddisk/important /destination/with/space/
|
||||
|
||||
# Verify byte-for-byte — empty output + exit code 0 means identical
|
||||
sudo diff -rq /mnt/olddisk/important /destination/with/space/important && echo OK
|
||||
```
|
||||
|
||||
For large trees the `diff -rq` (full byte comparison) is slow but is the
|
||||
authoritative check — don't skip it before the destructive phase. If an
|
||||
application tracks files by path (databases, media servers), update its path
|
||||
references to the new location *now*, while the old copy still exists as a
|
||||
fallback.
|
||||
|
||||
### 3. Unmount and remove the old disk from fstab
|
||||
|
||||
```bash
|
||||
sudo fuser -m /mnt/olddisk # confirm nothing holds it open
|
||||
sudo umount /mnt/olddisk
|
||||
mountpoint -q /mnt/olddisk && echo "STILL MOUNTED" || echo "unmounted"
|
||||
|
||||
sudo cp /etc/fstab /etc/fstab.bak-$(date +%Y%m%d) # always back up fstab
|
||||
sudo sed -i '/olddisk/d' /etc/fstab # remove the stale entry
|
||||
grep olddisk /etc/fstab || echo "fstab line gone"
|
||||
```
|
||||
|
||||
> [!tip] Verify your `sed` pattern only matches the line you mean
|
||||
> A too-broad pattern can delete the wrong fstab entry. Check the file before and
|
||||
> after, and keep the backup until you've confirmed the system still boots.
|
||||
|
||||
### 4. Tear down the old disk's LVM
|
||||
|
||||
```bash
|
||||
sudo lvremove -y /dev/oldvg/oldlv
|
||||
sudo vgremove -y oldvg
|
||||
sudo pvremove -y /dev/sdX # wipes the LVM label off the disk
|
||||
```
|
||||
|
||||
This is the point of no return for the old disk's data — which is why steps 2–3
|
||||
verified the copy first.
|
||||
|
||||
### 5. Add the disk to the target VG and extend
|
||||
|
||||
```bash
|
||||
sudo pvcreate -y /dev/sdX
|
||||
sudo vgextend myvg /dev/sdX
|
||||
sudo lvextend -l +100%FREE /dev/myvg/mylv
|
||||
```
|
||||
|
||||
`lvs`/`vgs` should now show the LV grown to span both disks and `0` free extents.
|
||||
|
||||
### 6. Grow the filesystem (online)
|
||||
|
||||
```bash
|
||||
# ext4 — works while mounted
|
||||
sudo resize2fs /dev/mapper/myvg-mylv
|
||||
|
||||
# XFS — grows online too, but takes the mountpoint, not the device
|
||||
sudo xfs_growfs /mountpoint
|
||||
```
|
||||
|
||||
`resize2fs` is idempotent — if it gets interrupted, just run it again; it reports
|
||||
"Nothing to do!" once the filesystem already fills the LV.
|
||||
|
||||
### 7. Verify
|
||||
|
||||
```bash
|
||||
df -h /mountpoint # should reflect the new larger size
|
||||
sudo pvs # /dev/sdX now listed under myvg
|
||||
sudo vgs myvg # two PVs, larger VSize
|
||||
```
|
||||
|
||||
## Notes & Gotchas
|
||||
|
||||
- **Online resize works for the volume being grown, not the one being removed.**
|
||||
The disk you absorb must be unmounted and torn down; the destination LV stays
|
||||
mounted throughout.
|
||||
- **`resize2fs` interruption is safe.** ext4 online resize is journaled; re-run it.
|
||||
- **macOS cruft on evacuated disks.** Trees touched by macOS often carry
|
||||
`._*` AppleDouble files and `.DS_Store` — harmless to drop, but they inflate
|
||||
file counts in `diff`/`rsync` output. Don't mistake them for real data.
|
||||
- **Check SMART on a disk you're promoting into a bigger role.** A disk with a
|
||||
pending-sector history is riskier once it's in the critical path for a whole
|
||||
multi-disk volume than it was holding a small isolated one.
|
||||
- **Mountpoint cleanup.** After the old disk is gone, its former mountpoint
|
||||
directory may reappear (it was shadowed by the mount). `rmdir` it if empty.
|
||||
Note `ls -A` exits `0` on an empty directory, so don't gate cleanup on its exit
|
||||
status — test contents explicitly.
|
||||
|
||||
## Related
|
||||
|
||||
- [SnapRAID & MergerFS Storage Setup](snapraid-mergerfs-setup.md) — add redundancy/parity instead of a linear span
|
||||
- [mdadm — Rebuilding a RAID Array After Reinstall](mdadm-raid-rebuild.md)
|
||||
|
|
@ -5,7 +5,7 @@ category: cloud
|
|||
tags: [aws, s3, cost, billing, mastodon, glacier]
|
||||
status: published
|
||||
created: 2026-04-19
|
||||
updated: 2026-04-19
|
||||
updated: 2026-06-01
|
||||
---
|
||||
|
||||
# AWS S3 Cost Management
|
||||
|
|
@ -17,24 +17,24 @@ The majorlinux AWS account is used exclusively for S3 object storage. This cover
|
|||
- **Account ID:** `408469496267`
|
||||
- **Account name:** majorlinux
|
||||
- **Services in use:** S3 (Standard + Glacier Deep Archive), AWS Config, Cost Explorer
|
||||
- **Monthly spend:** ~$32/mo (March 2026); expected ~$16/mo post-media-prune
|
||||
- **Monthly spend:** ~$24/mo (May 2026, post-media-prune, post-STANDARD_IA revert)
|
||||
|
||||
## Buckets and Cost Drivers
|
||||
|
||||
| Bucket | Size | Storage Class | Cost/mo | Purpose |
|
||||
|--------|------|---------------|---------|--------|
|
||||
| `majortoot` | 648 GB (mostly remote cache) | S3 Standard | ~$15/mo | Mastodon media |
|
||||
| `majorhomebackup` | 16 TiB | Glacier Deep Archive | ~$16/mo | MLS stream archives (sole copy) |
|
||||
| `majortoot` | ~7 GB (one-time prune; automation disabled) | S3 Standard | ~$0.16/mo | Mastodon media |
|
||||
| `majorhomebackup` | 16 TiB | Glacier Deep Archive | ~$11–12/mo | MLS stream archives (sole copy) |
|
||||
| `config-bucket-*` | ~185 KB | S3 Standard | ~$0.00 | AWS Config snapshots |
|
||||
|
||||
## CLI Setup
|
||||
|
||||
AWS CLI installed on MajorMac via Homebrew. Credentials configured at `~/.aws/credentials`.
|
||||
AWS CLI installed on MajorMac via Homebrew. Credentials for `MajorCLI` user at `~/.aws/credentials`.
|
||||
|
||||
```bash
|
||||
brew install awscli
|
||||
# Credentials pulled from Ansible vault:
|
||||
# AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY in group_vars/all/vault.yml
|
||||
# Credentials: MajorCLI IAM user (S3 + Billing read access)
|
||||
# Key ID: AKIAV6GVN4HF4Y6EV4NM — created 2026-05-23
|
||||
```
|
||||
|
||||
### Useful commands
|
||||
|
|
@ -42,18 +42,22 @@ brew install awscli
|
|||
```bash
|
||||
# Check current month spend by service
|
||||
aws ce get-cost-and-usage \
|
||||
--time-period Start=2026-04-01,End=2026-04-30 \
|
||||
--time-period Start=2026-05-01,End=2026-05-31 \
|
||||
--granularity MONTHLY \
|
||||
--metrics "UnblendedCost" \
|
||||
--group-by Type=DIMENSION,Key=SERVICE
|
||||
|
||||
# Daily cost breakdown with top usage types
|
||||
aws ce get-cost-and-usage \
|
||||
--time-period Start=2026-05-01,End=2026-05-23 \
|
||||
--granularity DAILY \
|
||||
--metrics "UnblendedCost" \
|
||||
--filter '{"Dimensions":{"Key":"SERVICE","Values":["Amazon Simple Storage Service"]}}' \
|
||||
--group-by Type=DIMENSION,Key=USAGE_TYPE
|
||||
|
||||
# View anomaly alerts
|
||||
aws ce get-anomalies \
|
||||
--date-interval StartDate=2026-04-01,EndDate=2026-04-30
|
||||
|
||||
# Check conformance pack compliance
|
||||
aws configservice get-conformance-pack-compliance-details \
|
||||
--conformance-pack-name MajorConformance
|
||||
--date-interval StartDate=2026-05-01,EndDate=2026-05-31
|
||||
|
||||
# List budgets
|
||||
aws budgets describe-budgets --account-id 408469496267
|
||||
|
|
@ -62,25 +66,52 @@ aws budgets describe-budgets --account-id 408469496267
|
|||
## Budget Alert
|
||||
|
||||
`MajorS3MonthlyAlert` configured 2026-04-19:
|
||||
- 80% threshold → email at $20 actual spend
|
||||
- 100% threshold → email at $25 actual spend
|
||||
- 80% threshold → email at $24 actual spend
|
||||
- 100% threshold → email at $30 actual spend
|
||||
- Recipient: maj.linux@gmail.com
|
||||
|
||||
> [!note] Thresholds updated 2026-05-23 to reflect actual ~$24/mo steady-state spend (was $20/$25, set when spend was higher due to large majortoot bucket before prune took effect).
|
||||
|
||||
## Cost Reduction Options
|
||||
|
||||
### majortoot — S3 Standard-IA
|
||||
### majortoot — S3 Standard-IA (⚠️ DO NOT USE — tried and reverted)
|
||||
|
||||
Switching `S3_STORAGE_CLASS=STANDARD_IA` in Mastodon's `.env.production` reduces storage cost from $0.023/GB to $0.0125/GB for new uploads. Expected saving: ~$4–5/mo after cache is pruned down to local-only content.
|
||||
**Attempted 2026-05 — reverted 2026-05-17. Do not retry without careful planning.**
|
||||
|
||||
See [[mastodon-instance-tuning]] for full instructions.
|
||||
The theory: switching `S3_STORAGE_CLASS=STANDARD_IA` saves ~$4–5/mo on storage. In practice, the bulk avatar restore operation (`restore-avatars.sh`, May 9–10) ran while STANDARD_IA was active. The ~5,223 account refreshes across 1,095 domains generated ~470,000 SIA Tier 1 PUT requests ($4.72) plus early-deletion fees ($1.21) when the objects were replaced after reverting to STANDARD on May 17.
|
||||
|
||||
### majortoot — Weekly media prune
|
||||
**STANDARD_IA is only economical if:**
|
||||
- The bucket has no large bulk-write operations (media cache rebuilds, avatar restores)
|
||||
- Objects are written and left for >30 days (early deletion incurs minimum 30-day fee)
|
||||
- The per-request cost ($0.01/1,000 for SIA vs $0.005/1,000 for Standard) doesn't offset storage savings
|
||||
|
||||
Weekly cron deployed (`0 3 * * 0`) via `configure_mastodon_media_prune.yml`. Removes remote federated cache older than 7 days. Expected to reduce bucket from 648 GB to ~7 GB over time.
|
||||
With the weekly prune now running correctly and the bucket shrinking toward ~7 GB, the storage savings of SIA are negligible (~$0.05/mo). **Leave at STANDARD.**
|
||||
|
||||
### majortoot — media pruning (automation DISABLED 2026-06-01)
|
||||
|
||||
A weekly prune cron (`0 3 * * 0`, via `configure_mastodon_media_prune.yml`) **used to** run `tootctl media remove --days=7`. It shrank the bucket from 648 GB to ~7 GB — a one-time cleanup of years of accumulated remote **attachment** cache, which is safe and accounts for the bulk of the savings above.
|
||||
|
||||
**That automation was removed 2026-06-01.** The same playbook also carried a monthly `tootctl accounts refresh --all`, and automated profile pruning (plus a storage-level deletion during the cost-cull/migration) repeatedly broke remote avatars. The playbook is now an *enforce-absent* guard, and a [synthetic upload health check](../services/mastodon-s3-acl-upload-failures.md) alerts if media serving/uploads regress. See [[mastodon-prune-profiles-trap]] and [[mastodon-s3-acl-upload-failures]].
|
||||
|
||||
**Going forward:** the bucket is already small (~7 GB) and attachment cache re-accumulates slowly. If it ever grows enough to matter, run an **attachment-only** prune **manually and deliberately** (`bin/tootctl media remove --days=30`) — never automate profile/header pruning or `accounts refresh --all`.
|
||||
|
||||
### majorhomebackup — Self-host consideration
|
||||
|
||||
Deep Archive at $0.00099/GB is the cheapest cloud tier — no cloud alternative is cheaper. If the MLS archives are no longer needed, deletion would save ~$16/mo. A 20TB HDD (~$300–400) would break even in ~2 years vs. continued cloud storage. **These are the sole copy — do not delete without a separate backup.**
|
||||
Deep Archive at $0.00099/GB is the cheapest cloud tier — no cloud alternative is cheaper. If the MLS archives are no longer needed, deletion would save ~$11–12/mo. A 20TB HDD (~$300–400) would break even in ~2.5 years vs. continued cloud storage. **These are the sole copy — do not delete without a separate backup.**
|
||||
|
||||
## IAM Users
|
||||
|
||||
| User | Scope | Credentials location | Notes |
|
||||
|------|-------|---------------------|-------|
|
||||
| `MajorToot` | S3 full (MajorsHouse group) | `~/.aws/credentials` on majortoot | Key rotated 2026-05-23 |
|
||||
| `MajorHome` | S3 full (MajorsHouse group) | `~/.aws/credentials` on majorhome | Key pending rotation (see below) |
|
||||
| `MajorCLI` | S3 full + Billing read (MajorsHouse group + AWSBillingReadOnlyAccess) | `~/.aws/credentials` on MajorMac | Created 2026-05-23, replaces root key |
|
||||
|
||||
> [!warning] Root access keys deleted 2026-05-23. Do NOT create new root access keys. Use `MajorCLI` for CLI work on MajorMac. The root account password (in Vaultwarden) is sufficient for console access.
|
||||
|
||||
> [!warning] MajorHome key (`AKIAV6GVN4HF7POCNW6D`) exposed in shell session 2026-05-23. Rotate via AWS Console → IAM → Users → MajorHome → Security credentials. Update `~/.aws/credentials` on majorhome afterward.
|
||||
|
||||
> [!note] `MajorCLI` does not have IAM permissions. Future key rotation requires AWS Console login or temporary IAM policy attachment. Consider adding a `SelfManageKeys` inline policy to `MajorCLI` via console.
|
||||
|
||||
## Conformance Pack
|
||||
|
||||
|
|
@ -92,15 +123,35 @@ Deep Archive at $0.00099/GB is the cheapest cloud tier — no cloud alternative
|
|||
|
||||
Evaluations cost $0.001 each and run on a periodic schedule. Safe to ignore; at current scale costs pennies per month.
|
||||
|
||||
## IAM Users
|
||||
|
||||
| User | Scope | Credentials location |
|
||||
|------|-------|---------------------|
|
||||
| `MajorToot` | S3 only — no billing/Cost Explorer | `~/.aws/credentials` on majortoot |
|
||||
| Root | Full access | `~/.aws/credentials` on MajorMac (configured 2026-04-19) |
|
||||
## CloudTrail Audit Logging
|
||||
|
||||
`MajorTrail` configured 2026-05-23:
|
||||
- **S3 bucket:** `majorcloudtrail-408469496267`
|
||||
- **Multi-region:** yes — captures API calls across all regions
|
||||
- **Global service events:** yes — includes IAM, STS, S3 control plane
|
||||
- **Log file validation:** enabled — tamper detection via digest files
|
||||
- **Retention:** logs accumulate in S3; no automatic expiry configured
|
||||
|
||||
Use CloudTrail to investigate unexpected cost spikes, IAM key usage, and bucket write activity. Without it, historical API calls are unrecoverable (learned the hard way from the May 2026 SIA spike investigation).
|
||||
|
||||
```bash
|
||||
# List recent CloudTrail events (last 1h, S3 writes only)
|
||||
aws cloudtrail lookup-events \
|
||||
--lookup-attributes AttributeKey=EventName,AttributeValue=PutObject \
|
||||
--start-time $(date -u -v-1H +%Y-%m-%dT%H:%M:%SZ 2>/dev/null || date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ) \
|
||||
--query 'Events[].{Time:EventTime,User:Username,Resource:Resources[0].ResourceName}' \
|
||||
--output table
|
||||
|
||||
# Look up events by specific access key
|
||||
aws cloudtrail lookup-events \
|
||||
--lookup-attributes AttributeKey=AccessKeyId,AttributeValue=AKIAV6GVN4HF3BWAIAGC \
|
||||
--output table
|
||||
```
|
||||
|
||||
## Related
|
||||
|
||||
- [[Services/AWS]] — infrastructure record
|
||||
- [[mastodon-instance-tuning]] — media cache management
|
||||
- [[mastodon-prune-profiles-trap]] — avatar restore incident + bulk-restore procedure
|
||||
- [[mastodon-s3-acl-upload-failures]] — silent upload failures on ACL-disabled buckets
|
||||
- [[majortoot]] — Mastodon host
|
||||
|
|
|
|||
98
02-selfhosting/cloud/vps-migration-baseline-checklist.md
Normal file
98
02-selfhosting/cloud/vps-migration-baseline-checklist.md
Normal file
|
|
@ -0,0 +1,98 @@
|
|||
---
|
||||
title: VPS Migration Baseline Checklist
|
||||
description: What to verify after migrating a server to a new provider — the packages, services, and configs that must match the old box
|
||||
tags:
|
||||
- migration
|
||||
- vps
|
||||
- hetzner
|
||||
- digitalocean
|
||||
- ansible
|
||||
- checklist
|
||||
status: published
|
||||
created: 2026-05-09
|
||||
updated: 2026-05-13T10:35
|
||||
---
|
||||
|
||||
# VPS Migration Baseline Checklist
|
||||
|
||||
When migrating a server from one VPS provider to another, it's easy to focus on the application (bots, web services, databases) and forget the infrastructure baseline. This checklist covers the common components that make a server operational beyond just running the app.
|
||||
|
||||
## Background
|
||||
|
||||
During the Hetzner migration (2026-05), `majordiscord` was migrated with only the application layer (PhantomBot, Red-DiscordBot) and core infrastructure (Netdata, Tailscale, fail2ban). Missing from the new box: Postfix (email relay), logwatch, ClamAV, and dnf-automatic. The gap went unnoticed for a week because all monitoring email depended on the missing Postfix.
|
||||
|
||||
## The Checklist
|
||||
|
||||
### Before Migration
|
||||
|
||||
Power on both old and new boxes. Run this comparison to find gaps:
|
||||
|
||||
```bash
|
||||
# Fedora — list baseline packages on both hosts
|
||||
ssh root@OLD_HOST 'rpm -qa --qf "%{NAME}\n" | sort | grep -iE "fail2ban|logwatch|postfix|netdata|clamav|dnf-auto|tailscale|cronie|firewalld"'
|
||||
ssh root@NEW_HOST 'rpm -qa --qf "%{NAME}\n" | sort | grep -iE "fail2ban|logwatch|postfix|netdata|clamav|dnf-auto|tailscale|cronie|firewalld"'
|
||||
|
||||
# Ubuntu — list baseline packages on both hosts
|
||||
ssh root@OLD_HOST 'dpkg -l | grep -iE "fail2ban|logwatch|postfix|netdata|clamav|unattended|tailscale" | awk "{print \$2}" | sort'
|
||||
ssh root@NEW_HOST 'dpkg -l | grep -iE "fail2ban|logwatch|postfix|netdata|clamav|unattended|tailscale" | awk "{print \$2}" | sort'
|
||||
```
|
||||
|
||||
Compare enabled services:
|
||||
|
||||
```bash
|
||||
ssh root@HOST 'systemctl list-unit-files --state=enabled --no-pager | grep -iE "fail2ban|logwatch|postfix|netdata|clamav|dnf-auto|tailscale|cronie|firewalld|sshd"'
|
||||
```
|
||||
|
||||
### Baseline Components
|
||||
|
||||
Every server in the fleet should have these. Check each one after migration:
|
||||
|
||||
| Component | Package (Fedora) | Package (Ubuntu) | Ansible Playbook | Notes |
|
||||
|-----------|-----------------|------------------|------------------|-------|
|
||||
| Monitoring | `netdata` | `netdata` | `netdata.yml` | Claim to Netdata Cloud if applicable |
|
||||
| VPN | `tailscale` | `tailscale` | — (manual join) | Rename node in Tailscale admin |
|
||||
| Intrusion prevention | `fail2ban` | `fail2ban` | `harden.yml` | Check jail.local, banaction matches firewall |
|
||||
| Email relay | `postfix` | `postfix` | `configure_postfix_relay.yml` | Required by logwatch, Netdata, fail2ban |
|
||||
| Log summaries | `logwatch` | `logwatch` | `logwatch.yml` | Override file, not defaults — see [logwatch fleet setup](../monitoring/logwatch-fleet-setup.md) |
|
||||
| Firewall | `firewalld` | `ufw` | `configure_firewall_*.yml` | Verify fail2ban banaction matches |
|
||||
| Cron | `cronie` | `cron` | — (usually pre-installed) | Required by logwatch |
|
||||
| Auto-updates | `dnf-automatic` | `unattended-upgrades` | `ansible-unattended-upgrades-fleet` | Security patches only |
|
||||
| Antivirus | `clamav` | `clamav` | `clamav.yml` (clamav role) | Internet-facing hosts only |
|
||||
| SSH hardening | `openssh-server` | `openssh-server` | `ssh_hardening.yml` (ssh_hardening role) | Key-only, no root password |
|
||||
| Timezone | — | — | — | US servers: `America/New_York`; UK: `Europe/London`. Hetzner defaults to UTC. |
|
||||
| CA bundle (Fedora) | `ca-certificates` | `ca-certificates` | — | Verify `/etc/pki/tls/certs/ca-bundle.crt` symlink exists — see [Fedora CA bundle fix](../../05-troubleshooting/security/fedora-ca-bundle-missing-symlink.md) |
|
||||
| Syslog (Fedora) | `rsyslog` | — (pre-installed) | — | Fedora 44 Hetzner images have journald only. Logwatch needs `/var/log/messages` + `/var/log/secure`. |
|
||||
|
||||
### After Migration
|
||||
|
||||
1. **Set the timezone** — `timedatectl set-timezone America/New_York` (US) or `Europe/London` (UK). Hetzner images default to UTC.
|
||||
2. **Set the system hostname** — Hetzner provisions the box as `<host>-hetzner`. Run `hostnamectl set-hostname <host>` and fix the loopback line: `sed -i "s/127.0.1.1.*/127.0.1.1 <host> <host>/" /etc/hosts`. Skip this and **Logwatch emails arrive titled `Logwatch for <host>-hetzner`** weeks later. Do it alongside the Tailscale node rename and Postfix `myhostname` — all three read from the provisioning label. See [Logwatch wrong hostname after migration](../../05-troubleshooting/logwatch-wrong-hostname-after-migration.md).
|
||||
3. **Verify CA bundle (Fedora)** — `ls /etc/pki/tls/certs/ca-bundle.crt`. If missing, Postfix TLS, curl, and dnf will all fail silently. See [Fedora CA bundle fix](../../05-troubleshooting/security/fedora-ca-bundle-missing-symlink.md).
|
||||
4. **Run `harden.yml` against the new host** — catches most gaps in one pass
|
||||
5. **Send a test email** — `echo test | mail -s "test" marcus@majorshouse.com` — if this fails, nothing else can alert you
|
||||
6. **Verify crond is running** — `systemctl is-active crond` (Fedora) or `systemctl is-active cron` (Ubuntu). cronie can be `enabled` but not `active` after provisioning.
|
||||
7. **Check Netdata Cloud** — verify the new node appears and alerts are flowing
|
||||
8. **Compare fail2ban jails** — `fail2ban-client status` on both old and new
|
||||
9. **Verify logwatch sends** — `sudo logwatch --output mail --range today`
|
||||
10. **Keep the old box powered off but not destroyed** for at least 7 days after remediation
|
||||
|
||||
### Using doctl to Manage Old Droplets
|
||||
|
||||
```bash
|
||||
# Authenticate (token from Ansible vault)
|
||||
cd ~/MajorAnsible
|
||||
ansible-vault view group_vars/all/vault.yml | grep vault_do_oauth_token | awk '{print $2}' | xargs doctl auth init --access-token
|
||||
|
||||
# List droplets
|
||||
doctl compute droplet list --format Name,ID,Status,PublicIPv4
|
||||
|
||||
# Power on for comparison
|
||||
doctl compute droplet-action power-on DROPLET_ID
|
||||
|
||||
# Power off when done
|
||||
doctl compute droplet-action power-off DROPLET_ID
|
||||
```
|
||||
|
||||
## Lesson Learned
|
||||
|
||||
Application migration is not server migration. The app can work perfectly while the monitoring, alerting, and email infrastructure is completely broken. Always compare the full package baseline between old and new boxes before calling a migration complete.
|
||||
|
|
@ -5,7 +5,7 @@ category: dns-networking
|
|||
tags: [tailscale, networking, infrastructure, dns, vpn]
|
||||
status: published
|
||||
created: 2026-04-02
|
||||
updated: 2026-04-02
|
||||
updated: 2026-05-19
|
||||
---
|
||||
# 🌐 Network Overview
|
||||
|
||||
|
|
@ -19,12 +19,13 @@ The **MajorsHouse** infrastructure is connected via a private **Tailscale** mesh
|
|||
|
||||
## 🌍 Geographic Nodes
|
||||
|
||||
| Host | Location | IP | OS |
|
||||
|---|---|---|---|
|
||||
| `dcaprod` | 🇺🇸 US | 100.104.11.146 | Ubuntu 24.04 |
|
||||
| `majortoot` | 🇺🇸 US | 100.110.197.17 | Ubuntu 24.04 |
|
||||
| `majorhome` | 🇺🇸 US | 100.120.209.106 | Fedora 43 |
|
||||
| `teelia` | 🇬🇧 UK | 100.120.32.69 | Ubuntu 24.04 |
|
||||
| Host | Location | IP | OS | Notes |
|
||||
|---|---|---|---|---|
|
||||
| `dcaprod` | 🇺🇸 US | 100.104.11.146 | Ubuntu 24.04 | DO droplet — live until ~2026-05-22 |
|
||||
| `dcaprod-hetzner` | 🇺🇸 US | 100.98.223.93 | Ubuntu 24.04 | Hetzner CPX21 — migration target; DNS cutover ~May 22 |
|
||||
| `majortoot` | 🇺🇸 US | 100.110.197.17 | Ubuntu 24.04 | |
|
||||
| `majorhome` | 🇺🇸 US | 100.120.209.106 | Fedora 43 | |
|
||||
| `teelia` | 🇬🇧 UK | 100.120.32.69 | Ubuntu 24.04 | |
|
||||
|
||||
## 🔗 Tailscale Setup
|
||||
|
||||
|
|
@ -35,4 +36,4 @@ Tailscale is configured as a persistent service on all nodes. Key features used
|
|||
- **ACLs:** Managed via the Tailscale admin console to restrict cross-group communication where necessary.
|
||||
|
||||
---
|
||||
*Last updated: 2026-03-04*
|
||||
*Last updated: 2026-05-19*
|
||||
|
|
|
|||
|
|
@ -7,7 +7,7 @@ tags:
|
|||
- asus
|
||||
- ssh
|
||||
created: 2026-04-19
|
||||
updated: 2026-04-29T22:45
|
||||
updated: 2026-04-30T05:21
|
||||
---
|
||||
|
||||
# Wake-on-LAN via Router SSH
|
||||
|
|
|
|||
|
|
@ -1,6 +1,6 @@
|
|||
---
|
||||
created: 2026-04-13T10:15
|
||||
updated: 2026-04-29T22:45
|
||||
updated: 2026-05-31
|
||||
---
|
||||
# 🏠 Self-Hosting & Homelab
|
||||
|
||||
|
|
@ -30,6 +30,19 @@ Guides for running your own services at home, including Docker, reverse proxies,
|
|||
- [Tuning Netdata Docker Health Alarms](monitoring/netdata-docker-health-alarm-tuning.md)
|
||||
- [Deploying Netdata to a New Server](monitoring/netdata-new-server-setup.md)
|
||||
|
||||
## Services
|
||||
|
||||
- [Mastodon Instance Tuning](services/mastodon-instance-tuning.md)
|
||||
- [Mastodon Post-Install Hardening (Permissions + Account)](services/mastodon-post-install-hardening.md)
|
||||
- [Mastodon DB Maintenance](services/mastodon-db-maintenance.md)
|
||||
- [Mastodon Federation](services/mastodon-federation.md)
|
||||
- [Mastodon `--prune-profiles` Trap](services/mastodon-prune-profiles-trap.md)
|
||||
- [Mastodon on S3 — Silent Upload Failures](services/mastodon-s3-acl-upload-failures.md)
|
||||
- [Mastodon — Triaging Crowdfunding / Mention-Spam Accounts](services/mastodon-mention-spam-crowdfunding.md)
|
||||
- [Ghost SMTP via Mailgun](services/ghost-smtp-mailgun-setup.md)
|
||||
- [Updating n8n Docker](services/updating-n8n-docker.md)
|
||||
- [Claude Code Remote Control](services/claude-code-remote-control.md)
|
||||
|
||||
## Security
|
||||
|
||||
- [Linux Server Hardening Checklist](security/linux-server-hardening-checklist.md)
|
||||
|
|
|
|||
296
02-selfhosting/monitoring/logwatch-fleet-setup.md
Normal file
296
02-selfhosting/monitoring/logwatch-fleet-setup.md
Normal file
|
|
@ -0,0 +1,296 @@
|
|||
---
|
||||
title: Logwatch Fleet Setup — Surviving Package Upgrades
|
||||
description: Configure logwatch on mixed Debian/Fedora fleets so settings survive package upgrades
|
||||
tags:
|
||||
- logwatch
|
||||
- monitoring
|
||||
- ansible
|
||||
- fedora
|
||||
- ubuntu
|
||||
status: published
|
||||
created: 2026-05-09
|
||||
updated: 2026-05-13T10:35
|
||||
---
|
||||
|
||||
# Logwatch Fleet Setup — Surviving Package Upgrades
|
||||
|
||||
Logwatch ships with a defaults file at `/usr/share/logwatch/default.conf/logwatch.conf`. On Fedora, package upgrades **silently reset** this file — wiping any customizations. The fix is to put all settings in the **local override file** at `/etc/logwatch/conf/logwatch.conf`, which is never touched by package managers.
|
||||
|
||||
## The Problem
|
||||
|
||||
Fedora 44's logwatch 7.14-1 upgrade (April 2026) reset `Output` from `mail` back to `stdout` in the defaults file. Servers that had been emailing daily reports for months went silent with zero errors. `rpm -V logwatch` shows the defaults file was modified (`S.5....T.`), but there's no warning during upgrade.
|
||||
|
||||
Ubuntu is less affected because its `/etc/cron.daily/00logwatch` script passes `--output mail` explicitly, overriding the config. Fedora's cron script does not.
|
||||
|
||||
## The Fix
|
||||
|
||||
Write all settings to the **override file** (`/etc/logwatch/conf/logwatch.conf`):
|
||||
|
||||
```ini
|
||||
# Managed by Ansible — do not edit manually.
|
||||
# Local overrides — survives package upgrades.
|
||||
Output = mail
|
||||
MailTo = marcus@majorshouse.com
|
||||
MailFrom = Logwatch@hostname.majorshouse.com
|
||||
Detail = Low
|
||||
```
|
||||
|
||||
Key settings:
|
||||
|
||||
| Setting | Value | Why |
|
||||
|---------|-------|-----|
|
||||
| `Output` | `mail` | Must be `mail`, not `stdout`. Fedora's cron script doesn't pass `--output mail` like Ubuntu's does. |
|
||||
| `MailTo` | recipient address | Where reports go. |
|
||||
| `MailFrom` | per-host sender | Makes it easy to identify which server sent the report. |
|
||||
| `Detail` | `Low` | Keeps emails scannable. Raise to `Med` or `High` for debugging. |
|
||||
|
||||
## Ansible Playbook
|
||||
|
||||
The `logwatch.yml` playbook handles both OS families:
|
||||
|
||||
```yaml
|
||||
- name: Install and configure logwatch
|
||||
hosts: all
|
||||
become: true
|
||||
gather_facts: true
|
||||
tasks:
|
||||
- name: Install logwatch (Debian/Ubuntu)
|
||||
ansible.builtin.apt:
|
||||
name: logwatch
|
||||
state: present
|
||||
when: ansible_facts['os_family'] == "Debian"
|
||||
|
||||
- name: Install logwatch (Fedora)
|
||||
ansible.builtin.dnf:
|
||||
name: logwatch
|
||||
state: present
|
||||
when: ansible_facts['os_family'] == "RedHat"
|
||||
|
||||
- name: Ensure logwatch override directory exists
|
||||
ansible.builtin.file:
|
||||
path: /etc/logwatch/conf
|
||||
state: directory
|
||||
mode: '0755'
|
||||
|
||||
- name: Configure logwatch override (survives package upgrades)
|
||||
ansible.builtin.copy:
|
||||
dest: /etc/logwatch/conf/logwatch.conf
|
||||
mode: '0644'
|
||||
content: |
|
||||
# Managed by Ansible — do not edit manually.
|
||||
Output = mail
|
||||
MailTo = {{ logwatch_email }}
|
||||
MailFrom = Logwatch@{{ inventory_hostname }}.majorshouse.com
|
||||
Detail = Low
|
||||
```
|
||||
|
||||
Include it in `harden.yml` so every new server gets logwatch as part of the baseline.
|
||||
|
||||
## Verifying
|
||||
|
||||
After deploying, test immediately:
|
||||
|
||||
```bash
|
||||
# Verify crond is actually running — cronie can be "enabled" but not "active"
|
||||
systemctl is-active crond # Fedora
|
||||
systemctl is-active cron # Ubuntu
|
||||
|
||||
# If inactive, start it
|
||||
sudo systemctl start crond
|
||||
|
||||
# Then test logwatch manually
|
||||
sudo logwatch --output mail --range today
|
||||
```
|
||||
|
||||
Check that the email arrives. If it doesn't, verify:
|
||||
|
||||
1. **crond is running** — if `inactive`, cron.daily never fires and logwatch never runs. No errors anywhere.
|
||||
2. **Postfix is installed and relaying** — logwatch depends on a working local MTA.
|
||||
3. **CA bundle exists (Fedora)** — missing `/etc/pki/tls/certs/ca-bundle.crt` breaks Postfix TLS relay. See [Fedora CA bundle fix](../../05-troubleshooting/security/fedora-ca-bundle-missing-symlink.md).
|
||||
|
||||
## Diagnosing Silent Failures
|
||||
|
||||
```bash
|
||||
# Check if the defaults file was modified by a package upgrade
|
||||
rpm -V logwatch # Fedora
|
||||
dpkg -V logwatch # Debian
|
||||
|
||||
# Look for S.5....T. on the defaults file — means it was replaced
|
||||
# S = size, 5 = md5, T = timestamp changed
|
||||
|
||||
# Check if logwatch produces any output at all
|
||||
logwatch --output stdout --range yesterday | wc -l
|
||||
# If 0 lines — logwatch has no log data to report (see rsyslog section below)
|
||||
```
|
||||
|
||||
## Fedora: rsyslog Missing — Logwatch Produces Zero Output
|
||||
|
||||
Fedora 44 cloud images (Hetzner, possibly others) ship with **journald only** — no rsyslog. This means `/var/log/messages`, `/var/log/secure`, and `/var/log/cron` do not exist. Logwatch scans those files, finds nothing, produces empty output, and sends no email. Exit code is still 0 — no error anywhere.
|
||||
|
||||
This is particularly insidious because everything else can be correct (crond running, postfix relaying, logwatch config pointing to the right recipient) and you'll still get silence.
|
||||
|
||||
```bash
|
||||
# Diagnose
|
||||
rpm -q rsyslog # "package rsyslog is not installed"
|
||||
ls /var/log/messages # "No such file or directory"
|
||||
|
||||
# Fix
|
||||
dnf install -y rsyslog
|
||||
systemctl enable --now rsyslog
|
||||
|
||||
# Verify log files appear
|
||||
ls /var/log/messages /var/log/secure /var/log/cron
|
||||
|
||||
# Test logwatch
|
||||
logwatch --output stdout --range today | wc -l # should be >0
|
||||
```
|
||||
|
||||
## Fedora CA Bundle Missing — Postfix TLS Engine Unavailable
|
||||
|
||||
If the Fedora half of your fleet is silent but the Debian/Ubuntu half is fine, and your relayhost requires TLS, suspect a missing CA bundle. Symptom on the sending host:
|
||||
|
||||
```
|
||||
postfix/error: status=deferred (delivery temporarily suspended:
|
||||
TLS is required, but our TLS engine is unavailable)
|
||||
```
|
||||
|
||||
The tell that this is the CA bundle and not a postfix-internal problem: **dnf and curl are also broken on the box.** Run any `sudo dnf list` / `sudo curl https://...` and look for:
|
||||
|
||||
```
|
||||
Curl error (77): Problem with the SSL CA cert (path? access rights?)
|
||||
[error adding trust anchors from file: /etc/pki/tls/certs/ca-bundle.crt]
|
||||
```
|
||||
|
||||
That's the same path postfix's `smtp_tls_CAfile` defaults to. Every TLS client on the box is failing because a single symlink is missing.
|
||||
|
||||
### Diagnosis
|
||||
|
||||
```bash
|
||||
# Is the consumer-path symlink there?
|
||||
ls -la /etc/pki/tls/certs/ca-bundle.crt
|
||||
# Expected: lrwxrwxrwx ... -> /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem
|
||||
|
||||
# Is the extracted bundle itself intact?
|
||||
ls -la /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem
|
||||
sudo grep -c 'BEGIN CERTIFICATE' /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem
|
||||
# Expected: ~140-150 certs, ~220 KB
|
||||
```
|
||||
|
||||
If the extracted bundle exists but the consumer-path symlink is gone, you've found it. `update-ca-trust extract` regenerates the `extracted/` paths but does **not** recreate the upstream-style symlink at `/etc/pki/tls/certs/ca-bundle.crt` — that symlink is shipped by the `ca-certificates` package and can be lost during a partial upgrade or a stray `rm`.
|
||||
|
||||
### Fix
|
||||
|
||||
```bash
|
||||
sudo ln -sfn /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem \
|
||||
/etc/pki/tls/certs/ca-bundle.crt
|
||||
sudo systemctl reload postfix
|
||||
sudo postqueue -f # drain deferred mail
|
||||
```
|
||||
|
||||
Verify with `sudo grep -c 'BEGIN CERTIFICATE' /etc/pki/tls/certs/ca-bundle.crt` (should match the extracted bundle's count) and `sudo dnf list --installed postfix` (should no longer show the curl error).
|
||||
|
||||
### Audit the rest of the Fedora fleet
|
||||
|
||||
Once you find one host with this issue, check the others — package events that broke one box may have broken its siblings:
|
||||
|
||||
```bash
|
||||
for host in $(your fleet | grep fedora); do
|
||||
echo "$host: $(ssh $host 'ls /etc/pki/tls/certs/ca-bundle.crt 2>&1' | tail -1)"
|
||||
done
|
||||
```
|
||||
|
||||
Hosts returning "No such file or directory" are silently broken. They won't fail loudly until something asks them to do TLS — which on a small homelab might be never until logwatch tries to mail you weeks later.
|
||||
|
||||
### Methodology note: postfix logs differ between distros
|
||||
|
||||
Don't trust a single log source when surveying a mixed fleet. **Fedora and majormail log postfix to journald** (`journalctl -u postfix`); **Debian/Ubuntu log to `/var/log/mail.log`** (and rotated `mail.log.1` / `mail.log.*.gz`). Querying journalctl on Ubuntu returns "no entries" even when mail is flowing — easy way to declare a working host broken. Always run `tail /var/log/mail.log` on Debian-family hosts and `journalctl -u postfix` on Fedora-family hosts.
|
||||
|
||||
## Bounce-source addresses must be real mailboxes
|
||||
|
||||
A subtle related class of bug: services like Watchtower, fail2ban, cron, and Netdata default to sending notifications **from** an identity that doesn't exist as a recipient — `watchtower@majorshouse.com`, `fail2ban@<host>.majorshouse.com`, `root@<host>.localdomain`. While the relayhost is healthy, nobody notices. The moment any delivery fails (network blip, recipient typo, queue overflow, the CA bundle bug above), the local MTA tries to bounce the original message back to that sender — finds no mailbox — and the bounce itself bounces. You get MAILER-DAEMON queue churn and `5.7.1 Relay access denied` rejections in your mail server logs.
|
||||
|
||||
Fix it once at the source: set `WATCHTOWER_NOTIFICATION_EMAIL_FROM`, fail2ban's `sender =`, and similar to a **real mailbox** on your mail server (e.g., `marcus@majorshouse.com`). Bounces then land somewhere a human can read them, and the noise disappears.
|
||||
|
||||
## Per-host config drift on cloud-image-derived servers
|
||||
|
||||
When fleet hosts are spun up from images (DigitalOcean droplet snapshots, Packer artifacts, cloud-init templates), three specific config drift patterns silently break notification mail. Each one looks fine in isolation; the combination produces "mail leaves the host with `250 OK queued` and disappears."
|
||||
|
||||
### 1. Packer/snapshot-leftover `myhostname` in postfix
|
||||
|
||||
A host built from a Packer-baked image often has `postfix myhostname = packer-<uuid>` baked into `main.cf` from the build process. The system hostname might have been correctly set by terraform/cloud-init at first boot, but postfix's `myhostname` was hardcoded during image build and was never overridden. Result: every outbound message-id and EHLO carries the Packer artifact name (e.g., `<20260509120011.7EB6ABD83C@packer-641079bc-bc17-b5e1-1425-be745d012d0b>`), no SPF/DKIM matches that name, and remote spam filters score it as suspicious.
|
||||
|
||||
**Detect:**
|
||||
|
||||
```bash
|
||||
postconf myhostname | grep -E 'packer-|builder-|<image-build-prefix>'
|
||||
```
|
||||
|
||||
**Fix:**
|
||||
|
||||
```bash
|
||||
hostnamectl set-hostname <real-fqdn>
|
||||
postconf -e 'myhostname = <real-fqdn>'
|
||||
sed -i '/^127\.0\.1\.1/d' /etc/hosts && \
|
||||
echo "127.0.1.1 <real-fqdn> <short-name>" >> /etc/hosts
|
||||
systemctl reload postfix
|
||||
```
|
||||
|
||||
> [!tip] Same drift, different symptom: the Logwatch **title**
|
||||
> Hetzner provisions boxes with `<host>-hetzner` as the *system* hostname. When that's never corrected, Logwatch (which reads the live hostname at runtime) mails reports titled `Logwatch for <host>-hetzner` — no postfix involvement needed. Same `hostnamectl set-hostname` + `/etc/hosts` fix as above. See [Logwatch wrong hostname after migration](../../05-troubleshooting/logwatch-wrong-hostname-after-migration.md).
|
||||
|
||||
### 2. Empty `relayhost` quietly forces public-MX delivery
|
||||
|
||||
If `postconf relayhost` returns an empty value, postfix doesn't fail — it just does an MX lookup for the destination domain and tries to deliver directly. For mail to your own mail server, that means going via the **public MX** (the domain's external MX record, e.g., `mail.majorshouse.com → 203.0.113.10:25`) instead of the **internal/Tailscale relay path** the rest of the fleet uses.
|
||||
|
||||
The public-MX path is subject to whatever spam filtering, content checks, and trust rules the receiving MX has configured for external traffic. Internal Tailscale-IP traffic typically gets a faster trust shortcut (e.g., bypass spamchk pipe). So this single configuration drift causes one host's mail to land in a different code path than its siblings — and then silently get filtered.
|
||||
|
||||
**Detect:** look for fleet hosts where `postconf relayhost` returns blank and compare to known-good siblings.
|
||||
|
||||
**Fix:** set `relayhost = [<mailserver-tailscale-ip>]:587` (or whatever port your fleet convention uses).
|
||||
|
||||
### 3. Stale SASL passwd map referencing a missing file
|
||||
|
||||
Postfix configurations migrated from a previous setup often retain `smtp_sasl_auth_enable = yes` and `smtp_sasl_password_maps = hash:/etc/postfix/sasl_passwd` even when no SASL is needed for the current relay path. If the actual `sasl_passwd` file isn't there (because the migration didn't carry it, or the new relay doesn't require auth), every send attempt produces:
|
||||
|
||||
```
|
||||
error: open database /etc/postfix/sasl_passwd.db: No such file or directory
|
||||
warning: smtp_sasl_password_maps lookup error
|
||||
status=deferred (local data error while talking to <relay>)
|
||||
```
|
||||
|
||||
Especially common after migrating from external SMTP (SendGrid, Mailgun, etc., which use SASL) to an internal Tailscale relay (which doesn't).
|
||||
|
||||
**Detect:**
|
||||
|
||||
```bash
|
||||
postconf -n | grep -E 'smtp_sasl_(auth_enable|password_maps)'
|
||||
[ -f /etc/postfix/sasl_passwd ] || echo "sasl_passwd file missing"
|
||||
```
|
||||
|
||||
**Fix — disable SASL if the new relay doesn't need it:**
|
||||
|
||||
```bash
|
||||
postconf -e 'smtp_sasl_auth_enable = no'
|
||||
postconf -e 'smtp_tls_wrappermode = no' # if switching from port 465 to 587
|
||||
postconf -X 'smtp_sasl_password_maps'
|
||||
systemctl reload postfix
|
||||
```
|
||||
|
||||
### Audit shortcut
|
||||
|
||||
For a quick per-host comparison across the fleet:
|
||||
|
||||
```bash
|
||||
for host in your fleet hosts; do
|
||||
echo "=== $host ==="
|
||||
ssh "$host" 'postconf myhostname relayhost smtp_sasl_auth_enable 2>&1' | head -3
|
||||
done
|
||||
```
|
||||
|
||||
Anomalies (Packer hostnames, blank relayhost, SASL enabled where siblings have it disabled) jump out immediately.
|
||||
|
||||
## Lesson Learned
|
||||
|
||||
Never customize `/usr/share/logwatch/default.conf/logwatch.conf`. Always use `/etc/logwatch/conf/logwatch.conf`. This applies to any software that has a "defaults" file and an "override" file — the override survives upgrades, the defaults file does not.
|
||||
|
||||
A second, broader lesson from the 2026-05-10 fleet outage: **silent fleet-wide email gaps are usually a stack of unrelated failures, not one cause.** That morning's investigation surfaced a missing CA bundle on two Fedora hosts, a postfix relayhost using a name that postfix's resolver couldn't handle, two services with non-mailbox sender addresses generating bounce churn, and a corrupt syslog-vs-journald assumption that hid working hosts. Each was minor in isolation. Together they made all seven hosts look broken when in fact only two were. Triage by ground-truth (what arrived in the destination mailbox) before assuming what's broken at the source.
|
||||
|
|
@ -1,11 +1,17 @@
|
|||
---
|
||||
title: "Tuning Netdata Docker Health Alarms to Prevent Update Flapping"
|
||||
title: Tuning Netdata Docker Health Alarms to Prevent Update Flapping
|
||||
domain: selfhosting
|
||||
category: monitoring
|
||||
tags: [netdata, docker, nextcloud, alarms, health, monitoring]
|
||||
tags:
|
||||
- netdata
|
||||
- docker
|
||||
- nextcloud
|
||||
- alarms
|
||||
- health
|
||||
- monitoring
|
||||
status: published
|
||||
created: 2026-03-18
|
||||
updated: 2026-03-28
|
||||
updated: 2026-05-02T11:04
|
||||
---
|
||||
|
||||
# Tuning Netdata Docker Health Alarms to Prevent Update Flapping
|
||||
|
|
@ -61,9 +67,9 @@ chart labels: container_name=!nextcloud-aio-nextcloud *
|
|||
|
||||
### Dedicated Nextcloud AIO Alarm
|
||||
|
||||
Added 2026-03-23, updated 2026-03-28. The `nextcloud-aio-nextcloud` container needs a more lenient window than other containers. Its healthcheck (`/healthcheck.sh`) verifies PostgreSQL connectivity (port 5432) and PHP-FPM (port 9000). PHP-FPM takes ~90 seconds to warm up after a normal restart — but during nightly AIO update cycles, the full startup (occ upgrade, app updates, migrations) can take 5+ minutes. On 2026-03-27, a startup hung and left the container unhealthy for 20 hours until the next nightly cycle replaced it.
|
||||
Added 2026-03-23, updated 2026-05-02. The `nextcloud-aio-nextcloud` container needs a more lenient window than other containers. Its healthcheck (`/healthcheck.sh`) verifies PostgreSQL connectivity (port 5432) and PHP-FPM (port 9000). PHP-FPM takes ~90 seconds to warm up after a normal restart — but during nightly AIO update cycles, the full startup (occ upgrade, app updates, migrations) can take 5+ minutes. On 2026-03-27, a startup hung and left the container unhealthy for 20 hours until the next nightly cycle replaced it.
|
||||
|
||||
The dedicated alarm uses a 10-minute lookup window and 10-minute delay to absorb normal startup, while still catching sustained failures:
|
||||
The dedicated alarm uses a 30-minute lookup window and 10-minute delay to absorb normal startup and update cycles (~40 minutes total grace), while still catching sustained failures:
|
||||
|
||||
```ini
|
||||
# Dedicated alarm for nextcloud-aio-nextcloud — lenient window to absorb nightly update cycle
|
||||
|
|
@ -76,15 +82,23 @@ template: docker_nextcloud_unhealthy
|
|||
component: Docker
|
||||
units: status
|
||||
every: 30s
|
||||
lookup: average -10m of unhealthy
|
||||
lookup: average -30m of unhealthy
|
||||
chart labels: container_name=nextcloud-aio-nextcloud
|
||||
warn: $this > 0
|
||||
warn: $this >= 1
|
||||
delay: up 10m down 5m multiplier 1.5 max 30m
|
||||
summary: Nextcloud container health sustained
|
||||
info: nextcloud-aio-nextcloud has been unhealthy for a sustained period — not a transient update blip
|
||||
info: nextcloud-aio-nextcloud has been continuously unhealthy for 30+ minutes — not a transient update blip
|
||||
to: sysadmin
|
||||
```
|
||||
|
||||
**Tuning history:**
|
||||
|
||||
| Date | Lookup | Delay | Trigger | Notes |
|
||||
|---|---|---|---|---|
|
||||
| 2026-03-23 | 35m | 35m | Initial split from general alarm | Absorbed PHP-FPM warm-up |
|
||||
| 2026-04-29 | 15m | 5m | Backup blip (~6m) never triggered | Tightened after stability |
|
||||
| 2026-05-02 | 30m | 10m | 15m still too aggressive for update cycles | ~40m total grace; catches real outages |
|
||||
|
||||
## Watchdog Cron: Auto-Restart on Sustained Unhealthy
|
||||
|
||||
If the Nextcloud container stays unhealthy for more than 1 hour (well past any normal startup window), a cron watchdog on majorlab auto-restarts it and logs the event. This was added 2026-03-28 after an incident where the container sat unhealthy for 20 hours until the next nightly backup cycle replaced it.
|
||||
|
|
|
|||
130
02-selfhosting/security/ansible-flat-playbooks-to-roles.md
Normal file
130
02-selfhosting/security/ansible-flat-playbooks-to-roles.md
Normal file
|
|
@ -0,0 +1,130 @@
|
|||
---
|
||||
title: "Migrating Flat Ansible Playbooks to Roles (Safely)"
|
||||
domain: selfhosting
|
||||
category: security
|
||||
tags: [ansible, roles, refactor, fleet, migration, fail2ban, infrastructure]
|
||||
status: published
|
||||
created: 2026-06-18
|
||||
updated: 2026-06-18
|
||||
---
|
||||
# Migrating Flat Ansible Playbooks to Roles (Safely)
|
||||
|
||||
## Overview
|
||||
|
||||
A fleet repo tends to grow a sprawl of flat `configure_*.yml` playbooks — one per subsystem, plus near-duplicates for variants (e.g. ~10 `configure_fail2ban_*` playbooks), all sharing a single overloaded top-level `templates/` directory. It works, but it resists reuse: there is no clean `defaults/` precedence, no encapsulation, and no way to compose a host's full configuration in one place.
|
||||
|
||||
Ansible **roles** fix this — but migrating a *live* fleet is where it gets dangerous. The risk is not the refactor itself; it's accidentally changing deployed behaviour while you "just reorganize." This article covers the incremental, regression-free approach used to migrate an 11-host fleet, including the two techniques that keep it safe: **byte-identical migration** and **capture-based reconciliation**.
|
||||
|
||||
> This is a process/pattern article. For the specific roles in this fleet, see the internal runbook. The techniques here generalize to any flat-playbook → role migration.
|
||||
|
||||
## Decide What Becomes a Role vs. What Stays a Playbook
|
||||
|
||||
Not everything should be a role. Draw the line by purpose:
|
||||
|
||||
| Becomes a role | Stays a playbook |
|
||||
|---|---|
|
||||
| Reusable host **configuration** (a subsystem you converge to a desired state) | **Ops / one-off** actions: `update`, `reboot`, `harden`, `bootstrap`, `provision`, `fix_*`, `verify_*` |
|
||||
| Has templates/files, defaults, handlers | Orchestrators that just `import_playbook` other things |
|
||||
| Applied repeatedly and idempotently | Run-once or run-as-needed remediation |
|
||||
|
||||
Roles get the standard `roles/<name>/` layout (`tasks/`, `defaults/`, `handlers/`, `templates/`, `files/`, `meta/`). Name them after the **subsystem noun** (`fail2ban`, `clamav`, `firewall`) — drop the `configure_` verb prefix.
|
||||
|
||||
## The Incremental Loop (one role per branch)
|
||||
|
||||
Migrate **one subsystem per branch** and validate before merging. This keeps every change small enough to diff by eye and roll back cleanly:
|
||||
|
||||
1. `git mv` the templates/files into `roles/<name>/` so **git tracks them as renames** (history preserved, 100% rename score).
|
||||
2. Move task bodies into `roles/<name>/tasks/` (split by lifecycle: install → service → config → verify).
|
||||
3. Lift tunables into `roles/<name>/defaults/main.yml`; keep per-host overrides in `group_vars`/`host_vars`.
|
||||
4. Add a thin entry playbook `<name>.yml` (`hosts: <group>` + `roles: [<name>]`).
|
||||
5. Validate with `--check --diff` against a single host **before** merging.
|
||||
6. Merge, then move to the next subsystem.
|
||||
|
||||
## Technique 1: Byte-Identical Migration
|
||||
|
||||
When the goal is "reorganize without changing behaviour," **prove** it. After moving a playbook into a role, the rendered task bodies should be identical to the original. Verify with a normalized diff against `main`:
|
||||
|
||||
```bash
|
||||
# Compare the role's task body against the original flat playbook,
|
||||
# ignoring only comments/whitespace you intend to change.
|
||||
git show main:configure_clamav.yml > /tmp/old.yml
|
||||
# ...extract the task list from roles/clamav/tasks/*.yml and diff
|
||||
diff <(yq '.[] | .tasks' /tmp/old.yml) <(cat roles/clamav/tasks/*.yml)
|
||||
```
|
||||
|
||||
The acceptance bar: `--check --diff` against a real host returns **`changed=0`** (or only the diffs you explicitly intended, like a doc-comment line). If a "faithful" migration shows unexpected `changed=N`, you altered behaviour — stop and reconcile before merging. Templates moved via `git mv` show as **100% renames** in `git show --stat`, which is your proof the deployed content is unchanged.
|
||||
|
||||
## Technique 2: Consolidating Near-Duplicates with Feature Flags
|
||||
|
||||
The big win is collapsing a family of near-duplicate playbooks (the ~10 `configure_fail2ban_*`) into **one role with flag-gated task files**:
|
||||
|
||||
```yaml
|
||||
# group_vars/<group>.yml — hosts self-select which jails/components they get
|
||||
fail2ban_jail_sshd: true
|
||||
fail2ban_jail_wordpress: true
|
||||
fail2ban_jail_nginx_bad_request: false
|
||||
```
|
||||
|
||||
```yaml
|
||||
# roles/fail2ban/tasks/main.yml
|
||||
- import_tasks: jail_wordpress.yml
|
||||
when: fail2ban_jail_wordpress | default(false)
|
||||
```
|
||||
|
||||
> **Critical gotcha — key flags to inventory GROUPS, not `ansible_os_family`.** It is tempting to gate OS-specific task files on `ansible_os_family == 'Debian'`. Don't. Inventory groups frequently include hosts the *original playbooks deliberately excluded* (e.g. a LAN-only Debian box that should get the network-wait step but **not** the public SSH bind, or a WSL host in the `fedora` group that must be skipped). Keep the original curated host patterns and set the flag per play/group. Keying on `os_family` silently widens a play's host set and is exactly how a "refactor" pushes config to a host that never had it.
|
||||
|
||||
## Technique 3: Capture-Based Reconciliation (the safety net)
|
||||
|
||||
This is the one that prevents an outage. Sometimes a role gets written as a **fresh re-implementation** of a subsystem rather than a faithful move — a cleaner `jail.local`, new drop-ins, a different default set. It may even be merged into `site.yml`. The trap: that role has **never been rolled out**, and its config *diverges* from what's actually deployed.
|
||||
|
||||
Running it would push divergent config to a live, security-sensitive subsystem (intrusion protection, firewall) across the whole fleet on the next `harden.yml`.
|
||||
|
||||
The check that catches it:
|
||||
|
||||
```bash
|
||||
ansible-playbook fail2ban.yml --check --diff --limit <host>
|
||||
# Divergent role => changed=8-12 per host + failures (missing filters/timers)
|
||||
# Faithful role => changed=0, failed=0
|
||||
```
|
||||
|
||||
**Capture-based reconciliation** is the fix: instead of pushing the role's idea of "correct," bring the **role into parity with the live, working config** first. Capture what's actually deployed, fold it into the role's templates/defaults until `--check` is clean fleet-wide, *then* switch the orchestrator over and retire the old playbooks. Order of operations:
|
||||
|
||||
1. **Decide the source of truth** — the live config or the new role. For security subsystems, the live (working) config wins.
|
||||
2. **Reconcile** the role to match live until `--check` shows `changed=0, failed=0` on every host.
|
||||
3. **Roll out host-by-host** with real runs; verify the service restarts cleanly and (for fail2ban) jails are actually active.
|
||||
4. **Only then** delete the old playbooks, rewire `harden.yml`/`bootstrap.yml`, and remove the orphaned top-level templates.
|
||||
|
||||
Never delete the old mechanism until the new one is proven converged everywhere. "It's in `site.yml`" is not the same as "it's been rolled out."
|
||||
|
||||
## Composition: `site.yml`, `harden.yml`, `bootstrap.yml`
|
||||
|
||||
Once subsystems are roles, compose them with thin orchestrators that `import_playbook` the role entry points — so each subsystem keeps a **single source of truth** for its host mapping:
|
||||
|
||||
```yaml
|
||||
# site.yml — day-to-day fleet convergence, in dependency order
|
||||
- import_playbook: swap.yml
|
||||
- import_playbook: tailscale.yml
|
||||
- import_playbook: ssh_hardening.yml
|
||||
- import_playbook: firewall.yml
|
||||
- import_playbook: fail2ban.yml
|
||||
- import_playbook: clamav.yml
|
||||
```
|
||||
|
||||
Order matters: base layer (swap) → networking (tailscale) → access (ssh_hardening) → perimeter (firewall) → intrusion protection (fail2ban). Bootstrap-only roles (guest agent, root password, provisioning prerequisites) belong in `bootstrap.yml`, not `site.yml`.
|
||||
|
||||
## Verification Checklist
|
||||
|
||||
- [ ] Templates moved with `git mv` (show as 100% renames)
|
||||
- [ ] `--check --diff` on a real host = `changed=0` (or only intended diffs)
|
||||
- [ ] Consolidation flags keyed to **inventory groups**, not `ansible_os_family`
|
||||
- [ ] Re-implemented roles reconciled to live parity **before** rollout (no surprise `changed=N`)
|
||||
- [ ] Security subsystems rolled out host-by-host with service-active verification
|
||||
- [ ] Old playbooks/templates deleted **only after** the role is converged fleet-wide
|
||||
- [ ] Orchestrators (`site.yml`/`harden.yml`/`bootstrap.yml`) rewired; stale references swept
|
||||
|
||||
## Related
|
||||
|
||||
- [SSH Hardening Fleet-Wide with Ansible](ssh-hardening-ansible-fleet.md)
|
||||
- [ClamAV Fleet Deployment with Ansible](clamav-fleet-deployment.md)
|
||||
- [Firewall Hardening with firewalld on Fedora Fleet](firewalld-fleet-hardening.md)
|
||||
- [Standardizing unattended-upgrades with Ansible](ansible-unattended-upgrades-fleet.md)
|
||||
|
|
@ -11,7 +11,7 @@ tags:
|
|||
- cron
|
||||
status: published
|
||||
created: 2026-04-18
|
||||
updated: 2026-04-18T11:13
|
||||
updated: 2026-05-15T03:00
|
||||
---
|
||||
# ClamAV Fleet Deployment with Ansible
|
||||
|
||||
|
|
@ -31,6 +31,10 @@ ClamAV is the standard open-source antivirus for Linux servers. For internet-fac
|
|||
|
||||
## Ansible Playbook
|
||||
|
||||
> On the MajorsHouse fleet this is packaged as the **`clamav` role** (`roles/clamav/`,
|
||||
> tasks split install → service → scan → verify) and run via `clamav.yml` or `site.yml`.
|
||||
> The standalone playbook below is the illustrative equivalent.
|
||||
|
||||
```yaml
|
||||
- name: Deploy ClamAV to internet-facing hosts
|
||||
hosts: internet_facing # dca, majorlinux, teelia, tttpod, majortoot, majormail
|
||||
|
|
@ -147,6 +151,120 @@ clamscan /tmp/eicar-test.txt
|
|||
rm /tmp/eicar-test.txt
|
||||
```
|
||||
|
||||
## DigitalOcean Monitoring Caveat (1 vCPU droplets)
|
||||
|
||||
`nice -n 19 ionice -c 3` plus `MemoryMax`/`MemorySwapMax` cgroups make clamscan "polite" to the Linux scheduler — it yields to PHP-FPM, MySQL, etc. instantly. **But hypervisor-level CPU monitoring (DigitalOcean, Linode, Hetzner) doesn't know about niceness.** It sees raw CPU utilization. On a 1 vCPU droplet during quiet hours, a single-threaded clamscan can fill 100% of the vCPU on its own, tripping a default `>85%/5m` CPU alert every week — even though the workload is genuinely insulating real traffic.
|
||||
|
||||
**Symptoms:**
|
||||
- Weekly `[ALERT] CPU is running high` email from DO at the same time/day every week
|
||||
- The alert clears within 10–60 min (when scan finishes)
|
||||
- No actual user-visible service degradation
|
||||
- Netdata shows CPU 80–100% but PHP-FPM/MySQL response times barely move
|
||||
|
||||
**Fix: per-droplet alert scoping.** Two changes via the DO API:
|
||||
|
||||
1. **Scope the existing fleet-wide CPU alert to exclude affected 1 vCPU droplets** by setting `entities` to an explicit array of *all other* droplet IDs.
|
||||
2. **Add a new alert scoped to just the affected droplet(s)** with a relaxed threshold:
|
||||
- `value: 95`
|
||||
- `window: "30m"`
|
||||
- `entities: [<droplet_id>]`
|
||||
|
||||
The relaxed threshold still catches runaway PHP loops, mining trojans, and actual sustained saturation — but ignores the weekly polite scan.
|
||||
|
||||
### Apply via DO API
|
||||
|
||||
```bash
|
||||
TOKEN="<your DigitalOcean PAT>"
|
||||
|
||||
# 1. Scope existing CPU alert (PUT requires the full alert spec)
|
||||
curl -sS -X PUT \
|
||||
-H "Authorization: Bearer $TOKEN" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"alerts": {"email": ["you@example.com"], "slack": []},
|
||||
"compare": "GreaterThan",
|
||||
"description": "CPU is running high (excludes 1vCPU clamscan boxes)",
|
||||
"enabled": true,
|
||||
"entities": ["<droplet_id_1>", "<droplet_id_2>"],
|
||||
"tags": [],
|
||||
"type": "v1/insights/droplet/cpu",
|
||||
"value": 85,
|
||||
"window": "5m"
|
||||
}' \
|
||||
"https://api.digitalocean.com/v2/monitoring/alerts/<existing_uuid>"
|
||||
|
||||
# 2. Create a relaxed alert for the small box
|
||||
curl -sS -X POST \
|
||||
-H "Authorization: Bearer $TOKEN" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"alerts": {"email": ["you@example.com"], "slack": []},
|
||||
"compare": "GreaterThan",
|
||||
"description": "<host> CPU sustained high (clamscan-aware)",
|
||||
"enabled": true,
|
||||
"entities": ["<small_droplet_id>"],
|
||||
"tags": [],
|
||||
"type": "v1/insights/droplet/cpu",
|
||||
"value": 95,
|
||||
"window": "30m"
|
||||
}' \
|
||||
"https://api.digitalocean.com/v2/monitoring/alerts"
|
||||
```
|
||||
|
||||
To list current alerts (find UUIDs and current `entities`):
|
||||
|
||||
```bash
|
||||
curl -sS -H "Authorization: Bearer $TOKEN" \
|
||||
"https://api.digitalocean.com/v2/monitoring/alerts" | jq
|
||||
```
|
||||
|
||||
**When *not* to do this:** If your droplet has 2+ vCPUs and clamscan only consumes ~50% of total, you probably won't trip an 85% alert in the first place. The per-droplet exemption is mainly for 1 vCPU boxes.
|
||||
|
||||
**When the per-droplet relaxed alert *also* trips (and what to do):** On a 1 vCPU droplet during low-traffic hours (e.g., the default Sunday-morning weekly cron window), clamscan has *nothing real to yield to* — `nice 19` only matters when something else wants the CPU. The kernel correctly schedules clamscan as nice/idle (`iostat` shows `%nice ~94, %idle 0`) but DO sees `100% - 0% idle = 100% CPU` and trips even the 95%/30m threshold for the duration of the scan (~30–50 min on small webserver boxes). At that point the realistic options are:
|
||||
|
||||
1. **Accept the weekly page** as expected noise — simplest, no further engineering
|
||||
2. **Switch to `clamdscan`** (daemon-backed) — scans finish ~3–5× faster and fit in a 30m window, but `clamd` adds ~250 MB resident memory continuously
|
||||
3. **Disable the per-droplet CPU alert entirely** for that host and rely on Netdata for the real signal
|
||||
|
||||
The "polite CPU is invisible to DO" trick stops working once the box is small enough that the polite work fills the entire core unopposed. There is no DO threshold that distinguishes "polite scan filling idle CPU" from "runaway process pinning the vCPU" — that distinction lives in `iostat`'s `%nice` vs `%user` split, which DO doesn't expose.
|
||||
|
||||
**Alternative considered: switch to `clamdscan`** — uses a resident `clamd` daemon, signatures stay loaded, scan finishes ~10× faster with much less CPU/RAM. Better long-term answer, but requires running `clamd` continuously (memory cost on small boxes is ~250 MB resident vs the cron approach which only holds RAM during scan). Trade-off, not strictly better.
|
||||
|
||||
## Daemonless Mode on Memory-Constrained Hosts
|
||||
|
||||
On hosts with ≤2 GB RAM, running `clamd` continuously is often counterproductive. The daemon loads its full signature database (~950 MB RSS) into memory and keeps it resident. On small VMs this crowds out MySQL, PHP-FPM, and other services — often pushing the whole system into swap rather than preventing anything.
|
||||
|
||||
**Affected hosts (fleet history):**
|
||||
|
||||
| Host | RAM | Incident | Resolution |
|
||||
|------|-----|----------|------------|
|
||||
| teelia | 1.9 GB | 2026-04-27 — clamd 728 MB RSS, 94% RAM alert | daemonless |
|
||||
| dcaprod | 3.8 GB | 2026-04-30 — clamd OOM thrash after 512M cgroup cap | daemonless |
|
||||
| majorlinux | 2.0 GB | 2026-05-15 — clamd 980 MB swap, mysqld swapping 293 MB | daemonless |
|
||||
|
||||
**The fix: `clamav_use_daemon: false` host_var**
|
||||
|
||||
The `clamav` role supports a per-host override. Add to the host's `host_vars/<hostname>/vars.yml`:
|
||||
|
||||
```yaml
|
||||
clamav_use_daemon: false
|
||||
```
|
||||
|
||||
Then re-run the role:
|
||||
|
||||
```bash
|
||||
ansible-playbook clamav.yml --limit <hostname>
|
||||
```
|
||||
|
||||
This will:
|
||||
- Stop and disable `clamav-daemon.service` and `clamav-daemon.socket`
|
||||
- Deploy the weekly scan template using `clamscan` (daemonless, loads DB per run)
|
||||
- Leave `clamav-freshclam` active so definitions stay current
|
||||
|
||||
**Trade-off:** Each weekly scan loads the signature DB fresh (~950 MB peak RAM for the scan duration, then freed). The scan takes longer than `clamdscan` (~3–5× on a warm daemon), but this is acceptable for a weekly background job. The `systemd-run MemoryMax` cgroup wrapper in the scan template caps peak usage so the scan can't OOM the host.
|
||||
|
||||
**Rule of thumb:** Use daemon mode (`clamav_use_daemon: true` or unset) on hosts with ≥4 GB RAM where scan speed matters (mail servers, upload handlers). Use daemonless on webservers and small VMs where continuous memory residency is the bigger risk.
|
||||
|
||||
## See Also
|
||||
|
||||
- [clamscan-cpu-spike-nice-ionice](../../05-troubleshooting/security/clamscan-cpu-spike-nice-ionice.md) — troubleshooting CPU spikes from unthrottled scans
|
||||
|
|
|
|||
|
|
@ -1,11 +1,18 @@
|
|||
---
|
||||
title: "Fail2Ban Digest Mode — Fleet-Wide Quiet Alerts"
|
||||
title: Fail2Ban Digest Mode — Fleet-Wide Quiet Alerts
|
||||
domain: selfhosting
|
||||
category: security
|
||||
tags: [fail2ban, security, email, ansible, fleet, cron, digest]
|
||||
tags:
|
||||
- fail2ban
|
||||
- security
|
||||
- email
|
||||
- ansible
|
||||
- fleet
|
||||
- cron
|
||||
- digest
|
||||
status: published
|
||||
created: 2026-04-22
|
||||
updated: 2026-04-22
|
||||
updated: 2026-05-02T14:56
|
||||
---
|
||||
# Fail2Ban Digest Mode — Fleet-Wide Quiet Alerts
|
||||
|
||||
|
|
@ -21,11 +28,11 @@ Three tiers replace the firehose:
|
|||
|
||||
| Tier | Jails | Action | Why |
|
||||
|------|-------|--------|-----|
|
||||
| **Immediate email** | `sshd`, `recidive` | `action_mwl` | Security-critical — someone is actively targeting auth or is a repeat offender |
|
||||
| **Immediate email** | `recidive` | `action_mwl` | Repeat offenders only — someone has been banned multiple times across jails |
|
||||
| **Silent ban** | Everything else | `action_` (default) | Ban happens, firewall rule applied, no email sent |
|
||||
| **Daily digest** | All jails | Cron script at 08:00 UTC | One summary email per host with ban counts across all jails |
|
||||
|
||||
This reduces email volume from hundreds per day to ~10 (one digest per host + occasional sshd/recidive alerts).
|
||||
This reduces email volume from hundreds per day to ~10 (one digest per host + occasional recidive alerts).
|
||||
|
||||
## jail.local Configuration
|
||||
|
||||
|
|
@ -40,18 +47,20 @@ action = %(action_)s
|
|||
|
||||
This overrides the stock `action_mwl` for all jails. Bans still happen — the firewall rule is applied — but no email is sent.
|
||||
|
||||
### Keep immediate alerts for critical jails
|
||||
### Keep immediate alerts for recidive only
|
||||
|
||||
```ini
|
||||
[sshd]
|
||||
enabled = true
|
||||
action = %(action_mwl)s
|
||||
action = %(action_)s
|
||||
|
||||
[recidive]
|
||||
enabled = true
|
||||
action = %(action_mwl)s
|
||||
```
|
||||
|
||||
> **Updated 2026-05-02:** sshd was moved to silent (`action_`). Only recidive (repeat offenders) now triggers immediate email. sshd bans are captured in the daily digest.
|
||||
|
||||
### Clean up email subjects with fq-hostname
|
||||
|
||||
By default, fail2ban uses the system FQDN in email subjects. On Tailscale hosts, this produces ugly subjects like `[Fail2Ban] sshd: banned 1.2.3.4 on MajorToot.tail7f2d9.ts.net`. Override it in `[DEFAULT]`:
|
||||
|
|
@ -91,8 +100,9 @@ The playbook `configure_fail2ban_digest.yml` deploys the full digest model fleet
|
|||
### What it does
|
||||
|
||||
1. Deploys a Python helper script that performs **section-aware editing** of `jail.local` (see gotchas below)
|
||||
2. Sets `action = %(action_)s` in `[DEFAULT]`
|
||||
3. Sets `action = %(action_mwl)s` in `[sshd]` and `[recidive]`
|
||||
2. Sets `action = %(action_)s` in `[DEFAULT]` and `[sshd]`
|
||||
3. Sets `action = %(action_mwl)s` in `[recidive]`
|
||||
4. Removes stale `action = %(action_mwl)s` from `defaults-debian.conf` if present
|
||||
4. Sets `fq-hostname` per host using an override dict
|
||||
5. Deploys the digest script from a Jinja2 template
|
||||
6. Creates the cron job via `ansible.builtin.cron`
|
||||
|
|
@ -143,6 +153,14 @@ option 'action' in section 'DEFAULT' already exists
|
|||
|
||||
The Python editor script handles this by replacing existing keys rather than appending.
|
||||
|
||||
### defaults-debian.conf overrides jail.local
|
||||
|
||||
On Debian/Ubuntu, `/etc/fail2ban/jail.d/defaults-debian.conf` is loaded **after** `jail.local`. If it contains `action = %(action_mwl)s`, it silently overrides your silent default — every jail sends email on every ban. The Ansible playbook now removes this line automatically. If you see per-ban emails after deploying digest mode, check this file first:
|
||||
|
||||
```bash
|
||||
grep action /etc/fail2ban/jail.d/defaults-debian.conf
|
||||
```
|
||||
|
||||
### fq-hostname scope
|
||||
|
||||
Setting `fq-hostname` in `[DEFAULT]` affects all action templates that use the `<fq-hostname>` tag — including both immediate emails and the digest subject. This is the desired behavior, but be aware that it overrides the system hostname globally within fail2ban.
|
||||
|
|
|
|||
|
|
@ -31,6 +31,9 @@ Rather than editing `/etc/ssh/sshd_config` directly (which may be managed by the
|
|||
|
||||
## Ansible Playbook
|
||||
|
||||
> On the MajorsHouse fleet this is packaged as the **`ssh_hardening` role** (`roles/ssh_hardening/`)
|
||||
> and run via `ssh_hardening.yml` or `site.yml`. The standalone playbook below is the illustrative equivalent.
|
||||
|
||||
```yaml
|
||||
- name: Harden SSH daemon fleet-wide
|
||||
hosts: all:!raspbian
|
||||
|
|
|
|||
|
|
@ -10,7 +10,7 @@ tags:
|
|||
- docker
|
||||
status: published
|
||||
created: 2026-04-02
|
||||
updated: 2026-04-29T22:45
|
||||
updated: 2026-04-30T05:21
|
||||
---
|
||||
|
||||
# Mastodon Instance Tuning
|
||||
|
|
|
|||
170
02-selfhosting/services/mastodon-mention-spam-crowdfunding.md
Normal file
170
02-selfhosting/services/mastodon-mention-spam-crowdfunding.md
Normal file
|
|
@ -0,0 +1,170 @@
|
|||
---
|
||||
title: "Mastodon — Triaging Crowdfunding / Mention-Spam Accounts"
|
||||
description: How to tell broadcast fundraising solicitation from genuine mentions, investigate the account and its origin instance with SQL + nodeinfo, and pick a proportionate moderation action.
|
||||
tags:
|
||||
- mastodon
|
||||
- moderation
|
||||
- abuse
|
||||
- federation
|
||||
- self-hosting
|
||||
created: 2026-06-22
|
||||
updated: 2026-06-22
|
||||
---
|
||||
|
||||
# Mastodon — Triaging Crowdfunding / Mention-Spam Accounts
|
||||
|
||||
If you run a Mastodon instance, sooner or later you (or your users) start getting tagged by accounts you've never interacted with, posting donation appeals with a link and a wall of hashtags. Some are real people in desperate situations; some are recycled-link scams. Either way, when an account is **broadcasting a solicitation at you** rather than replying to you, it's a moderation question, not a conversation.
|
||||
|
||||
This article is the runbook for telling the two apart, investigating both the **account** and its **origin instance**, and choosing an action that's proportionate instead of nuking eight years of legit federation over two bad actors.
|
||||
|
||||
## TL;DR
|
||||
|
||||
- A mention is **broadcast spam**, not engagement, when it's a *standalone post* (not a reply) that *tags a large fixed list* of accounts and carries a *donation link*, usually from a *throwaway profile* on an *open-registration instance*.
|
||||
- Investigate before acting: pull the account's age/stats/bio and check whether the post is a reply or a 40-way blast (SQL below). Profile the origin instance via its public `nodeinfo`.
|
||||
- **Default action is an account-level block**, which also federates and removes their follow of you. Escalate to domain-limit / domain-block only when *one instance* produces *repeat offenders*.
|
||||
- Keep a log so single incidents that are actually a pattern become visible.
|
||||
|
||||
## Signals that a mention is broadcast solicitation
|
||||
|
||||
Score it on how many of these hold:
|
||||
|
||||
| Signal | Why it matters |
|
||||
|---|---|
|
||||
| **Standalone post, not a reply** (`in_reply_to_account_id IS NULL`) but still tags you | They're broadcasting, not responding |
|
||||
| **Tags a large fixed recipient list** (e.g. 40+) | Mass distribution; the same list reused across senders = coordination |
|
||||
| **Donation link** in post or bio (`chuffed.org`, `gofundme`, `paypal.me`, `ko-fi`) | The payload |
|
||||
| **Throwaway profile** — days old, few followers, follows you but you don't follow back | Disposable, baiting a profile view |
|
||||
| **Mass-follow ratio** — following thousands / few hundred followers | Engagement farming |
|
||||
| **"I am not a scammer" disclaimer** in bio | Known red-flag phrase |
|
||||
| **Origin instance: open registration, no approval** | Easy throwaway-account farm |
|
||||
|
||||
> [!warning] Judgment, not a purity test
|
||||
> Many of these accounts are real people. The goal is not to adjudicate need — it's to stop *broadcast solicitation aimed at you* and track the *source instances*. Prefer the lightest action that stops it.
|
||||
|
||||
## Investigate the account
|
||||
|
||||
Connect to the DB on the instance:
|
||||
|
||||
```bash
|
||||
ssh <your-mastodon-host>
|
||||
sudo -u postgres psql mastodon_production
|
||||
```
|
||||
|
||||
**Profile + stats for a suspect** (age, post count, follower ratio, bio):
|
||||
|
||||
```sql
|
||||
SELECT a.username||'@'||a.domain,
|
||||
to_char(a.created_at,'YYYY-MM-DD') AS first_seen_locally,
|
||||
st.statuses_count, st.followers_count, st.following_count,
|
||||
left(regexp_replace(COALESCE(a.note,''),'<[^>]+>','','g'),200) AS bio
|
||||
FROM accounts a LEFT JOIN account_stats st ON st.account_id=a.id
|
||||
WHERE a.domain='<INSTANCE>' AND a.username='<HANDLE>';
|
||||
```
|
||||
|
||||
**Is the mention a reply or a blast?** `standalone=t` with a high `num_tagged` is the tell:
|
||||
|
||||
```sql
|
||||
SELECT a.username, to_char(s.created_at,'YYYY-MM-DD HH24:MI') AS posted,
|
||||
s.in_reply_to_account_id IS NULL AS standalone,
|
||||
(SELECT count(*) FROM mentions mm WHERE mm.status_id=s.id) AS num_tagged
|
||||
FROM mentions m JOIN statuses s ON s.id=m.status_id
|
||||
JOIN accounts a ON a.id=s.account_id
|
||||
JOIN accounts me ON me.id=m.account_id AND me.username='<YOU>' AND me.domain IS NULL
|
||||
WHERE a.username='<HANDLE>' AND a.domain='<INSTANCE>'
|
||||
ORDER BY s.created_at DESC;
|
||||
```
|
||||
|
||||
**All recent direct mentions of you** (sweep for the wider pattern):
|
||||
|
||||
```sql
|
||||
SELECT to_char(n.created_at,'YYYY-MM-DD HH24:MI') AS when,
|
||||
a.username||COALESCE('@'||a.domain,'@local') AS who,
|
||||
COALESCE(s.uri,'') AS uri,
|
||||
left(regexp_replace(COALESCE(s.text,''),'<[^>]+>','','g'),200) AS body
|
||||
FROM notifications n
|
||||
JOIN accounts recip ON recip.id=n.account_id AND recip.username='<YOU>' AND recip.domain IS NULL
|
||||
JOIN accounts a ON a.id=n.from_account_id
|
||||
LEFT JOIN mentions m ON m.id=n.activity_id AND n.activity_type='Mention'
|
||||
LEFT JOIN statuses s ON s.id=m.status_id
|
||||
WHERE n.type='mention' ORDER BY n.created_at DESC LIMIT 40;
|
||||
```
|
||||
|
||||
## Profile the origin instance
|
||||
|
||||
Don't judge an instance by one bad account. Pull its public metadata — no auth needed:
|
||||
|
||||
```bash
|
||||
# Software, version, user counts, registration policy
|
||||
NI=$(curl -s https://<INSTANCE>/.well-known/nodeinfo | python3 -c 'import sys,json;print(json.load(sys.stdin)["links"][-1]["href"])')
|
||||
curl -s "$NI" | python3 -m json.tool # software, openRegistrations, usage.users
|
||||
|
||||
# Title, contact/admin, rules, registration approval flag
|
||||
curl -s https://<INSTANCE>/api/v2/instance | python3 -m json.tool
|
||||
```
|
||||
|
||||
What to read off it:
|
||||
|
||||
- **`openRegistrations: true` + `approval_required: false`** → throwaway-account farm; expect more of the same.
|
||||
- **`totalUsers` vs `activeMonth`** → a huge dormant base is typical of sign-up-and-leave farms.
|
||||
- **Federation age on your side** — how long you've known the instance, how many of its accounts you cache. A long, broad relationship argues *against* a domain block.
|
||||
- **The instance's own rules** — many ban "backlink accounts" / harassment, which the mass-tag fundraising violates. That makes **reporting to its admin a legitimate, in-policy path.**
|
||||
|
||||
```sql
|
||||
-- What your instance already knows about the domain
|
||||
SELECT (SELECT count(*) FROM accounts WHERE domain='<INSTANCE>') AS known_accounts,
|
||||
(SELECT count(*) FROM statuses s JOIN accounts a ON a.id=s.account_id WHERE a.domain='<INSTANCE>') AS cached_statuses,
|
||||
(SELECT to_char(min(created_at),'YYYY-MM-DD') FROM accounts WHERE domain='<INSTANCE>') AS first_seen,
|
||||
(SELECT count(*) FROM domain_blocks WHERE domain='<INSTANCE>') AS is_domain_blocked;
|
||||
```
|
||||
|
||||
## The escalation ladder
|
||||
|
||||
| Level | Action | Effect | When |
|
||||
|---|---|---|---|
|
||||
| 1 | **Mute** | You stop seeing them; silent | Borderline; you don't want to cut them off |
|
||||
| 2 | **Block (account)** | Cuts mentions, removes their follow, federates to their instance | **Default first action** |
|
||||
| 3 | **Report** to source admin | Forwards the offending posts to their moderators | Repeat or egregious; in-policy on most instances |
|
||||
| 4 | **Domain-limit (silence)** | Their posts show only if you follow that account | One instance, multiple offenders |
|
||||
| 5 | **Domain-block (suspend)** | Severs all known accounts + federation | Instance is predominantly abuse |
|
||||
|
||||
### Blocking from a user account (federates + removes follow)
|
||||
|
||||
There is no `tootctl accounts block`. Do it through the model's `BlockService` so it tears down the relationship and federates correctly:
|
||||
|
||||
```ruby
|
||||
# run as the mastodon user:
|
||||
# sudo -u mastodon bash -c 'cd /home/mastodon/live && RAILS_ENV=production bin/rails runner /tmp/block.rb'
|
||||
me = Account.find_by(username: "<YOU>", domain: nil)
|
||||
%w[Handle1 Handle2].each do |u|
|
||||
t = Account.find_by(username: u, domain: "<INSTANCE>")
|
||||
next puts("NOTFOUND #{u}") if t.nil?
|
||||
BlockService.new.call(me, t)
|
||||
puts "BLOCKED #{u} blocking=#{me.blocking?(t)} they_follow_me=#{t.following?(me)}"
|
||||
end
|
||||
```
|
||||
|
||||
`blocking=true` with `they_follow_me=false` confirms the block landed and the follow was severed.
|
||||
|
||||
### Instance-level actions
|
||||
|
||||
Domain-limit / domain-block live in the admin UI (**Moderation → Federation**) or via `tootctl`:
|
||||
|
||||
```bash
|
||||
# Silence (limit) — posts hidden unless followed
|
||||
RAILS_ENV=production bin/tootctl domains ... # or set severity=silence in the admin UI
|
||||
# Suspend (block) the whole instance
|
||||
RAILS_ENV=production bin/tootctl ... # admin UI "Add domain block" is the safe path
|
||||
```
|
||||
|
||||
> [!tip] Reach for the lightest hammer
|
||||
> A domain block is rarely the right first move against an established instance — you lose every legit account and years of federation to swat a couple of accounts. Block the accounts, report them to the source admin, and only escalate the *instance* when it demonstrates a sustained, multi-actor pattern.
|
||||
|
||||
## Keep a log
|
||||
|
||||
Track offenders and source instances over time so a "one-off" that's actually a campaign becomes visible, and so domain-level decisions are evidence-based. A simple table — date, account, instance, signals, action — plus an instance-watch table with each source's registration policy and offender count is enough.
|
||||
|
||||
## Related
|
||||
|
||||
- [Mastodon `--prune-profiles` Trap](mastodon-prune-profiles-trap.md)
|
||||
- [Mastodon DB Maintenance](mastodon-db-maintenance.md)
|
||||
- [Mastodon Federation](mastodon-federation.md)
|
||||
174
02-selfhosting/services/mastodon-post-install-hardening.md
Normal file
174
02-selfhosting/services/mastodon-post-install-hardening.md
Normal file
|
|
@ -0,0 +1,174 @@
|
|||
---
|
||||
title: Mastodon Post-Install Hardening (Permissions + Account)
|
||||
domain: selfhosting
|
||||
category: services
|
||||
tags:
|
||||
- mastodon
|
||||
- fediverse
|
||||
- self-hosting
|
||||
- hardening
|
||||
- ansible
|
||||
- nginx
|
||||
- rbenv
|
||||
status: published
|
||||
created: 2026-05-31
|
||||
updated: 2026-05-31
|
||||
---
|
||||
|
||||
# Mastodon Post-Install Hardening (Permissions + Account)
|
||||
|
||||
Four gaps that the upstream Mastodon install guide doesn't lock down — each silently breaks something or leaves a credential exposed. Found on majortoot-hetzner during its 2026-05-31 cutover; codified in MajorAnsible's `configure_mastodon_permissions.yml`.
|
||||
|
||||
---
|
||||
|
||||
## Gap 1: `/home/mastodon` is `0750` — nginx 403s every asset
|
||||
|
||||
### Symptom
|
||||
|
||||
Browser loads `https://<your-instance>/` and shows an unstyled **purple background with no content** (Mastodon's React entry HTML loaded, but every JS / CSS / manifest request 403'd). API endpoints like `/api/v1/instance` still return 200 because they fall through nginx's `try_files` to the puma proxy — but static assets need direct filesystem access.
|
||||
|
||||
### Cause
|
||||
|
||||
Debian/Ubuntu's `useradd` default umask creates `/home/<user>` as `0750` (owner+group only). nginx runs as `www-data`, which is in neither — it cannot **traverse** into `/home/mastodon/live/public/` to serve `packs/assets/*.js`, manifest.json, etc. The errors land in `/var/log/nginx/error.log`:
|
||||
|
||||
```
|
||||
[crit] stat() "/home/mastodon/live/public/packs/assets/foo.js" failed (13: Permission denied)
|
||||
```
|
||||
|
||||
### Fix
|
||||
|
||||
```bash
|
||||
chmod 0751 /home/mastodon
|
||||
```
|
||||
|
||||
`0751` gives `other` execute (traversal) only, **not read** — files inside that aren't world-readable stay private. Take the opportunity to lock `.env.production` in the next gap.
|
||||
|
||||
---
|
||||
|
||||
## Gap 2: `.env.production` is `0644` — DB_PASS and SECRET_KEY_BASE are world-readable
|
||||
|
||||
### Symptom
|
||||
|
||||
Once Gap 1 is fixed and `/home/mastodon` is traversable, any local user (and any compromised process running as nginx, sidekiq under reduced privileges, a container escape, etc.) can `cat /home/mastodon/live/.env.production` and read every Mastodon secret.
|
||||
|
||||
### Cause
|
||||
|
||||
The `mastodon-setup` interactive wizard writes `.env.production` with default `0644` permissions. The file contains:
|
||||
|
||||
- `DB_PASS` — PostgreSQL password
|
||||
- `SECRET_KEY_BASE` — session cookie signing key
|
||||
- `OTP_SECRET` — 2FA encryption key
|
||||
- SMTP credentials
|
||||
- S3 / object-storage credentials if configured
|
||||
|
||||
### Fix
|
||||
|
||||
```bash
|
||||
chmod 0600 /home/mastodon/live/.env.production
|
||||
chown mastodon:mastodon /home/mastodon/live/.env.production
|
||||
```
|
||||
|
||||
No service restart needed — Rails reads `.env.production` at process boot, not per-request. Existing `puma`, `sidekiq`, and `streaming` services keep running.
|
||||
|
||||
---
|
||||
|
||||
## Gap 3: `mastodon` user shell is `/usr/sbin/nologin` — `su - mastodon` fails
|
||||
|
||||
### Symptom
|
||||
|
||||
```
|
||||
root@majortoot:~# su - mastodon
|
||||
This account is currently not available.
|
||||
```
|
||||
|
||||
Blocks all `tootctl` and Rails console admin via SSH.
|
||||
|
||||
### Cause
|
||||
|
||||
If the user was created with `useradd --system mastodon`, the system-account default is shell `/usr/sbin/nologin`. Mastodon's own installer typically sets `/bin/bash` but a manual / Ansible / Packer build path may have used `--system`.
|
||||
|
||||
### Fix
|
||||
|
||||
```bash
|
||||
usermod -s /bin/bash mastodon
|
||||
```
|
||||
|
||||
Verify with `getent passwd mastodon | cut -d: -f7` → `/bin/bash`.
|
||||
|
||||
---
|
||||
|
||||
## Gap 4: Login shells don't load rbenv — `tootctl` reports "ruby: command not found"
|
||||
|
||||
### Symptom
|
||||
|
||||
After fixing Gap 3, `su - mastodon` succeeds, but:
|
||||
|
||||
```
|
||||
mastodon@majortoot:~$ which ruby
|
||||
(no output, exit 1)
|
||||
mastodon@majortoot:~$ cd /home/mastodon/live && bin/tootctl version
|
||||
/usr/bin/env: 'ruby': No such file or directory
|
||||
```
|
||||
|
||||
### Cause
|
||||
|
||||
A typical Mastodon install puts rbenv init in `~/.bashrc`. But bash **login** shells (which `su -` and `ssh user@host` open) source `.bash_profile`, `.bash_login`, or `.profile` in that order — **not** `.bashrc`. If `.bash_profile` doesn't exist and `.profile` doesn't init rbenv, the login shell never gets rbenv on PATH.
|
||||
|
||||
Even when `.bash_profile` chains `.bashrc`, Ubuntu's default `.bashrc` has a guard at the top:
|
||||
|
||||
```bash
|
||||
case $- in
|
||||
*i*) ;;
|
||||
*) return;;
|
||||
esac
|
||||
```
|
||||
|
||||
This **returns early for non-interactive shells**, which is exactly what `su - mastodon -c "<command>"` opens — so the rbenv init lines later in `.bashrc` are never reached.
|
||||
|
||||
### Fix
|
||||
|
||||
Drop a `.bash_profile` that sets up rbenv **before** sourcing `.bashrc`, so it works for both interactive and non-interactive login shells:
|
||||
|
||||
```bash
|
||||
# /home/mastodon/.bash_profile (mode 0644, owned by mastodon:mastodon)
|
||||
export PATH="$HOME/.rbenv/bin:$HOME/.rbenv/shims:$PATH"
|
||||
if command -v rbenv >/dev/null 2>&1; then
|
||||
eval "$(rbenv init -)"
|
||||
fi
|
||||
|
||||
# Then load POSIX login env + bash interactive config
|
||||
[ -f ~/.profile ] && . ~/.profile
|
||||
[ -f ~/.bashrc ] && . ~/.bashrc
|
||||
```
|
||||
|
||||
Verify:
|
||||
|
||||
```bash
|
||||
su - mastodon -c "ruby -v" # → ruby 3.x.x …
|
||||
su - mastodon -c "cd /home/mastodon/live && RAILS_ENV=production bin/tootctl version"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Codified
|
||||
|
||||
All four gaps are handled by `configure_mastodon_permissions.yml` in MajorAnsible. The playbook is idempotent, requires no service restart, and includes self-asserting verification steps:
|
||||
|
||||
| Assertion | What it catches |
|
||||
|---|---|
|
||||
| `sudo -u www-data stat /home/mastodon/live/public/packs` must succeed | Gap 1 regression |
|
||||
| `sudo -u www-data cat .env.production` must fail | Gap 2 regression |
|
||||
| `su - mastodon -c "ruby -v"` must succeed and output "ruby" | Gap 3 or 4 regression |
|
||||
|
||||
Apply to all Mastodon hosts:
|
||||
|
||||
```bash
|
||||
ansible-playbook configure_mastodon_permissions.yml
|
||||
```
|
||||
|
||||
## References
|
||||
|
||||
- [[majortoot#2026-05-31 — ssh.socket race post-reboot on majortoot-hetzner (during cutover night)]]
|
||||
- [[majortoot#tootctl CLI Note]]
|
||||
- MajorAnsible: `configure_mastodon_permissions.yml`
|
||||
- Related: [[mastodon-instance-tuning|Mastodon Instance Tuning]] · [[mastodon-db-maintenance|Mastodon DB Maintenance]]
|
||||
220
02-selfhosting/services/mastodon-prune-profiles-trap.md
Normal file
220
02-selfhosting/services/mastodon-prune-profiles-trap.md
Normal file
|
|
@ -0,0 +1,220 @@
|
|||
---
|
||||
title: Mastodon — The `--prune-profiles` Trap and How to Recover
|
||||
description: Why running `tootctl media remove --prune-profiles` blows away avatars that don't come back, and how to repopulate them on demand
|
||||
tags:
|
||||
- mastodon
|
||||
- tootctl
|
||||
- federation
|
||||
- self-hosting
|
||||
- troubleshooting
|
||||
created: 2026-05-07
|
||||
updated: 2026-06-01
|
||||
---
|
||||
|
||||
# Mastodon — The `--prune-profiles` Trap and How to Recover
|
||||
|
||||
If you administer a Mastodon instance and run `tootctl media remove --prune-profiles` on a schedule, you're probably introducing a long-running cosmetic regression that no one will be able to explain when it happens.
|
||||
|
||||
This article documents what the flag actually does, why the missing avatars don't auto-recover, and the smallest tool you can ship to fix things on demand.
|
||||
|
||||
## TL;DR
|
||||
|
||||
- `tootctl media remove --prune-profiles` deletes cached **remote** avatars older than `--days=N` from your S3/local storage **and** clears `accounts.avatar_file_name` in the database.
|
||||
- Mastodon does **not** re-fetch avatars when a client views a profile. Re-fetch happens only on incoming `Update` ActivityPub activities or via an explicit `tootctl accounts refresh`.
|
||||
- Quiet remote accounts therefore stay broken — sometimes for weeks — after a prune.
|
||||
- The disk savings are modest (≈250 KB per account on average) and the cosmetic damage hits exactly the accounts you care about most: your follows.
|
||||
- Most admins should **drop `--prune-profiles` and `--remove-headers` from cron** and refresh on demand instead.
|
||||
|
||||
## What the flags actually do
|
||||
|
||||
`tootctl media remove` has three distinct modes:
|
||||
|
||||
| Invocation | Target | Default `--days` |
|
||||
|---|---|---|
|
||||
| `tootctl media remove` | remote media **attachments** (images/video in posts) | 7 |
|
||||
| `tootctl media remove --prune-profiles` | remote **avatars** | 7 |
|
||||
| `tootctl media remove --remove-headers` | remote **headers** | 7 |
|
||||
|
||||
Each mode deletes the file from your storage backend and nullifies the corresponding `accounts.avatar_file_name` / `header_file_name` column. They are **mutually exclusive** — passing two at once produces:
|
||||
|
||||
```
|
||||
--prune-profiles and --remove-headers should not be specified simultaneously
|
||||
```
|
||||
|
||||
If your cron script combines them, **the avatar/header pruning silently never runs**, and the first time you correct the bug you'll suddenly nuke everything that's accumulated since the instance was created.
|
||||
|
||||
## Why the pictures don't come back
|
||||
|
||||
Mastodon's media-recovery model is event-driven, not lazy. The triggers that cause a remote avatar to be re-fetched are:
|
||||
|
||||
1. The remote actor emits an `Update` ActivityPub activity — typically when they edit their profile, change avatar, change display name, etc.
|
||||
2. Less reliably, certain `Create` activities on accounts whose actor state appears stale.
|
||||
3. Manual: `tootctl accounts refresh user@instance.tld`, the web UI's "Refresh profile" button (gear menu on the profile page), or admin actions touching the actor record.
|
||||
|
||||
What does **not** trigger a re-fetch:
|
||||
|
||||
- Loading the profile in any client (web, iOS app, Ivory, Tusky, Toot!, etc.).
|
||||
- Liking, replying to, boosting, or following toots from the user.
|
||||
- Viewing the user in your followers/following list.
|
||||
|
||||
This is why you see **broken avatars consistently across every client and device** — the asset is missing on your server, and your clients are all faithfully fetching from the same broken URL.
|
||||
|
||||
Active accounts re-emit `Update` activities reasonably often, so they self-heal over hours/days. Quiet accounts, accounts on small or down instances, and accounts whose owners simply don't update their profiles can stay broken indefinitely.
|
||||
|
||||
## Recovery on demand
|
||||
|
||||
Single account:
|
||||
|
||||
```bash
|
||||
sudo -u mastodon -H bash -c '
|
||||
cd /home/mastodon/live
|
||||
export RAILS_ENV=production
|
||||
export PATH=/home/mastodon/.rbenv/bin:/home/mastodon/.rbenv/shims:$PATH
|
||||
bin/tootctl accounts refresh user@instance.tld
|
||||
'
|
||||
```
|
||||
|
||||
For your local user's follows, a small wrapper that finds only accounts with broken avatars *whose origin actually advertises one*:
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# refresh-my-follows.sh — repopulate broken avatars for the local user's
|
||||
# follows. Idempotent. Skips accounts whose origin has no avatar (e.g.,
|
||||
# users who never set one) and headers entirely (most users have none).
|
||||
set -euo pipefail
|
||||
|
||||
export PATH="/home/mastodon/.rbenv/bin:/home/mastodon/.rbenv/shims:$PATH"
|
||||
export RAILS_ENV=production
|
||||
cd /home/mastodon/live
|
||||
|
||||
USER_TO_REFRESH="${1:-yourusername}"
|
||||
|
||||
accts=$(bin/rails runner "
|
||||
acct = Account.find_by(username: %q($USER_TO_REFRESH), domain: nil)
|
||||
abort %q(no such local account) unless acct
|
||||
acct.following
|
||||
.where.not(domain: nil)
|
||||
.where(avatar_file_name: nil)
|
||||
.where.not(avatar_remote_url: [nil, ''])
|
||||
.pluck(:username, :domain)
|
||||
.each { |u, d| puts %Q(#{u}@#{d}) }
|
||||
" | grep -E '^[^[:space:]@]+@[^[:space:]@]+$' || true)
|
||||
|
||||
count=$(printf '%s\n' "$accts" | grep -cv '^$' || true)
|
||||
echo "Found $count remote follows with missing avatar"
|
||||
|
||||
i=0
|
||||
while IFS= read -r a; do
|
||||
[ -z "$a" ] && continue
|
||||
i=$((i+1))
|
||||
printf '[%d/%d] refresh %s ... ' "$i" "$count" "$a"
|
||||
if bin/tootctl accounts refresh "$a" >/dev/null 2>&1; then
|
||||
echo OK
|
||||
else
|
||||
echo FAIL
|
||||
fi
|
||||
done <<< "$accts"
|
||||
```
|
||||
|
||||
Three things in that WHERE clause matter:
|
||||
|
||||
- `avatar_file_name: nil` — local cache is empty, so we need to fetch.
|
||||
- `domain: not nil` — only remote accounts have cached avatars to repopulate.
|
||||
- `avatar_remote_url: [nil, '']` excluded — if the origin actor object has no avatar, refresh will not populate anything. Including these accounts puts the script in an infinite-retry loop on every run.
|
||||
|
||||
## Bulk restore at scale
|
||||
|
||||
When the breakage is large — a bad prune across the whole instance, or a storage-level deletion (see the next section) — refreshing follows one at a time isn't enough. The generalized procedure:
|
||||
|
||||
1. List the keys that actually exist in storage, so you only touch the broken ones.
|
||||
2. For each account whose current `avatar`/`header` key is **absent**, null the `*_file_name` (the redownload workers skip accounts that still have a file name) and enqueue the worker.
|
||||
3. Let Sidekiq's `pull` queue drain.
|
||||
|
||||
```ruby
|
||||
require "aws-sdk-s3"; require "set"
|
||||
c = Aws::S3::Client.new(region: ENV["S3_REGION"], access_key_id: ENV["AWS_ACCESS_KEY_ID"], secret_access_key: ENV["AWS_SECRET_ACCESS_KEY"])
|
||||
b = ENV["S3_BUCKET"]
|
||||
|
||||
def keys(c, b, prefix)
|
||||
s = Set.new; t = nil
|
||||
loop do
|
||||
r = c.list_objects_v2(bucket: b, prefix: prefix, continuation_token: t, max_keys: 1000)
|
||||
r.contents.each { |o| s << o.key }
|
||||
break unless r.is_truncated
|
||||
t = r.next_continuation_token
|
||||
end
|
||||
s
|
||||
end
|
||||
|
||||
avset = keys(c, b, "cache/accounts/avatars/")
|
||||
hdset = keys(c, b, "cache/accounts/headers/")
|
||||
|
||||
Account.where.not(domain: nil)
|
||||
.where("avatar_file_name IS NOT NULL OR header_file_name IS NOT NULL")
|
||||
.find_each(batch_size: 1000) do |a|
|
||||
if a.avatar_file_name.present? && a.avatar_remote_url.present? &&
|
||||
!avset.include?(a.avatar.path.sub(%r{^/}, ""))
|
||||
a.update_column(:avatar_file_name, nil)
|
||||
RedownloadAvatarWorker.perform_async(a.id)
|
||||
end
|
||||
if a.header_file_name.present? && a.header_remote_url.present? &&
|
||||
!hdset.include?(a.header.path.sub(%r{^/}, ""))
|
||||
a.update_column(:header_file_name, nil)
|
||||
RedownloadHeaderWorker.perform_async(a.id)
|
||||
end
|
||||
end
|
||||
```
|
||||
|
||||
Notes:
|
||||
|
||||
- Listing existing keys first means you re-fetch only what's missing, instead of re-downloading every avatar — which would re-bloat a bucket you may have just trimmed.
|
||||
- The workers return early if `*_file_name` is present, which is why you must `update_column(..., nil)` before enqueuing.
|
||||
- Avatars are small (tens of KB each), so re-fetching the whole missing set typically adds a few GB and a few hours of Sidekiq `pull` work. Headers are larger but still modest.
|
||||
- Origins that deleted the avatar after you cached it return 404 — the permanent, irrecoverable tail.
|
||||
|
||||
## Broader failure: storage-level deletion without DB de-ref
|
||||
|
||||
`--prune-profiles` is one way avatars vanish, but it at least nulls the database column, so the account re-fetches on its next `Update`. The **more dangerous** variant is deleting objects directly in your storage backend — a manual `aws s3 rm`, an S3 lifecycle expiration rule, a bucket migration that doesn't copy everything, or any "cost cleanup" done outside `tootctl`. Those delete the file but leave `accounts.avatar_file_name` **set**, pointing at an object that no longer exists.
|
||||
|
||||
Why it's worse:
|
||||
|
||||
- The DB still thinks the avatar is present, and the redownload workers skip the account (`*_file_name` is non-null) — so it never self-heals until an `Update` arrives.
|
||||
- It can hit **every** remote account at once, not just quiet ones.
|
||||
- It looks identical to the S3-ACL upload bug — see [Mastodon on S3 — Silent Upload Failures](mastodon-s3-acl-upload-failures.md). Tell them apart by checking whether new uploads succeed (ACL bug) versus only old objects being gone (a one-off deletion).
|
||||
|
||||
Recover with the [bulk restore](#bulk-restore-at-scale) procedure above. **Prevent** it by never deleting Mastodon media at the storage level: prune *attachments* through `tootctl media remove` (which derefs the DB and re-fetches on demand) and leave avatars/headers alone.
|
||||
|
||||
## Why `header_file_name IS NULL` is a bad signal
|
||||
|
||||
A naive script will treat both `avatar_file_name IS NULL` and `header_file_name IS NULL` as "broken." Don't.
|
||||
|
||||
Roughly 20% of Mastodon users never set a custom header — the default blank header isn't represented as a file, so `header_file_name` is legitimately `NULL` for them. After a `tootctl accounts refresh`, the field stays `NULL` because there is genuinely nothing to fetch. A script with `OR header_file_name IS NULL` will retry these accounts forever and never make progress.
|
||||
|
||||
Avatar is different — nearly all real users set one, so `avatar_file_name IS NULL AND avatar_remote_url IS NOT NULL` is a reliable "broken and fixable" signal.
|
||||
|
||||
## The cron decision
|
||||
|
||||
If your weekly media-prune cron currently looks like:
|
||||
|
||||
```bash
|
||||
bin/tootctl media remove --days=7 --concurrency=5
|
||||
bin/tootctl media remove --prune-profiles --days=7 --concurrency=5
|
||||
bin/tootctl media remove --remove-headers --days=7 --concurrency=5
|
||||
bin/tootctl preview_cards remove --days=30 --concurrency=5
|
||||
```
|
||||
|
||||
Consider deleting the middle two lines. The attachment prune is the real disk-saver (gigabytes per week on a busy instance). The avatar prune is small (~250 KB per remote account) and damages your UX. The header prune is even smaller and rarely worth it.
|
||||
|
||||
**Stronger recommendation:** after being bitten more than once, the safest policy is to **disable automated profile/header pruning entirely** — and reconsider scheduled `tootctl accounts refresh --all`, which re-fetches every profile and is destructive when uploads are failing at the time. Keep only a deliberate, occasional **attachment** prune if bucket size demands it. Pair that with a synthetic upload monitor (see [Mastodon on S3 — Silent Upload Failures](mastodon-s3-acl-upload-failures.md)) so any future regression is caught in hours instead of by a user weeks later.
|
||||
|
||||
## Edge cases
|
||||
|
||||
- **Origin-side 404:** the actor object advertises an avatar URL, but the URL itself returns 404. Your local cache stays empty no matter how many times you refresh. Only the origin user can fix it (re-upload). The script above will keep retrying these on every run; if that bothers you, add a "tried within last N hours" filter.
|
||||
- **Suspended accounts:** `tootctl accounts refresh` returns OK on suspended accounts but does not download media. They'll stay broken, which is correct behavior.
|
||||
- **Sidekiq backlog:** the avatar fetch is queued as a Sidekiq job, not done synchronously. If your `pull` queue is deep, you'll see a delay between "OK" and the avatar actually appearing in the database.
|
||||
|
||||
## Related
|
||||
|
||||
- [Mastodon Instance Tuning](mastodon-instance-tuning.md) — broader perf notes for self-hosters
|
||||
- [Mastodon DB Maintenance](mastodon-db-maintenance.md) — what to run on a schedule and when
|
||||
- [Mastodon Federation](mastodon-federation.md) — how the actor refresh fits into the larger federation model
|
||||
138
02-selfhosting/services/mastodon-s3-acl-upload-failures.md
Normal file
138
02-selfhosting/services/mastodon-s3-acl-upload-failures.md
Normal file
|
|
@ -0,0 +1,138 @@
|
|||
---
|
||||
title: Mastodon on S3 — Silent Upload Failures When the Bucket Disables ACLs
|
||||
description: Why a BucketOwnerEnforced S3 bucket plus a stale S3_PERMISSION/S3_ACL in .env.production makes every Mastodon media upload fail with AccessControlListNotSupported, how to diagnose it, and how to fix and monitor it.
|
||||
domain: selfhosting
|
||||
category: services
|
||||
tags:
|
||||
- mastodon
|
||||
- fediverse
|
||||
- self-hosting
|
||||
- aws
|
||||
- s3
|
||||
- paperclip
|
||||
- troubleshooting
|
||||
status: published
|
||||
created: 2026-06-01
|
||||
updated: 2026-06-01
|
||||
---
|
||||
|
||||
# Mastodon on S3 — Silent Upload Failures When the Bucket Disables ACLs
|
||||
|
||||
If your Mastodon instance stores media on S3 and you switch the bucket to **Object Ownership = `BucketOwnerEnforced`** (which AWS now recommends, and which the console nudges you toward), every media upload can start failing **silently** unless you also remove the object-ACL setting from `.env.production`. New avatars, headers, and attachments stop appearing; old ones keep working; nothing obvious is logged. This article is the diagnosis and fix.
|
||||
|
||||
## TL;DR
|
||||
|
||||
- `BucketOwnerEnforced` **disables ACLs entirely** on the bucket. Any request that carries an `x-amz-acl` header is rejected with `AccessControlListNotSupported: The bucket does not allow ACLs`.
|
||||
- Mastodon (via Paperclip) attaches `x-amz-acl` to every upload **if** `S3_PERMISSION` (or `S3_ACL`) is set in `.env.production`. The common value `S3_PERMISSION=public-read` — or a migration leftover like `S3_PERMISSION=private` — triggers the rejection.
|
||||
- Result: **every new upload fails**, but the database row is still updated, so Mastodon believes it has the file. The object never lands → broken image. Objects written *before* the bucket changed keep serving fine, which masks the problem.
|
||||
- **Fix:** set `S3_PERMISSION=` (empty) and remove any `S3_ACL=` line, then restart `mastodon-web` + `mastodon-sidekiq`. Public read is now served by the **bucket policy**, not per-object ACLs.
|
||||
|
||||
## Symptoms
|
||||
|
||||
- Newly-changed avatars/headers show broken; attachments on new posts fail to display.
|
||||
- Avatars that were cached **before** the bucket setting changed still work — so "some work, some don't."
|
||||
- `tootctl` and the web UI report success; Sidekiq doesn't obviously error.
|
||||
- Direct fetch of a broken object's URL returns **403 AccessDenied** (not 404 — see below).
|
||||
|
||||
## Why a missing object returns 403, not 404
|
||||
|
||||
A typical Mastodon S3 bucket policy grants public `s3:GetObject` but **not** `s3:ListBucket`. Without `ListBucket`, S3 hides whether a key exists: a `GET` on a **missing** key returns **403 AccessDenied**, identical to a permissions denial. So "403" here usually means *the object isn't there*, not *the object is forbidden*. This is why the failure reads like a permissions problem when it's really a failed write.
|
||||
|
||||
## Diagnosis
|
||||
|
||||
Run these with the instance's own S3 credentials (e.g. via `bin/rails runner`, which loads `.env.production`):
|
||||
|
||||
```ruby
|
||||
require "aws-sdk-s3"
|
||||
c = Aws::S3::Client.new(region: ENV["S3_REGION"],
|
||||
access_key_id: ENV["AWS_ACCESS_KEY_ID"],
|
||||
secret_access_key: ENV["AWS_SECRET_ACCESS_KEY"])
|
||||
b = ENV["S3_BUCKET"]
|
||||
|
||||
# 1. Is the bucket ACL-disabled?
|
||||
puts c.get_bucket_ownership_controls(bucket: b).ownership_controls.rules.map(&:object_ownership).inspect
|
||||
# => ["BucketOwnerEnforced"] <-- ACLs are OFF
|
||||
|
||||
# 2. Does an upload WITH an ACL fail, and WITHOUT one succeed?
|
||||
begin
|
||||
c.put_object(bucket: b, key: "tmp/acltest", body: "x", acl: "public-read")
|
||||
puts "PUT+acl: OK"
|
||||
rescue => e
|
||||
puts "PUT+acl FAILS: #{e.class} / #{e.message}" # AccessControlListNotSupported
|
||||
end
|
||||
c.put_object(bucket: b, key: "tmp/noacltest", body: "x") # succeeds
|
||||
c.delete_object(bucket: b, key: "tmp/noacltest")
|
||||
|
||||
# 3. Confirm a "broken" avatar's object is actually missing
|
||||
key = Account.find_by(username: "someuser", domain: "remote.tld").avatar.path.sub(%r{^/}, "")
|
||||
begin; c.head_object(bucket: b, key: key); puts "EXISTS"
|
||||
rescue Aws::S3::Errors::NotFound; puts "MISSING"; end
|
||||
```
|
||||
|
||||
If #1 shows `BucketOwnerEnforced` and #2 shows the ACL'd PUT failing while the plain PUT succeeds, you've confirmed it.
|
||||
|
||||
Check `.env.production` for the offending settings:
|
||||
|
||||
```bash
|
||||
grep -E '^S3_(ACL|PERMISSION|NO_INHERIT)' /home/mastodon/live/.env.production
|
||||
# S3_ACL=private <-- remove
|
||||
# S3_PERMISSION=private <-- set empty
|
||||
```
|
||||
|
||||
## The fix
|
||||
|
||||
1. Edit `.env.production`:
|
||||
- `S3_PERMISSION=` (empty — Paperclip then sends no `x-amz-acl` header)
|
||||
- remove/comment any `S3_ACL=` line
|
||||
2. Restart so the env is reloaded: `systemctl restart mastodon-sidekiq mastodon-web`
|
||||
3. Verify the previously-failing write path now works — reprocess any existing avatar and confirm it serves 200:
|
||||
|
||||
```ruby
|
||||
a = Account.local.first
|
||||
a.avatar.reprocess! # used to raise AccessControlListNotSupported; now succeeds
|
||||
```
|
||||
|
||||
Public readability is now provided by the **bucket policy** (grant `s3:GetObject` on `arn:aws:s3:::your-bucket/*` to `Principal: "*"`), with the account-level **Block Public Access** "ACLs" toggles off and "policy" allowed. You do **not** need per-object ACLs at all.
|
||||
|
||||
### Recovering the avatars that broke while it was failing
|
||||
|
||||
Any media that failed to upload during the broken window is gone from S3 while the DB still references it. Because Mastodon's redownload workers **skip accounts whose `*_file_name` is already set**, you must null the dead reference first, then enqueue the worker. See [Mastodon — The `--prune-profiles` Trap and How to Recover](mastodon-prune-profiles-trap.md#bulk-restore-at-scale) for the bulk procedure.
|
||||
|
||||
## Don't let it happen silently again — monitor uploads
|
||||
|
||||
The worst part of this bug is the silence. Add a periodic **synthetic write check** that uploads a tiny object with the app's own credentials, confirms it, deletes it, and alerts on failure:
|
||||
|
||||
```ruby
|
||||
s3.put_object(bucket: b, key: "health/upload-check", body: "ok") # no acl
|
||||
s3.head_object(bucket: b, key: "health/upload-check")
|
||||
s3.delete_object(bucket: b, key: "health/upload-check")
|
||||
# any exception -> email an alert
|
||||
```
|
||||
|
||||
Pair it with an HTTP check that your **local** account avatars all return 200 (they always should). Run both every few hours from cron. A regression then pages you in hours instead of being discovered by a user weeks later.
|
||||
|
||||
## Ansible enforcement
|
||||
|
||||
If you manage the host with Ansible, enforce the safe values so a future template render can't reintroduce the ACL header:
|
||||
|
||||
```yaml
|
||||
- name: Ensure S3_PERMISSION is empty (no x-amz-acl on uploads)
|
||||
ansible.builtin.lineinfile:
|
||||
path: /home/mastodon/live/.env.production
|
||||
regexp: '^S3_PERMISSION='
|
||||
line: 'S3_PERMISSION='
|
||||
notify: Restart Mastodon services
|
||||
|
||||
- name: Remove any active S3_ACL line (ACLs unsupported on this bucket)
|
||||
ansible.builtin.lineinfile:
|
||||
path: /home/mastodon/live/.env.production
|
||||
regexp: '^S3_ACL=.+'
|
||||
state: absent
|
||||
notify: Restart Mastodon services
|
||||
```
|
||||
|
||||
## Related
|
||||
|
||||
- [Mastodon — The `--prune-profiles` Trap and How to Recover](mastodon-prune-profiles-trap.md) — the other way avatars go missing, plus the bulk-restore script
|
||||
- [Mastodon Post-Install Hardening (Permissions + Account)](mastodon-post-install-hardening.md)
|
||||
- [AWS S3 Cost Management](../cloud/aws-s3-cost-management.md) — pruning attachments to control bucket size (safely)
|
||||
|
|
@ -0,0 +1,276 @@
|
|||
---
|
||||
title: "Inbound Spam Filtering: spamass-milter + SpamAssassin Bayes on Postfix/Dovecot (Fedora)"
|
||||
domain: selfhosting
|
||||
category: services
|
||||
tags: [postfix, dovecot, spamassassin, spamass-milter, bayes, spam, sieve, fedora, email, selinux]
|
||||
status: published
|
||||
created: 2026-06-04
|
||||
updated: 2026-06-05
|
||||
---
|
||||
# Inbound Spam Filtering: spamass-milter + SpamAssassin Bayes on Postfix/Dovecot
|
||||
|
||||
How to add inbound spam scanning to a Postfix/Dovecot virtual-mailbox server on Fedora: SpamAssassin scans every inbound message via `spamass-milter`, spam is **tagged (never rejected)**, Dovecot's Sieve files it into the user's `Junk` folder, and a **site-wide Bayes database** — shared between the scan path and manual `sa-learn` training — learns from your real mail.
|
||||
|
||||
This is a "tag and quarantine" design (not "reject at SMTP"), which is the safe default: a misfire lands a message in Junk for review rather than bouncing legitimate mail.
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
inbound SMTP (25) ─► Postfix smtpd
|
||||
│ smtpd_milters:
|
||||
│ 1. OpenDKIM (verify/sign)
|
||||
│ 2. spamass-milter ─► spamc ─► spamd (SpamAssassin)
|
||||
│ adds X-Spam-Flag / X-Spam-Status headers
|
||||
▼
|
||||
Dovecot LMTP delivery ─► global Sieve
|
||||
if X-Spam-Flag: YES ─► fileinto "Junk"
|
||||
else ─► INBOX
|
||||
|
||||
Bayes DB /var/lib/spamassassin/bayes/ (site-wide, shared)
|
||||
├─ spamd auto-learns at scan time
|
||||
└─ sa-learn manual/scripted training from Maildir folders
|
||||
```
|
||||
|
||||
## 1. Install
|
||||
|
||||
```bash
|
||||
sudo dnf install spamassassin spamass-milter
|
||||
sudo systemctl enable --now spamassassin # spamd
|
||||
```
|
||||
|
||||
On Fedora the `spamass-milter` unit runs as the unprivileged **`sa-milt`** user and creates its socket at `/run/spamass-milter/spamass-milter.sock`. Remember that user — the Bayes DB ownership and the socket permissions both hinge on it.
|
||||
|
||||
## 2. Configure spamass-milter — tag-only
|
||||
|
||||
Edit `/etc/sysconfig/spamass-milter`:
|
||||
|
||||
```sh
|
||||
EXTRA_FLAGS="-a -r 999999"
|
||||
```
|
||||
|
||||
> [!warning] The `-r` flag is a footgun
|
||||
> `-r nn` rejects mail scoring ≥ `nn` at SMTP time. **Omitting `-r` does NOT mean "never reject"** — this build still rejects flagged spam at a low default threshold (a GTUBE test will get `550 Blocked by SpamAssassin`). To get pure tag-only behaviour, set the threshold absurdly high (`-r 999999`) so nothing ever reaches it. Do **not** use `-r -1` — that means "reject anything tagged as spam."
|
||||
|
||||
- `-a` — skip messages on **authenticated** connections, so your own outbound/submission mail isn't scanned or tagged.
|
||||
|
||||
## 3. Socket permissions (so Postfix can connect)
|
||||
|
||||
The socket is created `0770 sa-milt:sa-milt` only if you widen the unit's umask; by default it's `0755` and Postfix (running as `postfix`) can't write to it. Two steps:
|
||||
|
||||
```bash
|
||||
# 1. Let the socket be group-accessible
|
||||
sudo install -d /etc/systemd/system/spamass-milter.service.d
|
||||
printf '[Service]\nUMask=0007\n' | sudo tee /etc/systemd/system/spamass-milter.service.d/socket-perms.conf
|
||||
|
||||
# 2. Put postfix in the sa-milt group, then RESTART postfix (group is read at start)
|
||||
sudo usermod -aG sa-milt postfix
|
||||
|
||||
sudo systemctl daemon-reload
|
||||
sudo systemctl enable --now spamass-milter
|
||||
```
|
||||
|
||||
Verify: `sudo -u postfix test -w /run/spamass-milter/spamass-milter.sock && echo OK`.
|
||||
|
||||
## 4. Wire into Postfix
|
||||
|
||||
Append the milter **alongside** OpenDKIM — don't replace it. Inbound (`smtpd`) gets both; local-injected mail (`non_smtpd`) stays DKIM-only.
|
||||
|
||||
```bash
|
||||
postconf -e 'smtpd_milters = local:/run/opendkim/opendkim.sock unix:/run/spamass-milter/spamass-milter.sock'
|
||||
postconf -e 'milter_default_action = accept' # if SA is down, accept the mail — never defer/bounce
|
||||
sudo systemctl restart postfix # restart (not reload) to pick up the new group
|
||||
```
|
||||
|
||||
`milter_default_action = accept` is important: if the milter ever hiccups, mail still flows.
|
||||
|
||||
## 5. Site-wide Bayes DB
|
||||
|
||||
Put the Bayes DB in one fixed location so the scan path and your training script share it. In `/etc/mail/spamassassin/local.cf`:
|
||||
|
||||
```
|
||||
use_bayes 1
|
||||
bayes_auto_learn 1
|
||||
bayes_path /var/lib/spamassassin/bayes/bayes
|
||||
bayes_file_mode 0660
|
||||
```
|
||||
|
||||
Create the directory owned by the **scanning user** (`sa-milt`), under `/var/lib/spamassassin` so it inherits the correct SELinux type (`spamd_var_lib_t`):
|
||||
|
||||
```bash
|
||||
sudo install -d -m 2770 -o sa-milt -g sa-milt /var/lib/spamassassin/bayes
|
||||
sudo restorecon -Rv /var/lib/spamassassin/bayes
|
||||
sudo systemctl restart spamassassin
|
||||
```
|
||||
|
||||
The `2770` setgid + `bayes_file_mode 0660` means whether the DB is written by `spamd` (as `sa-milt`) or by `sa-learn` (as `root`, from a training script), all parties can read and write it.
|
||||
|
||||
## 6. File spam into Junk (Dovecot Sieve)
|
||||
|
||||
A global Sieve before-script files anything SpamAssassin flagged. `/etc/dovecot/sieve/global/spam-to-junk.sieve`:
|
||||
|
||||
```sieve
|
||||
require ["fileinto", "mailbox"];
|
||||
if anyof (header :contains "X-Spam-Flag" "YES", header :contains "X-Spam-Status" "Yes") {
|
||||
fileinto :create "Junk";
|
||||
stop;
|
||||
}
|
||||
```
|
||||
|
||||
Register it as a global before-script in `dovecot.conf` (NOT under `plugin {}` on Pigeonhole 2.4+ — see warning below), then compile and restart Dovecot:
|
||||
|
||||
```bash
|
||||
sievec /etc/dovecot/sieve/global/spam-to-junk.sieve # produces .svbin
|
||||
systemctl restart dovecot
|
||||
```
|
||||
|
||||
> [!warning] Pigeonhole 2.4 dropped `plugin/sieve_before` — it silently does nothing
|
||||
> Before Dovecot/Pigeonhole 2.4, the canonical way to register a global before-script was:
|
||||
>
|
||||
> ```
|
||||
> plugin {
|
||||
> sieve_before = /etc/dovecot/sieve/global/spam-to-junk.sieve
|
||||
> }
|
||||
> ```
|
||||
>
|
||||
> On **Dovecot 2.4+**, that setting is gone and **silently ignored** — no warning at start-up, the script never runs, and your X-Spam-Flag mail just lands in INBOX wondering why nothing files it. The 2.4 replacement is a top-level `sieve_script` block (not inside `plugin {}`):
|
||||
>
|
||||
> ```
|
||||
> sieve_script spam_before {
|
||||
> type = before
|
||||
> path = /etc/dovecot/sieve/global/spam-to-junk.sieve
|
||||
> }
|
||||
> ```
|
||||
>
|
||||
> Verify with `doveconf -n | grep -A2 spam_before`. If it doesn't appear, dovecot.conf isn't reading your file — check that `!include conf.d/*.conf` exists in dovecot.conf (some Fedora rebuilds ship a flat dovecot.conf without it; the block has to live in dovecot.conf directly).
|
||||
|
||||
## 6b. (Optional) Route spam to a separate mailbox — silence iOS push notifications
|
||||
|
||||
`fileinto :create "Junk"` moves spam to the user's `.Junk` folder, but the user's IMAP session still sees a new-message event in INBOX (briefly, before sieve moves it) or in Junk (depending on client subscriptions). For clients with IMAP IDLE + push, that's a notification you don't want — e.g. Spark on iPhone/iPad fires APNS on any new message touching a subscribed folder.
|
||||
|
||||
To make spam **invisible to the user's mailbox entirely**, REDIRECT the envelope at Postfix `cleanup` (after the milter adds `X-Spam-Flag`, before LMTP delivery) so spam lands in a separate `junk@` mailbox the user doesn't subscribe to:
|
||||
|
||||
```bash
|
||||
# /etc/postfix/cleanup_header_checks
|
||||
/^X-Spam-Flag:[[:space:]]+YES/ REDIRECT junk@example.com
|
||||
```
|
||||
|
||||
```bash
|
||||
postconf -e 'header_checks = regexp:/etc/postfix/cleanup_header_checks'
|
||||
systemctl reload postfix
|
||||
```
|
||||
|
||||
> [!tip] Use `regexp:`, not `pcre:`, on stock Fedora
|
||||
> `pcre:` requires the `postfix-pcre` package. `regexp:` is built into postfix and supports POSIX extended regex — use `[[:space:]]+` for whitespace and `\\\\` for backslash. The patterns in cleanup_header_checks are simple enough that regexp is plenty.
|
||||
|
||||
The Sieve from §6 still runs as a safety net for any tagged message that escapes the cleanup REDIRECT (e.g. a message addressed to the junk@ mailbox itself, or aliases not covered by the REDIRECT rule). Defense in depth.
|
||||
|
||||
Train Bayes from the `junk@` Maildir instead of (or in addition to) per-user Junk folders:
|
||||
|
||||
```bash
|
||||
sa-learn --spam /var/vmail/example.com/junk/{cur,new}
|
||||
```
|
||||
|
||||
## 7. Training the Bayes filter
|
||||
|
||||
SpamAssassin's Bayes only starts scoring once it has learned **≥ 200 spam AND ≥ 200 ham** (`bayes_min_spam_num` / `bayes_min_ham_num`). Train from your Maildir folders with `sa-learn`. **Run it as `root`** — root can read every user's Maildir *and* write the Bayes DB.
|
||||
|
||||
```bash
|
||||
# Spam — your Junk folder(s) and any dedicated spam mailbox
|
||||
sa-learn --spam /var/vmail/example.com/user/.Junk/{cur,new}
|
||||
|
||||
# Ham — Sent + Inbox (known-good)
|
||||
sa-learn --ham /var/vmail/example.com/user/{cur,new}
|
||||
sa-learn --ham /var/vmail/example.com/user/.Sent/{cur,new}
|
||||
|
||||
sa-learn --sync
|
||||
sa-learn --dump magic | grep -E 'nspam|nham'
|
||||
```
|
||||
|
||||
`bayes_path` is read from `local.cf`, so no `--dbpath` is needed.
|
||||
|
||||
> [!tip] Keep spam and ham roughly balanced
|
||||
> Bayes accuracy drops when one corpus dwarfs the other (aim for within ~3:1). Don't dump a 90,000-message archive of ham against a few hundred spam — it biases everything toward "ham" and spam slips through. Use Sent + recent Inbox for ham, not your entire archive.
|
||||
|
||||
> [!warning] Train manually, not from cron — unless your folders are always clean
|
||||
> `sa-learn` learns whatever is *in* the folder. If a spam slips into the Inbox, or you haven't yet rescued a false-positive out of Junk, an unattended cron run will mislearn it. Prefer a manual script you run **after** triaging Junk/Inbox. (`sa-learn` is idempotent and re-classifies on re-run, so a mistake is fixable: move the message to the right folder and run again.)
|
||||
|
||||
### 7a. Weekly systemd timer (safe when junk@ is dedicated and INBOX is curated)
|
||||
|
||||
The warning above is the safe default. If you use the §6b REDIRECT-to-junk@ pattern, **the junk mailbox is pure spam by design** (only `X-Spam-Flag:YES` envelopes reach it), and your INBOX is curated by hand — the misclassification risk drops to near zero, and a weekly timer becomes both safe and useful. Add `--force-expire` to age out stale tokens so the Bayes corpus doesn't drift.
|
||||
|
||||
```ini
|
||||
# /etc/systemd/system/sa-learn-majormail.service
|
||||
[Unit]
|
||||
Description=SpamAssassin Bayes training from majorshouse.com Maildir
|
||||
After=spamassassin.service
|
||||
Wants=spamassassin.service
|
||||
|
||||
[Service]
|
||||
Type=oneshot
|
||||
Nice=10
|
||||
IOSchedulingClass=idle
|
||||
ExecStart=/usr/bin/sa-learn --spam --no-sync \
|
||||
/var/vmail/example.com/junk/cur \
|
||||
/var/vmail/example.com/junk/new
|
||||
ExecStart=/usr/bin/sa-learn --ham --no-sync \
|
||||
/var/vmail/example.com/user/cur \
|
||||
/var/vmail/example.com/user/new \
|
||||
/var/vmail/example.com/user/.Sent/cur \
|
||||
/var/vmail/example.com/user/.Sent/new
|
||||
ExecStart=/usr/bin/sa-learn --sync
|
||||
ExecStart=/usr/bin/sa-learn --force-expire
|
||||
```
|
||||
|
||||
```ini
|
||||
# /etc/systemd/system/sa-learn-majormail.timer
|
||||
[Unit]
|
||||
Description=Weekly SpamAssassin Bayes training + expiry
|
||||
|
||||
[Timer]
|
||||
OnCalendar=Sun 04:15
|
||||
Persistent=true
|
||||
RandomizedDelaySec=20min
|
||||
|
||||
[Install]
|
||||
WantedBy=timers.target
|
||||
```
|
||||
|
||||
```bash
|
||||
systemctl daemon-reload
|
||||
systemctl enable --now sa-learn-majormail.timer
|
||||
systemctl list-timers sa-learn-majormail.timer
|
||||
```
|
||||
|
||||
`Persistent=true` runs the missed job on next boot if the host was off at 04:15. `--force-expire` is a no-op until SA's expiry heuristic decides tokens are due (typically every few weeks for the default `bayes_expiry_max_db_size`).
|
||||
|
||||
## 8. Test
|
||||
|
||||
Send a [GTUBE](https://spamassassin.apache.org/gtube/) probe through port 25 (unauthenticated) and a normal message:
|
||||
|
||||
```bash
|
||||
# from a host that can reach :25 — GTUBE scores ~1000
|
||||
printf 'Subject: gtube\n\nXJS*C4JDBQADN1.NSBN3*2IDNEN*GTUBE-STANDARD-ANTI-UBE-TEST-EMAIL*C.34X\n' \
|
||||
| sendmail -f test@example.org user@example.com
|
||||
```
|
||||
|
||||
Confirm in `/var/log/maillog` that `spamd` scanned it (`result: Y …`), the message was **delivered** (no `milter-reject`), it landed in `.Junk`, and the stored message has `X-Spam-Flag: YES`.
|
||||
|
||||
## Gotchas recap
|
||||
|
||||
| Symptom | Cause | Fix |
|
||||
|---|---|---|
|
||||
| Spam gets `550 Blocked by SpamAssassin` (you wanted Junk) | spamass-milter rejects at a default threshold | `-r 999999` for tag-only |
|
||||
| Postfix can't reach the milter socket | socket `0755`, postfix not in `sa-milt` group | `UMask=0007` drop-in + `usermod -aG sa-milt postfix` + restart postfix |
|
||||
| `sa-learn` trains but `spamd` doesn't use it | per-user vs site Bayes mismatch | set `bayes_path` in `local.cf` (site-wide) |
|
||||
| Bayes never scores (`BAYES_*` absent) | below the 200/200 learn floor | train more, keep spam/ham balanced |
|
||||
| Your own outbound mail gets tagged | scanning authenticated mail | `-a` flag |
|
||||
| AVC denials on the Bayes DB (SELinux) | DB outside `/var/lib/spamassassin` | keep it under that path (`spamd_var_lib_t`) + `restorecon` |
|
||||
| `plugin/sieve_before` does nothing — spam keeps reaching INBOX | Pigeonhole 2.4 silently dropped that setting | use the top-level `sieve_script <name> { type = before; path = ...; }` block instead |
|
||||
| `postfix reload` fails: `unsupported dictionary type: pcre` | `pcre:` map requires `postfix-pcre` package | install it, OR use `regexp:` (built-in POSIX) |
|
||||
| Sieve `fileinto Junk` still notifies Spark/iOS | client subscribes to Junk; LMTP delivery briefly hits INBOX | REDIRECT envelope at Postfix cleanup (§6b) so the message never reaches the user's mailbox at all |
|
||||
| Local `sendmail` test doesn't trigger REDIRECT | `sendmail` bypasses smtpd milters → no `X-Spam-Flag` added | inject through SMTP :25 (e.g. swaks) OR pre-set the header in the test message |
|
||||
|
||||
## See also
|
||||
|
||||
- [[selinux-dovecot-vmail-context|SELinux: Fixing Dovecot Mail Spool Context (/var/vmail)]]
|
||||
- [[linux-server-hardening-checklist|Linux Server Hardening Checklist]] (basic `sa-learn` section)
|
||||
137
02-selfhosting/storage-backup/restic-b2-fleet-backups.md
Normal file
137
02-selfhosting/storage-backup/restic-b2-fleet-backups.md
Normal file
|
|
@ -0,0 +1,137 @@
|
|||
---
|
||||
title: "App-Consistent Fleet Backups with restic + Backblaze B2"
|
||||
domain: selfhosting
|
||||
category: storage-backup
|
||||
tags: [restic, backblaze, b2, backup, ansible, systemd, postgresql, mysql, sqlite, docker, disaster-recovery]
|
||||
status: published
|
||||
created: 2026-06-19
|
||||
updated: 2026-06-19
|
||||
---
|
||||
|
||||
# App-Consistent Fleet Backups with restic + Backblaze B2
|
||||
|
||||
A repeatable pattern for backing up a mixed fleet (Ubuntu + Fedora, VPS + homelab, bare services + Docker) to Backblaze B2 with [restic](https://restic.net) — encrypted, deduplicated, and **app-consistent** (databases are dumped before the snapshot, not copied live). Driven by Ansible and a per-host `systemd` timer.
|
||||
|
||||
## The Short Answer
|
||||
|
||||
Per host, nightly: **dump every database to a staging dir → `restic backup` that staging dir plus the data paths → apply retention → wipe staging.** A monthly timer runs `restic prune`. Anything that fails emails the admin. One B2 bucket holds a separate repo per host at `b2:<bucket>:<hostname>`.
|
||||
|
||||
Retention is `--keep-daily 7 --keep-weekly 4 --keep-monthly 6` (~6 months of history).
|
||||
|
||||
## Why dump databases first
|
||||
|
||||
Copying a live database's files (`/var/lib/mysql`, a running SQLite file, a Postgres data dir) gives you a *crash-consistent* copy at best — restorable only if you're lucky. Logical dumps are guaranteed consistent:
|
||||
|
||||
- **MySQL / MariaDB:** `mysqldump --single-transaction --routines --triggers --databases <db>`
|
||||
- **PostgreSQL:** `pg_dump -Fc <db>` (custom format) via the `postgres` system user (peer auth)
|
||||
- **SQLite:** `sqlite3 <file> ".backup '<out>'"` — uses the online backup API, safe against a running writer
|
||||
- **Dockerized DBs:** `docker exec <container> sh -c '<dump cmd>'`, letting the container's own shell expand its root-password env var
|
||||
|
||||
restic then backs up the dump files (which dedupe beautifully — only the changed blocks upload each night).
|
||||
|
||||
## Repository layout
|
||||
|
||||
- **One private B2 bucket** (e.g. `majorshouse-backups`).
|
||||
- **One repo per host:** `b2:majorshouse-backups:<hostname>`.
|
||||
- The application key needs **read + write + delete** for the bucket. restic deletes objects during `forget`/`prune`, so a pure *append-only* key will break retention. (True append-only requires splitting `forget`/`prune` onto a separate maintenance key — a worthwhile hardening step, but not the default.)
|
||||
- Credentials live in an `EnvironmentFile` (`/etc/restic/restic-env`, mode `0600`, root): `RESTIC_REPOSITORY`, `RESTIC_PASSWORD`, `B2_ACCOUNT_ID`, `B2_ACCOUNT_KEY`.
|
||||
|
||||
## The backup script (shape)
|
||||
|
||||
```bash
|
||||
set -uo pipefail
|
||||
STAGING=/var/backups/restic-staging
|
||||
rm -rf "$STAGING"; mkdir -p "$STAGING"; chmod 700 "$STAGING"
|
||||
|
||||
# per-engine dumps into $STAGING ...
|
||||
mysqldump --single-transaction --routines --triggers --databases wordpress > "$STAGING/mysql-wordpress.sql"
|
||||
sudo -u postgres pg_dump -Fc mastodon_production > "$STAGING/pg-mastodon_production.dump"
|
||||
sqlite3 /opt/phantombot/config/phantombot.db ".backup '$STAGING/sqlite-phantombot.db'"
|
||||
|
||||
restic backup --tag fleet-backup --host "$(hostname -s)" \
|
||||
"$STAGING" /var/www /etc/letsencrypt --exclude /path/to/already-offsite/media
|
||||
|
||||
restic forget --keep-daily 7 --keep-weekly 4 --keep-monthly 6
|
||||
rm -rf "$STAGING"
|
||||
```
|
||||
|
||||
Wrap each step so a failure mails the admin and aborts (don't silently back up a half-state). On hosts where the `mail` CLI is absent, pipe a message to `/usr/sbin/sendmail -t` instead.
|
||||
|
||||
## systemd units
|
||||
|
||||
A oneshot service + a timer. Stagger `OnCalendar` per host to spread B2 load, and **always set `RESTIC_CACHE_DIR`** (see Gotchas):
|
||||
|
||||
```ini
|
||||
# restic-backup.service
|
||||
[Service]
|
||||
Type=oneshot
|
||||
EnvironmentFile=/etc/restic/restic-env
|
||||
Environment=RESTIC_CACHE_DIR=/var/cache/restic
|
||||
ExecStart=/usr/local/sbin/restic-backup.sh
|
||||
Nice=10
|
||||
IOSchedulingClass=idle
|
||||
```
|
||||
|
||||
```ini
|
||||
# restic-backup.timer
|
||||
[Timer]
|
||||
OnCalendar=*-*-* 02:30:00
|
||||
RandomizedDelaySec=20m
|
||||
Persistent=true
|
||||
[Install]
|
||||
WantedBy=timers.target
|
||||
```
|
||||
|
||||
A second `restic-prune.timer` runs `restic prune` monthly (`OnCalendar=*-*-01 04:00:00`).
|
||||
|
||||
## Restore procedure
|
||||
|
||||
The whole point. From the target host (or any host with the repo creds):
|
||||
|
||||
```bash
|
||||
# load repo + B2 creds without echoing them
|
||||
set -a; . /etc/restic/restic-env; set +a
|
||||
|
||||
restic snapshots # list; note the snapshot ID or use 'latest'
|
||||
|
||||
# restore specific paths to a scratch dir (never restore in place blindly)
|
||||
restic restore latest --target /tmp/restore \
|
||||
--include /var/backups/restic-staging \
|
||||
--include /var/www/html/wp-config.php
|
||||
|
||||
# verify before doing anything with it
|
||||
ls -la /tmp/restore/var/backups/restic-staging/
|
||||
head -1 /tmp/restore/var/backups/restic-staging/mysql-wordpress.sql # "-- MySQL dump 10.13 ..."
|
||||
```
|
||||
|
||||
To recover a database, restore the dump then load it: `mysql <db> < mysql-<db>.sql`, `pg_restore -d <db> pg-<db>.dump`, or copy the SQLite file back. **Test restores periodically** — a backup you've never restored is a hope, not a backup. Restore the highest-stakes data (password manager, mail) first in any drill.
|
||||
|
||||
## Adding a host
|
||||
|
||||
1. Add it to the `backups` inventory group.
|
||||
2. Give it a `host_vars` scope — which DBs to dump and which paths to back up:
|
||||
|
||||
```yaml
|
||||
restic_backup_oncalendar: "*-*-* 02:40:00" # stagger
|
||||
restic_mysql_dbs: [castopod_db]
|
||||
restic_paths: [/var/www/html/castopod]
|
||||
restic_excludes: [/var/www/html/castopod/public/media] # already offsite
|
||||
```
|
||||
3. Run the playbook against that host. The role installs restic, deploys the script + units, `restic init`s the repo if absent, and enables the timers.
|
||||
|
||||
## Gotchas & Notes
|
||||
|
||||
- **`RESTIC_CACHE_DIR` is mandatory under systemd.** systemd services run with no `$HOME`, so restic can't find its cache and warns *"unable to locate cache directory: neither $XDG_CACHE_HOME nor $HOME are defined"* — and re-reads **every file** each run (no incremental). Point it at `/var/cache/restic` in the unit.
|
||||
- **`sqlite3` may not be installed.** A host that runs a SQLite-backed app (e.g. a bot) often lacks the `sqlite3`/`sqlite` CLI. Install it where `restic_sqlite_paths` is set, or the `.backup` step fails.
|
||||
- **Docker DB password env-var names vary.** Don't assume: the MariaDB image may use `MYSQL_ROOT_PASSWORD` (not `MARIADB_ROOT_PASSWORD`), and a Postgres container's superuser is whatever `POSTGRES_USER` is set to — reference `"$POSTGRES_USER"` rather than hardcoding `postgres`. Check with `docker exec <c> sh -c 'env | grep -oE "^(MYSQL|MARIADB|POSTGRES)_[A-Z_]*"'` (name only).
|
||||
- **B2 key needs delete capability.** Otherwise `forget`/`prune` fail. Scope the key to the bucket; reach for per-host `namePrefix`-restricted keys for blast-radius isolation.
|
||||
- **Exclude data that's already offsite.** Media already synced to object storage (S3/B2 via the app or `rclone`) should be `--exclude`d so you don't pay to store it twice.
|
||||
- **First upload is slow, the rest are fast.** The initial snapshot reads and uploads everything; subsequent runs only ship changed blocks. For a large first run, fire it detached and watch from a transient unit that emails you on completion.
|
||||
- **Keep secrets out of git.** The repo password and B2 key belong in an Ansible vault (committed encrypted), referenced into the role — never in plaintext vars.
|
||||
- **Changing a host's backup paths starts a new snapshot group.** `restic forget` groups snapshots by `host`+`paths` by default, so adding or removing a path on an existing host creates a *separate* lineage: the old path-set and the new one each retain their own 7d/4w/6m snapshots, and `restic snapshots` shows both. Expected, not a bug — but it means the old-path snapshots age out on their own schedule rather than being superseded. To collapse everything into one retention bucket, run `forget` with `--group-by host` (be deliberate: it then treats *any* path-set on that host as the same group).
|
||||
|
||||
## See Also
|
||||
|
||||
- [rsync Backup Patterns](rsync-backup-patterns.md)
|
||||
- [SnapRAID & MergerFS Storage Setup](../../01-linux/storage/snapraid-mergerfs-setup.md)
|
||||
- [restic documentation](https://restic.readthedocs.io)
|
||||
0
04-streaming/audio/.keep
Normal file
0
04-streaming/audio/.keep
Normal file
0
04-streaming/hardware/.keep
Normal file
0
04-streaming/hardware/.keep
Normal file
0
04-streaming/infrastructure/.keep
Normal file
0
04-streaming/infrastructure/.keep
Normal file
331
04-streaming/plex/hevc-vaapi-batch-encode.md
Normal file
331
04-streaming/plex/hevc-vaapi-batch-encode.md
Normal file
|
|
@ -0,0 +1,331 @@
|
|||
---
|
||||
title: "HEVC Batch Re-Encode for Plex Using VAAPI (AMD GPU)"
|
||||
domain: streaming
|
||||
category: plex
|
||||
tags: [plex, ffmpeg, hevc, vaapi, amd, gpu, encode, storage, rx480]
|
||||
status: published
|
||||
created: 2026-05-15
|
||||
updated: 2026-06-05
|
||||
---
|
||||
# HEVC Batch Re-Encode for Plex Using VAAPI (AMD GPU)
|
||||
|
||||
## Problem
|
||||
|
||||
Plex NVMe storage is filling up from a large library of H.264-encoded video files (YouTube downloads, stream archives, etc.). Re-encoding to HEVC (H.265) reclaims 30–50% of disk space. The catch: Plex tracks each file's "date added" in a SQLite database, and that order matters for playback queues. Naive re-encode-and-replace approaches can corrupt or reset that metadata.
|
||||
|
||||
## Solution
|
||||
|
||||
Use `ffmpeg` with `hevc_vaapi` (AMD GPU hardware encoder) to batch re-encode files in-place using an atomic rename swap that preserves the Plex database record — including `added_at` — without any Plex downtime or database editing.
|
||||
|
||||
---
|
||||
|
||||
## How Plex Stores "Date Added"
|
||||
|
||||
Plex does **not** use file modification time (`mtime`) for "date added." It stores a Unix timestamp in its SQLite database:
|
||||
|
||||
```sql
|
||||
-- Plex DB location (override via systemd unit may differ — check):
|
||||
-- /var/lib/plexmediaserver/Library/Application Support/Plex Media Server/
|
||||
-- Plug-in Support/Databases/com.plexapp.plugins.library.db
|
||||
-- (or wherever PLEX_MEDIA_SERVER_APPLICATION_SUPPORT_DIR points)
|
||||
|
||||
SELECT mi.added_at, datetime(mi.added_at, 'unixepoch'), mp.file
|
||||
FROM metadata_items mi
|
||||
JOIN media_items me ON me.metadata_item_id = mi.id
|
||||
JOIN media_parts mp ON mp.media_item_id = me.id
|
||||
WHERE mp.file LIKE '%your-file%';
|
||||
```
|
||||
|
||||
> **Note:** If the default path returns 0 rows, check your actual data directory:
|
||||
> ```bash
|
||||
> systemctl cat plexmediaserver | grep APPLICATION_SUPPORT
|
||||
> ```
|
||||
|
||||
The `added_at` field is keyed to the **file path** in `media_parts`. As long as the file path doesn't change, the database record — including `added_at` — is untouched even after the file's content is replaced.
|
||||
|
||||
---
|
||||
|
||||
## Why VAAPI Instead of libx265
|
||||
|
||||
On a host with an AMD RX 480/580 (or similar Polaris GPU), hardware HEVC encoding via VAAPI is roughly **9× faster** than software libx265 at comparable quality:
|
||||
|
||||
| Encoder | Speed (1080p) | Notes |
|
||||
|---|---|---|
|
||||
| libx265 -preset medium | ~21 fps / 0.35× | Best quality/size ratio |
|
||||
| hevc_vaapi QP 28 | ~186 fps / 3.1× | Sufficient for streaming content |
|
||||
|
||||
For 1080p streaming content (game streams, podcasts, YouTube archival), the quality difference is imperceptible. libx265 is preferable only for archival encodes where absolute quality matters.
|
||||
|
||||
### Verify VAAPI is working
|
||||
|
||||
```bash
|
||||
vainfo 2>&1 | grep -E "vaapi|HEVC|hevc|Driver"
|
||||
ls /dev/dri/renderD128
|
||||
```
|
||||
|
||||
You need `VAProfileHEVCMain : VAEntrypointEncSlice` in the output. If missing, install `mesa-va-drivers-freeworld` (RPM Fusion) for AMD hardware.
|
||||
|
||||
---
|
||||
|
||||
## The Atomic Swap Strategy
|
||||
|
||||
The key insight: `mv file.tmp file` on the **same filesystem** is an atomic inode rename at the kernel level. Plex sees the same path still present — it never fires a "file removed" event, so the `metadata_items` record (including `added_at`) is preserved.
|
||||
|
||||
**Safe sequence:**
|
||||
1. Encode source → `.hevc.tmp.mp4` alongside the original
|
||||
2. Verify the output with `ffprobe`
|
||||
3. `touch -r original.mp4 temp.mp4` — copy mtime (cosmetic, not required)
|
||||
4. `mv temp.mp4 original.mp4` — atomic replace
|
||||
|
||||
**The one pitfall:** if the original file is deleted *before* the `mv`, Plex orphans the DB record (removes `metadata_items` entry on next scan) and re-indexes the new file with a fresh `added_at`. The original must still exist at swap time.
|
||||
|
||||
---
|
||||
|
||||
## The Batch Script
|
||||
|
||||
Script lives at `~/hevc_batch.sh` on majorhome.
|
||||
|
||||
```bash
|
||||
# Dry run — scan and report what would be encoded, no changes
|
||||
bash ~/hevc_batch.sh --dry-run
|
||||
|
||||
# Full run (default: files >1GB, QP 28)
|
||||
tmux new-session -d -s hevc_batch 'bash ~/hevc_batch.sh'
|
||||
|
||||
# Custom options
|
||||
bash ~/hevc_batch.sh --min-size-gb 2 --qp 26
|
||||
```
|
||||
|
||||
### Queue and resume
|
||||
|
||||
The script writes a queue file at `~/hevc_queue.txt` on first run (scanning all files with ffprobe — takes ~10 min for a large library). On subsequent runs it resumes from where it left off. Completed files are logged to `~/hevc_done.txt`. Failed files go to `~/hevc_failed.txt`.
|
||||
|
||||
To restart from scratch: `rm ~/hevc_queue.txt ~/hevc_done.txt`
|
||||
|
||||
### Log output
|
||||
|
||||
```bash
|
||||
# Structured log lines only (skip ffmpeg progress noise)
|
||||
grep '^\[20' ~/hevc_batch.log
|
||||
|
||||
# Watch live progress
|
||||
tail -f ~/hevc_batch.log | grep '^\[20'
|
||||
```
|
||||
|
||||
Each file logs:
|
||||
- Source size and codec
|
||||
- `Plex added_at before: <unix timestamp>`
|
||||
- ffmpeg exit code and elapsed time
|
||||
- Output size and savings
|
||||
- `DB check: added_at PRESERVED ✓` (or WARN if changed)
|
||||
|
||||
### Space guard
|
||||
|
||||
The script aborts if free space on the Plex volume drops below 10GB (`MIN_FREE_GB`). Worst-case headroom needed is `source_size + tmp_size` simultaneously — on a 4GB source file that's ~8GB peak. Note: the space check only runs at the **start** of each encode, not during — a large file can still consume significant disk mid-encode.
|
||||
|
||||
---
|
||||
|
||||
## ffmpeg Command
|
||||
|
||||
```bash
|
||||
ffmpeg \
|
||||
-vaapi_device /dev/dri/renderD128 \
|
||||
-i "input.mp4" \
|
||||
-vf 'format=nv12,hwupload' \
|
||||
-c:v hevc_vaapi -rc_mode CQP -qp 28 \
|
||||
-c:a copy \
|
||||
-movflags +faststart \
|
||||
-y "output.tmp.mp4"
|
||||
```
|
||||
|
||||
- `-rc_mode CQP -qp 28` — constant quantizer; higher value = smaller file / lower quality. QP 24 is high quality, QP 28 is good for streaming content.
|
||||
- `-vf 'format=nv12,hwupload'` — required to move frames to GPU memory for VAAPI encoding.
|
||||
- `-c:a copy` — passes audio through untouched.
|
||||
- `hevc_vaapi` does not support 10-bit output on Polaris (RX 480/580). For 10-bit HDR sources, fall back to `libx265` with color signaling flags.
|
||||
|
||||
---
|
||||
|
||||
## Plex Data Directory Override
|
||||
|
||||
On majorhome, the Plex data directory is overridden in the systemd unit — the default path `/var/lib/plexmediaserver/` is empty:
|
||||
|
||||
```bash
|
||||
systemctl cat plexmediaserver | grep APPLICATION_SUPPORT
|
||||
# Environment=PLEX_MEDIA_SERVER_APPLICATION_SUPPORT_DIR=/plex/plexdata/Library/Application Support
|
||||
```
|
||||
|
||||
The actual DB path is therefore:
|
||||
```
|
||||
/plex/plexdata/Library/Application Support/Plex Media Server/Plug-in Support/Databases/com.plexapp.plugins.library.db
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Encode keeps stopping after a few files
|
||||
|
||||
**Symptom:** The script runs, encodes a handful of files, then exits. Restarting it produces the same behavior — processes a few, then exits again.
|
||||
|
||||
**Cause:** `hevc_batch.sh` is a **one-shot batch processor**, not a daemon. It reads through the queue file once from top to bottom, encodes whatever hasn't been done, then exits cleanly with `Batch complete: N processed`. It does not loop or restart itself.
|
||||
|
||||
On subsequent restarts, the script reuses the existing `hevc_queue.txt` rather than rebuilding it — the rebuild only runs if the queue file is missing or empty:
|
||||
|
||||
```bash
|
||||
if [[ ! -f "$QUEUE" ]] || [[ ! -s "$QUEUE" ]]; then
|
||||
build_queue
|
||||
fi
|
||||
```
|
||||
|
||||
This means restarts process only the few items left in the stale queue that haven't been marked done, then exit.
|
||||
|
||||
**Fix:** Delete the queue file before restarting so the script rescans the library and builds a fresh queue:
|
||||
|
||||
```bash
|
||||
su - majorlinux -c 'rm ~/hevc_queue.txt && tmux new-session -d -s hevc_batch "bash ~/hevc_batch.sh"'
|
||||
```
|
||||
|
||||
> Do **not** delete `hevc_done.txt` — that's the deduplication record. The rebuilt queue will skip anything already in `hevc_done.txt`.
|
||||
|
||||
---
|
||||
|
||||
### "Parse error, at least 3 arguments" in the log
|
||||
|
||||
**Symptom:** Log lines like `Parse error, at least 3 arguments were expected, only 1 given in string 'h.mp4'` scattered between encode entries.
|
||||
|
||||
**Cause:** ffmpeg printing its own internal parsing warnings to stderr for filenames containing Unicode special characters used in Giant Bomb / YouTube-DL titles (| : * — fullwidth variants). The bash script handles these correctly via `IFS= read -r`; these messages are cosmetic ffmpeg noise and do not affect the encode.
|
||||
|
||||
**Action:** None — these are safe to ignore.
|
||||
|
||||
---
|
||||
|
||||
### "SKIP (not found): uiem DLC & Far Far West.mp4" — truncated filenames
|
||||
|
||||
**Symptom:** "not found" skip entries in the log show what look like the *ends* of filenames (e.g., `uiem DLC & Far Far West.mp4` instead of `Resident Evil Requiem DLC & Far Far West.mp4`).
|
||||
|
||||
**Cause:** The queue file has corrupt/truncated entries — lines where the beginning of the path was lost, likely from a write error or interrupted pipe when the queue was originally built. The script can't find these truncated paths on disk and skips them.
|
||||
|
||||
**Fix:** Delete the queue file to force a full rebuild (see above). The rebuild uses `find` with a fresh scan — no truncation possible.
|
||||
|
||||
---
|
||||
|
||||
### Checking real progress
|
||||
|
||||
```bash
|
||||
# Files done, failed, and remaining in queue
|
||||
wc -l ~/hevc_done.txt ~/hevc_failed.txt ~/hevc_queue.txt
|
||||
|
||||
# Remaining = queue total - done - failed
|
||||
# (some "remaining" may be not-found or parse-error skips)
|
||||
|
||||
# Last 10 log entries
|
||||
grep '^\[20' ~/hevc_batch.log | tail -10
|
||||
|
||||
# Watch live
|
||||
tail -f ~/hevc_batch.log | grep '^\[20'
|
||||
|
||||
# Disk free on /plex
|
||||
df -h /plex | tail -1
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Script exits with `set -euo pipefail`
|
||||
|
||||
The script uses `set -euo pipefail` — any unhandled non-zero exit code kills it immediately. If the script exits with no "Batch complete" line in the log, look for the last log entry before the gap to identify the failing command. Most encode-path errors are handled with `|| echo ""` guards, but external tools (sqlite3, ffprobe) can still trip this under unusual conditions.
|
||||
|
||||
---
|
||||
|
||||
## Related
|
||||
|
||||
- [[plex-4k-codec-compatibility]] — Apple TV Direct Play compatibility, HEVC HDR notes
|
||||
- [[plex-transcoding-troubleshooting]] — Playback stops, software transcode CPU limits, VAAPI setup
|
||||
- [[snapraid-mergerfs-setup]] — MajorRAID storage pool setup
|
||||
- [[SnapRAID-Majorhome]] — majorhome SnapRAID project
|
||||
|
||||
---
|
||||
|
||||
### ffmpeg "Error opening output file" / "Invalid argument" on specific files
|
||||
|
||||
**Symptom:** One or more files fail with this in the log:
|
||||
|
||||
```
|
||||
Error opening output file /plex/plex/Giant Bomb's Sub-A-Thon | Day 3 PART 4.hevc.tmp.mp4.
|
||||
Error opening output files: Invalid argument
|
||||
[YYYY-MM-DD HH:MM:SS] ffmpeg exited 234 in 0s
|
||||
[YYYY-MM-DD HH:MM:SS] FAILED: ffmpeg error — keeping original, removing tmp
|
||||
```
|
||||
|
||||
The file ends up in `hevc_failed.txt` and the original is untouched.
|
||||
|
||||
**Cause:** ffmpeg has its own URL/protocol parser that runs on all input and output path strings before any filesystem access. The ASCII pipe character `|` (U+007C) triggers ffmpeg's pipe protocol handler — it tries to interpret `output|file.mp4` as "pipe output to the process named `file.mp4`" and fails with EINVAL. This happens even though the shell variable is properly quoted and the Linux filesystem supports `|` in filenames. The fullwidth variant `|` (U+FF5C) can also cause issues depending on ffmpeg's build.
|
||||
|
||||
Common in libraries with Giant Bomb, YouTube, or Twitch downloads — those titles frequently use `|` as a visual separator.
|
||||
|
||||
**Fix:** Sanitize the `stem` used for the `.hevc.tmp.` output filename. The *source* file keeps its original name (the final `mv` writes back to the original path, which the filesystem handles fine); only the temp file needs a clean name for ffmpeg:
|
||||
|
||||
```bash
|
||||
# In encode_file(), replace:
|
||||
local tmp="${dir}/${stem}.hevc.tmp.${ext}"
|
||||
|
||||
# With:
|
||||
local safe_stem="${stem//|/-}"
|
||||
safe_stem="${safe_stem//|/-}"
|
||||
local tmp="${dir}/${safe_stem}.hevc.tmp.${ext}"
|
||||
```
|
||||
|
||||
After patching, delete the affected entries from `hevc_failed.txt` (or leave them — they'll be re-queued on the next run since they're not in `hevc_done.txt`) and restart the batch.
|
||||
|
||||
---
|
||||
|
||||
### Many files failing: output larger than source (streaming content)
|
||||
|
||||
**Symptom:** A large portion of the queue ends up in `hevc_failed.txt` with log lines like:
|
||||
|
||||
```
|
||||
[2026-06-05 ...] Output: 4.7G savings=0 (output larger than source)
|
||||
[2026-06-05 ...] WARN: output is larger than source — skipping swap, keeping original
|
||||
```
|
||||
|
||||
**Cause:** These files are YouTube downloads or streaming archives (Giant Bomb, Twitch VODs, etc.) that were already encoded with an efficient H.264 encoder (typically YouTube's VP9-to-AVC pipeline or a broadcast H.264 encoder at a reasonable bitrate). VAAPI HEVC encoding at QP 28 on a Polaris GPU (RX 480/580) is a hardware encoder with limited rate control precision — it cannot beat a well-tuned software H.264 encode on already-compressed talking-head/gaming content. The output reliably comes out 15–25% *larger* than the source.
|
||||
|
||||
The script handles this correctly: it detects output > source, deletes the tmp, keeps the original, and writes to `hevc_failed.txt`. The files are not corrupted. However, without the `already_failed()` guard, the script will re-attempt these files on every queue rebuild, wasting CPU time and briefly consuming 4–8 GB of disk per failed attempt.
|
||||
|
||||
**Fix — add `already_failed()` skip logic:**
|
||||
|
||||
Patch `~/hevc_batch.sh` to skip files already in `hevc_failed.txt`:
|
||||
|
||||
```bash
|
||||
# After the existing already_done() function, add:
|
||||
already_failed() {
|
||||
[[ -f "$FAILED" ]] && grep -qF "$1" "$FAILED"
|
||||
}
|
||||
|
||||
# In build_queue(), after the already_done "$f" && continue line:
|
||||
already_failed "$f" && continue
|
||||
|
||||
# In the main loop, after the already_done "$file" check:
|
||||
already_failed "$file" && { log "SKIP (already failed): $file"; continue; }
|
||||
```
|
||||
|
||||
After patching, the batch will skip all 132+ known-bad files on the next pass and only attempt fresh queue entries.
|
||||
|
||||
**Tuning options to improve savings on dense content:**
|
||||
|
||||
- Lower QP: `--qp 24` or `--qp 22` — more aggressive quality target, better chance of beating source size. Trade-off: larger output for files that do compress.
|
||||
- Accept the failures: for streaming content archives, the source is already "good enough." Only files that are genuinely oversized H.264 (old stream captures at very high bitrate) will benefit from HEVC re-encode.
|
||||
|
||||
**Identifying which files are worth encoding:**
|
||||
|
||||
```bash
|
||||
# Show source bitrate for all queued files — high-bitrate sources are candidates
|
||||
while IFS= read -r f; do
|
||||
bitrate=$(ffprobe -v quiet -show_entries format=bit_rate -of csv=p=0 "$f" 2>/dev/null)
|
||||
echo "$bitrate $f"
|
||||
done < ~/hevc_queue.txt | sort -rn | head -20
|
||||
```
|
||||
|
||||
Files above ~8,000 kbits/s are typically good encode candidates. Files at 3,000–5,000 kbits/s (typical YouTube/Twitch 1080p) will usually fail.
|
||||
|
||||
126
04-streaming/plex/plex-transcoding-troubleshooting.md
Normal file
126
04-streaming/plex/plex-transcoding-troubleshooting.md
Normal file
|
|
@ -0,0 +1,126 @@
|
|||
---
|
||||
title: "Plex Transcoding Troubleshooting"
|
||||
domain: streaming
|
||||
category: plex
|
||||
tags: [plex, transcoding, hevc, h264, vaapi, troubleshooting, apple-tv]
|
||||
status: published
|
||||
created: 2026-05-22
|
||||
updated: 2026-05-22
|
||||
---
|
||||
|
||||
# Plex Transcoding Troubleshooting
|
||||
|
||||
Common issues when Plex is transcoding instead of direct playing, and how to fix them.
|
||||
|
||||
## Playback Stops After ~1 Minute
|
||||
|
||||
**Symptom:** Video starts normally, plays for 60–90 seconds, then freezes or stops. Hitting play again works briefly, then stops again.
|
||||
|
||||
**Cause:** The Plex server is software-transcoding the stream and the CPU can't keep up in real time. Plex delivers video as a series of short HLS segments (3 seconds each by default). When the transcoder falls behind real-time, the client exhausts its segment buffer and stops.
|
||||
|
||||
This is most common when:
|
||||
- The client has an auto-quality or bandwidth-limit setting enabled, forcing a transcode even for natively supported codecs
|
||||
- The source file is HEVC and the client is set to anything other than "Play Original"
|
||||
- Multiple streams are transcoding concurrently and saturating the CPU
|
||||
|
||||
### How to Confirm
|
||||
|
||||
SSH into the Plex host and check for an active software transcode:
|
||||
|
||||
```bash
|
||||
ps aux | grep 'Plex Transcoder' | grep -v grep
|
||||
```
|
||||
|
||||
Look for `libx264` or `libx265` in the output — these are CPU software encoders. A CPU% above 30–40% per stream on an i7-7700K means it's at or near the real-time limit for 1080p60.
|
||||
|
||||
### Fix: Enable Direct Play
|
||||
|
||||
The correct fix is to eliminate the transcode entirely.
|
||||
|
||||
**On Apple TV:**
|
||||
1. Open the Plex app → tap the user icon → **Settings**
|
||||
2. Go to **Quality**
|
||||
3. Set both **"Home Streaming"** and **"Remote Streaming"** to **"Play Original"** (or "Maximum")
|
||||
4. Restart playback
|
||||
|
||||
Apple TV 4K supports direct play for H.264, HEVC (H.265), and most common containers (MP4, MKV). With "Play Original" set, Plex streams the file as-is with no server-side processing.
|
||||
|
||||
**On other clients:** Look for a Quality or Streaming Quality setting and set it to Original/Maximum. The specific label varies by app version.
|
||||
|
||||
### If Direct Play Isn't Possible
|
||||
|
||||
If the client genuinely can't decode the source codec (e.g., a browser playing HEVC), reduce the transcode quality to something the CPU can sustain in real time:
|
||||
|
||||
- **8 Mbps 1080p** is usually achievable for a single stream on an i7-7700K
|
||||
- Avoid 1080p60 at high bitrates — the frame rate doubles the encoding work
|
||||
|
||||
Alternatively, enable hardware transcoding (see below).
|
||||
|
||||
---
|
||||
|
||||
## Understanding When Plex Transcodes
|
||||
|
||||
Plex will transcode (convert on the fly) when any of the following are true:
|
||||
|
||||
| Trigger | Example |
|
||||
|---------|---------|
|
||||
| Client can't decode the codec | Browser playing HEVC |
|
||||
| Client quality is set below original | "8 Mbps 1080p" selected |
|
||||
| Audio codec isn't supported by client | DTS-MA, TrueHD on some devices |
|
||||
| Subtitles need burning in | Forced image-based subs (PGS) |
|
||||
| Bandwidth limit set in Plex server settings | Server-side quality cap |
|
||||
|
||||
Direct play happens when the client supports the video codec, audio codec, container, and no quality downgrade is requested.
|
||||
|
||||
---
|
||||
|
||||
## Hardware Transcoding (VAAPI / RX 480)
|
||||
|
||||
majorhome has an XFX Radeon RX 480 8GB with VAAPI support. Hardware transcoding can offload video encoding from the CPU and allows more concurrent transcode streams.
|
||||
|
||||
**Enable in Plex:**
|
||||
Settings → Transcoder → **"Use hardware acceleration when available"** (requires Plex Pass)
|
||||
|
||||
**Caveats:**
|
||||
- The RX 480 VAAPI encoder (`hevc_vaapi`, `h264_vaapi`) is benchmarked ~3× slower than the i7-7700K CPU for single-stream x264 output on this workload. Hardware transcoding only wins when the CPU is already saturated (2+ concurrent streams).
|
||||
- VAAPI hardware transcode on AMD requires the `radeonsi` Mesa driver and `libva-mesa-driver`. Both are present on majorhome.
|
||||
|
||||
**Check VAAPI is working:**
|
||||
```bash
|
||||
vainfo 2>/dev/null | grep -E "VAProfile|VAEntrypoint"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## CPU Transcoding Capacity (i7-7700K)
|
||||
|
||||
| Scenario | CPU Load | Sustainable? |
|
||||
|----------|----------|-------------|
|
||||
| 1× HEVC → H.264 1080p30 | ~20% | ✅ Yes |
|
||||
| 1× HEVC → H.264 1080p60 | ~40% | ⚠️ Borderline — may drop behind |
|
||||
| 2× HEVC → H.264 1080p60 | ~80% | ❌ Will fall behind in real time |
|
||||
| 1× H.264 → H.264 1080p (remux only) | ~5% | ✅ Yes |
|
||||
|
||||
**Bottom line:** One software-transcode stream at 1080p60 is at the edge of what the i7-7700K can sustain. Two will fail. Direct play eliminates the problem entirely.
|
||||
|
||||
---
|
||||
|
||||
## Checking Active Transcode Sessions
|
||||
|
||||
```bash
|
||||
# See all active Plex Transcoder processes and what they're encoding
|
||||
ps aux | grep 'Plex Transcoder' | grep -v grep | grep -oP '\-i \S+' | sed 's/-i //'
|
||||
|
||||
# Full transcode command (codec, bitrate, resolution)
|
||||
ps aux | grep 'Plex Transcoder' | grep -v grep
|
||||
```
|
||||
|
||||
You can also see active sessions in Plex Web → Dashboard → Now Playing.
|
||||
|
||||
---
|
||||
|
||||
## Related
|
||||
|
||||
- [Plex 4K Codec Compatibility (Apple TV)](plex-4k-codec-compatibility.md)
|
||||
- [[../../../MajorInfrastructure/Services/Plex|Plex — Infrastructure Doc]]
|
||||
- [[../../../../30-Areas/MajorInfrastructure/Servers/majorhome|majorhome]]
|
||||
0
04-streaming/podcast/.keep
Normal file
0
04-streaming/podcast/.keep
Normal file
|
|
@ -11,7 +11,7 @@ tags:
|
|||
- troubleshooting
|
||||
status: published
|
||||
created: 2026-04-18
|
||||
updated: 2026-04-29T22:45
|
||||
updated: 2026-04-30T05:21
|
||||
---
|
||||
# Ansible Check Mode False Positives in Verify/Assert Tasks
|
||||
|
||||
|
|
|
|||
103
05-troubleshooting/ansible-reboot-become-timeout-wsl2.md
Normal file
103
05-troubleshooting/ansible-reboot-become-timeout-wsl2.md
Normal file
|
|
@ -0,0 +1,103 @@
|
|||
---
|
||||
title: "Ansible reboot.yml: become Timeout on WSL2 Hosts (Exclude Them)"
|
||||
domain: troubleshooting
|
||||
category: ansible
|
||||
tags: [ansible, wsl, wsl2, windows, reboot, become, privilege-escalation, openssh, inventory]
|
||||
status: published
|
||||
created: 2026-06-12
|
||||
updated: 2026-06-12
|
||||
---
|
||||
|
||||
# Ansible reboot.yml: become Timeout on WSL2 Hosts (Exclude Them)
|
||||
|
||||
## Problem
|
||||
|
||||
Running a reboot play across a Fedora fleet that includes a WSL2 "host" fails on the WSL2 box at privilege escalation — before the reboot command ever runs:
|
||||
|
||||
```console
|
||||
$ ansible-playbook reboot.yml --limit fedora
|
||||
|
||||
TASK [Reboot the server] *******************************************************
|
||||
changed: [majorhome]
|
||||
changed: [majorlab]
|
||||
changed: [majormail]
|
||||
changed: [majordiscord]
|
||||
[ERROR]: Task failed: Action failed: Timeout (62s) waiting for privilege
|
||||
escalation prompt:
|
||||
fatal: [majorrig-wsl]: FAILED! => {"changed": false,
|
||||
"msg": "Timeout (62s) waiting for privilege escalation prompt:",
|
||||
"reboot": false}
|
||||
```
|
||||
|
||||
Every real server reboots fine. Only the WSL2 host fails, and `"reboot": false` confirms the shutdown command never executed.
|
||||
|
||||
## Cause
|
||||
|
||||
Two independent problems, either of which is enough to break a reboot play against WSL2:
|
||||
|
||||
1. **WSL2 has no real reboot semantics.** `ansible.builtin.reboot` issues a shutdown, then blocks up to `reboot_timeout` (e.g. 900s) waiting for SSH to come back. A WSL2 distro doesn't reboot — it just terminates, and nothing relaunches it automatically. The task would hang the full timeout and then fail.
|
||||
|
||||
2. **`become` times out over the Windows OpenSSH → WSL2 bridge.** When a WSL2 box is reached as `majorlinux@host` through Windows' built-in OpenSSH Server (which forwards into WSL via the default shell), Ansible's privilege-escalation handshake watches the SSH stream for the sudo prompt/success marker. Across the Windows-intercept pty, that marker detection stalls until the 62s `timeout`. This happens **even with passwordless sudo** — `NOPASSWD` is configured and correct; Ansible simply never sees the handshake complete.
|
||||
|
||||
The error surfaces as #2 (it fails at escalation first), but #1 is the deeper reason WSL2 doesn't belong in a reboot play at all.
|
||||
|
||||
## Solution
|
||||
|
||||
**Exclude the WSL group from the reboot play.** A WSL2 instance is a managed *workstation environment*, not a server — it belongs in package/update plays but not in server lifecycle operations like reboot.
|
||||
|
||||
Scope the play to exclude the `wsl` group so even a broad `--limit` skips it:
|
||||
|
||||
```yaml
|
||||
# reboot.yml
|
||||
- name: Reboot servers
|
||||
hosts: all:!wsl # was: hosts: all
|
||||
become: true
|
||||
tasks:
|
||||
- name: Reboot the server
|
||||
ansible.builtin.reboot:
|
||||
msg: "Reboot initiated by Ansible"
|
||||
reboot_timeout: 900
|
||||
```
|
||||
|
||||
This assumes your WSL2 hosts are in a dedicated inventory group:
|
||||
|
||||
```yaml
|
||||
wsl:
|
||||
hosts:
|
||||
majorrig-wsl:
|
||||
ansible_host: 100.98.47.29
|
||||
```
|
||||
|
||||
Verify the targeting before running — the WSL host should be gone:
|
||||
|
||||
```console
|
||||
$ ansible-playbook reboot.yml --limit fedora --list-hosts
|
||||
play #1 (all:!wsl): Reboot servers
|
||||
hosts (4):
|
||||
majorhome
|
||||
majorlab
|
||||
majordiscord
|
||||
majormail
|
||||
```
|
||||
|
||||
### Rebooting the WSL2 instance itself
|
||||
|
||||
When you genuinely need to "reboot" WSL2, do it from the Windows side — not Ansible:
|
||||
|
||||
```powershell
|
||||
wsl --shutdown
|
||||
```
|
||||
|
||||
The distro relaunches on next access (next SSH login or `wsl` invocation). WSL2 stays in `update.yml` (dnf upgrades) and other package plays; it's only excluded from reboot and other server-specific roles.
|
||||
|
||||
## Why not just fix the become timeout?
|
||||
|
||||
You *could* raise `timeout` or tweak the become flow, but it doesn't address problem #1 — even a successful escalation would leave the reboot task hanging the full `reboot_timeout` because WSL2 never comes back the way the module expects. Excluding WSL from server lifecycle plays is the correct fix, not a workaround.
|
||||
|
||||
## Related
|
||||
|
||||
- [Ansible: ansible.cfg Ignored on WSL2 Windows Mounts](ansible-wsl2-world-writable-mount-ignores-cfg.md)
|
||||
- [Windows OpenSSH: WSL Default Shell Breaks Remote Commands](networking/windows-openssh-wsl-default-shell-breaks-remote-commands.md)
|
||||
- [Ansible: SSH Timeout During dnf upgrade on Fedora Hosts](ansible-ssh-timeout-dnf-upgrade.md)
|
||||
</content>
|
||||
</invoke>
|
||||
|
|
@ -0,0 +1,72 @@
|
|||
---
|
||||
title: "Ansible regex_search — capture-group argument doesn't work in set_fact"
|
||||
domain: troubleshooting
|
||||
category: general
|
||||
tags: [ansible, jinja, regex, set_fact, gotcha]
|
||||
status: published
|
||||
created: 2026-05-06
|
||||
updated: 2026-05-06
|
||||
---
|
||||
|
||||
# Ansible `regex_search` — capture-group argument doesn't work in `set_fact`
|
||||
|
||||
## Problem
|
||||
|
||||
You want to extract a number from a registered command's stdout — e.g. the package count from a dnf or apt upgrade — and stash it in a fact. The natural-looking `regex_search('pattern', '\1')` form fails or produces an empty string when used inside `set_fact`:
|
||||
|
||||
```yaml
|
||||
- name: Capture package count # ❌ does not behave as expected
|
||||
ansible.builtin.set_fact:
|
||||
pkg_count: "{{ apt_upgrade_result.stdout | regex_search('([0-9]+) upgraded', '\\1') }}"
|
||||
```
|
||||
|
||||
You'll see one of:
|
||||
|
||||
- An empty `pkg_count` (the filter ran but the back-reference returned nothing in this context)
|
||||
- A Jinja error about argument arity if the syntax is slightly off
|
||||
- The whole matched substring instead of just the captured group
|
||||
|
||||
## Root cause
|
||||
|
||||
In `set_fact` templating, the second-positional-argument form of `regex_search` (the back-reference `'\1'` you've seen in tutorials) doesn't reliably select capture groups. The filter is happiest returning the full match. Capture-group selection works in some contexts (e.g. `vars:` blocks, certain Jinja invocations) but not consistently inside `set_fact`, which makes "copy this snippet from the docs" fail intermittently.
|
||||
|
||||
## Fix — match the broader pattern, then split
|
||||
|
||||
Stop fighting the back-reference. Use `regex_search` to grab a string that *contains* the value you want, then peel it apart with plain Python string ops:
|
||||
|
||||
```yaml
|
||||
- name: Capture package count # ✅ works in set_fact
|
||||
ansible.builtin.set_fact:
|
||||
pkg_count: "{{ (apt_upgrade_result.stdout | regex_search('[0-9]+ upgraded') | default('0')).split()[0] }}"
|
||||
```
|
||||
|
||||
What this does:
|
||||
|
||||
1. `regex_search('[0-9]+ upgraded')` returns the matching substring (e.g. `"7 upgraded"`) or `None` on no match.
|
||||
2. `default('0')` turns the `None` case into the string `"0"` so the next step always has something to operate on.
|
||||
3. `.split()[0]` keeps just the number.
|
||||
|
||||
The result (`"7"`) is a string — cast with `| int` if you need arithmetic.
|
||||
|
||||
## Where this comes up in MajorAnsible
|
||||
|
||||
The `update.yml` executive-summary task uses this pattern to pull package counts out of `apt_upgrade_result.stdout` and `dnf_upgrade_result.stdout` so each host can print one tidy line:
|
||||
|
||||
```
|
||||
majorhome: 7 pkg(s) upgraded | No reboot needed | 2 active screen(s)
|
||||
majormail: 14 pkg(s) upgraded | REBOOT REQUIRED | Snapshot taken
|
||||
majorlab: 0 pkg(s) upgraded | No reboot needed
|
||||
```
|
||||
|
||||
The summary line is built with a Jinja `parts` array joined with `' | '` so segments that don't apply (no snapshot, no screens) drop out cleanly without leaving trailing separators.
|
||||
|
||||
## Quick checks if this still misbehaves
|
||||
|
||||
- **Confirm the source variable.** Ansible 2.x sometimes returns stdout as `result.stdout` and sometimes as `result.stdout_lines`; the `regex_search` filter wants a string, not a list. Use `.stdout` (or `.stdout | join('\n')` for a multi-line list).
|
||||
- **Escape your backslashes.** In YAML strings, `\d` needs to be written `\\d` or wrapped in single quotes: `'(\d+) upgraded'`.
|
||||
- **Always provide a default.** `regex_search` returns `None` on miss, which will explode `.split()[0]`. The `| default('0')` bridge is mandatory in production playbooks where some hosts will legitimately have zero upgrades.
|
||||
|
||||
## Related
|
||||
|
||||
- [[ansible-vault-password-file-missing]] — another set_fact / vault interaction quirk
|
||||
- [[ansible-ssh-timeout-dnf-upgrade]] — companion gotcha when running `update.yml`
|
||||
|
|
@ -0,0 +1,106 @@
|
|||
---
|
||||
title: "Ansible: Ubuntu Reboot Detection Misses Kernel Upgrades"
|
||||
domain: troubleshooting
|
||||
category: ansible
|
||||
tags: [ansible, ubuntu, kernel, reboot, needrestart, apt]
|
||||
status: published
|
||||
created: 2026-05-19
|
||||
updated: 2026-05-19
|
||||
---
|
||||
|
||||
# Ansible: Ubuntu Reboot Detection Misses Kernel Upgrades
|
||||
|
||||
## Problem
|
||||
|
||||
`update.yml` runs across the Ubuntu fleet, a kernel package is upgraded, but the executive summary reports `No reboot needed` — even though a reboot is genuinely required. Running `uname -r` on the host confirms it's still on the old kernel.
|
||||
|
||||
Example: majortoot had `linux-image-6.8.0-117-generic` installed on May 16 after a Tailscale update triggered `needrestart`, but the playbook kept reporting clean.
|
||||
|
||||
## Root Cause
|
||||
|
||||
The standard check for Ubuntu reboot state is:
|
||||
|
||||
```yaml
|
||||
- name: Check if a reboot is required for Ubuntu servers
|
||||
ansible.builtin.stat:
|
||||
path: /var/run/reboot-required
|
||||
register: ubuntu_reboot_flag
|
||||
```
|
||||
|
||||
`/var/run/reboot-required` is written by `update-notifier-common`'s `notify-reboot-required` script, called by `/etc/kernel/postinst.d/update-notifier` when a kernel package is installed via `apt`.
|
||||
|
||||
The problem is `needrestart`. It runs after every `apt` invocation via a `DPkg::Post-Invoke` hook (`apt-pinvoke -m u`). In **unattended mode** (`-m u`), needrestart detects the pending kernel upgrade and calls `announce_ver()` in `NeedRestart::UI::Ubuntu` — but that function only prints to stdout. It does **not** call `_write_reboot_file()`. Only `announce_ucode()` (microcode upgrades) calls `_write_reboot_file()`.
|
||||
|
||||
So the sequence is:
|
||||
|
||||
1. `apt` installs kernel → `notify-reboot-required` creates `/run/reboot-required` ✅
|
||||
2. Some later `apt` run (e.g. Ansible installs Tailscale) → `needrestart -m u` runs → detects kernel mismatch → calls `announce_ver()` → prints to stdout (suppressed in Ansible) → **does not** recreate the sentinel file
|
||||
3. Next Ansible run: stat check finds no file → reports `No reboot needed` ❌
|
||||
|
||||
The `/run` filesystem is tmpfs and clears on reboot, but the sentinel file can disappear between reboots any time needrestart runs without recreating it.
|
||||
|
||||
## Fix — Dual Check in update.yml
|
||||
|
||||
Add a parallel kernel comparison task after the existing stat check:
|
||||
|
||||
```yaml
|
||||
- name: Check running kernel vs installed kernel (Ubuntu)
|
||||
ansible.builtin.shell: |
|
||||
RUNNING=$(uname -r)
|
||||
INSTALLED=$(dpkg -l 'linux-image-[0-9]*-generic' 2>/dev/null \
|
||||
| awk '/^ii/{print $2}' \
|
||||
| sed 's/linux-image-//' \
|
||||
| sort -V | tail -1)
|
||||
if [ -n "$INSTALLED" ] && [ "$RUNNING" != "$INSTALLED" ]; then
|
||||
echo "KERNEL_MISMATCH"
|
||||
fi
|
||||
register: kernel_mismatch_check
|
||||
changed_when: false
|
||||
when: ansible_facts['os_family'] == "Debian"
|
||||
```
|
||||
|
||||
Then update the `host_summary` Jinja2 template to OR both conditions:
|
||||
|
||||
```jinja2
|
||||
{%- if ansible_facts['os_family'] == 'Debian' and (
|
||||
(ubuntu_reboot_flag is defined and ubuntu_reboot_flag.stat is defined and ubuntu_reboot_flag.stat.exists)
|
||||
or
|
||||
(kernel_mismatch_check is defined and 'KERNEL_MISMATCH' in (kernel_mismatch_check.stdout | default('')))
|
||||
) -%}
|
||||
{%- set _ = parts.append('REBOOT REQUIRED') -%}
|
||||
```
|
||||
|
||||
## Common Mistake — Comparing the Wrong dpkg Field
|
||||
|
||||
An initial version of this fix used `$3` (the package version) and `cut`:
|
||||
|
||||
```bash
|
||||
# WRONG — version field never matches uname -r
|
||||
INSTALLED=$(dpkg -l 'linux-image-*-generic' | awk '/^ii/{print $3}' | sort -V | tail -1 | cut -d- -f1-4)
|
||||
```
|
||||
|
||||
| Field | Example value |
|
||||
|-------|--------------|
|
||||
| `dpkg $3` (version) after cut | `6.8.0-57.59` |
|
||||
| `uname -r` | `6.8.0-57-generic` |
|
||||
|
||||
These formats never match. Every Ubuntu host permanently reports `KERNEL_MISMATCH`. Always use the **name column (`$2`)**, strip the `linux-image-` prefix, and compare directly to `uname -r`.
|
||||
|
||||
Also use `linux-image-[0-9]*-generic` (not `*-generic`) to exclude the `linux-image-generic` meta-package from the sort.
|
||||
|
||||
## Verification
|
||||
|
||||
Run against a known-pending host before and after reboot:
|
||||
|
||||
```bash
|
||||
ansible-playbook update.yml --limit majortoot
|
||||
```
|
||||
|
||||
Before reboot: `majortoot: 0 pkg(s) upgraded | REBOOT REQUIRED`
|
||||
After reboot: `majortoot: 0 pkg(s) upgraded | No reboot needed`
|
||||
|
||||
## Related
|
||||
|
||||
- [[ansible-regex-search-set-fact-capture-group]] — companion Jinja2 gotcha in the same `host_summary` task
|
||||
- [[ansible-unattended-upgrades-fleet]] — managing the Ubuntu auto-upgrade stack
|
||||
- [[ansible-check-mode-false-positives]] — another Ansible reporting quirk
|
||||
0
05-troubleshooting/boot-system/.keep
Normal file
0
05-troubleshooting/boot-system/.keep
Normal file
|
|
@ -0,0 +1,73 @@
|
|||
---
|
||||
title: "Claude Code Keychain Prompt Keeps Reappearing on macOS (ACL Invalidation)"
|
||||
domain: troubleshooting
|
||||
category: claude-code
|
||||
tags: [claude-code, authentication, oauth, keychain, macos, acl, security]
|
||||
status: published
|
||||
created: 2026-06-15
|
||||
updated: 2026-06-15
|
||||
---
|
||||
|
||||
# Claude Code Keychain Prompt Keeps Reappearing on macOS (ACL Invalidation)
|
||||
|
||||
## Symptom
|
||||
A macOS dialog repeatedly pops up:
|
||||
|
||||
> **security wants to access key "Claude Code-credentials" in your keychain.**
|
||||
> To allow this, enter the "login" keychain password. — `[Always Allow] [Deny] [Allow]`
|
||||
|
||||
The tell-tale sign: it **comes back even after clicking "Always Allow"** — the usual "trust forever" button doesn't make it stop. Login still works; it's the *permission prompt* that won't quiet down. This is **distinct** from [Claude Code won't log in](claude-code-warp-login-corrupt-keychain-credential.md), where the stored credential is corrupt and login itself fails.
|
||||
|
||||
## Cause
|
||||
Claude Code stores its OAuth token in the macOS **login keychain** as `Claude Code-credentials`, read via `/usr/bin/security`. macOS binds an "Always Allow" grant (the keychain item's ACL) to the **code-signing identity** of the requesting binary. That grant is silently invalidated when:
|
||||
|
||||
- **Claude Code updates** — the new binary's signature no longer matches the saved ACL. This is the most common trigger (see claude-code issues #48162, #9403).
|
||||
- **The credential item is recreated on token refresh** — wipes the ACL.
|
||||
- **Post-reboot keychain churn** — right after boot, the just-unlocked login keychain plus a concurrent token refresh can race ahead of the ACL settling, producing a *burst* of prompts that stops once a clean refresh completes.
|
||||
|
||||
It is **not** a lock-timeout issue if `security show-keychain-info` reports `no-timeout` (below).
|
||||
|
||||
## Triage (non-destructive — these do not trigger a prompt)
|
||||
```bash
|
||||
# Confirm the item exists (metadata only; no secret read)
|
||||
security find-generic-password -l "Claude Code-credentials" | grep -E "svce|acct"
|
||||
|
||||
# Confirm the login keychain isn't auto-locking
|
||||
security show-keychain-info ~/Library/Keychains/login.keychain-db
|
||||
# -> "no-timeout" means it won't relock; so recurring prompts = ACL invalidation, not locking
|
||||
```
|
||||
|
||||
## Fixes
|
||||
|
||||
### One-off burst (e.g. right after a reboot)
|
||||
Click **Always Allow** (not Allow) once a clean token refresh has completed. With a `no-timeout` keychain the grant then holds, and the post-boot prompt storm usually self-clears within a minute. *Observed exactly this on MajorAir 2026-06-15 — a reboot triggered a burst that stopped on its own.*
|
||||
|
||||
### Keeps returning after updates (durable) — reset the credential
|
||||
Deleting and re-creating the item rebinds a fresh ACL to the current binary. Costs one re-login.
|
||||
```bash
|
||||
security delete-generic-password -s "Claude Code-credentials"
|
||||
# then re-authenticate inside Claude Code: /login (or relaunch `claude`)
|
||||
```
|
||||
|
||||
### Bypass the keychain entirely (workaround)
|
||||
Claude Code falls back to `~/.claude/.credentials.json` in non-GUI contexts (SSH, tmux). On a local Mac this can be repurposed to stop keychain prompts for good:
|
||||
```bash
|
||||
# pipe straight to the file — never echo the token into a shared terminal
|
||||
security find-generic-password -s "Claude Code-credentials" -w > ~/.claude/.credentials.json
|
||||
chmod 600 ~/.claude/.credentials.json
|
||||
security delete-generic-password -s "Claude Code-credentials"
|
||||
```
|
||||
**Caveats:**
|
||||
- Token is then **plaintext at rest** (mode 600) instead of encrypted in the keychain.
|
||||
- A future Claude Code update may rewrite the keychain item.
|
||||
- GUI-session behaviour for the file fallback is **less documented** than the SSH/tmux case — **verify it holds for your setup before relying on it.**
|
||||
- Do **not** substitute `CLAUDE_CODE_OAUTH_TOKEN` — it is known to delete credentials on exit (issue #37512).
|
||||
|
||||
## Notes
|
||||
- Same keychain item as the corrupt-credential login failure; if login itself breaks, see the related article.
|
||||
- Always redirect `-w` output straight to a file — never into a terminal whose scrollback feeds shared context.
|
||||
|
||||
## Related
|
||||
- [Claude Code Won't Log In (Warp & iTerm2) — Corrupt Keychain Credential](claude-code-warp-login-corrupt-keychain-credential.md)
|
||||
- Config: `~/.claude.json`, login keychain item `Claude Code-credentials`
|
||||
- First observed: MajorAir, 2026-06-15 (post-reboot prompt burst; self-cleared)
|
||||
|
|
@ -0,0 +1,66 @@
|
|||
---
|
||||
title: "Claude Code Won't Log In (Warp & iTerm2) — Corrupt Keychain Credential"
|
||||
domain: troubleshooting
|
||||
category: claude-code
|
||||
tags: [claude-code, authentication, oauth, keychain, macos, warp, iterm2]
|
||||
status: published
|
||||
created: 2026-06-09
|
||||
updated: 2026-06-09
|
||||
---
|
||||
|
||||
# Claude Code Won't Log In (Warp & iTerm2) — Corrupt Keychain Credential
|
||||
|
||||
## Symptom
|
||||
Claude Code (v2.1.169) would not log in from Warp. The login flow never completed.
|
||||
The same failure occurred in iTerm2, which ruled out a terminal-specific cause.
|
||||
|
||||
## Investigation path
|
||||
1. **Version** — `claude --version` = 2.1.169. Already well past the v2.1.105–2.1.107
|
||||
bracketed-paste regression (fixed in 2.1.108), so the known paste bug was not it.
|
||||
2. **Environment / overrides** — none of `ANTHROPIC_API_KEY`, `CLAUDE_CODE_OAUTH_TOKEN`,
|
||||
`ANTHROPIC_AUTH_TOKEN`, `ANTHROPIC_BASE_URL`, or `CLAUDE_CODE_PATH` were set, so no stale
|
||||
key or shim was hijacking auth. System clock was correct (rules out token-time skew).
|
||||
3. **Account record** — `~/.claude.json` had `oauthAccount` and `userID` populated
|
||||
(`maj.linux@gmail.com`), i.e. Claude Code believed it already had an account.
|
||||
4. **Keychain** — a `Claude Code-credentials` generic-password item existed, but
|
||||
`security find-generic-password -s "Claude Code-credentials" -w` returned an
|
||||
**empty / non-JSON payload** (failed to parse). The credential entry was present but
|
||||
its secret was empty/corrupt.
|
||||
|
||||
## Root cause
|
||||
A **corrupt (empty) Keychain credential** named `Claude Code-credentials`. Claude Code saw
|
||||
an existing credential, tried to read/refresh it, failed to parse it, and wedged *before* it
|
||||
could start a clean login. Because the account also existed in `~/.claude.json`, the CLI kept
|
||||
trying to use the broken credential instead of prompting fresh auth. This is system-level
|
||||
(Keychain), which is why it reproduced across both Warp and iTerm2.
|
||||
|
||||
## The fix
|
||||
```bash
|
||||
# 1. Remove the broken credential
|
||||
security delete-generic-password -s "Claude Code-credentials"
|
||||
|
||||
# 2. Re-authenticate
|
||||
claude # then /login, or:
|
||||
claude /login
|
||||
```
|
||||
If `/login` still hangs after that, also clear the stale account record and retry:
|
||||
```bash
|
||||
cp ~/.claude.json ~/.claude.json.bak
|
||||
python3 -c "import json,pathlib; f=pathlib.Path.home()/'.claude.json'; d=json.load(open(f)); d.pop('oauthAccount',None); json.dump(d,open(f,'w'),indent=2)"
|
||||
claude /login
|
||||
```
|
||||
Resolved on step 1+2 — login succeeded after deleting the corrupt Keychain item.
|
||||
|
||||
## Notes
|
||||
- On macOS, Claude Code credentials live in the **login Keychain** (`Claude Code-credentials`),
|
||||
not in `~/.claude/.credentials.json` (that path is Linux/other).
|
||||
- Quick triage command to spot the same failure again:
|
||||
```bash
|
||||
security find-generic-password -s "Claude Code-credentials" -w | python3 -m json.tool
|
||||
```
|
||||
If that errors with "Expecting value", the stored secret is empty/corrupt — delete and re-login.
|
||||
|
||||
## Related
|
||||
- [Claude Code Keychain Prompt Keeps Reappearing on macOS (ACL Invalidation)](claude-code-keychain-prompt-recurring-macos.md) — different symptom: login works but the permission prompt won't stop
|
||||
- Config: `~/.claude.json` (oauthAccount, userID), login Keychain item `Claude Code-credentials`
|
||||
- Other Claude Code note: `claude-mem-setting-sources-empty-arg.md`
|
||||
|
|
@ -0,0 +1,190 @@
|
|||
---
|
||||
title: "Claude Desktop MCP Mass-Disconnect After Blocking SSH Reboot"
|
||||
domain: troubleshooting
|
||||
category: troubleshooting
|
||||
tags:
|
||||
- claude-desktop
|
||||
- mcp
|
||||
- wsl
|
||||
- wsl2
|
||||
- ssh
|
||||
- reboot
|
||||
- troubleshooting
|
||||
- hang
|
||||
- transport
|
||||
status: published
|
||||
created: 2026-05-10
|
||||
updated: 2026-05-10
|
||||
---
|
||||
|
||||
# Claude Desktop MCP Mass-Disconnect After Blocking SSH Reboot
|
||||
|
||||
> **TL;DR** — Issuing a synchronous `ssh host reboot` through Claude Desktop's shell MCP can hang the MCP transport when the target dies mid-session. Eventually the MCP manager force-disconnects **every** MCP at once. Recovery is a full Claude Desktop restart. Prevention is a fire-and-forget reboot pattern that lets the SSH session close cleanly before the target goes down.
|
||||
|
||||
---
|
||||
|
||||
## Symptom
|
||||
|
||||
You're running Claude Desktop with several MCPs configured (shell, filesystem, mail, etc.), most launched via `wsl.exe` against your WSL2 distro. You ask Claude to reboot a remote host through the shell MCP — typically something like `ssh fleethost reboot` or `ssh fleethost sudo systemctl reboot`. Things appear to succeed. Then, anywhere from immediately to ~30 minutes later:
|
||||
|
||||
- **Every MCP disconnects within tens of milliseconds of each other** — not in the order you'd expect from independent failures
|
||||
- Claude Desktop's main panel shows all MCP servers as failed/disconnected
|
||||
- The app itself is still running but cannot reconnect MCPs cleanly until you fully restart it
|
||||
- New chats can't use any MCP tools
|
||||
|
||||
The MCP server logs (`%APPDATA%\Claude\logs\mcp-server-*.log`) end with the standard *"Server transport closed unexpectedly, this is likely due to the process exiting early"* message — but they end at the **same instant** for every server.
|
||||
|
||||
---
|
||||
|
||||
## Why this happens
|
||||
|
||||
Claude Desktop launches each MCP server as a stdio child process (commonly `wsl.exe npx -y <server>` or `wsl.exe <binary>`). The MCP manager owns the stdio pipes and a transport per server. When you ask Claude to run a synchronous `ssh remote reboot` via the shell MCP:
|
||||
|
||||
1. The shell MCP calls SSH and waits for the remote process to exit so it can return stdout/stderr to Claude Desktop
|
||||
2. The remote `reboot` (or `systemctl reboot`) executes on the target — but reboot is special: the target severs its own SSH session as part of going down, often **without** sending a clean TCP FIN
|
||||
3. The local SSH client sits there waiting for a response that never comes
|
||||
4. The shell MCP's stdio pipe stays open, blocked on the SSH child
|
||||
5. Claude Desktop's MCP manager waits on the shell MCP's stdio pipe
|
||||
6. After some watchdog/timeout interval, the manager force-tears-down — and because of how the manager is wired, it tears down **all** MCP transports together, not just the wedged one
|
||||
|
||||
The blast radius is "every MCP in the session," not just the one that issued the reboot.
|
||||
|
||||
---
|
||||
|
||||
## Diagnostic chain
|
||||
|
||||
Use this exact order — it lets you rule out each layer cleanly.
|
||||
|
||||
### 1. Are the disconnect timestamps clustered?
|
||||
|
||||
Open `%APPDATA%\Claude\logs\mcp.log` (or each per-server log) and find the *Server transport closed* lines for each MCP. Are they within tens or hundreds of milliseconds of each other?
|
||||
|
||||
```
|
||||
2026-05-10T04:10:17.167Z [shell] Server transport closed unexpectedly
|
||||
2026-05-10T04:10:17.175Z [mail] Server transport closed unexpectedly
|
||||
2026-05-10T04:10:17.177Z [majorvault] Server transport closed unexpectedly
|
||||
2026-05-10T04:10:17.202Z [filesystem] Server transport closed unexpectedly
|
||||
```
|
||||
|
||||
If yes → a parent killed the children. This is **not** independent MCP failures.
|
||||
|
||||
### 2. Is there a Crashpad minidump?
|
||||
|
||||
```powershell
|
||||
dir "$env:APPDATA\Claude\Crashpad\reports"
|
||||
dir "$env:APPDATA\Claude\Crashpad\pending"
|
||||
```
|
||||
|
||||
Empty directories (or directories with no files newer than the disconnect time) = **Claude Desktop did not crash, it hung**. A real crash would have written a minidump.
|
||||
|
||||
### 3. Are the MCP child processes still alive in WSL?
|
||||
|
||||
```bash
|
||||
ps -eo pid,etime,cmd | grep -E 'mcp|claude' | grep -v grep
|
||||
```
|
||||
|
||||
If you see your MCP server processes still running with elapsed times spanning the disconnect (or fresh respawns from auto-recovery attempts), the WSL side is healthy. The damage is on the Claude Desktop ↔ MCP transport, not the MCP servers themselves.
|
||||
|
||||
### 4. What was the shell MCP doing right before the disconnect?
|
||||
|
||||
Check `%APPDATA%\Claude\logs\main.log` for the last `mcp__shell__shell_exec` permission grants and tool calls, and `%APPDATA%\Claude\logs\mcp-server-shell.log` for the last commands invoked. If you see an SSH command issued against a host that you also know to be currently rebooting / unreachable, you've found the trigger.
|
||||
|
||||
Confirm with a separate health probe of the remote host (do this in **WSL or a fresh terminal**, not through the wedged Claude Desktop):
|
||||
|
||||
```bash
|
||||
ping -c 3 -W 2 <host-or-tailscale-ip>
|
||||
ssh -o ConnectTimeout=5 -o BatchMode=yes <host> uptime
|
||||
tailscale status | grep <host>
|
||||
```
|
||||
|
||||
100% packet loss + missing tailnet entry + SSH timeout = the target is genuinely down or hung mid-reboot.
|
||||
|
||||
---
|
||||
|
||||
## Recovery
|
||||
|
||||
1. **Fully quit Claude Desktop** — system tray icon → *Quit*. Closing the window is not enough; you must terminate the main process so the MCP manager state is cleared.
|
||||
2. *(Optional)* If you want a clean slate in WSL, kill orphaned MCP child processes:
|
||||
```bash
|
||||
pkill -f mcp-shell
|
||||
pkill -f mail-mcp
|
||||
pkill -f mcp-majorvault
|
||||
# ...etc for any other MCP binaries you run
|
||||
```
|
||||
This is rarely necessary — fresh spawns will replace them on next launch.
|
||||
3. **Reopen Claude Desktop**. Watch `mcp.log` and `main.log`:
|
||||
```
|
||||
[LocalMcpServerManager] Connected to shell (1 tools)
|
||||
[LocalMcpServerManager] Connected to filesystem (14 tools)
|
||||
[LocalMcpServerManager] Connected to mail (30 tools)
|
||||
...
|
||||
```
|
||||
Tool counts should match your `claude_desktop_config.json`. The "UtilityProcess Check: Extension X not found in installed extensions" warnings are benign — Claude Desktop just notes that your MCPs aren't bundled built-in extensions (because they're WSL-launched).
|
||||
|
||||
---
|
||||
|
||||
## Prevention — fire-and-forget reboot patterns
|
||||
|
||||
Don't hand the MCP shell a command that intentionally severs its own SSH session and expects the shell to wait for clean closure. Instead, schedule the reboot to happen **after** SSH disconnects:
|
||||
|
||||
### Option A — `nohup` + background (most portable)
|
||||
|
||||
```bash
|
||||
ssh host 'nohup shutdown -r +1 >/dev/null 2>&1 &'
|
||||
```
|
||||
|
||||
Schedules a reboot 1 minute out, returns immediately, SSH closes cleanly. The minute delay gives you time to cancel (`ssh host 'sudo shutdown -c'`) if you change your mind.
|
||||
|
||||
### Option B — bounded keepalive timeout
|
||||
|
||||
```bash
|
||||
ssh -o ServerAliveInterval=5 -o ServerAliveCountMax=2 host 'systemctl reboot'
|
||||
```
|
||||
|
||||
If the remote drops without responding within 10 s of keepalives, the local SSH client hangs up — bounding the worst case to ~10 s instead of "until something kills the MCP." Less elegant than Option A but works for one-shot situations.
|
||||
|
||||
### Option C — schedule on the box itself
|
||||
|
||||
Use a cron `@reboot` reschedule, a `systemd` oneshot timer, or `at` on the box:
|
||||
|
||||
```bash
|
||||
ssh host 'echo "systemctl reboot" | at now + 1 minute'
|
||||
```
|
||||
|
||||
### Anti-pattern (don't do this)
|
||||
|
||||
```bash
|
||||
# ❌ Synchronous reboot through MCP shell
|
||||
ssh host reboot
|
||||
ssh host sudo reboot
|
||||
ssh host 'shutdown -r now'
|
||||
```
|
||||
|
||||
These all hold the MCP stdio pipe open waiting for a session that is being severed at the kernel level on the remote side.
|
||||
|
||||
---
|
||||
|
||||
## Worked example — 2026-05-10 majorhome reboot
|
||||
|
||||
| Time (EDT) | Event |
|
||||
|---|---|
|
||||
| 00:41:06 | Claude Desktop emits permission prompt for `mcp__shell__shell_exec` |
|
||||
| 00:41:08 | Shell MCP disconnect+reconnect cycle (transient, recovered in 2 s) |
|
||||
| 00:41:10 | `[LocalMcpServerManager] Connected to shell (1 tools)` |
|
||||
| 00:41:26 | Permission granted — likely the `ssh majorhome reboot` call |
|
||||
| 00:42:16 | `[Result] Turn succeeded` → session marked `running → idle` |
|
||||
| 00:42 | `main.log` goes silent |
|
||||
| 04:10:17 UTC (00:10:17 EDT *prior* — note timezone delta in mcp.log vs main.log) | All 5 MCPs disconnect within 35 ms |
|
||||
| 01:00–01:10 | majorhome physically recovers, comes back up clean (`uptime` 19 min, `systemctl is-system-running` = `running`) |
|
||||
| 01:13:42 | After full Claude Desktop restart, all 5 MCPs respawn |
|
||||
| 01:15:22 | All 5 MCPs reconnected, tools registered |
|
||||
|
||||
majorhome itself was never the problem — the reboot succeeded. The damage was the SSH session that never closed cleanly, which poisoned the local Claude Desktop MCP transport.
|
||||
|
||||
---
|
||||
|
||||
## See also
|
||||
|
||||
- [Claude Desktop MCP Server Started via wsl.exe Sees Empty Environment (WSLENV)](wsl-env-claude-desktop-mcp.md) — different failure mode (start-up env passing) on the same Claude Desktop + WSL stack
|
||||
- [Pi-hole AI Blocklist Blocks Claude Desktop (ERR_CONNECTION_REFUSED)](networking/pihole-blocks-claude-desktop.md) — another Claude Desktop transport-layer failure
|
||||
- [Windows OpenSSH: WSL as Default Shell Breaks Remote Commands](networking/windows-openssh-wsl-default-shell-breaks-remote-commands.md) — related WSL/SSH stdio behavior
|
||||
105
05-troubleshooting/forgejo-mailer-and-cli-recovery.md
Normal file
105
05-troubleshooting/forgejo-mailer-and-cli-recovery.md
Normal file
|
|
@ -0,0 +1,105 @@
|
|||
---
|
||||
title: "Forgejo: Account Recovery & CLI Admin When Locked Out of the GUI"
|
||||
domain: troubleshooting
|
||||
category: general
|
||||
tags: [forgejo, gitea, smtp, docker, account-recovery, self-hosting]
|
||||
status: published
|
||||
created: 2026-06-12
|
||||
updated: 2026-06-12
|
||||
---
|
||||
# Forgejo: Account Recovery & CLI Admin When Locked Out of the GUI
|
||||
|
||||
Two related problems on a single-admin self-hosted **Forgejo** (or Gitea): the GUI *"Forgot password"* is disabled, and you can't log in to fix it. Here's how to (1) enable account recovery properly, and (2) recover from the command line when you're already locked out.
|
||||
|
||||
## Symptoms
|
||||
|
||||
- The *Forgot password* page shows: **"Account recovery is only available when email is set up. Please set up email to enable account recovery."**
|
||||
- You can't log in (wrong/forgotten password), so you can't add an SSH key or change settings in the GUI either.
|
||||
|
||||
## Part 1 — Enable account recovery (configure the mailer)
|
||||
|
||||
Account recovery needs SMTP. If you already run a mail server on your tailnet, relay through it — **no app password needed** when the Forgejo host is `mynetworks`-trusted by that mail server.
|
||||
|
||||
Edit `app.ini` (in the data volume, e.g. `/data/gitea/conf/app.ini`):
|
||||
|
||||
```ini
|
||||
[mailer]
|
||||
ENABLED = true
|
||||
PROTOCOL = smtp+starttls
|
||||
SMTP_ADDR = 100.x.y.z ; mail server's tailnet IP
|
||||
SMTP_PORT = 587
|
||||
FROM = forgejo@example.com
|
||||
FORCE_TRUST_SERVER_CERT = true ; required when connecting by IP (cert CN won't match)
|
||||
```
|
||||
|
||||
Notes:
|
||||
|
||||
- `FORCE_TRUST_SERVER_CERT = true` is needed when you target the relay by **IP** — the TLS cert is issued for a hostname, not the IP, so verification would otherwise fail. Acceptable on a trusted internal hop.
|
||||
- Omit `USER`/`PASSWD` if the relay accepts your host via `mynetworks` (no SASL). Otherwise add SMTP auth.
|
||||
- `app.ini` lives in the persistent volume, so the change **survives container re-creation** (e.g. Watchtower's nightly pull).
|
||||
|
||||
Apply and verify:
|
||||
|
||||
```bash
|
||||
docker restart forgejo
|
||||
docker logs forgejo 2>&1 | grep -i "Mail Service Enabled" # confirms the mailer loaded
|
||||
```
|
||||
|
||||
Test the SMTP path **before** trusting it (run from the host, mimicking Forgejo's connection):
|
||||
|
||||
```bash
|
||||
python3 - <<'EOF'
|
||||
import smtplib, ssl
|
||||
ctx = ssl.create_default_context(); ctx.check_hostname = False; ctx.verify_mode = ssl.CERT_NONE
|
||||
s = smtplib.SMTP("100.x.y.z", 587, timeout=15)
|
||||
s.ehlo(); s.starttls(context=ctx); s.ehlo()
|
||||
s.sendmail("forgejo@example.com", ["you@example.com"],
|
||||
"Subject: test\r\n\r\nForgejo relay path test")
|
||||
s.quit(); print("SENT_OK")
|
||||
EOF
|
||||
```
|
||||
|
||||
`SENT_OK` means the relay accepted the message. `/user/forgot_password` should now show the reset form instead of the email error.
|
||||
|
||||
> **Container can't reach the tailnet IP?** Docker bridge networks usually route to Tailscale via the host (SNAT to the host's tailnet IP). Confirm with:
|
||||
> `docker exec forgejo nc -w5 100.x.y.z 587 </dev/null && echo REACHABLE`
|
||||
|
||||
## Part 2 — Recover from the CLI (already locked out)
|
||||
|
||||
Forgejo's admin CLI runs inside the container as the git user (UID 1000) and needs no login.
|
||||
|
||||
**Reset a password:**
|
||||
|
||||
```bash
|
||||
docker exec -u 1000 forgejo forgejo admin user change-password -u <user> -p '<newpass>'
|
||||
```
|
||||
|
||||
> ⚠️ **Gotcha:** `change-password` sets `must_change_password=true` by default. That **forces a change on next GUI login _and_ returns HTTP 403 on the API** (`"You must change your password"`). Clear it:
|
||||
> ```bash
|
||||
> docker exec -u 1000 forgejo forgejo admin user must-change-password --unset <user>
|
||||
> ```
|
||||
|
||||
**Add an SSH key without the GUI** (basic-auth API — works only if 2FA is off):
|
||||
|
||||
```bash
|
||||
curl -u <user>:'<pass>' -X POST -H 'Content-Type: application/json' \
|
||||
-d '{"title":"laptop","key":"ssh-ed25519 AAAA... you@host"}' \
|
||||
http://localhost:3004/api/v1/user/keys
|
||||
# HTTP 201 = created
|
||||
```
|
||||
|
||||
Forgejo regenerates the git user's `authorized_keys` from the database, so `ssh -p <port> git@host` authenticates immediately afterward — no restart needed.
|
||||
|
||||
## "The password keeps changing" — it (probably) isn't
|
||||
|
||||
If a self-hosted Forgejo admin password *seems* to reset itself, a stock Forgejo container does **not** reset admin passwords. Rule out the server first:
|
||||
|
||||
- the compose has **no** admin/password env and no custom entrypoint;
|
||||
- **no** cron, systemd timer, or script runs `forgejo admin user change-password`;
|
||||
- the data volume is persistent (re-creation keeps the DB, password included).
|
||||
|
||||
If all three hold, nothing server-side is changing it — the "changing" password is a **client-side** artifact: a duplicate or stale entry in your password manager autofilling different values. Delete the duplicates and keep one.
|
||||
|
||||
## See also
|
||||
|
||||
- Forgejo — [Config Cheat Sheet → mailer](https://forgejo.org/docs/latest/admin/config-cheat-sheet/)
|
||||
|
|
@ -0,0 +1,119 @@
|
|||
---
|
||||
title: "LoRA adapter — GGUF conversion fails with 'config.json not found'"
|
||||
domain: troubleshooting
|
||||
category: gpu-display
|
||||
tags: [lora, qlora, gguf, llama.cpp, unsloth, fine-tuning, qwen]
|
||||
status: published
|
||||
created: 2026-04-30
|
||||
updated: 2026-04-30
|
||||
---
|
||||
|
||||
# LoRA adapter — GGUF conversion fails with 'config.json not found'
|
||||
|
||||
## Problem
|
||||
|
||||
After a QLoRA fine-tune, you point `llama.cpp/convert_hf_to_gguf.py` at the training output directory and it crashes immediately:
|
||||
|
||||
```
|
||||
FileNotFoundError: [Errno 2] No such file or directory:
|
||||
'/path/to/training-runs/<run>/final/config.json'
|
||||
```
|
||||
|
||||
The output directory looks fine — it contains:
|
||||
|
||||
```
|
||||
adapter_config.json
|
||||
adapter_model.safetensors (~150 MB for a 7B base)
|
||||
chat_template.jinja
|
||||
tokenizer_config.json
|
||||
tokenizer.json
|
||||
```
|
||||
|
||||
But no `config.json`, and `adapter_model.safetensors` is 150 MB — way smaller than the ~14 GB you'd expect for a full Qwen2.5-7B 16-bit checkpoint.
|
||||
|
||||
## Root cause
|
||||
|
||||
`model.save_pretrained()` after a LoRA/QLoRA train saves **only the adapter weights**, not a merged full-precision model. `convert_hf_to_gguf.py` expects a full HuggingFace model directory — it reads `config.json` to identify the architecture. Adapter-only directories don't have one.
|
||||
|
||||
You need to merge the LoRA adapter into the base model first, then point the GGUF converter at the merged dir.
|
||||
|
||||
## Solution
|
||||
|
||||
### Quick fix — inline merge step
|
||||
|
||||
Insert this block between training completion and `convert_hf_to_gguf.py`:
|
||||
|
||||
```python
|
||||
from unsloth import FastLanguageModel
|
||||
|
||||
adapter = "/path/to/training-runs/<run>/final"
|
||||
merged = "/path/to/training-runs/<run>/merged"
|
||||
|
||||
model, tok = FastLanguageModel.from_pretrained(
|
||||
model_name=adapter,
|
||||
max_seq_length=2048,
|
||||
load_in_4bit=True,
|
||||
)
|
||||
model.save_pretrained_merged(merged, tok, save_method="merged_16bit")
|
||||
```
|
||||
|
||||
Then run the GGUF converter against the **merged** dir, not the adapter dir:
|
||||
|
||||
```bash
|
||||
python3 llama.cpp/convert_hf_to_gguf.py /path/to/training-runs/<run>/merged \
|
||||
--outfile model-f16.gguf --outtype f16
|
||||
```
|
||||
|
||||
The merged dir will contain `config.json`, `model-00001-of-00004.safetensors` (multiple shards totaling the full base model size), `generation_config.json`, etc.
|
||||
|
||||
### Cleaner fix — use a wrapper
|
||||
|
||||
If you do this often, encapsulate it:
|
||||
|
||||
1. Wrapper Python script accepts `--adapter`, `--output`, `--skip-merge`, `--all-quants`
|
||||
2. Step 1: load adapter via `FastLanguageModel.from_pretrained()`, call `save_pretrained_merged()`
|
||||
3. Step 2: subprocess `convert_hf_to_gguf.py` on the merged dir
|
||||
4. Step 3: subprocess `llama-quantize` for each requested quant
|
||||
|
||||
This is what `~/corpus/scripts/convert_gguf.py` does on MajorRig (rewritten 2026-04-09 for the MajorTwin v7b cycle).
|
||||
|
||||
## Why this trips people up
|
||||
|
||||
- Unsloth and PEFT both save adapter-only by default after `trainer.save_model()` or `model.save_pretrained()`. There's no warning that downstream tools expect a merged model.
|
||||
- The training output **looks** complete — there's a `tokenizer.json`, a `chat_template.jinja`, and a non-trivial `.safetensors`. It feels like a checkpoint.
|
||||
- A pipeline that uses `convert_gguf.py` (with merge) once and then someone reimplements Step 4 inline (skipping the wrapper) will silently lose the merge step. This is what happened in MajorTwin v8c (Apr 30, 2026) — see [[majortwin-v8b-plan#Pipeline Bug + Fix (2026-04-30)]].
|
||||
|
||||
## Verification checklist
|
||||
|
||||
After training, before running the GGUF converter, verify the directory you're pointing at:
|
||||
|
||||
| File | Adapter-only dir | Merged dir |
|
||||
|---|---|---|
|
||||
| `adapter_config.json` | ✅ | ❌ |
|
||||
| `adapter_model.safetensors` | ✅ (~150 MB / 7B) | ❌ |
|
||||
| `config.json` | ❌ | ✅ |
|
||||
| `model-*.safetensors` (sharded) | ❌ | ✅ (~14 GB / 7B) |
|
||||
| `generation_config.json` | ❌ | ✅ |
|
||||
| `tokenizer.json` | ✅ | ✅ |
|
||||
|
||||
If you see only the left column, you need to merge before converting.
|
||||
|
||||
## Resuming a failed pipeline without re-training
|
||||
|
||||
The adapter is small and self-contained. If your pipeline crashes at the GGUF step, you do NOT need to retrain — the LoRA adapter at `<run>/final/` is intact. Write a resume wrapper that runs only:
|
||||
|
||||
1. Merge (`save_pretrained_merged`)
|
||||
2. F16 conversion (`convert_hf_to_gguf.py`)
|
||||
3. Quantization (`llama-quantize`)
|
||||
4. Deploy
|
||||
|
||||
This saves the cost of however many GPU-hours the training took. See `~/corpus/scripts/resume_v8c_step4.sh` on MajorRig for an example.
|
||||
|
||||
## Related
|
||||
|
||||
- [[qwen-14b-oom-3080ti]] — base model size choice on a 12GB GPU
|
||||
- [[majortwin-v8b-plan]] — v8c pipeline architecture and resume
|
||||
|
||||
## Maintenance
|
||||
|
||||
- 2026-04-30 — Created after MajorTwin v8c pipeline failed Step 4. Root-caused, patched, resumed.
|
||||
|
|
@ -1,6 +1,6 @@
|
|||
---
|
||||
created: 2026-03-15T06:37
|
||||
updated: 2026-04-29T23:55
|
||||
updated: 2026-05-02T17:50
|
||||
---
|
||||
# 🔧 General Troubleshooting
|
||||
|
||||
|
|
@ -8,12 +8,18 @@ Practical fixes for common Linux, networking, and application problems.
|
|||
|
||||
## 🖥️ GPU & AI
|
||||
- [Qwen2.5-14B OOM on RTX 3080 Ti (12GB)](gpu-display/qwen-14b-oom-3080ti.md)
|
||||
- [LoRA adapter — GGUF conversion fails with 'config.json not found'](gpu-display/lora-adapter-gguf-conversion-fails.md)
|
||||
|
||||
## 🌐 Networking & Web
|
||||
- [Wi-Fi Game Streaming Stutter: 160 MHz Channel Width Saturating the 5 GHz Radio](networking/wifi-160mhz-airtime-saturation-game-streaming.md)
|
||||
- [Apache Outage: Fail2ban Self-Ban + Missing iptables Rules](networking/fail2ban-self-ban-apache-outage.md)
|
||||
- [Mail Client Stops Receiving: Fail2ban IMAP Self-Ban](networking/fail2ban-imap-self-ban-mail-client.md)
|
||||
- [firewalld: Mail Ports Wiped After Reload](networking/firewalld-mail-ports-reset.md)
|
||||
- [Dovecot IMAP Clients Fail to Sync: vsz_limit OOM from a Bloated Index Log](networking/dovecot-imap-oom-vsz-limit-bloated-index.md)
|
||||
- [Postfix header_checks Can't Act on Milter-Added Headers (Use Sieve)](networking/postfix-header-checks-vs-milter-headers.md)
|
||||
- [Dovecot Phantom Mailboxes from .dovecot.lda-dupes (mail_home Overlapping the Maildir Root)](networking/dovecot-mail-home-maildir-root-phantom-mailboxes.md)
|
||||
- [Tailscale SSH: Unexpected Re-Authentication Prompt](networking/tailscale-ssh-reauth-prompt.md)
|
||||
- [SSH Alias Falls Through to MagicDNS — Host-Key Verification Failure (No `Host` Block)](networking/ssh-missing-host-block-magicdns-host-key-failure.md)
|
||||
- [iOS Tailscale Clients Report HostName="localhost" — Breaks /etc/hosts Generators](networking/tailscale-status-json-hostname-localhost-ios.md)
|
||||
- [rsync over Tailscale: Hung in TCP Teardown After Transfer Completes](networking/rsync-tailscale-teardown-stall.md)
|
||||
- [Windows OpenSSH: WSL Default Shell Breaks Remote Commands](networking/windows-openssh-wsl-default-shell-breaks-remote-commands.md)
|
||||
|
|
@ -26,6 +32,8 @@ Practical fixes for common Linux, networking, and application problems.
|
|||
- [SSH Timeout During dnf upgrade on Fedora Hosts](ansible-ssh-timeout-dnf-upgrade.md)
|
||||
- [Vault Password File Missing](ansible-vault-password-file-missing.md)
|
||||
- [ansible.cfg Ignored on WSL2 Windows Mounts](ansible-wsl2-world-writable-mount-ignores-cfg.md)
|
||||
- [regex_search — capture-group argument doesn't work in set_fact](ansible-regex-search-set-fact-capture-group.md)
|
||||
- [reboot.yml: become Timeout on WSL2 Hosts (Exclude Them)](ansible-reboot-become-timeout-wsl2.md)
|
||||
|
||||
## 📦 Docker & Systems
|
||||
- [Docker & Caddy Recovery After Reboot (Fedora + SELinux)](docker-caddy-selinux-post-reboot-recovery.md)
|
||||
|
|
@ -36,6 +44,7 @@ Practical fixes for common Linux, networking, and application problems.
|
|||
|
||||
## 🔒 SELinux
|
||||
- [SELinux: Fixing Dovecot Mail Spool Context (/var/vmail)](selinux-dovecot-vmail-context.md)
|
||||
- [SELinux: Wrong /etc/localtime Label Silently Breaks Timezone Changes](selinux-localtime-label-breaks-timezone.md)
|
||||
|
||||
## 💾 Storage
|
||||
- [mdadm RAID Recovery After USB Hub Disconnect](storage/mdadm-usb-hub-disconnect-recovery.md)
|
||||
|
|
@ -43,9 +52,12 @@ Practical fixes for common Linux, networking, and application problems.
|
|||
## 📝 Application Specific
|
||||
- [Obsidian Vault Recovery — Loading Cache Hang](obsidian-cache-hang-recovery.md)
|
||||
- [Gemini CLI Manual Update](gemini-cli-manual-update.md)
|
||||
- [iPhone Mirroring Hangs on 'Connecting…' — AWDL Data Stall (27.0 Beta)](iphone-mirroring-connecting-hang-awdl-stall-beta.md)
|
||||
|
||||
## 🤖 AI / Local LLM
|
||||
- [Ollama Drops Off Tailscale When Mac Sleeps](ollama-macos-sleep-tailscale-disconnect.md)
|
||||
- [Ollama: `ollama run` with Piped Stdin Bypasses Chat Template + SYSTEM Prompt](ollama-chat-template-pipe-stdin-bypass.md)
|
||||
- [Windows OpenSSH Server (sshd) Stops After Reboot](networking/windows-sshd-stops-after-reboot.md)
|
||||
- [claude-mem Silently Fails with Claude Code 2.1+ (Empty `--setting-sources`)](claude-mem-setting-sources-empty-arg.md)
|
||||
- [Claude Code Won't Log In (Warp & iTerm2) — Corrupt Keychain Credential](claude-code-warp-login-corrupt-keychain-credential.md)
|
||||
- [Claude Code Keychain Prompt Keeps Reappearing on macOS (ACL Invalidation)](claude-code-keychain-prompt-recurring-macos.md)
|
||||
|
|
|
|||
|
|
@ -0,0 +1,150 @@
|
|||
---
|
||||
title: "iPhone Mirroring Hangs on 'Connecting…' — AWDL Data Stall (27.0 Beta)"
|
||||
domain: troubleshooting
|
||||
category: macos
|
||||
tags: [macos, iphone-mirroring, continuity, awdl, rapport, quic, tailscale, mullvad, beta, channel-validation, aimesh, quicktime, usb]
|
||||
status: published
|
||||
created: 2026-06-09
|
||||
updated: 2026-06-15
|
||||
---
|
||||
|
||||
# iPhone Mirroring Hangs on 'Connecting…' — AWDL Data Stall (27.0 Beta)
|
||||
|
||||
## Update 2026‑06‑15 — REGRESSED; reproducibly stuck on "Connecting", and Tailscale was **not** the cure
|
||||
|
||||
> **Correction to the 2026‑06‑14 "it WORKS" update below.** On 2026‑06‑15 iPhone Mirroring is **reproducibly stuck on "Connecting to iPhone 16 Pro"** on MajorAir again — with Tailscale `accept-routes` *still* `false`. So the accept‑routes change was **correlation, not the fix**: this is an **intermittent macOS 27.0 beta AWDL bug, independent of Tailscale**.
|
||||
>
|
||||
> **Tried this round — all failed to establish a session:** Tailscale `accept-routes=false` (already in place) · `sudo ifconfig awdl0 down/up` · **full Mac reboot** · cycling the iPhone's Wi‑Fi + Bluetooth.
|
||||
>
|
||||
> **Log signature:** `rapportd` resolves the phone's `_asquic._udp.local` endpoint and `_companion-link` registers (discovery *succeeds*), but the QUIC‑over‑AWDL **datapath never completes into a live session** — `wifip2pd` loops on `AWDLDiscoveryTimeout (hasAdvertises=false)`. Each reset advanced the handshake one stage further (no‑advertises → resolve‑started → endpoint‑resolved) yet none reached a streaming session. **`llw0` never went active (0 bytes)** — confirming no A/V ever flowed, regardless of what the 06‑14 note measured.
|
||||
>
|
||||
> **Stance:** beta OS bug, **no reliable user‑side fix**. Use the **QuickTime USB mirror** workaround (below) when you actually need the phone on screen. The 06‑14 "it works on `llw0`" measurements were real *for that one session* but are **not reproducible** across seeds/sessions — treat mirroring as intermittently broken on the 27.0 betas. This re‑confirms the original **Root cause (conclusion)** section further down (a beta bug, "nothing in local config wrong"), which the 06‑14 update had prematurely overridden.
|
||||
|
||||
## Update 2026‑06‑14 (evening) — it WORKS; the "AWDL starvation" finding was the wrong interface
|
||||
|
||||
> iPhone Mirroring is now **working** on MajorAir — stable session, clean video, no missing icons — on **ch44/80** with Tailscale `accept-routes=false`. An earlier pass the same day blamed an "AWDL bulk‑path starving at ~90 B/s"; that was **measuring the wrong interface** and is corrected here.
|
||||
|
||||
**The video transport is `llw0` (low‑latency WLAN), not `awdl0`.**
|
||||
Measured during an active session: **`llw0` ≈ 800 KB/s** (≈6 Mbps of real video), `en0` ~60 KB/s, **`awdl0` ~1 KB/s**. `awdl0` only ever carries AWDL *discovery/control* (~90 B/s) — whether mirroring works or not. So "90 B/s on `awdl0` = starved bulk path" was a **red herring**: the A/V stream rides `llw0`, which the earlier pass never measured.
|
||||
|
||||
**What was actually broken was session *stability*.** The `XPC_ERROR_CONNECTION_INTERRUPTED` / `MediaContinuityKit.TaskTimeoutError` teardown loop kept the `llw0` stream from ever sustaining (→ glitchy / missing icons). When the session holds, `llw0` streams clean.
|
||||
|
||||
**What changed (not cleanly isolated):** three things differed between the broken and working states — (1) the network fully **settled on ch44** over ~15 h (the failing ch44 test was minutes after a chaotic AiMesh re‑sync + reconnect scramble), (2) Tailscale **`accept-routes` was turned off** (it had been polluting IPv4 routing + the Continuity control plane), and (3) both devices slept/woke. Which one mattered is not yet proven.
|
||||
|
||||
**Open test — isolates Tailscale's role:** repro on **MajorMac** with *unaltered* Tailscale (`accept-routes` still **ON**). If mirroring breaks there but works on MajorAir (accept‑routes OFF), that pins Tailscale's accepted routes as the trigger. See [[MajorAir#Known Issues]] for the `accept-routes=false` fix.
|
||||
|
||||
**Still valid from earlier today:** congestion ruled out (router `chanim_stats` ch36 = 90 % idle, 86 % txop); the AiMesh / router infra notes below; and iPhone Mirroring is **wireless‑only — no USB transport** (for a wired screen view, use QuickTime, below).
|
||||
|
||||
> ⚠️ The iPhone‑radio `isValidChannel`/`awdl0` evidence cited in the original 2026‑06‑09 write‑up below describes AWDL *discovery* health, **not** the video path — read it in light of this correction.
|
||||
|
||||
**Wired workaround (works today, no AWDL):**
|
||||
iPhone Mirroring is **wireless‑only — there is no USB transport** (confirmed: cable connected throughout, every attempt still used `awdl0`). For a wired view of the screen:
|
||||
> **QuickTime Player → File → New Movie Recording → ⌄ next to record → select the iPhone** = full‑rate USB‑C screen mirror (view + record). Does **not** give remote control (tap/type) — that's unique to iPhone Mirroring.
|
||||
|
||||
**Infra notes (RT‑AX82U, AiMesh controller):**
|
||||
- Router SSH is on **port 1025** (not 22); creds in Ansible vault (`router_username` / `router_password`).
|
||||
- The 5 GHz channel is **AiMesh‑coordinated** and **resists CLI changes** — `wl chanspec` / nvram `wl1_chanspec` get re‑asserted by `acsd2` + AiMesh within seconds, even after `restart_wireless`. Only setting Control Channel to an **explicit value in the Web UI** holds mesh‑wide. Left "Auto" → acsd2 picks **36** (the cleanest channel).
|
||||
- Any channel change triggers a **mesh re‑sync (~1 min) that drops all Wi‑Fi**; during it MajorAir falls back to the iPhone's **USB Personal Hotspot** (`en7` / `172.20.10.x`) and won't auto‑rejoin home Wi‑Fi while the hotspot feeds it internet (manual Wi‑Fi‑menu join needed).
|
||||
- **Current state: 5 GHz on ch44/80** (same clean UNII‑1 spectrum as 36; left here to avoid another re‑sync — the Deck streams identically on 44).
|
||||
|
||||
**If it breaks again — troubleshooting checklist:**
|
||||
1. **It's session stability, not bandwidth.** Look for teardown loops: `log show --last 3m --predicate 'process == "iPhone Mirroring"' | grep -iE "interrupt|timeout|endpoint"`.
|
||||
2. **Measure the right interface** — video rides **`llw0`** (hundreds of KB/s when the screen is active), *not* `awdl0` (~90 B/s control is normal): `netstat -ib | awk '/<Link#/{print $1, $7}'` before/after a few seconds.
|
||||
3. **Tailscale:** confirm `accept-routes=false` on the Mac (`tailscale debug prefs | grep RouteAll`) — see [[MajorAir#Known Issues]].
|
||||
4. **Let the network settle** after any Wi‑Fi/channel change — an AiMesh re‑sync churns AWDL/Continuity state for a minute+; retry once stable.
|
||||
5. iPhone: on home Wi‑Fi, near the Mac, **Personal Hotspot off**, not in Low Power Mode.
|
||||
6. **Wired fallback that always works:** QuickTime → New Movie Recording → select the iPhone (USB‑C; view/record only, no control).
|
||||
|
||||
---
|
||||
|
||||
## Symptom
|
||||
iPhone Mirroring on the Mac sits on **"Connecting…"** forever and never shows the iPhone screen.
|
||||
- Mac: **macOS 27.0 dev beta** (build 26A5353q), MajorAir
|
||||
- iPhone: **Major16Pro / iPhone17,1, iOS 27.0 dev beta**, same Apple ID (maj.linux@gmail.com)
|
||||
|
||||
## Root cause (conclusion)
|
||||
A **bug in the iPhone Mirroring beta** (both devices on the `.0` developer seeds). The connection
|
||||
authenticates, the AWDL peer-to-peer link comes up, the TLS handshake starts — then **bidirectional
|
||||
data stalls ~2 seconds in** and the link is torn down. Deterministic, reproduces every attempt.
|
||||
**Nothing in the local configuration was wrong.** Filed via Feedback Assistant; expected to clear in
|
||||
a future seed.
|
||||
|
||||
Two *real but secondary* network-layer issues were found and fixed along the way (see below) — they
|
||||
can block mirroring independently, but were not the cause of the final 2-second stall.
|
||||
|
||||
## The smoking gun (unified log)
|
||||
Per connection attempt the sequence is always:
|
||||
```
|
||||
rapportd: Session start … linkType "AWDL", error "NoError" # link negotiated OK
|
||||
iPhone Mirroring: Installing verify block for 1 authorized peer key(s)
|
||||
boringssl: TLS client read_server_hello # iPhone DID respond
|
||||
quic: path over awdl0 received event established / promoted to primary
|
||||
nw_flow_connected: Transport protocol connected (socket)
|
||||
… flow:connect_stalled @2.003s # stalls exactly ~2s in
|
||||
quic_conn_log_summary: Connection attempts: 6, RETRY received: no, PTOs: 5 # packets sent, zero ACKs
|
||||
[C1.1.1 … awdl0 … failed socket-flow (unsatisfied (No network route))] # link dropped (symptom, not cause)
|
||||
```
|
||||
The connection is pinned to AWDL (`allowed subtypes: wifi_awdl, prohibit fallback`), so once the
|
||||
AWDL data path stalls there is no fallback and it fails. "No network route" is the *result* of the
|
||||
teardown, not the trigger. The trigger is that after the initial handshake packets, **sustained
|
||||
QUIC traffic over AWDL gets no ACKs** (PTOs).
|
||||
|
||||
## Investigation path (what was ruled out)
|
||||
- **Discovery / proximity** — healthy throughout. BLE + Bonjour resolve the iPhone; `rapportd`
|
||||
sees it with good RSSI, same iCloud (`DF < MyiCloud >`), `WiFiP2P`.
|
||||
- **Tailscale (full-tunnel)** — with Tailscale connected, the attempt died at "No network route"
|
||||
*before* even reaching AWDL. Cause: `RouteAll: true` (accept-routes / a `::/0` advertised route)
|
||||
installs **IPv6 default routes via `utun`** (`default → fe80::%utun0..3`) that black-hole the
|
||||
IPv6 path AWDL needs. **`tailscale down` is NOT enough** — it only sets `WantRunning=false`; the
|
||||
macOS VPN *configuration* (`scutil --nc list` showed it still `Connected`) and the system
|
||||
extension keep reasserting the routes across reboots. Must disable in **System Settings → VPN**.
|
||||
- **Mullvad** — `mullvad-daemon` running; **"Local network sharing" was set to `block`**, which
|
||||
blocks LAN/AWDL/multicast. Changed to **`allow`** (`mullvad lan set allow`). Kill-switch was off.
|
||||
- **macOS firewall** — off. No Little Snitch/LuLu app installed.
|
||||
- **Lockdown Mode** — off (iPhone).
|
||||
- **OS-version mismatch** — ruled out; both Mac and iPhone on 27.0 dev beta.
|
||||
- **Device trust / re-pairing** — there is **no local pairing record on the Mac** to reset.
|
||||
`rapportd` lists the iPhone as **"PairedSys Conjectured"** = trust is *derived from the shared
|
||||
Apple ID*, not a manual pairing. Forgetting the Mac on the iPhone does not force re-setup; the
|
||||
Mac just re-derives the association from iCloud and reconnects. (App containers
|
||||
`~/Library/Containers/com.apple.ScreenContinuity` and the rapport stores held no device record;
|
||||
the "1 authorized peer key" lives in the protected system keychain.)
|
||||
- **Reboots / airplane-mode toggle / Mac-side AWDL + rapportd reset** — no change.
|
||||
|
||||
## Secondary issues found & fixed (do these regardless)
|
||||
1. **Mullvad** — set **Local network sharing = allow** (done). Required for any LAN/AWDL feature.
|
||||
2. **Tailscale** — do not run **full-tunnel / accept a `::/0` route** while mirroring; it installs
|
||||
IPv6 default routes via `utun` that kill the local link. Toggle the VPN off in System Settings
|
||||
(not just `tailscale down`) if it ever needs to be fully out of the path.
|
||||
3. **Orphaned Little Snitch network extension** — the app was uninstalled but its
|
||||
`at.obdev.littlesnitch.networkextension` is still `[activated enabled]`
|
||||
(`systemextensionsctl list`). Remove via **System Settings → General → Login Items &
|
||||
Extensions → Network Extensions**. A zombie filter extension with no app behind it can
|
||||
black-hole traffic.
|
||||
|
||||
## Status / next steps
|
||||
- **No user-side fix.** Filed in Feedback Assistant.
|
||||
- Debug capture saved: `~/Desktop/iPhoneMirroring-debug-20260609-0026.txt` (summary + log narrative).
|
||||
For a full report, trigger a sysdiagnose (**⌃⌥⇧⌘ + .**) right after reproducing and attach it.
|
||||
- VPNs restored after session: Tailscale back up; Mullvad left disconnected with LAN sharing = allow.
|
||||
|
||||
## Useful diagnostic commands (for next time)
|
||||
```bash
|
||||
# Connection narrative
|
||||
log show --last 10m --style compact --predicate \
|
||||
'(subsystem == "com.apple.MediaContinuityKit") OR (process == "iPhone Mirroring")' | tail -60
|
||||
# rapport / AWDL negotiation
|
||||
log show --last 5m --style compact --predicate 'process == "rapportd"' | grep -iE "AWDL|Pair|Session"
|
||||
# VPN config really on? (CLI "down" lies)
|
||||
scutil --nc list ; scutil --nc status "Tailscale"
|
||||
# IPv6 default routes hijacked by utun?
|
||||
netstat -rn -f inet6 | awk '$1=="default"{print}'
|
||||
# Active system extensions (filters/VPNs)
|
||||
systemextensionsctl list
|
||||
# Mullvad LAN sharing
|
||||
mullvad lan get
|
||||
```
|
||||
|
||||
## Related
|
||||
- `macos-mirrored-notification-alert-loop.md` (other Continuity issue)
|
||||
- Hosts/VPN context: MajorTwin project doc (Tailscale tailnet, 100.x addresses)
|
||||
|
|
@ -1,11 +1,17 @@
|
|||
---
|
||||
title: "ISP SNI Filtering & Caddy Troubleshooting"
|
||||
title: ISP SNI Filtering & Caddy Troubleshooting
|
||||
domain: troubleshooting
|
||||
category: general
|
||||
tags: [isp, sni, caddy, tls, dns, cloudflare]
|
||||
tags:
|
||||
- isp
|
||||
- sni
|
||||
- caddy
|
||||
- tls
|
||||
- dns
|
||||
- cloudflare
|
||||
status: published
|
||||
created: 2026-04-02
|
||||
updated: 2026-04-02
|
||||
updated: 2026-04-30T13:07
|
||||
---
|
||||
# ISP SNI Filtering & Caddy Troubleshooting
|
||||
|
||||
|
|
@ -29,3 +35,89 @@ notes.majorshouse.com {
|
|||
```
|
||||
|
||||
Once the hostname was changed to one without the "wiki" keyword, the TLS handshake completed successfully.
|
||||
|
||||
---
|
||||
|
||||
## 🔁 2026-04-30 Update — Stale A Record + Cloudflare Proxy Fix
|
||||
|
||||
The hostname rename held for ~4 weeks. On 2026-04-30 the wiki went down with a TLS handshake failure on `notes.majorshouse.com`. The on-the-spot framing was "ISP filter expanded to include 'notes'" — but Cloudflare DNS audit showed a different (and arguably worse) root cause: **the `notes` A record was pointing at `136.54.3.248`, an IP that is not majorlab's current home IP.** Whichever host responds at that address either does not run Caddy or does not know about the `notes.majorshouse.com` SNI, so the TLS handshake was rejected with `internal_error 80`.
|
||||
|
||||
### Re-diagnosis
|
||||
|
||||
```bash
|
||||
# Cert + Caddy + mkdocs all healthy on majorlab
|
||||
$ ssh majorlab 'systemctl is-active caddy; ss -tlnp | grep :443'
|
||||
active
|
||||
LISTEN 0 4096 *:443 users:(("caddy",pid=1549,fd=7))
|
||||
|
||||
# Loopback-served TLS works fine — cert valid Mar 11 → Jun 9 2026
|
||||
$ ssh majorlab 'curl -sS -o /dev/null -w "%{http_code}\n" --resolve notes.majorshouse.com:443:127.0.0.1 https://notes.majorshouse.com/'
|
||||
200
|
||||
|
||||
# External TLS handshake gets rejected with internal_error
|
||||
$ openssl s_client -servername notes.majorshouse.com -connect 136.54.3.248:443
|
||||
… SSL alert number 80 (internal_error) …
|
||||
```
|
||||
|
||||
### The smoking-gun comparison
|
||||
|
||||
Other `*.majorshouse.com` services worked because they were CNAMEs to the apex, which resolves to majorlab's actual home IP:
|
||||
|
||||
| Subdomain | DNS shape | Final IP | Status |
|
||||
|---|---|---|---|
|
||||
| `notes.majorshouse.com` | **A → `136.54.3.248`** (stale) | `136.54.3.248` (wrong host) | ❌ TLS rejected |
|
||||
| `git.majorshouse.com` | CNAME → `majorshouse.com.` | `136.56.0.55` (majorlab) | ✅ |
|
||||
| `n8n.majorshouse.com` | CNAME → `majorshouse.com.` | `136.56.0.55` (majorlab) | ✅ |
|
||||
| `matrix.majorshouse.com` | CNAME → `majorshouse.com.` | `136.56.0.55` (majorlab) | ✅ |
|
||||
|
||||
None of the working subdomains were proxied through Cloudflare (`proxied=false` on all of them); they simply had the right IP. The `notes` A record was the only one pointing somewhere wrong — most likely a stale value from a prior ISP / IP change that never got cleaned up.
|
||||
|
||||
### ✅ Fix — switch `notes` to a Cloudflare-proxied CNAME
|
||||
|
||||
Rather than just correcting the A record (which would silently break again the next time the home IP changes), the fix is a CNAME to the apex with proxy on. That gives two protections in one move: it always tracks the apex (so home IP changes propagate automatically) and it puts the wiki behind Cloudflare's edge (so any future ISP-side weirdness like the original `wiki` SNI filter is also bypassed).
|
||||
|
||||
```bash
|
||||
# via Cloudflare API (token from ansible-vault: vault_cloudflare_api_token)
|
||||
PUT /zones/{ZONE_ID}/dns_records/{NOTES_RECORD_ID}
|
||||
{
|
||||
"type": "CNAME",
|
||||
"name": "notes.majorshouse.com",
|
||||
"content": "majorshouse.com",
|
||||
"ttl": 1,
|
||||
"proxied": true,
|
||||
"comment": "switched A→CNAME proxied to bypass stale IP / ISP SNI filter"
|
||||
}
|
||||
```
|
||||
|
||||
Or via the dashboard:
|
||||
|
||||
1. Cloudflare → `majorshouse.com` zone → DNS → Records
|
||||
2. Edit the `notes` record: Type `CNAME`, Target `majorshouse.com`, Proxy `Proxied` (orange cloud)
|
||||
3. Save
|
||||
|
||||
External clients now hit Cloudflare edge IPs (`104.21.x.x` / `172.67.x.x`) which TLS-terminate at the edge and tunnel back to majorlab's apex IP. ACME on majorlab keeps working — Cloudflare passes the HTTP-01 challenge through on port 80. Caddy's `notes.majorshouse.com {}` block needs no change.
|
||||
|
||||
Verify (response should show `server: cloudflare` and `via: 1.0 Caddy`):
|
||||
|
||||
```bash
|
||||
curl -sSI https://notes.majorshouse.com/
|
||||
```
|
||||
|
||||
### Why a Cloudflare-proxied CNAME is the durable shape
|
||||
|
||||
- **Apex follows the home IP automatically.** Update the apex A record once when the ISP changes; every subdomain inherits it without per-record fixes.
|
||||
- **TLS handshake is offloaded to CF.** Any ISP-level SNI weirdness (the original `wiki` ban; theoretical future bans) becomes irrelevant — external clients SNI=`notes.majorshouse.com` to Cloudflare, which the ISP doesn't filter.
|
||||
- **Free.** Cloudflare's free tier covers proxy + TLS termination.
|
||||
|
||||
### Audit checklist for any home-hosted `*.majorshouse.com` subdomain
|
||||
|
||||
- [ ] DNS record is a **CNAME** to `majorshouse.com.`, not an A record to a literal home IP.
|
||||
- [ ] Cloudflare proxy (orange cloud, `proxied=true`) enabled on the record — at minimum for any subdomain where TLS reachability matters.
|
||||
- [ ] Caddy entry on majorlab references the public hostname; `reverse_proxy` stays on the localhost port.
|
||||
- [ ] HTTPS verified from outside the LAN (phone on cellular is sufficient) within the first hour after the change.
|
||||
- [ ] If an A record is genuinely required (e.g. it must NOT go through CF), document why in the deploy notes for that service.
|
||||
|
||||
### Related
|
||||
|
||||
- [[majwiki-setup-and-pipeline]] — full wiki deploy pipeline; the DNS step there should reference this fix
|
||||
- [[Network-Overview]] — fleet IP table
|
||||
|
|
|
|||
150
05-troubleshooting/logwatch-wrong-hostname-after-migration.md
Normal file
150
05-troubleshooting/logwatch-wrong-hostname-after-migration.md
Normal file
|
|
@ -0,0 +1,150 @@
|
|||
---
|
||||
title: "Logwatch Reports the Wrong Hostname (`<host>-hetzner`) After a Migration"
|
||||
domain: troubleshooting
|
||||
category: monitoring
|
||||
tags: [logwatch, hostname, hetzner, migration, monitoring, provisioning, fail2ban]
|
||||
status: published
|
||||
created: 2026-06-12
|
||||
updated: 2026-06-14
|
||||
---
|
||||
|
||||
# Logwatch Reports the Wrong Hostname (`<host>-hetzner`) After a Migration
|
||||
|
||||
## Symptom
|
||||
|
||||
Daily Logwatch emails from a recently migrated server arrive titled with the
|
||||
provisioning label instead of the real hostname:
|
||||
|
||||
```
|
||||
Logwatch for tttpod-hetzner (Linux)
|
||||
Logwatch for dcaprod-hetzner (Linux)
|
||||
```
|
||||
|
||||
Everything else works — the report is generated, mailed, and delivered. Only the
|
||||
**name in the title is wrong**, which makes reports harder to scan and breaks any
|
||||
filter or rule that keys on the expected hostname.
|
||||
|
||||
## Cause
|
||||
|
||||
Logwatch titles each report with the box's **live system hostname**
|
||||
(`hostnamectl --static` / `/etc/hostname`) read at runtime — it does *not* keep
|
||||
its own copy of the name.
|
||||
|
||||
Hetzner Cloud servers are provisioned with a temporary node label as the system
|
||||
hostname — `<host>-hetzner` (e.g. `tttpod-hetzner`). The migration runbook renames
|
||||
the **Tailscale node** back to the bare name and sets Postfix `myhostname`, but the
|
||||
**OS hostname** itself is easy to miss because nothing surfaces it day to day. It
|
||||
stays `<host>-hetzner` until something reads `hostname` — Logwatch is usually the
|
||||
first thing to do so, weeks later.
|
||||
|
||||
Confirm the box is actually mislabelled:
|
||||
|
||||
```bash
|
||||
ssh root@<host> 'hostnamectl --static; cat /etc/hostname; grep 127.0.1.1 /etc/hosts'
|
||||
# static: tttpod-hetzner
|
||||
# /etc/hostname: tttpod-hetzner
|
||||
# 127.0.1.1 tttpod-hetzner tttpod-hetzner
|
||||
```
|
||||
|
||||
## Fix
|
||||
|
||||
Set the real hostname and fix the matching `/etc/hosts` loopback line:
|
||||
|
||||
```bash
|
||||
ssh root@<host> '
|
||||
hostnamectl set-hostname <host>
|
||||
sed -i "s/127.0.1.1.*/127.0.1.1 <host> <host>/" /etc/hosts
|
||||
hostnamectl --static # verify -> <host>
|
||||
'
|
||||
```
|
||||
|
||||
That's it. **Logwatch has no hardcoded hostname override** — verify with:
|
||||
|
||||
```bash
|
||||
grep -ri hostname /etc/logwatch/ /etc/cron.daily/0logwatch /etc/cron.daily/logwatch 2>/dev/null
|
||||
cat /etc/mailname 2>/dev/null
|
||||
```
|
||||
|
||||
If those are empty (the normal case), Logwatch reads the live hostname on its next
|
||||
run, so the **next daily report self-corrects** — no service restart, no logwatch
|
||||
config change needed.
|
||||
|
||||
> [!note] If `grep` *does* find a hostname pinned in `/etc/logwatch/conf/logwatch.conf`
|
||||
> (e.g. a `HostLimit`/`MailFrom` line baked in by Ansible), update it there too —
|
||||
> the override file wins over the live hostname.
|
||||
|
||||
## Sweep the whole fleet
|
||||
|
||||
This is a per-box provisioning leftover, so check every migrated host at once —
|
||||
more than one is usually affected:
|
||||
|
||||
```bash
|
||||
for ip in 100.98.223.93 100.95.137.38 100.64.169.62 100.112.127.0 100.73.85.46; do
|
||||
echo -n "$ip -> "
|
||||
ssh -o ConnectTimeout=8 -o BatchMode=yes root@$ip 'hostnamectl --static' 2>/dev/null \
|
||||
|| echo '(unreachable)'
|
||||
done
|
||||
```
|
||||
|
||||
Any value ending in `-hetzner` (or your provider's build label) needs the fix above.
|
||||
In the 2026-06 sweep, `tttpod` and `dcaprod` were still `*-hetzner` at the OS
|
||||
level; `majortoot`, `majormail`, and `majorlinux` had the correct system hostname
|
||||
— but see the variant below: `majormail`'s *configs* were still stale even though
|
||||
its hostname wasn't.
|
||||
|
||||
## Variant: hostname is correct, but a config has the old name baked in
|
||||
|
||||
A second, sneakier form of this drift: the **system hostname is already right**, so
|
||||
the sweep above passes and the Logwatch report *title* is correct — yet mail still
|
||||
arrives **from** `<host>-hetzner` because the old label is hardcoded in a service's
|
||||
`From`/`sender` field. These fields are static text, not derived from the live
|
||||
hostname, so fixing `hostnamectl` does nothing for them.
|
||||
|
||||
Seen on `majormail` (2026-06-14): system hostname was `majormail`, but
|
||||
`Logwatch@majormail-hetzner...` was still the sender. Two configs held it:
|
||||
|
||||
```bash
|
||||
# sweep a box for the old provisioning label in any send-related config
|
||||
ssh root@<host> 'grep -rsn "<host>-hetzner" /etc/logwatch/ /etc/fail2ban/ \
|
||||
/etc/postfix/ /etc/aliases /etc/mailname 2>/dev/null'
|
||||
# /etc/logwatch/conf/logwatch.conf:MailFrom = Logwatch@<host>-hetzner.majorshouse.com
|
||||
# /etc/fail2ban/jail.local:sender = fail2ban@<host>-hetzner.majorshouse.com
|
||||
```
|
||||
|
||||
Fix in place (no restart needed for Logwatch; reload fail2ban for its change):
|
||||
|
||||
```bash
|
||||
ssh root@<host> '
|
||||
sed -i "s/<host>-hetzner/<host>/g" /etc/logwatch/conf/logwatch.conf /etc/fail2ban/jail.local
|
||||
systemctl reload fail2ban
|
||||
'
|
||||
```
|
||||
|
||||
> [!warning] Check the Ansible source, or it comes back
|
||||
> A live `sed` is undone by the next playbook run if the repo still carries the old
|
||||
> value. Distinguish two cases:
|
||||
> - **Templated** (safe): e.g. `logwatch.yml` sets `MailFrom = Logwatch@{{ inventory_hostname }}...`. If the inventory host is named correctly, a run *regenerates* the right value — it even self-heals a stale box.
|
||||
> - **Static file** (will regress): e.g. `roles/fail2ban/files/hosts/<host>/jail.local` with the literal `sender = ...@<host>-hetzner...`. Grep the repo (`grep -rn "<host>-hetzner" .`) and fix the file too, or every deploy re-pushes the stale sender.
|
||||
|
||||
Inert backups (`jail.local.bak*`, `*~`) may still contain the old string — they
|
||||
don't send mail, so leave them.
|
||||
|
||||
## Prevention
|
||||
|
||||
Fold "set the system hostname" into the migration bootstrap so it never drifts:
|
||||
|
||||
```bash
|
||||
hostnamectl set-hostname <host>
|
||||
sed -i "s/127.0.1.1.*/127.0.1.1 <host> <host>/" /etc/hosts
|
||||
```
|
||||
|
||||
Do this in the **same step** that renames the Tailscale node and sets Postfix
|
||||
`myhostname` — all three read from the provisioning label and all three must be
|
||||
corrected together. See the
|
||||
[VPS Migration Baseline Checklist](../02-selfhosting/cloud/vps-migration-baseline-checklist.md).
|
||||
|
||||
## Related
|
||||
|
||||
- [Logwatch Fleet Setup — Surviving Package Upgrades](../02-selfhosting/monitoring/logwatch-fleet-setup.md) — the broader "logwatch went silent / wrong-source" class, including the Packer `myhostname` variant of this same drift
|
||||
- [VPS Migration Baseline Checklist](../02-selfhosting/cloud/vps-migration-baseline-checklist.md) — the full post-migration verification list
|
||||
- [Ansible UNREACHABLE: Host Key Verification Failed After a Host Rebuild or Migration](networking/ansible-host-key-verification-failed-rebuilt-host.md) — another IP/identity-drift gotcha from the same Hetzner migration
|
||||
|
|
@ -0,0 +1,154 @@
|
|||
---
|
||||
title: "Auditing & Cleaning macOS Background App Activity (sfltool dumpbtm)"
|
||||
domain: troubleshooting
|
||||
category: general
|
||||
tags: [macos, background-tasks, btm, sfltool, login-items, system-extensions, uninstall, little-snitch]
|
||||
status: published
|
||||
created: 2026-06-21
|
||||
updated: 2026-06-21
|
||||
---
|
||||
# Auditing & Cleaning macOS Background App Activity (`sfltool dumpbtm`)
|
||||
|
||||
## Overview
|
||||
macOS tracks every login item, agent, daemon, helper, and extension that may run in the background in its **Background Task Management (BTM)** database. The GUI shows this under **System Settings → General → Login Items & Extensions** ("Allow in the Background"), but the GUI is summarised and hides paths, identifiers, and orphans.
|
||||
|
||||
`sfltool dumpbtm` prints the full BTM database from the command line — and the per-user records need **no `sudo`**. This is the fastest way to answer "what is allowed to run in the background, and does each entry still map to an installed app?"
|
||||
|
||||
## List what's registered
|
||||
|
||||
```bash
|
||||
sfltool dumpbtm # per-user records, no sudo required
|
||||
```
|
||||
|
||||
Each record looks like:
|
||||
|
||||
```
|
||||
Name: CleanMyMac Menu
|
||||
Type: login item (0x4)
|
||||
Disposition: [enabled, allowed, notified] (0xb)
|
||||
Identifier: 4.com.macpaw.CleanMyMac-mas.Menu
|
||||
URL: Contents/Library/LoginItems/CleanMyMac_5_MAS_Menu.app
|
||||
Bundle Identifier: com.macpaw.CleanMyMac-mas.Menu
|
||||
Parent Identifier: 2.com.macpaw.CleanMyMac-mas
|
||||
```
|
||||
|
||||
### Reading the fields
|
||||
- **Disposition** — `enabled` = actively allowed to run in the background. `disabled` = present but off.
|
||||
- **Type** — what kind of item it is:
|
||||
|
||||
| Type | Meaning |
|
||||
|---|---|
|
||||
| `app (0x2)` | A normal application entry |
|
||||
| `login item (0x4)` | Launches at login (menu-bar apps, helpers) |
|
||||
| `agent (0x8)` / `legacy agent` | Per-user background agent |
|
||||
| `legacy daemon (0x10010)` | System-wide background daemon |
|
||||
| `background tasks (0x2000)` | Abstract background-task registration owned by a parent app — **has no file path of its own** |
|
||||
| `developer (0x20)` | A per-developer grouping header (the collapsible row in Settings), **not an app** |
|
||||
| `quicklook` / `spotlight` / `dock tile` | Plugins/extensions — not really "background apps" |
|
||||
|
||||
## Map entries to installed apps (find orphans)
|
||||
|
||||
Two gotchas make naïve path-checking fail:
|
||||
|
||||
1. **Absolute paths are stored as `file://` URLs**, not plain `/…`. Strip the `file://` prefix and URL-decode (`%20` → space).
|
||||
2. **Child items store a *relative* `URL`** (e.g. `Contents/Library/LoginItems/…`) that must be joined to the **parent record's** absolute path, found via `Parent Identifier`.
|
||||
|
||||
A small parser that resolves each record to a real path and flags true orphans:
|
||||
|
||||
```python
|
||||
import sys, re, os, urllib.parse
|
||||
items, cur = [], None
|
||||
def push():
|
||||
global cur
|
||||
if cur is not None: items.append(cur)
|
||||
for line in sys.stdin:
|
||||
s = line.strip()
|
||||
if re.match(r"^#\d+:$", s): push(); cur = {}; continue
|
||||
if cur is None: continue
|
||||
m = re.match(r"^([A-Za-z][A-Za-z /]+):\s*(.*)$", s)
|
||||
if m: cur[m.group(1).strip()] = m.group(2).strip()
|
||||
push()
|
||||
byid = {it["Identifier"]: it for it in items if it.get("Identifier")}
|
||||
def abspath(it, d=0):
|
||||
if d > 8: return None
|
||||
u = it.get("URL", "")
|
||||
if u and u != "(null)":
|
||||
if u.startswith("file://"): return urllib.parse.unquote(u[7:]).rstrip("/")
|
||||
if u.startswith("/"): return u.rstrip("/")
|
||||
par = byid.get(it.get("Parent Identifier", ""))
|
||||
if par:
|
||||
b = abspath(par, d + 1)
|
||||
if b: return os.path.join(b, urllib.parse.unquote(u)).rstrip("/")
|
||||
return None
|
||||
for it in items:
|
||||
if not it.get("Name"): continue
|
||||
p = abspath(it)
|
||||
if p and not os.path.exists(p):
|
||||
print("ORPHAN:", it["Name"], "->", p)
|
||||
```
|
||||
|
||||
```bash
|
||||
sfltool dumpbtm | python3 btm_check.py
|
||||
```
|
||||
|
||||
> **Expected non-orphans:** `background tasks (0x2000)` and `developer (0x20)` rows legitimately store no path — they are not missing apps. Helpers/daemons that resolve *inside* a parent bundle (e.g. `/Applications/Foo.app/Contents/Library/LoginItems/…`) or in `/Library/…` are also fine; they just don't appear as a top-level `.app`. That is usually why an entry "has no application you can find."
|
||||
|
||||
## Disable background for an app
|
||||
|
||||
This **cannot be scripted** — Apple deliberately gates the toggle behind the GUI:
|
||||
|
||||
**System Settings → General → Login Items & Extensions → "Allow in the Background"** → switch the app off.
|
||||
|
||||
Disabling a `developer (0x20)` grouping header turns off all of that developer's sub-items at once.
|
||||
|
||||
## Uninstall cleanly — the system-extension trap
|
||||
|
||||
**Dragging an app to the Trash is not a full uninstall.** Apps that install a **network/system extension** plus a privileged daemon (firewalls and VPNs especially — Little Snitch, Mullvad, etc.) leave their `/Library` daemon **still loaded and running** after the app is trashed. The BTM entry persists and the background service keeps working.
|
||||
|
||||
### 1. Prefer the app's own uninstaller
|
||||
- **Bundled uninstall script** (Mullvad): runs cleanly, deactivates the system extension, resets the firewall.
|
||||
```bash
|
||||
sudo "/Applications/Mullvad VPN.app/Contents/Resources/uninstall.sh"
|
||||
```
|
||||
- Some apps ship an uninstaller in their DMG or a CLI tool. **Note:** Little Snitch 6.x has **no DMG uninstaller and no `littlesnitch uninstall` subcommand** — manual removal is the supported route there.
|
||||
|
||||
### 2. Check whether a system extension is still active
|
||||
```bash
|
||||
systemextensionsctl list
|
||||
```
|
||||
If the app's extension is **not** listed (only unrelated ones like Tailscale/Canon remain), the extension is already deactivated and a manual file removal is now complete and safe.
|
||||
|
||||
### 3. Manual removal (when no uninstaller exists)
|
||||
Find every component first:
|
||||
```bash
|
||||
ls /Library/LaunchDaemons/<id>* /Library/LaunchAgents/<id>* 2>/dev/null
|
||||
ls -d "/Library/Application Support/<Vendor>" 2>/dev/null
|
||||
ls ~/Library/Preferences/<id>* 2>/dev/null
|
||||
```
|
||||
Then boot out the daemon and remove the files:
|
||||
```bash
|
||||
sudo launchctl bootout system /Library/LaunchDaemons/<id>.daemon.plist 2>/dev/null
|
||||
sudo rm -f /Library/LaunchDaemons/<id>.daemon.plist /Library/LaunchAgents/<id>.agent.plist
|
||||
sudo rm -rf "/Library/Application Support/<Vendor>" "$HOME/.Trash/<App>.app"
|
||||
rm -f ~/Library/Preferences/<id>*.plist # user-owned, no sudo
|
||||
```
|
||||
|
||||
> **Shared-container caution:** before deleting `~/Library/Group Containers/*`, check it isn't shared. Microsoft apps share `UBF8T346G9.com.microsoft.oneauth`, `…entrabroker`, and `…teams` across Office/Teams/RDP — delete only the app-specific container (e.g. `…com.microsoft.rdc`), never the shared auth ones.
|
||||
|
||||
## Stale BTM "ghost" entries
|
||||
|
||||
After a manual uninstall, `sfltool dumpbtm` may still list the removed app, pointing at now-deleted paths. These are harmless orphans (nothing left to load). **BTM reconciles them on the next reboot / login cycle** — a reboot also finalises any system-extension teardown.
|
||||
|
||||
## Quick reference
|
||||
|
||||
```bash
|
||||
sfltool dumpbtm # full per-user BTM dump (no sudo)
|
||||
sfltool dumpbtm | grep -A6 'Name:' # browse records
|
||||
systemextensionsctl list # active network/system extensions
|
||||
# Verify a removal:
|
||||
sfltool dumpbtm | grep -i <vendor> # should be empty after a reboot
|
||||
```
|
||||
|
||||
## See also
|
||||
- Apple gates "Allow in the Background" behind System Settings — there is no supported CLI toggle for BTM dispositions.
|
||||
- For VPN/firewall apps, always reach for the vendor uninstaller first; manual `rm` alone can leave a registered system extension behind.
|
||||
|
|
@ -0,0 +1,94 @@
|
|||
---
|
||||
title: "Ansible UNREACHABLE: Host Key Verification Failed After a Host Rebuild or Migration"
|
||||
domain: troubleshooting
|
||||
category: networking
|
||||
tags: [ansible, ssh, known-hosts, tailscale, host-key, migration]
|
||||
status: published
|
||||
created: 2026-06-12
|
||||
updated: 2026-06-12
|
||||
---
|
||||
|
||||
# Ansible UNREACHABLE: Host Key Verification Failed After a Host Rebuild or Migration
|
||||
|
||||
## Symptom
|
||||
|
||||
A subset of hosts in an Ansible run fail at **Gathering Facts** while the rest succeed:
|
||||
|
||||
```
|
||||
[ERROR]: Task failed: Data could not be sent to remote host "100.112.127.0".
|
||||
Make sure this host can be reached over ssh: Host key verification failed.
|
||||
fatal: [majormail]: UNREACHABLE! => {"unreachable": true, ...}
|
||||
```
|
||||
|
||||
The failing hosts are exactly the ones that were recently **rebuilt or migrated** (new server, new OS install, or a cloud move that issued a new Tailscale IP). Hosts that were never rebuilt connect fine.
|
||||
|
||||
Confusingly, **interactive `ssh root@<host>` works perfectly** for the same boxes — only Ansible fails.
|
||||
|
||||
## Cause
|
||||
|
||||
SSH stores each accepted host key in `~/.ssh/known_hosts` keyed by the **exact address you connected with**. A key accepted for `ssh root@tttpod` is saved under the hostname `tttpod`; it is *not* indexed under that node's IP.
|
||||
|
||||
Ansible inventories almost always set `ansible_host` to a **literal IP** (here, the Tailscale `100.x.x.x` address). So Ansible's SSH lookup is by IP, finds no matching entry, and with `StrictHostKeyChecking=yes` (or `accept-new` already exhausted) it refuses the connection:
|
||||
|
||||
```
|
||||
No ED25519 host key is known for 100.112.127.0 and you have requested strict checking.
|
||||
Host key verification failed.
|
||||
```
|
||||
|
||||
The hostname-form and IP-form entries are independent. Fixing interactive SSH (e.g. converting aliases to MagicDNS names and re-accepting keys) does **nothing** for Ansible, because Ansible never uses the hostname.
|
||||
|
||||
A rebuilt host also generates **brand-new host keys**, so any old IP-form entry would additionally be a mismatch — but the common case after a migration to a *new* IP is simply that no IP entry exists at all.
|
||||
|
||||
## Diagnosis
|
||||
|
||||
```bash
|
||||
# 1. Is there any known_hosts entry for the failing IP? (0 = none)
|
||||
ssh-keygen -F 100.112.127.0
|
||||
|
||||
# 2. Reproduce the exact failure without an interactive prompt:
|
||||
ssh -o BatchMode=yes -o StrictHostKeyChecking=yes root@100.112.127.0 true
|
||||
# -> "Host key verification failed." confirms the gap
|
||||
|
||||
# 3. Confirm the inventory IP is actually the host's CURRENT address
|
||||
# (guards against stale-IP drift, a separate problem):
|
||||
tailscale status | grep majormail
|
||||
ssh-keyscan -t ed25519 100.112.127.0 | ssh-keygen -lf - # fingerprint it
|
||||
```
|
||||
|
||||
If step 3 shows the inventory IP matches the live Tailscale node and the box answers `ssh-keyscan`, the only problem is the missing IP-form key.
|
||||
|
||||
## Fix
|
||||
|
||||
Add the **IP-form** host keys to the `known_hosts` of the user that runs Ansible. Back up first, scan over the tailnet, de-dup:
|
||||
|
||||
```bash
|
||||
cp ~/.ssh/known_hosts ~/.ssh/known_hosts.bak.$(date +%Y%m%d)
|
||||
|
||||
for ip in 100.98.223.93 100.112.127.0 100.73.85.46 100.95.137.38 100.76.51.16 100.64.169.62; do
|
||||
ssh-keyscan -T 5 -t rsa,ecdsa,ed25519 "$ip" >> ~/.ssh/known_hosts
|
||||
done
|
||||
sort -u ~/.ssh/known_hosts -o ~/.ssh/known_hosts
|
||||
```
|
||||
|
||||
Verify before re-running the playbook:
|
||||
|
||||
```bash
|
||||
ansible <hosts> -m ping # expect "pong" from each
|
||||
```
|
||||
|
||||
### Why `ssh-keyscan` is safe here
|
||||
|
||||
`ssh-keyscan` trusts whatever answers on the wire — normally a MITM risk. Over **Tailscale**, the connection rides WireGuard, which cryptographically authenticates the peer by its tailnet identity: reaching `100.x.x.x` *guarantees* you are talking to the node that owns that tailnet address. Scanning and trusting the key over the tailnet is therefore as trustworthy as the tailnet itself. Always cross-check the IP against `tailscale status` first (step 3) so you scan the right node.
|
||||
|
||||
## Prevention
|
||||
|
||||
- **Per-workstation, not fleet-wide.** `known_hosts` is local to each machine + user. After a migration, *every* host that runs Ansible (each workstation, plus any control node like `majorlab`) needs the IP keys added independently. Adding them on one Mac does not help the others.
|
||||
- **Sweep on every migration phase.** A rolling migration changes one node's IP at a time; fold the keyscan above into the post-cutover checklist so Ansible never breaks mid-rollout.
|
||||
- **Alternative — `accept-new`.** Setting `host_key_checking = False` in `ansible.cfg` (or `ANSIBLE_HOST_KEY_CHECKING=False`) sidesteps the prompt but trades away host-key verification entirely. Prefer the explicit keyscan: it keeps strict checking on for every *future* run while accepting the new key exactly once, under your control.
|
||||
|
||||
## Related
|
||||
|
||||
- SSH-Aliases — Fleet SSH access; the MagicDNS-vs-pinned-IP strategy and the Ansible-by-IP `known_hosts` note
|
||||
- Network Overview — Tailscale fleet inventory and current IPs
|
||||
- Hetzner-Migration-Status — the migration that triggered the fleet-wide IP churn
|
||||
- [[ssh-socket-tailscale-race-condition]] — a different "SSH unreachable after reboot" failure mode
|
||||
|
|
@ -0,0 +1,105 @@
|
|||
---
|
||||
title: "Dovecot IMAP Clients Fail to Sync: vsz_limit OOM from a Bloated Index Log"
|
||||
domain: troubleshooting
|
||||
category: networking
|
||||
tags: [dovecot, imap, oom, vsz_limit, index, maildir, fedora, mail]
|
||||
status: published
|
||||
created: 2026-06-05
|
||||
updated: 2026-06-05
|
||||
---
|
||||
# Dovecot IMAP Clients Fail to Sync: vsz_limit OOM from a Bloated Index Log
|
||||
|
||||
All IMAP clients fail to connect or hang while syncing a particular folder, even though the box has plenty of free RAM and disk. The cause is a corrupt/bloated per-folder `dovecot.index.log` that overflows Dovecot's **per-process** virtual-memory cap (`default_vsz_limit`, 256 MB by default) when it is `mmap`ed — so the IMAP child is killed on every sync attempt.
|
||||
|
||||
> First seen on **majormail** (Fedora 44, Dovecot 2.4.4), 2026-06-05. An empty `.Later` folder had a 152 MB `dovecot.index.log`.
|
||||
|
||||
## Symptoms
|
||||
|
||||
- Multiple/all IMAP clients can't connect, or connect but never finish syncing.
|
||||
- Often only **one folder** is the trigger — the client hangs the moment it opens/syncs that folder.
|
||||
- The server is otherwise healthy: Postfix delivering, Dovecot `active`, ports listening, TLS valid.
|
||||
- `free -h` shows the host has plenty of RAM available — this is **not** a host-level OOM.
|
||||
|
||||
## Log Signature
|
||||
|
||||
`journalctl -u dovecot` shows, per affected user/folder:
|
||||
|
||||
```
|
||||
imap(user@dom): Fatal: block_alloc(8388608): Out of memory
|
||||
imap(user@dom): Fatal: master: service(imap): child NNN returned error 83
|
||||
(Out of memory (service imap { vsz_limit=256 MB }, you may need to increase it) ...)
|
||||
imap(user@dom): Error: Mailbox X: mmap(size=158769660) failed ...: Cannot allocate memory
|
||||
imap(user@dom): Error: Mailbox X: Failed to map transaction log .../dovecot.index.log
|
||||
at sync_offset=N after locking: Beginning of the log isn't available
|
||||
```
|
||||
|
||||
The two tells: **`error 83` naming `vsz_limit`** (Dovecot literally suggests raising it), and an **`mmap(size=…)` value that is huge relative to the folder's real contents**.
|
||||
|
||||
## Why It Happens
|
||||
|
||||
Each Maildir folder has its own `dovecot.index.log` transaction log. If it grows or corrupts to tens/hundreds of MB (here: 152 MB on a folder with **zero** messages), Dovecot tries to `mmap` the whole thing into the IMAP worker. That worker runs under `default_vsz_limit` (compiled default **256 MB**). The mapping blows the cap, the kernel refuses the allocation, and the child dies with `error 83`. Because every client re-syncs that folder on connect, it fails for **all** of them at once.
|
||||
|
||||
Key point: the limit is **per-process virtual size**, not host memory. A box with 2.5 GB free RAM still hits it.
|
||||
|
||||
## Diagnosis
|
||||
|
||||
```bash
|
||||
# 1. The smoking gun — OOM / error 83 mentioning vsz_limit
|
||||
journalctl -u dovecot --since "-3h" | grep -iE "out of memory|error 83|vsz_limit"
|
||||
|
||||
# 2. Confirm it is NOT a host OOM (expect plenty free)
|
||||
free -h ; df -h /var/vmail
|
||||
|
||||
# 3. Current per-process cap (256 M = compiled default, no explicit setting)
|
||||
doveconf default_vsz_limit
|
||||
|
||||
# 4. Find the bloated index — size wildly out of proportion to message count
|
||||
du -sh /var/vmail/<domain>/<user>/.<Folder>
|
||||
ls -lh /var/vmail/<domain>/<user>/.<Folder>/dovecot.index*
|
||||
ls -1 /var/vmail/<domain>/<user>/.<Folder>/{cur,new} | wc -l # real message count
|
||||
```
|
||||
|
||||
## Fix
|
||||
|
||||
Two parts: raise the cap, and repair the bloated index.
|
||||
|
||||
```bash
|
||||
# (1) Raise default_vsz_limit. Flat Fedora dovecot.conf has no !include conf.d/*,
|
||||
# so add it at top-level scope (after `protocols = ...`):
|
||||
# default_vsz_limit = 1G
|
||||
doveconf -n >/dev/null && echo CONFIG_OK # validate
|
||||
systemctl restart dovecot # required to apply the new vsz
|
||||
doveconf default_vsz_limit # -> 1G
|
||||
|
||||
# (2a) Rebuild the index from the real messages
|
||||
doveadm force-resync -u <user@dom> <Folder>
|
||||
|
||||
# (2b) If force-resync leaves a stale multi-MB index.log AND the folder has
|
||||
# 0 message files, it is safe to delete the index files and let Dovecot
|
||||
# regenerate them clean (152 M -> 24 K in the original case):
|
||||
L=/var/vmail/<domain>/<user>/.<Folder>
|
||||
rm -f $L/dovecot.index $L/dovecot.index.log $L/dovecot.index.cache $L/dovecot.index.backup
|
||||
doveadm mailbox status -u <user@dom> "messages vsize" <Folder> # regenerates
|
||||
```
|
||||
|
||||
Verify: `journalctl -u dovecot --since "-2m" | grep -ic "out of memory"` returns `0`, and the folder reads without error.
|
||||
|
||||
> **Only delete index files when the folder's `cur/` and `new/` are empty** (or you are certain the messages are intact). The index is rebuildable from the message files; deleting indexes never deletes mail, but verify the count first.
|
||||
|
||||
## Codified
|
||||
|
||||
majormail's role sets this permanently so the cap survives a config rebuild:
|
||||
`roles/majormail/templates/dovecot.conf.j2` → `default_vsz_limit = 1G` (MajorAnsible commit `a69ac5d`).
|
||||
|
||||
## Key Notes
|
||||
|
||||
- **`error 83` = vsz, not host RAM.** Don't go chasing free memory — read the parenthetical in the error; Dovecot names the exact setting.
|
||||
- **A huge index on a tiny/empty folder is the corruption,** not the messages. Resync, and truncate the index if the folder is empty.
|
||||
- **`tcpdump` may not be installed** on a minimal Fedora mail host — don't conclude "no packets arrived" from an empty capture without confirming the tool exists (`which tcpdump`).
|
||||
- 1 G is a comfortable headroom for large mailboxes; raise further only if a genuinely large single mailbox needs it.
|
||||
|
||||
## Related
|
||||
|
||||
- [Mail Client Stops Receiving: Fail2ban IMAP Self-Ban](fail2ban-imap-self-ban-mail-client.md)
|
||||
- [firewalld: Mail Ports Wiped After Reload](firewalld-mail-ports-reset.md)
|
||||
- [SELinux: Dovecot vmail Context](../selinux-dovecot-vmail-context.md)
|
||||
|
|
@ -0,0 +1,111 @@
|
|||
---
|
||||
title: "Dovecot Phantom Mailboxes from .dovecot.lda-dupes (mail_home Overlapping the Maildir Root)"
|
||||
domain: troubleshooting
|
||||
category: networking
|
||||
tags: [dovecot, maildir, mail_home, sieve, lda-dupes, duplicate-database, pigeonhole, phantom-mailbox]
|
||||
status: published
|
||||
created: 2026-06-07
|
||||
updated: 2026-06-07
|
||||
---
|
||||
# Dovecot Phantom Mailboxes from `.dovecot.lda-dupes` (mail_home Overlapping the Maildir Root)
|
||||
|
||||
Dovecot starts logging errors like this on mailbox LIST, and `doveadm mailbox list` grows phantom mailboxes named after Dovecot's own control files:
|
||||
|
||||
```
|
||||
imap(user@example.com): Error: maildir: stat(/var/vmail/example.com/user/.dovecot.lda-dupes/tmp) failed: Not a directory
|
||||
```
|
||||
|
||||
```
|
||||
$ doveadm mailbox list -u user@example.com
|
||||
INBOX
|
||||
…
|
||||
dovecot
|
||||
dovecot.lda-dupes
|
||||
dovecot.lda-dupes.locks
|
||||
```
|
||||
|
||||
> Hit on **majormail** (2026-06-07), the day after switching the global spam Sieve to `redirect`. Mail delivery was unaffected — purely log noise plus phantom folders a client could see on `LIST "*"`.
|
||||
|
||||
## Why
|
||||
|
||||
The LDA/Sieve **duplicate database** (`.dovecot.lda-dupes`, plus a `.dovecot.lda-dupes.locks` lock dir) is created in the user's **home** directory. Per the Dovecot maintainer, its location strictly follows the user's home — it is *not* separately configurable.
|
||||
|
||||
If `mail_home` (the userdb `home` field) is set equal to the **maildir root** (`mail_path`), those control files get written *inside* the mail store:
|
||||
|
||||
```
|
||||
mail_path = /var/vmail/%{user|domain}/%{user|username} # maildir root
|
||||
userdb static { fields { home = /var/vmail/%{user|domain}/%{user|username} } } # SAME path — the bug
|
||||
```
|
||||
|
||||
The maildir++ layout treats every `.`-prefixed entry in the root as a mailbox folder. So:
|
||||
|
||||
- `.dovecot.lda-dupes` (a **file**) → lister stats `.dovecot.lda-dupes/tmp` → **"Not a directory"** (cosmetic, logged every LIST).
|
||||
- `.dovecot.lda-dupes.locks` (a **directory**) → opened as a maildir, auto-populated with `cur/new/tmp/dovecot-uidlist/dovecot.index.log`, and exposed as a real phantom mailbox.
|
||||
|
||||
The trigger is anything that exercises duplicate tracking — Sieve `redirect` (loop-guard), `vacation`, or the `duplicate` test. A pure `fileinto` setup never creates the db, which is why the error can appear suddenly after a Sieve change.
|
||||
|
||||
## How to confirm
|
||||
|
||||
```bash
|
||||
# Phantom mailboxes named after the control files:
|
||||
doveadm mailbox list -u user@example.com | grep -E '^dovecot'
|
||||
|
||||
# Is home the SAME as the maildir root? (the root cause)
|
||||
doveadm user user@example.com | grep -E 'home|mail_path'
|
||||
# home /var/vmail/example.com/user <- equals mail_path == bug
|
||||
# mail_path /var/vmail/example.com/user
|
||||
|
||||
# The offending control files living inside the maildir root:
|
||||
ls -la /var/vmail/example.com/user/.dovecot.lda-dupes*
|
||||
# -rw------- … .dovecot.lda-dupes (regular file — the dedup db)
|
||||
# drwx------ … .dovecot.lda-dupes.locks (dir — the lock dir, mis-listed)
|
||||
```
|
||||
|
||||
## Fix
|
||||
|
||||
Point `home` at a path **separate from the maildir root**. The cleanest low-risk option is a **non-dotted subdir** of the user dir, so `mail_path` stays put and **no mail migration** is needed (a dotted name would just become another phantom folder):
|
||||
|
||||
```diff
|
||||
userdb static {
|
||||
fields {
|
||||
uid = vmail
|
||||
gid = vmail
|
||||
- home = /var/vmail/%{user|domain}/%{user|username}
|
||||
+ home = /var/vmail/%{user|domain}/%{user|username}/home
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Then deploy and clean up the stale artifacts:
|
||||
|
||||
```bash
|
||||
# 1. Deploy the config change, restart/reload Dovecot.
|
||||
|
||||
# 2. Confirm home moved:
|
||||
doveadm user user@example.com | grep home # -> /var/vmail/example.com/user/home
|
||||
|
||||
# 3. Remove the stale dupe-db + the cached list index from the maildir root
|
||||
# (all regenerable):
|
||||
cd /var/vmail/example.com/user/
|
||||
rm -rf .dovecot.lda-dupes .dovecot.lda-dupes.locks dovecot.list.index dovecot.list.index.log
|
||||
|
||||
# 4. Pre-create the new home (so the first dupe-db write can't fail):
|
||||
install -d -o vmail -g vmail -m 700 /var/vmail/example.com/user/home
|
||||
|
||||
# 5. Verify:
|
||||
doveadm mailbox list -u user@example.com | grep -E '^dovecot' || echo CLEAN
|
||||
```
|
||||
|
||||
The duplicate db now regenerates under `…/user/home/`, where the maildir lister never looks.
|
||||
|
||||
## Gotchas
|
||||
|
||||
- **`mail_home` follows userdb.** A userdb-returned `home` field overrides the global `mail_home` setting, so fix it where userdb defines it (here, `userdb static { fields { home = … } }`).
|
||||
- **What else keys off `~`:** personal Sieve (`~/.dovecot.sieve`, `~/sieve`), `mail_attribute_dict`, and some quota backends. Before moving home, confirm none of those hold live data in the old location (`ls -a` the maildir root). A *global* spam Sieve at a fixed path (`/etc/dovecot/sieve/global/…`) is unaffected.
|
||||
- **Indexes** default to `mail_path`, not home, so moving home doesn't touch `dovecot.index*`.
|
||||
- **Don't trust a local-injection test** to exercise Sieve `redirect`: Postfix `cleanup` header_checks may intercept it first, and `dovecot-lda` may not apply the same before-script as LMTP. Verify the relocation at the authoritative level (`doveadm user` home), since the db location is home-relative by design.
|
||||
|
||||
## Related
|
||||
|
||||
- [[postfix-header-checks-vs-milter-headers]] — the spam-routing migration that introduced the Sieve `redirect` (and thus the dupe db) on majormail.
|
||||
- Upstream: Dovecot mailing-list thread "Change location where .dovecot.lda-dupes* file/dir are created" — maintainer confirms the db follows the user's home.
|
||||
|
|
@ -5,7 +5,7 @@ category: networking
|
|||
tags: [fail2ban, imap, dovecot, email, self-ban]
|
||||
status: published
|
||||
created: 2026-04-02
|
||||
updated: 2026-04-02
|
||||
updated: 2026-06-05
|
||||
---
|
||||
# Mail Client Stops Receiving: Fail2ban IMAP Self-Ban
|
||||
|
||||
|
|
@ -79,6 +79,21 @@ fail2ban-client set dovecot-invalid unbanip <IP>
|
|||
|
||||
Mail should resume immediately without restarting any services.
|
||||
|
||||
### Permanent fix — whitelist the trusted IP (`ignoreip`)
|
||||
|
||||
Unbanning is temporary: if the client keeps failing auth (wrong password, stale token), the same IP gets re-banned within minutes. For a **known, trusted network** (e.g. your home egress IP) add it to Fail2ban's `ignoreip` so it can never be banned:
|
||||
|
||||
```bash
|
||||
# /etc/fail2ban/jail.local — [DEFAULT] section, applies to ALL jails
|
||||
ignoreip = 127.0.0.1/8 ::1 100.64.0.0/10 <home_ip>
|
||||
fail2ban-client reload
|
||||
fail2ban-client get postfix-sasl ignoreip # confirm the IP is listed
|
||||
```
|
||||
|
||||
On majormail this is codified via `fail2ban_ignoreip` in `host_vars/majormail-hetzner/vars.yml` (MajorAnsible commit `fa91fe3`).
|
||||
|
||||
> ⚠️ `ignoreip` takes a **public egress** IP, which may be dynamic. If your ISP reassigns it, the whitelist points at a stale address and bans can return — recheck the egress IP first. Use a subnet only if you trust the whole range.
|
||||
|
||||
---
|
||||
|
||||
## 🔁 Why This Happens
|
||||
|
|
|
|||
|
|
@ -5,7 +5,7 @@ category: networking
|
|||
tags: [firewalld, mail, imap, fedora, ports]
|
||||
status: published
|
||||
created: 2026-04-02
|
||||
updated: 2026-04-02
|
||||
updated: 2026-06-05
|
||||
---
|
||||
# firewalld: Mail Ports Wiped After Reload (IMAP + Webmail Outage)
|
||||
|
||||
|
|
@ -66,8 +66,24 @@ Expected output:
|
|||
dhcpv6-client http https imap imaps mdns smtp smtp-submission smtps ssh
|
||||
```
|
||||
|
||||
## Variant: One port (587) fails while the rest work — service never added
|
||||
|
||||
A subtler version of this: IMAP (993) and implicit-TLS submission (465) work fine, but **only STARTTLS submission on 587 fails** — clients on 587 get "no route to host." This is **not** a reload wipe; the `submission` service was simply never added during initial setup (the box's mail ports were opened by hand and one was missed).
|
||||
|
||||
```bash
|
||||
# Each mail service, individually — submission will be the odd one out
|
||||
for s in smtp smtps submission imap imaps; do printf "%-12s " "$s"; firewall-cmd --query-service=$s; done
|
||||
|
||||
# Fix (Fedora 44 / firewalld names the 587 service `submission`, NOT `smtp-submission`)
|
||||
firewall-cmd --permanent --zone=public --add-service=submission
|
||||
firewall-cmd --reload
|
||||
```
|
||||
|
||||
> On majormail the full mail-service set is now managed declaratively in `roles/majormail/tasks/postfix.yml` (smtp/smtps/**submission**/imap/imaps), so a hand-edit can't leave 587 behind again (MajorAnsible commit `b75f14a`). Seen 2026-06-05.
|
||||
|
||||
## Key Notes
|
||||
|
||||
- **Service name differs by distro/version:** the 587 service is `submission` on current Fedora firewalld; older/other docs may say `smtp-submission`. Verify with `firewall-cmd --get-services | tr ' ' '\n' | grep submission`.
|
||||
- **Always use `--permanent`** when adding services to firewalld on a server. Without it, the rule exists only until the next reload.
|
||||
- **Fail2ban + firewalld**: Fail2ban uses firewalld as its ban backend (`firewallcmd-rich-rules`). When Fail2ban restarts or crashes, it may trigger a `firewall-cmd --reload`, resetting any runtime-only rules.
|
||||
- **Verify after any firewall event**: After Fail2ban restarts, system reboots, or `firewall-cmd --reload`, always confirm mail services are still present with `firewall-cmd --list-services --zone=public`.
|
||||
|
|
@ -77,3 +93,4 @@ dhcpv6-client http https imap imaps mdns smtp smtp-submission smtps ssh
|
|||
|
||||
- [Linux Server Hardening Checklist](../../02-selfhosting/security/linux-server-hardening-checklist.md)
|
||||
- [Mail Client Stops Receiving: Fail2ban IMAP Self-Ban](fail2ban-imap-self-ban-mail-client.md)
|
||||
- [Dovecot IMAP Clients Fail to Sync: vsz_limit OOM from a Bloated Index Log](dovecot-imap-oom-vsz-limit-bloated-index.md)
|
||||
|
|
|
|||
|
|
@ -0,0 +1,72 @@
|
|||
---
|
||||
title: "Postfix header_checks Can't Act on Milter-Added Headers (Use Sieve)"
|
||||
domain: troubleshooting
|
||||
category: networking
|
||||
tags: [postfix, milter, header_checks, spamassassin, spamass-milter, dovecot, sieve, spam]
|
||||
status: published
|
||||
created: 2026-06-06
|
||||
updated: 2026-06-06
|
||||
---
|
||||
# Postfix header_checks Can't Act on Milter-Added Headers (Use Sieve)
|
||||
|
||||
A Postfix `header_checks` rule that keys on a header added by a **milter** (e.g. `X-Spam-Flag: YES` from `spamass-milter`/`rspamd`/`opendkim`) appears correct, is wired up, and even fires for test mail — yet silently does nothing for real inbound mail. The cause: `header_checks` run in the `cleanup` daemon and **do not reliably see headers a milter adds**, so a rule like:
|
||||
|
||||
```
|
||||
/^X-Spam-Flag:[[:space:]]+YES/ REDIRECT junk@example.com
|
||||
```
|
||||
|
||||
never matches genuine inbound spam, even though the delivered message clearly contains `X-Spam-Flag: YES`.
|
||||
|
||||
> Hit on **majormail** (2026-06-06): spam-routing REDIRECT had been dead since it was deployed — spam kept reaching the inbox.
|
||||
|
||||
## Why
|
||||
|
||||
Milter header modifications and `header_checks` happen at different stages of `cleanup`, and `header_checks` evaluate the message as received from the network, **before** the milter's header additions are folded in. So for an `smtpd_milter`-tagged message, the flag header is not visible to `header_checks` at the time they run.
|
||||
|
||||
Confusingly, **locally-injected test mail can fire the rule** (timing/origin differences) — so a quick `swaks`/`smtplib` test to `localhost:25` "passes" while real inbound mail silently slips through. Don't trust a local-injection test for this; verify against real inbound mail (or with the method below).
|
||||
|
||||
## How to confirm
|
||||
|
||||
```bash
|
||||
# A delivered message that SHOULD have matched — but wasn't acted on:
|
||||
grep -iE '^(X-Spam-Flag|Delivered-To|Subject):' /var/vmail/<dom>/<user>/.Junk/cur/<msg>
|
||||
# X-Spam-Flag: YES Delivered-To: <user>@… (i.e. NOT redirected)
|
||||
|
||||
# Is the spam scanner an smtpd milter? (then header_checks can't see its headers)
|
||||
postconf smtpd_milters
|
||||
# smtpd_milters = … unix:/run/spamass-milter/spamass-milter.sock
|
||||
|
||||
# maillog: the header_checks REDIRECT never appears for real inbound spam,
|
||||
# only (if at all) for locally-submitted mail ("redirect: … from local").
|
||||
grep -i 'redirect:' /var/log/maillog
|
||||
```
|
||||
|
||||
## Fix — act at delivery time, in Sieve
|
||||
|
||||
Dovecot **Sieve** runs at LMTP delivery, *after* the milter, so it reliably sees milter-added headers. Do the routing there instead of in `header_checks`. To keep spam out of the real mailbox entirely (so a push client like Spark never sees it), `redirect` to a dedicated account rather than `fileinto Junk`:
|
||||
|
||||
```sieve
|
||||
require ["envelope"];
|
||||
if header :contains "X-Spam-Flag" "YES" {
|
||||
# Loop guard: a global before-script also runs for junk@'s own delivery.
|
||||
if envelope :is "to" "junk@example.com" {
|
||||
keep;
|
||||
stop;
|
||||
}
|
||||
redirect "junk@example.com";
|
||||
stop;
|
||||
}
|
||||
```
|
||||
|
||||
On majormail this is the global before-script `roles/majormail/templates/spam-to-junk.sieve.j2` (MajorAnsible `07dab90`); `redirect` cancels the implicit keep so the real mailbox stays clean (INBOX *and* Junk). Verify deterministically with `sieve-test -u <user> -r <recipient> <script> <msg.eml>` — it prints the resulting actions.
|
||||
|
||||
## Key Notes
|
||||
|
||||
- **Don't use `header_checks` for milter-added headers.** Options that *do* see them: Sieve at delivery (simplest), or running the scanner as a re-injecting `content_filter` (the re-injected message has the flag as a real header). spamass-milter cannot rewrite the envelope recipient itself.
|
||||
- **`redirect` re-injects via the MTA** — if a *global* before-script does the redirect, it also runs for the destination mailbox's delivery; guard with an `envelope :is "to"` check or you get a mail loop.
|
||||
- **Local-injection tests lie here.** A `localhost:25` test may fire a header_checks rule that real inbound mail never triggers.
|
||||
|
||||
## Related
|
||||
|
||||
- [Mail Client Stops Receiving: Fail2ban IMAP Self-Ban](fail2ban-imap-self-ban-mail-client.md)
|
||||
- [Dovecot IMAP Clients Fail to Sync: vsz_limit OOM from a Bloated Index Log](dovecot-imap-oom-vsz-limit-bloated-index.md)
|
||||
|
|
@ -0,0 +1,85 @@
|
|||
# Postfix + SendGrid: TLS Handshake Failure (Port 465 vs 587)
|
||||
|
||||
## Symptom
|
||||
|
||||
Outbound mail silently queues with no delivery. `postqueue -p` shows deferred messages:
|
||||
|
||||
```
|
||||
(Cannot start TLS: handshake failure)
|
||||
```
|
||||
|
||||
`/var/log/maillog` shows:
|
||||
|
||||
```
|
||||
SSL_connect error to smtp.sendgrid.net[...]:465: -1
|
||||
warning: TLS library problem: error:0A00010B:SSL routines::wrong version number
|
||||
```
|
||||
|
||||
Or on port 587:
|
||||
|
||||
```
|
||||
warning: TLS library problem: error:0A0000C1:SSL routines::no shared cipher
|
||||
```
|
||||
|
||||
## Root Cause
|
||||
|
||||
Port **465** (SMTPS) uses **implicit TLS** — the connection starts encrypted immediately. Port **587** (submission) uses **STARTTLS** — the connection starts plaintext, then upgrades.
|
||||
|
||||
Postfix has two settings that must match the port:
|
||||
|
||||
| Port | `smtp_tls_wrappermode` | `smtp_tls_security_level` |
|
||||
|------|------------------------|---------------------------|
|
||||
| 465 | `yes` | `encrypt` |
|
||||
| 587 | `no` | `encrypt` (or `may`) |
|
||||
|
||||
If `smtp_tls_wrappermode=yes` is set with port 587, Postfix sends a TLS ClientHello immediately but the server expects a plaintext SMTP greeting first — `wrong version number`.
|
||||
|
||||
If `smtp_tls_wrappermode=no` is set with port 465, Postfix sends a plaintext EHLO but the server expects a TLS ClientHello — `no shared cipher` or connection reset.
|
||||
|
||||
## Fix
|
||||
|
||||
Use port 587 + STARTTLS (recommended — more widely supported and debuggable):
|
||||
|
||||
```bash
|
||||
postconf -e 'relayhost = [smtp.sendgrid.net]:587'
|
||||
postconf -e 'smtp_tls_wrappermode = no'
|
||||
postconf -e 'smtp_tls_security_level = encrypt'
|
||||
systemctl restart postfix
|
||||
postqueue -f # flush stuck messages
|
||||
```
|
||||
|
||||
## Verify
|
||||
|
||||
```bash
|
||||
# Check config
|
||||
postconf relayhost smtp_tls_wrappermode smtp_tls_security_level
|
||||
|
||||
# Test TLS connection manually
|
||||
openssl s_client -starttls smtp -connect smtp.sendgrid.net:587 -brief
|
||||
|
||||
# Watch delivery
|
||||
tail -f /var/log/maillog | grep status=
|
||||
```
|
||||
|
||||
Successful delivery looks like:
|
||||
|
||||
```
|
||||
Untrusted TLS connection established to smtp.sendgrid.net[...]:587: TLSv1.3 with cipher TLS_AES_128_GCM_SHA256
|
||||
status=sent (250 Ok: queued as ...)
|
||||
```
|
||||
|
||||
## Why "Untrusted"?
|
||||
|
||||
If `smtp_tls_CAfile` and `smtp_tls_CApath` are both empty, Postfix can't verify the server certificate and logs "Untrusted TLS connection." The connection is still encrypted — just not authenticated. To fix, point to the system CA bundle:
|
||||
|
||||
```bash
|
||||
postconf -e 'smtp_tls_CAfile = /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem' # Fedora
|
||||
# or
|
||||
postconf -e 'smtp_tls_CAfile = /etc/ssl/certs/ca-certificates.crt' # Ubuntu/Debian
|
||||
```
|
||||
|
||||
## Notes
|
||||
|
||||
- OpenSSL 3.x is stricter about protocol mismatches than OpenSSL 1.1 — a config that worked on older distros may break after an OS upgrade.
|
||||
- SendGrid supports both ports, but port 587 + STARTTLS is the documented recommendation.
|
||||
- This applies to any SMTP relay (Mailgun, AWS SES, etc.), not just SendGrid — the port/wrappermode pairing is universal.
|
||||
|
|
@ -0,0 +1,133 @@
|
|||
---
|
||||
title: "SSH Alias Falls Through to MagicDNS — Host-Key Verification Failure (No `Host` Block)"
|
||||
domain: selfhosting
|
||||
category: troubleshooting
|
||||
tags:
|
||||
- ssh
|
||||
- ssh-config
|
||||
- tailscale
|
||||
- magicdns
|
||||
- known-hosts
|
||||
- host-key
|
||||
- troubleshooting
|
||||
status: published
|
||||
created: 2026-06-11
|
||||
updated: 2026-06-12
|
||||
---
|
||||
|
||||
# SSH Alias Falls Through to MagicDNS — Host-Key Verification Failure (No `Host` Block)
|
||||
|
||||
## The Problem
|
||||
|
||||
You `ssh` to a host you've reached many times before, but now it dies before any
|
||||
auth happens:
|
||||
|
||||
```
|
||||
$ ssh MyMac
|
||||
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
|
||||
Host key verification failed.
|
||||
```
|
||||
|
||||
On a headless box (WSL, a server, a CI runner) there's no askpass binary, so the
|
||||
prompt can't even be shown — SSH just aborts. Connecting **by Tailscale IP** works
|
||||
fine:
|
||||
|
||||
```
|
||||
$ ssh user@100.74.124.81 # works
|
||||
$ ssh MyMac # Host key verification failed
|
||||
```
|
||||
|
||||
## Why It Happens
|
||||
|
||||
There is **no `Host MyMac` block in `~/.ssh/config` at all** — and there never was.
|
||||
The connection only ever worked by IP, or interactively (where you clicked through
|
||||
the first-connect `yes` prompt without noticing).
|
||||
|
||||
When no `Host` block matches, SSH uses the literal argument as the hostname. With
|
||||
Tailscale MagicDNS, `MyMac` (or `mymac`) resolves to the node — so the *connection*
|
||||
succeeds — but the host key it presents is checked against `known_hosts` under the
|
||||
name **`mymac`**, which has no entry. Meanwhile the key you actually trust is stored
|
||||
under the **IP**:
|
||||
|
||||
```
|
||||
$ ssh-keygen -F 100.74.124.81 # found — line 67
|
||||
$ ssh-keygen -F mymac # nothing
|
||||
```
|
||||
|
||||
So strict host-key checking has nothing to match, tries to prompt to accept the
|
||||
"new" key, and on a headless host that prompt fails → `Host key verification failed`.
|
||||
|
||||
Confirm there's no block (and that `ssh -G` is just echoing defaults):
|
||||
|
||||
```
|
||||
$ ssh -G MyMac | grep -E '^(hostname|user|port) '
|
||||
hostname mymac # lowercased literal — NOT an explicit HostName
|
||||
user youruser # your local username default — not from a block
|
||||
port 22 # default
|
||||
```
|
||||
|
||||
If `hostname` equals the arg you typed (just lowercased) and `user` is your local
|
||||
login name, there is no matching `Host` block.
|
||||
|
||||
## The Fix
|
||||
|
||||
Add an explicit `Host` block that **pins the IP** that `known_hosts` already trusts.
|
||||
This matches the convention every other host in a Tailscale fleet should follow —
|
||||
pin the `100.x` address, not the MagicDNS name:
|
||||
|
||||
```sshconfig
|
||||
Host MyMac mymac
|
||||
HostName 100.74.124.81
|
||||
User youruser
|
||||
IdentityFile ~/.ssh/id_ed25519
|
||||
```
|
||||
|
||||
> [!note] When pinning the IP is the *wrong* call
|
||||
> Pinning the IP is right while the host is **stable**. If the box gets migrated or
|
||||
> rebuilt — new Tailscale IP *and* new host key — the pin rots and `known_hosts`
|
||||
> mismatches. At that point switch to **MagicDNS names** so the alias self-heals. See
|
||||
> *[MagicDNS Names vs Pinned IPs for Tailscale SSH (After a Fleet Migration)](tailscale-ssh-magicdns-vs-pinned-ip-after-migration.md)*.
|
||||
|
||||
Now `ssh MyMac` resolves to `100.74.124.81`, whose key is in `known_hosts`, and the
|
||||
check passes with no prompt. Verify non-interactively:
|
||||
|
||||
```
|
||||
$ ssh -o BatchMode=yes MyMac 'hostname'
|
||||
mymac.majorlan
|
||||
```
|
||||
|
||||
`BatchMode=yes` disables every prompt — if it returns the hostname cleanly, the key
|
||||
is trusted and a real key authenticated.
|
||||
|
||||
**Don't over-pin the identity.** Run `ssh -v user@<IP> true` and check the
|
||||
`Will attempt key` / accepted-key lines first. A workstation often authenticates
|
||||
with the *default* `id_ed25519`, not a fleet key — if `id_ed25519_fleet` isn't even
|
||||
offered, don't put it in the block.
|
||||
|
||||
## Cleanup: Stale `known_hosts` Cruft
|
||||
|
||||
Drive-by `ssh` attempts leave junk entries like `mymac-2` (auto-suffixed names from
|
||||
old keys). They never match anything once you pin the IP. Purge them:
|
||||
|
||||
```
|
||||
$ ssh-keygen -R mymac-2
|
||||
```
|
||||
|
||||
## How to Diagnose This
|
||||
|
||||
1. `ssh -o BatchMode=yes <alias> true` — if it fails with `Host key verification
|
||||
failed` (not `Permission denied`), it's a host-key problem, not auth.
|
||||
2. `ssh -G <alias> | grep -E '^(hostname|user|port) '` — if `hostname` is just your
|
||||
typed arg and there's no real `HostName`, there's no `Host` block.
|
||||
3. `ssh-keygen -F <name>` vs `ssh-keygen -F <ip>` — find which name actually holds
|
||||
the trusted key. Pin whichever one `known_hosts` has (usually the IP).
|
||||
|
||||
## Why This Gotcha Is Invisible
|
||||
|
||||
It only surfaces on a host with **no askpass** (headless / WSL / cron). On a desktop,
|
||||
the first-connect prompt appears, you hit `yes`, an entry gets written under the
|
||||
MagicDNS name, and it "just works" — masking the fact that no `Host` block exists and
|
||||
the IP-keyed entry is the only durable trust. Move the same config to a headless box
|
||||
and the missing block becomes a hard failure. Related: SSH only applies `Host` blocks
|
||||
by **literal pattern match**, so connecting by IP also skips them — see *Ansible Fails
|
||||
with Permission Denied While `ssh <alias>` Works (Host Alias Bypass)*.
|
||||
|
|
@ -0,0 +1,160 @@
|
|||
---
|
||||
title: "SSH `Permission denied (publickey)` After Rotating a Key — Backfill Every `authorized_keys`"
|
||||
domain: selfhosting
|
||||
category: troubleshooting
|
||||
tags:
|
||||
- ssh
|
||||
- ssh-keys
|
||||
- authorized-keys
|
||||
- key-rotation
|
||||
- publickey
|
||||
- fleet
|
||||
- troubleshooting
|
||||
status: published
|
||||
created: 2026-06-17
|
||||
updated: 2026-06-17
|
||||
---
|
||||
|
||||
# SSH `Permission denied (publickey)` After Rotating a Key — Backfill Every `authorized_keys`
|
||||
|
||||
## The Problem
|
||||
|
||||
A host you've SSH'd into for months suddenly rejects you — but **only some hosts**, not all:
|
||||
|
||||
```
|
||||
$ ssh root@host-a
|
||||
root@host-a: Permission denied (publickey).
|
||||
|
||||
$ ssh root@host-b # same key, same workstation — works fine
|
||||
host-b $
|
||||
```
|
||||
|
||||
Nothing changed on the servers. The thing that changed is on **your** side: at some
|
||||
point the workstation's SSH key was **regenerated** (lost laptop, rebuild, a key file
|
||||
clobbered by a botched copy, a routine rotation). The new public key was pushed to a
|
||||
few hosts but never fanned out to the rest. Every host still holding only the *old*
|
||||
public key now rejects the new private key with `Permission denied (publickey)`.
|
||||
|
||||
> The tell: it's `Permission denied (publickey)`, **not** `Host key verification
|
||||
> failed`. The former is an **authorization** failure (the server doesn't trust your
|
||||
> key); the latter is the server's key not matching your `known_hosts`. Different
|
||||
> problem — see *[SSH Alias Falls Through to MagicDNS — Host-Key Verification Failure](ssh-missing-host-block-magicdns-host-key-failure.md)*.
|
||||
|
||||
## Why It Happens
|
||||
|
||||
Public-key auth is **per-host**: the server only lets you in if your public key is a
|
||||
line in that host's `~/.ssh/authorized_keys`. There is no central directory — each
|
||||
host is its own island. So when you rotate a key, *every* host needs the new public
|
||||
key appended independently.
|
||||
|
||||
It's easy to do this partially without noticing. You regenerate the key, then over the
|
||||
next hour you happen to SSH into three boxes and (re-)deploy the key there as part of
|
||||
other work. Those three now trust the new key. The other six don't — and you won't
|
||||
find out until weeks later when you reach for one of them.
|
||||
|
||||
Confirm it's an authorization (key) failure and see which key is being offered:
|
||||
|
||||
```
|
||||
$ ssh -v root@host-a 2>&1 | grep -E 'Offering|Authentications|Permission denied'
|
||||
debug1: Offering public key: /home/you/.ssh/id_ed25519 ED25519 SHA256:XeY1/N9qwB…
|
||||
debug1: Authentications that can continue: publickey
|
||||
root@host-a: Permission denied (publickey).
|
||||
```
|
||||
|
||||
The server offered you nothing but `publickey`, you offered your current key, and it
|
||||
was refused → your key isn't in that host's `authorized_keys`.
|
||||
|
||||
## Scope It First — Don't Fix One Host at a Time
|
||||
|
||||
The host you noticed is rarely the only one. Sweep the whole fleet in one pass before
|
||||
touching anything, so you fix the real set, not just the squeaky wheel:
|
||||
|
||||
```bash
|
||||
for h in host-a host-b host-c host-d host-e host-f; do
|
||||
r=$(ssh -o BatchMode=yes -o ConnectTimeout=8 root@"$h" 'echo OK' 2>&1 | tail -1)
|
||||
echo "$h: $r"
|
||||
done
|
||||
```
|
||||
|
||||
`BatchMode=yes` suppresses password/passphrase prompts so a failure fails fast instead
|
||||
of hanging. Anything that doesn't print `OK` needs the backfill.
|
||||
|
||||
## The Fix
|
||||
|
||||
You need a **second, still-trusted** way onto each failing host to append the new key.
|
||||
Common transit options, best first:
|
||||
|
||||
- **Another of your keys that still works** (e.g. a config-management / automation
|
||||
user whose key is authorized fleet-wide, ideally with `sudo`).
|
||||
- **Another workstation** whose key those hosts still trust.
|
||||
- **The provider's web console / serial console** as a last resort.
|
||||
|
||||
> [!warning] A jump host only helps if *it* can reach the target
|
||||
> "Bounce through a box that still trusts me" only works if that box's own key is in
|
||||
> the target's `authorized_keys`. A host can trust *your* key yet have no standing
|
||||
> trust to a third host (and hit its own `Host key verification failed` on the way).
|
||||
> Test the full two-hop path before relying on it.
|
||||
|
||||
Using a fleet-wide automation user (`deploy`) with passwordless `sudo` as the transit,
|
||||
append the new key idempotently, with a backup, to every failing host:
|
||||
|
||||
```bash
|
||||
PUBKEY=$(cat ~/.ssh/id_ed25519.pub)
|
||||
STAMP=$(date +%Y%m%d-%H%M%S)
|
||||
for h in host-a host-c host-e; do # only the hosts that failed the sweep
|
||||
ssh deploy@"$h" "sudo bash -s" <<EOF
|
||||
set -e
|
||||
F=/root/.ssh/authorized_keys
|
||||
mkdir -p /root/.ssh && touch "\$F"
|
||||
cp "\$F" "\$F.bak-$STAMP" # backup before any change
|
||||
grep -qF "$PUBKEY" "\$F" || printf '%s\n' "$PUBKEY" >> "\$F" # append only if absent
|
||||
chmod 600 "\$F"
|
||||
EOF
|
||||
done
|
||||
```
|
||||
|
||||
Three things that keep this safe:
|
||||
|
||||
- **Append, never overwrite.** `>> "$F"` and the `grep -qF … ||` guard mean you add
|
||||
one line and only if it's missing. Re-running is a no-op — never clobber an
|
||||
`authorized_keys` with `>` or you'll lock out every *other* key on the box.
|
||||
- **Back up first.** The `.bak-<stamp>` copy is your undo.
|
||||
- **`chmod 600`.** SSH silently ignores an `authorized_keys` that's group/world
|
||||
writable, which looks exactly like "the key didn't take."
|
||||
|
||||
Then verify directly — not through the transit user:
|
||||
|
||||
```bash
|
||||
for h in host-a host-c host-e; do
|
||||
echo "$h: $(ssh -o BatchMode=yes root@"$h" 'echo OK' 2>&1 | tail -1)"
|
||||
done
|
||||
```
|
||||
|
||||
All `OK` means the new key authenticates on its own.
|
||||
|
||||
## Prevention
|
||||
|
||||
- **Treat rotation as fleet-wide.** When a workstation key changes, the very next step
|
||||
is to fan the new public key out to **every** host's `authorized_keys` in one pass —
|
||||
not opportunistically as you happen to log in. A short `for` loop over the full host
|
||||
list (or a config-management task — see below) closes the gap immediately.
|
||||
- **Manage `authorized_keys` declaratively.** An Ansible `ansible.posix.authorized_key`
|
||||
task (or equivalent) that lists the *current* set of keys makes "who can log in" a
|
||||
reviewed, version-controlled fact instead of an append-only pile that drifts per host.
|
||||
- **Keep the old key authorized until the new one is verified everywhere**, then remove
|
||||
the stale line in a deliberate cleanup pass.
|
||||
|
||||
## How to Diagnose This (Checklist)
|
||||
|
||||
1. `ssh -o BatchMode=yes <host> true` → `Permission denied (publickey)` (auth), not
|
||||
`Host key verification failed` (host key). Confirms which problem you have.
|
||||
2. `ssh -v <host> 2>&1 | grep Offering` → which private key is being offered, and its
|
||||
fingerprint.
|
||||
3. Sweep the whole fleet with the `BatchMode` loop → get the **full** list of affected
|
||||
hosts before fixing.
|
||||
4. Append the new public key (idempotent, backed up, `chmod 600`) via a still-trusted
|
||||
transit path.
|
||||
5. Re-verify each host with a direct `BatchMode` login.
|
||||
|
||||
Related: *[SSH Config & Key Management](../../01-linux/networking/ssh-config-key-management.md)*
|
||||
and *[SSH Hardening Across a Fleet with Ansible](../../02-selfhosting/security/ssh-hardening-ansible-fleet.md)*.
|
||||
|
|
@ -0,0 +1,157 @@
|
|||
# Tailscale Boot Race Conditions (SSH Unreachable After Reboot)
|
||||
|
||||
Two related race conditions can make a host unreachable via Tailscale after reboot. Both stem from systemd services starting before Tailscale or the network is ready.
|
||||
|
||||
---
|
||||
|
||||
## Race 1: ssh.socket Binds Before Tailscale Is Up (Ubuntu)
|
||||
|
||||
### Symptom
|
||||
|
||||
SSH to a host via Tailscale IP times out. `tailscale ping` works, `tailscale status` shows `active; direct`, but SSH on port 22 refuses connections. No access via Hetzner console if root password is unset.
|
||||
|
||||
### Cause
|
||||
|
||||
Ubuntu 24.04 uses systemd **socket activation** for SSH (`ssh.socket` instead of persistent `ssh.service`). When the socket override binds to a Tailscale IP, it can start *before* `tailscaled.service` is ready. The bind may succeed initially (Tailscale state file caches the IP), but a later Tailscale reconnect or interface reset invalidates the bound address silently — SSH dies with no recovery path.
|
||||
|
||||
### Diagnosis
|
||||
|
||||
```bash
|
||||
# From another host:
|
||||
tailscale ping <IP> # succeeds — host is up
|
||||
ssh root@<IP> # times out — sshd not listening
|
||||
|
||||
# After gaining console access or reboot:
|
||||
systemctl status ssh.socket # check Listen: address
|
||||
journalctl -b -1 -u ssh # likely empty — sshd never spawned
|
||||
journalctl -b -1 -u ssh.socket # socket started before tailscaled
|
||||
```
|
||||
|
||||
### Fix (current — 2026-05-31)
|
||||
|
||||
`After=tailscaled.service` orders against the service becoming `active` — **not** against the `tailscale0` interface actually having an IPv4 address. tailscaled flips to active within a second of starting, but the kernel doesn't have the address bound to the interface until DERP relays connect and the control plane confirms the node. ssh.socket attempting `ListenStream=<TS IP>:22` in that window fails with `Cannot assign requested address`, the socket goes into a failed state, and there is no automatic retry.
|
||||
|
||||
The proper gate is a dedicated readiness service that **waits for the tailscale0 IPv4 address to exist** before letting ssh.socket bind:
|
||||
|
||||
```ini
|
||||
# /etc/systemd/system/tailscale-wait-ready.service
|
||||
[Unit]
|
||||
Description=Wait until tailscale0 has an IPv4 address
|
||||
After=tailscaled.service
|
||||
Requires=tailscaled.service
|
||||
ConditionPathExists=/usr/sbin/ip
|
||||
|
||||
[Service]
|
||||
Type=oneshot
|
||||
RemainAfterExit=yes
|
||||
TimeoutStartSec=120
|
||||
ExecStart=/usr/bin/bash -c 'for i in $(seq 1 120); do ip -4 -o addr show tailscale0 2>/dev/null | grep -q "inet " && exit 0; sleep 1; done; exit 1'
|
||||
|
||||
[Install]
|
||||
WantedBy=multi-user.target
|
||||
```
|
||||
|
||||
```ini
|
||||
# /etc/systemd/system/ssh.socket.d/override.conf
|
||||
[Unit]
|
||||
After=tailscale-wait-ready.service
|
||||
Requires=tailscale-wait-ready.service
|
||||
|
||||
[Socket]
|
||||
ListenStream=
|
||||
ListenStream=<TAILSCALE_IP>:22
|
||||
```
|
||||
|
||||
Reload + restart:
|
||||
|
||||
```bash
|
||||
systemctl daemon-reload
|
||||
systemctl enable tailscale-wait-ready.service
|
||||
systemctl restart ssh.socket
|
||||
ss -tlnp | grep :22 # verify bound to Tailscale IP
|
||||
```
|
||||
|
||||
!!! note "Evolution of this fix"
|
||||
- **2026-05-19 v1** — `After=tailscaled.service` + `BindsTo=tailscaled.service`. Worked initially but caused a shutdown-time ordering cycle.
|
||||
- **2026-05-23 v2** — `BindsTo` swapped for `Requires` to break the cycle. Fixed the cycle but did **not** wait for `tailscale0` to actually have an IP — just for `tailscaled` to be active. Hosts continued losing SSH after some reboots (intermittent, depending on whether the race won).
|
||||
- **2026-05-31 v3** — Added `tailscale-wait-ready.service` to gate ssh.socket on the interface having an address. This is the current canonical fix.
|
||||
|
||||
!!! warning "Do NOT use BindsTo"
|
||||
`BindsTo=tailscaled.service` creates a **systemd ordering cycle** during shutdown: `basic.target → sockets.target → ssh.socket → tailscaled.service → basic.target`. Systemd breaks the cycle by deleting jobs unpredictably, which can prevent `ssh.socket` from starting on the next boot. Use `Requires=` for startup ordering without the bidirectional lifecycle coupling.
|
||||
|
||||
### Affected Hosts
|
||||
|
||||
Ubuntu hosts locked via the `tailscale` role (`ssh_only_ubuntu` task, formerly `configure_tailscale_ssh_only.yml`): majorlinux, dcaprod-hetzner, tttpod-hetzner, majortoot-hetzner.
|
||||
|
||||
> [!danger] The Ubuntu playbook shipped the cycle pattern until 2026-06-07
|
||||
> Despite the 2026-06-04 resolution above, `configure_tailscale_ssh_only.yml` in the repo kept deploying the `[Unit] Requires=tailscale-wait-ready.service` gate on **ssh.socket** (the cycle-causer) and never added the ssh.service gate — so re-running it *re-armed* the ordering cycle. Caught 2026-06-07: it clobbered majorlinux's hand-fix, and **majortoot-hetzner was found already armed** with the latent cycle (would have lost SSH on its next reboot). Both restored/defused; playbook corrected in MajorAnsible `e0d35aa` (gate on ssh.service, dependency-free socket).
|
||||
>
|
||||
> **Fleet audited & reconciled 2026-06-07:** dcaprod-hetzner + tttpod-hetzner had the dependency-free socket already but were **missing `tailscale-wait-ready.service`** (their ssh.service gate referenced a non-existent unit → inert → latent *bind* race, not a cycle); the corrected playbook was applied to both, deploying the service and activating the gate. teelia uses **Tailscale SSH** (no sshd, ss.socket/ssh.service disabled) — immune to both races. All Ubuntu hosts now run the same pattern: dependency-free `ss.socket` bind + `ssh.service` readiness gate + `tailscale-wait-ready.service`.
|
||||
|
||||
> [!warning] Fedora hosts are NOT automatically immune (corrected 2026-06-07)
|
||||
> The firewalld method (`configure_tailscale_ssh_only_fedora.yml`) binds sshd on `0.0.0.0:22` and enforces Tailscale-only via the firewall, so it has no dependency on the Tailscale address — **unless** a host also carries a leftover manual `ListenAddress <tailscale-ip>` drop-in (`/etc/ssh/sshd_config.d/tailscale-only.conf`) from the pre-firewall lockdown. Then sshd.service hits the same boot bind-race (`Bind to port 22 on <ts-ip> failed: Cannot assign requested address`) and flaps every reboot. Hit on **majordiscord 2026-06-07**; fixed by removing the redundant drop-in (firewall stays the enforcing layer). The Fedora playbook now removes it automatically (MajorAnsible `b4a9090`).
|
||||
|
||||
---
|
||||
|
||||
## Race 2: tailscaled Starts Before Network Is Online (All Hosts)
|
||||
|
||||
### Symptom
|
||||
|
||||
Host reboots but never appears on Tailscale. `tailscale ping` times out entirely. SSH is dead because Tailscale never connects. The host is up (accessible via provider console) but isolated from the Tailscale network.
|
||||
|
||||
### Cause
|
||||
|
||||
`tailscaled.service` ships with `After=network-pre.target`, which fires *before* the network interface has an IP. On VPS hosts (especially Hetzner), the interface can take several seconds to come online. Tailscale starts, sees no network (`SetNetworkUp(false)`, `link state: defaultRoute= ifs={} v4=false v6=false`), fails DNS bootstrap and DERP relay connections, and gets stuck — never retrying.
|
||||
|
||||
### Diagnosis
|
||||
|
||||
```bash
|
||||
# From Hetzner console or another access method:
|
||||
journalctl -b -u tailscaled | grep -E "SetNetworkUp|link state|error|DERP"
|
||||
# Look for:
|
||||
# magicsock: SetNetworkUp(false)
|
||||
# link state: interfaces.State{defaultRoute= ifs={} v4=false v6=false}
|
||||
# health: Tailscale could not connect to any relay server
|
||||
```
|
||||
|
||||
### Fix
|
||||
|
||||
Deploy a systemd drop-in to wait for full network connectivity:
|
||||
|
||||
```ini
|
||||
# /etc/systemd/system/tailscaled.service.d/override.conf
|
||||
[Unit]
|
||||
After=network-online.target
|
||||
Wants=network-online.target
|
||||
```
|
||||
|
||||
Then reload and restart:
|
||||
|
||||
```bash
|
||||
systemctl daemon-reload
|
||||
systemctl restart tailscaled
|
||||
```
|
||||
|
||||
### Affected Hosts
|
||||
|
||||
All hosts where Tailscale is the primary access path. Particularly impactful on VPS hosts with slow interface bringup. Both Fedora and Ubuntu hosts are affected.
|
||||
|
||||
---
|
||||
|
||||
## Prevention
|
||||
|
||||
- Set root passwords on all VPS hosts for emergency console access
|
||||
- The `tailscale` role deploys all fixes automatically (run via `tailscale.yml` / `site.yml`):
|
||||
- `network_wait` task — tailscaled network-online dependency (all hosts)
|
||||
- `ssh_only_ubuntu` task — dependency-free ssh.socket bind + ssh.service readiness gate + `tailscale-wait-ready.service` (Ubuntu group)
|
||||
- `ssh_only_fedora` task — firewalld Tailscale-only lockdown; removes any leftover `ListenAddress` drop-in (Fedora group)
|
||||
|
||||
## References
|
||||
|
||||
- [[dcaprod#2026-05-19 — SSH unreachable due to ssh.socket race condition with Tailscale]]
|
||||
- [[majordiscord#2026-05-19 — Tailscale boot race: unreachable after Ansible reboot]]
|
||||
- [[majorlinux#2026-05-19 — ssh.socket override patched: added Tailscale dependency]]
|
||||
- [[dcaprod#2026-05-23 — SSH unreachable again: BindsTo ordering cycle in ssh.socket override]]
|
||||
- [[majorlinux#2026-05-31 — ssh.socket race recurrence post-reboot (Requires= insufficient; added wait-ready gate)]]
|
||||
- [[majortoot#2026-05-31 — ssh.socket race post-reboot on majortoot-hetzner (during cutover night)]]
|
||||
- Ansible: the `tailscale` role (`tailscale.yml`) — `network_wait` + `ssh_only_ubuntu`/`ssh_only_fedora` tasks, consolidated from the former `configure_tailscale_*` playbooks (MajorAnsible `656302e`)
|
||||
|
|
@ -0,0 +1,133 @@
|
|||
---
|
||||
title: "Steam Deck Wi-Fi Flapping: IWD Periodic Scan + rtw88 Power Save"
|
||||
domain: troubleshooting
|
||||
category: networking
|
||||
tags: [wifi, steam-deck, steamos, iwd, networkmanager, rtw88, rtl8822ce, power-save, supplicant-disconnect, flapping]
|
||||
status: published
|
||||
created: 2026-06-19
|
||||
updated: 2026-06-19
|
||||
---
|
||||
|
||||
# Steam Deck Wi-Fi Flapping: IWD Periodic Scan + rtw88 Power Save
|
||||
|
||||
## 🛑 Problem
|
||||
|
||||
An OG Steam Deck (LCD model, Realtek **RTL8822CE** on the `rtw88_8822ce` driver) kept "losing" Wi-Fi — it would connect, hold for around a minute, drop, then reconnect a second later, over and over. From the router side the device looked like it was constantly coming and going; from the couch it felt like the network "wouldn't stay connected."
|
||||
|
||||
Crucially, **this was not a router problem.** The AP config was correct, RF was clean (strong signal, zero tx retries / beacon loss), and every other client on the network was rock-solid. The fault was entirely on the Deck.
|
||||
|
||||
## 🔍 Diagnosis
|
||||
|
||||
SteamOS uses **NetworkManager with the `iwd` backend** (not `wpa_supplicant`). That detail is the whole ballgame.
|
||||
|
||||
### Step 1 — Confirm the flap and its cadence
|
||||
|
||||
```bash
|
||||
# how many disconnects this boot?
|
||||
journalctl -b -u NetworkManager --no-pager | grep -c supplicant-disconnect
|
||||
# 50
|
||||
|
||||
# when did they happen?
|
||||
journalctl -b -u NetworkManager --no-pager | grep supplicant-disconnect \
|
||||
| awk '{print $1,$2,$3}' | tail
|
||||
# 10:20:52 · 10:21:54 · 10:22:57 · 10:24:00 · 10:25:03 · 10:26:05 · 10:27:08 ...
|
||||
```
|
||||
|
||||
**~63 seconds between every drop.** A fixed, metronome-like interval is the tell — this is a *timer*, not RF noise. The NetworkManager log shows the pattern plainly:
|
||||
|
||||
```
|
||||
activated -> failed (reason 'supplicant-disconnect')
|
||||
... -> activated # reconnects ~1s later
|
||||
```
|
||||
|
||||
### Step 2 — Prove the link is healthy *when it's up*
|
||||
|
||||
```bash
|
||||
iw dev wlan0 station dump | grep -iE 'signal|bitrate|failed|retries|beacon loss'
|
||||
# signal: -65 dBm
|
||||
# tx retries: 0
|
||||
# tx failed: 0
|
||||
# beacon loss: 0
|
||||
```
|
||||
|
||||
Strong signal, zero retries, zero beacon loss — the association is clean while it lasts. So the drop is being *commanded*, not caused by a bad radio link.
|
||||
|
||||
### Step 3 — Identify the chip and the backend
|
||||
|
||||
```bash
|
||||
lspci -k | grep -A3 -iE 'network|wireless'
|
||||
# Realtek RTL8822CE ... Kernel driver in use: rtw88_8822ce
|
||||
```
|
||||
|
||||
The `~63s` interval is **IWD's default periodic background scan**. With no `/etc/iwd/main.conf` present, IWD scans on a timer even while connected, and on the `rtw88` driver that scan knocks the current association over — producing the `supplicant-disconnect` every minute.
|
||||
|
||||
A secondary annoyance: `iw dev wlan0 get power_save` reported `on`, which showed up as wildly jittery LAN latency (8–69 ms to the gateway over Wi-Fi, where a healthy 5 GHz link is 2–10 ms).
|
||||
|
||||
## ✅ Fix
|
||||
|
||||
Two independent changes — the first stops the flap, the second smooths latency.
|
||||
|
||||
### 1. Disable IWD's periodic scan (stops the flap)
|
||||
|
||||
```bash
|
||||
sudo mkdir -p /etc/iwd
|
||||
printf '[Scan]\nDisablePeriodicScan=true\n' | sudo tee /etc/iwd/main.conf
|
||||
sudo systemctl restart iwd # briefly drops Wi-Fi; NetworkManager auto-reconnects
|
||||
```
|
||||
|
||||
Trade-off: with periodic scanning off, the Deck roams to a different/stronger AP (e.g. another AiMesh node) more lazily. Fine for a device that mostly sits in one spot.
|
||||
|
||||
### 2. Disable Wi-Fi power save (kills the latency jitter)
|
||||
|
||||
The obvious `nmcli connection modify <name> 802-11-wireless.powersave 2` **does not work under the IWD backend** — NetworkManager doesn't enforce that property when `iwd` is managing the radio. Use a dispatcher script instead, with a retry loop because `rtw88` won't accept the setting in the first instant after association on a cold boot:
|
||||
|
||||
```bash
|
||||
sudo tee /etc/NetworkManager/dispatcher.d/90-wifi-powersave >/dev/null <<'SCRIPT'
|
||||
#!/bin/sh
|
||||
# Disable Wi-Fi power save on the wireless iface (retry: rtw88 may not accept it instantly on boot)
|
||||
case "$2" in
|
||||
up|dhcp4-change|connectivity-change)
|
||||
case "$1" in
|
||||
wl*)
|
||||
for n in 1 2 3 4 5; do
|
||||
/usr/bin/iw dev "$1" set power_save off 2>/dev/null
|
||||
[ "$(/usr/bin/iw dev "$1" get power_save 2>/dev/null)" = "Power save: off" ] && break
|
||||
sleep 1
|
||||
done
|
||||
;;
|
||||
esac
|
||||
;;
|
||||
esac
|
||||
SCRIPT
|
||||
sudo chmod +x /etc/NetworkManager/dispatcher.d/90-wifi-powersave
|
||||
sudo iw dev wlan0 set power_save off # apply now without waiting for a reconnect
|
||||
```
|
||||
|
||||
> 💡 A single-shot dispatcher (no retry) **silently fails on a cold boot** — it fires before the interface is ready, the `iw` call no-ops, and power save stays on. Verify with `iw get power_save` *after a real reboot*, not just after a service restart.
|
||||
|
||||
## 🔁 Verification
|
||||
|
||||
```bash
|
||||
# was 50/boot, ~once a minute:
|
||||
journalctl -b -u NetworkManager --no-pager | grep -c supplicant-disconnect
|
||||
# 0
|
||||
iw dev wlan0 get power_save
|
||||
# Power save: off
|
||||
```
|
||||
|
||||
A 3-minute continuous `ping` showed **180/180 replies, 0 loss**, latency tightened to **6–11 ms**. Confirmed across a full cold reboot: the Deck auto-rejoins Wi-Fi, both settings persist, and the disconnect counter stays at 0.
|
||||
|
||||
## 📌 Notes
|
||||
|
||||
- **Persistence:** `/etc/iwd/main.conf` and the dispatcher live in `/etc`, which survives reboots. A major SteamOS update *can* reset `/etc` — re-apply if the flapping returns after an OS update.
|
||||
- **Fully reversible:**
|
||||
```bash
|
||||
sudo rm /etc/iwd/main.conf /etc/NetworkManager/dispatcher.d/90-wifi-powersave
|
||||
sudo systemctl restart iwd
|
||||
```
|
||||
- **Interface name** is usually `wlan0`; confirm with `iw dev` if different.
|
||||
- The same IWD-periodic-scan behavior can affect other `iwd`-based distros (Arch, some Fedora spins) on flaky/older Wi-Fi chips — the `DisablePeriodicScan` fix is general, not Deck-specific.
|
||||
|
||||
## 🔗 Related
|
||||
|
||||
- [Wi-Fi Game Streaming Stutter: 160 MHz Channel Width Saturating the 5 GHz Radio](wifi-160mhz-airtime-saturation-game-streaming.md) — the *other* Steam Deck Wi-Fi issue (airtime contention, router-side), distinct from this client-side flap.
|
||||
|
|
@ -0,0 +1,163 @@
|
|||
---
|
||||
title: "MagicDNS Names vs Pinned IPs for Tailscale SSH (After a Fleet Migration)"
|
||||
domain: troubleshooting
|
||||
category: networking
|
||||
tags:
|
||||
- ssh
|
||||
- ssh-config
|
||||
- tailscale
|
||||
- magicdns
|
||||
- known-hosts
|
||||
- host-key
|
||||
- migration
|
||||
- wsl2
|
||||
status: published
|
||||
created: 2026-06-12
|
||||
updated: 2026-06-12
|
||||
---
|
||||
|
||||
# MagicDNS Names vs Pinned IPs for Tailscale SSH (After a Fleet Migration)
|
||||
|
||||
You have SSH aliases for a Tailscale fleet (`alias tttpod='ssh root@100.84.42.102'`).
|
||||
They worked for months. Then you migrate or rebuild some nodes — and now a third of
|
||||
them hang on connect or refuse the host key. This is the failure mode that hardcoded
|
||||
addresses hit, and why the durable answer is **MagicDNS names**, not pinned IPs.
|
||||
|
||||
> This is the sequel to *[SSH Alias Falls Through to MagicDNS — Host-Key Verification
|
||||
> Failure (No `Host` Block)](ssh-missing-host-block-magicdns-host-key-failure.md)*.
|
||||
> That article says **pin the IP** `known_hosts` already trusts — correct when the
|
||||
> node is stable. This one covers what happens when a migration changes the IP *and*
|
||||
> the host key, which is exactly when IP-pinning stops paying off.
|
||||
|
||||
## The Three Failure Modes
|
||||
|
||||
A migration/rebuild can trigger any of these — often several at once across a fleet,
|
||||
which is what makes it confusing:
|
||||
|
||||
### 1. Stale hardcoded IP → connection times out
|
||||
|
||||
The node re-registered on the tailnet with a **new** Tailscale IP, but your alias
|
||||
still names the old one:
|
||||
|
||||
```
|
||||
$ tttpod
|
||||
ssh: connect to host 100.84.42.102 port 22: Operation timed out
|
||||
```
|
||||
|
||||
The old address is dead; SSH waits the full timeout and gives up. Confirm by asking
|
||||
the tailnet for the node's *current* IP by name:
|
||||
|
||||
```
|
||||
$ tailscale status | grep tttpod
|
||||
100.95.137.38 tttpod ... # alias points at 100.84.42.102 — stale
|
||||
```
|
||||
|
||||
### 2. Cold-path teardown → first connect after idle times out
|
||||
|
||||
The IP is correct and the node is up (it answers `ping`), but TCP/22 still times out
|
||||
on the *first* try after a quiet period, then works on retry. Tailscale 1.98.x is more
|
||||
aggressive about tearing down **idle direct UDP paths**; the first SSH has to
|
||||
re-establish NAT traversal, which can overrun SSH's default connect timeout.
|
||||
|
||||
```
|
||||
$ tailscale status | grep tttpod
|
||||
100.95.137.38 tttpod ... idle, tx 9360 rx 0 # cold path
|
||||
$ tailscale ping tttpod
|
||||
pong from tttpod (100.95.137.38) via 5.161.118.84:41641 in 48ms # warms instantly
|
||||
```
|
||||
|
||||
### 3. Host-key verification failed → box was rebuilt
|
||||
|
||||
The node was reinstalled, so it presents a **new** SSH host key. Your `known_hosts`
|
||||
still has the old one, so even `StrictHostKeyChecking=accept-new` aborts — `accept-new`
|
||||
only adds *genuinely new* hosts, it refuses a **mismatch**:
|
||||
|
||||
```
|
||||
$ ssh root@tttpod hostname
|
||||
Host key verification failed.
|
||||
```
|
||||
|
||||
## The Fix
|
||||
|
||||
Three changes, applied on every **name-capable** machine (see the WSL2 caveat below):
|
||||
|
||||
### a. Switch aliases from IPs to MagicDNS names
|
||||
|
||||
```bash
|
||||
# before — rots on every migration
|
||||
alias tttpod='ssh root@100.84.42.102'
|
||||
# after — always resolves the node's current IP
|
||||
alias tttpod='ssh root@tttpod'
|
||||
```
|
||||
|
||||
MagicDNS resolves the name to whatever IP the node currently has, so a future
|
||||
migration needs **zero** alias edits. This is the whole point: the tailnet already
|
||||
knows the mapping — stop duplicating (and stale-ing) it in your dotfiles.
|
||||
|
||||
> **Exception:** if there's no tailnet device with that exact name (e.g. an alias
|
||||
> `teelia` pointing at a node actually named `temptedparadise`), MagicDNS can't
|
||||
> resolve it — keep the IP for that one.
|
||||
|
||||
### b. Purge stale host keys, then re-accept
|
||||
|
||||
After a rebuild, clear the old entries under **both** the name and the current IP,
|
||||
then reconnect with `accept-new` to record the fresh key. Over Tailscale's
|
||||
authenticated WireGuard tunnel, a key change from a known rebuild is safe to accept.
|
||||
|
||||
```bash
|
||||
for pair in "tttpod:100.95.137.38" "majortoot:100.64.169.62" "dcaprod:100.98.223.93"; do
|
||||
n="${pair%%:*}"; ip="${pair##*:}"
|
||||
ssh-keygen -R "$n"; ssh-keygen -R "$ip"
|
||||
done
|
||||
# repopulate
|
||||
ssh -o StrictHostKeyChecking=accept-new root@tttpod hostname
|
||||
```
|
||||
|
||||
### c. Add a cold-path cushion to `~/.ssh/config`
|
||||
|
||||
Give the first (cold) connection time to renegotiate instead of erroring:
|
||||
|
||||
```sshconfig
|
||||
Host majorlinux tttpod majortoot majordiscord dcaprod majormail majorhome
|
||||
ConnectTimeout 25
|
||||
ServerAliveInterval 30
|
||||
ServerAliveCountMax 4
|
||||
```
|
||||
|
||||
`ConnectTimeout 25` turns the cold-path timeout into a ~1–2 s pause. The keepalives
|
||||
hold the path open during an active session so it doesn't drop mid-command.
|
||||
|
||||
## Caveat: WSL2 Can't Use MagicDNS
|
||||
|
||||
A Linux box under **WSL2** typically has **no `tailscale` CLI and no MagicDNS
|
||||
resolver** — it rides the Windows host's networking, and name lookups for tailnet
|
||||
nodes fail:
|
||||
|
||||
```
|
||||
$ getent hosts tttpod # (inside WSL2)
|
||||
# nothing — no resolution
|
||||
$ command -v tailscale # nothing — CLI lives on the Windows side
|
||||
```
|
||||
|
||||
On those machines you **must** keep hardcoded IPs in `~/.ssh/config` (or use `Host`
|
||||
blocks with explicit `HostName <ip>`), and refresh them by hand when a node migrates.
|
||||
There's no self-healing option there — the trade is unavoidable.
|
||||
|
||||
## Diagnosis Checklist
|
||||
|
||||
1. `tailscale status | grep <host>` — does your alias's IP match the **current** one?
|
||||
(Mode 1: stale IP.)
|
||||
2. `ping`/`tailscale ping <host>` works but TCP/22 times out on first try, succeeds on
|
||||
retry? (Mode 2: cold path.)
|
||||
3. `ssh root@<host> true` → `Host key verification failed` (not `Permission denied`)?
|
||||
(Mode 3: rebuilt box, stale `known_hosts`.)
|
||||
4. Is the client a WSL2 box? `getent hosts <name>` returns nothing → MagicDNS
|
||||
unavailable, stay on IPs.
|
||||
|
||||
## Takeaway
|
||||
|
||||
Pin the IP when a host is **stable** and the IP-keyed `known_hosts` entry is your
|
||||
durable trust anchor. Switch to **MagicDNS names** when hosts **move** — migrations,
|
||||
rebuilds, provider changes — so the tailnet's own name→IP mapping does the work your
|
||||
dotfiles kept getting wrong. And on WSL2, you don't get the choice: hardcoded IPs,
|
||||
refreshed by hand.
|
||||
|
|
@ -0,0 +1,115 @@
|
|||
---
|
||||
title: "Wi-Fi Game Streaming Stutter: 160 MHz Channel Width Saturating the 5 GHz Radio"
|
||||
domain: troubleshooting
|
||||
category: networking
|
||||
tags: [wifi, 5ghz, 160mhz, channel-width, dfs, steam-deck, game-streaming, asuswrt, airtime, chanim]
|
||||
status: published
|
||||
created: 2026-06-13
|
||||
updated: 2026-06-13
|
||||
---
|
||||
|
||||
# Wi-Fi Game Streaming Stutter: 160 MHz Channel Width Saturating the 5 GHz Radio
|
||||
|
||||
## 🛑 Problem
|
||||
|
||||
Streaming a game from a desktop (wired) to a Steam Deck over Wi-Fi was stuttering intermittently — fine for a while, then choppy, hard to reproduce on demand. Throughput tests "looked fine," which is exactly why it was hard to pin down: **game streaming fails on jitter and microbursts of contention, not on average bandwidth.**
|
||||
|
||||
The Wi-Fi was an Asus RT-AX82U (AsusWRT, stock firmware) with the 5 GHz radio set to **Auto channel at 160 MHz width**.
|
||||
|
||||
## 🔍 Diagnosis
|
||||
|
||||
The key insight: **signal was excellent, but latency was not.** That combination means the airwaves are busy, not weak.
|
||||
|
||||
### Step 1 — Measure jitter to the gateway from a Wi-Fi client
|
||||
|
||||
```bash
|
||||
ping -c 20 -i 0.2 192.168.50.1
|
||||
# round-trip min/avg/max/stddev = 7.5/27.0/61.0/16.5 ms
|
||||
```
|
||||
|
||||
27 ms **average** and 16 ms of jitter to your *own router* over Wi-Fi is pathological. A healthy 5 GHz link sits at 2–5 ms. Yet the client's signal was **-43 dBm** (excellent) with a clean **-92 dBm** noise floor. Strong signal + high jitter = **airtime contention**, not range or interference at the receiver.
|
||||
|
||||
### Step 2 — Confirm channel utilization at the router
|
||||
|
||||
AsusWRT/Broadcom exposes per-channel airtime stats via `wl chanim_stats`. SSH into the router and run it against the 5 GHz interface:
|
||||
|
||||
```bash
|
||||
# 5 GHz interface name varies (eth6/eth7); resolve it from nvram
|
||||
IF=$(nvram get wl1_ifname)
|
||||
wl -i "$IF" chanspec # e.g. 36/160 (0xe832) → channel 36, 160 MHz
|
||||
wl -i "$IF" assoclist | wc -l # number of associated 5 GHz clients
|
||||
wl -i "$IF" chanim_stats
|
||||
```
|
||||
|
||||
The smoking gun (`chanim_stats`, version 3):
|
||||
|
||||
```
|
||||
chanspec tx inbss obss nocat nopkt doze txop goodtx badtx glitch ... idle
|
||||
0xe832 92 2 1 2 1 0 4 8 81 2 14
|
||||
```
|
||||
|
||||
Read it as percentages of airtime:
|
||||
|
||||
| Field | Value | Meaning |
|
||||
|-------|-------|---------|
|
||||
| `tx` | **92** | Channel busy transmitting 92% of the time |
|
||||
| `txop` | **4** | Transmit-opportunities available only 4% — the channel is starved |
|
||||
| `idle` | **14** | Channel idle only 14% |
|
||||
| `goodtx` / `badtx` | 8 / **81** | Failed/retried transmits vastly outnumber good ones |
|
||||
|
||||
Seventeen clients were associated to that one 5 GHz radio.
|
||||
|
||||
### Step 3 — Understand why 160 MHz makes it worse
|
||||
|
||||
A 160 MHz channel on the lower 5 GHz band spans channels **36–64**, which overlaps DFS sub-blocks. To stay clean it needs 160 MHz of *uncontended* spectrum — but in a dense RF environment (≈25 neighbor APs here, several on 5 GHz channels 48/52/100/132/153 that overlap or border the block), any one busy neighbor degrades the **entire** wide channel. 160 MHz also makes the radio **DFS-radar exposed**: a single radar detection forces a channel-switch with a 1 s+ blackout — a stream-killer.
|
||||
|
||||
So 160 MHz buys a higher *peak* PHY rate that game streaming doesn't need, at the cost of the *stability* it absolutely does.
|
||||
|
||||
## ✅ Fix
|
||||
|
||||
Drop the 5 GHz radio to **80 MHz** and pin it to a **non-DFS** channel (UNII-1: 36/40/44/48 — no radar, no DFS blackouts).
|
||||
|
||||
GUI: **Wireless → 5 GHz → Channel Bandwidth = 80 MHz**, **Control Channel = 36**, turn off "Auto."
|
||||
|
||||
Or over SSH (`nvram` + `restart_wireless`):
|
||||
|
||||
```bash
|
||||
nvram set wl1_bw_cap=7 # cap at 80 MHz (bitmask: 1=20, 3=40, 7=80, 15=160)
|
||||
nvram set wl1_chanspec=36/80 # channel 36 @ 80 MHz
|
||||
nvram set wl1_channel=36
|
||||
nvram commit
|
||||
service restart_wireless # ~15-20s radio bounce, drops all clients briefly
|
||||
```
|
||||
|
||||
> [!warning] `restart_wireless` drops every Wi-Fi client for 15–20 seconds. `nvram commit` runs *before* the restart, so the config persists even if your own SSH/Wi-Fi session drops.
|
||||
|
||||
## 📊 Result
|
||||
|
||||
Verified from both the router and a client after the radio came back:
|
||||
|
||||
| Metric | Before (36/160) | After (36/80) |
|
||||
|--------|-----------------|---------------|
|
||||
| Channel tx-busy | 92% | **9%** |
|
||||
| Transmit-opportunity available | 4% | **79%** |
|
||||
| Channel idle | 14% | **87%** |
|
||||
| Failed tx (`badtx` vs `goodtx`) | 81 vs 8 | **1 vs 3** |
|
||||
| Gateway ping (avg / floor) | 27 ms / 7.5 ms | **9 ms / 2.7 ms** |
|
||||
| PHY peak rate | 1729 Mbps | 1200 Mbps |
|
||||
|
||||
The PHY peak dropped (narrower channel) but that is irrelevant — Steam Remote Play wants ~30–50 Mbps with *consistent* airtime, which it now has. The stutter resolved.
|
||||
|
||||
## 🧠 Takeaways
|
||||
|
||||
- **Diagnose Wi-Fi streaming problems with jitter, not throughput.** A speed test can pass while a stream stutters. Ping your gateway and watch the stddev.
|
||||
- **Strong signal + high latency = airtime congestion.** Don't chase signal strength when RSSI is already good; look at channel utilization (`chanim_stats`).
|
||||
- **160 MHz is a trap in a dense RF environment.** Use 80 MHz for reliability; reserve 160 MHz for clean spectrum and short range.
|
||||
- **Prefer non-DFS channels (36–48) for anything latency-sensitive** — DFS radar events cause silent multi-second dropouts.
|
||||
- **Wire the *source*.** The streaming PC should be on Ethernet so the video only crosses the air once (AP → handheld). The handheld has to be Wi-Fi; the desktop doesn't.
|
||||
- **Isolate IoT on 2.4 GHz** (separate SSID) so it never competes for 5 GHz airtime with latency-sensitive clients.
|
||||
|
||||
## Related
|
||||
|
||||
- [Steam Deck Wi-Fi Flapping: IWD Periodic Scan + rtw88 Power Save](steam-deck-wifi-flapping-iwd-periodic-scan-rtw88.md) — the *other* Steam Deck Wi-Fi issue (client-side flap), distinct from this router-side airtime problem.
|
||||
- [Network Overview](../../02-selfhosting/dns-networking/network-overview.md)
|
||||
- [Wake-on-LAN via Router SSH](../../02-selfhosting/dns-networking/wake-on-lan-router-ssh.md)
|
||||
- [Pi-hole v6 Group Management — Per-Client DNS Rules](../../02-selfhosting/dns-networking/pihole-v6-group-management.md)
|
||||
|
|
@ -11,7 +11,7 @@ tags:
|
|||
- powershell
|
||||
status: published
|
||||
created: 2026-04-03
|
||||
updated: 2026-04-22T09:20
|
||||
updated: 2026-04-30T05:21
|
||||
---
|
||||
|
||||
# Windows OpenSSH: WSL as Default Shell Breaks Remote Commands
|
||||
|
|
|
|||
|
|
@ -10,7 +10,7 @@ tags:
|
|||
- majorrig
|
||||
status: published
|
||||
created: 2026-04-02
|
||||
updated: 2026-04-22T09:20
|
||||
updated: 2026-04-30T05:21
|
||||
---
|
||||
# Windows OpenSSH Server (sshd) Stops After Reboot
|
||||
|
||||
|
|
|
|||
|
|
@ -0,0 +1,129 @@
|
|||
---
|
||||
title: "OBS Studio — \"Error opening file: (null)\" After Windows Profile Rename"
|
||||
domain: troubleshooting
|
||||
category: streaming
|
||||
tags: [obs, streaming, windows, lua, profile-migration]
|
||||
status: published
|
||||
created: 2026-05-14
|
||||
updated: 2026-05-14
|
||||
---
|
||||
|
||||
# OBS Studio — "Error opening file: (null)" After Windows Profile Rename
|
||||
|
||||
## Symptom
|
||||
|
||||
Loading a scene collection in OBS Studio triggers a popup like:
|
||||
|
||||
```
|
||||
[<ScriptName>.lua] Error opening file: (null)
|
||||
```
|
||||
|
||||
The `(null)` is the giveaway: OBS resolved the registered script path to nothing — the file doesn't exist where the scene collection says it does. Most commonly this happens after a Windows profile was renamed or migrated and `C:\Users\<old>\...` paths were not updated.
|
||||
|
||||
## Why it happens
|
||||
|
||||
OBS stores per-scene-collection Lua/Python script registrations inside the scene collection JSON at:
|
||||
|
||||
```
|
||||
%APPDATA%\obs-studio\basic\scenes\<Collection>.json
|
||||
```
|
||||
|
||||
Each entry under `modules.scripts-tool[]` is an absolute Windows path. Renaming the Windows profile does not rewrite these — the JSON keeps pointing at the old `C:\Users\<old>\...` location, and OBS surfaces the resolution failure as a `(null)` popup on collection load.
|
||||
|
||||
## Diagnose
|
||||
|
||||
From WSL (or any shell with access to `%APPDATA%`):
|
||||
|
||||
```bash
|
||||
OBS_DIR="/mnt/c/Users/<current-windows-user>/AppData/Roaming/obs-studio"
|
||||
|
||||
# 1. List scene collections
|
||||
ls "$OBS_DIR/basic/scenes/"
|
||||
|
||||
# 2. Find collections referencing the missing script
|
||||
grep -l -i "<script-name-substring>" "$OBS_DIR/basic/scenes/"*.json
|
||||
|
||||
# 3. Dump the scripts-tool paths from each suspect collection
|
||||
python3 -c "
|
||||
import json, sys
|
||||
d = json.load(open(sys.argv[1]))
|
||||
for s in d.get('modules', {}).get('scripts-tool', []):
|
||||
print(s.get('path'))
|
||||
" "$OBS_DIR/basic/scenes/<Collection>.json"
|
||||
```
|
||||
|
||||
If a printed path contains `C:/Users/<old-username>/...` and the file doesn't exist on disk, you've found it.
|
||||
|
||||
## Fix
|
||||
|
||||
> [!warning] Close OBS first
|
||||
> OBS rewrites the scene collection JSON when it exits. Any edit made while OBS is running will be overwritten. Confirm with `tasklist.exe | grep obs64` (WSL) or Task Manager.
|
||||
|
||||
### 1. Make the missing script reachable
|
||||
|
||||
Either:
|
||||
|
||||
- **Re-extract / restore the script** to a path under the new profile (recommended — gives you a clean canonical home), or
|
||||
- **Leave it in the rescue/migration folder** and point OBS there (fragile if the rescue folder is later deleted).
|
||||
|
||||
### 2. Back up the scene collection JSON
|
||||
|
||||
```bash
|
||||
SCENES="/mnt/c/Users/<current-windows-user>/AppData/Roaming/obs-studio/basic/scenes"
|
||||
STAMP="$(date +%Y%m%d-%H%M%S)"
|
||||
cp -p "$SCENES/<Collection>.json" "$SCENES/<Collection>.json.$STAMP.bak"
|
||||
```
|
||||
|
||||
### 3. Rewrite the paths atomically
|
||||
|
||||
Edit the JSON in place by parsing it, replacing the matched path strings, and writing through a temp file (so a crash mid-write can't corrupt the collection):
|
||||
|
||||
```bash
|
||||
python3 <<'PY'
|
||||
import json, os
|
||||
scenes = "/mnt/c/Users/<current-windows-user>/AppData/Roaming/obs-studio/basic/scenes"
|
||||
mapping = {
|
||||
"C:/Users/<old>/Pictures/.../<script>.lua":
|
||||
"C:/Users/<new>/Pictures/.../<script>.lua",
|
||||
}
|
||||
for fn in ("<Collection>.json",):
|
||||
path = os.path.join(scenes, fn)
|
||||
d = json.load(open(path))
|
||||
for entry in d.get("modules", {}).get("scripts-tool", []):
|
||||
if entry.get("path") in mapping:
|
||||
entry["path"] = mapping[entry["path"]]
|
||||
tmp = path + ".tmp"
|
||||
json.dump(d, open(tmp, "w"), indent=4)
|
||||
os.replace(tmp, path)
|
||||
PY
|
||||
```
|
||||
|
||||
OBS scene JSONs use forward slashes in Windows paths — preserve that style.
|
||||
|
||||
### 4. Verify
|
||||
|
||||
Re-run the diagnostic Python snippet and confirm every printed path resolves to a real file (translate `C:/` → `/mnt/c/` from WSL).
|
||||
|
||||
### 5. Reopen OBS
|
||||
|
||||
Load the scene collection. The popup should be gone.
|
||||
|
||||
## Why not just remove the script?
|
||||
|
||||
If the script is part of a third-party overlay pack (Twitch Pimpage, OWN3D, etc.), removing the registration also removes the overlay's source presets — fixing the path keeps the imported scenes intact. If you don't actually use the overlay anymore, removing the `scripts-tool` entry is fine; OBS will silently drop the broken reference on next save.
|
||||
|
||||
## Generalization
|
||||
|
||||
This same pattern applies to any OBS asset path stored in a scene collection or profile:
|
||||
|
||||
- Browser source local files
|
||||
- Image / media source files
|
||||
- Lua / Python script paths
|
||||
- VST plugin paths
|
||||
|
||||
All of them are absolute, all of them survive a Windows profile rename in stale form, and all of them can be batch-rewritten with the same JSON-edit pattern above. Search for the old username substring across `%APPDATA%\obs-studio\` to catch them all in one pass.
|
||||
|
||||
## Related
|
||||
|
||||
- [[../../MajorInfrastructure/Devices/MajorRig|MajorRig device note]] — Incident Log 2026-05-14 (TTT/MLS scene popups) and 2026-05-07 (`majli` profile retirement that left these references stranded)
|
||||
- [[../04-streaming/obs/obs-studio-setup-encoding|OBS Studio Setup and Encoding Settings]]
|
||||
0
05-troubleshooting/performance/.keep
Normal file
0
05-troubleshooting/performance/.keep
Normal file
225
05-troubleshooting/php-84-vendor-implicit-nullable-patch.md
Normal file
225
05-troubleshooting/php-84-vendor-implicit-nullable-patch.md
Normal file
|
|
@ -0,0 +1,225 @@
|
|||
---
|
||||
title: Patching PHP 8.4 Implicit-Nullable Deprecations in Vendor Packages
|
||||
domain: troubleshooting
|
||||
category: troubleshooting
|
||||
tags:
|
||||
- php
|
||||
- php-8.4
|
||||
- codeigniter
|
||||
- castopod
|
||||
- composer
|
||||
- vendor
|
||||
- deprecation
|
||||
- troubleshooting
|
||||
status: published
|
||||
created: 2026-05-10
|
||||
updated: 2026-05-10
|
||||
---
|
||||
|
||||
# Patching PHP 8.4 Implicit-Nullable Deprecations in Vendor Packages
|
||||
|
||||
> **TL;DR** — PHP 8.4 deprecated implicit-nullable parameters (`function f(int $x = null)` without `?int`). Old vendor packages that haven't been updated will spam `E_DEPRECATED` warnings on every load. CodeIgniter (and similar frameworks) wrap each warning in a 23-frame stack trace, which on a per-minute cron multiplies into hundreds of MB/day of logs and a noticeable CPU floor on small VPS boxes. The fix is a four-line `sed` patch — but be very careful: a naive sed pattern can substring-match an *already-nullable* parameter and produce illegal `??type` syntax.
|
||||
|
||||
---
|
||||
|
||||
## Symptom
|
||||
|
||||
You're running a CodeIgniter 4 app (Castopod, BookStack, etc.) on PHP 8.4 with an older vendored library that hasn't been updated to declare nullable types properly. The combination produces:
|
||||
|
||||
- **Sustained CPU floor** on a 1 vCPU box (typically 15–25% baseline) when the framework's spark/cron scheduler runs every 60 seconds
|
||||
- **Massive daily log volume** in `writable/logs/log-YYYY-MM-DD.log` — 50–100 MB per day is common
|
||||
- Each WARNING line is followed by a **23-frame stack trace** through Composer's autoloader, the framework's autoloader, and the application's command/model entry point
|
||||
- The actual scheduler task may report `Failed:` even though it logs no obvious error — the deprecation is fatal in some PHP/CodeIgniter combinations
|
||||
|
||||
The deprecation warnings look like:
|
||||
|
||||
```
|
||||
WARNING - 2026-05-10 16:33:01 --> [DEPRECATED]
|
||||
Vendor\Package\SomeModel::doFindAll(): Implicitly marking parameter $limit as nullable is deprecated,
|
||||
the explicit nullable type must be used instead in
|
||||
VENDORPATH/vendor/package/src/SomeModel.php on line 287.
|
||||
```
|
||||
|
||||
## Why this matters more than a typical deprecation
|
||||
|
||||
Three multipliers turn "minor PHP deprecation" into "the box is on fire":
|
||||
|
||||
1. **Per-minute cron** — `php spark tasks:run` runs every 60 seconds. Each run loads the framework, hits the deprecation, dumps a stack trace.
|
||||
2. **CodeIgniter's error handler is verbose** — it catches `E_DEPRECATED` and writes a full backtrace to disk. There's no debug-vs-production split here.
|
||||
3. **Small VPS boxes have a thin idle margin** — on a 1 vCPU droplet, sustained 22% from PHP startup overhead + log writes is enough to trip a default `>85% / 5min` DigitalOcean alert during traffic spikes.
|
||||
|
||||
## Diagnostic chain
|
||||
|
||||
### 1. Confirm the symptom is deprecation cascade, not autoload failure
|
||||
|
||||
The stack trace makes this look like an autoload error — it isn't. Check the WARNING line itself:
|
||||
|
||||
- **`[DEPRECATED] ... Implicitly marking parameter ... as nullable`** → vendor library + PHP 8.4 mismatch (this article applies)
|
||||
- **`Class 'X' not found`** → actual autoload problem (different fix)
|
||||
|
||||
### 2. Identify the PHP version
|
||||
|
||||
```bash
|
||||
php -v
|
||||
```
|
||||
|
||||
If it's 8.4+, implicit-nullable is now `E_DEPRECATED`. (PHP 8.4.0 was released 2024-11-21; many distros bumped during 2025–26.)
|
||||
|
||||
### 3. List the offending lines
|
||||
|
||||
The log itself names them. Grep for the unique vendor-path pattern:
|
||||
|
||||
```bash
|
||||
grep 'DEPRECATED' /var/www/<app>/writable/logs/log-$(date +%Y-%m-%d).log \
|
||||
| awk -F'on line ' '{print $2}' | sort -u
|
||||
```
|
||||
|
||||
You'll typically see three to six line numbers in one file — each parameter that needs `?` prefixing.
|
||||
|
||||
### 4. Inspect each line before patching
|
||||
|
||||
```bash
|
||||
F=/var/www/<app>/vendor/<package>/src/<File>.php
|
||||
sed -n '287p;520p' "$F" # Show only the lines named by the warnings
|
||||
```
|
||||
|
||||
Look for **already-prefixed** parameters in the same function or nearby — if `?type $foo = null` already exists in the file, your sed pattern must not match it.
|
||||
|
||||
## The fix — anchored sed
|
||||
|
||||
**Step 1: Backup.**
|
||||
|
||||
```bash
|
||||
F=/var/www/<app>/vendor/<package>/src/<File>.php
|
||||
sudo cp -p "$F" "$F.bak.$(date +%Y%m%d-%H%M%S)"
|
||||
```
|
||||
|
||||
**Step 2: Apply patches with anchors.** Don't use bare patterns like `int \$limit = null` — they'll substring-match against `?int \$limit = null` (an already-nullable parameter elsewhere in the file) and produce `??int $limit = null`, which PHP rejects as a `ParseError: unexpected token "??"`.
|
||||
|
||||
Anchor on the function signature:
|
||||
|
||||
```bash
|
||||
sudo sed -i \
|
||||
-e 's|^\(\s*protected function doFindAll(\)int \$limit = null|\1?int $limit = null|' \
|
||||
-e 's|^\(\s*protected function doUpdateBatch(\)array \$set = null, string \$index = null|\1?array $set = null, ?string $index = null|' \
|
||||
"$F"
|
||||
```
|
||||
|
||||
For constructors with reference operators (`&$db`), include the `&` in the anchor:
|
||||
|
||||
```bash
|
||||
sudo sed -i 's|ConnectionInterface &\$db = null|?ConnectionInterface \&$db = null|' "$F"
|
||||
```
|
||||
|
||||
**Step 3: Lint immediately.**
|
||||
|
||||
```bash
|
||||
sudo php -l "$F"
|
||||
# Must print: No syntax errors detected in <path>
|
||||
```
|
||||
|
||||
If lint fails, restore from the backup and try a tighter anchor — don't chain another sed onto a broken file.
|
||||
|
||||
**Step 4: Verify the runtime.**
|
||||
|
||||
```bash
|
||||
sudo -u www-data php /var/www/<app>/spark tasks:run | grep -E '(Executed|Failed)'
|
||||
```
|
||||
|
||||
The previously-Failing task should now show `Executed`.
|
||||
|
||||
**Step 5: Confirm the log bleed stops.** Wait 60s, then:
|
||||
|
||||
```bash
|
||||
LOG=/var/www/<app>/writable/logs/log-$(date +%Y-%m-%d).log
|
||||
SINCE=$(date -d '60 seconds ago' '+%H:%M:%S')
|
||||
awk -v t="$SINCE" '/DEPRECATED/ && $4>=t' "$LOG" | wc -l
|
||||
# Expect: 0
|
||||
```
|
||||
|
||||
## The substring-match gotcha (the one that bit me)
|
||||
|
||||
This is the failure mode that turns a 30-second fix into a 30-minute incident:
|
||||
|
||||
```bash
|
||||
# DANGEROUS
|
||||
sed -i 's|int \$limit = null|?int $limit = null|' "$F"
|
||||
```
|
||||
|
||||
That pattern matches both:
|
||||
|
||||
- `protected function doFindAll(int $limit = null, …)` — the line you want to fix
|
||||
- `protected function doInsertBatch(?array $set = null, ?bool $escape = null, int $batchSize = 100)` — somewhere else in the file, where there's an `int $limit = null` substring **inside** an already-nullable signature you don't want to touch
|
||||
|
||||
After sed, the second line becomes `??array $set = null` (or similar) — illegal in PHP. The first time the autoloader tries to load the file, you get:
|
||||
|
||||
```
|
||||
ParseError: syntax error, unexpected token "??", expecting variable
|
||||
at vendor/.../src/<File>.php:426
|
||||
```
|
||||
|
||||
Recovery is restore-from-backup, then re-apply with anchored patterns. **Always lint before reload, before flush, before next anything.**
|
||||
|
||||
## Are reference parameters tricky? Yes.
|
||||
|
||||
`&$db` (pass-by-reference) needs the ampersand preserved when adding the `?` prefix:
|
||||
|
||||
| Before | After |
|
||||
|---|---|
|
||||
| `ConnectionInterface &$db = null` | `?ConnectionInterface &$db = null` |
|
||||
| `array &$rows` | `?array &$rows` |
|
||||
|
||||
In sed, escape the ampersand in the replacement (`\&`) because unescaped `&` in the replacement means "the matched text." Easy way to test the right escaping: run sed with `--debug` or do a dry-run with `-n` and `p`.
|
||||
|
||||
## Bonus: hunt for stray debug prints while you're in there
|
||||
|
||||
When you're already grepping the application source for one issue, scan for sloppy `log_message('critical', ...)` calls left in by upstream developers. Real-world finds include:
|
||||
|
||||
- `log_message('critical', 'ITS HEEEEEEEEEEEERE');` — left in Castopod's `modules/Fediverse/Filters/FediverseFilter.php` line 62, firing on every fediverse request, contributing 195 CRITICAL entries to one day's log
|
||||
- `log_message('critical', 'TODO');`
|
||||
- `log_message('critical', 'wtf');`
|
||||
|
||||
```bash
|
||||
grep -rE "log_message\(['\"]critical['\"]" /var/www/<app>/modules/ /var/www/<app>/app/ \
|
||||
| grep -v -E 'TODO|FIXME' \
|
||||
| head -10
|
||||
```
|
||||
|
||||
These are usually safe to remove (or downgrade to `debug` level) — they don't represent real failure conditions, just developer artifacts.
|
||||
|
||||
## Why not just upgrade the vendor package?
|
||||
|
||||
`composer update <package>` is the proper fix. But:
|
||||
|
||||
- Many PHP applications (Castopod especially) ship pre-built `vendor/` and don't expect composer to be installed at runtime
|
||||
- A major version bump (`v1.x → v2.x`) implies API changes that the application may not handle
|
||||
- `composer update` may pull in cascading dependency updates you don't want
|
||||
|
||||
Hot-patching is the right answer when:
|
||||
|
||||
- The application doesn't ship with `composer.json` referencing the package directly
|
||||
- The fix is purely syntactic (parameter type declarations)
|
||||
- A future application release will likely include the upgraded vendor anyway
|
||||
|
||||
Just **document the patch** and add a follow-up task to re-apply (or skip) after the next application upgrade. Without that note, the next time the box is rebuilt or upgraded, you'll spend another evening chasing the same stack trace.
|
||||
|
||||
## Specific examples observed in the MajorsHouse fleet
|
||||
|
||||
### Castopod 1.20+ on PHP 8.4
|
||||
|
||||
`vendor/michalsn/codeigniter4-uuid/src/UuidModel.php` v1.3.1 — four nullable-prefix corrections needed:
|
||||
|
||||
| Line | Original | Patched |
|
||||
|---|---|---|
|
||||
| 54 | `__construct(ConnectionInterface &$db = null, ValidationInterface $validation = null)` | `__construct(?ConnectionInterface &$db = null, ?ValidationInterface $validation = null)` |
|
||||
| 287 | `doFindAll(int $limit = null, int $offset = 0)` | `doFindAll(?int $limit = null, int $offset = 0)` |
|
||||
| 520 | `doUpdateBatch(array $set = null, string $index = null, …)` | `doUpdateBatch(?array $set = null, ?string $index = null, …)` |
|
||||
|
||||
Line 426 (`doInsertBatch(?array $set = null, ?bool $escape = null, …)`) was already correct — the substring-match gotcha above was triggered by it.
|
||||
|
||||
Upstream `michalsn/codeigniter4-uuid` v2.0.0 (released 2024) declares all parameters with explicit `?type` syntax and has no deprecation warnings. Castopod hadn't upgraded the dependency as of Castopod 1.20.
|
||||
|
||||
## See also
|
||||
|
||||
- [Castopod Posts Don't Appear on Mastodon — Diagnosing the Federation Path](security/castopod-broadcast-not-on-mastodon.md) — tttpod-specific diagnostic
|
||||
- [PHP RFC: Deprecate implicitly nullable parameter types](https://wiki.php.net/rfc/deprecate-implicitly-nullable-types) — the canonical PHP 8.4 reference
|
||||
|
|
@ -0,0 +1,154 @@
|
|||
---
|
||||
title: "Castopod Posts Don't Appear on Mastodon — Diagnosing the Federation Path"
|
||||
domain: troubleshooting
|
||||
category: security
|
||||
tags: [castopod, mastodon, fediverse, activitypub, federation, notifications]
|
||||
status: published
|
||||
created: 2026-05-10
|
||||
updated: 2026-05-10
|
||||
---
|
||||
|
||||
# Castopod Posts Don't Appear on Mastodon — Diagnosing the Federation Path
|
||||
|
||||
## 🛑 Problem
|
||||
|
||||
You publish a podcast episode (or a standalone post) on Castopod. The Castopod admin shows it went out fine. But on the Mastodon account that you *expected* to see it from — your own personal account, an account that follows your podcast, a colleague's — the post never shows up. Or it shows up in the home timeline but the notification bell never rings.
|
||||
|
||||
Three different failure modes hide behind "I didn't get the post." This article walks the diagnostic chain that distinguishes them.
|
||||
|
||||
---
|
||||
|
||||
## 🔬 The four checks, in order
|
||||
|
||||
Run these in sequence. The first one that fails tells you what's actually wrong.
|
||||
|
||||
### Check 1 — Did Castopod create the post?
|
||||
|
||||
On the Castopod host:
|
||||
|
||||
```sh
|
||||
mysql -u $CP_DB_USER -p$CP_DB_PASS $CP_DB_NAME --binary-as-hex -e "
|
||||
SELECT HEX(id), actor_id, LEFT(message,80), episode_id, published_at, created_at
|
||||
FROM cp_fediverse_posts
|
||||
ORDER BY created_at DESC LIMIT 5
|
||||
"
|
||||
```
|
||||
|
||||
If your post isn't here at all, Castopod didn't generate it. That's a Castopod-side bug — check `writable/logs/log-<date>.log`, verify the per-minute task scheduler is firing (`php spark tasks:list` should show `Last Run` for `fediverse-broadcast`), and confirm the cron exists:
|
||||
|
||||
```sh
|
||||
sudo crontab -u www-data -l | grep tasks:run
|
||||
# expect: * * * * * php /var/www/html/castopod/spark tasks:run >> /dev/null 2>&1
|
||||
```
|
||||
|
||||
### Check 2 — Did Castopod queue and deliver the activity?
|
||||
|
||||
```sh
|
||||
mysql -u $CP_DB_USER -p$CP_DB_PASS $CP_DB_NAME --binary-as-hex -e "
|
||||
SELECT HEX(id), actor_id, type, status, scheduled_at, created_at
|
||||
FROM cp_fediverse_activities
|
||||
WHERE type='Create'
|
||||
ORDER BY created_at DESC LIMIT 10
|
||||
"
|
||||
```
|
||||
|
||||
The `status` column tells you everything:
|
||||
|
||||
| Status | Meaning |
|
||||
|---|---|
|
||||
| `queued` | Sitting in the queue, broadcast task hasn't run yet (or is bogged down) |
|
||||
| `processing` | In-flight |
|
||||
| `delivered` | All follower inboxes returned 2xx |
|
||||
| `failed` | One or more inbox POSTs returned non-2xx, gave up after retries |
|
||||
|
||||
If `status='delivered'`, Castopod has done its job — and yet someone says they didn't see the post. Move to Check 3.
|
||||
|
||||
### Check 3 — Are they actually a follower?
|
||||
|
||||
The single most common cause of "I didn't see it." Federation only delivers `Create` activities to **followers** (and to anyone explicitly mentioned). Interacting with a post (favourite, boost) does NOT establish a follow relationship.
|
||||
|
||||
On the Castopod host:
|
||||
|
||||
```sh
|
||||
mysql -u $CP_DB_USER -p$CP_DB_PASS $CP_DB_NAME -e "
|
||||
SELECT a.username, a.domain, f.created_at
|
||||
FROM cp_fediverse_follows f
|
||||
JOIN cp_fediverse_actors a ON a.id = f.actor_id
|
||||
ORDER BY f.created_at DESC
|
||||
"
|
||||
```
|
||||
|
||||
`cp_fediverse_follows.actor_id` is the **follower** (remote actor); `target_actor_id` is your local podcast actor. If the user's `username@domain` isn't in this list, they don't follow your podcast, and the Create activity was never sent to their inbox.
|
||||
|
||||
Cross-check from the Mastodon side (if you control both):
|
||||
|
||||
```sh
|
||||
sudo -u postgres psql mastodon_production -t -A -c "
|
||||
SELECT a.username, a.domain
|
||||
FROM follows f
|
||||
JOIN accounts a ON a.id = f.target_account_id
|
||||
WHERE f.account_id = <mastodon-account-id>
|
||||
AND a.domain = '<your-castopod-domain>'
|
||||
"
|
||||
```
|
||||
|
||||
Empty result on both sides = they're not following. **Resolution: have them search `@yourpodcast@yourdomain.tld` in their Mastodon and click Follow.**
|
||||
|
||||
A subtler corner of this check: `accounts WHERE domain='<your-castopod-domain>'` returning 0 rows on the Mastodon side means Mastodon has never even webfingered your podcast actor. The user may have *thought* they followed at some point, but it never went through (e.g., they typed the handle wrong, or the follow request errored).
|
||||
|
||||
### Check 4 — Is "didn't see it" a notification problem, not a delivery problem?
|
||||
|
||||
Even after a successful follow, the post lands in the **home timeline** by default. Mastodon **notifications** (the bell icon, the unread badge) fire for a specific list of activity types — and "new post from someone I follow" isn't one of them. Notifications fire for:
|
||||
|
||||
- Mentions (`@you` in the post body)
|
||||
- Follows (someone follows you)
|
||||
- Favourites of your posts
|
||||
- Boosts of your posts
|
||||
- Polls ending
|
||||
- Status edits (post you favourited was edited)
|
||||
- Admin alerts
|
||||
|
||||
So even with delivery working perfectly and the follow in place, "I didn't get a notification on my account" is the expected state for a regular podcast post. Three ways to make notifications happen:
|
||||
|
||||
1. **Bell icon on the followed profile.** Mastodon UI: open the followed account's profile → click the bell. Enables per-account post notifications. Now every new post from that account raises a notification.
|
||||
2. **`@`-mention in the post.** Have Castopod include `@you@yourdomain.tld` in the post text. Mention activities always raise notifications regardless of follow/bell state. (You may not control the post text on someone else's Castopod, but you control your own.)
|
||||
3. **Cross-post via a different actor.** If you also run a Mastodon account for the show, post manually from there and `@`-mention the audience accounts you want to page.
|
||||
|
||||
---
|
||||
|
||||
## 🧪 Worked example
|
||||
|
||||
A real case: someone running a podcast on Castopod 2.0.0-next.4 expected a new episode's auto-post to appear on their personal Mastodon. It didn't.
|
||||
|
||||
- Check 1 → post present in `cp_fediverse_posts`, episode_id correct ✓
|
||||
- Check 2 → matching `cp_fediverse_activities` row, `type='Create'`, `status='delivered'` ✓
|
||||
- Check 3 → 8 followers in `cp_fediverse_follows`, none from the personal Mastodon's domain ✗
|
||||
|
||||
Outcome: the user wasn't following their own podcast. They had been favouriting and boosting its posts (which doesn't require following), and assumed those interactions implied a follow. Resolution: search-and-follow from the personal Mastodon. After the follow propagated, future broadcasts arrived as expected.
|
||||
|
||||
The post itself never raised a notification (only landed in home timeline). They later enabled the bell icon on the podcast profile and started getting notified on new episodes.
|
||||
|
||||
---
|
||||
|
||||
## 🧭 When this isn't the answer
|
||||
|
||||
If Check 3 shows the person IS a follower but they still didn't receive the post:
|
||||
|
||||
- **Check their inbox** if you have access: Mastodon nginx access log:
|
||||
```sh
|
||||
sudo grep '<castopod-ip-or-domain>' /var/log/nginx/access.log | grep inbox
|
||||
```
|
||||
Expect a `POST /users/<them>/inbox HTTP/2.0 202` from the Castopod IP shortly after `published_at`. No POST = Castopod didn't deliver despite claiming `status='delivered'` (rare; check Castopod's HTTP signing config and any outbound firewall on the Castopod host).
|
||||
|
||||
- **Check Sidekiq** on Mastodon for `ActivityPub::ProcessingWorker` failures around the activity timestamp.
|
||||
|
||||
- **Check domain blocks**: `SELECT * FROM domain_blocks WHERE domain = '<castopod-domain>'` on Mastodon. A silenced or suspended domain on either end would explain everything.
|
||||
|
||||
---
|
||||
|
||||
## 📚 References
|
||||
|
||||
- ActivityPub spec — [Delivery semantics](https://www.w3.org/TR/activitypub/#delivery): `Create` activities go to actors in `to`/`cc`/`bcc`/`audience`; for public posts that resolves to the actor's followers collection.
|
||||
- Mastodon notification types: `app/models/notification.rb` — `TYPES` constant
|
||||
- Castopod fediverse module: `modules/Fediverse/Commands/Broadcast.php` (the per-minute task) and `modules/Fediverse/Models/ActivityModel.php` (queue model)
|
||||
- Related: [Castopod: Stale Federated Avatar URLs After Remote Profile Updates](castopod-stale-federated-avatar.md) — sister article, also Castopod fediverse module
|
||||
190
05-troubleshooting/security/castopod-stale-federated-avatar.md
Normal file
190
05-troubleshooting/security/castopod-stale-federated-avatar.md
Normal file
|
|
@ -0,0 +1,190 @@
|
|||
---
|
||||
title: "Castopod: Stale Federated Avatar URLs After Remote Profile Updates"
|
||||
domain: troubleshooting
|
||||
category: security
|
||||
tags: [castopod, mastodon, fediverse, activitypub, s3, federation]
|
||||
status: published
|
||||
created: 2026-05-08
|
||||
updated: 2026-05-08
|
||||
---
|
||||
|
||||
# Castopod: Stale Federated Avatar URLs After Remote Profile Updates
|
||||
|
||||
## 🛑 Problem
|
||||
|
||||
Your Castopod admin pages — most visibly the notifications list (`/cp-admin/podcasts/<id>/notifications`) — show broken avatars for federated actors. The browser dev tools (or a direct `curl -I`) on the avatar URL returns:
|
||||
|
||||
```
|
||||
HTTP/1.1 403 Forbidden
|
||||
Server: AmazonS3
|
||||
```
|
||||
|
||||
…with the response body:
|
||||
|
||||
```xml
|
||||
<Error>
|
||||
<Code>AccessDenied</Code>
|
||||
<Message>Access Denied</Message>
|
||||
...
|
||||
</Error>
|
||||
```
|
||||
|
||||
The hostname is the remote instance's S3 bucket (e.g. `s3.amazonaws.com/<their-bucket>/accounts/avatars/...`). Other actors in the same notifications list — those with avatars on Mastodon's own CDN, or on instances using path-stable storage — render fine.
|
||||
|
||||
This article explains *why* the alarm code is misleading, *what's actually broken*, and how to fix it on Castopod.
|
||||
|
||||
---
|
||||
|
||||
## 🔬 Why "AccessDenied" is misleading
|
||||
|
||||
S3 returns `403 AccessDenied` to anonymous requesters for **any** missing object — by design, as anti-enumeration. Anonymous users typically don't have `s3:ListBucket` permission on the bucket, so S3 deliberately can't tell them whether the key is missing or merely forbidden. Both cases produce the same 403.
|
||||
|
||||
So when you see `403 AccessDenied` on a remote avatar URL, **the actual problem is almost always that the object no longer exists**. The bucket is fine; the file is gone.
|
||||
|
||||
### Verifying that interpretation
|
||||
|
||||
If you have access to the remote instance (or to S3 credentials for that bucket):
|
||||
|
||||
```sh
|
||||
aws s3api head-object --bucket <bucket> --key accounts/avatars/.../<filename>.jpeg
|
||||
```
|
||||
|
||||
If you see `An error occurred (404) when calling the HeadObject operation: Not Found`, the object is genuinely gone — and the upstream user has updated their avatar.
|
||||
|
||||
---
|
||||
|
||||
## 🔍 What's actually broken
|
||||
|
||||
Mastodon (and most ActivityPub servers using Paperclip-style storage) **deletes the old object** on avatar replacement and stores only the current filename in the DB. The remote instance is functioning normally — its current `<img>` URL points to a different filename and serves correctly.
|
||||
|
||||
Castopod 2.0.0 (verified up to `2.0.0-next.4`) **caches the avatar URL** of every federated actor in `cp_fediverse_actors.avatar_image_url` when it first sees activity from that actor — and never refetches. The admin templates (e.g. `themes/cp_admin/podcast/notifications.php`) emit that stored URL directly into `<img src>`. Once the upstream replaces the avatar:
|
||||
|
||||
- Old object deleted → S3 returns 403 to anonymous fetchers
|
||||
- Castopod still renders the dead URL forever
|
||||
- Every cached page using that template shows a broken image
|
||||
|
||||
The same pattern applies to `cover_image_url` (header).
|
||||
|
||||
---
|
||||
|
||||
## ✅ Fix
|
||||
|
||||
You have three options, in increasing order of "this stays fixed."
|
||||
|
||||
### Option 1 — Manual SQL update (one-shot)
|
||||
|
||||
Recommended for one or two stale actors. Get the current URL from the upstream instance.
|
||||
|
||||
If the upstream is your own Mastodon instance:
|
||||
|
||||
```sh
|
||||
sudo -u postgres psql mastodon_production -t -A \
|
||||
-c "SELECT id, avatar_file_name, header_file_name FROM accounts WHERE username='<their-username>'"
|
||||
```
|
||||
|
||||
Construct the canonical URL using the standard Paperclip path scheme. For an account ID like `109326168175475699`, the path is built by chunking the ID three digits at a time:
|
||||
|
||||
```
|
||||
accounts/avatars/109/326/168/175/475/699/original/<avatar_file_name>
|
||||
accounts/headers/109/326/168/175/475/699/original/<header_file_name>
|
||||
```
|
||||
|
||||
Then UPDATE the Castopod row:
|
||||
|
||||
```sh
|
||||
mysql -u $CP_DB_USER -p$CP_DB_PASS $CP_DB_NAME <<'SQL'
|
||||
UPDATE cp_fediverse_actors
|
||||
SET avatar_image_url = 'https://<s3-host>/<bucket>/accounts/avatars/109/326/168/175/475/699/original/<new>.jpeg',
|
||||
cover_image_url = 'https://<s3-host>/<bucket>/accounts/headers/109/326/168/175/475/699/original/<new>.jpg',
|
||||
updated_at = NOW()
|
||||
WHERE username = '<their-username>'
|
||||
AND domain = '<their-domain>';
|
||||
SQL
|
||||
```
|
||||
|
||||
Then clear the Castopod cache so any cached HTML rerenders:
|
||||
|
||||
```sh
|
||||
cd /var/www/html/castopod
|
||||
sudo -u www-data php spark cache:clear
|
||||
```
|
||||
|
||||
Verify:
|
||||
|
||||
```sh
|
||||
curl -sI 'https://<new-url>' | head -1 # expect HTTP/1.1 200 OK
|
||||
```
|
||||
|
||||
### Option 2 — Delete and let Castopod refetch
|
||||
|
||||
For a one-shot self-healing fix, delete the actor row entirely:
|
||||
|
||||
```sql
|
||||
DELETE FROM cp_fediverse_actors WHERE username='<u>' AND domain='<d>';
|
||||
```
|
||||
|
||||
Castopod will repopulate the row from the next inbound activity from that actor (favourite, boost, mention, follow…). **Caveat — verify foreign-key cascades first:** `cp_fediverse_favourites`, `cp_fediverse_follows`, `cp_fediverse_posts`, and `cp_fediverse_notifications` all reference `actor_id`. Depending on the migration version, ON DELETE may cascade or restrict. Check with:
|
||||
|
||||
```sh
|
||||
mysql -u $CP_DB_USER -p$CP_DB_PASS $CP_DB_NAME -e "
|
||||
SELECT TABLE_NAME, CONSTRAINT_NAME, DELETE_RULE
|
||||
FROM information_schema.REFERENTIAL_CONSTRAINTS
|
||||
WHERE CONSTRAINT_SCHEMA = '$CP_DB_NAME'
|
||||
AND REFERENCED_TABLE_NAME = 'cp_fediverse_actors';
|
||||
"
|
||||
```
|
||||
|
||||
If deletes cascade, you'll lose the activity history attributed to that actor. Use Option 1 instead.
|
||||
|
||||
### Option 3 — Bulk audit and update
|
||||
|
||||
If multiple federated actors have likely-stale avatars (any old enough that an upstream user might have refreshed their profile picture), audit them all:
|
||||
|
||||
```sh
|
||||
mysql -u $CP_DB_USER -p$CP_DB_PASS $CP_DB_NAME -BNe "
|
||||
SELECT id, username, domain, avatar_image_url
|
||||
FROM cp_fediverse_actors
|
||||
WHERE avatar_image_url IS NOT NULL
|
||||
" | while IFS=$'\t' read -r id user dom url; do
|
||||
code=$(curl -s -o /dev/null -w "%{http_code}" "$url")
|
||||
[ "$code" != "200" ] && echo "BROKEN $code $id $user@$dom $url"
|
||||
done
|
||||
```
|
||||
|
||||
For each broken row, fetch the upstream's current actor JSON and update from `icon.url` / `image.url`:
|
||||
|
||||
```sh
|
||||
curl -s -H 'Accept: application/activity+json' \
|
||||
"https://<their-domain>/users/<their-username>" | jq '{icon, image}'
|
||||
```
|
||||
|
||||
Then run the Option 1 SQL update with the fresh URLs.
|
||||
|
||||
---
|
||||
|
||||
## 🧪 Why this isn't fixable on the upstream side
|
||||
|
||||
Once the old object is deleted, you can't restore the URL without re-uploading bytes to the **exact original key** — which Mastodon won't do, because its DB only knows about the new filename. Trying to "fix" it on the Mastodon side means resurrecting a file Mastodon has no record of and that no fresh ActivityPub request would emit a URL for. The fix has to live on the consumer (Castopod) because Castopod is the one holding the stale reference.
|
||||
|
||||
This applies to every federation consumer that caches URLs by reference rather than fetching bytes locally. Mastodon, Pleroma, Akkoma, and Misskey all cache the bytes; that's why they self-heal across remote avatar swaps. Castopod 2.0.0 currently does not.
|
||||
|
||||
---
|
||||
|
||||
## 🛠 Long-term mitigations
|
||||
|
||||
This is a Castopod design issue worth raising upstream:
|
||||
- Add a `last_refreshed_at` to `cp_fediverse_actors` and a worker that refetches actor JSON on a schedule.
|
||||
- Or fetch and store avatars locally on first sight, the way Mastodon does.
|
||||
|
||||
A `fediverse:refresh-actor` spark command would also let admins fix stale rows without writing SQL.
|
||||
|
||||
If you have a recurring case (you update your Mastodon avatar often, and you also operate a Castopod instance under your own control), keep the Option 1 SQL handy as a one-liner. After your own avatar update, run it within minutes and the dead-URL window closes before it spreads to many cached pages.
|
||||
|
||||
---
|
||||
|
||||
## 📚 References
|
||||
|
||||
- Castopod source (`themes/cp_admin/podcast/notifications.php`) — uses `avatar_image_url` directly in `<img src>`
|
||||
- AWS S3 anti-enumeration: `403` vs `404` is bucket-policy-dependent; see [GetObject — Permissions Required](https://docs.aws.amazon.com/AmazonS3/latest/API/API_GetObject.html#API_GetObject_RequestPermissions)
|
||||
- Mastodon Paperclip storage layout: `accounts/avatars/<3-digit chunks of account id>/original/<file_name>`
|
||||
- Related fix patterns: [Tuning Netdata `web_log_1m_successful` for Redirect-Heavy WordPress Sites](netdata-web-log-successful-redirect-heavy-tuning.md) — shares the "the alarm is technically correct, but means something different than you think" theme
|
||||
|
|
@ -1,11 +1,17 @@
|
|||
---
|
||||
title: "ClamAV Safe Scheduling on Live Servers"
|
||||
title: ClamAV Safe Scheduling on Live Servers
|
||||
domain: troubleshooting
|
||||
category: security
|
||||
tags: [clamav, cpu, nice, ionice, cron, vps]
|
||||
tags:
|
||||
- clamav
|
||||
- cpu
|
||||
- nice
|
||||
- ionice
|
||||
- cron
|
||||
- vps
|
||||
status: published
|
||||
created: 2026-04-02
|
||||
updated: 2026-04-02
|
||||
updated: 2026-05-11T18:31
|
||||
---
|
||||
# ClamAV Safe Scheduling on Live Servers
|
||||
|
||||
|
|
@ -75,6 +81,7 @@ kill <PID>
|
|||
- `ionice -c 3` (Idle) requires Linux kernel ≥ 2.6.13 and CFQ/BFQ I/O scheduler. Works on most Ubuntu/Debian/Fedora systems.
|
||||
- On multi-core servers, consider also using `cpulimit` for a hard cap: `cpulimit -l 30 -- clamscan ...`
|
||||
- Always keep `--exclude=/sys` (and optionally `--exclude=/proc`, `--exclude=/dev`) to avoid scanning virtual filesystems.
|
||||
- **1 vCPU limitation:** `nice` and `ionice` only help when other processes compete for resources. On a single-core VPS, clamscan will still saturate the CPU at 57-100% even with `nice -n 19 ionice -c 3` — there's nothing to yield to. Accept the weekly spike as benign, or reduce scan scope to shorten the window.
|
||||
|
||||
## Related
|
||||
|
||||
|
|
|
|||
116
05-troubleshooting/security/fedora-ca-bundle-missing-symlink.md
Normal file
116
05-troubleshooting/security/fedora-ca-bundle-missing-symlink.md
Normal file
|
|
@ -0,0 +1,116 @@
|
|||
---
|
||||
title: "Fedora CA Bundle Missing Symlink — TLS Breaks Fleet-Wide"
|
||||
description: Hetzner-provisioned Fedora images may be missing the /etc/pki/tls/certs/ca-bundle.crt symlink, silently breaking Postfix TLS relay, curl, and dnf
|
||||
tags:
|
||||
- fedora
|
||||
- tls
|
||||
- postfix
|
||||
- ca-certificates
|
||||
- hetzner
|
||||
- troubleshooting
|
||||
status: published
|
||||
created: 2026-05-11
|
||||
updated: 2026-05-11
|
||||
---
|
||||
|
||||
# Fedora CA Bundle Missing Symlink
|
||||
|
||||
On Fedora, many TLS clients (Postfix, curl, dnf) look for the CA bundle at `/etc/pki/tls/certs/ca-bundle.crt`. This path is normally a symlink to `/etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem`, shipped by the `ca-certificates` package.
|
||||
|
||||
On Hetzner Cloud Fedora images (observed on Fedora 44, May 2026), this symlink can be missing despite `ca-certificates` being installed. The extracted bundle exists, but the consumer-facing symlink does not.
|
||||
|
||||
## Symptoms
|
||||
|
||||
Postfix relay to a TLS-required upstream fails:
|
||||
|
||||
```
|
||||
postfix/smtp: cannot load Certification Authority data,
|
||||
CAfile="/etc/pki/tls/certs/ca-bundle.crt",
|
||||
CApath="/etc/pki/tls/certs": disabling TLS support
|
||||
```
|
||||
|
||||
If your relay requires TLS (port 465 with `smtp_tls_wrappermode = yes`, or `smtp_tls_security_level = encrypt`), mail silently queues as deferred. No bounce, no alert — just silence.
|
||||
|
||||
Other symptoms on the same box:
|
||||
|
||||
```bash
|
||||
# curl fails
|
||||
curl https://example.com
|
||||
# error: Problem with the SSL CA cert (path? access rights?)
|
||||
|
||||
# dnf fails
|
||||
dnf list --installed
|
||||
# Curl error (77): Problem with the SSL CA cert
|
||||
```
|
||||
|
||||
## Diagnosis
|
||||
|
||||
```bash
|
||||
# Check the symlink
|
||||
ls -la /etc/pki/tls/certs/ca-bundle.crt
|
||||
# Expected: symlink -> /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem
|
||||
# Broken: "No such file or directory"
|
||||
|
||||
# Verify the extracted bundle exists
|
||||
ls -la /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem
|
||||
# Should exist (~220 KB, ~140-150 certs)
|
||||
|
||||
# Confirm the package is installed
|
||||
rpm -q ca-certificates
|
||||
# Should return a version string
|
||||
```
|
||||
|
||||
If the extracted bundle exists but the symlink at `/etc/pki/tls/certs/ca-bundle.crt` is missing, that's the problem.
|
||||
|
||||
## Fix
|
||||
|
||||
```bash
|
||||
sudo ln -sf /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem \
|
||||
/etc/pki/tls/certs/ca-bundle.crt
|
||||
sudo systemctl restart postfix
|
||||
sudo postqueue -f # flush any deferred mail
|
||||
```
|
||||
|
||||
Verify:
|
||||
|
||||
```bash
|
||||
# Symlink exists
|
||||
ls -la /etc/pki/tls/certs/ca-bundle.crt
|
||||
|
||||
# Postfix can relay
|
||||
echo "Subject: TLS test" | sendmail -v marcus@majorshouse.com
|
||||
|
||||
# curl works
|
||||
curl -sI https://example.com | head -1
|
||||
```
|
||||
|
||||
## Fleet Audit
|
||||
|
||||
If one Hetzner-provisioned Fedora host has this issue, check the others:
|
||||
|
||||
```bash
|
||||
for host in majordiscord majorlab majorhome majormail; do
|
||||
echo "$host: $(ssh root@$host 'ls /etc/pki/tls/certs/ca-bundle.crt 2>&1' | tail -1)"
|
||||
done
|
||||
```
|
||||
|
||||
Hosts returning "No such file or directory" are silently broken for all TLS operations.
|
||||
|
||||
## Why This Happens
|
||||
|
||||
`update-ca-trust extract` regenerates the files under `/etc/pki/ca-trust/extracted/` but does not create the legacy consumer-path symlink at `/etc/pki/tls/certs/ca-bundle.crt`. That symlink is shipped by the `ca-certificates` RPM. On cloud images built from minimal installs or snapshot-based provisioning, the symlink can be lost during image creation or a partial upgrade.
|
||||
|
||||
## Prevention
|
||||
|
||||
Add to your provisioning checklist (see [VPS Migration Baseline Checklist](../../02-selfhosting/cloud/vps-migration-baseline-checklist.md)):
|
||||
|
||||
```bash
|
||||
# Fedora provisioning — verify CA bundle symlink
|
||||
ls /etc/pki/tls/certs/ca-bundle.crt || \
|
||||
ln -sf /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem /etc/pki/tls/certs/ca-bundle.crt
|
||||
```
|
||||
|
||||
## Related
|
||||
|
||||
- [Logwatch Fleet Setup](../../02-selfhosting/monitoring/logwatch-fleet-setup.md) — logwatch depends on a working Postfix relay, which depends on TLS, which depends on this symlink
|
||||
- [VPS Migration Baseline Checklist](../../02-selfhosting/cloud/vps-migration-baseline-checklist.md) — includes CA bundle verification step
|
||||
|
|
@ -0,0 +1,80 @@
|
|||
---
|
||||
title: "Logwatch Falsely Reports 'No freshclam updates' in ClamAV Daemon Mode"
|
||||
domain: troubleshooting
|
||||
category: security
|
||||
tags: [clamav, freshclam, logwatch, false-positive, fedora, ubuntu, ansible]
|
||||
status: published
|
||||
created: 2026-06-06
|
||||
updated: 2026-06-06
|
||||
---
|
||||
# Logwatch Falsely Reports "No freshclam updates" in ClamAV Daemon Mode
|
||||
|
||||
Logwatch's daily `clam-update` section emails:
|
||||
|
||||
> No updates detected in the log for the freshclam daemon (the ClamAV update process). If the freshclam daemon is not running, you may need to restart it.
|
||||
|
||||
…even though freshclam **is** running and signatures **are** current. It's a parser quirk specific to running freshclam as a daemon. Don't act on the "restart it" suggestion — first confirm whether signatures are actually stale.
|
||||
|
||||
> Seen on **tttpod** (2026-06-06). All four freshclam hosts (majorlinux, majortoot-hetzner, teelia, tttpod) hit this on quiet days.
|
||||
|
||||
## First: is it real or false?
|
||||
|
||||
```bash
|
||||
systemctl is-active clamav-freshclam # active?
|
||||
ls -l /var/lib/clamav/daily.c[lv]d # mtime today/yesterday?
|
||||
grep 'updated' /var/log/clamav/freshclam.log | tail # real download events
|
||||
```
|
||||
|
||||
- **Fresh `daily.cld` + active service → false positive** (this page).
|
||||
- **`daily.cld` weeks old / service disabled → real.** Re-enable freshclam and update (see Related). A daemonless box still needs freshclam enabled — `clamav_use_daemon: false` only disables the *scanner* daemon, not the updater.
|
||||
|
||||
## Why It False-Alarms
|
||||
|
||||
logwatch's `clam-update` script (`/usr/share/logwatch/scripts/services/clam-update`) decides "updated" by counting **`ClamAV update process started`** lines (`$UpdatedNum`) within its range (`Range = yesterday`). It does **not** count the actual `daily.cld updated (version: …)` download lines.
|
||||
|
||||
freshclam emits "update process started" **only when the daemon (re)starts** — not on its periodic in-daemon checks (`Checks 24`, `ExecStart=/usr/bin/freshclam -d`). So on any day the box doesn't reboot or restart freshclam, yesterday's log has zero "started" lines → `$UpdatedNum == 0` → the warning fires, regardless of whether signatures downloaded. (Conversely, on a day you *do* reboot, the warning won't fire.) The script was written for the old cron-driven freshclam, which started a fresh process each run.
|
||||
|
||||
## Fix
|
||||
|
||||
Silence just that one message — real `ERROR` / `WARNING` / outdated alerts still report:
|
||||
|
||||
```bash
|
||||
# /etc/logwatch/conf/services/clam-update.conf
|
||||
$ignore_no_updates = 1
|
||||
```
|
||||
|
||||
No service restart needed; logwatch picks it up on its next daily run. (The variable is read as `$ENV{'ignore_no_updates'}` by the script — note: **not** prefixed `clam_update_`, despite what the script's own self-help text suggests.)
|
||||
|
||||
## Codify (Ansible)
|
||||
|
||||
Deploy the drop-in wherever freshclam runs in daemon mode. On the fleet it's a task in the `clamav` role (`roles/clamav/tasks/install.yml`, group `clamav`), right after freshclam is enabled — originally added in MajorAnsible commit `cb27c93`:
|
||||
|
||||
```yaml
|
||||
- name: Suppress logwatch clam-update false "no updates" alert (daemon-mode freshclam)
|
||||
ansible.builtin.copy:
|
||||
dest: /etc/logwatch/conf/services/clam-update.conf
|
||||
mode: '0644'
|
||||
content: |
|
||||
$ignore_no_updates = 1
|
||||
tags: [logwatch]
|
||||
```
|
||||
|
||||
## Key Notes
|
||||
|
||||
- **Confirm freshness before suppressing.** If signatures really are stale (freshclam off / no update timer), suppressing hides a genuine security gap. On a daemonless host that disabled freshclam, the warning is *true*.
|
||||
- The script's built-in options B/C (about syslog format) don't apply when freshclam logs to its own file (`LogSyslog false`); `$ignore_no_updates` is the right lever.
|
||||
- **Don't alert with `mail`.** The `mail`/`mailx` CLI is absent on most fleet hosts (only Postfix's `/usr/sbin/sendmail` is guaranteed). A health script that ends in `mail -s … root` silently fails to send. Pipe a headered message to `/usr/sbin/sendmail -t` addressed to `admin_email` directly (don't rely on an `/etc/aliases` `root` rewrite either).
|
||||
|
||||
## Proactive monitoring (don't rely on logwatch for "is it updating?")
|
||||
|
||||
Since logwatch's heuristic is suppressed, a **direct daily watchdog** is what actually catches a dead freshclam. The `clamav` role deploys `/etc/cron.daily/clamav-freshness` (originally MajorAnsible `9d1a1a9`) to every `clamav`-group host: it emails the admin (via `sendmail`) if `clamav-freshclam` is inactive **or** `daily.cld` is older than `clamav_staleness_threshold_days` (default 3) — and stays silent otherwise. Test without emailing:
|
||||
|
||||
```bash
|
||||
CLAMAV_STALE_DAYS=0 /etc/cron.daily/clamav-freshness # forces the stale branch
|
||||
```
|
||||
|
||||
This is what would have caught dcaprod's 20-day drift immediately instead of it surfacing by accident.
|
||||
|
||||
## Related
|
||||
|
||||
- [ClamAV CPU Spike: Safe Scheduling with nice/ionice](clamscan-cpu-spike-nice-ionice.md)
|
||||
|
|
@ -0,0 +1,112 @@
|
|||
---
|
||||
title: Netdata apps-group FD-utilisation false 100% (silenced fleet-wide)
|
||||
domain: troubleshooting
|
||||
category: security
|
||||
tags:
|
||||
- netdata
|
||||
- apps.plugin
|
||||
- file-descriptors
|
||||
- tailscale
|
||||
- false-positive
|
||||
- ansible
|
||||
- fleet
|
||||
status: published
|
||||
created: 2026-05-15
|
||||
updated: 2026-05-15T02:40
|
||||
---
|
||||
# Netdata apps-group FD-utilisation false 100%
|
||||
|
||||
The Netdata stock alarm **`apps_group_file_descriptors_utilization`** (from
|
||||
`/usr/lib/netdata/conf.d/health.d/file_descriptors.conf`) fires
|
||||
`Raised to Warning — App group <X> file descriptors utilization = 100%`
|
||||
emails for application groups that are perfectly healthy. First hit on
|
||||
**MajorToot** (the `tailscaled` app group), 2026-05-15.
|
||||
|
||||
## The Problem
|
||||
|
||||
A Netdata email arrives: *"App group tailscaled file descriptors utilization
|
||||
= 100% on MajorToot"*. The process is fine. On the host:
|
||||
|
||||
```
|
||||
PID 1047 tailscaled (daemon) fds=35 soft_limit=524287 util=0.01%
|
||||
PID 1984541 tailscaled (child) fds=10 soft_limit=524287 util=0.00%
|
||||
PID 1984548 bash (tailscale hook) fds=5 soft_limit=1024 util=0.49%
|
||||
```
|
||||
|
||||
No PID exceeds **0.5%**, yet `app.fds_open_limit` reads ~100%. Over 1h the raw
|
||||
chart was min 0 / **mean 36.7** / max 100, with sustained multi-minute 100%
|
||||
plateaus (not isolated spikes).
|
||||
|
||||
> This is **not** an `apps.plugin` privilege problem. apps.plugin already has
|
||||
> `cap_dac_read_search,cap_sys_ptrace` and `sudo -u netdata cat
|
||||
> /proc/<pid>/limits` succeeds. Verify before "fixing" privileges — it's a
|
||||
> no-op.
|
||||
|
||||
## Root Cause
|
||||
|
||||
The stock alarm does `lookup: max -10s` over **every PID in the app group**.
|
||||
App groups whose processes fork short-lived children (tailscaled spawns
|
||||
route/DNS helpers and bash hooks; `bash` children inherit the systemd default
|
||||
soft limit of 1024) trip a false 100%: apps.plugin's per-PID FD-limit read
|
||||
**races on transient/just-forked PIDs**, and because the group lookup uses
|
||||
`max`, a single bad 10-second sample pegs the entire group to ~100%. The
|
||||
signal carries no usable information for any forking/root app group.
|
||||
|
||||
A `lookup: average -5m` does **not** rescue it — the bogus reading sits at
|
||||
~100% for sustained multi-minute stretches, so the 5-minute rolling average
|
||||
itself still reaches 100.0% (empirically verified on MajorToot).
|
||||
|
||||
## The Fix
|
||||
|
||||
Silence this template fleet-wide, keep the reliable system-wide FD alarm.
|
||||
|
||||
- **Codified in Ansible** (do not hand-edit hosts): `MajorAnsible/netdata.yml`
|
||||
ships `templates/health_apps_fds_group.conf.j2` to
|
||||
`/etc/netdata/health.d/apps_fds_group_override.conf` and reloads via
|
||||
`netdatacli reload-health`.
|
||||
- The override redefines `apps_group_file_descriptors_utilization` with
|
||||
`to: silent`. Netdata loads `/etc/netdata/health.d/` *after* the stock
|
||||
`conf.d` dir, so a same-name template deterministically supersedes the stock
|
||||
one (same mechanism as the manual `tcp_resets.conf` override, 2026-04-30).
|
||||
- **Safety net retained:** the companion stock template
|
||||
`system_file_descriptors_utilization` (on `system.file_nr_utilization`,
|
||||
`crit > 90`, `to: sysadmin`) is untouched and still catches genuine
|
||||
system-wide FD exhaustion regardless of app grouping.
|
||||
- The reload handler is restart-tolerant (`retries`/`until` + `failed_when`
|
||||
ignoring a `netdata.pipe` socket-absent error) because on hosts where the
|
||||
notify-config also drifts, `Restart Netdata` and `Reload Netdata health`
|
||||
can race during the ~5s restart window.
|
||||
|
||||
## Verification
|
||||
|
||||
```bash
|
||||
ssh <host> 'curl -s "http://localhost:19999/api/v1/alarms?all=true" \
|
||||
| python3 -c "import sys,json;A=json.load(sys.stdin)[\"alarms\"]; \
|
||||
print(A[\"app.tailscaled_fds_open_limit.apps_group_file_descriptors_utilization\"][\"recipient\"])"'
|
||||
# expect: silent
|
||||
```
|
||||
|
||||
After the fix the alarm still shows `status=WARNING` in the dashboard
|
||||
(cosmetic — silencing suppresses the *notification*, not the computed state);
|
||||
`recipient=silent` confirms no more emails. The system-wide alarm should read
|
||||
`CLEAR recipient=sysadmin`.
|
||||
|
||||
## Notes
|
||||
|
||||
- Silenced fleet-wide on all 10 servers 2026-05-15 (workstations majorrig/
|
||||
majormac were asleep — irrelevant, they are not fleet servers).
|
||||
- Any future host running a forking/root daemon in a named app group would
|
||||
have hit the same false positive; silencing is fleet-wide and pre-emptive.
|
||||
- **Follow-up debt:** the manual `/etc/netdata/health.d/tcp_resets.conf`
|
||||
override on MajorToot (2026-04-30) is still **not codified in
|
||||
`netdata.yml`** — a per-host divergence the fleet play does not manage.
|
||||
Worth folding into Ansible the same way.
|
||||
|
||||
## Related
|
||||
|
||||
- [[clamscan-cpu-spike-nice-ionice]]
|
||||
- [[netdata-web-log-successful-redirect-heavy-tuning]]
|
||||
- Server doc: `30-Areas/MajorInfrastructure/Servers/majortoot.md` (incident
|
||||
2026-05-15)
|
||||
- Playbook: `MajorAnsible/netdata.yml` +
|
||||
`templates/health_apps_fds_group.conf.j2`
|
||||
|
|
@ -0,0 +1,196 @@
|
|||
---
|
||||
title: "Tuning Netdata `web_log_1m_successful` for Redirect-Heavy WordPress Sites"
|
||||
domain: troubleshooting
|
||||
category: security
|
||||
tags: [netdata, monitoring, wordpress, apache, fail2ban, alerts]
|
||||
status: published
|
||||
created: 2026-05-08
|
||||
updated: 2026-05-08
|
||||
---
|
||||
|
||||
# Tuning Netdata `web_log_1m_successful` for Redirect-Heavy WordPress Sites
|
||||
|
||||
## 🛑 Problem
|
||||
|
||||
Netdata's stock `web_log_1m_successful` alarm fires CRITICAL on a perfectly healthy WordPress site whenever a crawler hammers legacy URLs. Example email/notification:
|
||||
|
||||
```
|
||||
[CRITICAL] web_log_1m_successful = 54.1%
|
||||
Ratio of successful HTTP requests over the last minute (1xx, 2xx, 304, 401, 429)
|
||||
```
|
||||
|
||||
Meanwhile the front page returns HTTP 200, no 5xx errors are logged, and only a handful of 4xx noise hits appear. So why the alert?
|
||||
|
||||
---
|
||||
|
||||
## 🔬 Root Cause
|
||||
|
||||
The metric counts as **"successful"** only the response code classes:
|
||||
|
||||
```
|
||||
1xx, 2xx, 304, 401, 429
|
||||
```
|
||||
|
||||
**301 redirects are NOT counted as successful.** They land in the `redirect` dimension and pull the success ratio down.
|
||||
|
||||
WordPress sites generate large volumes of 301s as a normal part of life:
|
||||
|
||||
| Redirect source | Why a 301 |
|
||||
|---|---|
|
||||
| `/?p=NNNN` legacy shortlinks | Canonical URL rewrite to slug |
|
||||
| Stale post slugs after permalink edits | Old → new path |
|
||||
| `/feed` → `/feed/` | Trailing-slash normalization |
|
||||
| `http://` → `https://` | TLS upgrade |
|
||||
| `domain.com` ↔ `www.domain.com` | Host canonicalization |
|
||||
| Proxy CONNECT probes (e.g. `www.instagram.com:443`) | Apache returns 301 to canonical host |
|
||||
|
||||
When a feed scraper or vulnerability crawler walks a long list of legacy `/?p=` URLs, **every single hit is a 301**. A short burst can push the ratio of `success / total_requests` below 75% (warn) or 65% (stock crit) within a single minute — even though the server is functioning perfectly.
|
||||
|
||||
### Verifying the cause
|
||||
|
||||
Pull the last few thousand lines of the access log and split by status code:
|
||||
|
||||
```sh
|
||||
sudo tail -5000 /var/log/apache2/access.log | awk '{print $9}' | sort | uniq -c | sort -rn
|
||||
```
|
||||
|
||||
If you see something like:
|
||||
|
||||
```
|
||||
196 200
|
||||
162 301
|
||||
1 405
|
||||
1 404
|
||||
1 400
|
||||
```
|
||||
|
||||
…the math is `196 / (196+162+5) ≈ 54%`, which matches the alarm value almost exactly. **The alert is correct by its definition; the definition is wrong for this workload.**
|
||||
|
||||
Cross-check the source IPs:
|
||||
|
||||
```sh
|
||||
sudo tail -2000 /var/log/apache2/access.log | awk '{print $1}' | sort | uniq -c | sort -rn | head -10
|
||||
```
|
||||
|
||||
If a single IP dominates (hundreds of requests in minutes) and most of its hits are 301 to legacy URLs, you have your culprit.
|
||||
|
||||
---
|
||||
|
||||
## ✅ Solution
|
||||
|
||||
Two parts: **fix the alarm definition** so normal redirect bursts don't trip it, and **block the abusive scraper** so it stops generating noise.
|
||||
|
||||
### 1. Retune `web_log_1m_successful` thresholds
|
||||
|
||||
Edit `/etc/netdata/health.d/web_log.conf` (this is a local override of the stock template). Locate the `template: web_log_1m_successful` block and replace its `warn`/`crit` lines:
|
||||
|
||||
```diff
|
||||
template: web_log_1m_successful
|
||||
on: web_log.type_requests
|
||||
class: Workload
|
||||
type: Web Server
|
||||
component: Web log
|
||||
lookup: sum -1m unaligned of success
|
||||
calc: $this * 100 / $web_log_1m_requests
|
||||
units: %
|
||||
every: 10s
|
||||
- warn: ($web_log_1m_requests > 120) ? ($this < (($status >= $WARNING ) ? ( 90 ) : ( 80 )) ) : ( 0 )
|
||||
- crit: ($web_log_1m_requests > 120) ? ($this < (($status == $CRITICAL) ? ( 75 ) : ( 65 )) ) : ( 0 )
|
||||
+ warn: ($web_log_1m_requests > 120) ? ($this < (($status >= $WARNING ) ? ( 50 ) : ( 40 )) ) : ( 0 )
|
||||
+ crit: ($web_log_1m_requests > 120) ? ($this < (($status == $CRITICAL) ? ( 30 ) : ( 20 )) ) : ( 0 )
|
||||
delay: up 2m down 15m multiplier 1.5 max 1h
|
||||
summary: Web log successful
|
||||
info: Ratio of successful HTTP requests over the last minute (1xx, 2xx, 304, 401, 429)
|
||||
to: webmaster
|
||||
```
|
||||
|
||||
Then reload Netdata health:
|
||||
|
||||
```sh
|
||||
sudo netdatacli reload-health
|
||||
```
|
||||
|
||||
Confirm the new thresholds are active:
|
||||
|
||||
```sh
|
||||
curl -s http://localhost:19999/api/v1/alarms?all \
|
||||
| jq -r '.alarms | to_entries[] | select(.value.name == "web_log_1m_successful") | .value.warn,.value.crit'
|
||||
```
|
||||
|
||||
You should see the new `50/40` warn and `30/20` crit values.
|
||||
|
||||
### 2. Why the new thresholds make sense
|
||||
|
||||
The stock alarm assumes a low-redirect workload (typical SPA backend: lots of 200s, very few 301s). On a WP site with active permalink rewrites, expect routine ratios of 70–95% successful with occasional dips into the 50s during crawler bursts. The retuned alarm:
|
||||
|
||||
- **Warn at <40%** — not until *most* responses are non-2xx
|
||||
- **Crit at <20%** — only when the site is genuinely melting down (e.g., backend down, Apache returning 5xx for everything)
|
||||
|
||||
You haven't disabled the safety net — you've moved it past the floor of normal redirect-heavy noise.
|
||||
|
||||
### 3. Lean on the right alarms for real outages
|
||||
|
||||
Two other web_log alarms remain stock and **are** the correct outage signals:
|
||||
|
||||
| Alarm | Catches | Default thresholds |
|
||||
|---|---|---|
|
||||
| `web_log_1m_internal_errors` | 5xx ratio | warn 2% / crit 5% |
|
||||
| `web_log_1m_bad_requests` | 4xx (excl. 401, 429) | warn 30% |
|
||||
|
||||
Verify both are active and CLEAR after your retune:
|
||||
|
||||
```sh
|
||||
curl -s http://localhost:19999/api/v1/alarms?all \
|
||||
| jq -r '.alarms | to_entries[] | select(.value.name | test("web_log")) | "\(.value.status) | \(.value.name)"'
|
||||
```
|
||||
|
||||
### 4. Block the abusive scraper
|
||||
|
||||
Identify the dominant offender from step 1's IP list and ban it permanently via the recidive jail (assuming `bantime = -1` is set in `jail.local`):
|
||||
|
||||
```sh
|
||||
sudo fail2ban-client set recidive banip 74.7.242.61
|
||||
sudo fail2ban-client status recidive
|
||||
```
|
||||
|
||||
The recidive jail uses iptables/nftables, so the IP is dropped at the firewall — Apache no longer sees it, and the redirect-flood stops contributing to the ratio. If `bantime` is finite on your host, edit `/etc/fail2ban/jail.local`:
|
||||
|
||||
```ini
|
||||
[recidive]
|
||||
bantime = -1
|
||||
findtime = 86400
|
||||
maxretry = 3
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🧪 Verification
|
||||
|
||||
After both changes:
|
||||
|
||||
```sh
|
||||
# 1. Active alarms — should be empty (or only your real ones)
|
||||
curl -s http://localhost:19999/api/v1/alarms?active | jq '.alarms'
|
||||
|
||||
# 2. Recidive ban list includes the IP
|
||||
sudo fail2ban-client status recidive
|
||||
|
||||
# 3. Live ratio — should climb above 50% within 1–2 minutes
|
||||
watch -n 5 'curl -s http://localhost:19999/api/v1/data?chart=web_log_apache.requests_by_type\&after=-60\&points=1\&format=json | jq'
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🧭 When NOT to apply this
|
||||
|
||||
- If your site is an API or SPA backend that should have a 200-dominated traffic mix, the stock thresholds are correct — diagnose what's actually returning 301 instead of relaxing the alarm.
|
||||
- If 5xx errors are climbing in tandem with the success-ratio drop, retuning the 1m_successful alarm will mask a real outage. **Always check `web_log_1m_internal_errors` first.**
|
||||
|
||||
---
|
||||
|
||||
## 📚 References
|
||||
|
||||
- Netdata stock template: `/usr/lib/netdata/conf.d/health.d/web_log.conf`
|
||||
- Local override: `/etc/netdata/health.d/web_log.conf`
|
||||
- Netdata web_log Go module dimensions: `success`, `redirect`, `bad`, `error`, `other`
|
||||
- Related: [Custom Fail2ban Jail: Apache Directory Scanning](apache-dirscan-fail2ban-jail.md)
|
||||
|
|
@ -98,6 +98,29 @@ ausearch -m avc -ts recent | grep dovecot
|
|||
|
||||
No output = no new denials.
|
||||
|
||||
## Variant: a Freshly-Rebuilt Box Left in Permissive Mode
|
||||
|
||||
If a server was rebuilt or migrated and came up **Permissive** (check `getenforce`), the symptom flips: mail works fine, but `/var/log/audit/audit.log` quietly fills with thousands of `dovecot_t → var_t` denials that *would* break IMAP/LMTP the instant you switch to Enforcing. The mailstore was created or `rsync`'d onto `/var/vmail` with no fcontext rule, so it defaulted to `var_t`.
|
||||
|
||||
Apply the relabel above first, then flip to Enforcing **only after** verifying zero new denials:
|
||||
|
||||
```bash
|
||||
MARK=$(date +%H:%M:%S)
|
||||
# ...deliver a test message + do an IMAP login...
|
||||
ausearch -m avc -ts "$MARK" | grep -c denied # expect 0
|
||||
setenforce 1
|
||||
sed -i 's/^SELINUX=permissive/SELINUX=enforcing/' /etc/selinux/config
|
||||
```
|
||||
|
||||
**Companion denial:** a Postfix virtual-mailbox server that looks up recipients in MySQL also trips `postfix_cleanup_t` reading `/etc/my.cnf*` (`mysqld_etc_t`). Allow it with a small local module:
|
||||
|
||||
```bash
|
||||
ausearch -m avc -c cleanup | audit2allow -M local_postfix_mysql
|
||||
semodule -i local_postfix_mysql.pp
|
||||
```
|
||||
|
||||
See also [[postfix-spamassassin-bayes-spam-filtering|Inbound Spam Filtering]] — the SpamAssassin Bayes DB belongs under `/var/lib/spamassassin` (`spamd_var_lib_t`) for the same labeling reason.
|
||||
|
||||
## Key Notes
|
||||
|
||||
- **One rule is enough** — `"/var/vmail(/.*)?"` with `mail_spool_t` covers every file and directory under `/var/vmail`, including all `tmp/` subdirectories.
|
||||
|
|
|
|||
|
|
@ -0,0 +1,92 @@
|
|||
---
|
||||
title: "SELinux: Wrong /etc/localtime Label Silently Breaks Timezone Changes"
|
||||
domain: troubleshooting
|
||||
category: general
|
||||
tags: [selinux, timezone, timedatectl, localtime, fedora, ansible, hetzner]
|
||||
status: published
|
||||
created: 2026-06-05
|
||||
updated: 2026-06-05
|
||||
---
|
||||
# SELinux: Wrong /etc/localtime Label Silently Breaks Timezone Changes
|
||||
|
||||
`timedatectl set-timezone` (and Ansible's `community.general.timezone`) report **success but the timezone never actually changes** — `date` keeps showing the old zone. The cause is an SELinux mislabel on `/etc/localtime`: it must be `locale_t`, but freshly-provisioned images sometimes ship it as `etc_t`, which makes SELinux deny `systemd-timedated` from rewriting the symlink.
|
||||
|
||||
> Hit on **majormail** (Fedora 44, SELinux Enforcing, Hetzner Cloud image), 2026-06-05. The box stayed on UTC for hours despite the timezone task "succeeding."
|
||||
|
||||
## Symptoms
|
||||
|
||||
- `timedatectl set-timezone America/New_York` exits **0**, but `date` still shows the old zone/offset.
|
||||
- `timedatectl show -p Timezone --value` reports the **new** zone while `readlink /etc/localtime` still points at the **old** one — an inconsistent split state.
|
||||
- Ansible `community.general.timezone` reports `changed=false` ("already set") because its idempotence check reads the stale in-memory value from `timedatectl`.
|
||||
- `journalctl -u systemd-timedated` shows: `Failed to set time zone: Permission denied`.
|
||||
- A direct `ln -sf … /etc/localtime` **works** — but a brand-new symlink may get the wrong label again, sending you in circles.
|
||||
|
||||
## Why It Happens
|
||||
|
||||
`systemd-timedated` changes the timezone by replacing the `/etc/localtime` symlink. Under SELinux Enforcing, that target must be labeled `locale_t`. If it is `etc_t` (or anything else), timedated is denied (`Permission denied`) and aborts — but `timedatectl`/the Ansible module surface this poorly, so the change looks like it took. The denial may be **dontaudit-suppressed**, so `ausearch -m avc` can come up empty, hiding the real cause.
|
||||
|
||||
## Diagnosis
|
||||
|
||||
```bash
|
||||
# The split state — these two should agree but won't:
|
||||
readlink /etc/localtime # e.g. .../Etc/UTC (the truth)
|
||||
timedatectl show -p Timezone --value # e.g. America/New_York (stale)
|
||||
date '+%Z %z' # confirms actual zone via the symlink
|
||||
|
||||
# The label — this is the smoking gun:
|
||||
ls -Z /etc/localtime # WRONG: ...:etc_t:s0
|
||||
matchpathcon /etc/localtime # EXPECTED: ...:locale_t:s0
|
||||
|
||||
# The denial (only if dontaudit is disabled):
|
||||
journalctl -u systemd-timedated | grep -i 'permission denied'
|
||||
```
|
||||
|
||||
## Fix
|
||||
|
||||
Relabel first, *then* set the timezone the normal way:
|
||||
|
||||
```bash
|
||||
restorecon -v /etc/localtime # etc_t -> locale_t
|
||||
timedatectl set-timezone America/New_York
|
||||
# verify all three agree now:
|
||||
date '+%F %T %Z (%z)'; readlink /etc/localtime; ls -Z /etc/localtime
|
||||
```
|
||||
|
||||
If you set the symlink by hand (`ln -sf`) as a stopgap, run `restorecon /etc/localtime` afterward — a manually created symlink can inherit `etc_t` and re-break the next `timedatectl` call.
|
||||
|
||||
Then restart anything that caches the zone at startup so logs/schedules switch over:
|
||||
|
||||
```bash
|
||||
systemctl restart rsyslog crond
|
||||
```
|
||||
|
||||
(`journalctl` renders in local time automatically; rsyslog-written logs like `/var/log/maillog` keep the old zone until rsyslog restarts.)
|
||||
|
||||
## Codify (Ansible)
|
||||
|
||||
Run `restorecon` on `/etc/localtime` **before** the timezone task, so a mislabeled symlink can't silently defeat it:
|
||||
|
||||
```yaml
|
||||
- name: Ensure correct SELinux label on /etc/localtime
|
||||
ansible.builtin.command: restorecon -v /etc/localtime
|
||||
register: localtime_relabel
|
||||
changed_when: "'Relabeled' in localtime_relabel.stdout"
|
||||
when: ansible_selinux.status | default('disabled') == 'enabled'
|
||||
|
||||
- name: Set timezone
|
||||
community.general.timezone:
|
||||
name: America/New_York
|
||||
```
|
||||
|
||||
On majormail this is in `roles/majormail/tasks/main.yml` (MajorAnsible commit `2ff566d`).
|
||||
|
||||
## Key Notes
|
||||
|
||||
- **`timedatectl`/the Ansible module lie here.** Always confirm with `readlink /etc/localtime` + `date`, not just `timedatectl show`.
|
||||
- **The denial can be invisible.** dontaudit rules may hide the AVC; trust the label mismatch (`ls -Z` vs `matchpathcon`) over an empty `ausearch`.
|
||||
- **Fresh cloud images are the usual offender** — a clean rebuild/provision is where the wrong label sneaks in.
|
||||
|
||||
## Related
|
||||
|
||||
- [SELinux: Fixing Dovecot Mail Spool Context (/var/vmail)](selinux-dovecot-vmail-context.md)
|
||||
- [Dovecot IMAP Clients Fail to Sync: vsz_limit OOM from a Bloated Index Log](networking/dovecot-imap-oom-vsz-limit-bloated-index.md)
|
||||
|
|
@ -0,0 +1,120 @@
|
|||
---
|
||||
title: "Time Machine: Orphaned APFS .previous Folder Blocks All Backups"
|
||||
domain: troubleshooting
|
||||
category: general
|
||||
tags: [macos, time-machine, apfs, backup, fsck, disk-utility]
|
||||
status: published
|
||||
created: 2026-06-18
|
||||
updated: 2026-06-18
|
||||
---
|
||||
# Time Machine: Orphaned APFS `.previous` Folder Blocks All Backups
|
||||
|
||||
## Overview
|
||||
On an APFS Time Machine destination, an interrupted backup can leave behind an orphaned staging folder named `<timestamp>.previous` (plus a matching, uncatalogued APFS snapshot). Every subsequent backup reads that folder during *FindingChanges*, hits a metadata-type mismatch, and aborts — so backups silently stop running. macOS shows only a generic "**Time Machine couldn't complete the backup … An unknown error occurred.**"
|
||||
|
||||
The trap: because the orphan is **not in Time Machine's catalog** and the destination is OS-protected, every obvious removal tool (`rm`, `chmod`, `tmutil delete`, `diskutil deleteSnapshot`) refuses it. The clean fix is **First Aid (`fsck_apfs`)**, which has authority over the volume and clears the orphaned snapshot.
|
||||
|
||||
## Symptoms
|
||||
- "Time Machine couldn't complete the backup to '<disk>' — An unknown error occurred."
|
||||
- Backups haven't run since around the time of an interrupted/cancelled backup.
|
||||
- The destination disk is mounted and has plenty of free space (not full, not disconnected).
|
||||
- `tmutil status` cycles through `Starting` / `FindingChanges` and never reaches `Copying`.
|
||||
|
||||
## Root Cause
|
||||
`backupd` logs the real error on a loop (every ~15 s):
|
||||
|
||||
```bash
|
||||
log show --predicate 'subsystem == "com.apple.TimeMachine"' --last 10m --style compact \
|
||||
| grep -iE 'previous|error'
|
||||
```
|
||||
```
|
||||
[TMStructure] Expected SnapshotInProgressContainer metadata type but found APFSBackup
|
||||
metadata type at URL '.../<disk>/2026-06-17-172230.previous/'
|
||||
```
|
||||
|
||||
An earlier backup was interrupted mid-run. It left two orphans tied to that timestamp, **neither registered in Time Machine's backup catalog**:
|
||||
|
||||
1. A staging directory `<timestamp>.previous` on the destination volume.
|
||||
2. A matching APFS snapshot `com.apple.TimeMachine.<timestamp>.backup`.
|
||||
|
||||
Time Machine expects the staging folder to be a `SnapshotInProgressContainer` but finds completed-backup (`APFSBackup`) metadata, so it bails before copying anything.
|
||||
|
||||
> **Ignore the surrounding log noise.** `com.apple.backupd.sandbox.xpc: connection invalid`, `Mountpoint '…' is still valid`, and `missingName` on `/System/Volumes/Data/home` are all normal on a healthy backup — flagged `E` but harmless. The only line that matters is the `SnapshotInProgressContainer` mismatch.
|
||||
|
||||
## Diagnosis
|
||||
|
||||
Confirm the disk is healthy (not the problem) and locate the orphan:
|
||||
|
||||
```bash
|
||||
tmutil status # stuck in Starting/FindingChanges, never Copying
|
||||
df -h | grep -i "<disk-name>" # mounted, plenty free
|
||||
diskutil apfs listSnapshots <diskNsN> # note the highest/last snapshot timestamp
|
||||
```
|
||||
|
||||
If `listSnapshots` shows a final snapshot whose timestamp matches the `.previous` folder in the error, that's the orphaned pair.
|
||||
|
||||
## Why the Obvious Tools Fail
|
||||
|
||||
Do **not** burn time trying to force the folder out — here's what each tool does and why it refuses:
|
||||
|
||||
| Command | Result | Reason |
|
||||
|---|---|---|
|
||||
| `sudo rm -rf …/<ts>.previous` | `Operation not permitted` | TM applies a `group:everyone deny delete` ACL that overrides root. |
|
||||
| `sudo chmod -RN …/<ts>.previous` | runs for minutes, then fails | A `.previous` folder is a **full copy of the entire Mac filesystem**; `-R` walks the whole tree and can't clear ACLs on the SIP-`restricted` system files inside (`/usr/bin/sh`, frameworks, keymaps). `rm` then hits the same wall. |
|
||||
| `sudo tmutil delete -p …/<ts>.previous` | `Invalid deletion target (error 22)` | Not a registered backup. |
|
||||
| `sudo tmutil delete -t <timestamp>` | `error 2 (No such file)` | No catalog entry for that timestamp. |
|
||||
| `sudo diskutil apfs deleteSnapshot <diskNsN> -uuid <uuid>` | `Not a valid APFS Snapshot UUID` | TM-managed snapshot; diskutil won't remove it directly. |
|
||||
|
||||
> **If you started a `chmod -R` and killed it:** the live system is unaffected — `chmod -R` does not follow symlinks out of the backup tree. Verify with `ls -lde ~/Desktop` (normal ACLs = untouched). Stop a runaway with `sudo pkill -f '<timestamp>.previous'`.
|
||||
|
||||
## Fix — Run First Aid (`fsck_apfs`)
|
||||
|
||||
First Aid runs with full authority over the volume and clears the orphaned snapshot, which defuses the `.previous` folder's metadata mismatch.
|
||||
|
||||
```bash
|
||||
# 1. Stop the looping backup
|
||||
sudo tmutil stopbackup
|
||||
|
||||
# 2. Verify the destination volume (live mode is fine; read-only check)
|
||||
sudo diskutil verifyVolume <diskNsN>
|
||||
# or: Disk Utility → View → Show All Devices → select the TM volume → First Aid → Run
|
||||
```
|
||||
|
||||
`verifyVolume` enumerates and validates every snapshot; the verify/remount cycle purges the orphaned in-progress snapshot. Expected result:
|
||||
|
||||
```
|
||||
The volume <name> appears to be OK
|
||||
File system check exit code is 0
|
||||
```
|
||||
|
||||
Confirm the orphan snapshot is gone (count drops by one; the matching timestamp no longer appears):
|
||||
|
||||
```bash
|
||||
diskutil apfs listSnapshots <diskNsN>
|
||||
```
|
||||
|
||||
Then restart and watch it succeed:
|
||||
|
||||
```bash
|
||||
sudo tmutil startbackup --auto
|
||||
tmutil status # should reach BackupPhase = Copying with no SnapshotInProgressContainer errors
|
||||
```
|
||||
|
||||
If `verifyVolume` reports problems rather than "appears to be OK", run the repair (it must unmount the volume):
|
||||
|
||||
```bash
|
||||
sudo diskutil repairVolume <diskNsN>
|
||||
```
|
||||
|
||||
## Notes
|
||||
- The first backup after the fix is often a large catch-up (hundreds of GB) because the chain was broken — let it finish; it returns to quick hourly increments afterward.
|
||||
- The inert `<timestamp>.previous` **folder** may still sit on the volume after the fix. Time Machine now ignores it, so it's not blocking — but it consumes space. Removing it cleanly requires booting to **Recovery Mode**, `csrutil disable`, `rm -rf` the folder, then `csrutil enable` — only worth it to reclaim the space.
|
||||
- Time Machine identifies its destination by `DestinationID` (a UUID), not the volume name, so renaming the disk later is safe.
|
||||
- Interrupted backups are more likely on flaky USB-SATA bridge enclosures (e.g. some WD My Passport units) whose slow sleep/wake transitions can drop the drive mid-backup.
|
||||
|
||||
## Tags
|
||||
`macos` `time-machine` `apfs` `backup` `fsck-apfs` `disk-utility` `snapshot` `first-aid`
|
||||
|
||||
## See Also
|
||||
- [SnapRAID & MergerFS Storage Setup](../01-linux/storage/snapraid-mergerfs-setup.md)
|
||||
- MajorMac Incident Log (2026-06-18) — the originating incident
|
||||
|
|
@ -0,0 +1,193 @@
|
|||
---
|
||||
title: "WordPress 6.7 _load_textdomain_just_in_time Notice (Theme/Plugin Loads Translations Too Early)"
|
||||
domain: troubleshooting
|
||||
category: troubleshooting
|
||||
tags:
|
||||
- wordpress
|
||||
- wordpress-6.7
|
||||
- php
|
||||
- i18n
|
||||
- textdomain
|
||||
- theme
|
||||
- mu-plugin
|
||||
- deprecation
|
||||
- troubleshooting
|
||||
status: published
|
||||
created: 2026-06-21
|
||||
updated: 2026-06-21
|
||||
---
|
||||
|
||||
# WordPress 6.7 `_load_textdomain_just_in_time` Notice
|
||||
|
||||
> **TL;DR** — WordPress 6.7 added a `doing_it_wrong` notice that fires when a translation function (`__()`, `_e()`, `esc_html__()`, …) is called for a text domain **before the `init` action**. It's almost always a theme or plugin registering nav menus / sidebars / labels on `after_setup_theme` (which runs before `init`). The notice is **debug-only and harmless** — translations still load via the just-in-time fallback. If the offending code is in your own (or an updatable) theme/plugin, fix it at the source by deferring to `init`. If it's a **non-updating or third-party** theme you don't want to hand-edit, suppress *only this one notice* with a `doing_it_wrong_trigger_error` filter in a tiny mu-plugin.
|
||||
|
||||
---
|
||||
|
||||
## Symptom
|
||||
|
||||
With `WP_DEBUG` on (or in Query Monitor's PHP panel), you see:
|
||||
|
||||
```
|
||||
Function _load_textdomain_just_in_time was called incorrectly.
|
||||
Translation loading for the <domain> domain was triggered too early.
|
||||
This is usually an indicator for some code in the plugin or theme running too early.
|
||||
Translations should be loaded at the init action or later.
|
||||
(This message was added in version 6.7.0.)
|
||||
|
||||
_load_textdomain_just_in_time() wp-includes/l10n.php
|
||||
get_translations_for_domain() wp-includes/l10n.php
|
||||
translate() wp-includes/l10n.php
|
||||
__() wp-includes/l10n.php
|
||||
WordPress Core
|
||||
```
|
||||
|
||||
The key fields are **the domain name** (e.g. `marstheme`, `woocommerce`, `astra`) and the fact that the stack bottoms out in **WordPress Core** via `__()` — that tells you *some* extension called a translation function, not that core is broken.
|
||||
|
||||
## Why it happens (the WP 6.7 change)
|
||||
|
||||
Before 6.7, WordPress silently "just-in-time" loaded a text domain the first time you translated a string in it. 6.7 kept the JIT loading but started **warning** when it's triggered before `init`, because:
|
||||
|
||||
- Translations loaded before `init` can't be filtered/overridden by other plugins that hook `init`.
|
||||
- It signals the extension is doing setup work earlier than the WordPress lifecycle intends.
|
||||
|
||||
The usual culprit is code on **`after_setup_theme`** (which fires *before* `init`) that translates a label inline, e.g.:
|
||||
|
||||
```php
|
||||
function mytheme_setup() {
|
||||
register_nav_menus( array(
|
||||
'primary' => __( 'Primary Menu', 'mytheme' ), // <-- translate call before init
|
||||
) );
|
||||
}
|
||||
add_action( 'after_setup_theme', 'mytheme_setup' );
|
||||
```
|
||||
|
||||
> **Important:** explicitly calling `load_theme_textdomain()` / `load_plugin_textdomain()` early does **not** fix the notice, and as of WP 4.6+ themes on wordpress.org don't even need to call it. The notice is about the *translate call*, not about whether the domain was loaded. Moving only the `load_*_textdomain()` call around is a common dead-end (see the gotcha below).
|
||||
|
||||
## Diagnostic chain
|
||||
|
||||
### 1. Identify the domain and what owns it
|
||||
|
||||
The notice names the domain. Find which theme/plugin uses it:
|
||||
|
||||
```bash
|
||||
WPROOT=/var/www/html
|
||||
grep -rlw '<domain>' "$WPROOT/wp-content/themes" "$WPROOT/wp-content/plugins" 2>/dev/null
|
||||
|
||||
# Which extension has the most references (i.e. owns the domain)?
|
||||
grep -rl '<domain>' "$WPROOT/wp-content/" 2>/dev/null \
|
||||
| sed -E "s#$WPROOT/wp-content/(themes|plugins|mu-plugins)/([^/]+)/.*#\1/\2#" \
|
||||
| sort | uniq -c | sort -rn | head
|
||||
```
|
||||
|
||||
> **Watch for renamed/forked themes.** The domain often does **not** match the theme's folder name. A theme bought as "Mars" and re-slugged to `kappa` keeps `marstheme` as its text domain in all 40+ template files. So `wp theme list` shows `kappa` active while the notice says `marstheme` — they're the same thing.
|
||||
|
||||
### 2. Confirm it's active and whether it can be updated
|
||||
|
||||
```bash
|
||||
sudo -u www-data wp --path=$WPROOT theme list --fields=name,status,version,update
|
||||
sudo -u www-data wp --path=$WPROOT plugin list --fields=name,status,version,update
|
||||
```
|
||||
|
||||
- `update available` → **update it first** (newest releases of most themes/plugins fixed this in late 2024/2025). That's the proper fix; the rest of this article is for when you can't.
|
||||
- `update none` on a **renamed/custom fork** → no upstream exists, so updating is impossible. Go to the suppression fix.
|
||||
|
||||
### 3. Pin down the early call (optional)
|
||||
|
||||
```bash
|
||||
grep -rn "__(\s*['\"].*['\"]\s*,\s*['\"]<domain>['\"]" \
|
||||
"$WPROOT/wp-content/themes/<theme>" | head
|
||||
```
|
||||
|
||||
Look for translate calls inside functions hooked to `after_setup_theme`, `setup_theme`, `plugins_loaded`, or run at file scope in `functions.php`.
|
||||
|
||||
## The fix
|
||||
|
||||
### Option A — fix it at the source (own / updatable code)
|
||||
|
||||
Defer the translation. Either register the raw string and translate at render time, or move the registration to `init`:
|
||||
|
||||
```php
|
||||
// Before: translated on after_setup_theme (too early)
|
||||
add_action( 'after_setup_theme', function () {
|
||||
register_nav_menus( array( 'primary' => __( 'Primary Menu', 'mytheme' ) ) );
|
||||
} );
|
||||
|
||||
// After: register the menu location on init, where translation is allowed
|
||||
add_action( 'init', function () {
|
||||
register_nav_menus( array( 'primary' => __( 'Primary Menu', 'mytheme' ) ) );
|
||||
} );
|
||||
```
|
||||
|
||||
Don't do this by editing a theme/plugin that receives updates — your change is wiped on the next update. Use Option B for those.
|
||||
|
||||
### Option B — suppress just this notice (third-party / non-updating code)
|
||||
|
||||
When the early call lives in a theme you don't control and can't update (a renamed commercial fork, an abandoned plugin), the clean, update-safe move is to silence **only** the `_load_textdomain_just_in_time` notice — not all `doing_it_wrong` output — via a must-use plugin.
|
||||
|
||||
Create `wp-content/mu-plugins/fix-textdomain.php`:
|
||||
|
||||
```php
|
||||
<?php
|
||||
/**
|
||||
* Suppress the WP 6.7 "_load_textdomain_just_in_time was called incorrectly"
|
||||
* notice for a theme/plugin that translates before init.
|
||||
*
|
||||
* Scope is intentionally narrow: only this one function is silenced, so other
|
||||
* doing_it_wrong notices still surface. Translations still load via the JIT
|
||||
* fallback, so nothing visible changes for visitors.
|
||||
*/
|
||||
add_filter( 'doing_it_wrong_trigger_error', function ( $trigger, $function_name ) {
|
||||
return '_load_textdomain_just_in_time' === $function_name ? false : $trigger;
|
||||
}, 10, 2 );
|
||||
```
|
||||
|
||||
`mu-plugins/` loads automatically (no activation, can't be deactivated from the admin), and runs early enough to register the filter before the notice fires.
|
||||
|
||||
#### Verify
|
||||
|
||||
```bash
|
||||
WPROOT=/var/www/html
|
||||
|
||||
# 1. Syntax-check the mu-plugin
|
||||
php -l "$WPROOT/wp-content/mu-plugins/fix-textdomain.php"
|
||||
# -> No syntax errors detected
|
||||
|
||||
# 2. Confirm WP still boots and the filter is registered
|
||||
sudo -u www-data wp --path=$WPROOT eval \
|
||||
'echo has_filter("doing_it_wrong_trigger_error") ? "filter set\n" : "MISSING\n";'
|
||||
|
||||
# 3. Clear the debug log, trigger an early translate, confirm 0 new notices
|
||||
DBG="$WPROOT/wp-content/debug.log"
|
||||
[ -f "$DBG" ] && : > "$DBG"
|
||||
sudo -u www-data wp --path=$WPROOT eval '__("Primary Menu","<domain>");' >/dev/null 2>&1
|
||||
grep -c "<domain>" "$DBG" 2>/dev/null || echo 0
|
||||
# -> 0
|
||||
```
|
||||
|
||||
## Gotchas
|
||||
|
||||
### The "load the textdomain earlier/later" dead-end
|
||||
|
||||
A very common (wrong) first attempt is an mu-plugin that just calls `load_theme_textdomain()` on `plugins_loaded` or `after_setup_theme`:
|
||||
|
||||
```php
|
||||
// DOES NOT FIX THE NOTICE
|
||||
add_action( 'plugins_loaded', function () {
|
||||
load_theme_textdomain( 'mytheme', get_template_directory() . '/languages' );
|
||||
}, 0 );
|
||||
```
|
||||
|
||||
`plugins_loaded` still runs **before `init`**, and — more importantly — the notice is triggered by the theme's own early `__()` call, not by whether you've loaded the domain. This code is dead weight. If you find one in place, replace it with the Option B filter rather than tweaking its hook/priority.
|
||||
|
||||
### Don't blanket-suppress all deprecations
|
||||
|
||||
Resist `error_reporting(E_ALL & ~E_DEPRECATED)` or returning `false` from `doing_it_wrong_trigger_error` unconditionally — that also hides genuinely useful warnings (a plugin breaking on a future PHP/WP bump). Scope the filter to the one `function_name`.
|
||||
|
||||
### Renamed theme ⇒ domain ≠ folder
|
||||
|
||||
Re-stating because it costs the most time: the domain in the notice can be the theme's *original* slug, not its current folder. Always `grep` for the domain to find the real owner before concluding "I don't even have that theme installed."
|
||||
|
||||
## See also
|
||||
|
||||
- [Patching PHP 8.4 Implicit-Nullable Deprecations in Vendor Packages](php-84-vendor-implicit-nullable-patch.md) — the other "harmless deprecation that floods logs" pattern on the WordPress fleet
|
||||
- [WordPress developer note: i18n improvements in 6.7](https://make.wordpress.org/core/2024/10/21/i18n-improvements-in-6-7/) — the canonical reference for this change
|
||||
|
|
@ -0,0 +1,125 @@
|
|||
---
|
||||
title: "WSL2: PyTorch Training Deadlocks on Windows Filesystem Checkpoint Saves"
|
||||
domain: troubleshooting
|
||||
category: wsl2
|
||||
tags: [wsl2, pytorch, huggingface, training, llm, checkpoint, windows, ntfs, deadlock, majortwin]
|
||||
status: published
|
||||
created: 2026-05-23
|
||||
updated: 2026-05-23
|
||||
---
|
||||
|
||||
# WSL2: PyTorch Training Deadlocks on Windows Filesystem Checkpoint Saves
|
||||
|
||||
## Problem
|
||||
|
||||
A Hugging Face Trainer / Unsloth fine-tuning run starts successfully, logs training steps for a while, then freezes completely. The tqdm progress bar stops advancing, GPU utilization drops to near-zero, but the training process stays alive at 100% CPU with the full model loaded in VRAM. No new checkpoint directories appear.
|
||||
|
||||
**Confirming it's a checkpoint deadlock:**
|
||||
|
||||
```bash
|
||||
# Check if training is frozen — same step count + elapsed time across checks
|
||||
tmux capture-pane -t <session> -p | tail -5
|
||||
sleep 60
|
||||
tmux capture-pane -t <session> -p | tail -5
|
||||
|
||||
# GPU idle despite process alive
|
||||
nvidia-smi --query-gpu=utilization.gpu,memory.used --format=csv,noheader
|
||||
|
||||
# No new checkpoint directories written
|
||||
ls -lt /mnt/d/your/training/output/ | head -10
|
||||
```
|
||||
|
||||
If the tqdm step count is identical both times and the newest directory timestamp is from a previous run, the save is deadlocked.
|
||||
|
||||
---
|
||||
|
||||
## Root Cause
|
||||
|
||||
WSL2's `/mnt/d/` paths go through the **virtio-9p filesystem driver** to reach the host Windows NTFS volume. Large sequential writes — like saving a multi-GB PyTorch checkpoint (optimizer states, model weights, scheduler, RNG state) — can deadlock when:
|
||||
|
||||
- A Windows process (antivirus, VSS, Windows Search) holds a lock on the output directory
|
||||
- The Windows virtual disk hits write pressure from concurrent activity
|
||||
|
||||
The Linux process blocks in a kernel `write()` syscall waiting for virtio-9p to acknowledge the write. The process is alive and spinning at 100% CPU in the kernel, but no userspace progress occurs. This is distinct from OOM kills (which log clearly) and out-of-disk errors (which exit cleanly).
|
||||
|
||||
---
|
||||
|
||||
## Fix: Train on Linux-Native Storage
|
||||
|
||||
Keep all training I/O on Linux ext4 (`~/`), and copy final artifacts to Windows only after training completes.
|
||||
|
||||
### Change output paths
|
||||
|
||||
```bash
|
||||
# Before
|
||||
TRAIN_OUT="/mnt/d/corpus/training-runs/v9"
|
||||
GGUF_OUT="/mnt/d/corpus/models"
|
||||
|
||||
# After — Linux-native for training
|
||||
TRAIN_OUT="/home/majorlinux/corpus/training-runs/v8i"
|
||||
GGUF_OUT="/home/majorlinux/corpus/models"
|
||||
```
|
||||
|
||||
The WSL2 home directory lives on a Linux ext4 `.vhdx` managed by WSL2 — writes here bypass virtio-9p entirely.
|
||||
|
||||
### Copy to Windows after training finishes
|
||||
|
||||
```bash
|
||||
cp "$GGUF_OUT/majortwin-v8i-q4-k-m.gguf" "/mnt/d/corpus/models/"
|
||||
cp "$GGUF_OUT/majortwin-v8i-q4-k-m.gguf" "/mnt/d/MajorTwin/06-Models/"
|
||||
```
|
||||
|
||||
Single large-file copies to `/mnt/d/` complete reliably — it's repeated checkpoint saves during training that deadlock.
|
||||
|
||||
### Kill a stuck training process
|
||||
|
||||
```bash
|
||||
kill $(pgrep -f 'train_v3.py')
|
||||
sleep 2
|
||||
tmux kill-session -t majortwin_v8i
|
||||
nvidia-smi --query-gpu=utilization.gpu,memory.used --format=csv,noheader
|
||||
# Should show low utilization and <1GB memory used
|
||||
```
|
||||
|
||||
The original checkpoint files from the previous run in `/mnt/d/` are untouched — the deadlock prevents writes, it does not corrupt existing data.
|
||||
|
||||
---
|
||||
|
||||
## Why Previous Runs May Have Worked
|
||||
|
||||
The deadlock is not guaranteed. It depends on Windows-side state at checkpoint save time. Factors:
|
||||
|
||||
- Antivirus scanning newly created checkpoint files
|
||||
- Windows Search indexing the output directory
|
||||
- VSS snapshot in progress
|
||||
- Concurrent Windows desktop I/O
|
||||
|
||||
A run on a quiet machine may succeed; the same run during normal desktop use may deadlock.
|
||||
|
||||
---
|
||||
|
||||
## Confirming the Fix
|
||||
|
||||
```bash
|
||||
# Watch for checkpoint directories appearing at each save_steps interval
|
||||
watch -n 30 'ls -lt ~/corpus/training-runs/v8i/ | head -8'
|
||||
|
||||
# GPU should be active (85–99%) during training steps
|
||||
nvidia-smi --query-gpu=utilization.gpu --format=csv,noheader
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Notes
|
||||
|
||||
- Setting `save_strategy="no"` in TrainingArguments eliminates checkpoint saves entirely — useful as a diagnostic to confirm this is the cause, at the cost of no crash recovery.
|
||||
- `torch.compile()` / `torch._inductor` can add hours of CPU-bound kernel compilation before the first training step. Long startup + eventual freeze together can make a session look permanently stuck when they're actually two separate issues.
|
||||
- This applies to any large sequential WSL2→Windows write, not just PyTorch — large `rsync` or `tar` to `/mnt/<drive>/` can also stall.
|
||||
|
||||
---
|
||||
|
||||
## Related
|
||||
|
||||
- [[wsl2-rebuild-fedora43-training-env]] — Full WSL2 training environment setup
|
||||
- [[wsl2-backup-powershell]] — Backing up WSL2 virtual disks from PowerShell
|
||||
- [[ansible-wsl2-world-writable-mount-ignores-cfg]] — Other WSL2 filesystem quirks
|
||||
|
|
@ -10,7 +10,7 @@ tags:
|
|||
- deno
|
||||
status: published
|
||||
created: 2026-04-02
|
||||
updated: 2026-04-22T11:33
|
||||
updated: 2026-06-16T18:35
|
||||
---
|
||||
# yt-dlp YouTube JS Challenge Fix (Fedora)
|
||||
|
||||
|
|
@ -84,12 +84,43 @@ echo '--remote-components ejs:github' > ~/.config/yt-dlp/config
|
|||
|
||||
## Maintenance
|
||||
|
||||
YouTube pushes extractor changes frequently. Keep yt-dlp current:
|
||||
YouTube pushes extractor changes frequently. Keep yt-dlp current.
|
||||
|
||||
### Updating: the `-U` trap + avoid duplicate installs
|
||||
|
||||
`yt-dlp -U` **does not work** when yt-dlp was installed via pip/PyPI — the PyPI build deliberately disables the self-updater:
|
||||
|
||||
```
|
||||
ERROR: You installed yt-dlp with pip or using the wheel from PyPi; Use that to update
|
||||
```
|
||||
|
||||
Update through pip instead. **Pick one install method and stick to it** — running both a user install and a system install leaves two copies that drift out of sync (one updates, the other stays stale and shadows it depending on `$PATH` / sudo).
|
||||
|
||||
**Recommended — single user install (no sudo):**
|
||||
|
||||
```bash
|
||||
pip3 install -U --user yt-dlp
|
||||
```
|
||||
|
||||
This lives in `~/.local/bin/yt-dlp` and is first on a normal user's `$PATH`. Update it the same way; never use sudo.
|
||||
|
||||
**Alternative — system-wide (Fedora, PEP 668):**
|
||||
|
||||
```bash
|
||||
sudo pip install -U yt-dlp --break-system-packages
|
||||
```
|
||||
|
||||
> Only use `--break-system-packages` if you intentionally want a root-owned copy in `/usr/local`. Do **not** mix it with a `--user` install.
|
||||
|
||||
**Check for and remove a duplicate install:**
|
||||
|
||||
```bash
|
||||
which -a yt-dlp # more than one path = duplicate installs
|
||||
sudo pip3 uninstall -y yt-dlp # removes the /usr/local (system) copy + its wrapper
|
||||
```
|
||||
|
||||
> If installed via the standalone binary (not pip), `yt-dlp -U` is the correct updater.
|
||||
|
||||
---
|
||||
|
||||
## Known Limitations
|
||||
|
|
|
|||
|
|
@ -2,7 +2,7 @@
|
|||
title: MajorWiki Deployment Status
|
||||
status: deployed
|
||||
project: MajorTwin
|
||||
updated: 2026-04-07T10:48
|
||||
updated: 2026-04-30T05:30
|
||||
created: 2026-04-02T16:10
|
||||
---
|
||||
|
||||
|
|
|
|||
|
|
@ -1,6 +1,6 @@
|
|||
---
|
||||
created: 2026-04-06T09:52
|
||||
updated: 2026-04-29T22:46
|
||||
updated: 2026-04-30T05:21
|
||||
---
|
||||
# MajorLinux Tech Wiki — Index
|
||||
|
||||
|
|
|
|||
52
SUMMARY.md
52
SUMMARY.md
|
|
@ -1,6 +1,6 @@
|
|||
---
|
||||
created: 2026-04-02T16:03
|
||||
updated: 2026-04-29T23:55
|
||||
updated: 2026-06-21T11:46
|
||||
---
|
||||
* [Home](index.md)
|
||||
* [Linux & Sysadmin](01-linux/index.md)
|
||||
|
|
@ -12,10 +12,12 @@ updated: 2026-04-29T23:55
|
|||
* [Bash Scripting Patterns](01-linux/shell-scripting/bash-scripting-patterns.md)
|
||||
* [SnapRAID & MergerFS Storage Setup](01-linux/storage/snapraid-mergerfs-setup.md)
|
||||
* [mdadm — Rebuilding a RAID Array After Reinstall](01-linux/storage/mdadm-raid-rebuild.md)
|
||||
* [Growing an LVM Volume by Absorbing Another Disk](01-linux/storage/lvm-grow-volume-absorb-disk.md)
|
||||
* [Linux Distro Guide for Beginners](01-linux/distro-specific/linux-distro-guide-beginners.md)
|
||||
* [WSL2 Instance Migration to Fedora 43](01-linux/distro-specific/wsl2-instance-migration-fedora43.md)
|
||||
* [WSL2 Training Environment Rebuild](01-linux/distro-specific/wsl2-rebuild-fedora43-training-env.md)
|
||||
* [WSL2 Backup via PowerShell](01-linux/distro-specific/wsl2-backup-powershell.md)
|
||||
* [WSL2 In-Place Upgrade to Fedora 44](01-linux/distro-specific/wsl2-fedora44-inplace-upgrade.md)
|
||||
* [Self-Hosting & Homelab](02-selfhosting/index.md)
|
||||
* [Self-Hosting Starter Guide](02-selfhosting/docker/self-hosting-starter-guide.md)
|
||||
* [Docker vs VMs for the Homelab](02-selfhosting/docker/docker-vs-vms-homelab.md)
|
||||
|
|
@ -28,15 +30,23 @@ updated: 2026-04-29T23:55
|
|||
* [Wake-on-LAN via Router SSH](02-selfhosting/dns-networking/wake-on-lan-router-ssh.md)
|
||||
* [Pi-hole v6 Group Management — Per-Client DNS Rules](02-selfhosting/dns-networking/pihole-v6-group-management.md)
|
||||
* [AWS S3 Cost Management](02-selfhosting/cloud/aws-s3-cost-management.md)
|
||||
* [VPS Migration Baseline Checklist](02-selfhosting/cloud/vps-migration-baseline-checklist.md)
|
||||
* [rsync Backup Patterns](02-selfhosting/storage-backup/rsync-backup-patterns.md)
|
||||
* [Fleet Backups with restic + B2](02-selfhosting/storage-backup/restic-b2-fleet-backups.md)
|
||||
* [Tuning Netdata Web Log Alerts](02-selfhosting/monitoring/tuning-netdata-web-log-alerts.md)
|
||||
* [Tuning Netdata Docker Health Alarms](02-selfhosting/monitoring/netdata-docker-health-alarm-tuning.md)
|
||||
* [Deploying Netdata to a New Server](02-selfhosting/monitoring/netdata-new-server-setup.md)
|
||||
* [Netdata SELinux AVC Denial Monitoring](02-selfhosting/monitoring/netdata-selinux-avc-chart.md)
|
||||
* [Netdata n8n Enriched Alert Emails](02-selfhosting/monitoring/netdata-n8n-enriched-alerts.md)
|
||||
* [Logwatch Fleet Setup — Surviving Package Upgrades](02-selfhosting/monitoring/logwatch-fleet-setup.md)
|
||||
* [Updating n8n Running in Docker](02-selfhosting/services/updating-n8n-docker.md)
|
||||
* [Mastodon Instance Tuning](02-selfhosting/services/mastodon-instance-tuning.md)
|
||||
* [Mastodon Post-Install Hardening (Permissions + Account)](02-selfhosting/services/mastodon-post-install-hardening.md)
|
||||
* [Mastodon — The `--prune-profiles` Trap and How to Recover](02-selfhosting/services/mastodon-prune-profiles-trap.md)
|
||||
* [Mastodon on S3 — Silent Upload Failures (BucketOwnerEnforced/ACLs)](02-selfhosting/services/mastodon-s3-acl-upload-failures.md)
|
||||
* [Mastodon — Triaging Crowdfunding / Mention-Spam Accounts](02-selfhosting/services/mastodon-mention-spam-crowdfunding.md)
|
||||
* [Ghost Email Configuration with Mailgun](02-selfhosting/services/ghost-smtp-mailgun-setup.md)
|
||||
* [Inbound Spam Filtering: spamass-milter + SpamAssassin Bayes](02-selfhosting/services/postfix-spamassassin-bayes-spam-filtering.md)
|
||||
* [Claude Code Remote Control — Mobile Access to a Persistent Host Session](02-selfhosting/services/claude-code-remote-control.md)
|
||||
* [Linux Server Hardening Checklist](02-selfhosting/security/linux-server-hardening-checklist.md)
|
||||
* [Standardizing unattended-upgrades with Ansible](02-selfhosting/security/ansible-unattended-upgrades-fleet.md)
|
||||
|
|
@ -50,7 +60,10 @@ updated: 2026-04-29T23:55
|
|||
* [Fail2ban Custom Jail: Nginx Bad Request Detection](02-selfhosting/security/fail2ban-nginx-bad-request-jail.md)
|
||||
* [Fail2ban Custom Jail: Apache Bad Request Detection](02-selfhosting/security/fail2ban-apache-bad-request-jail.md)
|
||||
* [SSH Hardening Fleet-Wide with Ansible](02-selfhosting/security/ssh-hardening-ansible-fleet.md)
|
||||
* [Migrating Flat Ansible Playbooks to Roles (Safely)](02-selfhosting/security/ansible-flat-playbooks-to-roles.md)
|
||||
* [ClamAV Fleet Deployment with Ansible](02-selfhosting/security/clamav-fleet-deployment.md)
|
||||
* [Fail2Ban Digest Mode — Fleet-Wide Quiet Alerts](02-selfhosting/security/fail2ban-digest-mode-fleet.md)
|
||||
* [Apache CVE-2026-23918 — HTTP/2 Double Free Mitigation](02-selfhosting/security/apache-cve-2026-23918-http2-mitigation.md)
|
||||
* [Open Source & Alternatives](03-opensource/index.md)
|
||||
* [SearXNG: Private Self-Hosted Search](03-opensource/alternatives/searxng.md)
|
||||
* [FreshRSS: Self-Hosted RSS Reader](03-opensource/alternatives/freshrss.md)
|
||||
|
|
@ -65,42 +78,79 @@ updated: 2026-04-29T23:55
|
|||
* [Streaming & Podcasting](04-streaming/index.md)
|
||||
* [OBS Studio Setup & Encoding](04-streaming/obs/obs-studio-setup-encoding.md)
|
||||
* [Plex 4K Codec Compatibility (Apple TV)](04-streaming/plex/plex-4k-codec-compatibility.md)
|
||||
* [HEVC Batch Re-Encode for Plex Using VAAPI (AMD GPU)](04-streaming/plex/hevc-vaapi-batch-encode.md)
|
||||
* [Plex Transcoding Troubleshooting](04-streaming/plex/plex-transcoding-troubleshooting.md)
|
||||
* [Troubleshooting](05-troubleshooting/index.md)
|
||||
* [Wi-Fi Game Streaming Stutter: 160 MHz Channel Width Saturating the 5 GHz Radio](05-troubleshooting/networking/wifi-160mhz-airtime-saturation-game-streaming.md)
|
||||
* [Steam Deck Wi-Fi Flapping: IWD Periodic Scan + rtw88 Power Save](05-troubleshooting/networking/steam-deck-wifi-flapping-iwd-periodic-scan-rtw88.md)
|
||||
* [Apache Outage: Fail2ban Self-Ban + Missing iptables Rules](05-troubleshooting/networking/fail2ban-self-ban-apache-outage.md)
|
||||
* [Postfix + SendGrid: TLS Handshake Failure (Port 465 vs 587)](05-troubleshooting/networking/postfix-sendgrid-tls-handshake-failure.md)
|
||||
* [Mail Client Stops Receiving: Fail2ban IMAP Self-Ban](05-troubleshooting/networking/fail2ban-imap-self-ban-mail-client.md)
|
||||
* [firewalld: Mail Ports Wiped After Reload](05-troubleshooting/networking/firewalld-mail-ports-reset.md)
|
||||
* [Dovecot IMAP Clients Fail to Sync: vsz_limit OOM from a Bloated Index Log](05-troubleshooting/networking/dovecot-imap-oom-vsz-limit-bloated-index.md)
|
||||
* [Postfix header_checks Can't Act on Milter-Added Headers (Use Sieve)](05-troubleshooting/networking/postfix-header-checks-vs-milter-headers.md)
|
||||
* [Dovecot Phantom Mailboxes from .dovecot.lda-dupes (mail_home Overlapping the Maildir Root)](05-troubleshooting/networking/dovecot-mail-home-maildir-root-phantom-mailboxes.md)
|
||||
* [Tailscale SSH: Unexpected Re-Authentication Prompt](05-troubleshooting/networking/tailscale-ssh-reauth-prompt.md)
|
||||
* [ssh.socket Unreachable After Reboot (Tailscale Race Condition)](05-troubleshooting/networking/ssh-socket-tailscale-race-condition.md)
|
||||
* [Fail2ban & UFW Rule Bloat Cleanup](05-troubleshooting/networking/fail2ban-ufw-rule-bloat-cleanup.md)
|
||||
* [Custom Fail2ban Jail: Apache Directory Scanning](05-troubleshooting/security/apache-dirscan-fail2ban-jail.md)
|
||||
* [Tuning Netdata `web_log_1m_successful` for Redirect-Heavy WordPress Sites](05-troubleshooting/security/netdata-web-log-successful-redirect-heavy-tuning.md)
|
||||
* [Castopod: Stale Federated Avatar URLs After Remote Profile Updates](05-troubleshooting/security/castopod-stale-federated-avatar.md)
|
||||
* [Castopod Posts Don't Appear on Mastodon — Diagnosing the Federation Path](05-troubleshooting/security/castopod-broadcast-not-on-mastodon.md)
|
||||
* [Nextcloud AIO Unhealthy 20h After Nightly Update](05-troubleshooting/docker/nextcloud-aio-unhealthy-20h-stuck.md)
|
||||
* [n8n Behind Reverse Proxy: X-Forwarded-For Trust Fix](05-troubleshooting/docker/n8n-proxy-trust-x-forwarded-for.md)
|
||||
* [Docker & Caddy Recovery After Reboot (Fedora + SELinux)](05-troubleshooting/docker-caddy-selinux-post-reboot-recovery.md)
|
||||
* [ISP SNI Filtering with Caddy](05-troubleshooting/isp-sni-filtering-caddy.md)
|
||||
* [Obsidian Vault Recovery — Loading Cache Hang](05-troubleshooting/obsidian-cache-hang-recovery.md)
|
||||
* [Qwen2.5-14B OOM on RTX 3080 Ti (12GB)](05-troubleshooting/gpu-display/qwen-14b-oom-3080ti.md)
|
||||
* [LoRA adapter — GGUF conversion fails with 'config.json not found'](05-troubleshooting/gpu-display/lora-adapter-gguf-conversion-fails.md)
|
||||
* [yt-dlp YouTube JS Challenge Fix on Fedora](05-troubleshooting/yt-dlp-fedora-js-challenge.md)
|
||||
* [Gemini CLI Manual Update](05-troubleshooting/gemini-cli-manual-update.md)
|
||||
* [MajorWiki Setup & Publishing Pipeline](05-troubleshooting/majwiki-setup-and-pipeline.md)
|
||||
* [Gitea Actions Runner: Boot Race Condition Fix](05-troubleshooting/gitea-runner-boot-race-network-target.md)
|
||||
* [Forgejo: Account Recovery & CLI Admin When Locked Out of the GUI](05-troubleshooting/forgejo-mailer-and-cli-recovery.md)
|
||||
* [Cron Heartbeat False Alarm: /var/run Cleared by Reboot](05-troubleshooting/cron-heartbeat-tmpfs-reboot-false-alarm.md)
|
||||
* [SELinux: Fixing Dovecot Mail Spool Context (/var/vmail)](05-troubleshooting/selinux-dovecot-vmail-context.md)
|
||||
* [SELinux: Wrong /etc/localtime Label Silently Breaks Timezone Changes](05-troubleshooting/selinux-localtime-label-breaks-timezone.md)
|
||||
* [mdadm RAID Recovery After USB Hub Disconnect](05-troubleshooting/storage/mdadm-usb-hub-disconnect-recovery.md)
|
||||
* [Windows OpenSSH Server (sshd) Stops After Reboot](05-troubleshooting/networking/windows-sshd-stops-after-reboot.md)
|
||||
* [Windows OpenSSH: WSL Default Shell Breaks Remote Commands](05-troubleshooting/networking/windows-openssh-wsl-default-shell-breaks-remote-commands.md)
|
||||
* [Pi-hole AI Blocklist Blocks Claude Desktop (ERR_CONNECTION_REFUSED)](05-troubleshooting/networking/pihole-blocks-claude-desktop.md)
|
||||
* [Claude Desktop MCP Server Started via wsl.exe Sees Empty Environment (WSLENV)](05-troubleshooting/wsl-env-claude-desktop-mcp.md)
|
||||
* [Claude Desktop MCP Mass-Disconnect After Blocking SSH Reboot](05-troubleshooting/claude-desktop-mcp-mass-disconnect-blocking-reboot.md)
|
||||
* [Patching PHP 8.4 Implicit-Nullable Deprecations in Vendor Packages](05-troubleshooting/php-84-vendor-implicit-nullable-patch.md)
|
||||
* [WordPress 6.7 `_load_textdomain_just_in_time` Notice (Translations Loaded Too Early)](05-troubleshooting/wordpress-67-textdomain-just-in-time-notice.md)
|
||||
* [Ollama Drops Off Tailscale When Mac Sleeps](05-troubleshooting/ollama-macos-sleep-tailscale-disconnect.md)
|
||||
* [Ollama: `ollama run` with Piped Stdin Bypasses Chat Template + SYSTEM Prompt](05-troubleshooting/ollama-chat-template-pipe-stdin-bypass.md)
|
||||
* [Claude Code Won't Log In (Warp & iTerm2) — Corrupt Keychain Credential](05-troubleshooting/claude-code-warp-login-corrupt-keychain-credential.md)
|
||||
* [Claude Code Keychain Prompt Keeps Reappearing on macOS (ACL Invalidation)](05-troubleshooting/claude-code-keychain-prompt-recurring-macos.md)
|
||||
* [iPhone Mirroring Hangs on 'Connecting…' — AWDL Data Stall (27.0 Beta)](05-troubleshooting/iphone-mirroring-connecting-hang-awdl-stall-beta.md)
|
||||
* [rsync over Tailscale: Hung in TCP Teardown After Transfer Completes](05-troubleshooting/networking/rsync-tailscale-teardown-stall.md)
|
||||
* [iOS Tailscale Clients Report HostName="localhost" — Breaks /etc/hosts Generators](05-troubleshooting/networking/tailscale-status-json-hostname-localhost-ios.md)
|
||||
* [macOS: Repeating Alert Tone from Mirrored iPhone Notification](05-troubleshooting/macos-mirrored-notification-alert-loop.md)
|
||||
* [Auditing & Cleaning macOS Background App Activity (sfltool dumpbtm)](05-troubleshooting/macos-background-app-activity-audit-sfltool.md)
|
||||
* [Time Machine: Orphaned APFS `.previous` Folder Blocks All Backups](05-troubleshooting/time-machine-apfs-orphaned-previous-blocks-backup.md)
|
||||
* [OBS Studio: Stale Script Paths After Windows Profile Rename](05-troubleshooting/obs-stale-script-paths-after-windows-profile-rename.md)
|
||||
* [ClamAV CPU Spike: Safe Scheduling with nice/ionice](05-troubleshooting/security/clamscan-cpu-spike-nice-ionice.md)
|
||||
* [Logwatch Falsely Reports 'No freshclam updates' in ClamAV Daemon Mode](05-troubleshooting/security/freshclam-logwatch-false-no-updates.md)
|
||||
* [Fedora CA Bundle Missing Symlink — TLS Breaks Fleet-Wide](05-troubleshooting/security/fedora-ca-bundle-missing-symlink.md)
|
||||
* [Netdata apps-group FD-utilisation false 100% (silenced fleet-wide)](05-troubleshooting/security/netdata-apps-fds-group-false-positive.md)
|
||||
* [Ansible: Vault Password File Not Found](05-troubleshooting/ansible-vault-password-file-missing.md)
|
||||
* [Ansible: ansible.cfg Ignored on WSL2 Windows Mounts](05-troubleshooting/ansible-wsl2-world-writable-mount-ignores-cfg.md)
|
||||
* [WSL2: PyTorch Training Deadlocks on Windows Filesystem Checkpoint Saves](05-troubleshooting/wsl2-pytorch-checkpoint-windows-filesystem-deadlock.md)
|
||||
* [Ansible: SSH Timeout During dnf upgrade on Fedora Hosts](05-troubleshooting/ansible-ssh-timeout-dnf-upgrade.md)
|
||||
* [Ansible: regex_search Capture-Group Argument Fails in set_fact](05-troubleshooting/ansible-regex-search-set-fact-capture-group.md)
|
||||
* [Ansible: Ubuntu Reboot Detection Misses Kernel Upgrades](05-troubleshooting/ansible-ubuntu-reboot-detection-kernel-mismatch.md)
|
||||
* [Ansible: reboot.yml become Timeout on WSL2 Hosts (Exclude Them)](05-troubleshooting/ansible-reboot-become-timeout-wsl2.md)
|
||||
* [Fedora Networking & Kernel Troubleshooting](05-troubleshooting/fedora-networking-kernel-recovery.md)
|
||||
* [Systemd Session Scope Fails at Login](05-troubleshooting/systemd/session-scope-failure-at-login.md)
|
||||
* [wget/curl: URLs with Special Characters Fail in Bash](05-troubleshooting/wget-url-special-characters.md)
|
||||
* [Ansible: Check Mode False Positives in Verify/Assert Tasks](05-troubleshooting/ansible-check-mode-false-positives.md)
|
||||
* [Ansible Fails with Permission Denied While `ssh <alias>` Works (Host Alias Bypass)](05-troubleshooting/ansible-ssh-host-alias-bypass.md)
|
||||
* [SSH Alias Falls Through to MagicDNS — Host-Key Verification Failure (No `Host` Block)](05-troubleshooting/networking/ssh-missing-host-block-magicdns-host-key-failure.md)
|
||||
* [MagicDNS Names vs Pinned IPs for Tailscale SSH (After a Fleet Migration)](05-troubleshooting/networking/tailscale-ssh-magicdns-vs-pinned-ip-after-migration.md)
|
||||
* [`Permission denied (publickey)` After Rotating a Key — Backfill Every `authorized_keys`](05-troubleshooting/networking/ssh-rotated-key-not-backfilled-authorized-keys.md)
|
||||
* [Ansible UNREACHABLE: Host Key Verification Failed After a Host Rebuild or Migration](05-troubleshooting/networking/ansible-host-key-verification-failed-rebuilt-host.md)
|
||||
* [Logwatch Reports the Wrong Hostname (`<host>-hetzner`) After a Migration](05-troubleshooting/logwatch-wrong-hostname-after-migration.md)
|
||||
* [Ghost EmailAnalytics Lag Warning — What It Means and When to Worry](05-troubleshooting/ghost-emailanalytics-lag-warning.md)
|
||||
* [claude-mem: --setting-sources Empty Arg Bug (Claude Code 2.1.x)](05-troubleshooting/claude-mem-setting-sources-empty-arg.md)
|
||||
|
|
|
|||
341
index.md
341
index.md
|
|
@ -1,179 +1,214 @@
|
|||
---
|
||||
created: 2026-04-06T09:52
|
||||
updated: 2026-04-29T22:45
|
||||
updated: 2026-05-10T01:30
|
||||
---
|
||||
# MajorLinux Tech Wiki — Index
|
||||
|
||||
> A growing reference of Linux, self-hosting, open source, streaming, and troubleshooting guides. Written by MajorLinux. Used by MajorTwin.
|
||||
>
|
||||
> **Last updated:** 2026-04-18
|
||||
> **Article count:** 89
|
||||
> **Last updated:** 2026-05-10
|
||||
> **Article count:** 111
|
||||
|
||||
## Domains
|
||||
|
||||
| Domain | Folder | Articles |
|
||||
|---|---|---|
|
||||
| 🐧 Linux & Sysadmin | `01-linux/` | 12 |
|
||||
| 🏠 Self-Hosting & Homelab | `02-selfhosting/` | 32 |
|
||||
| 🏠 Self-Hosting & Homelab | `02-selfhosting/` | 39 |
|
||||
| 🔓 Open Source Tools | `03-opensource/` | 10 |
|
||||
| 🎙️ Streaming & Podcasting | `04-streaming/` | 2 |
|
||||
| 🔧 General Troubleshooting | `05-troubleshooting/` | 34 |
|
||||
| 🔧 General Troubleshooting | `05-troubleshooting/` | 48 |
|
||||
|
||||
|
||||
---
|
||||
|
||||
## 🐧 Linux & Sysadmin
|
||||
|
||||
### Files & Permissions
|
||||
- [Linux File Permissions](01-linux/files-permissions/linux-file-permissions.md) — chmod, chown, special bits, finding permission problems
|
||||
### Distro-Specific
|
||||
- [Linux Distro Guide for Beginners](01-linux/distro-specific/linux-distro-guide-beginners.md)
|
||||
- [WSL2 Backup via PowerShell Scheduled Task](01-linux/distro-specific/wsl2-backup-powershell.md)
|
||||
- [WSL2 Instance Migration (Fedora 43)](01-linux/distro-specific/wsl2-instance-migration-fedora43.md)
|
||||
- [Wsl2 Rebuild Fedora43 Training Env](01-linux/distro-specific/wsl2-rebuild-fedora43-training-env.md)
|
||||
|
||||
### Process Management
|
||||
- [Managing Linux Services with systemd](01-linux/process-management/managing-linux-services-systemd-ansible.md) — systemctl, journalctl, writing service files, Ansible service management
|
||||
### Files & Permissions
|
||||
- [Linux File Permissions and Ownership](01-linux/files-permissions/linux-file-permissions.md)
|
||||
|
||||
### Networking
|
||||
- [SSH Config & Key Management](01-linux/networking/ssh-config-key-management.md) — key generation, ssh-copy-id, ~/.ssh/config, managing multiple keys, Windows OpenSSH admin key auth
|
||||
- [SSH Config and Key Management](01-linux/networking/ssh-config-key-management.md)
|
||||
|
||||
### Package Management
|
||||
- [Package Management Reference](01-linux/packages/package-management-reference.md) — apt, dnf, pacman side-by-side reference, Flatpak/Snap
|
||||
- [Linux Package Management Reference: apt, dnf, pacman](01-linux/packages/package-management-reference.md)
|
||||
|
||||
### Process Management
|
||||
- [Managing Linux Services: systemd and Ansible](01-linux/process-management/managing-linux-services-systemd-ansible.md)
|
||||
|
||||
### Shell & Scripting
|
||||
- [Ansible Getting Started](01-linux/shell-scripting/ansible-getting-started.md) — inventory, ad-hoc commands, playbooks, handlers, roles
|
||||
- [Bash Scripting Patterns](01-linux/shell-scripting/bash-scripting-patterns.md) — set -euo pipefail, logging, error handling, argument parsing, common patterns
|
||||
- [Ansible Getting Started: Inventory, Playbooks, and Ad-Hoc Commands](01-linux/shell-scripting/ansible-getting-started.md)
|
||||
- [Bash Scripting Patterns for Sysadmins](01-linux/shell-scripting/bash-scripting-patterns.md)
|
||||
|
||||
### Storage
|
||||
- [SnapRAID & MergerFS Storage Setup](01-linux/storage/snapraid-mergerfs-setup.md) — Pooling mismatched drives and adding parity on Linux
|
||||
- [mdadm — Rebuilding a RAID Array After Reinstall](01-linux/storage/mdadm-raid-rebuild.md) — reassembling and recovering mdadm arrays after OS reinstall
|
||||
- [SnapRAID & MergerFS Storage Setup](01-linux/storage/snapraid-mergerfs-setup.md)
|
||||
- [mdadm — Rebuilding a RAID Array After Reinstall](01-linux/storage/mdadm-raid-rebuild.md)
|
||||
|
||||
### Distro-Specific
|
||||
- [Linux Distro Guide for Beginners](01-linux/distro-specific/linux-distro-guide-beginners.md) — Ubuntu recommendation, distro comparison, desktop environments
|
||||
- [WSL2 Instance Migration to Fedora 43](01-linux/distro-specific/wsl2-instance-migration-fedora43.md) — moving WSL2 VHDX from C: to another drive
|
||||
- [WSL2 Training Environment Rebuild (Fedora 43)](01-linux/distro-specific/wsl2-rebuild-fedora43-training-env.md) — rebuilding the MajorTwin training env in WSL2 from scratch
|
||||
- [WSL2 Backup via PowerShell Scheduled Task](01-linux/distro-specific/wsl2-backup-powershell.md) — automating WSL2 exports on a schedule using PowerShell
|
||||
|
||||
---
|
||||
|
||||
## 🏠 Self-Hosting & Homelab
|
||||
|
||||
### Docker & Containers
|
||||
- [Self-Hosting Starter Guide](02-selfhosting/docker/self-hosting-starter-guide.md) — hardware options, Docker install, first services, networking basics
|
||||
- [Docker vs VMs for the Homelab](02-selfhosting/docker/docker-vs-vms-homelab.md) — when to use containers vs VMs, KVM setup, how to run both
|
||||
- [Debugging Broken Docker Containers](02-selfhosting/docker/debugging-broken-docker-containers.md) — logs, inspect, exec, port conflicts, permission errors
|
||||
- [Docker Healthchecks](02-selfhosting/docker/docker-healthchecks.md) — writing and debugging HEALTHCHECK instructions in Docker containers
|
||||
- [Watchtower SMTP via Localhost Postfix Relay](02-selfhosting/docker/watchtower-smtp-localhost-relay.md) — credential-free container update notifications by routing through a local Postfix relay
|
||||
|
||||
### Reverse Proxies
|
||||
- [Setting Up Caddy as a Reverse Proxy](02-selfhosting/reverse-proxy/setting-up-caddy-reverse-proxy.md) — Caddyfile basics, automatic HTTPS, local TLS, DNS challenge
|
||||
### Cloud
|
||||
- [AWS S3 Cost Management](02-selfhosting/cloud/aws-s3-cost-management.md)
|
||||
|
||||
### DNS & Networking
|
||||
- [Tailscale for Homelab Remote Access](02-selfhosting/dns-networking/tailscale-homelab-remote-access.md) — installation, MagicDNS, making services accessible, subnet router, ACLs
|
||||
- [Network Overview](02-selfhosting/dns-networking/network-overview.md) — MajorsHouse network topology, Tailscale IPs, and connectivity map
|
||||
- [Wake-on-LAN via Router SSH](02-selfhosting/dns-networking/wake-on-lan-router-ssh.md) — send WOL magic packets through an Asus router over SSH, with Ansible vault integration
|
||||
- [Network Overview](02-selfhosting/dns-networking/network-overview.md)
|
||||
- [Pi-hole DoH / DoT Bypass Defense](02-selfhosting/dns-networking/pihole-doh-dot-bypass-defense.md)
|
||||
- [Pi-hole v6 Adlist Management via SQL](02-selfhosting/dns-networking/pihole-v6-adlist-management.md)
|
||||
- [Pi-hole v6 Group Management: Per-Client DNS Rules](02-selfhosting/dns-networking/pihole-v6-group-management.md)
|
||||
- [Tailscale for Homelab Remote Access](02-selfhosting/dns-networking/tailscale-homelab-remote-access.md)
|
||||
- [Wake-on-LAN via Router SSH](02-selfhosting/dns-networking/wake-on-lan-router-ssh.md)
|
||||
|
||||
### Cloud
|
||||
- [AWS S3 Cost Management](02-selfhosting/cloud/aws-s3-cost-management.md) — identify and control S3 costs: lifecycle rules, storage class selection, bucket inventory, unexpected-growth investigation
|
||||
|
||||
### Storage & Backup
|
||||
- [rsync Backup Patterns](02-selfhosting/storage-backup/rsync-backup-patterns.md) — flags reference, remote backup, incremental with hard links, cron/systemd
|
||||
### Docker & Containers
|
||||
- [Debugging Broken Docker Containers](02-selfhosting/docker/debugging-broken-docker-containers.md)
|
||||
- [Docker Healthchecks](02-selfhosting/docker/docker-healthchecks.md)
|
||||
- [Docker vs VMs in the Homelab: Why Not Both?](02-selfhosting/docker/docker-vs-vms-homelab.md)
|
||||
- [Self-Hosting Starter Guide](02-selfhosting/docker/self-hosting-starter-guide.md)
|
||||
- [Watchtower SMTP via Localhost Postfix Relay](02-selfhosting/docker/watchtower-smtp-localhost-relay.md)
|
||||
|
||||
### Monitoring
|
||||
- [Tuning Netdata Web Log Alerts](02-selfhosting/monitoring/tuning-netdata-web-log-alerts.md) — tuning web_log_1m_redirects threshold for HTTPS-forcing servers
|
||||
- [Tuning Netdata Docker Health Alarms](02-selfhosting/monitoring/netdata-docker-health-alarm-tuning.md) — preventing false alerts during nightly Nextcloud AIO container update cycles
|
||||
- [Deploying Netdata to a New Server](02-selfhosting/monitoring/netdata-new-server-setup.md) — install, email notifications, and Netdata Cloud claim for Ubuntu/Debian servers
|
||||
- [Netdata + n8n Enriched Alert Emails](02-selfhosting/monitoring/netdata-n8n-enriched-alerts.md) — rich HTML alert emails with remediation steps and wiki links via n8n
|
||||
- [Netdata SELinux AVC Denial Monitoring](02-selfhosting/monitoring/netdata-selinux-avc-chart.md) — custom Netdata chart for tracking SELinux AVC denials
|
||||
- [Deploying Netdata to a New Server](02-selfhosting/monitoring/netdata-new-server-setup.md)
|
||||
- [Netdata SELinux AVC Denial Monitoring](02-selfhosting/monitoring/netdata-selinux-avc-chart.md)
|
||||
- [Netdata n8n Enriched Alert Emails](02-selfhosting/monitoring/netdata-n8n-enriched-alerts.md)
|
||||
- [Tuning Netdata Docker Health Alarms to Prevent Update Flapping](02-selfhosting/monitoring/netdata-docker-health-alarm-tuning.md)
|
||||
- [Tuning Netdata Web Log Alerts](02-selfhosting/monitoring/tuning-netdata-web-log-alerts.md)
|
||||
|
||||
### Reverse Proxies
|
||||
- [Setting Up a Reverse Proxy with Caddy](02-selfhosting/reverse-proxy/setting-up-caddy-reverse-proxy.md)
|
||||
|
||||
### Security
|
||||
- [Linux Server Hardening Checklist](02-selfhosting/security/linux-server-hardening-checklist.md) — non-root user, SSH key auth, sshd_config, firewall, fail2ban, SpamAssassin
|
||||
- [Standardizing unattended-upgrades with Ansible](02-selfhosting/security/ansible-unattended-upgrades-fleet.md) — fleet-wide automatic security updates across Ubuntu servers
|
||||
- [Fail2ban Custom Jail: Apache 404 Scanner Detection](02-selfhosting/security/fail2ban-apache-404-scanner-jail.md) — custom filter and jail for blocking 404 scanners
|
||||
- [Fail2ban Custom Jail: Apache PHP Webshell Probe Detection](02-selfhosting/security/fail2ban-apache-php-probe-jail.md) — catching PHP webshell/backdoor probes that return 301 on HTTPS-redirecting servers
|
||||
- [Fail2ban Custom Jail: WordPress Login Brute Force](02-selfhosting/security/fail2ban-wordpress-login-jail.md) — access-log-based wp-login.php brute force detection without plugins
|
||||
- [SELinux: Fixing Fail2ban grep execmem Denial](02-selfhosting/security/selinux-fail2ban-execmem-fix.md) — resolving execmem AVC denials from Fail2ban's grep on Fedora
|
||||
- [UFW Firewall Management](02-selfhosting/security/ufw-firewall-management.md) — managing UFW rules, common patterns, troubleshooting
|
||||
- [Firewall Hardening with firewalld on Fedora Fleet](02-selfhosting/security/firewalld-fleet-hardening.md) — audit-and-harden pattern for Fedora fleet hosts using Ansible; flush stale rules, rebuild minimal whitelists
|
||||
- [Fail2ban Custom Jail: Nginx Bad Request Detection](02-selfhosting/security/fail2ban-nginx-bad-request-jail.md) — wiring the stock nginx-bad-request filter to a jail to catch malformed-request scanners
|
||||
- [Fail2ban Custom Jail: Apache Bad Request Detection](02-selfhosting/security/fail2ban-apache-bad-request-jail.md) — custom filter for Apache 400 Bad Request responses (no stock equivalent exists)
|
||||
- [SSH Hardening Fleet-Wide with Ansible](02-selfhosting/security/ssh-hardening-ansible-fleet.md) — drop-in sshd config hardening across mixed Ubuntu/Fedora fleets
|
||||
- [ClamAV Fleet Deployment with Ansible](02-selfhosting/security/clamav-fleet-deployment.md) — deploy ClamAV with nice/ionice throttling, freshclam, and quarantine to internet-facing hosts
|
||||
- [ClamAV Fleet Deployment with Ansible](02-selfhosting/security/clamav-fleet-deployment.md)
|
||||
- [Fail2Ban Digest Mode — Fleet-Wide Quiet Alerts](02-selfhosting/security/fail2ban-digest-mode-fleet.md)
|
||||
- [Fail2ban Custom Jail: Apache 404 Scanner Detection](02-selfhosting/security/fail2ban-apache-404-scanner-jail.md)
|
||||
- [Fail2ban Custom Jail: Apache Bad Request Detection](02-selfhosting/security/fail2ban-apache-bad-request-jail.md)
|
||||
- [Fail2ban Custom Jail: Apache PHP Webshell Probe Detection](02-selfhosting/security/fail2ban-apache-php-probe-jail.md)
|
||||
- [Fail2ban Custom Jail: WordPress Login Brute Force](02-selfhosting/security/fail2ban-wordpress-login-jail.md)
|
||||
- [Fail2ban: Enable the nginx-bad-request Jail](02-selfhosting/security/fail2ban-nginx-bad-request-jail.md)
|
||||
- [Firewall Hardening with firewalld on Fedora Fleet](02-selfhosting/security/firewalld-fleet-hardening.md)
|
||||
- [Linux Server Hardening Checklist](02-selfhosting/security/linux-server-hardening-checklist.md)
|
||||
- [SELinux: Fixing Fail2ban grep execmem Denial on Fedora](02-selfhosting/security/selinux-fail2ban-execmem-fix.md)
|
||||
- [SSH Hardening Fleet-Wide with Ansible](02-selfhosting/security/ssh-hardening-ansible-fleet.md)
|
||||
- [Standardizing unattended-upgrades Across Ubuntu Fleet with Ansible](02-selfhosting/security/ansible-unattended-upgrades-fleet.md)
|
||||
- [UFW Firewall Management](02-selfhosting/security/ufw-firewall-management.md)
|
||||
- [wp-fail2ban Plugin Logpath on Debian/Ubuntu (auth.log, not syslog)](02-selfhosting/security/wp-fail2ban-logpath-debian-ubuntu.md)
|
||||
|
||||
### Services
|
||||
- [Updating n8n Running in Docker](02-selfhosting/services/updating-n8n-docker.md) — pinned version updates, password reset, Arcane timing gaps
|
||||
- [Mastodon Instance Tuning](02-selfhosting/services/mastodon-instance-tuning.md) — character limit increase, media cache management for self-hosted Mastodon
|
||||
- [Ghost Email Configuration with Mailgun](02-selfhosting/services/ghost-smtp-mailgun-setup.md) — configuring Ghost's two independent mail systems (newsletter API + transactional SMTP) with Mailgun
|
||||
- [Claude Code Remote Control — Mobile Access to a Persistent Host Session](02-selfhosting/services/claude-code-remote-control.md) — running `claude remote-control` on a host so `claude.ai` and the Claude mobile app can drive the CLI, with vault + MCPs intact
|
||||
- [Claude Code Remote Control — Mobile Access to a Persistent Host Session](02-selfhosting/services/claude-code-remote-control.md)
|
||||
- [Ghost Email Configuration with Mailgun](02-selfhosting/services/ghost-smtp-mailgun-setup.md)
|
||||
- [Mastodon DB Maintenance — Statuses, Accounts, and VACUUM](02-selfhosting/services/mastodon-db-maintenance.md)
|
||||
- [Mastodon Federation — Domain Blocks, Silencing, and FediSeer](02-selfhosting/services/mastodon-federation.md)
|
||||
- [Mastodon Instance Tuning](02-selfhosting/services/mastodon-instance-tuning.md)
|
||||
- [Mastodon — The `--prune-profiles` Trap and How to Recover](02-selfhosting/services/mastodon-prune-profiles-trap.md)
|
||||
- [Updating n8n Running in Docker](02-selfhosting/services/updating-n8n-docker.md)
|
||||
|
||||
### Storage & Backup
|
||||
- [rsync Backup Patterns](02-selfhosting/storage-backup/rsync-backup-patterns.md)
|
||||
|
||||
|
||||
---
|
||||
|
||||
## 🔓 Open Source Tools
|
||||
|
||||
### Alternatives
|
||||
- [SearXNG: Private Self-Hosted Search](03-opensource/alternatives/searxng.md) — metasearch engine that queries multiple engines without exposing your identity
|
||||
- [FreshRSS: Self-Hosted RSS Reader](03-opensource/alternatives/freshrss.md) — algorithm-free feed aggregator with mobile app sync
|
||||
- [Gitea: Self-Hosted Git](03-opensource/alternatives/gitea.md) — lightweight GitHub alternative, webhooks, single Docker container
|
||||
|
||||
### Productivity
|
||||
- [rmlint: Duplicate File Scanning](03-opensource/productivity/rmlint-duplicate-scanning.md) — extremely fast duplicate file finding and storage reclamation
|
||||
- [FreshRSS — Self-Hosted RSS Reader](03-opensource/alternatives/freshrss.md)
|
||||
- [Gitea — Self-Hosted Git](03-opensource/alternatives/gitea.md)
|
||||
- [SearXNG — Private Self-Hosted Search](03-opensource/alternatives/searxng.md)
|
||||
|
||||
### Development Tools
|
||||
- [tmux: Persistent Terminal Sessions](03-opensource/dev-tools/tmux.md) — detachable sessions for long-running jobs over SSH
|
||||
- [screen: Simple Persistent Sessions](03-opensource/dev-tools/screen.md) — lightweight terminal multiplexer, universally available
|
||||
- [rsync: Fast, Resumable File Transfers](03-opensource/dev-tools/rsync.md) — incremental file sync locally and over SSH, survives interruptions
|
||||
- [Ventoy: Multi-Boot USB Tool](03-opensource/dev-tools/ventoy.md) — drop ISOs on a USB drive and boot any of them, no reflashing
|
||||
|
||||
### Privacy & Security
|
||||
- [Vaultwarden: Self-Hosted Password Manager](03-opensource/privacy-security/vaultwarden.md) — Bitwarden-compatible server in a single Docker container, passwords stay on your hardware
|
||||
- [Ventoy — Multi-Boot USB Tool](03-opensource/dev-tools/ventoy.md)
|
||||
- [rsync — Fast, Resumable File Transfers](03-opensource/dev-tools/rsync.md)
|
||||
- [screen — Simple Persistent Terminal Sessions](03-opensource/dev-tools/screen.md)
|
||||
- [tmux — Persistent Terminal Sessions](03-opensource/dev-tools/tmux.md)
|
||||
|
||||
### Media & Creative
|
||||
- [yt-dlp: Video Downloading](03-opensource/media-creative/yt-dlp.md) — download from YouTube and hundreds of other sites, Plex-optimized format selection
|
||||
- [yt-dlp — Video Downloading](03-opensource/media-creative/yt-dlp.md)
|
||||
|
||||
### Privacy & Security
|
||||
- [Vaultwarden — Self-Hosted Password Manager](03-opensource/privacy-security/vaultwarden.md)
|
||||
|
||||
### Productivity
|
||||
- [rmlint — Extreme Duplicate File Scanning](03-opensource/productivity/rmlint-duplicate-scanning.md)
|
||||
|
||||
|
||||
---
|
||||
|
||||
## 🎙️ Streaming & Podcasting
|
||||
|
||||
### OBS Studio
|
||||
- [OBS Studio Setup & Encoding](04-streaming/obs/obs-studio-setup-encoding.md) — installation, NVENC/x264 settings, scene setup, audio filters, Linux Wayland notes
|
||||
- [OBS Studio Setup and Encoding Settings](04-streaming/obs/obs-studio-setup-encoding.md)
|
||||
|
||||
### Plex
|
||||
- [Plex 4K Codec Compatibility (Apple TV)](04-streaming/plex/plex-4k-codec-compatibility.md) — AV1/VP9 vs HEVC, batch conversion script, yt-dlp auto-convert hook
|
||||
- [Plex 4K Codec Compatibility (Apple TV)](04-streaming/plex/plex-4k-codec-compatibility.md)
|
||||
|
||||
|
||||
---
|
||||
|
||||
## 🔧 General Troubleshooting
|
||||
|
||||
- [Apache Outage: Fail2ban Self-Ban + Missing iptables Rules](05-troubleshooting/networking/fail2ban-self-ban-apache-outage.md) — diagnosing and fixing Apache outages caused by missing firewall rules and Fail2ban self-bans
|
||||
- [Mail Client Stops Receiving: Fail2ban IMAP Self-Ban](05-troubleshooting/networking/fail2ban-imap-self-ban-mail-client.md) — diagnosing why one device stops receiving email when the mail server is healthy
|
||||
- [firewalld: Mail Ports Wiped After Reload](05-troubleshooting/networking/firewalld-mail-ports-reset.md) — recovering IMAP and webmail after firewalld reload drops all mail service rules
|
||||
- [Fail2ban & UFW Rule Bloat: 30k Rules Slowing Down a VPS](05-troubleshooting/networking/fail2ban-ufw-rule-bloat-cleanup.md) — diagnosing and cleaning up massive nftables/UFW rule accumulation
|
||||
- [Tailscale SSH: Unexpected Re-Authentication Prompt](05-troubleshooting/networking/tailscale-ssh-reauth-prompt.md) — resolving unexpected re-auth prompts on Tailscale SSH connections
|
||||
- [Docker & Caddy Recovery After Reboot (Fedora + SELinux)](05-troubleshooting/docker-caddy-selinux-post-reboot-recovery.md) — fixing docker.socket, SELinux port blocks, and httpd_can_network_connect after reboot
|
||||
- [n8n Behind Reverse Proxy: X-Forwarded-For Trust Fix](05-troubleshooting/docker/n8n-proxy-trust-x-forwarded-for.md) — fixing webhook failures caused by missing proxy trust configuration
|
||||
- [Nextcloud AIO Container Unhealthy for 20 Hours](05-troubleshooting/docker/nextcloud-aio-unhealthy-20h-stuck.md) — diagnosing stuck Nextcloud AIO containers after nightly update cycles
|
||||
- [ISP SNI Filtering with Caddy](05-troubleshooting/isp-sni-filtering-caddy.md) — troubleshooting why wiki.majorshouse.com was blocked by Google Fiber
|
||||
- [Obsidian Cache Hang Recovery](05-troubleshooting/obsidian-cache-hang-recovery.md) — resolving "Loading cache" hang in Obsidian by cleaning Electron app data and ML artifacts
|
||||
- [macOS Repeating Alert Tone from Mirrored Notification](05-troubleshooting/macos-mirrored-notification-alert-loop.md) — stopping alert tone loops from mirrored iPhone notifications on Mac
|
||||
- [Qwen2.5-14B OOM on RTX 3080 Ti (12GB)](05-troubleshooting/gpu-display/qwen-14b-oom-3080ti.md) — fixes and alternatives when hitting VRAM limits during fine-tuning
|
||||
- [yt-dlp YouTube JS Challenge Fix on Fedora](05-troubleshooting/yt-dlp-fedora-js-challenge.md) — fixing YouTube JS challenge solver errors and missing formats on Fedora
|
||||
- [Gemini CLI Manual Update](05-troubleshooting/gemini-cli-manual-update.md) — how to manually update the Gemini CLI when automatic updates fail
|
||||
- [MajorWiki Setup & Pipeline](05-troubleshooting/majwiki-setup-and-pipeline.md) — setting up MajorWiki and the Obsidian → Gitea → MkDocs publishing pipeline
|
||||
- [Gitea Actions Runner: Boot Race Condition Fix](05-troubleshooting/gitea-runner-boot-race-network-target.md) — fixing act_runner crash loop on boot caused by DNS not ready at startup
|
||||
- [Cron Heartbeat False Alarm: /var/run Cleared by Reboot](05-troubleshooting/cron-heartbeat-tmpfs-reboot-false-alarm.md) — why `/run` is tmpfs and how a reboot wipes cron heartbeat files, and where to put them instead
|
||||
- [SELinux: Fixing Dovecot Mail Spool Context (/var/vmail)](05-troubleshooting/selinux-dovecot-vmail-context.md) — fixing thousands of AVC denials when /var/vmail has wrong SELinux context
|
||||
- [mdadm RAID Recovery After USB Hub Disconnect](05-troubleshooting/storage/mdadm-usb-hub-disconnect-recovery.md) — diagnosing and recovering a failed mdadm array caused by a USB hub dropout
|
||||
- [Windows OpenSSH Server (sshd) Stops After Reboot](05-troubleshooting/networking/windows-sshd-stops-after-reboot.md) — fixing sshd not running after reboot due to Manual startup type
|
||||
- [Windows OpenSSH: WSL Default Shell Breaks Remote Commands](05-troubleshooting/networking/windows-openssh-wsl-default-shell-breaks-remote-commands.md) — fixing remote SSH command failures when wsl.exe is the default shell
|
||||
- [Ollama Drops Off Tailscale When Mac Sleeps](05-troubleshooting/ollama-macos-sleep-tailscale-disconnect.md) — keeping Ollama reachable over Tailscale by disabling macOS sleep on AC power
|
||||
- [Ansible: Vault Password File Not Found](05-troubleshooting/ansible-vault-password-file-missing.md) — fixing the missing vault_pass file error when running ansible-playbook
|
||||
- [Ansible: ansible.cfg Ignored on WSL2 Windows Mounts](05-troubleshooting/ansible-wsl2-world-writable-mount-ignores-cfg.md) — fixing silent config ignore due to world-writable /mnt/d/ permissions
|
||||
- [Ansible SSH Timeout During dnf upgrade](05-troubleshooting/ansible-ssh-timeout-dnf-upgrade.md) — preventing SSH timeouts during long-running dnf upgrades on Fedora
|
||||
- [Fedora Networking & Kernel Troubleshooting](05-troubleshooting/fedora-networking-kernel-recovery.md) — nmcli quick fix, GRUB kernel rollback, and recovery for Fedora fleet
|
||||
- [Custom Fail2ban Jail: Apache Directory Scanning](05-troubleshooting/security/apache-dirscan-fail2ban-jail.md) — blocking directory scanners and junk HTTP methods
|
||||
- [ClamAV Safe Scheduling on Live Servers](05-troubleshooting/security/clamscan-cpu-spike-nice-ionice.md) — preventing clamscan CPU spikes with nice and ionice
|
||||
- [Systemd Session Scope Fails at Login](05-troubleshooting/systemd/session-scope-failure-at-login.md) — fixing session-cN.scope failures during login
|
||||
- [wget/curl: URLs with Special Characters Fail in Bash](05-troubleshooting/wget-url-special-characters.md) — fixing broken downloads caused by unquoted URLs with &, ?, # characters
|
||||
- [Ansible: Check Mode False Positives in Verify/Assert Tasks](05-troubleshooting/ansible-check-mode-false-positives.md) — guarding verify/assert tasks with `when: not ansible_check_mode` to prevent false failures in dry runs
|
||||
- [Ansible Fails with Permission Denied While `ssh <alias>` Works (Host Alias Bypass)](05-troubleshooting/ansible-ssh-host-alias-bypass.md) — SSH Host blocks match on literal pattern; `ansible_host: <IP>` bypasses the alias and the IdentityFile never gets applied
|
||||
- [Ghost EmailAnalytics Lag Warning — What It Means and When to Worry](05-troubleshooting/ghost-emailanalytics-lag-warning.md) — explaining the lag counter, `submitted` status, and `fetchMissing end == begin` skip
|
||||
- [claude-mem: --setting-sources Empty Arg Bug (Claude Code 2.1.x)](05-troubleshooting/claude-mem-setting-sources-empty-arg.md) — fixing silent pipeline failure when claude-mem 12.1.x spawns Claude Code 2.1.112+
|
||||
- [Ansible Check Mode False Positives in Verify/Assert Tasks](05-troubleshooting/ansible-check-mode-false-positives.md)
|
||||
- [Ansible Fails with Permission Denied While `ssh <alias>` Works (Host Alias Bypass)](05-troubleshooting/ansible-ssh-host-alias-bypass.md)
|
||||
- [Ansible SSH Timeout During dnf upgrade on Fedora Hosts](05-troubleshooting/ansible-ssh-timeout-dnf-upgrade.md)
|
||||
- [Ansible: Vault Password File Not Found](05-troubleshooting/ansible-vault-password-file-missing.md)
|
||||
- [Ansible Ignores ansible.cfg on WSL2 Windows Mounts](05-troubleshooting/ansible-wsl2-world-writable-mount-ignores-cfg.md)
|
||||
- [claude-mem Silently Fails with Claude Code 2.1+ (Empty --setting-sources)](05-troubleshooting/claude-mem-setting-sources-empty-arg.md)
|
||||
- [Cron Heartbeat False Alarm: /var/run Cleared by Reboot](05-troubleshooting/cron-heartbeat-tmpfs-reboot-false-alarm.md)
|
||||
- [Docker & Caddy Recovery After Reboot (Fedora + SELinux)](05-troubleshooting/docker-caddy-selinux-post-reboot-recovery.md)
|
||||
- [Fantastical Google Sync Error Flood — Phantom Calendars Fixed via syncselect](05-troubleshooting/fantastical-google-phantom-calendar-syncselect.md)
|
||||
- [Fantastical MCP Server: Permission Denied on Launch (macOS Quarantine)](05-troubleshooting/fantastical-mcp-permission-denied.md)
|
||||
- [Fedora Networking & Kernel Troubleshooting](05-troubleshooting/fedora-networking-kernel-recovery.md)
|
||||
- [Gemini CLI: Manual Update Guide](05-troubleshooting/gemini-cli-manual-update.md)
|
||||
- [Ghost EmailAnalytics Lag Warning — What It Means and When to Worry](05-troubleshooting/ghost-emailanalytics-lag-warning.md)
|
||||
- [Gitea Actions Runner: Boot Race Condition Fix](05-troubleshooting/gitea-runner-boot-race-network-target.md)
|
||||
- [ISP SNI Filtering & Caddy Troubleshooting](05-troubleshooting/isp-sni-filtering-caddy.md)
|
||||
- [macOS Repeating Alert Tone from Mirrored iPhone Notification](05-troubleshooting/macos-mirrored-notification-alert-loop.md)
|
||||
- [MajorWiki Setup & Publishing Pipeline](05-troubleshooting/majwiki-setup-and-pipeline.md)
|
||||
- [Obsidian Vault Recovery — Loading Cache Hang](05-troubleshooting/obsidian-cache-hang-recovery.md)
|
||||
- [Ollama: `ollama run` with Piped Stdin Bypasses Chat Template + SYSTEM Prompt](05-troubleshooting/ollama-chat-template-pipe-stdin-bypass.md)
|
||||
- [Ollama Drops Off Tailscale When Mac Sleeps](05-troubleshooting/ollama-macos-sleep-tailscale-disconnect.md)
|
||||
- [Python smtplib: Missing Date/Message-ID Headers Break Mail Clients](05-troubleshooting/python-smtplib-missing-rfc-headers.md)
|
||||
- [SELinux: Fixing Dovecot Mail Spool Context (/var/vmail)](05-troubleshooting/selinux-dovecot-vmail-context.md)
|
||||
- [Ubuntu dist-upgrade Quarantines Third-Party Repos](05-troubleshooting/ubuntu-dist-upgrade-repo-quarantine.md)
|
||||
- [wget/curl: URLs with Special Characters Fail in Bash](05-troubleshooting/wget-url-special-characters.md)
|
||||
- [yt-dlp YouTube JS Challenge Fix (Fedora)](05-troubleshooting/yt-dlp-fedora-js-challenge.md)
|
||||
### Docker & Containers
|
||||
- [Nextcloud AIO Container Unhealthy for 20 Hours After Nightly Update](05-troubleshooting/docker/nextcloud-aio-unhealthy-20h-stuck.md)
|
||||
- [n8n Behind Reverse Proxy: X-Forwarded-For Trust Fix](05-troubleshooting/docker/n8n-proxy-trust-x-forwarded-for.md)
|
||||
|
||||
### GPU & Display
|
||||
- [LoRA adapter — GGUF conversion fails with 'config.json not found](05-troubleshooting/gpu-display/lora-adapter-gguf-conversion-fails.md)
|
||||
- [Qwen2.5-14B OOM on RTX 3080 Ti (12GB)](05-troubleshooting/gpu-display/qwen-14b-oom-3080ti.md)
|
||||
|
||||
### Networking
|
||||
- [Apache Outage: Fail2ban Self-Ban + Missing iptables Rules](05-troubleshooting/networking/fail2ban-self-ban-apache-outage.md)
|
||||
- [Fail2ban & UFW Rule Bloat: 30k Rules Slowing Down a VPS](05-troubleshooting/networking/fail2ban-ufw-rule-bloat-cleanup.md)
|
||||
- [Mail Client Stops Receiving: Fail2ban IMAP Self-Ban](05-troubleshooting/networking/fail2ban-imap-self-ban-mail-client.md)
|
||||
- [Pi-hole AI Blocklist Blocks Claude Desktop (ERR_CONNECTION_REFUSED)](05-troubleshooting/networking/pihole-blocks-claude-desktop.md)
|
||||
- [Tailscale SSH: Unexpected Re-Authentication Prompt](05-troubleshooting/networking/tailscale-ssh-reauth-prompt.md)
|
||||
- [Windows OpenSSH Server (sshd) Stops After Reboot](05-troubleshooting/networking/windows-sshd-stops-after-reboot.md)
|
||||
- [Windows OpenSSH: WSL as Default Shell Breaks Remote Commands](05-troubleshooting/networking/windows-openssh-wsl-default-shell-breaks-remote-commands.md)
|
||||
- [firewalld: Mail Ports Wiped After Reload (IMAP + Webmail Outage)](05-troubleshooting/networking/firewalld-mail-ports-reset.md)
|
||||
- [iOS Tailscale Clients Report HostName="localhost" — Breaks /etc/hosts Generators](05-troubleshooting/networking/tailscale-status-json-hostname-localhost-ios.md)
|
||||
- [rsync over Tailscale: Hung in TCP Teardown After Transfer Completes](05-troubleshooting/networking/rsync-tailscale-teardown-stall.md)
|
||||
|
||||
### Security
|
||||
- [ClamAV Safe Scheduling on Live Servers](05-troubleshooting/security/clamscan-cpu-spike-nice-ionice.md)
|
||||
- [Custom Fail2ban Jail: Apache Directory Scanning & Junk Methods](05-troubleshooting/security/apache-dirscan-fail2ban-jail.md)
|
||||
- [Tuning Netdata `web_log_1m_successful` for Redirect-Heavy WordPress Sites](05-troubleshooting/security/netdata-web-log-successful-redirect-heavy-tuning.md)
|
||||
- [Castopod: Stale Federated Avatar URLs After Remote Profile Updates](05-troubleshooting/security/castopod-stale-federated-avatar.md)
|
||||
- [Castopod Posts Don't Appear on Mastodon — Diagnosing the Federation Path](05-troubleshooting/security/castopod-broadcast-not-on-mastodon.md)
|
||||
|
||||
### Storage
|
||||
- [mdadm RAID Recovery After USB Hub Disconnect](05-troubleshooting/storage/mdadm-usb-hub-disconnect-recovery.md)
|
||||
|
||||
### Systemd
|
||||
- [Systemd Session Scope Fails at Login (session-cN.scope)](05-troubleshooting/systemd/session-scope-failure-at-login.md)
|
||||
|
||||
|
||||
---
|
||||
|
|
@ -182,57 +217,45 @@ updated: 2026-04-29T22:45
|
|||
|
||||
| Date | Article | Domain |
|
||||
|---|---|---|
|
||||
| 2026-04-19 | [Wake-on-LAN via Router SSH](02-selfhosting/dns-networking/wake-on-lan-router-ssh.md) | Self-Hosting |
|
||||
| 2026-04-18 | [Ghost Email Configuration with Mailgun](02-selfhosting/services/ghost-smtp-mailgun-setup.md) | Self-Hosting |
|
||||
| 2026-04-18 | [Firewall Hardening with firewalld on Fedora Fleet](02-selfhosting/security/firewalld-fleet-hardening.md) | Self-Hosting |
|
||||
| 2026-04-18 | [ClamAV Fleet Deployment with Ansible](02-selfhosting/security/clamav-fleet-deployment.md) | Self-Hosting |
|
||||
| 2026-04-18 | [Ansible: Check Mode False Positives in Verify/Assert Tasks](05-troubleshooting/ansible-check-mode-false-positives.md) | Troubleshooting |
|
||||
| 2026-04-18 | [Ghost EmailAnalytics Lag Warning](05-troubleshooting/ghost-emailanalytics-lag-warning.md) | Troubleshooting |
|
||||
| 2026-04-17 | [Watchtower SMTP via Localhost Postfix Relay](02-selfhosting/docker/watchtower-smtp-localhost-relay.md) | Self-Hosting |
|
||||
| 2026-04-17 | [Fail2ban Custom Jail: Nginx Bad Request Detection](02-selfhosting/security/fail2ban-nginx-bad-request-jail.md) | Self-Hosting |
|
||||
| 2026-04-17 | [Fail2ban Custom Jail: Apache Bad Request Detection](02-selfhosting/security/fail2ban-apache-bad-request-jail.md) | Self-Hosting |
|
||||
| 2026-04-17 | [SSH Hardening Fleet-Wide with Ansible](02-selfhosting/security/ssh-hardening-ansible-fleet.md) | Self-Hosting |
|
||||
| 2026-04-17 | [claude-mem: --setting-sources Empty Arg Bug](05-troubleshooting/claude-mem-setting-sources-empty-arg.md) | Troubleshooting |
|
||||
| 2026-04-13 | [Cron Heartbeat False Alarm: /var/run Cleared by Reboot](05-troubleshooting/cron-heartbeat-tmpfs-reboot-false-alarm.md) | Troubleshooting |
|
||||
| 2026-04-09 | [Fail2ban Custom Jail: Apache PHP Webshell Probe Detection](02-selfhosting/security/fail2ban-apache-php-probe-jail.md) | Self-Hosting |
|
||||
| 2026-04-08 | [wget/curl: URLs with Special Characters Fail in Bash](05-troubleshooting/wget-url-special-characters.md) | Troubleshooting |
|
||||
| 2026-04-07 | [SSH Config & Key Management](01-linux/networking/ssh-config-key-management.md) | Linux |
|
||||
| 2026-04-07 | [Windows OpenSSH: WSL Default Shell Breaks Remote Commands](05-troubleshooting/networking/windows-openssh-wsl-default-shell-breaks-remote-commands.md) | Troubleshooting |
|
||||
| 2026-04-07 | [Windows OpenSSH Server (sshd) Stops After Reboot](05-troubleshooting/networking/windows-sshd-stops-after-reboot.md) | Troubleshooting |
|
||||
| 2026-04-03 | [Ansible: ansible.cfg Ignored on WSL2 Windows Mounts](05-troubleshooting/ansible-wsl2-world-writable-mount-ignores-cfg.md) | Troubleshooting |
|
||||
| 2026-04-02 | [Fail2ban Custom Jail: WordPress Login Brute Force](02-selfhosting/security/fail2ban-wordpress-login-jail.md) | Self-Hosting |
|
||||
| 2026-04-02 | [Mastodon Instance Tuning](02-selfhosting/services/mastodon-instance-tuning.md) | Self-Hosting |
|
||||
| 2026-04-02 | [mdadm — Rebuilding a RAID Array After Reinstall](01-linux/storage/mdadm-raid-rebuild.md) | Linux |
|
||||
| 2026-04-02 | [Fedora Networking & Kernel Troubleshooting](05-troubleshooting/fedora-networking-kernel-recovery.md) | Troubleshooting |
|
||||
| 2026-04-02 | [Ventoy: Multi-Boot USB Tool](03-opensource/dev-tools/ventoy.md) | Open Source |
|
||||
| 2026-04-02 | [rsync Backup Patterns](02-selfhosting/storage-backup/rsync-backup-patterns.md) (updated — Glacier Deep Archive) | Self-Hosting |
|
||||
| 2026-04-02 | [yt-dlp: Video Downloading](03-opensource/media-creative/yt-dlp.md) (updated — subtitles, temp fix) | Open Source |
|
||||
| 2026-04-02 | [OBS Studio Setup & Encoding](04-streaming/obs/obs-studio-setup-encoding.md) (updated — captions plugin, VLC capture) | Streaming |
|
||||
| 2026-04-02 | [Linux Server Hardening Checklist](02-selfhosting/security/linux-server-hardening-checklist.md) (updated — SpamAssassin) | Self-Hosting |
|
||||
| 2026-03-23 | [Ansible: Vault Password File Not Found](05-troubleshooting/ansible-vault-password-file-missing.md) | Troubleshooting |
|
||||
| 2026-03-18 | [Deploying Netdata to a New Server](02-selfhosting/monitoring/netdata-new-server-setup.md) | Self-Hosting |
|
||||
| 2026-03-18 | [Tuning Netdata Docker Health Alarms](02-selfhosting/monitoring/netdata-docker-health-alarm-tuning.md) | Self-Hosting |
|
||||
| 2026-03-17 | [Ollama Drops Off Tailscale When Mac Sleeps](05-troubleshooting/ollama-macos-sleep-tailscale-disconnect.md) | Troubleshooting |
|
||||
| 2026-03-17 | [Windows OpenSSH Server (sshd) Stops After Reboot](05-troubleshooting/networking/windows-sshd-stops-after-reboot.md) | Troubleshooting |
|
||||
| 2026-03-16 | [Standardizing unattended-upgrades with Ansible](02-selfhosting/security/ansible-unattended-upgrades-fleet.md) | Self-Hosting |
|
||||
| 2026-03-16 | [WSL2 Training Environment Rebuild (Fedora 43)](01-linux/distro-specific/wsl2-rebuild-fedora43-training-env.md) | Linux |
|
||||
| 2026-03-16 | [WSL2 Backup via PowerShell Scheduled Task](01-linux/distro-specific/wsl2-backup-powershell.md) | Linux |
|
||||
| 2026-03-15 | [firewalld: Mail Ports Wiped After Reload](05-troubleshooting/networking/firewalld-mail-ports-reset.md) | Troubleshooting |
|
||||
| 2026-03-15 | [Plex 4K Codec Compatibility (Apple TV)](04-streaming/plex/plex-4k-codec-compatibility.md) | Streaming |
|
||||
| 2026-03-15 | [mdadm RAID Recovery After USB Hub Disconnect](05-troubleshooting/storage/mdadm-usb-hub-disconnect-recovery.md) | Troubleshooting |
|
||||
| 2026-03-15 | [yt-dlp: Video Downloading](03-opensource/media-creative/yt-dlp.md) | Open Source |
|
||||
| 2026-03-14 | [SELinux: Fixing Dovecot Mail Spool Context (/var/vmail)](05-troubleshooting/selinux-dovecot-vmail-context.md) | Troubleshooting |
|
||||
| 2026-03-14 | [Gitea Actions Runner: Boot Race Condition Fix](05-troubleshooting/gitea-runner-boot-race-network-target.md) | Troubleshooting |
|
||||
| 2026-03-14 | [Mail Client Stops Receiving: Fail2ban IMAP Self-Ban](05-troubleshooting/networking/fail2ban-imap-self-ban-mail-client.md) | Troubleshooting |
|
||||
| 2026-03-14 | [SearXNG: Private Self-Hosted Search](03-opensource/alternatives/searxng.md) | Open Source |
|
||||
| 2026-03-14 | [FreshRSS: Self-Hosted RSS Reader](03-opensource/alternatives/freshrss.md) | Open Source |
|
||||
| 2026-03-14 | [Gitea: Self-Hosted Git](03-opensource/alternatives/gitea.md) | Open Source |
|
||||
| 2026-03-14 | [yt-dlp: Video Downloading](03-opensource/media-creative/yt-dlp.md) | Open Source |
|
||||
| 2026-03-13 | [Vaultwarden: Self-Hosted Password Manager](03-opensource/privacy-security/vaultwarden.md) | Open Source |
|
||||
| 2026-03-13 | [Gemini CLI Manual Update](05-troubleshooting/gemini-cli-manual-update.md) | Troubleshooting |
|
||||
| 2026-03-13 | [rmlint: Duplicate File Scanning](03-opensource/productivity/rmlint-duplicate-scanning.md) | Open Source |
|
||||
| 2026-03-13 | [SnapRAID & MergerFS Storage Setup](01-linux/storage/snapraid-mergerfs-setup.md) | Linux |
|
||||
| 2026-03-13 | [Qwen2.5-14B OOM on RTX 3080 Ti (12GB)](05-troubleshooting/gpu-display/qwen-14b-oom-3080ti.md) | Troubleshooting |
|
||||
| 2026-05-10 | [Logwatch Fleet Setup — Surviving Package Upgrades](02-selfhosting/monitoring/logwatch-fleet-setup.md) — added "Per-host config drift on cloud-image-derived servers" section: Packer-leftover myhostname, empty relayhost forcing public-MX path, stale SASL passwd maps from prior relays | Self-Hosting |
|
||||
| 2026-05-10 | [Patching PHP 8.4 Implicit-Nullable Deprecations in Vendor Packages](05-troubleshooting/php-84-vendor-implicit-nullable-patch.md) — generalized from a Castopod/UuidModel incident; covers the substring-match gotcha that turns a 30-second fix into a 30-minute one | Troubleshooting |
|
||||
| 2026-05-10 | [Logwatch Fleet Setup — Surviving Package Upgrades](02-selfhosting/monitoring/logwatch-fleet-setup.md) — added Fedora CA bundle missing diagnosis, journald-vs-mail.log methodology note, and bounce-source-must-be-real-mailbox section | Self-Hosting |
|
||||
| 2026-05-10 | [ClamAV Fleet Deployment with Ansible](02-selfhosting/security/clamav-fleet-deployment.md) — added DigitalOcean monitoring caveat for 1vCPU droplets (with follow-up note: per-droplet relaxed alert can still trip; accept-the-page decision) | Self-Hosting |
|
||||
| 2026-05-10 | [Claude Desktop MCP Mass-Disconnect After Blocking SSH Reboot](05-troubleshooting/claude-desktop-mcp-mass-disconnect-blocking-reboot.md) | Troubleshooting |
|
||||
| 2026-05-10 | [Castopod Posts Don't Appear on Mastodon — Diagnosing the Federation Path](05-troubleshooting/security/castopod-broadcast-not-on-mastodon.md) | Troubleshooting |
|
||||
| 2026-05-08 | [Castopod: Stale Federated Avatar URLs After Remote Profile Updates](05-troubleshooting/security/castopod-stale-federated-avatar.md) | Troubleshooting |
|
||||
| 2026-05-08 | [Tuning Netdata `web_log_1m_successful` for Redirect-Heavy WordPress Sites](05-troubleshooting/security/netdata-web-log-successful-redirect-heavy-tuning.md) | Troubleshooting |
|
||||
| 2026-05-07 | [Mastodon — The `--prune-profiles` Trap and How to Recover](02-selfhosting/services/mastodon-prune-profiles-trap.md) | Self-Hosting |
|
||||
| 2026-05-02 | [WSL2 Backup via PowerShell Scheduled Task](01-linux/distro-specific/wsl2-backup-powershell.md) | Linux |
|
||||
| 2026-05-02 | [SSH Config and Key Management](01-linux/networking/ssh-config-key-management.md) | Linux |
|
||||
| 2026-05-02 | [Wake-on-LAN via Router SSH](02-selfhosting/dns-networking/wake-on-lan-router-ssh.md) | Self-Hosting |
|
||||
| 2026-05-02 | [Tuning Netdata Docker Health Alarms to Prevent Update Flapping](02-selfhosting/monitoring/netdata-docker-health-alarm-tuning.md) | Self-Hosting |
|
||||
| 2026-05-02 | [ClamAV Fleet Deployment with Ansible](02-selfhosting/security/clamav-fleet-deployment.md) | Self-Hosting |
|
||||
| 2026-05-02 | [Fail2Ban Digest Mode — Fleet-Wide Quiet Alerts](02-selfhosting/security/fail2ban-digest-mode-fleet.md) | Self-Hosting |
|
||||
| 2026-05-02 | [Mastodon Instance Tuning](02-selfhosting/services/mastodon-instance-tuning.md) | Self-Hosting |
|
||||
| 2026-05-02 | [Ansible Check Mode False Positives in Verify/Assert Tasks](05-troubleshooting/ansible-check-mode-false-positives.md) | Troubleshooting |
|
||||
| 2026-05-02 | [ISP SNI Filtering & Caddy Troubleshooting](05-troubleshooting/isp-sni-filtering-caddy.md) | Troubleshooting |
|
||||
| 2026-05-02 | [Windows OpenSSH: WSL as Default Shell Breaks Remote Commands](05-troubleshooting/networking/windows-openssh-wsl-default-shell-breaks-remote-commands.md) | Troubleshooting |
|
||||
| 2026-05-02 | [Windows OpenSSH Server (sshd) Stops After Reboot](05-troubleshooting/networking/windows-sshd-stops-after-reboot.md) | Troubleshooting |
|
||||
| 2026-05-02 | [yt-dlp YouTube JS Challenge Fix (Fedora)](05-troubleshooting/yt-dlp-fedora-js-challenge.md) | Troubleshooting |
|
||||
| 2026-04-30 | [wp-fail2ban Plugin Logpath on Debian/Ubuntu (auth.log, not syslog)](02-selfhosting/security/wp-fail2ban-logpath-debian-ubuntu.md) | Self-Hosting |
|
||||
| 2026-04-30 | [LoRA adapter — GGUF conversion fails with 'config.json not found](05-troubleshooting/gpu-display/lora-adapter-gguf-conversion-fails.md) | Troubleshooting |
|
||||
| 2026-04-29 | [iOS Tailscale Clients Report HostName="localhost" — Breaks /etc/hosts Generators](05-troubleshooting/networking/tailscale-status-json-hostname-localhost-ios.md) | Troubleshooting |
|
||||
| 2026-04-29 | [Python smtplib: Missing Date/Message-ID Headers Break Mail Clients](05-troubleshooting/python-smtplib-missing-rfc-headers.md) | Troubleshooting |
|
||||
| 2026-04-28 | [Ubuntu dist-upgrade Quarantines Third-Party Repos](05-troubleshooting/ubuntu-dist-upgrade-repo-quarantine.md) | Troubleshooting |
|
||||
| 2026-04-26 | [Fantastical MCP Server: Permission Denied on Launch (macOS Quarantine)](05-troubleshooting/fantastical-mcp-permission-denied.md) | Troubleshooting |
|
||||
| 2026-04-25 | [rsync over Tailscale: Hung in TCP Teardown After Transfer Completes](05-troubleshooting/networking/rsync-tailscale-teardown-stall.md) | Troubleshooting |
|
||||
| 2026-04-25 | [Ollama: `ollama run` with Piped Stdin Bypasses Chat Template + SYSTEM Prompt](05-troubleshooting/ollama-chat-template-pipe-stdin-bypass.md) | Troubleshooting |
|
||||
| 2026-04-24 | [Fantastical Google Sync Error Flood — Phantom Calendars Fixed via syncselect](05-troubleshooting/fantastical-google-phantom-calendar-syncselect.md) | Troubleshooting |
|
||||
| 2026-04-23 | [Pi-hole DoH / DoT Bypass Defense](02-selfhosting/dns-networking/pihole-doh-dot-bypass-defense.md) | Self-Hosting |
|
||||
| 2026-04-22 | [Pi-hole v6 Adlist Management via SQL](02-selfhosting/dns-networking/pihole-v6-adlist-management.md) | Self-Hosting |
|
||||
| 2026-04-22 | [Pi-hole v6 Group Management: Per-Client DNS Rules](02-selfhosting/dns-networking/pihole-v6-group-management.md) | Self-Hosting |
|
||||
| 2026-04-22 | [Mastodon DB Maintenance — Statuses, Accounts, and VACUUM](02-selfhosting/services/mastodon-db-maintenance.md) | Self-Hosting |
|
||||
| 2026-04-22 | [Mastodon Federation — Domain Blocks, Silencing, and FediSeer](02-selfhosting/services/mastodon-federation.md) | Self-Hosting |
|
||||
| 2026-04-22 | [Pi-hole AI Blocklist Blocks Claude Desktop (ERR_CONNECTION_REFUSED)](05-troubleshooting/networking/pihole-blocks-claude-desktop.md) | Troubleshooting |
|
||||
| 2026-04-21 | [Ansible Fails with Permission Denied While `ssh <alias>` Works (Host Alias Bypass)](05-troubleshooting/ansible-ssh-host-alias-bypass.md) | Troubleshooting |
|
||||
| 2026-04-20 | [Claude Code Remote Control — Mobile Access to a Persistent Host Session](02-selfhosting/services/claude-code-remote-control.md) | Self-Hosting |
|
||||
| 2026-04-19 | [AWS S3 Cost Management](02-selfhosting/cloud/aws-s3-cost-management.md) | Self-Hosting |
|
||||
|
||||
---
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Reference in a new issue