Compare commits

...

11 commits

Author SHA1 Message Date
65b0aa4567 wiki: expand Tailscale race condition article with network-online race
Added Race 2: tailscaled starts before network-online.target, causing
Tailscale to get stuck with SetNetworkUp(false). Covers both Ubuntu
ssh.socket and cross-platform tailscaled ordering issues. Updated
references to include majordiscord incident and new Ansible playbook.
2026-05-19 20:39:18 -04:00
eb39da9a26 Merge cowork/majorair/ssh-socket-wiki: ssh.socket Tailscale race condition article 2026-05-19 19:36:19 -04:00
7dc591d257 wiki: add ssh.socket Tailscale race condition troubleshooting article
Documents the systemd socket activation race where ssh.socket binds
to the Tailscale IP before tailscaled is ready, causing SSH to become
unreachable after a Tailscale reconnect. Includes diagnosis steps and
the After=/BindsTo= fix.
2026-05-19 19:35:16 -04:00
64ac418a36 wiki: add ClamAV daemonless mode section + HEVC VAAPI article link 2026-05-15 09:02:24 -04:00
Marcus (via Claude Code)
28518e403e Add troubleshooting articles: Netdata apps-group FD false-positive + OBS stale script paths
- netdata-apps-fds-group-false-positive: the apps_group_file_descriptors_utilization
  false 100% on forking/root app groups (tailscaled on MajorToot 2026-05-15),
  the not-a-privilege gotcha, fleet-wide silence fix in MajorAnsible.
- obs-stale-script-paths: pending from prior session (not on remote).
- SUMMARY.md: link both (re-applied onto upstream after concurrent rebase).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 03:22:12 -04:00
a785e85821 Merge branch 'code/majorair/rsyslog-logwatch-fix' 2026-05-13 10:36:06 -04:00
4ec481c584 wiki: add rsyslog requirement to migration checklist and logwatch docs
Fedora 44 Hetzner images ship without rsyslog — logwatch produces
zero output because /var/log/messages doesn't exist. Added rsyslog
to baseline table and new diagnostic section to logwatch article.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 10:36:00 -04:00
c22457f1aa Merge branch 'code/majorair/teelia-cpu-docs' 2026-05-11 18:32:18 -04:00
ac84610380 wiki: add 1 vCPU nice/ionice limitation note to ClamAV article
nice -n 19 only yields when other processes compete; on single-core
VPS boxes the scan still saturates CPU. Document the expectation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-11 18:32:01 -04:00
3df0979786 Merge branch 'code/majorair/logwatch-ca-bundle-docs'
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-11 07:37:48 -04:00
de9b661b9d wiki: add Fedora CA bundle article, update migration checklist and logwatch docs
New article documenting missing /etc/pki/tls/certs/ca-bundle.crt symlink
on Hetzner Fedora images breaking Postfix TLS, curl, and dnf. Updated
VPS migration baseline checklist with timezone, CA bundle, and crond
verification steps. Updated logwatch fleet setup with crond check.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-11 07:35:42 -04:00
10 changed files with 834 additions and 7 deletions

View file

@ -0,0 +1,97 @@
---
title: VPS Migration Baseline Checklist
description: What to verify after migrating a server to a new provider — the packages, services, and configs that must match the old box
tags:
- migration
- vps
- hetzner
- digitalocean
- ansible
- checklist
status: published
created: 2026-05-09
updated: 2026-05-13T10:35
---
# VPS Migration Baseline Checklist
When migrating a server from one VPS provider to another, it's easy to focus on the application (bots, web services, databases) and forget the infrastructure baseline. This checklist covers the common components that make a server operational beyond just running the app.
## Background
During the Hetzner migration (2026-05), `majordiscord` was migrated with only the application layer (PhantomBot, Red-DiscordBot) and core infrastructure (Netdata, Tailscale, fail2ban). Missing from the new box: Postfix (email relay), logwatch, ClamAV, and dnf-automatic. The gap went unnoticed for a week because all monitoring email depended on the missing Postfix.
## The Checklist
### Before Migration
Power on both old and new boxes. Run this comparison to find gaps:
```bash
# Fedora — list baseline packages on both hosts
ssh root@OLD_HOST 'rpm -qa --qf "%{NAME}\n" | sort | grep -iE "fail2ban|logwatch|postfix|netdata|clamav|dnf-auto|tailscale|cronie|firewalld"'
ssh root@NEW_HOST 'rpm -qa --qf "%{NAME}\n" | sort | grep -iE "fail2ban|logwatch|postfix|netdata|clamav|dnf-auto|tailscale|cronie|firewalld"'
# Ubuntu — list baseline packages on both hosts
ssh root@OLD_HOST 'dpkg -l | grep -iE "fail2ban|logwatch|postfix|netdata|clamav|unattended|tailscale" | awk "{print \$2}" | sort'
ssh root@NEW_HOST 'dpkg -l | grep -iE "fail2ban|logwatch|postfix|netdata|clamav|unattended|tailscale" | awk "{print \$2}" | sort'
```
Compare enabled services:
```bash
ssh root@HOST 'systemctl list-unit-files --state=enabled --no-pager | grep -iE "fail2ban|logwatch|postfix|netdata|clamav|dnf-auto|tailscale|cronie|firewalld|sshd"'
```
### Baseline Components
Every server in the fleet should have these. Check each one after migration:
| Component | Package (Fedora) | Package (Ubuntu) | Ansible Playbook | Notes |
|-----------|-----------------|------------------|------------------|-------|
| Monitoring | `netdata` | `netdata` | `netdata.yml` | Claim to Netdata Cloud if applicable |
| VPN | `tailscale` | `tailscale` | — (manual join) | Rename node in Tailscale admin |
| Intrusion prevention | `fail2ban` | `fail2ban` | `harden.yml` | Check jail.local, banaction matches firewall |
| Email relay | `postfix` | `postfix` | `configure_postfix_relay.yml` | Required by logwatch, Netdata, fail2ban |
| Log summaries | `logwatch` | `logwatch` | `logwatch.yml` | Override file, not defaults — see [logwatch fleet setup](../monitoring/logwatch-fleet-setup.md) |
| Firewall | `firewalld` | `ufw` | `configure_firewall_*.yml` | Verify fail2ban banaction matches |
| Cron | `cronie` | `cron` | — (usually pre-installed) | Required by logwatch |
| Auto-updates | `dnf-automatic` | `unattended-upgrades` | `ansible-unattended-upgrades-fleet` | Security patches only |
| Antivirus | `clamav` | `clamav` | `configure_clamav.yml` | Internet-facing hosts only |
| SSH hardening | `openssh-server` | `openssh-server` | `configure_ssh_hardening.yml` | Key-only, no root password |
| Timezone | — | — | — | US servers: `America/New_York`; UK: `Europe/London`. Hetzner defaults to UTC. |
| CA bundle (Fedora) | `ca-certificates` | `ca-certificates` | — | Verify `/etc/pki/tls/certs/ca-bundle.crt` symlink exists — see [Fedora CA bundle fix](../../05-troubleshooting/security/fedora-ca-bundle-missing-symlink.md) |
| Syslog (Fedora) | `rsyslog` | — (pre-installed) | — | Fedora 44 Hetzner images have journald only. Logwatch needs `/var/log/messages` + `/var/log/secure`. |
### After Migration
1. **Set the timezone**`timedatectl set-timezone America/New_York` (US) or `Europe/London` (UK). Hetzner images default to UTC.
2. **Verify CA bundle (Fedora)**`ls /etc/pki/tls/certs/ca-bundle.crt`. If missing, Postfix TLS, curl, and dnf will all fail silently. See [Fedora CA bundle fix](../../05-troubleshooting/security/fedora-ca-bundle-missing-symlink.md).
3. **Run `harden.yml` against the new host** — catches most gaps in one pass
4. **Send a test email**`echo test | mail -s "test" marcus@majorshouse.com` — if this fails, nothing else can alert you
5. **Verify crond is running**`systemctl is-active crond` (Fedora) or `systemctl is-active cron` (Ubuntu). cronie can be `enabled` but not `active` after provisioning.
6. **Check Netdata Cloud** — verify the new node appears and alerts are flowing
7. **Compare fail2ban jails**`fail2ban-client status` on both old and new
8. **Verify logwatch sends**`sudo logwatch --output mail --range today`
9. **Keep the old box powered off but not destroyed** for at least 7 days after remediation
### Using doctl to Manage Old Droplets
```bash
# Authenticate (token from Ansible vault)
cd ~/MajorAnsible
ansible-vault view group_vars/all/vault.yml | grep vault_do_oauth_token | awk '{print $2}' | xargs doctl auth init --access-token
# List droplets
doctl compute droplet list --format Name,ID,Status,PublicIPv4
# Power on for comparison
doctl compute droplet-action power-on DROPLET_ID
# Power off when done
doctl compute droplet-action power-off DROPLET_ID
```
## Lesson Learned
Application migration is not server migration. The app can work perfectly while the monitoring, alerting, and email infrastructure is completely broken. Always compare the full package baseline between old and new boxes before calling a migration complete.

View file

@ -9,7 +9,7 @@ tags:
- ubuntu
status: published
created: 2026-05-09
updated: 2026-05-10T13:00
updated: 2026-05-13T10:35
---
# Logwatch Fleet Setup — Surviving Package Upgrades
@ -91,10 +91,22 @@ Include it in `harden.yml` so every new server gets logwatch as part of the base
After deploying, test immediately:
```bash
# Verify crond is actually running — cronie can be "enabled" but not "active"
systemctl is-active crond # Fedora
systemctl is-active cron # Ubuntu
# If inactive, start it
sudo systemctl start crond
# Then test logwatch manually
sudo logwatch --output mail --range today
```
Check that the email arrives. If it doesn't, verify Postfix is installed and relaying correctly — logwatch depends on a working local MTA.
Check that the email arrives. If it doesn't, verify:
1. **crond is running** — if `inactive`, cron.daily never fires and logwatch never runs. No errors anywhere.
2. **Postfix is installed and relaying** — logwatch depends on a working local MTA.
3. **CA bundle exists (Fedora)** — missing `/etc/pki/tls/certs/ca-bundle.crt` breaks Postfix TLS relay. See [Fedora CA bundle fix](../../05-troubleshooting/security/fedora-ca-bundle-missing-symlink.md).
## Diagnosing Silent Failures
@ -105,6 +117,32 @@ dpkg -V logwatch # Debian
# Look for S.5....T. on the defaults file — means it was replaced
# S = size, 5 = md5, T = timestamp changed
# Check if logwatch produces any output at all
logwatch --output stdout --range yesterday | wc -l
# If 0 lines — logwatch has no log data to report (see rsyslog section below)
```
## Fedora: rsyslog Missing — Logwatch Produces Zero Output
Fedora 44 cloud images (Hetzner, possibly others) ship with **journald only** — no rsyslog. This means `/var/log/messages`, `/var/log/secure`, and `/var/log/cron` do not exist. Logwatch scans those files, finds nothing, produces empty output, and sends no email. Exit code is still 0 — no error anywhere.
This is particularly insidious because everything else can be correct (crond running, postfix relaying, logwatch config pointing to the right recipient) and you'll still get silence.
```bash
# Diagnose
rpm -q rsyslog # "package rsyslog is not installed"
ls /var/log/messages # "No such file or directory"
# Fix
dnf install -y rsyslog
systemctl enable --now rsyslog
# Verify log files appear
ls /var/log/messages /var/log/secure /var/log/cron
# Test logwatch
logwatch --output stdout --range today | wc -l # should be >0
```
## Fedora CA Bundle Missing — Postfix TLS Engine Unavailable

View file

@ -11,7 +11,7 @@ tags:
- cron
status: published
created: 2026-04-18
updated: 2026-05-10T01:50
updated: 2026-05-15T03:00
---
# ClamAV Fleet Deployment with Ansible
@ -226,6 +226,41 @@ The "polite CPU is invisible to DO" trick stops working once the box is small en
**Alternative considered: switch to `clamdscan`** — uses a resident `clamd` daemon, signatures stay loaded, scan finishes ~10× faster with much less CPU/RAM. Better long-term answer, but requires running `clamd` continuously (memory cost on small boxes is ~250 MB resident vs the cron approach which only holds RAM during scan). Trade-off, not strictly better.
## Daemonless Mode on Memory-Constrained Hosts
On hosts with ≤2 GB RAM, running `clamd` continuously is often counterproductive. The daemon loads its full signature database (~950 MB RSS) into memory and keeps it resident. On small VMs this crowds out MySQL, PHP-FPM, and other services — often pushing the whole system into swap rather than preventing anything.
**Affected hosts (fleet history):**
| Host | RAM | Incident | Resolution |
|------|-----|----------|------------|
| teelia | 1.9 GB | 2026-04-27 — clamd 728 MB RSS, 94% RAM alert | daemonless |
| dcaprod | 3.8 GB | 2026-04-30 — clamd OOM thrash after 512M cgroup cap | daemonless |
| majorlinux | 2.0 GB | 2026-05-15 — clamd 980 MB swap, mysqld swapping 293 MB | daemonless |
**The fix: `clamav_use_daemon: false` host_var**
`configure_clamav.yml` supports a per-host override. Add to the host's `host_vars/<hostname>/vars.yml`:
```yaml
clamav_use_daemon: false
```
Then re-run the playbook:
```bash
ansible-playbook configure_clamav.yml --limit <hostname>
```
This will:
- Stop and disable `clamav-daemon.service` and `clamav-daemon.socket`
- Deploy the weekly scan template using `clamscan` (daemonless, loads DB per run)
- Leave `clamav-freshclam` active so definitions stay current
**Trade-off:** Each weekly scan loads the signature DB fresh (~950 MB peak RAM for the scan duration, then freed). The scan takes longer than `clamdscan` (~35× on a warm daemon), but this is acceptable for a weekly background job. The `systemd-run MemoryMax` cgroup wrapper in the scan template caps peak usage so the scan can't OOM the host.
**Rule of thumb:** Use daemon mode (`clamav_use_daemon: true` or unset) on hosts with ≥4 GB RAM where scan speed matters (mail servers, upload handlers). Use daemonless on webservers and small VMs where continuous memory residency is the bigger risk.
## See Also
- [clamscan-cpu-spike-nice-ionice](../../05-troubleshooting/security/clamscan-cpu-spike-nice-ionice.md) — troubleshooting CPU spikes from unthrottled scans

View file

@ -0,0 +1,168 @@
---
title: "HEVC Batch Re-Encode for Plex Using VAAPI (AMD GPU)"
domain: streaming
category: plex
tags: [plex, ffmpeg, hevc, vaapi, amd, gpu, encode, storage, rx480]
status: published
created: 2026-05-15
updated: 2026-05-15
---
# HEVC Batch Re-Encode for Plex Using VAAPI (AMD GPU)
## Problem
Plex NVMe storage is filling up from a large library of H.264-encoded video files (YouTube downloads, stream archives, etc.). Re-encoding to HEVC (H.265) reclaims 3050% of disk space. The catch: Plex tracks each file's "date added" in a SQLite database, and that order matters for playback queues. Naive re-encode-and-replace approaches can corrupt or reset that metadata.
## Solution
Use `ffmpeg` with `hevc_vaapi` (AMD GPU hardware encoder) to batch re-encode files in-place using an atomic rename swap that preserves the Plex database record — including `added_at` — without any Plex downtime or database editing.
---
## How Plex Stores "Date Added"
Plex does **not** use file modification time (`mtime`) for "date added." It stores a Unix timestamp in its SQLite database:
```sql
-- Plex DB location (override via systemd unit may differ — check):
-- /var/lib/plexmediaserver/Library/Application Support/Plex Media Server/
-- Plug-in Support/Databases/com.plexapp.plugins.library.db
-- (or wherever PLEX_MEDIA_SERVER_APPLICATION_SUPPORT_DIR points)
SELECT mi.added_at, datetime(mi.added_at, 'unixepoch'), mp.file
FROM metadata_items mi
JOIN media_items me ON me.metadata_item_id = mi.id
JOIN media_parts mp ON mp.media_item_id = me.id
WHERE mp.file LIKE '%your-file%';
```
> **Note:** If the default path returns 0 rows, check your actual data directory:
> ```bash
> systemctl cat plexmediaserver | grep APPLICATION_SUPPORT
> ```
The `added_at` field is keyed to the **file path** in `media_parts`. As long as the file path doesn't change, the database record — including `added_at` — is untouched even after the file's content is replaced.
---
## Why VAAPI Instead of libx265
On a host with an AMD RX 480/580 (or similar Polaris GPU), hardware HEVC encoding via VAAPI is roughly **9× faster** than software libx265 at comparable quality:
| Encoder | Speed (1080p) | Notes |
|---|---|---|
| libx265 -preset medium | ~21 fps / 0.35× | Best quality/size ratio |
| hevc_vaapi QP 28 | ~186 fps / 3.1× | Sufficient for streaming content |
For 1080p streaming content (game streams, podcasts, YouTube archival), the quality difference is imperceptible. libx265 is preferable only for archival encodes where absolute quality matters.
### Verify VAAPI is working
```bash
vainfo 2>&1 | grep -E "vaapi|HEVC|hevc|Driver"
ls /dev/dri/renderD128
```
You need `VAProfileHEVCMain : VAEntrypointEncSlice` in the output. If missing, install `mesa-va-drivers-freeworld` (RPM Fusion) for AMD hardware.
---
## The Atomic Swap Strategy
The key insight: `mv file.tmp file` on the **same filesystem** is an atomic inode rename at the kernel level. Plex sees the same path still present — it never fires a "file removed" event, so the `metadata_items` record (including `added_at`) is preserved.
**Safe sequence:**
1. Encode source → `.hevc.tmp.mp4` alongside the original
2. Verify the output with `ffprobe`
3. `touch -r original.mp4 temp.mp4` — copy mtime (cosmetic, not required)
4. `mv temp.mp4 original.mp4` — atomic replace
**The one pitfall:** if the original file is deleted *before* the `mv`, Plex orphans the DB record (removes `metadata_items` entry on next scan) and re-indexes the new file with a fresh `added_at`. The original must still exist at swap time.
---
## The Batch Script
Script lives at `~/hevc_batch.sh` on majorhome.
```bash
# Dry run — scan and report what would be encoded, no changes
bash ~/hevc_batch.sh --dry-run
# Full run (default: files >1GB, QP 28)
tmux new-session -d -s hevc_batch 'bash ~/hevc_batch.sh'
# Custom options
bash ~/hevc_batch.sh --min-size-gb 2 --qp 26
```
### Queue and resume
The script writes a queue file at `~/hevc_queue.txt` on first run (scanning all files with ffprobe — takes ~10 min for a large library). On subsequent runs it resumes from where it left off. Completed files are logged to `~/hevc_done.txt`. Failed files go to `~/hevc_failed.txt`.
To restart from scratch: `rm ~/hevc_queue.txt ~/hevc_done.txt`
### Log output
```bash
# Structured log lines only (skip ffmpeg progress noise)
grep '^\[20' ~/hevc_batch.log
# Watch live progress
tail -f ~/hevc_batch.log | grep '^\[20'
```
Each file logs:
- Source size and codec
- `Plex added_at before: <unix timestamp>`
- ffmpeg exit code and elapsed time
- Output size and savings
- `DB check: added_at PRESERVED ✓` (or WARN if changed)
### Space guard
The script aborts if free space on the Plex volume drops below 20GB (`MIN_FREE_GB`). Worst-case headroom needed is `source_size + tmp_size` simultaneously — on a 4GB source file that's ~8GB peak.
---
## ffmpeg Command
```bash
ffmpeg \
-vaapi_device /dev/dri/renderD128 \
-i "input.mp4" \
-vf 'format=nv12,hwupload' \
-c:v hevc_vaapi -rc_mode CQP -qp 28 \
-c:a copy \
-movflags +faststart \
-y "output.tmp.mp4"
```
- `-rc_mode CQP -qp 28` — constant quantizer; higher value = smaller file / lower quality. QP 24 is high quality, QP 28 is good for streaming content.
- `-vf 'format=nv12,hwupload'` — required to move frames to GPU memory for VAAPI encoding.
- `-c:a copy` — passes audio through untouched.
- `hevc_vaapi` does not support 10-bit output on Polaris (RX 480/580). For 10-bit HDR sources, fall back to `libx265` with color signaling flags.
---
## Plex Data Directory Override
On majorhome, the Plex data directory is overridden in the systemd unit — the default path `/var/lib/plexmediaserver/` is empty:
```bash
systemctl cat plexmediaserver | grep APPLICATION_SUPPORT
# Environment=PLEX_MEDIA_SERVER_APPLICATION_SUPPORT_DIR=/plex/plexdata/Library/Application Support
```
The actual DB path is therefore:
```
/plex/plexdata/Library/Application Support/Plex Media Server/Plug-in Support/Databases/com.plexapp.plugins.library.db
```
---
## Related
- [[plex-4k-codec-compatibility]] — Apple TV Direct Play compatibility, HEVC HDR notes
- [[snapraid-mergerfs-setup]] — MajorRAID storage pool setup
- [[SnapRAID-Majorhome]] — majorhome SnapRAID project

View file

@ -0,0 +1,119 @@
# Tailscale Boot Race Conditions (SSH Unreachable After Reboot)
Two related race conditions can make a host unreachable via Tailscale after reboot. Both stem from systemd services starting before Tailscale or the network is ready.
---
## Race 1: ssh.socket Binds Before Tailscale Is Up (Ubuntu)
### Symptom
SSH to a host via Tailscale IP times out. `tailscale ping` works, `tailscale status` shows `active; direct`, but SSH on port 22 refuses connections. No access via Hetzner console if root password is unset.
### Cause
Ubuntu 24.04 uses systemd **socket activation** for SSH (`ssh.socket` instead of persistent `ssh.service`). When the socket override binds to a Tailscale IP, it can start *before* `tailscaled.service` is ready. The bind may succeed initially (Tailscale state file caches the IP), but a later Tailscale reconnect or interface reset invalidates the bound address silently — SSH dies with no recovery path.
### Diagnosis
```bash
# From another host:
tailscale ping <IP> # succeeds — host is up
ssh root@<IP> # times out — sshd not listening
# After gaining console access or reboot:
systemctl status ssh.socket # check Listen: address
journalctl -b -1 -u ssh # likely empty — sshd never spawned
journalctl -b -1 -u ssh.socket # socket started before tailscaled
```
### Fix
Add Tailscale dependency to the socket override:
```ini
# /etc/systemd/system/ssh.socket.d/override.conf
[Unit]
After=tailscaled.service
BindsTo=tailscaled.service
[Socket]
ListenStream=
ListenStream=<TAILSCALE_IP>:22
```
Then reload and restart:
```bash
systemctl daemon-reload
systemctl restart ssh.socket
systemctl status ssh.socket # verify Listen: shows correct IP
```
- `After=` ensures the socket waits for Tailscale to start
- `BindsTo=` restarts the socket if Tailscale restarts, preventing stale binds
### Affected Hosts
Ubuntu hosts using `configure_tailscale_ssh_only.yml`: majorlinux, dcaprod-hetzner. Fedora hosts (majordiscord) use firewall rules for SSH restriction — not affected by this race.
---
## Race 2: tailscaled Starts Before Network Is Online (All Hosts)
### Symptom
Host reboots but never appears on Tailscale. `tailscale ping` times out entirely. SSH is dead because Tailscale never connects. The host is up (accessible via provider console) but isolated from the Tailscale network.
### Cause
`tailscaled.service` ships with `After=network-pre.target`, which fires *before* the network interface has an IP. On VPS hosts (especially Hetzner), the interface can take several seconds to come online. Tailscale starts, sees no network (`SetNetworkUp(false)`, `link state: defaultRoute= ifs={} v4=false v6=false`), fails DNS bootstrap and DERP relay connections, and gets stuck — never retrying.
### Diagnosis
```bash
# From Hetzner console or another access method:
journalctl -b -u tailscaled | grep -E "SetNetworkUp|link state|error|DERP"
# Look for:
# magicsock: SetNetworkUp(false)
# link state: interfaces.State{defaultRoute= ifs={} v4=false v6=false}
# health: Tailscale could not connect to any relay server
```
### Fix
Deploy a systemd drop-in to wait for full network connectivity:
```ini
# /etc/systemd/system/tailscaled.service.d/override.conf
[Unit]
After=network-online.target
Wants=network-online.target
```
Then reload and restart:
```bash
systemctl daemon-reload
systemctl restart tailscaled
```
### Affected Hosts
All hosts where Tailscale is the primary access path. Particularly impactful on VPS hosts with slow interface bringup. Both Fedora and Ubuntu hosts are affected.
---
## Prevention
- Set root passwords on all VPS hosts for emergency console access
- Ansible playbooks deploy both fixes automatically:
- `configure_tailscale_network_wait.yml` — tailscaled network-online dependency (all hosts)
- `configure_tailscale_ssh_only.yml` — ssh.socket Tailscale dependency (Ubuntu only)
## References
- [[dcaprod#2026-05-19 — SSH unreachable due to ssh.socket race condition with Tailscale]]
- [[majordiscord#2026-05-19 — Tailscale boot race: unreachable after Ansible reboot]]
- [[majorlinux#2026-05-19 — ssh.socket override patched: added Tailscale dependency]]
- Ansible: `configure_tailscale_ssh_only.yml`, `configure_tailscale_network_wait.yml`

View file

@ -0,0 +1,129 @@
---
title: "OBS Studio — \"Error opening file: (null)\" After Windows Profile Rename"
domain: troubleshooting
category: streaming
tags: [obs, streaming, windows, lua, profile-migration]
status: published
created: 2026-05-14
updated: 2026-05-14
---
# OBS Studio — "Error opening file: (null)" After Windows Profile Rename
## Symptom
Loading a scene collection in OBS Studio triggers a popup like:
```
[<ScriptName>.lua] Error opening file: (null)
```
The `(null)` is the giveaway: OBS resolved the registered script path to nothing — the file doesn't exist where the scene collection says it does. Most commonly this happens after a Windows profile was renamed or migrated and `C:\Users\<old>\...` paths were not updated.
## Why it happens
OBS stores per-scene-collection Lua/Python script registrations inside the scene collection JSON at:
```
%APPDATA%\obs-studio\basic\scenes\<Collection>.json
```
Each entry under `modules.scripts-tool[]` is an absolute Windows path. Renaming the Windows profile does not rewrite these — the JSON keeps pointing at the old `C:\Users\<old>\...` location, and OBS surfaces the resolution failure as a `(null)` popup on collection load.
## Diagnose
From WSL (or any shell with access to `%APPDATA%`):
```bash
OBS_DIR="/mnt/c/Users/<current-windows-user>/AppData/Roaming/obs-studio"
# 1. List scene collections
ls "$OBS_DIR/basic/scenes/"
# 2. Find collections referencing the missing script
grep -l -i "<script-name-substring>" "$OBS_DIR/basic/scenes/"*.json
# 3. Dump the scripts-tool paths from each suspect collection
python3 -c "
import json, sys
d = json.load(open(sys.argv[1]))
for s in d.get('modules', {}).get('scripts-tool', []):
print(s.get('path'))
" "$OBS_DIR/basic/scenes/<Collection>.json"
```
If a printed path contains `C:/Users/<old-username>/...` and the file doesn't exist on disk, you've found it.
## Fix
> [!warning] Close OBS first
> OBS rewrites the scene collection JSON when it exits. Any edit made while OBS is running will be overwritten. Confirm with `tasklist.exe | grep obs64` (WSL) or Task Manager.
### 1. Make the missing script reachable
Either:
- **Re-extract / restore the script** to a path under the new profile (recommended — gives you a clean canonical home), or
- **Leave it in the rescue/migration folder** and point OBS there (fragile if the rescue folder is later deleted).
### 2. Back up the scene collection JSON
```bash
SCENES="/mnt/c/Users/<current-windows-user>/AppData/Roaming/obs-studio/basic/scenes"
STAMP="$(date +%Y%m%d-%H%M%S)"
cp -p "$SCENES/<Collection>.json" "$SCENES/<Collection>.json.$STAMP.bak"
```
### 3. Rewrite the paths atomically
Edit the JSON in place by parsing it, replacing the matched path strings, and writing through a temp file (so a crash mid-write can't corrupt the collection):
```bash
python3 <<'PY'
import json, os
scenes = "/mnt/c/Users/<current-windows-user>/AppData/Roaming/obs-studio/basic/scenes"
mapping = {
"C:/Users/<old>/Pictures/.../<script>.lua":
"C:/Users/<new>/Pictures/.../<script>.lua",
}
for fn in ("<Collection>.json",):
path = os.path.join(scenes, fn)
d = json.load(open(path))
for entry in d.get("modules", {}).get("scripts-tool", []):
if entry.get("path") in mapping:
entry["path"] = mapping[entry["path"]]
tmp = path + ".tmp"
json.dump(d, open(tmp, "w"), indent=4)
os.replace(tmp, path)
PY
```
OBS scene JSONs use forward slashes in Windows paths — preserve that style.
### 4. Verify
Re-run the diagnostic Python snippet and confirm every printed path resolves to a real file (translate `C:/``/mnt/c/` from WSL).
### 5. Reopen OBS
Load the scene collection. The popup should be gone.
## Why not just remove the script?
If the script is part of a third-party overlay pack (Twitch Pimpage, OWN3D, etc.), removing the registration also removes the overlay's source presets — fixing the path keeps the imported scenes intact. If you don't actually use the overlay anymore, removing the `scripts-tool` entry is fine; OBS will silently drop the broken reference on next save.
## Generalization
This same pattern applies to any OBS asset path stored in a scene collection or profile:
- Browser source local files
- Image / media source files
- Lua / Python script paths
- VST plugin paths
All of them are absolute, all of them survive a Windows profile rename in stale form, and all of them can be batch-rewritten with the same JSON-edit pattern above. Search for the old username substring across `%APPDATA%\obs-studio\` to catch them all in one pass.
## Related
- [[../../MajorInfrastructure/Devices/MajorRig|MajorRig device note]] — Incident Log 2026-05-14 (TTT/MLS scene popups) and 2026-05-07 (`majli` profile retirement that left these references stranded)
- [[../04-streaming/obs/obs-studio-setup-encoding|OBS Studio Setup and Encoding Settings]]

View file

@ -1,11 +1,17 @@
---
title: "ClamAV Safe Scheduling on Live Servers"
title: ClamAV Safe Scheduling on Live Servers
domain: troubleshooting
category: security
tags: [clamav, cpu, nice, ionice, cron, vps]
tags:
- clamav
- cpu
- nice
- ionice
- cron
- vps
status: published
created: 2026-04-02
updated: 2026-04-02
updated: 2026-05-11T18:31
---
# ClamAV Safe Scheduling on Live Servers
@ -75,6 +81,7 @@ kill <PID>
- `ionice -c 3` (Idle) requires Linux kernel ≥ 2.6.13 and CFQ/BFQ I/O scheduler. Works on most Ubuntu/Debian/Fedora systems.
- On multi-core servers, consider also using `cpulimit` for a hard cap: `cpulimit -l 30 -- clamscan ...`
- Always keep `--exclude=/sys` (and optionally `--exclude=/proc`, `--exclude=/dev`) to avoid scanning virtual filesystems.
- **1 vCPU limitation:** `nice` and `ionice` only help when other processes compete for resources. On a single-core VPS, clamscan will still saturate the CPU at 57-100% even with `nice -n 19 ionice -c 3` — there's nothing to yield to. Accept the weekly spike as benign, or reduce scan scope to shorten the window.
## Related

View file

@ -0,0 +1,116 @@
---
title: "Fedora CA Bundle Missing Symlink — TLS Breaks Fleet-Wide"
description: Hetzner-provisioned Fedora images may be missing the /etc/pki/tls/certs/ca-bundle.crt symlink, silently breaking Postfix TLS relay, curl, and dnf
tags:
- fedora
- tls
- postfix
- ca-certificates
- hetzner
- troubleshooting
status: published
created: 2026-05-11
updated: 2026-05-11
---
# Fedora CA Bundle Missing Symlink
On Fedora, many TLS clients (Postfix, curl, dnf) look for the CA bundle at `/etc/pki/tls/certs/ca-bundle.crt`. This path is normally a symlink to `/etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem`, shipped by the `ca-certificates` package.
On Hetzner Cloud Fedora images (observed on Fedora 44, May 2026), this symlink can be missing despite `ca-certificates` being installed. The extracted bundle exists, but the consumer-facing symlink does not.
## Symptoms
Postfix relay to a TLS-required upstream fails:
```
postfix/smtp: cannot load Certification Authority data,
CAfile="/etc/pki/tls/certs/ca-bundle.crt",
CApath="/etc/pki/tls/certs": disabling TLS support
```
If your relay requires TLS (port 465 with `smtp_tls_wrappermode = yes`, or `smtp_tls_security_level = encrypt`), mail silently queues as deferred. No bounce, no alert — just silence.
Other symptoms on the same box:
```bash
# curl fails
curl https://example.com
# error: Problem with the SSL CA cert (path? access rights?)
# dnf fails
dnf list --installed
# Curl error (77): Problem with the SSL CA cert
```
## Diagnosis
```bash
# Check the symlink
ls -la /etc/pki/tls/certs/ca-bundle.crt
# Expected: symlink -> /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem
# Broken: "No such file or directory"
# Verify the extracted bundle exists
ls -la /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem
# Should exist (~220 KB, ~140-150 certs)
# Confirm the package is installed
rpm -q ca-certificates
# Should return a version string
```
If the extracted bundle exists but the symlink at `/etc/pki/tls/certs/ca-bundle.crt` is missing, that's the problem.
## Fix
```bash
sudo ln -sf /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem \
/etc/pki/tls/certs/ca-bundle.crt
sudo systemctl restart postfix
sudo postqueue -f # flush any deferred mail
```
Verify:
```bash
# Symlink exists
ls -la /etc/pki/tls/certs/ca-bundle.crt
# Postfix can relay
echo "Subject: TLS test" | sendmail -v marcus@majorshouse.com
# curl works
curl -sI https://example.com | head -1
```
## Fleet Audit
If one Hetzner-provisioned Fedora host has this issue, check the others:
```bash
for host in majordiscord majorlab majorhome majormail; do
echo "$host: $(ssh root@$host 'ls /etc/pki/tls/certs/ca-bundle.crt 2>&1' | tail -1)"
done
```
Hosts returning "No such file or directory" are silently broken for all TLS operations.
## Why This Happens
`update-ca-trust extract` regenerates the files under `/etc/pki/ca-trust/extracted/` but does not create the legacy consumer-path symlink at `/etc/pki/tls/certs/ca-bundle.crt`. That symlink is shipped by the `ca-certificates` RPM. On cloud images built from minimal installs or snapshot-based provisioning, the symlink can be lost during image creation or a partial upgrade.
## Prevention
Add to your provisioning checklist (see [VPS Migration Baseline Checklist](../../02-selfhosting/cloud/vps-migration-baseline-checklist.md)):
```bash
# Fedora provisioning — verify CA bundle symlink
ls /etc/pki/tls/certs/ca-bundle.crt || \
ln -sf /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem /etc/pki/tls/certs/ca-bundle.crt
```
## Related
- [Logwatch Fleet Setup](../../02-selfhosting/monitoring/logwatch-fleet-setup.md) — logwatch depends on a working Postfix relay, which depends on TLS, which depends on this symlink
- [VPS Migration Baseline Checklist](../../02-selfhosting/cloud/vps-migration-baseline-checklist.md) — includes CA bundle verification step

View file

@ -0,0 +1,112 @@
---
title: Netdata apps-group FD-utilisation false 100% (silenced fleet-wide)
domain: troubleshooting
category: security
tags:
- netdata
- apps.plugin
- file-descriptors
- tailscale
- false-positive
- ansible
- fleet
status: published
created: 2026-05-15
updated: 2026-05-15T02:40
---
# Netdata apps-group FD-utilisation false 100%
The Netdata stock alarm **`apps_group_file_descriptors_utilization`** (from
`/usr/lib/netdata/conf.d/health.d/file_descriptors.conf`) fires
`Raised to Warning — App group <X> file descriptors utilization = 100%`
emails for application groups that are perfectly healthy. First hit on
**MajorToot** (the `tailscaled` app group), 2026-05-15.
## The Problem
A Netdata email arrives: *"App group tailscaled file descriptors utilization
= 100% on MajorToot"*. The process is fine. On the host:
```
PID 1047 tailscaled (daemon) fds=35 soft_limit=524287 util=0.01%
PID 1984541 tailscaled (child) fds=10 soft_limit=524287 util=0.00%
PID 1984548 bash (tailscale hook) fds=5 soft_limit=1024 util=0.49%
```
No PID exceeds **0.5%**, yet `app.fds_open_limit` reads ~100%. Over 1h the raw
chart was min 0 / **mean 36.7** / max 100, with sustained multi-minute 100%
plateaus (not isolated spikes).
> This is **not** an `apps.plugin` privilege problem. apps.plugin already has
> `cap_dac_read_search,cap_sys_ptrace` and `sudo -u netdata cat
> /proc/<pid>/limits` succeeds. Verify before "fixing" privileges — it's a
> no-op.
## Root Cause
The stock alarm does `lookup: max -10s` over **every PID in the app group**.
App groups whose processes fork short-lived children (tailscaled spawns
route/DNS helpers and bash hooks; `bash` children inherit the systemd default
soft limit of 1024) trip a false 100%: apps.plugin's per-PID FD-limit read
**races on transient/just-forked PIDs**, and because the group lookup uses
`max`, a single bad 10-second sample pegs the entire group to ~100%. The
signal carries no usable information for any forking/root app group.
A `lookup: average -5m` does **not** rescue it — the bogus reading sits at
~100% for sustained multi-minute stretches, so the 5-minute rolling average
itself still reaches 100.0% (empirically verified on MajorToot).
## The Fix
Silence this template fleet-wide, keep the reliable system-wide FD alarm.
- **Codified in Ansible** (do not hand-edit hosts): `MajorAnsible/netdata.yml`
ships `templates/health_apps_fds_group.conf.j2` to
`/etc/netdata/health.d/apps_fds_group_override.conf` and reloads via
`netdatacli reload-health`.
- The override redefines `apps_group_file_descriptors_utilization` with
`to: silent`. Netdata loads `/etc/netdata/health.d/` *after* the stock
`conf.d` dir, so a same-name template deterministically supersedes the stock
one (same mechanism as the manual `tcp_resets.conf` override, 2026-04-30).
- **Safety net retained:** the companion stock template
`system_file_descriptors_utilization` (on `system.file_nr_utilization`,
`crit > 90`, `to: sysadmin`) is untouched and still catches genuine
system-wide FD exhaustion regardless of app grouping.
- The reload handler is restart-tolerant (`retries`/`until` + `failed_when`
ignoring a `netdata.pipe` socket-absent error) because on hosts where the
notify-config also drifts, `Restart Netdata` and `Reload Netdata health`
can race during the ~5s restart window.
## Verification
```bash
ssh <host> 'curl -s "http://localhost:19999/api/v1/alarms?all=true" \
| python3 -c "import sys,json;A=json.load(sys.stdin)[\"alarms\"]; \
print(A[\"app.tailscaled_fds_open_limit.apps_group_file_descriptors_utilization\"][\"recipient\"])"'
# expect: silent
```
After the fix the alarm still shows `status=WARNING` in the dashboard
(cosmetic — silencing suppresses the *notification*, not the computed state);
`recipient=silent` confirms no more emails. The system-wide alarm should read
`CLEAR recipient=sysadmin`.
## Notes
- Silenced fleet-wide on all 10 servers 2026-05-15 (workstations majorrig/
majormac were asleep — irrelevant, they are not fleet servers).
- Any future host running a forking/root daemon in a named app group would
have hit the same false positive; silencing is fleet-wide and pre-emptive.
- **Follow-up debt:** the manual `/etc/netdata/health.d/tcp_resets.conf`
override on MajorToot (2026-04-30) is still **not codified in
`netdata.yml`** — a per-host divergence the fleet play does not manage.
Worth folding into Ansible the same way.
## Related
- [[clamscan-cpu-spike-nice-ionice]]
- [[netdata-web-log-successful-redirect-heavy-tuning]]
- Server doc: `30-Areas/MajorInfrastructure/Servers/majortoot.md` (incident
2026-05-15)
- Playbook: `MajorAnsible/netdata.yml` +
`templates/health_apps_fds_group.conf.j2`

View file

@ -1,6 +1,6 @@
---
created: 2026-04-02T16:03
updated: 2026-05-10T00:10
updated: 2026-05-15T09:00
---
* [Home](index.md)
* [Linux & Sysadmin](01-linux/index.md)
@ -28,6 +28,7 @@ updated: 2026-05-10T00:10
* [Wake-on-LAN via Router SSH](02-selfhosting/dns-networking/wake-on-lan-router-ssh.md)
* [Pi-hole v6 Group Management — Per-Client DNS Rules](02-selfhosting/dns-networking/pihole-v6-group-management.md)
* [AWS S3 Cost Management](02-selfhosting/cloud/aws-s3-cost-management.md)
* [VPS Migration Baseline Checklist](02-selfhosting/cloud/vps-migration-baseline-checklist.md)
* [rsync Backup Patterns](02-selfhosting/storage-backup/rsync-backup-patterns.md)
* [Tuning Netdata Web Log Alerts](02-selfhosting/monitoring/tuning-netdata-web-log-alerts.md)
* [Tuning Netdata Docker Health Alarms](02-selfhosting/monitoring/netdata-docker-health-alarm-tuning.md)
@ -69,11 +70,13 @@ updated: 2026-05-10T00:10
* [Streaming & Podcasting](04-streaming/index.md)
* [OBS Studio Setup & Encoding](04-streaming/obs/obs-studio-setup-encoding.md)
* [Plex 4K Codec Compatibility (Apple TV)](04-streaming/plex/plex-4k-codec-compatibility.md)
* [HEVC Batch Re-Encode for Plex Using VAAPI (AMD GPU)](04-streaming/plex/hevc-vaapi-batch-encode.md)
* [Troubleshooting](05-troubleshooting/index.md)
* [Apache Outage: Fail2ban Self-Ban + Missing iptables Rules](05-troubleshooting/networking/fail2ban-self-ban-apache-outage.md)
* [Mail Client Stops Receiving: Fail2ban IMAP Self-Ban](05-troubleshooting/networking/fail2ban-imap-self-ban-mail-client.md)
* [firewalld: Mail Ports Wiped After Reload](05-troubleshooting/networking/firewalld-mail-ports-reset.md)
* [Tailscale SSH: Unexpected Re-Authentication Prompt](05-troubleshooting/networking/tailscale-ssh-reauth-prompt.md)
* [ssh.socket Unreachable After Reboot (Tailscale Race Condition)](05-troubleshooting/networking/ssh-socket-tailscale-race-condition.md)
* [Fail2ban & UFW Rule Bloat Cleanup](05-troubleshooting/networking/fail2ban-ufw-rule-bloat-cleanup.md)
* [Custom Fail2ban Jail: Apache Directory Scanning](05-troubleshooting/security/apache-dirscan-fail2ban-jail.md)
* [Tuning Netdata `web_log_1m_successful` for Redirect-Heavy WordPress Sites](05-troubleshooting/security/netdata-web-log-successful-redirect-heavy-tuning.md)
@ -104,7 +107,10 @@ updated: 2026-05-10T00:10
* [rsync over Tailscale: Hung in TCP Teardown After Transfer Completes](05-troubleshooting/networking/rsync-tailscale-teardown-stall.md)
* [iOS Tailscale Clients Report HostName="localhost" — Breaks /etc/hosts Generators](05-troubleshooting/networking/tailscale-status-json-hostname-localhost-ios.md)
* [macOS: Repeating Alert Tone from Mirrored iPhone Notification](05-troubleshooting/macos-mirrored-notification-alert-loop.md)
* [OBS Studio: Stale Script Paths After Windows Profile Rename](05-troubleshooting/obs-stale-script-paths-after-windows-profile-rename.md)
* [ClamAV CPU Spike: Safe Scheduling with nice/ionice](05-troubleshooting/security/clamscan-cpu-spike-nice-ionice.md)
* [Fedora CA Bundle Missing Symlink — TLS Breaks Fleet-Wide](05-troubleshooting/security/fedora-ca-bundle-missing-symlink.md)
* [Netdata apps-group FD-utilisation false 100% (silenced fleet-wide)](05-troubleshooting/security/netdata-apps-fds-group-false-positive.md)
* [Ansible: Vault Password File Not Found](05-troubleshooting/ansible-vault-password-file-missing.md)
* [Ansible: ansible.cfg Ignored on WSL2 Windows Mounts](05-troubleshooting/ansible-wsl2-world-writable-mount-ignores-cfg.md)
* [Ansible: SSH Timeout During dnf upgrade on Fedora Hosts](05-troubleshooting/ansible-ssh-timeout-dnf-upgrade.md)