Compare commits
11 commits
9c62e7f804
...
65b0aa4567
| Author | SHA1 | Date | |
|---|---|---|---|
| 65b0aa4567 | |||
| eb39da9a26 | |||
| 7dc591d257 | |||
| 64ac418a36 | |||
|
|
28518e403e | ||
| a785e85821 | |||
| 4ec481c584 | |||
| c22457f1aa | |||
| ac84610380 | |||
| 3df0979786 | |||
| de9b661b9d |
10 changed files with 834 additions and 7 deletions
97
02-selfhosting/cloud/vps-migration-baseline-checklist.md
Normal file
97
02-selfhosting/cloud/vps-migration-baseline-checklist.md
Normal file
|
|
@ -0,0 +1,97 @@
|
||||||
|
---
|
||||||
|
title: VPS Migration Baseline Checklist
|
||||||
|
description: What to verify after migrating a server to a new provider — the packages, services, and configs that must match the old box
|
||||||
|
tags:
|
||||||
|
- migration
|
||||||
|
- vps
|
||||||
|
- hetzner
|
||||||
|
- digitalocean
|
||||||
|
- ansible
|
||||||
|
- checklist
|
||||||
|
status: published
|
||||||
|
created: 2026-05-09
|
||||||
|
updated: 2026-05-13T10:35
|
||||||
|
---
|
||||||
|
|
||||||
|
# VPS Migration Baseline Checklist
|
||||||
|
|
||||||
|
When migrating a server from one VPS provider to another, it's easy to focus on the application (bots, web services, databases) and forget the infrastructure baseline. This checklist covers the common components that make a server operational beyond just running the app.
|
||||||
|
|
||||||
|
## Background
|
||||||
|
|
||||||
|
During the Hetzner migration (2026-05), `majordiscord` was migrated with only the application layer (PhantomBot, Red-DiscordBot) and core infrastructure (Netdata, Tailscale, fail2ban). Missing from the new box: Postfix (email relay), logwatch, ClamAV, and dnf-automatic. The gap went unnoticed for a week because all monitoring email depended on the missing Postfix.
|
||||||
|
|
||||||
|
## The Checklist
|
||||||
|
|
||||||
|
### Before Migration
|
||||||
|
|
||||||
|
Power on both old and new boxes. Run this comparison to find gaps:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Fedora — list baseline packages on both hosts
|
||||||
|
ssh root@OLD_HOST 'rpm -qa --qf "%{NAME}\n" | sort | grep -iE "fail2ban|logwatch|postfix|netdata|clamav|dnf-auto|tailscale|cronie|firewalld"'
|
||||||
|
ssh root@NEW_HOST 'rpm -qa --qf "%{NAME}\n" | sort | grep -iE "fail2ban|logwatch|postfix|netdata|clamav|dnf-auto|tailscale|cronie|firewalld"'
|
||||||
|
|
||||||
|
# Ubuntu — list baseline packages on both hosts
|
||||||
|
ssh root@OLD_HOST 'dpkg -l | grep -iE "fail2ban|logwatch|postfix|netdata|clamav|unattended|tailscale" | awk "{print \$2}" | sort'
|
||||||
|
ssh root@NEW_HOST 'dpkg -l | grep -iE "fail2ban|logwatch|postfix|netdata|clamav|unattended|tailscale" | awk "{print \$2}" | sort'
|
||||||
|
```
|
||||||
|
|
||||||
|
Compare enabled services:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ssh root@HOST 'systemctl list-unit-files --state=enabled --no-pager | grep -iE "fail2ban|logwatch|postfix|netdata|clamav|dnf-auto|tailscale|cronie|firewalld|sshd"'
|
||||||
|
```
|
||||||
|
|
||||||
|
### Baseline Components
|
||||||
|
|
||||||
|
Every server in the fleet should have these. Check each one after migration:
|
||||||
|
|
||||||
|
| Component | Package (Fedora) | Package (Ubuntu) | Ansible Playbook | Notes |
|
||||||
|
|-----------|-----------------|------------------|------------------|-------|
|
||||||
|
| Monitoring | `netdata` | `netdata` | `netdata.yml` | Claim to Netdata Cloud if applicable |
|
||||||
|
| VPN | `tailscale` | `tailscale` | — (manual join) | Rename node in Tailscale admin |
|
||||||
|
| Intrusion prevention | `fail2ban` | `fail2ban` | `harden.yml` | Check jail.local, banaction matches firewall |
|
||||||
|
| Email relay | `postfix` | `postfix` | `configure_postfix_relay.yml` | Required by logwatch, Netdata, fail2ban |
|
||||||
|
| Log summaries | `logwatch` | `logwatch` | `logwatch.yml` | Override file, not defaults — see [logwatch fleet setup](../monitoring/logwatch-fleet-setup.md) |
|
||||||
|
| Firewall | `firewalld` | `ufw` | `configure_firewall_*.yml` | Verify fail2ban banaction matches |
|
||||||
|
| Cron | `cronie` | `cron` | — (usually pre-installed) | Required by logwatch |
|
||||||
|
| Auto-updates | `dnf-automatic` | `unattended-upgrades` | `ansible-unattended-upgrades-fleet` | Security patches only |
|
||||||
|
| Antivirus | `clamav` | `clamav` | `configure_clamav.yml` | Internet-facing hosts only |
|
||||||
|
| SSH hardening | `openssh-server` | `openssh-server` | `configure_ssh_hardening.yml` | Key-only, no root password |
|
||||||
|
| Timezone | — | — | — | US servers: `America/New_York`; UK: `Europe/London`. Hetzner defaults to UTC. |
|
||||||
|
| CA bundle (Fedora) | `ca-certificates` | `ca-certificates` | — | Verify `/etc/pki/tls/certs/ca-bundle.crt` symlink exists — see [Fedora CA bundle fix](../../05-troubleshooting/security/fedora-ca-bundle-missing-symlink.md) |
|
||||||
|
| Syslog (Fedora) | `rsyslog` | — (pre-installed) | — | Fedora 44 Hetzner images have journald only. Logwatch needs `/var/log/messages` + `/var/log/secure`. |
|
||||||
|
|
||||||
|
### After Migration
|
||||||
|
|
||||||
|
1. **Set the timezone** — `timedatectl set-timezone America/New_York` (US) or `Europe/London` (UK). Hetzner images default to UTC.
|
||||||
|
2. **Verify CA bundle (Fedora)** — `ls /etc/pki/tls/certs/ca-bundle.crt`. If missing, Postfix TLS, curl, and dnf will all fail silently. See [Fedora CA bundle fix](../../05-troubleshooting/security/fedora-ca-bundle-missing-symlink.md).
|
||||||
|
3. **Run `harden.yml` against the new host** — catches most gaps in one pass
|
||||||
|
4. **Send a test email** — `echo test | mail -s "test" marcus@majorshouse.com` — if this fails, nothing else can alert you
|
||||||
|
5. **Verify crond is running** — `systemctl is-active crond` (Fedora) or `systemctl is-active cron` (Ubuntu). cronie can be `enabled` but not `active` after provisioning.
|
||||||
|
6. **Check Netdata Cloud** — verify the new node appears and alerts are flowing
|
||||||
|
7. **Compare fail2ban jails** — `fail2ban-client status` on both old and new
|
||||||
|
8. **Verify logwatch sends** — `sudo logwatch --output mail --range today`
|
||||||
|
9. **Keep the old box powered off but not destroyed** for at least 7 days after remediation
|
||||||
|
|
||||||
|
### Using doctl to Manage Old Droplets
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Authenticate (token from Ansible vault)
|
||||||
|
cd ~/MajorAnsible
|
||||||
|
ansible-vault view group_vars/all/vault.yml | grep vault_do_oauth_token | awk '{print $2}' | xargs doctl auth init --access-token
|
||||||
|
|
||||||
|
# List droplets
|
||||||
|
doctl compute droplet list --format Name,ID,Status,PublicIPv4
|
||||||
|
|
||||||
|
# Power on for comparison
|
||||||
|
doctl compute droplet-action power-on DROPLET_ID
|
||||||
|
|
||||||
|
# Power off when done
|
||||||
|
doctl compute droplet-action power-off DROPLET_ID
|
||||||
|
```
|
||||||
|
|
||||||
|
## Lesson Learned
|
||||||
|
|
||||||
|
Application migration is not server migration. The app can work perfectly while the monitoring, alerting, and email infrastructure is completely broken. Always compare the full package baseline between old and new boxes before calling a migration complete.
|
||||||
|
|
@ -9,7 +9,7 @@ tags:
|
||||||
- ubuntu
|
- ubuntu
|
||||||
status: published
|
status: published
|
||||||
created: 2026-05-09
|
created: 2026-05-09
|
||||||
updated: 2026-05-10T13:00
|
updated: 2026-05-13T10:35
|
||||||
---
|
---
|
||||||
|
|
||||||
# Logwatch Fleet Setup — Surviving Package Upgrades
|
# Logwatch Fleet Setup — Surviving Package Upgrades
|
||||||
|
|
@ -91,10 +91,22 @@ Include it in `harden.yml` so every new server gets logwatch as part of the base
|
||||||
After deploying, test immediately:
|
After deploying, test immediately:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
|
# Verify crond is actually running — cronie can be "enabled" but not "active"
|
||||||
|
systemctl is-active crond # Fedora
|
||||||
|
systemctl is-active cron # Ubuntu
|
||||||
|
|
||||||
|
# If inactive, start it
|
||||||
|
sudo systemctl start crond
|
||||||
|
|
||||||
|
# Then test logwatch manually
|
||||||
sudo logwatch --output mail --range today
|
sudo logwatch --output mail --range today
|
||||||
```
|
```
|
||||||
|
|
||||||
Check that the email arrives. If it doesn't, verify Postfix is installed and relaying correctly — logwatch depends on a working local MTA.
|
Check that the email arrives. If it doesn't, verify:
|
||||||
|
|
||||||
|
1. **crond is running** — if `inactive`, cron.daily never fires and logwatch never runs. No errors anywhere.
|
||||||
|
2. **Postfix is installed and relaying** — logwatch depends on a working local MTA.
|
||||||
|
3. **CA bundle exists (Fedora)** — missing `/etc/pki/tls/certs/ca-bundle.crt` breaks Postfix TLS relay. See [Fedora CA bundle fix](../../05-troubleshooting/security/fedora-ca-bundle-missing-symlink.md).
|
||||||
|
|
||||||
## Diagnosing Silent Failures
|
## Diagnosing Silent Failures
|
||||||
|
|
||||||
|
|
@ -105,6 +117,32 @@ dpkg -V logwatch # Debian
|
||||||
|
|
||||||
# Look for S.5....T. on the defaults file — means it was replaced
|
# Look for S.5....T. on the defaults file — means it was replaced
|
||||||
# S = size, 5 = md5, T = timestamp changed
|
# S = size, 5 = md5, T = timestamp changed
|
||||||
|
|
||||||
|
# Check if logwatch produces any output at all
|
||||||
|
logwatch --output stdout --range yesterday | wc -l
|
||||||
|
# If 0 lines — logwatch has no log data to report (see rsyslog section below)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Fedora: rsyslog Missing — Logwatch Produces Zero Output
|
||||||
|
|
||||||
|
Fedora 44 cloud images (Hetzner, possibly others) ship with **journald only** — no rsyslog. This means `/var/log/messages`, `/var/log/secure`, and `/var/log/cron` do not exist. Logwatch scans those files, finds nothing, produces empty output, and sends no email. Exit code is still 0 — no error anywhere.
|
||||||
|
|
||||||
|
This is particularly insidious because everything else can be correct (crond running, postfix relaying, logwatch config pointing to the right recipient) and you'll still get silence.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Diagnose
|
||||||
|
rpm -q rsyslog # "package rsyslog is not installed"
|
||||||
|
ls /var/log/messages # "No such file or directory"
|
||||||
|
|
||||||
|
# Fix
|
||||||
|
dnf install -y rsyslog
|
||||||
|
systemctl enable --now rsyslog
|
||||||
|
|
||||||
|
# Verify log files appear
|
||||||
|
ls /var/log/messages /var/log/secure /var/log/cron
|
||||||
|
|
||||||
|
# Test logwatch
|
||||||
|
logwatch --output stdout --range today | wc -l # should be >0
|
||||||
```
|
```
|
||||||
|
|
||||||
## Fedora CA Bundle Missing — Postfix TLS Engine Unavailable
|
## Fedora CA Bundle Missing — Postfix TLS Engine Unavailable
|
||||||
|
|
|
||||||
|
|
@ -11,7 +11,7 @@ tags:
|
||||||
- cron
|
- cron
|
||||||
status: published
|
status: published
|
||||||
created: 2026-04-18
|
created: 2026-04-18
|
||||||
updated: 2026-05-10T01:50
|
updated: 2026-05-15T03:00
|
||||||
---
|
---
|
||||||
# ClamAV Fleet Deployment with Ansible
|
# ClamAV Fleet Deployment with Ansible
|
||||||
|
|
||||||
|
|
@ -226,6 +226,41 @@ The "polite CPU is invisible to DO" trick stops working once the box is small en
|
||||||
|
|
||||||
**Alternative considered: switch to `clamdscan`** — uses a resident `clamd` daemon, signatures stay loaded, scan finishes ~10× faster with much less CPU/RAM. Better long-term answer, but requires running `clamd` continuously (memory cost on small boxes is ~250 MB resident vs the cron approach which only holds RAM during scan). Trade-off, not strictly better.
|
**Alternative considered: switch to `clamdscan`** — uses a resident `clamd` daemon, signatures stay loaded, scan finishes ~10× faster with much less CPU/RAM. Better long-term answer, but requires running `clamd` continuously (memory cost on small boxes is ~250 MB resident vs the cron approach which only holds RAM during scan). Trade-off, not strictly better.
|
||||||
|
|
||||||
|
## Daemonless Mode on Memory-Constrained Hosts
|
||||||
|
|
||||||
|
On hosts with ≤2 GB RAM, running `clamd` continuously is often counterproductive. The daemon loads its full signature database (~950 MB RSS) into memory and keeps it resident. On small VMs this crowds out MySQL, PHP-FPM, and other services — often pushing the whole system into swap rather than preventing anything.
|
||||||
|
|
||||||
|
**Affected hosts (fleet history):**
|
||||||
|
|
||||||
|
| Host | RAM | Incident | Resolution |
|
||||||
|
|------|-----|----------|------------|
|
||||||
|
| teelia | 1.9 GB | 2026-04-27 — clamd 728 MB RSS, 94% RAM alert | daemonless |
|
||||||
|
| dcaprod | 3.8 GB | 2026-04-30 — clamd OOM thrash after 512M cgroup cap | daemonless |
|
||||||
|
| majorlinux | 2.0 GB | 2026-05-15 — clamd 980 MB swap, mysqld swapping 293 MB | daemonless |
|
||||||
|
|
||||||
|
**The fix: `clamav_use_daemon: false` host_var**
|
||||||
|
|
||||||
|
`configure_clamav.yml` supports a per-host override. Add to the host's `host_vars/<hostname>/vars.yml`:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
clamav_use_daemon: false
|
||||||
|
```
|
||||||
|
|
||||||
|
Then re-run the playbook:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ansible-playbook configure_clamav.yml --limit <hostname>
|
||||||
|
```
|
||||||
|
|
||||||
|
This will:
|
||||||
|
- Stop and disable `clamav-daemon.service` and `clamav-daemon.socket`
|
||||||
|
- Deploy the weekly scan template using `clamscan` (daemonless, loads DB per run)
|
||||||
|
- Leave `clamav-freshclam` active so definitions stay current
|
||||||
|
|
||||||
|
**Trade-off:** Each weekly scan loads the signature DB fresh (~950 MB peak RAM for the scan duration, then freed). The scan takes longer than `clamdscan` (~3–5× on a warm daemon), but this is acceptable for a weekly background job. The `systemd-run MemoryMax` cgroup wrapper in the scan template caps peak usage so the scan can't OOM the host.
|
||||||
|
|
||||||
|
**Rule of thumb:** Use daemon mode (`clamav_use_daemon: true` or unset) on hosts with ≥4 GB RAM where scan speed matters (mail servers, upload handlers). Use daemonless on webservers and small VMs where continuous memory residency is the bigger risk.
|
||||||
|
|
||||||
## See Also
|
## See Also
|
||||||
|
|
||||||
- [clamscan-cpu-spike-nice-ionice](../../05-troubleshooting/security/clamscan-cpu-spike-nice-ionice.md) — troubleshooting CPU spikes from unthrottled scans
|
- [clamscan-cpu-spike-nice-ionice](../../05-troubleshooting/security/clamscan-cpu-spike-nice-ionice.md) — troubleshooting CPU spikes from unthrottled scans
|
||||||
|
|
|
||||||
168
04-streaming/plex/hevc-vaapi-batch-encode.md
Normal file
168
04-streaming/plex/hevc-vaapi-batch-encode.md
Normal file
|
|
@ -0,0 +1,168 @@
|
||||||
|
---
|
||||||
|
title: "HEVC Batch Re-Encode for Plex Using VAAPI (AMD GPU)"
|
||||||
|
domain: streaming
|
||||||
|
category: plex
|
||||||
|
tags: [plex, ffmpeg, hevc, vaapi, amd, gpu, encode, storage, rx480]
|
||||||
|
status: published
|
||||||
|
created: 2026-05-15
|
||||||
|
updated: 2026-05-15
|
||||||
|
---
|
||||||
|
# HEVC Batch Re-Encode for Plex Using VAAPI (AMD GPU)
|
||||||
|
|
||||||
|
## Problem
|
||||||
|
|
||||||
|
Plex NVMe storage is filling up from a large library of H.264-encoded video files (YouTube downloads, stream archives, etc.). Re-encoding to HEVC (H.265) reclaims 30–50% of disk space. The catch: Plex tracks each file's "date added" in a SQLite database, and that order matters for playback queues. Naive re-encode-and-replace approaches can corrupt or reset that metadata.
|
||||||
|
|
||||||
|
## Solution
|
||||||
|
|
||||||
|
Use `ffmpeg` with `hevc_vaapi` (AMD GPU hardware encoder) to batch re-encode files in-place using an atomic rename swap that preserves the Plex database record — including `added_at` — without any Plex downtime or database editing.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## How Plex Stores "Date Added"
|
||||||
|
|
||||||
|
Plex does **not** use file modification time (`mtime`) for "date added." It stores a Unix timestamp in its SQLite database:
|
||||||
|
|
||||||
|
```sql
|
||||||
|
-- Plex DB location (override via systemd unit may differ — check):
|
||||||
|
-- /var/lib/plexmediaserver/Library/Application Support/Plex Media Server/
|
||||||
|
-- Plug-in Support/Databases/com.plexapp.plugins.library.db
|
||||||
|
-- (or wherever PLEX_MEDIA_SERVER_APPLICATION_SUPPORT_DIR points)
|
||||||
|
|
||||||
|
SELECT mi.added_at, datetime(mi.added_at, 'unixepoch'), mp.file
|
||||||
|
FROM metadata_items mi
|
||||||
|
JOIN media_items me ON me.metadata_item_id = mi.id
|
||||||
|
JOIN media_parts mp ON mp.media_item_id = me.id
|
||||||
|
WHERE mp.file LIKE '%your-file%';
|
||||||
|
```
|
||||||
|
|
||||||
|
> **Note:** If the default path returns 0 rows, check your actual data directory:
|
||||||
|
> ```bash
|
||||||
|
> systemctl cat plexmediaserver | grep APPLICATION_SUPPORT
|
||||||
|
> ```
|
||||||
|
|
||||||
|
The `added_at` field is keyed to the **file path** in `media_parts`. As long as the file path doesn't change, the database record — including `added_at` — is untouched even after the file's content is replaced.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Why VAAPI Instead of libx265
|
||||||
|
|
||||||
|
On a host with an AMD RX 480/580 (or similar Polaris GPU), hardware HEVC encoding via VAAPI is roughly **9× faster** than software libx265 at comparable quality:
|
||||||
|
|
||||||
|
| Encoder | Speed (1080p) | Notes |
|
||||||
|
|---|---|---|
|
||||||
|
| libx265 -preset medium | ~21 fps / 0.35× | Best quality/size ratio |
|
||||||
|
| hevc_vaapi QP 28 | ~186 fps / 3.1× | Sufficient for streaming content |
|
||||||
|
|
||||||
|
For 1080p streaming content (game streams, podcasts, YouTube archival), the quality difference is imperceptible. libx265 is preferable only for archival encodes where absolute quality matters.
|
||||||
|
|
||||||
|
### Verify VAAPI is working
|
||||||
|
|
||||||
|
```bash
|
||||||
|
vainfo 2>&1 | grep -E "vaapi|HEVC|hevc|Driver"
|
||||||
|
ls /dev/dri/renderD128
|
||||||
|
```
|
||||||
|
|
||||||
|
You need `VAProfileHEVCMain : VAEntrypointEncSlice` in the output. If missing, install `mesa-va-drivers-freeworld` (RPM Fusion) for AMD hardware.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## The Atomic Swap Strategy
|
||||||
|
|
||||||
|
The key insight: `mv file.tmp file` on the **same filesystem** is an atomic inode rename at the kernel level. Plex sees the same path still present — it never fires a "file removed" event, so the `metadata_items` record (including `added_at`) is preserved.
|
||||||
|
|
||||||
|
**Safe sequence:**
|
||||||
|
1. Encode source → `.hevc.tmp.mp4` alongside the original
|
||||||
|
2. Verify the output with `ffprobe`
|
||||||
|
3. `touch -r original.mp4 temp.mp4` — copy mtime (cosmetic, not required)
|
||||||
|
4. `mv temp.mp4 original.mp4` — atomic replace
|
||||||
|
|
||||||
|
**The one pitfall:** if the original file is deleted *before* the `mv`, Plex orphans the DB record (removes `metadata_items` entry on next scan) and re-indexes the new file with a fresh `added_at`. The original must still exist at swap time.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## The Batch Script
|
||||||
|
|
||||||
|
Script lives at `~/hevc_batch.sh` on majorhome.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Dry run — scan and report what would be encoded, no changes
|
||||||
|
bash ~/hevc_batch.sh --dry-run
|
||||||
|
|
||||||
|
# Full run (default: files >1GB, QP 28)
|
||||||
|
tmux new-session -d -s hevc_batch 'bash ~/hevc_batch.sh'
|
||||||
|
|
||||||
|
# Custom options
|
||||||
|
bash ~/hevc_batch.sh --min-size-gb 2 --qp 26
|
||||||
|
```
|
||||||
|
|
||||||
|
### Queue and resume
|
||||||
|
|
||||||
|
The script writes a queue file at `~/hevc_queue.txt` on first run (scanning all files with ffprobe — takes ~10 min for a large library). On subsequent runs it resumes from where it left off. Completed files are logged to `~/hevc_done.txt`. Failed files go to `~/hevc_failed.txt`.
|
||||||
|
|
||||||
|
To restart from scratch: `rm ~/hevc_queue.txt ~/hevc_done.txt`
|
||||||
|
|
||||||
|
### Log output
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Structured log lines only (skip ffmpeg progress noise)
|
||||||
|
grep '^\[20' ~/hevc_batch.log
|
||||||
|
|
||||||
|
# Watch live progress
|
||||||
|
tail -f ~/hevc_batch.log | grep '^\[20'
|
||||||
|
```
|
||||||
|
|
||||||
|
Each file logs:
|
||||||
|
- Source size and codec
|
||||||
|
- `Plex added_at before: <unix timestamp>`
|
||||||
|
- ffmpeg exit code and elapsed time
|
||||||
|
- Output size and savings
|
||||||
|
- `DB check: added_at PRESERVED ✓` (or WARN if changed)
|
||||||
|
|
||||||
|
### Space guard
|
||||||
|
|
||||||
|
The script aborts if free space on the Plex volume drops below 20GB (`MIN_FREE_GB`). Worst-case headroom needed is `source_size + tmp_size` simultaneously — on a 4GB source file that's ~8GB peak.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## ffmpeg Command
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ffmpeg \
|
||||||
|
-vaapi_device /dev/dri/renderD128 \
|
||||||
|
-i "input.mp4" \
|
||||||
|
-vf 'format=nv12,hwupload' \
|
||||||
|
-c:v hevc_vaapi -rc_mode CQP -qp 28 \
|
||||||
|
-c:a copy \
|
||||||
|
-movflags +faststart \
|
||||||
|
-y "output.tmp.mp4"
|
||||||
|
```
|
||||||
|
|
||||||
|
- `-rc_mode CQP -qp 28` — constant quantizer; higher value = smaller file / lower quality. QP 24 is high quality, QP 28 is good for streaming content.
|
||||||
|
- `-vf 'format=nv12,hwupload'` — required to move frames to GPU memory for VAAPI encoding.
|
||||||
|
- `-c:a copy` — passes audio through untouched.
|
||||||
|
- `hevc_vaapi` does not support 10-bit output on Polaris (RX 480/580). For 10-bit HDR sources, fall back to `libx265` with color signaling flags.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Plex Data Directory Override
|
||||||
|
|
||||||
|
On majorhome, the Plex data directory is overridden in the systemd unit — the default path `/var/lib/plexmediaserver/` is empty:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
systemctl cat plexmediaserver | grep APPLICATION_SUPPORT
|
||||||
|
# Environment=PLEX_MEDIA_SERVER_APPLICATION_SUPPORT_DIR=/plex/plexdata/Library/Application Support
|
||||||
|
```
|
||||||
|
|
||||||
|
The actual DB path is therefore:
|
||||||
|
```
|
||||||
|
/plex/plexdata/Library/Application Support/Plex Media Server/Plug-in Support/Databases/com.plexapp.plugins.library.db
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Related
|
||||||
|
|
||||||
|
- [[plex-4k-codec-compatibility]] — Apple TV Direct Play compatibility, HEVC HDR notes
|
||||||
|
- [[snapraid-mergerfs-setup]] — MajorRAID storage pool setup
|
||||||
|
- [[SnapRAID-Majorhome]] — majorhome SnapRAID project
|
||||||
|
|
@ -0,0 +1,119 @@
|
||||||
|
# Tailscale Boot Race Conditions (SSH Unreachable After Reboot)
|
||||||
|
|
||||||
|
Two related race conditions can make a host unreachable via Tailscale after reboot. Both stem from systemd services starting before Tailscale or the network is ready.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Race 1: ssh.socket Binds Before Tailscale Is Up (Ubuntu)
|
||||||
|
|
||||||
|
### Symptom
|
||||||
|
|
||||||
|
SSH to a host via Tailscale IP times out. `tailscale ping` works, `tailscale status` shows `active; direct`, but SSH on port 22 refuses connections. No access via Hetzner console if root password is unset.
|
||||||
|
|
||||||
|
### Cause
|
||||||
|
|
||||||
|
Ubuntu 24.04 uses systemd **socket activation** for SSH (`ssh.socket` instead of persistent `ssh.service`). When the socket override binds to a Tailscale IP, it can start *before* `tailscaled.service` is ready. The bind may succeed initially (Tailscale state file caches the IP), but a later Tailscale reconnect or interface reset invalidates the bound address silently — SSH dies with no recovery path.
|
||||||
|
|
||||||
|
### Diagnosis
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# From another host:
|
||||||
|
tailscale ping <IP> # succeeds — host is up
|
||||||
|
ssh root@<IP> # times out — sshd not listening
|
||||||
|
|
||||||
|
# After gaining console access or reboot:
|
||||||
|
systemctl status ssh.socket # check Listen: address
|
||||||
|
journalctl -b -1 -u ssh # likely empty — sshd never spawned
|
||||||
|
journalctl -b -1 -u ssh.socket # socket started before tailscaled
|
||||||
|
```
|
||||||
|
|
||||||
|
### Fix
|
||||||
|
|
||||||
|
Add Tailscale dependency to the socket override:
|
||||||
|
|
||||||
|
```ini
|
||||||
|
# /etc/systemd/system/ssh.socket.d/override.conf
|
||||||
|
[Unit]
|
||||||
|
After=tailscaled.service
|
||||||
|
BindsTo=tailscaled.service
|
||||||
|
|
||||||
|
[Socket]
|
||||||
|
ListenStream=
|
||||||
|
ListenStream=<TAILSCALE_IP>:22
|
||||||
|
```
|
||||||
|
|
||||||
|
Then reload and restart:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
systemctl daemon-reload
|
||||||
|
systemctl restart ssh.socket
|
||||||
|
systemctl status ssh.socket # verify Listen: shows correct IP
|
||||||
|
```
|
||||||
|
|
||||||
|
- `After=` ensures the socket waits for Tailscale to start
|
||||||
|
- `BindsTo=` restarts the socket if Tailscale restarts, preventing stale binds
|
||||||
|
|
||||||
|
### Affected Hosts
|
||||||
|
|
||||||
|
Ubuntu hosts using `configure_tailscale_ssh_only.yml`: majorlinux, dcaprod-hetzner. Fedora hosts (majordiscord) use firewall rules for SSH restriction — not affected by this race.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Race 2: tailscaled Starts Before Network Is Online (All Hosts)
|
||||||
|
|
||||||
|
### Symptom
|
||||||
|
|
||||||
|
Host reboots but never appears on Tailscale. `tailscale ping` times out entirely. SSH is dead because Tailscale never connects. The host is up (accessible via provider console) but isolated from the Tailscale network.
|
||||||
|
|
||||||
|
### Cause
|
||||||
|
|
||||||
|
`tailscaled.service` ships with `After=network-pre.target`, which fires *before* the network interface has an IP. On VPS hosts (especially Hetzner), the interface can take several seconds to come online. Tailscale starts, sees no network (`SetNetworkUp(false)`, `link state: defaultRoute= ifs={} v4=false v6=false`), fails DNS bootstrap and DERP relay connections, and gets stuck — never retrying.
|
||||||
|
|
||||||
|
### Diagnosis
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# From Hetzner console or another access method:
|
||||||
|
journalctl -b -u tailscaled | grep -E "SetNetworkUp|link state|error|DERP"
|
||||||
|
# Look for:
|
||||||
|
# magicsock: SetNetworkUp(false)
|
||||||
|
# link state: interfaces.State{defaultRoute= ifs={} v4=false v6=false}
|
||||||
|
# health: Tailscale could not connect to any relay server
|
||||||
|
```
|
||||||
|
|
||||||
|
### Fix
|
||||||
|
|
||||||
|
Deploy a systemd drop-in to wait for full network connectivity:
|
||||||
|
|
||||||
|
```ini
|
||||||
|
# /etc/systemd/system/tailscaled.service.d/override.conf
|
||||||
|
[Unit]
|
||||||
|
After=network-online.target
|
||||||
|
Wants=network-online.target
|
||||||
|
```
|
||||||
|
|
||||||
|
Then reload and restart:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
systemctl daemon-reload
|
||||||
|
systemctl restart tailscaled
|
||||||
|
```
|
||||||
|
|
||||||
|
### Affected Hosts
|
||||||
|
|
||||||
|
All hosts where Tailscale is the primary access path. Particularly impactful on VPS hosts with slow interface bringup. Both Fedora and Ubuntu hosts are affected.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Prevention
|
||||||
|
|
||||||
|
- Set root passwords on all VPS hosts for emergency console access
|
||||||
|
- Ansible playbooks deploy both fixes automatically:
|
||||||
|
- `configure_tailscale_network_wait.yml` — tailscaled network-online dependency (all hosts)
|
||||||
|
- `configure_tailscale_ssh_only.yml` — ssh.socket Tailscale dependency (Ubuntu only)
|
||||||
|
|
||||||
|
## References
|
||||||
|
|
||||||
|
- [[dcaprod#2026-05-19 — SSH unreachable due to ssh.socket race condition with Tailscale]]
|
||||||
|
- [[majordiscord#2026-05-19 — Tailscale boot race: unreachable after Ansible reboot]]
|
||||||
|
- [[majorlinux#2026-05-19 — ssh.socket override patched: added Tailscale dependency]]
|
||||||
|
- Ansible: `configure_tailscale_ssh_only.yml`, `configure_tailscale_network_wait.yml`
|
||||||
|
|
@ -0,0 +1,129 @@
|
||||||
|
---
|
||||||
|
title: "OBS Studio — \"Error opening file: (null)\" After Windows Profile Rename"
|
||||||
|
domain: troubleshooting
|
||||||
|
category: streaming
|
||||||
|
tags: [obs, streaming, windows, lua, profile-migration]
|
||||||
|
status: published
|
||||||
|
created: 2026-05-14
|
||||||
|
updated: 2026-05-14
|
||||||
|
---
|
||||||
|
|
||||||
|
# OBS Studio — "Error opening file: (null)" After Windows Profile Rename
|
||||||
|
|
||||||
|
## Symptom
|
||||||
|
|
||||||
|
Loading a scene collection in OBS Studio triggers a popup like:
|
||||||
|
|
||||||
|
```
|
||||||
|
[<ScriptName>.lua] Error opening file: (null)
|
||||||
|
```
|
||||||
|
|
||||||
|
The `(null)` is the giveaway: OBS resolved the registered script path to nothing — the file doesn't exist where the scene collection says it does. Most commonly this happens after a Windows profile was renamed or migrated and `C:\Users\<old>\...` paths were not updated.
|
||||||
|
|
||||||
|
## Why it happens
|
||||||
|
|
||||||
|
OBS stores per-scene-collection Lua/Python script registrations inside the scene collection JSON at:
|
||||||
|
|
||||||
|
```
|
||||||
|
%APPDATA%\obs-studio\basic\scenes\<Collection>.json
|
||||||
|
```
|
||||||
|
|
||||||
|
Each entry under `modules.scripts-tool[]` is an absolute Windows path. Renaming the Windows profile does not rewrite these — the JSON keeps pointing at the old `C:\Users\<old>\...` location, and OBS surfaces the resolution failure as a `(null)` popup on collection load.
|
||||||
|
|
||||||
|
## Diagnose
|
||||||
|
|
||||||
|
From WSL (or any shell with access to `%APPDATA%`):
|
||||||
|
|
||||||
|
```bash
|
||||||
|
OBS_DIR="/mnt/c/Users/<current-windows-user>/AppData/Roaming/obs-studio"
|
||||||
|
|
||||||
|
# 1. List scene collections
|
||||||
|
ls "$OBS_DIR/basic/scenes/"
|
||||||
|
|
||||||
|
# 2. Find collections referencing the missing script
|
||||||
|
grep -l -i "<script-name-substring>" "$OBS_DIR/basic/scenes/"*.json
|
||||||
|
|
||||||
|
# 3. Dump the scripts-tool paths from each suspect collection
|
||||||
|
python3 -c "
|
||||||
|
import json, sys
|
||||||
|
d = json.load(open(sys.argv[1]))
|
||||||
|
for s in d.get('modules', {}).get('scripts-tool', []):
|
||||||
|
print(s.get('path'))
|
||||||
|
" "$OBS_DIR/basic/scenes/<Collection>.json"
|
||||||
|
```
|
||||||
|
|
||||||
|
If a printed path contains `C:/Users/<old-username>/...` and the file doesn't exist on disk, you've found it.
|
||||||
|
|
||||||
|
## Fix
|
||||||
|
|
||||||
|
> [!warning] Close OBS first
|
||||||
|
> OBS rewrites the scene collection JSON when it exits. Any edit made while OBS is running will be overwritten. Confirm with `tasklist.exe | grep obs64` (WSL) or Task Manager.
|
||||||
|
|
||||||
|
### 1. Make the missing script reachable
|
||||||
|
|
||||||
|
Either:
|
||||||
|
|
||||||
|
- **Re-extract / restore the script** to a path under the new profile (recommended — gives you a clean canonical home), or
|
||||||
|
- **Leave it in the rescue/migration folder** and point OBS there (fragile if the rescue folder is later deleted).
|
||||||
|
|
||||||
|
### 2. Back up the scene collection JSON
|
||||||
|
|
||||||
|
```bash
|
||||||
|
SCENES="/mnt/c/Users/<current-windows-user>/AppData/Roaming/obs-studio/basic/scenes"
|
||||||
|
STAMP="$(date +%Y%m%d-%H%M%S)"
|
||||||
|
cp -p "$SCENES/<Collection>.json" "$SCENES/<Collection>.json.$STAMP.bak"
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3. Rewrite the paths atomically
|
||||||
|
|
||||||
|
Edit the JSON in place by parsing it, replacing the matched path strings, and writing through a temp file (so a crash mid-write can't corrupt the collection):
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python3 <<'PY'
|
||||||
|
import json, os
|
||||||
|
scenes = "/mnt/c/Users/<current-windows-user>/AppData/Roaming/obs-studio/basic/scenes"
|
||||||
|
mapping = {
|
||||||
|
"C:/Users/<old>/Pictures/.../<script>.lua":
|
||||||
|
"C:/Users/<new>/Pictures/.../<script>.lua",
|
||||||
|
}
|
||||||
|
for fn in ("<Collection>.json",):
|
||||||
|
path = os.path.join(scenes, fn)
|
||||||
|
d = json.load(open(path))
|
||||||
|
for entry in d.get("modules", {}).get("scripts-tool", []):
|
||||||
|
if entry.get("path") in mapping:
|
||||||
|
entry["path"] = mapping[entry["path"]]
|
||||||
|
tmp = path + ".tmp"
|
||||||
|
json.dump(d, open(tmp, "w"), indent=4)
|
||||||
|
os.replace(tmp, path)
|
||||||
|
PY
|
||||||
|
```
|
||||||
|
|
||||||
|
OBS scene JSONs use forward slashes in Windows paths — preserve that style.
|
||||||
|
|
||||||
|
### 4. Verify
|
||||||
|
|
||||||
|
Re-run the diagnostic Python snippet and confirm every printed path resolves to a real file (translate `C:/` → `/mnt/c/` from WSL).
|
||||||
|
|
||||||
|
### 5. Reopen OBS
|
||||||
|
|
||||||
|
Load the scene collection. The popup should be gone.
|
||||||
|
|
||||||
|
## Why not just remove the script?
|
||||||
|
|
||||||
|
If the script is part of a third-party overlay pack (Twitch Pimpage, OWN3D, etc.), removing the registration also removes the overlay's source presets — fixing the path keeps the imported scenes intact. If you don't actually use the overlay anymore, removing the `scripts-tool` entry is fine; OBS will silently drop the broken reference on next save.
|
||||||
|
|
||||||
|
## Generalization
|
||||||
|
|
||||||
|
This same pattern applies to any OBS asset path stored in a scene collection or profile:
|
||||||
|
|
||||||
|
- Browser source local files
|
||||||
|
- Image / media source files
|
||||||
|
- Lua / Python script paths
|
||||||
|
- VST plugin paths
|
||||||
|
|
||||||
|
All of them are absolute, all of them survive a Windows profile rename in stale form, and all of them can be batch-rewritten with the same JSON-edit pattern above. Search for the old username substring across `%APPDATA%\obs-studio\` to catch them all in one pass.
|
||||||
|
|
||||||
|
## Related
|
||||||
|
|
||||||
|
- [[../../MajorInfrastructure/Devices/MajorRig|MajorRig device note]] — Incident Log 2026-05-14 (TTT/MLS scene popups) and 2026-05-07 (`majli` profile retirement that left these references stranded)
|
||||||
|
- [[../04-streaming/obs/obs-studio-setup-encoding|OBS Studio Setup and Encoding Settings]]
|
||||||
|
|
@ -1,11 +1,17 @@
|
||||||
---
|
---
|
||||||
title: "ClamAV Safe Scheduling on Live Servers"
|
title: ClamAV Safe Scheduling on Live Servers
|
||||||
domain: troubleshooting
|
domain: troubleshooting
|
||||||
category: security
|
category: security
|
||||||
tags: [clamav, cpu, nice, ionice, cron, vps]
|
tags:
|
||||||
|
- clamav
|
||||||
|
- cpu
|
||||||
|
- nice
|
||||||
|
- ionice
|
||||||
|
- cron
|
||||||
|
- vps
|
||||||
status: published
|
status: published
|
||||||
created: 2026-04-02
|
created: 2026-04-02
|
||||||
updated: 2026-04-02
|
updated: 2026-05-11T18:31
|
||||||
---
|
---
|
||||||
# ClamAV Safe Scheduling on Live Servers
|
# ClamAV Safe Scheduling on Live Servers
|
||||||
|
|
||||||
|
|
@ -75,6 +81,7 @@ kill <PID>
|
||||||
- `ionice -c 3` (Idle) requires Linux kernel ≥ 2.6.13 and CFQ/BFQ I/O scheduler. Works on most Ubuntu/Debian/Fedora systems.
|
- `ionice -c 3` (Idle) requires Linux kernel ≥ 2.6.13 and CFQ/BFQ I/O scheduler. Works on most Ubuntu/Debian/Fedora systems.
|
||||||
- On multi-core servers, consider also using `cpulimit` for a hard cap: `cpulimit -l 30 -- clamscan ...`
|
- On multi-core servers, consider also using `cpulimit` for a hard cap: `cpulimit -l 30 -- clamscan ...`
|
||||||
- Always keep `--exclude=/sys` (and optionally `--exclude=/proc`, `--exclude=/dev`) to avoid scanning virtual filesystems.
|
- Always keep `--exclude=/sys` (and optionally `--exclude=/proc`, `--exclude=/dev`) to avoid scanning virtual filesystems.
|
||||||
|
- **1 vCPU limitation:** `nice` and `ionice` only help when other processes compete for resources. On a single-core VPS, clamscan will still saturate the CPU at 57-100% even with `nice -n 19 ionice -c 3` — there's nothing to yield to. Accept the weekly spike as benign, or reduce scan scope to shorten the window.
|
||||||
|
|
||||||
## Related
|
## Related
|
||||||
|
|
||||||
|
|
|
||||||
116
05-troubleshooting/security/fedora-ca-bundle-missing-symlink.md
Normal file
116
05-troubleshooting/security/fedora-ca-bundle-missing-symlink.md
Normal file
|
|
@ -0,0 +1,116 @@
|
||||||
|
---
|
||||||
|
title: "Fedora CA Bundle Missing Symlink — TLS Breaks Fleet-Wide"
|
||||||
|
description: Hetzner-provisioned Fedora images may be missing the /etc/pki/tls/certs/ca-bundle.crt symlink, silently breaking Postfix TLS relay, curl, and dnf
|
||||||
|
tags:
|
||||||
|
- fedora
|
||||||
|
- tls
|
||||||
|
- postfix
|
||||||
|
- ca-certificates
|
||||||
|
- hetzner
|
||||||
|
- troubleshooting
|
||||||
|
status: published
|
||||||
|
created: 2026-05-11
|
||||||
|
updated: 2026-05-11
|
||||||
|
---
|
||||||
|
|
||||||
|
# Fedora CA Bundle Missing Symlink
|
||||||
|
|
||||||
|
On Fedora, many TLS clients (Postfix, curl, dnf) look for the CA bundle at `/etc/pki/tls/certs/ca-bundle.crt`. This path is normally a symlink to `/etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem`, shipped by the `ca-certificates` package.
|
||||||
|
|
||||||
|
On Hetzner Cloud Fedora images (observed on Fedora 44, May 2026), this symlink can be missing despite `ca-certificates` being installed. The extracted bundle exists, but the consumer-facing symlink does not.
|
||||||
|
|
||||||
|
## Symptoms
|
||||||
|
|
||||||
|
Postfix relay to a TLS-required upstream fails:
|
||||||
|
|
||||||
|
```
|
||||||
|
postfix/smtp: cannot load Certification Authority data,
|
||||||
|
CAfile="/etc/pki/tls/certs/ca-bundle.crt",
|
||||||
|
CApath="/etc/pki/tls/certs": disabling TLS support
|
||||||
|
```
|
||||||
|
|
||||||
|
If your relay requires TLS (port 465 with `smtp_tls_wrappermode = yes`, or `smtp_tls_security_level = encrypt`), mail silently queues as deferred. No bounce, no alert — just silence.
|
||||||
|
|
||||||
|
Other symptoms on the same box:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# curl fails
|
||||||
|
curl https://example.com
|
||||||
|
# error: Problem with the SSL CA cert (path? access rights?)
|
||||||
|
|
||||||
|
# dnf fails
|
||||||
|
dnf list --installed
|
||||||
|
# Curl error (77): Problem with the SSL CA cert
|
||||||
|
```
|
||||||
|
|
||||||
|
## Diagnosis
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check the symlink
|
||||||
|
ls -la /etc/pki/tls/certs/ca-bundle.crt
|
||||||
|
# Expected: symlink -> /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem
|
||||||
|
# Broken: "No such file or directory"
|
||||||
|
|
||||||
|
# Verify the extracted bundle exists
|
||||||
|
ls -la /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem
|
||||||
|
# Should exist (~220 KB, ~140-150 certs)
|
||||||
|
|
||||||
|
# Confirm the package is installed
|
||||||
|
rpm -q ca-certificates
|
||||||
|
# Should return a version string
|
||||||
|
```
|
||||||
|
|
||||||
|
If the extracted bundle exists but the symlink at `/etc/pki/tls/certs/ca-bundle.crt` is missing, that's the problem.
|
||||||
|
|
||||||
|
## Fix
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo ln -sf /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem \
|
||||||
|
/etc/pki/tls/certs/ca-bundle.crt
|
||||||
|
sudo systemctl restart postfix
|
||||||
|
sudo postqueue -f # flush any deferred mail
|
||||||
|
```
|
||||||
|
|
||||||
|
Verify:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Symlink exists
|
||||||
|
ls -la /etc/pki/tls/certs/ca-bundle.crt
|
||||||
|
|
||||||
|
# Postfix can relay
|
||||||
|
echo "Subject: TLS test" | sendmail -v marcus@majorshouse.com
|
||||||
|
|
||||||
|
# curl works
|
||||||
|
curl -sI https://example.com | head -1
|
||||||
|
```
|
||||||
|
|
||||||
|
## Fleet Audit
|
||||||
|
|
||||||
|
If one Hetzner-provisioned Fedora host has this issue, check the others:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
for host in majordiscord majorlab majorhome majormail; do
|
||||||
|
echo "$host: $(ssh root@$host 'ls /etc/pki/tls/certs/ca-bundle.crt 2>&1' | tail -1)"
|
||||||
|
done
|
||||||
|
```
|
||||||
|
|
||||||
|
Hosts returning "No such file or directory" are silently broken for all TLS operations.
|
||||||
|
|
||||||
|
## Why This Happens
|
||||||
|
|
||||||
|
`update-ca-trust extract` regenerates the files under `/etc/pki/ca-trust/extracted/` but does not create the legacy consumer-path symlink at `/etc/pki/tls/certs/ca-bundle.crt`. That symlink is shipped by the `ca-certificates` RPM. On cloud images built from minimal installs or snapshot-based provisioning, the symlink can be lost during image creation or a partial upgrade.
|
||||||
|
|
||||||
|
## Prevention
|
||||||
|
|
||||||
|
Add to your provisioning checklist (see [VPS Migration Baseline Checklist](../../02-selfhosting/cloud/vps-migration-baseline-checklist.md)):
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Fedora provisioning — verify CA bundle symlink
|
||||||
|
ls /etc/pki/tls/certs/ca-bundle.crt || \
|
||||||
|
ln -sf /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem /etc/pki/tls/certs/ca-bundle.crt
|
||||||
|
```
|
||||||
|
|
||||||
|
## Related
|
||||||
|
|
||||||
|
- [Logwatch Fleet Setup](../../02-selfhosting/monitoring/logwatch-fleet-setup.md) — logwatch depends on a working Postfix relay, which depends on TLS, which depends on this symlink
|
||||||
|
- [VPS Migration Baseline Checklist](../../02-selfhosting/cloud/vps-migration-baseline-checklist.md) — includes CA bundle verification step
|
||||||
|
|
@ -0,0 +1,112 @@
|
||||||
|
---
|
||||||
|
title: Netdata apps-group FD-utilisation false 100% (silenced fleet-wide)
|
||||||
|
domain: troubleshooting
|
||||||
|
category: security
|
||||||
|
tags:
|
||||||
|
- netdata
|
||||||
|
- apps.plugin
|
||||||
|
- file-descriptors
|
||||||
|
- tailscale
|
||||||
|
- false-positive
|
||||||
|
- ansible
|
||||||
|
- fleet
|
||||||
|
status: published
|
||||||
|
created: 2026-05-15
|
||||||
|
updated: 2026-05-15T02:40
|
||||||
|
---
|
||||||
|
# Netdata apps-group FD-utilisation false 100%
|
||||||
|
|
||||||
|
The Netdata stock alarm **`apps_group_file_descriptors_utilization`** (from
|
||||||
|
`/usr/lib/netdata/conf.d/health.d/file_descriptors.conf`) fires
|
||||||
|
`Raised to Warning — App group <X> file descriptors utilization = 100%`
|
||||||
|
emails for application groups that are perfectly healthy. First hit on
|
||||||
|
**MajorToot** (the `tailscaled` app group), 2026-05-15.
|
||||||
|
|
||||||
|
## The Problem
|
||||||
|
|
||||||
|
A Netdata email arrives: *"App group tailscaled file descriptors utilization
|
||||||
|
= 100% on MajorToot"*. The process is fine. On the host:
|
||||||
|
|
||||||
|
```
|
||||||
|
PID 1047 tailscaled (daemon) fds=35 soft_limit=524287 util=0.01%
|
||||||
|
PID 1984541 tailscaled (child) fds=10 soft_limit=524287 util=0.00%
|
||||||
|
PID 1984548 bash (tailscale hook) fds=5 soft_limit=1024 util=0.49%
|
||||||
|
```
|
||||||
|
|
||||||
|
No PID exceeds **0.5%**, yet `app.fds_open_limit` reads ~100%. Over 1h the raw
|
||||||
|
chart was min 0 / **mean 36.7** / max 100, with sustained multi-minute 100%
|
||||||
|
plateaus (not isolated spikes).
|
||||||
|
|
||||||
|
> This is **not** an `apps.plugin` privilege problem. apps.plugin already has
|
||||||
|
> `cap_dac_read_search,cap_sys_ptrace` and `sudo -u netdata cat
|
||||||
|
> /proc/<pid>/limits` succeeds. Verify before "fixing" privileges — it's a
|
||||||
|
> no-op.
|
||||||
|
|
||||||
|
## Root Cause
|
||||||
|
|
||||||
|
The stock alarm does `lookup: max -10s` over **every PID in the app group**.
|
||||||
|
App groups whose processes fork short-lived children (tailscaled spawns
|
||||||
|
route/DNS helpers and bash hooks; `bash` children inherit the systemd default
|
||||||
|
soft limit of 1024) trip a false 100%: apps.plugin's per-PID FD-limit read
|
||||||
|
**races on transient/just-forked PIDs**, and because the group lookup uses
|
||||||
|
`max`, a single bad 10-second sample pegs the entire group to ~100%. The
|
||||||
|
signal carries no usable information for any forking/root app group.
|
||||||
|
|
||||||
|
A `lookup: average -5m` does **not** rescue it — the bogus reading sits at
|
||||||
|
~100% for sustained multi-minute stretches, so the 5-minute rolling average
|
||||||
|
itself still reaches 100.0% (empirically verified on MajorToot).
|
||||||
|
|
||||||
|
## The Fix
|
||||||
|
|
||||||
|
Silence this template fleet-wide, keep the reliable system-wide FD alarm.
|
||||||
|
|
||||||
|
- **Codified in Ansible** (do not hand-edit hosts): `MajorAnsible/netdata.yml`
|
||||||
|
ships `templates/health_apps_fds_group.conf.j2` to
|
||||||
|
`/etc/netdata/health.d/apps_fds_group_override.conf` and reloads via
|
||||||
|
`netdatacli reload-health`.
|
||||||
|
- The override redefines `apps_group_file_descriptors_utilization` with
|
||||||
|
`to: silent`. Netdata loads `/etc/netdata/health.d/` *after* the stock
|
||||||
|
`conf.d` dir, so a same-name template deterministically supersedes the stock
|
||||||
|
one (same mechanism as the manual `tcp_resets.conf` override, 2026-04-30).
|
||||||
|
- **Safety net retained:** the companion stock template
|
||||||
|
`system_file_descriptors_utilization` (on `system.file_nr_utilization`,
|
||||||
|
`crit > 90`, `to: sysadmin`) is untouched and still catches genuine
|
||||||
|
system-wide FD exhaustion regardless of app grouping.
|
||||||
|
- The reload handler is restart-tolerant (`retries`/`until` + `failed_when`
|
||||||
|
ignoring a `netdata.pipe` socket-absent error) because on hosts where the
|
||||||
|
notify-config also drifts, `Restart Netdata` and `Reload Netdata health`
|
||||||
|
can race during the ~5s restart window.
|
||||||
|
|
||||||
|
## Verification
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ssh <host> 'curl -s "http://localhost:19999/api/v1/alarms?all=true" \
|
||||||
|
| python3 -c "import sys,json;A=json.load(sys.stdin)[\"alarms\"]; \
|
||||||
|
print(A[\"app.tailscaled_fds_open_limit.apps_group_file_descriptors_utilization\"][\"recipient\"])"'
|
||||||
|
# expect: silent
|
||||||
|
```
|
||||||
|
|
||||||
|
After the fix the alarm still shows `status=WARNING` in the dashboard
|
||||||
|
(cosmetic — silencing suppresses the *notification*, not the computed state);
|
||||||
|
`recipient=silent` confirms no more emails. The system-wide alarm should read
|
||||||
|
`CLEAR recipient=sysadmin`.
|
||||||
|
|
||||||
|
## Notes
|
||||||
|
|
||||||
|
- Silenced fleet-wide on all 10 servers 2026-05-15 (workstations majorrig/
|
||||||
|
majormac were asleep — irrelevant, they are not fleet servers).
|
||||||
|
- Any future host running a forking/root daemon in a named app group would
|
||||||
|
have hit the same false positive; silencing is fleet-wide and pre-emptive.
|
||||||
|
- **Follow-up debt:** the manual `/etc/netdata/health.d/tcp_resets.conf`
|
||||||
|
override on MajorToot (2026-04-30) is still **not codified in
|
||||||
|
`netdata.yml`** — a per-host divergence the fleet play does not manage.
|
||||||
|
Worth folding into Ansible the same way.
|
||||||
|
|
||||||
|
## Related
|
||||||
|
|
||||||
|
- [[clamscan-cpu-spike-nice-ionice]]
|
||||||
|
- [[netdata-web-log-successful-redirect-heavy-tuning]]
|
||||||
|
- Server doc: `30-Areas/MajorInfrastructure/Servers/majortoot.md` (incident
|
||||||
|
2026-05-15)
|
||||||
|
- Playbook: `MajorAnsible/netdata.yml` +
|
||||||
|
`templates/health_apps_fds_group.conf.j2`
|
||||||
|
|
@ -1,6 +1,6 @@
|
||||||
---
|
---
|
||||||
created: 2026-04-02T16:03
|
created: 2026-04-02T16:03
|
||||||
updated: 2026-05-10T00:10
|
updated: 2026-05-15T09:00
|
||||||
---
|
---
|
||||||
* [Home](index.md)
|
* [Home](index.md)
|
||||||
* [Linux & Sysadmin](01-linux/index.md)
|
* [Linux & Sysadmin](01-linux/index.md)
|
||||||
|
|
@ -28,6 +28,7 @@ updated: 2026-05-10T00:10
|
||||||
* [Wake-on-LAN via Router SSH](02-selfhosting/dns-networking/wake-on-lan-router-ssh.md)
|
* [Wake-on-LAN via Router SSH](02-selfhosting/dns-networking/wake-on-lan-router-ssh.md)
|
||||||
* [Pi-hole v6 Group Management — Per-Client DNS Rules](02-selfhosting/dns-networking/pihole-v6-group-management.md)
|
* [Pi-hole v6 Group Management — Per-Client DNS Rules](02-selfhosting/dns-networking/pihole-v6-group-management.md)
|
||||||
* [AWS S3 Cost Management](02-selfhosting/cloud/aws-s3-cost-management.md)
|
* [AWS S3 Cost Management](02-selfhosting/cloud/aws-s3-cost-management.md)
|
||||||
|
* [VPS Migration Baseline Checklist](02-selfhosting/cloud/vps-migration-baseline-checklist.md)
|
||||||
* [rsync Backup Patterns](02-selfhosting/storage-backup/rsync-backup-patterns.md)
|
* [rsync Backup Patterns](02-selfhosting/storage-backup/rsync-backup-patterns.md)
|
||||||
* [Tuning Netdata Web Log Alerts](02-selfhosting/monitoring/tuning-netdata-web-log-alerts.md)
|
* [Tuning Netdata Web Log Alerts](02-selfhosting/monitoring/tuning-netdata-web-log-alerts.md)
|
||||||
* [Tuning Netdata Docker Health Alarms](02-selfhosting/monitoring/netdata-docker-health-alarm-tuning.md)
|
* [Tuning Netdata Docker Health Alarms](02-selfhosting/monitoring/netdata-docker-health-alarm-tuning.md)
|
||||||
|
|
@ -69,11 +70,13 @@ updated: 2026-05-10T00:10
|
||||||
* [Streaming & Podcasting](04-streaming/index.md)
|
* [Streaming & Podcasting](04-streaming/index.md)
|
||||||
* [OBS Studio Setup & Encoding](04-streaming/obs/obs-studio-setup-encoding.md)
|
* [OBS Studio Setup & Encoding](04-streaming/obs/obs-studio-setup-encoding.md)
|
||||||
* [Plex 4K Codec Compatibility (Apple TV)](04-streaming/plex/plex-4k-codec-compatibility.md)
|
* [Plex 4K Codec Compatibility (Apple TV)](04-streaming/plex/plex-4k-codec-compatibility.md)
|
||||||
|
* [HEVC Batch Re-Encode for Plex Using VAAPI (AMD GPU)](04-streaming/plex/hevc-vaapi-batch-encode.md)
|
||||||
* [Troubleshooting](05-troubleshooting/index.md)
|
* [Troubleshooting](05-troubleshooting/index.md)
|
||||||
* [Apache Outage: Fail2ban Self-Ban + Missing iptables Rules](05-troubleshooting/networking/fail2ban-self-ban-apache-outage.md)
|
* [Apache Outage: Fail2ban Self-Ban + Missing iptables Rules](05-troubleshooting/networking/fail2ban-self-ban-apache-outage.md)
|
||||||
* [Mail Client Stops Receiving: Fail2ban IMAP Self-Ban](05-troubleshooting/networking/fail2ban-imap-self-ban-mail-client.md)
|
* [Mail Client Stops Receiving: Fail2ban IMAP Self-Ban](05-troubleshooting/networking/fail2ban-imap-self-ban-mail-client.md)
|
||||||
* [firewalld: Mail Ports Wiped After Reload](05-troubleshooting/networking/firewalld-mail-ports-reset.md)
|
* [firewalld: Mail Ports Wiped After Reload](05-troubleshooting/networking/firewalld-mail-ports-reset.md)
|
||||||
* [Tailscale SSH: Unexpected Re-Authentication Prompt](05-troubleshooting/networking/tailscale-ssh-reauth-prompt.md)
|
* [Tailscale SSH: Unexpected Re-Authentication Prompt](05-troubleshooting/networking/tailscale-ssh-reauth-prompt.md)
|
||||||
|
* [ssh.socket Unreachable After Reboot (Tailscale Race Condition)](05-troubleshooting/networking/ssh-socket-tailscale-race-condition.md)
|
||||||
* [Fail2ban & UFW Rule Bloat Cleanup](05-troubleshooting/networking/fail2ban-ufw-rule-bloat-cleanup.md)
|
* [Fail2ban & UFW Rule Bloat Cleanup](05-troubleshooting/networking/fail2ban-ufw-rule-bloat-cleanup.md)
|
||||||
* [Custom Fail2ban Jail: Apache Directory Scanning](05-troubleshooting/security/apache-dirscan-fail2ban-jail.md)
|
* [Custom Fail2ban Jail: Apache Directory Scanning](05-troubleshooting/security/apache-dirscan-fail2ban-jail.md)
|
||||||
* [Tuning Netdata `web_log_1m_successful` for Redirect-Heavy WordPress Sites](05-troubleshooting/security/netdata-web-log-successful-redirect-heavy-tuning.md)
|
* [Tuning Netdata `web_log_1m_successful` for Redirect-Heavy WordPress Sites](05-troubleshooting/security/netdata-web-log-successful-redirect-heavy-tuning.md)
|
||||||
|
|
@ -104,7 +107,10 @@ updated: 2026-05-10T00:10
|
||||||
* [rsync over Tailscale: Hung in TCP Teardown After Transfer Completes](05-troubleshooting/networking/rsync-tailscale-teardown-stall.md)
|
* [rsync over Tailscale: Hung in TCP Teardown After Transfer Completes](05-troubleshooting/networking/rsync-tailscale-teardown-stall.md)
|
||||||
* [iOS Tailscale Clients Report HostName="localhost" — Breaks /etc/hosts Generators](05-troubleshooting/networking/tailscale-status-json-hostname-localhost-ios.md)
|
* [iOS Tailscale Clients Report HostName="localhost" — Breaks /etc/hosts Generators](05-troubleshooting/networking/tailscale-status-json-hostname-localhost-ios.md)
|
||||||
* [macOS: Repeating Alert Tone from Mirrored iPhone Notification](05-troubleshooting/macos-mirrored-notification-alert-loop.md)
|
* [macOS: Repeating Alert Tone from Mirrored iPhone Notification](05-troubleshooting/macos-mirrored-notification-alert-loop.md)
|
||||||
|
* [OBS Studio: Stale Script Paths After Windows Profile Rename](05-troubleshooting/obs-stale-script-paths-after-windows-profile-rename.md)
|
||||||
* [ClamAV CPU Spike: Safe Scheduling with nice/ionice](05-troubleshooting/security/clamscan-cpu-spike-nice-ionice.md)
|
* [ClamAV CPU Spike: Safe Scheduling with nice/ionice](05-troubleshooting/security/clamscan-cpu-spike-nice-ionice.md)
|
||||||
|
* [Fedora CA Bundle Missing Symlink — TLS Breaks Fleet-Wide](05-troubleshooting/security/fedora-ca-bundle-missing-symlink.md)
|
||||||
|
* [Netdata apps-group FD-utilisation false 100% (silenced fleet-wide)](05-troubleshooting/security/netdata-apps-fds-group-false-positive.md)
|
||||||
* [Ansible: Vault Password File Not Found](05-troubleshooting/ansible-vault-password-file-missing.md)
|
* [Ansible: Vault Password File Not Found](05-troubleshooting/ansible-vault-password-file-missing.md)
|
||||||
* [Ansible: ansible.cfg Ignored on WSL2 Windows Mounts](05-troubleshooting/ansible-wsl2-world-writable-mount-ignores-cfg.md)
|
* [Ansible: ansible.cfg Ignored on WSL2 Windows Mounts](05-troubleshooting/ansible-wsl2-world-writable-mount-ignores-cfg.md)
|
||||||
* [Ansible: SSH Timeout During dnf upgrade on Fedora Hosts](05-troubleshooting/ansible-ssh-timeout-dnf-upgrade.md)
|
* [Ansible: SSH Timeout During dnf upgrade on Fedora Hosts](05-troubleshooting/ansible-ssh-timeout-dnf-upgrade.md)
|
||||||
|
|
|
||||||
Loading…
Add table
Reference in a new issue