wiki: batch update — 4 new articles + 4 updates

New articles:
- Postfix SendGrid TLS handshake failure (port 465 vs 587)
- Plex transcoding troubleshooting
- Ansible Ubuntu reboot detection kernel mismatch
- WSL2 PyTorch checkpoint Windows filesystem deadlock

Updated:
- AWS S3 cost management (expanded)
- Network overview (IP updates)
- HEVC VAAPI batch encode (progress + fixes)
- SUMMARY.md (new entries)
This commit is contained in:
Marcus Summers 2026-05-25 13:55:10 -04:00
parent dc897d4a67
commit 52ca8a0413
8 changed files with 640 additions and 35 deletions

View file

@ -5,7 +5,7 @@ category: cloud
tags: [aws, s3, cost, billing, mastodon, glacier]
status: published
created: 2026-04-19
updated: 2026-04-19
updated: 2026-05-23
---
# AWS S3 Cost Management
@ -17,24 +17,24 @@ The majorlinux AWS account is used exclusively for S3 object storage. This cover
- **Account ID:** `408469496267`
- **Account name:** majorlinux
- **Services in use:** S3 (Standard + Glacier Deep Archive), AWS Config, Cost Explorer
- **Monthly spend:** ~$32/mo (March 2026); expected ~$16/mo post-media-prune
- **Monthly spend:** ~$24/mo (May 2026, post-media-prune, post-STANDARD_IA revert)
## Buckets and Cost Drivers
| Bucket | Size | Storage Class | Cost/mo | Purpose |
|--------|------|---------------|---------|--------|
| `majortoot` | 648 GB (mostly remote cache) | S3 Standard | ~$15/mo | Mastodon media |
| `majorhomebackup` | 16 TiB | Glacier Deep Archive | ~$16/mo | MLS stream archives (sole copy) |
| `majortoot` | ~7 GB (after weekly prune running) | S3 Standard | ~$0.16/mo | Mastodon media |
| `majorhomebackup` | 16 TiB | Glacier Deep Archive | ~$1112/mo | MLS stream archives (sole copy) |
| `config-bucket-*` | ~185 KB | S3 Standard | ~$0.00 | AWS Config snapshots |
## CLI Setup
AWS CLI installed on MajorMac via Homebrew. Credentials configured at `~/.aws/credentials`.
AWS CLI installed on MajorMac via Homebrew. Credentials for `MajorCLI` user at `~/.aws/credentials`.
```bash
brew install awscli
# Credentials pulled from Ansible vault:
# AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY in group_vars/all/vault.yml
# Credentials: MajorCLI IAM user (S3 + Billing read access)
# Key ID: AKIAV6GVN4HF4Y6EV4NM — created 2026-05-23
```
### Useful commands
@ -42,18 +42,22 @@ brew install awscli
```bash
# Check current month spend by service
aws ce get-cost-and-usage \
--time-period Start=2026-04-01,End=2026-04-30 \
--time-period Start=2026-05-01,End=2026-05-31 \
--granularity MONTHLY \
--metrics "UnblendedCost" \
--group-by Type=DIMENSION,Key=SERVICE
# Daily cost breakdown with top usage types
aws ce get-cost-and-usage \
--time-period Start=2026-05-01,End=2026-05-23 \
--granularity DAILY \
--metrics "UnblendedCost" \
--filter '{"Dimensions":{"Key":"SERVICE","Values":["Amazon Simple Storage Service"]}}' \
--group-by Type=DIMENSION,Key=USAGE_TYPE
# View anomaly alerts
aws ce get-anomalies \
--date-interval StartDate=2026-04-01,EndDate=2026-04-30
# Check conformance pack compliance
aws configservice get-conformance-pack-compliance-details \
--conformance-pack-name MajorConformance
--date-interval StartDate=2026-05-01,EndDate=2026-05-31
# List budgets
aws budgets describe-budgets --account-id 408469496267
@ -62,25 +66,48 @@ aws budgets describe-budgets --account-id 408469496267
## Budget Alert
`MajorS3MonthlyAlert` configured 2026-04-19:
- 80% threshold → email at $20 actual spend
- 100% threshold → email at $25 actual spend
- 80% threshold → email at $24 actual spend
- 100% threshold → email at $30 actual spend
- Recipient: maj.linux@gmail.com
> [!note] Thresholds updated 2026-05-23 to reflect actual ~$24/mo steady-state spend (was $20/$25, set when spend was higher due to large majortoot bucket before prune took effect).
## Cost Reduction Options
### majortoot — S3 Standard-IA
### majortoot — S3 Standard-IA (⚠️ DO NOT USE — tried and reverted)
Switching `S3_STORAGE_CLASS=STANDARD_IA` in Mastodon's `.env.production` reduces storage cost from $0.023/GB to $0.0125/GB for new uploads. Expected saving: ~$45/mo after cache is pruned down to local-only content.
**Attempted 2026-05 — reverted 2026-05-17. Do not retry without careful planning.**
See [[mastodon-instance-tuning]] for full instructions.
The theory: switching `S3_STORAGE_CLASS=STANDARD_IA` saves ~$45/mo on storage. In practice, the bulk avatar restore operation (`restore-avatars.sh`, May 910) ran while STANDARD_IA was active. The ~5,223 account refreshes across 1,095 domains generated ~470,000 SIA Tier 1 PUT requests ($4.72) plus early-deletion fees ($1.21) when the objects were replaced after reverting to STANDARD on May 17.
**STANDARD_IA is only economical if:**
- The bucket has no large bulk-write operations (media cache rebuilds, avatar restores)
- Objects are written and left for >30 days (early deletion incurs minimum 30-day fee)
- The per-request cost ($0.01/1,000 for SIA vs $0.005/1,000 for Standard) doesn't offset storage savings
With the weekly prune now running correctly and the bucket shrinking toward ~7 GB, the storage savings of SIA are negligible (~$0.05/mo). **Leave at STANDARD.**
### majortoot — Weekly media prune
Weekly cron deployed (`0 3 * * 0`) via `configure_mastodon_media_prune.yml`. Removes remote federated cache older than 7 days. Expected to reduce bucket from 648 GB to ~7 GB over time.
Weekly cron deployed (`0 3 * * 0`) via `configure_mastodon_media_prune.yml`. Removes remote federated cache older than 7 days. Bucket shrinking from 648 GB toward ~7 GB over time. **This is the real cost driver — let it run.**
### majorhomebackup — Self-host consideration
Deep Archive at $0.00099/GB is the cheapest cloud tier — no cloud alternative is cheaper. If the MLS archives are no longer needed, deletion would save ~$16/mo. A 20TB HDD (~$300400) would break even in ~2 years vs. continued cloud storage. **These are the sole copy — do not delete without a separate backup.**
Deep Archive at $0.00099/GB is the cheapest cloud tier — no cloud alternative is cheaper. If the MLS archives are no longer needed, deletion would save ~$1112/mo. A 20TB HDD (~$300400) would break even in ~2.5 years vs. continued cloud storage. **These are the sole copy — do not delete without a separate backup.**
## IAM Users
| User | Scope | Credentials location | Notes |
|------|-------|---------------------|-------|
| `MajorToot` | S3 full (MajorsHouse group) | `~/.aws/credentials` on majortoot | Key rotated 2026-05-23 |
| `MajorHome` | S3 full (MajorsHouse group) | `~/.aws/credentials` on majorhome | Key pending rotation (see below) |
| `MajorCLI` | S3 full + Billing read (MajorsHouse group + AWSBillingReadOnlyAccess) | `~/.aws/credentials` on MajorMac | Created 2026-05-23, replaces root key |
> [!warning] Root access keys deleted 2026-05-23. Do NOT create new root access keys. Use `MajorCLI` for CLI work on MajorMac. The root account password (in Vaultwarden) is sufficient for console access.
> [!warning] MajorHome key (`AKIAV6GVN4HF7POCNW6D`) exposed in shell session 2026-05-23. Rotate via AWS Console → IAM → Users → MajorHome → Security credentials. Update `~/.aws/credentials` on majorhome afterward.
> [!note] `MajorCLI` does not have IAM permissions. Future key rotation requires AWS Console login or temporary IAM policy attachment. Consider adding a `SelfManageKeys` inline policy to `MajorCLI` via console.
## Conformance Pack
@ -92,15 +119,34 @@ Deep Archive at $0.00099/GB is the cheapest cloud tier — no cloud alternative
Evaluations cost $0.001 each and run on a periodic schedule. Safe to ignore; at current scale costs pennies per month.
## IAM Users
| User | Scope | Credentials location |
|------|-------|---------------------|
| `MajorToot` | S3 only — no billing/Cost Explorer | `~/.aws/credentials` on majortoot |
| Root | Full access | `~/.aws/credentials` on MajorMac (configured 2026-04-19) |
## CloudTrail Audit Logging
`MajorTrail` configured 2026-05-23:
- **S3 bucket:** `majorcloudtrail-408469496267`
- **Multi-region:** yes — captures API calls across all regions
- **Global service events:** yes — includes IAM, STS, S3 control plane
- **Log file validation:** enabled — tamper detection via digest files
- **Retention:** logs accumulate in S3; no automatic expiry configured
Use CloudTrail to investigate unexpected cost spikes, IAM key usage, and bucket write activity. Without it, historical API calls are unrecoverable (learned the hard way from the May 2026 SIA spike investigation).
```bash
# List recent CloudTrail events (last 1h, S3 writes only)
aws cloudtrail lookup-events \
--lookup-attributes AttributeKey=EventName,AttributeValue=PutObject \
--start-time $(date -u -v-1H +%Y-%m-%dT%H:%M:%SZ 2>/dev/null || date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ) \
--query 'Events[].{Time:EventTime,User:Username,Resource:Resources[0].ResourceName}' \
--output table
# Look up events by specific access key
aws cloudtrail lookup-events \
--lookup-attributes AttributeKey=AccessKeyId,AttributeValue=AKIAV6GVN4HF3BWAIAGC \
--output table
```
## Related
- [[Services/AWS]] — infrastructure record
- [[mastodon-instance-tuning]] — media cache management
- [[mastodon-prune-profiles-trap]] — avatar restore incident (May 2026)
- [[majortoot]] — Mastodon host

View file

@ -5,7 +5,7 @@ category: dns-networking
tags: [tailscale, networking, infrastructure, dns, vpn]
status: published
created: 2026-04-02
updated: 2026-04-02
updated: 2026-05-19
---
# 🌐 Network Overview
@ -19,12 +19,13 @@ The **MajorsHouse** infrastructure is connected via a private **Tailscale** mesh
## 🌍 Geographic Nodes
| Host | Location | IP | OS |
|---|---|---|---|
| `dcaprod` | 🇺🇸 US | 100.104.11.146 | Ubuntu 24.04 |
| `majortoot` | 🇺🇸 US | 100.110.197.17 | Ubuntu 24.04 |
| `majorhome` | 🇺🇸 US | 100.120.209.106 | Fedora 43 |
| `teelia` | 🇬🇧 UK | 100.120.32.69 | Ubuntu 24.04 |
| Host | Location | IP | OS | Notes |
|---|---|---|---|---|
| `dcaprod` | 🇺🇸 US | 100.104.11.146 | Ubuntu 24.04 | DO droplet — live until ~2026-05-22 |
| `dcaprod-hetzner` | 🇺🇸 US | 100.98.223.93 | Ubuntu 24.04 | Hetzner CPX21 — migration target; DNS cutover ~May 22 |
| `majortoot` | 🇺🇸 US | 100.110.197.17 | Ubuntu 24.04 | |
| `majorhome` | 🇺🇸 US | 100.120.209.106 | Fedora 43 | |
| `teelia` | 🇬🇧 UK | 100.120.32.69 | Ubuntu 24.04 | |
## 🔗 Tailscale Setup
@ -35,4 +36,4 @@ Tailscale is configured as a persistent service on all nodes. Key features used
- **ACLs:** Managed via the Tailscale admin console to restrict cross-group communication where necessary.
---
*Last updated: 2026-03-04*
*Last updated: 2026-05-19*

View file

@ -5,7 +5,7 @@ category: plex
tags: [plex, ffmpeg, hevc, vaapi, amd, gpu, encode, storage, rx480]
status: published
created: 2026-05-15
updated: 2026-05-15
updated: 2026-05-22
---
# HEVC Batch Re-Encode for Plex Using VAAPI (AMD GPU)
@ -161,8 +161,120 @@ The actual DB path is therefore:
---
---
## Troubleshooting
### Encode keeps stopping after a few files
**Symptom:** The script runs, encodes a handful of files, then exits. Restarting it produces the same behavior — processes a few, then exits again.
**Cause:** `hevc_batch.sh` is a **one-shot batch processor**, not a daemon. It reads through the queue file once from top to bottom, encodes whatever hasn't been done, then exits cleanly with `Batch complete: N processed`. It does not loop or restart itself.
On subsequent restarts, the script reuses the existing `hevc_queue.txt` rather than rebuilding it — the rebuild only runs if the queue file is missing or empty:
```bash
if [[ ! -f "$QUEUE" ]] || [[ ! -s "$QUEUE" ]]; then
build_queue
fi
```
This means restarts process only the few items left in the stale queue that haven't been marked done, then exit.
**Fix:** Delete the queue file before restarting so the script rescans the library and builds a fresh queue:
```bash
su - majorlinux -c 'rm ~/hevc_queue.txt && tmux new-session -d -s hevc_batch "bash ~/hevc_batch.sh"'
```
> Do **not** delete `hevc_done.txt` — that's the deduplication record. The rebuilt queue will skip anything already in `hevc_done.txt`.
---
### "Parse error, at least 3 arguments" in the log
**Symptom:** Log lines like `Parse error, at least 3 arguments were expected, only 1 given in string 'h.mp4'` scattered between encode entries.
**Cause:** ffmpeg printing its own internal parsing warnings to stderr for filenames containing Unicode special characters used in Giant Bomb / YouTube-DL titles ( — fullwidth variants). The bash script handles these correctly via `IFS= read -r`; these messages are cosmetic ffmpeg noise and do not affect the encode.
**Action:** None — these are safe to ignore.
---
### "SKIP (not found): uiem DLC & Far Far West.mp4" — truncated filenames
**Symptom:** "not found" skip entries in the log show what look like the *ends* of filenames (e.g., `uiem DLC & Far Far West.mp4` instead of `Resident Evil Requiem DLC & Far Far West.mp4`).
**Cause:** The queue file has corrupt/truncated entries — lines where the beginning of the path was lost, likely from a write error or interrupted pipe when the queue was originally built. The script can't find these truncated paths on disk and skips them.
**Fix:** Delete the queue file to force a full rebuild (see above). The rebuild uses `find` with a fresh scan — no truncation possible.
---
### Checking real progress
```bash
# Files done, failed, and remaining in queue
wc -l ~/hevc_done.txt ~/hevc_failed.txt ~/hevc_queue.txt
# Remaining = queue total - done - failed
# (some "remaining" may be not-found or parse-error skips)
# Last 10 log entries
grep '^\[20' ~/hevc_batch.log | tail -10
# Watch live
tail -f ~/hevc_batch.log | grep '^\[20'
# Disk free on /plex
df -h /plex | tail -1
```
---
### Script exits with `set -euo pipefail`
The script uses `set -euo pipefail` — any unhandled non-zero exit code kills it immediately. If the script exits with no "Batch complete" line in the log, look for the last log entry before the gap to identify the failing command. Most encode-path errors are handled with `|| echo ""` guards, but external tools (sqlite3, ffprobe) can still trip this under unusual conditions.
---
## Related
- [[plex-4k-codec-compatibility]] — Apple TV Direct Play compatibility, HEVC HDR notes
- [[plex-transcoding-troubleshooting]] — Playback stops, software transcode CPU limits, VAAPI setup
- [[snapraid-mergerfs-setup]] — MajorRAID storage pool setup
- [[SnapRAID-Majorhome]] — majorhome SnapRAID project
---
### ffmpeg "Error opening output file" / "Invalid argument" on specific files
**Symptom:** One or more files fail with this in the log:
```
Error opening output file /plex/plex/Giant Bomb's Sub-A-Thon Day 3 PART 4.hevc.tmp.mp4.
Error opening output files: Invalid argument
[YYYY-MM-DD HH:MM:SS] ffmpeg exited 234 in 0s
[YYYY-MM-DD HH:MM:SS] FAILED: ffmpeg error — keeping original, removing tmp
```
The file ends up in `hevc_failed.txt` and the original is untouched.
**Cause:** ffmpeg has its own URL/protocol parser that runs on all input and output path strings before any filesystem access. The ASCII pipe character `|` (U+007C) triggers ffmpeg's pipe protocol handler — it tries to interpret `output|file.mp4` as "pipe output to the process named `file.mp4`" and fails with EINVAL. This happens even though the shell variable is properly quoted and the Linux filesystem supports `|` in filenames. The fullwidth variant `` (U+FF5C) can also cause issues depending on ffmpeg's build.
Common in libraries with Giant Bomb, YouTube, or Twitch downloads — those titles frequently use `` as a visual separator.
**Fix:** Sanitize the `stem` used for the `.hevc.tmp.` output filename. The *source* file keeps its original name (the final `mv` writes back to the original path, which the filesystem handles fine); only the temp file needs a clean name for ffmpeg:
```bash
# In encode_file(), replace:
local tmp="${dir}/${stem}.hevc.tmp.${ext}"
# With:
local safe_stem="${stem//|/-}"
safe_stem="${safe_stem///-}"
local tmp="${dir}/${safe_stem}.hevc.tmp.${ext}"
```
After patching, delete the affected entries from `hevc_failed.txt` (or leave them — they'll be re-queued on the next run since they're not in `hevc_done.txt`) and restart the batch.

View file

@ -0,0 +1,126 @@
---
title: "Plex Transcoding Troubleshooting"
domain: streaming
category: plex
tags: [plex, transcoding, hevc, h264, vaapi, troubleshooting, apple-tv]
status: published
created: 2026-05-22
updated: 2026-05-22
---
# Plex Transcoding Troubleshooting
Common issues when Plex is transcoding instead of direct playing, and how to fix them.
## Playback Stops After ~1 Minute
**Symptom:** Video starts normally, plays for 6090 seconds, then freezes or stops. Hitting play again works briefly, then stops again.
**Cause:** The Plex server is software-transcoding the stream and the CPU can't keep up in real time. Plex delivers video as a series of short HLS segments (3 seconds each by default). When the transcoder falls behind real-time, the client exhausts its segment buffer and stops.
This is most common when:
- The client has an auto-quality or bandwidth-limit setting enabled, forcing a transcode even for natively supported codecs
- The source file is HEVC and the client is set to anything other than "Play Original"
- Multiple streams are transcoding concurrently and saturating the CPU
### How to Confirm
SSH into the Plex host and check for an active software transcode:
```bash
ps aux | grep 'Plex Transcoder' | grep -v grep
```
Look for `libx264` or `libx265` in the output — these are CPU software encoders. A CPU% above 3040% per stream on an i7-7700K means it's at or near the real-time limit for 1080p60.
### Fix: Enable Direct Play
The correct fix is to eliminate the transcode entirely.
**On Apple TV:**
1. Open the Plex app → tap the user icon → **Settings**
2. Go to **Quality**
3. Set both **"Home Streaming"** and **"Remote Streaming"** to **"Play Original"** (or "Maximum")
4. Restart playback
Apple TV 4K supports direct play for H.264, HEVC (H.265), and most common containers (MP4, MKV). With "Play Original" set, Plex streams the file as-is with no server-side processing.
**On other clients:** Look for a Quality or Streaming Quality setting and set it to Original/Maximum. The specific label varies by app version.
### If Direct Play Isn't Possible
If the client genuinely can't decode the source codec (e.g., a browser playing HEVC), reduce the transcode quality to something the CPU can sustain in real time:
- **8 Mbps 1080p** is usually achievable for a single stream on an i7-7700K
- Avoid 1080p60 at high bitrates — the frame rate doubles the encoding work
Alternatively, enable hardware transcoding (see below).
---
## Understanding When Plex Transcodes
Plex will transcode (convert on the fly) when any of the following are true:
| Trigger | Example |
|---------|---------|
| Client can't decode the codec | Browser playing HEVC |
| Client quality is set below original | "8 Mbps 1080p" selected |
| Audio codec isn't supported by client | DTS-MA, TrueHD on some devices |
| Subtitles need burning in | Forced image-based subs (PGS) |
| Bandwidth limit set in Plex server settings | Server-side quality cap |
Direct play happens when the client supports the video codec, audio codec, container, and no quality downgrade is requested.
---
## Hardware Transcoding (VAAPI / RX 480)
majorhome has an XFX Radeon RX 480 8GB with VAAPI support. Hardware transcoding can offload video encoding from the CPU and allows more concurrent transcode streams.
**Enable in Plex:**
Settings → Transcoder → **"Use hardware acceleration when available"** (requires Plex Pass)
**Caveats:**
- The RX 480 VAAPI encoder (`hevc_vaapi`, `h264_vaapi`) is benchmarked ~3× slower than the i7-7700K CPU for single-stream x264 output on this workload. Hardware transcoding only wins when the CPU is already saturated (2+ concurrent streams).
- VAAPI hardware transcode on AMD requires the `radeonsi` Mesa driver and `libva-mesa-driver`. Both are present on majorhome.
**Check VAAPI is working:**
```bash
vainfo 2>/dev/null | grep -E "VAProfile|VAEntrypoint"
```
---
## CPU Transcoding Capacity (i7-7700K)
| Scenario | CPU Load | Sustainable? |
|----------|----------|-------------|
| 1× HEVC → H.264 1080p30 | ~20% | ✅ Yes |
| 1× HEVC → H.264 1080p60 | ~40% | ⚠️ Borderline — may drop behind |
| 2× HEVC → H.264 1080p60 | ~80% | ❌ Will fall behind in real time |
| 1× H.264 → H.264 1080p (remux only) | ~5% | ✅ Yes |
**Bottom line:** One software-transcode stream at 1080p60 is at the edge of what the i7-7700K can sustain. Two will fail. Direct play eliminates the problem entirely.
---
## Checking Active Transcode Sessions
```bash
# See all active Plex Transcoder processes and what they're encoding
ps aux | grep 'Plex Transcoder' | grep -v grep | grep -oP '\-i \S+' | sed 's/-i //'
# Full transcode command (codec, bitrate, resolution)
ps aux | grep 'Plex Transcoder' | grep -v grep
```
You can also see active sessions in Plex Web → Dashboard → Now Playing.
---
## Related
- [Plex 4K Codec Compatibility (Apple TV)](plex-4k-codec-compatibility.md)
- [[../../../MajorInfrastructure/Services/Plex|Plex — Infrastructure Doc]]
- [[../../../../30-Areas/MajorInfrastructure/Servers/majorhome|majorhome]]

View file

@ -0,0 +1,106 @@
---
title: "Ansible: Ubuntu Reboot Detection Misses Kernel Upgrades"
domain: troubleshooting
category: ansible
tags: [ansible, ubuntu, kernel, reboot, needrestart, apt]
status: published
created: 2026-05-19
updated: 2026-05-19
---
# Ansible: Ubuntu Reboot Detection Misses Kernel Upgrades
## Problem
`update.yml` runs across the Ubuntu fleet, a kernel package is upgraded, but the executive summary reports `No reboot needed` — even though a reboot is genuinely required. Running `uname -r` on the host confirms it's still on the old kernel.
Example: majortoot had `linux-image-6.8.0-117-generic` installed on May 16 after a Tailscale update triggered `needrestart`, but the playbook kept reporting clean.
## Root Cause
The standard check for Ubuntu reboot state is:
```yaml
- name: Check if a reboot is required for Ubuntu servers
ansible.builtin.stat:
path: /var/run/reboot-required
register: ubuntu_reboot_flag
```
`/var/run/reboot-required` is written by `update-notifier-common`'s `notify-reboot-required` script, called by `/etc/kernel/postinst.d/update-notifier` when a kernel package is installed via `apt`.
The problem is `needrestart`. It runs after every `apt` invocation via a `DPkg::Post-Invoke` hook (`apt-pinvoke -m u`). In **unattended mode** (`-m u`), needrestart detects the pending kernel upgrade and calls `announce_ver()` in `NeedRestart::UI::Ubuntu` — but that function only prints to stdout. It does **not** call `_write_reboot_file()`. Only `announce_ucode()` (microcode upgrades) calls `_write_reboot_file()`.
So the sequence is:
1. `apt` installs kernel → `notify-reboot-required` creates `/run/reboot-required`
2. Some later `apt` run (e.g. Ansible installs Tailscale) → `needrestart -m u` runs → detects kernel mismatch → calls `announce_ver()` → prints to stdout (suppressed in Ansible) → **does not** recreate the sentinel file
3. Next Ansible run: stat check finds no file → reports `No reboot needed`
The `/run` filesystem is tmpfs and clears on reboot, but the sentinel file can disappear between reboots any time needrestart runs without recreating it.
## Fix — Dual Check in update.yml
Add a parallel kernel comparison task after the existing stat check:
```yaml
- name: Check running kernel vs installed kernel (Ubuntu)
ansible.builtin.shell: |
RUNNING=$(uname -r)
INSTALLED=$(dpkg -l 'linux-image-[0-9]*-generic' 2>/dev/null \
| awk '/^ii/{print $2}' \
| sed 's/linux-image-//' \
| sort -V | tail -1)
if [ -n "$INSTALLED" ] && [ "$RUNNING" != "$INSTALLED" ]; then
echo "KERNEL_MISMATCH"
fi
register: kernel_mismatch_check
changed_when: false
when: ansible_facts['os_family'] == "Debian"
```
Then update the `host_summary` Jinja2 template to OR both conditions:
```jinja2
{%- if ansible_facts['os_family'] == 'Debian' and (
(ubuntu_reboot_flag is defined and ubuntu_reboot_flag.stat is defined and ubuntu_reboot_flag.stat.exists)
or
(kernel_mismatch_check is defined and 'KERNEL_MISMATCH' in (kernel_mismatch_check.stdout | default('')))
) -%}
{%- set _ = parts.append('REBOOT REQUIRED') -%}
```
## Common Mistake — Comparing the Wrong dpkg Field
An initial version of this fix used `$3` (the package version) and `cut`:
```bash
# WRONG — version field never matches uname -r
INSTALLED=$(dpkg -l 'linux-image-*-generic' | awk '/^ii/{print $3}' | sort -V | tail -1 | cut -d- -f1-4)
```
| Field | Example value |
|-------|--------------|
| `dpkg $3` (version) after cut | `6.8.0-57.59` |
| `uname -r` | `6.8.0-57-generic` |
These formats never match. Every Ubuntu host permanently reports `KERNEL_MISMATCH`. Always use the **name column (`$2`)**, strip the `linux-image-` prefix, and compare directly to `uname -r`.
Also use `linux-image-[0-9]*-generic` (not `*-generic`) to exclude the `linux-image-generic` meta-package from the sort.
## Verification
Run against a known-pending host before and after reboot:
```bash
ansible-playbook update.yml --limit majortoot
```
Before reboot: `majortoot: 0 pkg(s) upgraded | REBOOT REQUIRED`
After reboot: `majortoot: 0 pkg(s) upgraded | No reboot needed`
## Related
- [[ansible-regex-search-set-fact-capture-group]] — companion Jinja2 gotcha in the same `host_summary` task
- [[ansible-unattended-upgrades-fleet]] — managing the Ubuntu auto-upgrade stack
- [[ansible-check-mode-false-positives]] — another Ansible reporting quirk

View file

@ -0,0 +1,85 @@
# Postfix + SendGrid: TLS Handshake Failure (Port 465 vs 587)
## Symptom
Outbound mail silently queues with no delivery. `postqueue -p` shows deferred messages:
```
(Cannot start TLS: handshake failure)
```
`/var/log/maillog` shows:
```
SSL_connect error to smtp.sendgrid.net[...]:465: -1
warning: TLS library problem: error:0A00010B:SSL routines::wrong version number
```
Or on port 587:
```
warning: TLS library problem: error:0A0000C1:SSL routines::no shared cipher
```
## Root Cause
Port **465** (SMTPS) uses **implicit TLS** — the connection starts encrypted immediately. Port **587** (submission) uses **STARTTLS** — the connection starts plaintext, then upgrades.
Postfix has two settings that must match the port:
| Port | `smtp_tls_wrappermode` | `smtp_tls_security_level` |
|------|------------------------|---------------------------|
| 465 | `yes` | `encrypt` |
| 587 | `no` | `encrypt` (or `may`) |
If `smtp_tls_wrappermode=yes` is set with port 587, Postfix sends a TLS ClientHello immediately but the server expects a plaintext SMTP greeting first — `wrong version number`.
If `smtp_tls_wrappermode=no` is set with port 465, Postfix sends a plaintext EHLO but the server expects a TLS ClientHello — `no shared cipher` or connection reset.
## Fix
Use port 587 + STARTTLS (recommended — more widely supported and debuggable):
```bash
postconf -e 'relayhost = [smtp.sendgrid.net]:587'
postconf -e 'smtp_tls_wrappermode = no'
postconf -e 'smtp_tls_security_level = encrypt'
systemctl restart postfix
postqueue -f # flush stuck messages
```
## Verify
```bash
# Check config
postconf relayhost smtp_tls_wrappermode smtp_tls_security_level
# Test TLS connection manually
openssl s_client -starttls smtp -connect smtp.sendgrid.net:587 -brief
# Watch delivery
tail -f /var/log/maillog | grep status=
```
Successful delivery looks like:
```
Untrusted TLS connection established to smtp.sendgrid.net[...]:587: TLSv1.3 with cipher TLS_AES_128_GCM_SHA256
status=sent (250 Ok: queued as ...)
```
## Why "Untrusted"?
If `smtp_tls_CAfile` and `smtp_tls_CApath` are both empty, Postfix can't verify the server certificate and logs "Untrusted TLS connection." The connection is still encrypted — just not authenticated. To fix, point to the system CA bundle:
```bash
postconf -e 'smtp_tls_CAfile = /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem' # Fedora
# or
postconf -e 'smtp_tls_CAfile = /etc/ssl/certs/ca-certificates.crt' # Ubuntu/Debian
```
## Notes
- OpenSSL 3.x is stricter about protocol mismatches than OpenSSL 1.1 — a config that worked on older distros may break after an OS upgrade.
- SendGrid supports both ports, but port 587 + STARTTLS is the documented recommendation.
- This applies to any SMTP relay (Mailgun, AWS SES, etc.), not just SendGrid — the port/wrappermode pairing is universal.

View file

@ -0,0 +1,125 @@
---
title: "WSL2: PyTorch Training Deadlocks on Windows Filesystem Checkpoint Saves"
domain: troubleshooting
category: wsl2
tags: [wsl2, pytorch, huggingface, training, llm, checkpoint, windows, ntfs, deadlock, majortwin]
status: published
created: 2026-05-23
updated: 2026-05-23
---
# WSL2: PyTorch Training Deadlocks on Windows Filesystem Checkpoint Saves
## Problem
A Hugging Face Trainer / Unsloth fine-tuning run starts successfully, logs training steps for a while, then freezes completely. The tqdm progress bar stops advancing, GPU utilization drops to near-zero, but the training process stays alive at 100% CPU with the full model loaded in VRAM. No new checkpoint directories appear.
**Confirming it's a checkpoint deadlock:**
```bash
# Check if training is frozen — same step count + elapsed time across checks
tmux capture-pane -t <session> -p | tail -5
sleep 60
tmux capture-pane -t <session> -p | tail -5
# GPU idle despite process alive
nvidia-smi --query-gpu=utilization.gpu,memory.used --format=csv,noheader
# No new checkpoint directories written
ls -lt /mnt/d/your/training/output/ | head -10
```
If the tqdm step count is identical both times and the newest directory timestamp is from a previous run, the save is deadlocked.
---
## Root Cause
WSL2's `/mnt/d/` paths go through the **virtio-9p filesystem driver** to reach the host Windows NTFS volume. Large sequential writes — like saving a multi-GB PyTorch checkpoint (optimizer states, model weights, scheduler, RNG state) — can deadlock when:
- A Windows process (antivirus, VSS, Windows Search) holds a lock on the output directory
- The Windows virtual disk hits write pressure from concurrent activity
The Linux process blocks in a kernel `write()` syscall waiting for virtio-9p to acknowledge the write. The process is alive and spinning at 100% CPU in the kernel, but no userspace progress occurs. This is distinct from OOM kills (which log clearly) and out-of-disk errors (which exit cleanly).
---
## Fix: Train on Linux-Native Storage
Keep all training I/O on Linux ext4 (`~/`), and copy final artifacts to Windows only after training completes.
### Change output paths
```bash
# Before
TRAIN_OUT="/mnt/d/corpus/training-runs/v9"
GGUF_OUT="/mnt/d/corpus/models"
# After — Linux-native for training
TRAIN_OUT="/home/majorlinux/corpus/training-runs/v8i"
GGUF_OUT="/home/majorlinux/corpus/models"
```
The WSL2 home directory lives on a Linux ext4 `.vhdx` managed by WSL2 — writes here bypass virtio-9p entirely.
### Copy to Windows after training finishes
```bash
cp "$GGUF_OUT/majortwin-v8i-q4-k-m.gguf" "/mnt/d/corpus/models/"
cp "$GGUF_OUT/majortwin-v8i-q4-k-m.gguf" "/mnt/d/MajorTwin/06-Models/"
```
Single large-file copies to `/mnt/d/` complete reliably — it's repeated checkpoint saves during training that deadlock.
### Kill a stuck training process
```bash
kill $(pgrep -f 'train_v3.py')
sleep 2
tmux kill-session -t majortwin_v8i
nvidia-smi --query-gpu=utilization.gpu,memory.used --format=csv,noheader
# Should show low utilization and <1GB memory used
```
The original checkpoint files from the previous run in `/mnt/d/` are untouched — the deadlock prevents writes, it does not corrupt existing data.
---
## Why Previous Runs May Have Worked
The deadlock is not guaranteed. It depends on Windows-side state at checkpoint save time. Factors:
- Antivirus scanning newly created checkpoint files
- Windows Search indexing the output directory
- VSS snapshot in progress
- Concurrent Windows desktop I/O
A run on a quiet machine may succeed; the same run during normal desktop use may deadlock.
---
## Confirming the Fix
```bash
# Watch for checkpoint directories appearing at each save_steps interval
watch -n 30 'ls -lt ~/corpus/training-runs/v8i/ | head -8'
# GPU should be active (8599%) during training steps
nvidia-smi --query-gpu=utilization.gpu --format=csv,noheader
```
---
## Notes
- Setting `save_strategy="no"` in TrainingArguments eliminates checkpoint saves entirely — useful as a diagnostic to confirm this is the cause, at the cost of no crash recovery.
- `torch.compile()` / `torch._inductor` can add hours of CPU-bound kernel compilation before the first training step. Long startup + eventual freeze together can make a session look permanently stuck when they're actually two separate issues.
- This applies to any large sequential WSL2→Windows write, not just PyTorch — large `rsync` or `tar` to `/mnt/<drive>/` can also stall.
---
## Related
- [[wsl2-rebuild-fedora43-training-env]] — Full WSL2 training environment setup
- [[wsl2-backup-powershell]] — Backing up WSL2 virtual disks from PowerShell
- [[ansible-wsl2-world-writable-mount-ignores-cfg]] — Other WSL2 filesystem quirks

View file

@ -71,8 +71,10 @@ updated: 2026-05-15T09:00
* [OBS Studio Setup & Encoding](04-streaming/obs/obs-studio-setup-encoding.md)
* [Plex 4K Codec Compatibility (Apple TV)](04-streaming/plex/plex-4k-codec-compatibility.md)
* [HEVC Batch Re-Encode for Plex Using VAAPI (AMD GPU)](04-streaming/plex/hevc-vaapi-batch-encode.md)
* [Plex Transcoding Troubleshooting](04-streaming/plex/plex-transcoding-troubleshooting.md)
* [Troubleshooting](05-troubleshooting/index.md)
* [Apache Outage: Fail2ban Self-Ban + Missing iptables Rules](05-troubleshooting/networking/fail2ban-self-ban-apache-outage.md)
* [Postfix + SendGrid: TLS Handshake Failure (Port 465 vs 587)](05-troubleshooting/networking/postfix-sendgrid-tls-handshake-failure.md)
* [Mail Client Stops Receiving: Fail2ban IMAP Self-Ban](05-troubleshooting/networking/fail2ban-imap-self-ban-mail-client.md)
* [firewalld: Mail Ports Wiped After Reload](05-troubleshooting/networking/firewalld-mail-ports-reset.md)
* [Tailscale SSH: Unexpected Re-Authentication Prompt](05-troubleshooting/networking/tailscale-ssh-reauth-prompt.md)
@ -113,8 +115,10 @@ updated: 2026-05-15T09:00
* [Netdata apps-group FD-utilisation false 100% (silenced fleet-wide)](05-troubleshooting/security/netdata-apps-fds-group-false-positive.md)
* [Ansible: Vault Password File Not Found](05-troubleshooting/ansible-vault-password-file-missing.md)
* [Ansible: ansible.cfg Ignored on WSL2 Windows Mounts](05-troubleshooting/ansible-wsl2-world-writable-mount-ignores-cfg.md)
* [WSL2: PyTorch Training Deadlocks on Windows Filesystem Checkpoint Saves](05-troubleshooting/wsl2-pytorch-checkpoint-windows-filesystem-deadlock.md)
* [Ansible: SSH Timeout During dnf upgrade on Fedora Hosts](05-troubleshooting/ansible-ssh-timeout-dnf-upgrade.md)
* [Ansible: regex_search Capture-Group Argument Fails in set_fact](05-troubleshooting/ansible-regex-search-set-fact-capture-group.md)
* [Ansible: Ubuntu Reboot Detection Misses Kernel Upgrades](05-troubleshooting/ansible-ubuntu-reboot-detection-kernel-mismatch.md)
* [Fedora Networking & Kernel Troubleshooting](05-troubleshooting/fedora-networking-kernel-recovery.md)
* [Systemd Session Scope Fails at Login](05-troubleshooting/systemd/session-scope-failure-at-login.md)
* [wget/curl: URLs with Special Characters Fail in Bash](05-troubleshooting/wget-url-special-characters.md)