Add troubleshooting articles: Netdata apps-group FD false-positive + OBS stale script paths

- netdata-apps-fds-group-false-positive: the apps_group_file_descriptors_utilization false 100% on forking/root app groups (tailscaled on MajorToot 2026-05-15), the not-a-privilege gotcha, fleet-wide silence fix in MajorAnsible. - obs-stale-script-paths: pending from prior session (not on remote). - SUMMARY.md: link both (re-applied onto upstream after concurrent rebase). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 03:22:12 -04:00 · 2026-05-15 03:22:12 -04:00 · 28518e403e
commit 28518e403e
parent a785e85821
3 changed files with 243 additions and 0 deletions
--- a/05-troubleshooting/obs-stale-script-paths-after-windows-profile-rename.md
+++ b/05-troubleshooting/obs-stale-script-paths-after-windows-profile-rename.md
@ -0,0 +1,129 @@
+---
+title: "OBS Studio — \"Error opening file: (null)\" After Windows Profile Rename"
+domain: troubleshooting
+category: streaming
+tags: [obs, streaming, windows, lua, profile-migration]
+status: published
+created: 2026-05-14
+updated: 2026-05-14
+---
+
+# OBS Studio — "Error opening file: (null)" After Windows Profile Rename
+
+## Symptom
+
+Loading a scene collection in OBS Studio triggers a popup like:
+
+```
+[<ScriptName>.lua] Error opening file: (null)
+```
+
+The `(null)` is the giveaway: OBS resolved the registered script path to nothing — the file doesn't exist where the scene collection says it does. Most commonly this happens after a Windows profile was renamed or migrated and `C:\Users\<old>\...` paths were not updated.
+
+## Why it happens
+
+OBS stores per-scene-collection Lua/Python script registrations inside the scene collection JSON at:
+
+```
+%APPDATA%\obs-studio\basic\scenes\<Collection>.json
+```
+
+Each entry under `modules.scripts-tool[]` is an absolute Windows path. Renaming the Windows profile does not rewrite these — the JSON keeps pointing at the old `C:\Users\<old>\...` location, and OBS surfaces the resolution failure as a `(null)` popup on collection load.
+
+## Diagnose
+
+From WSL (or any shell with access to `%APPDATA%`):
+
+```bash
+OBS_DIR="/mnt/c/Users/<current-windows-user>/AppData/Roaming/obs-studio"
+
+# 1. List scene collections
+ls "$OBS_DIR/basic/scenes/"
+
+# 2. Find collections referencing the missing script
+grep -l -i "<script-name-substring>" "$OBS_DIR/basic/scenes/"*.json
+
+# 3. Dump the scripts-tool paths from each suspect collection
+python3 -c "
+import json, sys
+d = json.load(open(sys.argv[1]))
+for s in d.get('modules', {}).get('scripts-tool', []):
+    print(s.get('path'))
+" "$OBS_DIR/basic/scenes/<Collection>.json"
+```
+
+If a printed path contains `C:/Users/<old-username>/...` and the file doesn't exist on disk, you've found it.
+
+## Fix
+
+> [!warning] Close OBS first
+> OBS rewrites the scene collection JSON when it exits. Any edit made while OBS is running will be overwritten. Confirm with `tasklist.exe | grep obs64` (WSL) or Task Manager.
+
+### 1. Make the missing script reachable
+
+Either:
+
+- **Re-extract / restore the script** to a path under the new profile (recommended — gives you a clean canonical home), or
+- **Leave it in the rescue/migration folder** and point OBS there (fragile if the rescue folder is later deleted).
+
+### 2. Back up the scene collection JSON
+
+```bash
+SCENES="/mnt/c/Users/<current-windows-user>/AppData/Roaming/obs-studio/basic/scenes"
+STAMP="$(date +%Y%m%d-%H%M%S)"
+cp -p "$SCENES/<Collection>.json" "$SCENES/<Collection>.json.$STAMP.bak"
+```
+
+### 3. Rewrite the paths atomically
+
+Edit the JSON in place by parsing it, replacing the matched path strings, and writing through a temp file (so a crash mid-write can't corrupt the collection):
+
+```bash
+python3 <<'PY'
+import json, os
+scenes  = "/mnt/c/Users/<current-windows-user>/AppData/Roaming/obs-studio/basic/scenes"
+mapping = {
+    "C:/Users/<old>/Pictures/.../<script>.lua":
+    "C:/Users/<new>/Pictures/.../<script>.lua",
+}
+for fn in ("<Collection>.json",):
+    path = os.path.join(scenes, fn)
+    d = json.load(open(path))
+    for entry in d.get("modules", {}).get("scripts-tool", []):
+        if entry.get("path") in mapping:
+            entry["path"] = mapping[entry["path"]]
+    tmp = path + ".tmp"
+    json.dump(d, open(tmp, "w"), indent=4)
+    os.replace(tmp, path)
+PY
+```
+
+OBS scene JSONs use forward slashes in Windows paths — preserve that style.
+
+### 4. Verify
+
+Re-run the diagnostic Python snippet and confirm every printed path resolves to a real file (translate `C:/` → `/mnt/c/` from WSL).
+
+### 5. Reopen OBS
+
+Load the scene collection. The popup should be gone.
+
+## Why not just remove the script?
+
+If the script is part of a third-party overlay pack (Twitch Pimpage, OWN3D, etc.), removing the registration also removes the overlay's source presets — fixing the path keeps the imported scenes intact. If you don't actually use the overlay anymore, removing the `scripts-tool` entry is fine; OBS will silently drop the broken reference on next save.
+
+## Generalization
+
+This same pattern applies to any OBS asset path stored in a scene collection or profile:
+
+- Browser source local files
+- Image / media source files
+- Lua / Python script paths
+- VST plugin paths
+
+All of them are absolute, all of them survive a Windows profile rename in stale form, and all of them can be batch-rewritten with the same JSON-edit pattern above. Search for the old username substring across `%APPDATA%\obs-studio\` to catch them all in one pass.
+
+## Related
+
+- [[../../MajorInfrastructure/Devices/MajorRig|MajorRig device note]] — Incident Log 2026-05-14 (TTT/MLS scene popups) and 2026-05-07 (`majli` profile retirement that left these references stranded)
+- [[../04-streaming/obs/obs-studio-setup-encoding|OBS Studio Setup and Encoding Settings]]
--- a/05-troubleshooting/security/netdata-apps-fds-group-false-positive.md
+++ b/05-troubleshooting/security/netdata-apps-fds-group-false-positive.md
@ -0,0 +1,112 @@
+---
+title: Netdata apps-group FD-utilisation false 100% (silenced fleet-wide)
+domain: troubleshooting
+category: security
+tags:
+  - netdata
+  - apps.plugin
+  - file-descriptors
+  - tailscale
+  - false-positive
+  - ansible
+  - fleet
+status: published
+created: 2026-05-15
+updated: 2026-05-15T02:40
+---
+# Netdata apps-group FD-utilisation false 100%
+
+The Netdata stock alarm **`apps_group_file_descriptors_utilization`** (from
+`/usr/lib/netdata/conf.d/health.d/file_descriptors.conf`) fires
+`Raised to Warning — App group <X> file descriptors utilization = 100%`
+emails for application groups that are perfectly healthy. First hit on
+**MajorToot** (the `tailscaled` app group), 2026-05-15.
+
+## The Problem
+
+A Netdata email arrives: *"App group tailscaled file descriptors utilization
+= 100% on MajorToot"*. The process is fine. On the host:
+
+```
+PID 1047    tailscaled (daemon)   fds=35  soft_limit=524287  util=0.01%
+PID 1984541 tailscaled (child)    fds=10  soft_limit=524287  util=0.00%
+PID 1984548 bash (tailscale hook) fds=5   soft_limit=1024    util=0.49%
+```
+
+No PID exceeds **0.5%**, yet `app.fds_open_limit` reads ~100%. Over 1h the raw
+chart was min 0 / **mean 36.7** / max 100, with sustained multi-minute 100%
+plateaus (not isolated spikes).
+
+> This is **not** an `apps.plugin` privilege problem. apps.plugin already has
+> `cap_dac_read_search,cap_sys_ptrace` and `sudo -u netdata cat
+> /proc/<pid>/limits` succeeds. Verify before "fixing" privileges — it's a
+> no-op.
+
+## Root Cause
+
+The stock alarm does `lookup: max -10s` over **every PID in the app group**.
+App groups whose processes fork short-lived children (tailscaled spawns
+route/DNS helpers and bash hooks; `bash` children inherit the systemd default
+soft limit of 1024) trip a false 100%: apps.plugin's per-PID FD-limit read
+**races on transient/just-forked PIDs**, and because the group lookup uses
+`max`, a single bad 10-second sample pegs the entire group to ~100%. The
+signal carries no usable information for any forking/root app group.
+
+A `lookup: average -5m` does **not** rescue it — the bogus reading sits at
+~100% for sustained multi-minute stretches, so the 5-minute rolling average
+itself still reaches 100.0% (empirically verified on MajorToot).
+
+## The Fix
+
+Silence this template fleet-wide, keep the reliable system-wide FD alarm.
+
+- **Codified in Ansible** (do not hand-edit hosts): `MajorAnsible/netdata.yml`
+  ships `templates/health_apps_fds_group.conf.j2` to
+  `/etc/netdata/health.d/apps_fds_group_override.conf` and reloads via
+  `netdatacli reload-health`.
+- The override redefines `apps_group_file_descriptors_utilization` with
+  `to: silent`. Netdata loads `/etc/netdata/health.d/` *after* the stock
+  `conf.d` dir, so a same-name template deterministically supersedes the stock
+  one (same mechanism as the manual `tcp_resets.conf` override, 2026-04-30).
+- **Safety net retained:** the companion stock template
+  `system_file_descriptors_utilization` (on `system.file_nr_utilization`,
+  `crit > 90`, `to: sysadmin`) is untouched and still catches genuine
+  system-wide FD exhaustion regardless of app grouping.
+- The reload handler is restart-tolerant (`retries`/`until` + `failed_when`
+  ignoring a `netdata.pipe` socket-absent error) because on hosts where the
+  notify-config also drifts, `Restart Netdata` and `Reload Netdata health`
+  can race during the ~5s restart window.
+
+## Verification
+
+```bash
+ssh <host> 'curl -s "http://localhost:19999/api/v1/alarms?all=true" \
+ | python3 -c "import sys,json;A=json.load(sys.stdin)[\"alarms\"]; \
+ print(A[\"app.tailscaled_fds_open_limit.apps_group_file_descriptors_utilization\"][\"recipient\"])"'
+# expect: silent
+```
+
+After the fix the alarm still shows `status=WARNING` in the dashboard
+(cosmetic — silencing suppresses the *notification*, not the computed state);
+`recipient=silent` confirms no more emails. The system-wide alarm should read
+`CLEAR recipient=sysadmin`.
+
+## Notes
+
+- Silenced fleet-wide on all 10 servers 2026-05-15 (workstations majorrig/
+  majormac were asleep — irrelevant, they are not fleet servers).
+- Any future host running a forking/root daemon in a named app group would
+  have hit the same false positive; silencing is fleet-wide and pre-emptive.
+- **Follow-up debt:** the manual `/etc/netdata/health.d/tcp_resets.conf`
+  override on MajorToot (2026-04-30) is still **not codified in
+  `netdata.yml`** — a per-host divergence the fleet play does not manage.
+  Worth folding into Ansible the same way.
+
+## Related
+
+- [[clamscan-cpu-spike-nice-ionice]]
+- [[netdata-web-log-successful-redirect-heavy-tuning]]
+- Server doc: `30-Areas/MajorInfrastructure/Servers/majortoot.md` (incident
+  2026-05-15)
+- Playbook: `MajorAnsible/netdata.yml` +
+  `templates/health_apps_fds_group.conf.j2`
--- a/SUMMARY.md
+++ b/SUMMARY.md
@ -105,8 +105,10 @@ updated: 2026-05-11T07:35
    * [rsync over Tailscale: Hung in TCP Teardown After Transfer Completes](05-troubleshooting/networking/rsync-tailscale-teardown-stall.md)
    * [iOS Tailscale Clients Report HostName="localhost" — Breaks /etc/hosts Generators](05-troubleshooting/networking/tailscale-status-json-hostname-localhost-ios.md)
    * [macOS: Repeating Alert Tone from Mirrored iPhone Notification](05-troubleshooting/macos-mirrored-notification-alert-loop.md)
+    * [OBS Studio: Stale Script Paths After Windows Profile Rename](05-troubleshooting/obs-stale-script-paths-after-windows-profile-rename.md)
    * [ClamAV CPU Spike: Safe Scheduling with nice/ionice](05-troubleshooting/security/clamscan-cpu-spike-nice-ionice.md)
    * [Fedora CA Bundle Missing Symlink — TLS Breaks Fleet-Wide](05-troubleshooting/security/fedora-ca-bundle-missing-symlink.md)
+    * [Netdata apps-group FD-utilisation false 100% (silenced fleet-wide)](05-troubleshooting/security/netdata-apps-fds-group-false-positive.md)
    * [Ansible: Vault Password File Not Found](05-troubleshooting/ansible-vault-password-file-missing.md)
    * [Ansible: ansible.cfg Ignored on WSL2 Windows Mounts](05-troubleshooting/ansible-wsl2-world-writable-mount-ignores-cfg.md)
    * [Ansible: SSH Timeout During dnf upgrade on Fedora Hosts](05-troubleshooting/ansible-ssh-timeout-dnf-upgrade.md)