Add troubleshooting articles: Netdata apps-group FD false-positive + OBS stale script paths

- netdata-apps-fds-group-false-positive: the apps_group_file_descriptors_utilization
  false 100% on forking/root app groups (tailscaled on MajorToot 2026-05-15),
  the not-a-privilege gotcha, fleet-wide silence fix in MajorAnsible.
- obs-stale-script-paths: pending from prior session (not on remote).
- SUMMARY.md: link both (re-applied onto upstream after concurrent rebase).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Marcus (via Claude Code) 2026-05-15 03:22:12 -04:00
parent a785e85821
commit 28518e403e
3 changed files with 243 additions and 0 deletions

View file

@ -0,0 +1,129 @@
---
title: "OBS Studio — \"Error opening file: (null)\" After Windows Profile Rename"
domain: troubleshooting
category: streaming
tags: [obs, streaming, windows, lua, profile-migration]
status: published
created: 2026-05-14
updated: 2026-05-14
---
# OBS Studio — "Error opening file: (null)" After Windows Profile Rename
## Symptom
Loading a scene collection in OBS Studio triggers a popup like:
```
[<ScriptName>.lua] Error opening file: (null)
```
The `(null)` is the giveaway: OBS resolved the registered script path to nothing — the file doesn't exist where the scene collection says it does. Most commonly this happens after a Windows profile was renamed or migrated and `C:\Users\<old>\...` paths were not updated.
## Why it happens
OBS stores per-scene-collection Lua/Python script registrations inside the scene collection JSON at:
```
%APPDATA%\obs-studio\basic\scenes\<Collection>.json
```
Each entry under `modules.scripts-tool[]` is an absolute Windows path. Renaming the Windows profile does not rewrite these — the JSON keeps pointing at the old `C:\Users\<old>\...` location, and OBS surfaces the resolution failure as a `(null)` popup on collection load.
## Diagnose
From WSL (or any shell with access to `%APPDATA%`):
```bash
OBS_DIR="/mnt/c/Users/<current-windows-user>/AppData/Roaming/obs-studio"
# 1. List scene collections
ls "$OBS_DIR/basic/scenes/"
# 2. Find collections referencing the missing script
grep -l -i "<script-name-substring>" "$OBS_DIR/basic/scenes/"*.json
# 3. Dump the scripts-tool paths from each suspect collection
python3 -c "
import json, sys
d = json.load(open(sys.argv[1]))
for s in d.get('modules', {}).get('scripts-tool', []):
print(s.get('path'))
" "$OBS_DIR/basic/scenes/<Collection>.json"
```
If a printed path contains `C:/Users/<old-username>/...` and the file doesn't exist on disk, you've found it.
## Fix
> [!warning] Close OBS first
> OBS rewrites the scene collection JSON when it exits. Any edit made while OBS is running will be overwritten. Confirm with `tasklist.exe | grep obs64` (WSL) or Task Manager.
### 1. Make the missing script reachable
Either:
- **Re-extract / restore the script** to a path under the new profile (recommended — gives you a clean canonical home), or
- **Leave it in the rescue/migration folder** and point OBS there (fragile if the rescue folder is later deleted).
### 2. Back up the scene collection JSON
```bash
SCENES="/mnt/c/Users/<current-windows-user>/AppData/Roaming/obs-studio/basic/scenes"
STAMP="$(date +%Y%m%d-%H%M%S)"
cp -p "$SCENES/<Collection>.json" "$SCENES/<Collection>.json.$STAMP.bak"
```
### 3. Rewrite the paths atomically
Edit the JSON in place by parsing it, replacing the matched path strings, and writing through a temp file (so a crash mid-write can't corrupt the collection):
```bash
python3 <<'PY'
import json, os
scenes = "/mnt/c/Users/<current-windows-user>/AppData/Roaming/obs-studio/basic/scenes"
mapping = {
"C:/Users/<old>/Pictures/.../<script>.lua":
"C:/Users/<new>/Pictures/.../<script>.lua",
}
for fn in ("<Collection>.json",):
path = os.path.join(scenes, fn)
d = json.load(open(path))
for entry in d.get("modules", {}).get("scripts-tool", []):
if entry.get("path") in mapping:
entry["path"] = mapping[entry["path"]]
tmp = path + ".tmp"
json.dump(d, open(tmp, "w"), indent=4)
os.replace(tmp, path)
PY
```
OBS scene JSONs use forward slashes in Windows paths — preserve that style.
### 4. Verify
Re-run the diagnostic Python snippet and confirm every printed path resolves to a real file (translate `C:/``/mnt/c/` from WSL).
### 5. Reopen OBS
Load the scene collection. The popup should be gone.
## Why not just remove the script?
If the script is part of a third-party overlay pack (Twitch Pimpage, OWN3D, etc.), removing the registration also removes the overlay's source presets — fixing the path keeps the imported scenes intact. If you don't actually use the overlay anymore, removing the `scripts-tool` entry is fine; OBS will silently drop the broken reference on next save.
## Generalization
This same pattern applies to any OBS asset path stored in a scene collection or profile:
- Browser source local files
- Image / media source files
- Lua / Python script paths
- VST plugin paths
All of them are absolute, all of them survive a Windows profile rename in stale form, and all of them can be batch-rewritten with the same JSON-edit pattern above. Search for the old username substring across `%APPDATA%\obs-studio\` to catch them all in one pass.
## Related
- [[../../MajorInfrastructure/Devices/MajorRig|MajorRig device note]] — Incident Log 2026-05-14 (TTT/MLS scene popups) and 2026-05-07 (`majli` profile retirement that left these references stranded)
- [[../04-streaming/obs/obs-studio-setup-encoding|OBS Studio Setup and Encoding Settings]]

View file

@ -0,0 +1,112 @@
---
title: Netdata apps-group FD-utilisation false 100% (silenced fleet-wide)
domain: troubleshooting
category: security
tags:
- netdata
- apps.plugin
- file-descriptors
- tailscale
- false-positive
- ansible
- fleet
status: published
created: 2026-05-15
updated: 2026-05-15T02:40
---
# Netdata apps-group FD-utilisation false 100%
The Netdata stock alarm **`apps_group_file_descriptors_utilization`** (from
`/usr/lib/netdata/conf.d/health.d/file_descriptors.conf`) fires
`Raised to Warning — App group <X> file descriptors utilization = 100%`
emails for application groups that are perfectly healthy. First hit on
**MajorToot** (the `tailscaled` app group), 2026-05-15.
## The Problem
A Netdata email arrives: *"App group tailscaled file descriptors utilization
= 100% on MajorToot"*. The process is fine. On the host:
```
PID 1047 tailscaled (daemon) fds=35 soft_limit=524287 util=0.01%
PID 1984541 tailscaled (child) fds=10 soft_limit=524287 util=0.00%
PID 1984548 bash (tailscale hook) fds=5 soft_limit=1024 util=0.49%
```
No PID exceeds **0.5%**, yet `app.fds_open_limit` reads ~100%. Over 1h the raw
chart was min 0 / **mean 36.7** / max 100, with sustained multi-minute 100%
plateaus (not isolated spikes).
> This is **not** an `apps.plugin` privilege problem. apps.plugin already has
> `cap_dac_read_search,cap_sys_ptrace` and `sudo -u netdata cat
> /proc/<pid>/limits` succeeds. Verify before "fixing" privileges — it's a
> no-op.
## Root Cause
The stock alarm does `lookup: max -10s` over **every PID in the app group**.
App groups whose processes fork short-lived children (tailscaled spawns
route/DNS helpers and bash hooks; `bash` children inherit the systemd default
soft limit of 1024) trip a false 100%: apps.plugin's per-PID FD-limit read
**races on transient/just-forked PIDs**, and because the group lookup uses
`max`, a single bad 10-second sample pegs the entire group to ~100%. The
signal carries no usable information for any forking/root app group.
A `lookup: average -5m` does **not** rescue it — the bogus reading sits at
~100% for sustained multi-minute stretches, so the 5-minute rolling average
itself still reaches 100.0% (empirically verified on MajorToot).
## The Fix
Silence this template fleet-wide, keep the reliable system-wide FD alarm.
- **Codified in Ansible** (do not hand-edit hosts): `MajorAnsible/netdata.yml`
ships `templates/health_apps_fds_group.conf.j2` to
`/etc/netdata/health.d/apps_fds_group_override.conf` and reloads via
`netdatacli reload-health`.
- The override redefines `apps_group_file_descriptors_utilization` with
`to: silent`. Netdata loads `/etc/netdata/health.d/` *after* the stock
`conf.d` dir, so a same-name template deterministically supersedes the stock
one (same mechanism as the manual `tcp_resets.conf` override, 2026-04-30).
- **Safety net retained:** the companion stock template
`system_file_descriptors_utilization` (on `system.file_nr_utilization`,
`crit > 90`, `to: sysadmin`) is untouched and still catches genuine
system-wide FD exhaustion regardless of app grouping.
- The reload handler is restart-tolerant (`retries`/`until` + `failed_when`
ignoring a `netdata.pipe` socket-absent error) because on hosts where the
notify-config also drifts, `Restart Netdata` and `Reload Netdata health`
can race during the ~5s restart window.
## Verification
```bash
ssh <host> 'curl -s "http://localhost:19999/api/v1/alarms?all=true" \
| python3 -c "import sys,json;A=json.load(sys.stdin)[\"alarms\"]; \
print(A[\"app.tailscaled_fds_open_limit.apps_group_file_descriptors_utilization\"][\"recipient\"])"'
# expect: silent
```
After the fix the alarm still shows `status=WARNING` in the dashboard
(cosmetic — silencing suppresses the *notification*, not the computed state);
`recipient=silent` confirms no more emails. The system-wide alarm should read
`CLEAR recipient=sysadmin`.
## Notes
- Silenced fleet-wide on all 10 servers 2026-05-15 (workstations majorrig/
majormac were asleep — irrelevant, they are not fleet servers).
- Any future host running a forking/root daemon in a named app group would
have hit the same false positive; silencing is fleet-wide and pre-emptive.
- **Follow-up debt:** the manual `/etc/netdata/health.d/tcp_resets.conf`
override on MajorToot (2026-04-30) is still **not codified in
`netdata.yml`** — a per-host divergence the fleet play does not manage.
Worth folding into Ansible the same way.
## Related
- [[clamscan-cpu-spike-nice-ionice]]
- [[netdata-web-log-successful-redirect-heavy-tuning]]
- Server doc: `30-Areas/MajorInfrastructure/Servers/majortoot.md` (incident
2026-05-15)
- Playbook: `MajorAnsible/netdata.yml` +
`templates/health_apps_fds_group.conf.j2`

View file

@ -105,8 +105,10 @@ updated: 2026-05-11T07:35
* [rsync over Tailscale: Hung in TCP Teardown After Transfer Completes](05-troubleshooting/networking/rsync-tailscale-teardown-stall.md)
* [iOS Tailscale Clients Report HostName="localhost" — Breaks /etc/hosts Generators](05-troubleshooting/networking/tailscale-status-json-hostname-localhost-ios.md)
* [macOS: Repeating Alert Tone from Mirrored iPhone Notification](05-troubleshooting/macos-mirrored-notification-alert-loop.md)
* [OBS Studio: Stale Script Paths After Windows Profile Rename](05-troubleshooting/obs-stale-script-paths-after-windows-profile-rename.md)
* [ClamAV CPU Spike: Safe Scheduling with nice/ionice](05-troubleshooting/security/clamscan-cpu-spike-nice-ionice.md)
* [Fedora CA Bundle Missing Symlink — TLS Breaks Fleet-Wide](05-troubleshooting/security/fedora-ca-bundle-missing-symlink.md)
* [Netdata apps-group FD-utilisation false 100% (silenced fleet-wide)](05-troubleshooting/security/netdata-apps-fds-group-false-positive.md)
* [Ansible: Vault Password File Not Found](05-troubleshooting/ansible-vault-password-file-missing.md)
* [Ansible: ansible.cfg Ignored on WSL2 Windows Mounts](05-troubleshooting/ansible-wsl2-world-writable-mount-ignores-cfg.md)
* [Ansible: SSH Timeout During dnf upgrade on Fedora Hosts](05-troubleshooting/ansible-ssh-timeout-dnf-upgrade.md)