Add troubleshooting articles: Netdata apps-group FD false-positive + OBS stale script paths
- netdata-apps-fds-group-false-positive: the apps_group_file_descriptors_utilization false 100% on forking/root app groups (tailscaled on MajorToot 2026-05-15), the not-a-privilege gotcha, fleet-wide silence fix in MajorAnsible. - obs-stale-script-paths: pending from prior session (not on remote). - SUMMARY.md: link both (re-applied onto upstream after concurrent rebase). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
a785e85821
commit
28518e403e
3 changed files with 243 additions and 0 deletions
|
|
@ -0,0 +1,129 @@
|
|||
---
|
||||
title: "OBS Studio — \"Error opening file: (null)\" After Windows Profile Rename"
|
||||
domain: troubleshooting
|
||||
category: streaming
|
||||
tags: [obs, streaming, windows, lua, profile-migration]
|
||||
status: published
|
||||
created: 2026-05-14
|
||||
updated: 2026-05-14
|
||||
---
|
||||
|
||||
# OBS Studio — "Error opening file: (null)" After Windows Profile Rename
|
||||
|
||||
## Symptom
|
||||
|
||||
Loading a scene collection in OBS Studio triggers a popup like:
|
||||
|
||||
```
|
||||
[<ScriptName>.lua] Error opening file: (null)
|
||||
```
|
||||
|
||||
The `(null)` is the giveaway: OBS resolved the registered script path to nothing — the file doesn't exist where the scene collection says it does. Most commonly this happens after a Windows profile was renamed or migrated and `C:\Users\<old>\...` paths were not updated.
|
||||
|
||||
## Why it happens
|
||||
|
||||
OBS stores per-scene-collection Lua/Python script registrations inside the scene collection JSON at:
|
||||
|
||||
```
|
||||
%APPDATA%\obs-studio\basic\scenes\<Collection>.json
|
||||
```
|
||||
|
||||
Each entry under `modules.scripts-tool[]` is an absolute Windows path. Renaming the Windows profile does not rewrite these — the JSON keeps pointing at the old `C:\Users\<old>\...` location, and OBS surfaces the resolution failure as a `(null)` popup on collection load.
|
||||
|
||||
## Diagnose
|
||||
|
||||
From WSL (or any shell with access to `%APPDATA%`):
|
||||
|
||||
```bash
|
||||
OBS_DIR="/mnt/c/Users/<current-windows-user>/AppData/Roaming/obs-studio"
|
||||
|
||||
# 1. List scene collections
|
||||
ls "$OBS_DIR/basic/scenes/"
|
||||
|
||||
# 2. Find collections referencing the missing script
|
||||
grep -l -i "<script-name-substring>" "$OBS_DIR/basic/scenes/"*.json
|
||||
|
||||
# 3. Dump the scripts-tool paths from each suspect collection
|
||||
python3 -c "
|
||||
import json, sys
|
||||
d = json.load(open(sys.argv[1]))
|
||||
for s in d.get('modules', {}).get('scripts-tool', []):
|
||||
print(s.get('path'))
|
||||
" "$OBS_DIR/basic/scenes/<Collection>.json"
|
||||
```
|
||||
|
||||
If a printed path contains `C:/Users/<old-username>/...` and the file doesn't exist on disk, you've found it.
|
||||
|
||||
## Fix
|
||||
|
||||
> [!warning] Close OBS first
|
||||
> OBS rewrites the scene collection JSON when it exits. Any edit made while OBS is running will be overwritten. Confirm with `tasklist.exe | grep obs64` (WSL) or Task Manager.
|
||||
|
||||
### 1. Make the missing script reachable
|
||||
|
||||
Either:
|
||||
|
||||
- **Re-extract / restore the script** to a path under the new profile (recommended — gives you a clean canonical home), or
|
||||
- **Leave it in the rescue/migration folder** and point OBS there (fragile if the rescue folder is later deleted).
|
||||
|
||||
### 2. Back up the scene collection JSON
|
||||
|
||||
```bash
|
||||
SCENES="/mnt/c/Users/<current-windows-user>/AppData/Roaming/obs-studio/basic/scenes"
|
||||
STAMP="$(date +%Y%m%d-%H%M%S)"
|
||||
cp -p "$SCENES/<Collection>.json" "$SCENES/<Collection>.json.$STAMP.bak"
|
||||
```
|
||||
|
||||
### 3. Rewrite the paths atomically
|
||||
|
||||
Edit the JSON in place by parsing it, replacing the matched path strings, and writing through a temp file (so a crash mid-write can't corrupt the collection):
|
||||
|
||||
```bash
|
||||
python3 <<'PY'
|
||||
import json, os
|
||||
scenes = "/mnt/c/Users/<current-windows-user>/AppData/Roaming/obs-studio/basic/scenes"
|
||||
mapping = {
|
||||
"C:/Users/<old>/Pictures/.../<script>.lua":
|
||||
"C:/Users/<new>/Pictures/.../<script>.lua",
|
||||
}
|
||||
for fn in ("<Collection>.json",):
|
||||
path = os.path.join(scenes, fn)
|
||||
d = json.load(open(path))
|
||||
for entry in d.get("modules", {}).get("scripts-tool", []):
|
||||
if entry.get("path") in mapping:
|
||||
entry["path"] = mapping[entry["path"]]
|
||||
tmp = path + ".tmp"
|
||||
json.dump(d, open(tmp, "w"), indent=4)
|
||||
os.replace(tmp, path)
|
||||
PY
|
||||
```
|
||||
|
||||
OBS scene JSONs use forward slashes in Windows paths — preserve that style.
|
||||
|
||||
### 4. Verify
|
||||
|
||||
Re-run the diagnostic Python snippet and confirm every printed path resolves to a real file (translate `C:/` → `/mnt/c/` from WSL).
|
||||
|
||||
### 5. Reopen OBS
|
||||
|
||||
Load the scene collection. The popup should be gone.
|
||||
|
||||
## Why not just remove the script?
|
||||
|
||||
If the script is part of a third-party overlay pack (Twitch Pimpage, OWN3D, etc.), removing the registration also removes the overlay's source presets — fixing the path keeps the imported scenes intact. If you don't actually use the overlay anymore, removing the `scripts-tool` entry is fine; OBS will silently drop the broken reference on next save.
|
||||
|
||||
## Generalization
|
||||
|
||||
This same pattern applies to any OBS asset path stored in a scene collection or profile:
|
||||
|
||||
- Browser source local files
|
||||
- Image / media source files
|
||||
- Lua / Python script paths
|
||||
- VST plugin paths
|
||||
|
||||
All of them are absolute, all of them survive a Windows profile rename in stale form, and all of them can be batch-rewritten with the same JSON-edit pattern above. Search for the old username substring across `%APPDATA%\obs-studio\` to catch them all in one pass.
|
||||
|
||||
## Related
|
||||
|
||||
- [[../../MajorInfrastructure/Devices/MajorRig|MajorRig device note]] — Incident Log 2026-05-14 (TTT/MLS scene popups) and 2026-05-07 (`majli` profile retirement that left these references stranded)
|
||||
- [[../04-streaming/obs/obs-studio-setup-encoding|OBS Studio Setup and Encoding Settings]]
|
||||
|
|
@ -0,0 +1,112 @@
|
|||
---
|
||||
title: Netdata apps-group FD-utilisation false 100% (silenced fleet-wide)
|
||||
domain: troubleshooting
|
||||
category: security
|
||||
tags:
|
||||
- netdata
|
||||
- apps.plugin
|
||||
- file-descriptors
|
||||
- tailscale
|
||||
- false-positive
|
||||
- ansible
|
||||
- fleet
|
||||
status: published
|
||||
created: 2026-05-15
|
||||
updated: 2026-05-15T02:40
|
||||
---
|
||||
# Netdata apps-group FD-utilisation false 100%
|
||||
|
||||
The Netdata stock alarm **`apps_group_file_descriptors_utilization`** (from
|
||||
`/usr/lib/netdata/conf.d/health.d/file_descriptors.conf`) fires
|
||||
`Raised to Warning — App group <X> file descriptors utilization = 100%`
|
||||
emails for application groups that are perfectly healthy. First hit on
|
||||
**MajorToot** (the `tailscaled` app group), 2026-05-15.
|
||||
|
||||
## The Problem
|
||||
|
||||
A Netdata email arrives: *"App group tailscaled file descriptors utilization
|
||||
= 100% on MajorToot"*. The process is fine. On the host:
|
||||
|
||||
```
|
||||
PID 1047 tailscaled (daemon) fds=35 soft_limit=524287 util=0.01%
|
||||
PID 1984541 tailscaled (child) fds=10 soft_limit=524287 util=0.00%
|
||||
PID 1984548 bash (tailscale hook) fds=5 soft_limit=1024 util=0.49%
|
||||
```
|
||||
|
||||
No PID exceeds **0.5%**, yet `app.fds_open_limit` reads ~100%. Over 1h the raw
|
||||
chart was min 0 / **mean 36.7** / max 100, with sustained multi-minute 100%
|
||||
plateaus (not isolated spikes).
|
||||
|
||||
> This is **not** an `apps.plugin` privilege problem. apps.plugin already has
|
||||
> `cap_dac_read_search,cap_sys_ptrace` and `sudo -u netdata cat
|
||||
> /proc/<pid>/limits` succeeds. Verify before "fixing" privileges — it's a
|
||||
> no-op.
|
||||
|
||||
## Root Cause
|
||||
|
||||
The stock alarm does `lookup: max -10s` over **every PID in the app group**.
|
||||
App groups whose processes fork short-lived children (tailscaled spawns
|
||||
route/DNS helpers and bash hooks; `bash` children inherit the systemd default
|
||||
soft limit of 1024) trip a false 100%: apps.plugin's per-PID FD-limit read
|
||||
**races on transient/just-forked PIDs**, and because the group lookup uses
|
||||
`max`, a single bad 10-second sample pegs the entire group to ~100%. The
|
||||
signal carries no usable information for any forking/root app group.
|
||||
|
||||
A `lookup: average -5m` does **not** rescue it — the bogus reading sits at
|
||||
~100% for sustained multi-minute stretches, so the 5-minute rolling average
|
||||
itself still reaches 100.0% (empirically verified on MajorToot).
|
||||
|
||||
## The Fix
|
||||
|
||||
Silence this template fleet-wide, keep the reliable system-wide FD alarm.
|
||||
|
||||
- **Codified in Ansible** (do not hand-edit hosts): `MajorAnsible/netdata.yml`
|
||||
ships `templates/health_apps_fds_group.conf.j2` to
|
||||
`/etc/netdata/health.d/apps_fds_group_override.conf` and reloads via
|
||||
`netdatacli reload-health`.
|
||||
- The override redefines `apps_group_file_descriptors_utilization` with
|
||||
`to: silent`. Netdata loads `/etc/netdata/health.d/` *after* the stock
|
||||
`conf.d` dir, so a same-name template deterministically supersedes the stock
|
||||
one (same mechanism as the manual `tcp_resets.conf` override, 2026-04-30).
|
||||
- **Safety net retained:** the companion stock template
|
||||
`system_file_descriptors_utilization` (on `system.file_nr_utilization`,
|
||||
`crit > 90`, `to: sysadmin`) is untouched and still catches genuine
|
||||
system-wide FD exhaustion regardless of app grouping.
|
||||
- The reload handler is restart-tolerant (`retries`/`until` + `failed_when`
|
||||
ignoring a `netdata.pipe` socket-absent error) because on hosts where the
|
||||
notify-config also drifts, `Restart Netdata` and `Reload Netdata health`
|
||||
can race during the ~5s restart window.
|
||||
|
||||
## Verification
|
||||
|
||||
```bash
|
||||
ssh <host> 'curl -s "http://localhost:19999/api/v1/alarms?all=true" \
|
||||
| python3 -c "import sys,json;A=json.load(sys.stdin)[\"alarms\"]; \
|
||||
print(A[\"app.tailscaled_fds_open_limit.apps_group_file_descriptors_utilization\"][\"recipient\"])"'
|
||||
# expect: silent
|
||||
```
|
||||
|
||||
After the fix the alarm still shows `status=WARNING` in the dashboard
|
||||
(cosmetic — silencing suppresses the *notification*, not the computed state);
|
||||
`recipient=silent` confirms no more emails. The system-wide alarm should read
|
||||
`CLEAR recipient=sysadmin`.
|
||||
|
||||
## Notes
|
||||
|
||||
- Silenced fleet-wide on all 10 servers 2026-05-15 (workstations majorrig/
|
||||
majormac were asleep — irrelevant, they are not fleet servers).
|
||||
- Any future host running a forking/root daemon in a named app group would
|
||||
have hit the same false positive; silencing is fleet-wide and pre-emptive.
|
||||
- **Follow-up debt:** the manual `/etc/netdata/health.d/tcp_resets.conf`
|
||||
override on MajorToot (2026-04-30) is still **not codified in
|
||||
`netdata.yml`** — a per-host divergence the fleet play does not manage.
|
||||
Worth folding into Ansible the same way.
|
||||
|
||||
## Related
|
||||
|
||||
- [[clamscan-cpu-spike-nice-ionice]]
|
||||
- [[netdata-web-log-successful-redirect-heavy-tuning]]
|
||||
- Server doc: `30-Areas/MajorInfrastructure/Servers/majortoot.md` (incident
|
||||
2026-05-15)
|
||||
- Playbook: `MajorAnsible/netdata.yml` +
|
||||
`templates/health_apps_fds_group.conf.j2`
|
||||
|
|
@ -105,8 +105,10 @@ updated: 2026-05-11T07:35
|
|||
* [rsync over Tailscale: Hung in TCP Teardown After Transfer Completes](05-troubleshooting/networking/rsync-tailscale-teardown-stall.md)
|
||||
* [iOS Tailscale Clients Report HostName="localhost" — Breaks /etc/hosts Generators](05-troubleshooting/networking/tailscale-status-json-hostname-localhost-ios.md)
|
||||
* [macOS: Repeating Alert Tone from Mirrored iPhone Notification](05-troubleshooting/macos-mirrored-notification-alert-loop.md)
|
||||
* [OBS Studio: Stale Script Paths After Windows Profile Rename](05-troubleshooting/obs-stale-script-paths-after-windows-profile-rename.md)
|
||||
* [ClamAV CPU Spike: Safe Scheduling with nice/ionice](05-troubleshooting/security/clamscan-cpu-spike-nice-ionice.md)
|
||||
* [Fedora CA Bundle Missing Symlink — TLS Breaks Fleet-Wide](05-troubleshooting/security/fedora-ca-bundle-missing-symlink.md)
|
||||
* [Netdata apps-group FD-utilisation false 100% (silenced fleet-wide)](05-troubleshooting/security/netdata-apps-fds-group-false-positive.md)
|
||||
* [Ansible: Vault Password File Not Found](05-troubleshooting/ansible-vault-password-file-missing.md)
|
||||
* [Ansible: ansible.cfg Ignored on WSL2 Windows Mounts](05-troubleshooting/ansible-wsl2-world-writable-mount-ignores-cfg.md)
|
||||
* [Ansible: SSH Timeout During dnf upgrade on Fedora Hosts](05-troubleshooting/ansible-ssh-timeout-dnf-upgrade.md)
|
||||
|
|
|
|||
Loading…
Add table
Reference in a new issue