- netdata-apps-fds-group-false-positive: the apps_group_file_descriptors_utilization false 100% on forking/root app groups (tailscaled on MajorToot 2026-05-15), the not-a-privilege gotcha, fleet-wide silence fix in MajorAnsible. - obs-stale-script-paths: pending from prior session (not on remote). - SUMMARY.md: link both (re-applied onto upstream after concurrent rebase). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4.7 KiB
| title | domain | category | tags | status | created | updated | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Netdata apps-group FD-utilisation false 100% (silenced fleet-wide) | troubleshooting | security |
|
published | 2026-05-15 | 2026-05-15T02:40 |
Netdata apps-group FD-utilisation false 100%
The Netdata stock alarm apps_group_file_descriptors_utilization (from
/usr/lib/netdata/conf.d/health.d/file_descriptors.conf) fires
Raised to Warning — App group <X> file descriptors utilization = 100%
emails for application groups that are perfectly healthy. First hit on
MajorToot (the tailscaled app group), 2026-05-15.
The Problem
A Netdata email arrives: "App group tailscaled file descriptors utilization = 100% on MajorToot". The process is fine. On the host:
PID 1047 tailscaled (daemon) fds=35 soft_limit=524287 util=0.01%
PID 1984541 tailscaled (child) fds=10 soft_limit=524287 util=0.00%
PID 1984548 bash (tailscale hook) fds=5 soft_limit=1024 util=0.49%
No PID exceeds 0.5%, yet app.fds_open_limit reads ~100%. Over 1h the raw
chart was min 0 / mean 36.7 / max 100, with sustained multi-minute 100%
plateaus (not isolated spikes).
This is not an
apps.pluginprivilege problem. apps.plugin already hascap_dac_read_search,cap_sys_ptraceandsudo -u netdata cat /proc/<pid>/limitssucceeds. Verify before "fixing" privileges — it's a no-op.
Root Cause
The stock alarm does lookup: max -10s over every PID in the app group.
App groups whose processes fork short-lived children (tailscaled spawns
route/DNS helpers and bash hooks; bash children inherit the systemd default
soft limit of 1024) trip a false 100%: apps.plugin's per-PID FD-limit read
races on transient/just-forked PIDs, and because the group lookup uses
max, a single bad 10-second sample pegs the entire group to ~100%. The
signal carries no usable information for any forking/root app group.
A lookup: average -5m does not rescue it — the bogus reading sits at
~100% for sustained multi-minute stretches, so the 5-minute rolling average
itself still reaches 100.0% (empirically verified on MajorToot).
The Fix
Silence this template fleet-wide, keep the reliable system-wide FD alarm.
- Codified in Ansible (do not hand-edit hosts):
MajorAnsible/netdata.ymlshipstemplates/health_apps_fds_group.conf.j2to/etc/netdata/health.d/apps_fds_group_override.confand reloads vianetdatacli reload-health. - The override redefines
apps_group_file_descriptors_utilizationwithto: silent. Netdata loads/etc/netdata/health.d/after the stockconf.ddir, so a same-name template deterministically supersedes the stock one (same mechanism as the manualtcp_resets.confoverride, 2026-04-30). - Safety net retained: the companion stock template
system_file_descriptors_utilization(onsystem.file_nr_utilization,crit > 90,to: sysadmin) is untouched and still catches genuine system-wide FD exhaustion regardless of app grouping. - The reload handler is restart-tolerant (
retries/until+failed_whenignoring anetdata.pipesocket-absent error) because on hosts where the notify-config also drifts,Restart NetdataandReload Netdata healthcan race during the ~5s restart window.
Verification
ssh <host> 'curl -s "http://localhost:19999/api/v1/alarms?all=true" \
| python3 -c "import sys,json;A=json.load(sys.stdin)[\"alarms\"]; \
print(A[\"app.tailscaled_fds_open_limit.apps_group_file_descriptors_utilization\"][\"recipient\"])"'
# expect: silent
After the fix the alarm still shows status=WARNING in the dashboard
(cosmetic — silencing suppresses the notification, not the computed state);
recipient=silent confirms no more emails. The system-wide alarm should read
CLEAR recipient=sysadmin.
Notes
- Silenced fleet-wide on all 10 servers 2026-05-15 (workstations majorrig/ majormac were asleep — irrelevant, they are not fleet servers).
- Any future host running a forking/root daemon in a named app group would have hit the same false positive; silencing is fleet-wide and pre-emptive.
- Follow-up debt: the manual
/etc/netdata/health.d/tcp_resets.confoverride on MajorToot (2026-04-30) is still not codified innetdata.yml— a per-host divergence the fleet play does not manage. Worth folding into Ansible the same way.
Related
- clamscan-cpu-spike-nice-ionice
- netdata-web-log-successful-redirect-heavy-tuning
- Server doc:
30-Areas/MajorInfrastructure/Servers/majortoot.md(incident 2026-05-15) - Playbook:
MajorAnsible/netdata.yml+templates/health_apps_fds_group.conf.j2