- netdata-apps-fds-group-false-positive: the apps_group_file_descriptors_utilization false 100% on forking/root app groups (tailscaled on MajorToot 2026-05-15), the not-a-privilege gotcha, fleet-wide silence fix in MajorAnsible. - obs-stale-script-paths: pending from prior session (not on remote). - SUMMARY.md: link both (re-applied onto upstream after concurrent rebase). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
112 lines
4.7 KiB
Markdown
112 lines
4.7 KiB
Markdown
---
|
|
title: Netdata apps-group FD-utilisation false 100% (silenced fleet-wide)
|
|
domain: troubleshooting
|
|
category: security
|
|
tags:
|
|
- netdata
|
|
- apps.plugin
|
|
- file-descriptors
|
|
- tailscale
|
|
- false-positive
|
|
- ansible
|
|
- fleet
|
|
status: published
|
|
created: 2026-05-15
|
|
updated: 2026-05-15T02:40
|
|
---
|
|
# Netdata apps-group FD-utilisation false 100%
|
|
|
|
The Netdata stock alarm **`apps_group_file_descriptors_utilization`** (from
|
|
`/usr/lib/netdata/conf.d/health.d/file_descriptors.conf`) fires
|
|
`Raised to Warning — App group <X> file descriptors utilization = 100%`
|
|
emails for application groups that are perfectly healthy. First hit on
|
|
**MajorToot** (the `tailscaled` app group), 2026-05-15.
|
|
|
|
## The Problem
|
|
|
|
A Netdata email arrives: *"App group tailscaled file descriptors utilization
|
|
= 100% on MajorToot"*. The process is fine. On the host:
|
|
|
|
```
|
|
PID 1047 tailscaled (daemon) fds=35 soft_limit=524287 util=0.01%
|
|
PID 1984541 tailscaled (child) fds=10 soft_limit=524287 util=0.00%
|
|
PID 1984548 bash (tailscale hook) fds=5 soft_limit=1024 util=0.49%
|
|
```
|
|
|
|
No PID exceeds **0.5%**, yet `app.fds_open_limit` reads ~100%. Over 1h the raw
|
|
chart was min 0 / **mean 36.7** / max 100, with sustained multi-minute 100%
|
|
plateaus (not isolated spikes).
|
|
|
|
> This is **not** an `apps.plugin` privilege problem. apps.plugin already has
|
|
> `cap_dac_read_search,cap_sys_ptrace` and `sudo -u netdata cat
|
|
> /proc/<pid>/limits` succeeds. Verify before "fixing" privileges — it's a
|
|
> no-op.
|
|
|
|
## Root Cause
|
|
|
|
The stock alarm does `lookup: max -10s` over **every PID in the app group**.
|
|
App groups whose processes fork short-lived children (tailscaled spawns
|
|
route/DNS helpers and bash hooks; `bash` children inherit the systemd default
|
|
soft limit of 1024) trip a false 100%: apps.plugin's per-PID FD-limit read
|
|
**races on transient/just-forked PIDs**, and because the group lookup uses
|
|
`max`, a single bad 10-second sample pegs the entire group to ~100%. The
|
|
signal carries no usable information for any forking/root app group.
|
|
|
|
A `lookup: average -5m` does **not** rescue it — the bogus reading sits at
|
|
~100% for sustained multi-minute stretches, so the 5-minute rolling average
|
|
itself still reaches 100.0% (empirically verified on MajorToot).
|
|
|
|
## The Fix
|
|
|
|
Silence this template fleet-wide, keep the reliable system-wide FD alarm.
|
|
|
|
- **Codified in Ansible** (do not hand-edit hosts): `MajorAnsible/netdata.yml`
|
|
ships `templates/health_apps_fds_group.conf.j2` to
|
|
`/etc/netdata/health.d/apps_fds_group_override.conf` and reloads via
|
|
`netdatacli reload-health`.
|
|
- The override redefines `apps_group_file_descriptors_utilization` with
|
|
`to: silent`. Netdata loads `/etc/netdata/health.d/` *after* the stock
|
|
`conf.d` dir, so a same-name template deterministically supersedes the stock
|
|
one (same mechanism as the manual `tcp_resets.conf` override, 2026-04-30).
|
|
- **Safety net retained:** the companion stock template
|
|
`system_file_descriptors_utilization` (on `system.file_nr_utilization`,
|
|
`crit > 90`, `to: sysadmin`) is untouched and still catches genuine
|
|
system-wide FD exhaustion regardless of app grouping.
|
|
- The reload handler is restart-tolerant (`retries`/`until` + `failed_when`
|
|
ignoring a `netdata.pipe` socket-absent error) because on hosts where the
|
|
notify-config also drifts, `Restart Netdata` and `Reload Netdata health`
|
|
can race during the ~5s restart window.
|
|
|
|
## Verification
|
|
|
|
```bash
|
|
ssh <host> 'curl -s "http://localhost:19999/api/v1/alarms?all=true" \
|
|
| python3 -c "import sys,json;A=json.load(sys.stdin)[\"alarms\"]; \
|
|
print(A[\"app.tailscaled_fds_open_limit.apps_group_file_descriptors_utilization\"][\"recipient\"])"'
|
|
# expect: silent
|
|
```
|
|
|
|
After the fix the alarm still shows `status=WARNING` in the dashboard
|
|
(cosmetic — silencing suppresses the *notification*, not the computed state);
|
|
`recipient=silent` confirms no more emails. The system-wide alarm should read
|
|
`CLEAR recipient=sysadmin`.
|
|
|
|
## Notes
|
|
|
|
- Silenced fleet-wide on all 10 servers 2026-05-15 (workstations majorrig/
|
|
majormac were asleep — irrelevant, they are not fleet servers).
|
|
- Any future host running a forking/root daemon in a named app group would
|
|
have hit the same false positive; silencing is fleet-wide and pre-emptive.
|
|
- **Follow-up debt:** the manual `/etc/netdata/health.d/tcp_resets.conf`
|
|
override on MajorToot (2026-04-30) is still **not codified in
|
|
`netdata.yml`** — a per-host divergence the fleet play does not manage.
|
|
Worth folding into Ansible the same way.
|
|
|
|
## Related
|
|
|
|
- [[clamscan-cpu-spike-nice-ionice]]
|
|
- [[netdata-web-log-successful-redirect-heavy-tuning]]
|
|
- Server doc: `30-Areas/MajorInfrastructure/Servers/majortoot.md` (incident
|
|
2026-05-15)
|
|
- Playbook: `MajorAnsible/netdata.yml` +
|
|
`templates/health_apps_fds_group.conf.j2`
|