--- title: Netdata apps-group FD-utilisation false 100% (silenced fleet-wide) domain: troubleshooting category: security tags: - netdata - apps.plugin - file-descriptors - tailscale - false-positive - ansible - fleet status: published created: 2026-05-15 updated: 2026-05-15T02:40 --- # Netdata apps-group FD-utilisation false 100% The Netdata stock alarm **`apps_group_file_descriptors_utilization`** (from `/usr/lib/netdata/conf.d/health.d/file_descriptors.conf`) fires `Raised to Warning — App group file descriptors utilization = 100%` emails for application groups that are perfectly healthy. First hit on **MajorToot** (the `tailscaled` app group), 2026-05-15. ## The Problem A Netdata email arrives: *"App group tailscaled file descriptors utilization = 100% on MajorToot"*. The process is fine. On the host: ``` PID 1047 tailscaled (daemon) fds=35 soft_limit=524287 util=0.01% PID 1984541 tailscaled (child) fds=10 soft_limit=524287 util=0.00% PID 1984548 bash (tailscale hook) fds=5 soft_limit=1024 util=0.49% ``` No PID exceeds **0.5%**, yet `app.fds_open_limit` reads ~100%. Over 1h the raw chart was min 0 / **mean 36.7** / max 100, with sustained multi-minute 100% plateaus (not isolated spikes). > This is **not** an `apps.plugin` privilege problem. apps.plugin already has > `cap_dac_read_search,cap_sys_ptrace` and `sudo -u netdata cat > /proc//limits` succeeds. Verify before "fixing" privileges — it's a > no-op. ## Root Cause The stock alarm does `lookup: max -10s` over **every PID in the app group**. App groups whose processes fork short-lived children (tailscaled spawns route/DNS helpers and bash hooks; `bash` children inherit the systemd default soft limit of 1024) trip a false 100%: apps.plugin's per-PID FD-limit read **races on transient/just-forked PIDs**, and because the group lookup uses `max`, a single bad 10-second sample pegs the entire group to ~100%. The signal carries no usable information for any forking/root app group. A `lookup: average -5m` does **not** rescue it — the bogus reading sits at ~100% for sustained multi-minute stretches, so the 5-minute rolling average itself still reaches 100.0% (empirically verified on MajorToot). ## The Fix Silence this template fleet-wide, keep the reliable system-wide FD alarm. - **Codified in Ansible** (do not hand-edit hosts): `MajorAnsible/netdata.yml` ships `templates/health_apps_fds_group.conf.j2` to `/etc/netdata/health.d/apps_fds_group_override.conf` and reloads via `netdatacli reload-health`. - The override redefines `apps_group_file_descriptors_utilization` with `to: silent`. Netdata loads `/etc/netdata/health.d/` *after* the stock `conf.d` dir, so a same-name template deterministically supersedes the stock one (same mechanism as the manual `tcp_resets.conf` override, 2026-04-30). - **Safety net retained:** the companion stock template `system_file_descriptors_utilization` (on `system.file_nr_utilization`, `crit > 90`, `to: sysadmin`) is untouched and still catches genuine system-wide FD exhaustion regardless of app grouping. - The reload handler is restart-tolerant (`retries`/`until` + `failed_when` ignoring a `netdata.pipe` socket-absent error) because on hosts where the notify-config also drifts, `Restart Netdata` and `Reload Netdata health` can race during the ~5s restart window. ## Verification ```bash ssh 'curl -s "http://localhost:19999/api/v1/alarms?all=true" \ | python3 -c "import sys,json;A=json.load(sys.stdin)[\"alarms\"]; \ print(A[\"app.tailscaled_fds_open_limit.apps_group_file_descriptors_utilization\"][\"recipient\"])"' # expect: silent ``` After the fix the alarm still shows `status=WARNING` in the dashboard (cosmetic — silencing suppresses the *notification*, not the computed state); `recipient=silent` confirms no more emails. The system-wide alarm should read `CLEAR recipient=sysadmin`. ## Notes - Silenced fleet-wide on all 10 servers 2026-05-15 (workstations majorrig/ majormac were asleep — irrelevant, they are not fleet servers). - Any future host running a forking/root daemon in a named app group would have hit the same false positive; silencing is fleet-wide and pre-emptive. - **Follow-up debt:** the manual `/etc/netdata/health.d/tcp_resets.conf` override on MajorToot (2026-04-30) is still **not codified in `netdata.yml`** — a per-host divergence the fleet play does not manage. Worth folding into Ansible the same way. ## Related - [[clamscan-cpu-spike-nice-ionice]] - [[netdata-web-log-successful-redirect-heavy-tuning]] - Server doc: `30-Areas/MajorInfrastructure/Servers/majortoot.md` (incident 2026-05-15) - Playbook: `MajorAnsible/netdata.yml` + `templates/health_apps_fds_group.conf.j2`