diff --git a/02-selfhosting/monitoring/netdata-new-server-setup.md b/02-selfhosting/monitoring/netdata-new-server-setup.md index c086f6d..c91d6f9 100644 --- a/02-selfhosting/monitoring/netdata-new-server-setup.md +++ b/02-selfhosting/monitoring/netdata-new-server-setup.md @@ -2,17 +2,31 @@ title: "Deploying Netdata to a New Server" domain: selfhosting category: monitoring -tags: [netdata, monitoring, email, notifications, netdata-cloud, ubuntu, debian] +tags: [netdata, monitoring, email, notifications, netdata-cloud, ubuntu, debian, n8n] status: published created: 2026-03-18 -updated: 2026-03-18 +updated: 2026-03-22 --- # Deploying Netdata to a New Server -This covers the full Netdata setup for a new server in the fleet: install, email notification config, and Netdata Cloud claim. Applies to Ubuntu/Debian servers. +This covers the full Netdata setup for a new server in the fleet: install, email notification config, n8n webhook integration, and Netdata Cloud claim. Applies to Ubuntu/Debian servers. -## 1. Install +## 1. Install Prerequisites + +Install `jq` before anything else. It is required by the `custom_sender()` function in `health_alarm_notify.conf` to build the JSON payload sent to the n8n webhook. **If `jq` is missing, the webhook will fire with an empty body and n8n alert emails will have no information in them.** + +```bash +apt install -y jq +``` + +Verify: + +```bash +jq --version +``` + +## 2. Install Netdata Use the official kickstart script: @@ -28,7 +42,7 @@ systemctl is-active netdata curl -s http://localhost:19999/api/v1/info | python3 -c "import sys,json; d=json.load(sys.stdin); print('Netdata', d['version'])" ``` -## 2. Configure Email Notifications +## 3. Configure Email Notifications Copy the default config and set the three required values: @@ -64,7 +78,23 @@ You should see three `# OK` lines (WARNING → CRITICAL → CLEAR test cycle) an > [!note] Delivery via local Postfix > Email is relayed through the server's local Postfix instance. Ensure Postfix is installed and `/usr/sbin/sendmail` resolves. -## 3. Claim to Netdata Cloud +## 4. Configure n8n Webhook Notifications + +Copy the `health_alarm_notify.conf` from an existing server (e.g. majormail) which contains the `custom_sender()` function. This sends enriched JSON payloads to the n8n webhook at `https://n8n.majorshouse.com/webhook/netdata-alert`. + +> [!warning] jq required +> The `custom_sender()` function uses `jq` to build the JSON payload. If `jq` is not installed, `payload` will be empty, curl will send `Content-Length: 0`, and n8n will produce alert emails with `Host: unknown`, blank alert/value fields, and `Status: UNKNOWN`. Always install `jq` first (Step 1). + +After deploying the config, run a test to confirm the webhook fires correctly: + +```bash +systemctl restart netdata +/usr/libexec/netdata/plugins.d/alarm-notify.sh test 2>&1 | grep -E '(custom|n8n|OK|FAILED)' +``` + +Verify in n8n that the latest execution shows a non-empty body with `hostname`, `alarm`, and `status` fields populated. + +## 5. Claim to Netdata Cloud Get the claim command from **Netdata Cloud → Space Settings → Nodes → Add Nodes**. It will look like: @@ -84,7 +114,7 @@ cat /var/lib/netdata/cloud.d/claimed_id A UUID will be present if claimed successfully. The node should appear in Netdata Cloud within ~60 seconds. -## 4. Verify Alerts +## 6. Verify Alerts Check that no unexpected alerts are active after setup: @@ -111,6 +141,20 @@ for host in majorlab majorhome majormail majordiscord majortoot majorlinux tttpo done ``` +## Fleet-wide jq Audit + +To check that all servers with `custom_sender` have `jq` installed: + +```bash +for host in majorlab majorhome majormail majordiscord majortoot majorlinux tttpod dca teelia; do + echo -n "=== $host: " + ssh -o ConnectTimeout=5 root@$host \ + 'has_cs=$(grep -l "custom_sender\|n8n.majorshouse.com" /etc/netdata/health_alarm_notify.conf 2>/dev/null | wc -l); has_jq=$(which jq 2>/dev/null && echo yes || echo NO); echo "custom_sender=$has_cs jq=$has_jq"' +done +``` + +Any server showing `custom_sender=1 jq=NO` needs `apt install -y jq` immediately. + ## Related - [Tuning Netdata Web Log Alerts](tuning-netdata-web-log-alerts.md) diff --git a/05-troubleshooting/security/clamscan-cpu-spike-nice-ionice.md b/05-troubleshooting/security/clamscan-cpu-spike-nice-ionice.md new file mode 100644 index 0000000..a0480f0 --- /dev/null +++ b/05-troubleshooting/security/clamscan-cpu-spike-nice-ionice.md @@ -0,0 +1,73 @@ +# ClamAV Safe Scheduling on Live Servers + +Running `clamscan` unthrottled on a live server will peg CPU until completion. On a small VPS (1 vCPU), a full recursive scan can sustain 70–100% CPU for an hour or more, degrading or taking down hosted services. + +## The Problem + +A common out-of-the-box ClamAV cron setup looks like this: + +```cron +0 1 * * 0 clamscan --infected --recursive / --exclude=/sys +``` + +This runs at Linux's default scheduling priority (`nice 0`) with normal I/O priority. On a live server it will: + +- Monopolize the CPU for the scan duration +- Cause high I/O wait, degrading web serving, databases, and other services +- Trigger monitoring alerts (e.g., Netdata `10min_cpu_usage`) + +## The Fix + +Throttle the scan with `nice` and `ionice`: + +```cron +0 1 * * 0 nice -n 19 ionice -c 3 clamscan --infected --recursive / --exclude=/sys +``` + +| Flag | Meaning | +|------|---------| +| `nice -n 19` | Lowest CPU scheduling priority (range: -20 to 19) | +| `ionice -c 3` | Idle I/O class — only uses disk when no other process needs it | + +The scan will take longer but will not impact server performance. + +## Applying the Fix + +Edit root's crontab: + +```bash +crontab -e +``` + +Or apply non-interactively: + +```bash +crontab -l | sed 's|clamscan|nice -n 19 ionice -c 3 clamscan|' | crontab - +``` + +Verify: + +```bash +crontab -l | grep clam +``` + +## Diagnosing a Runaway Scan + +If CPU is already pegged, identify and kill the process: + +```bash +ps aux --sort=-%cpu | head -15 +# Look for clamscan +kill +``` + +## Notes + +- `ionice -c 3` (Idle) requires Linux kernel ≥ 2.6.13 and CFQ/BFQ I/O scheduler. Works on most Ubuntu/Debian/Fedora systems. +- On multi-core servers, consider also using `cpulimit` for a hard cap: `cpulimit -l 30 -- clamscan ...` +- Always keep `--exclude=/sys` (and optionally `--exclude=/proc`, `--exclude=/dev`) to avoid scanning virtual filesystems. + +## Related + +- [ClamAV Documentation](https://docs.clamav.net/) +- [[02-selfhosting/security/linux-server-hardening-checklist|Linux Server Hardening Checklist]] diff --git a/SUMMARY.md b/SUMMARY.md index 5ccb815..2074268 100644 --- a/SUMMARY.md +++ b/SUMMARY.md @@ -53,3 +53,4 @@ * [mdadm RAID Recovery After USB Hub Disconnect](05-troubleshooting/storage/mdadm-usb-hub-disconnect-recovery.md) * [Windows OpenSSH Server (sshd) Stops After Reboot](05-troubleshooting/networking/windows-sshd-stops-after-reboot.md) * [Ollama Drops Off Tailscale When Mac Sleeps](05-troubleshooting/ollama-macos-sleep-tailscale-disconnect.md) + * [ClamAV CPU Spike: Safe Scheduling with nice/ionice](05-troubleshooting/security/clamscan-cpu-spike-nice-ionice.md)