ClamAV fleet article: add DigitalOcean monitoring caveat for 1vCPU droplets

DO's hypervisor-level CPU metric doesn't know about nice/ionice — a
"polite" weekly clamscan on a 1 vCPU droplet still reads 100% utilization
and trips a default >85%/5m alert. Adds a new section explaining the
trade-off and providing the DO API recipe (PUT existing alert with
explicit entities, POST a new relaxed alert scoped to the small
droplet) plus when not to bother (2+ vCPU boxes won't trip).

Triggered by the 2026-05-10 teelia incident where the weekly cron fired
the fleet-wide CPU alert despite the cron script already wrapping
clamscan in nice 19 + ionice idle + cgroup memory limits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Marcus Summers 2026-05-10 02:24:17 -04:00
parent 545df9f5c6
commit af14e36caf
2 changed files with 73 additions and 1 deletions

View file

@ -11,7 +11,7 @@ tags:
- cron
status: published
created: 2026-04-18
updated: 2026-04-30T05:21
updated: 2026-05-10T01:50
---
# ClamAV Fleet Deployment with Ansible
@ -147,6 +147,77 @@ clamscan /tmp/eicar-test.txt
rm /tmp/eicar-test.txt
```
## DigitalOcean Monitoring Caveat (1 vCPU droplets)
`nice -n 19 ionice -c 3` plus `MemoryMax`/`MemorySwapMax` cgroups make clamscan "polite" to the Linux scheduler — it yields to PHP-FPM, MySQL, etc. instantly. **But hypervisor-level CPU monitoring (DigitalOcean, Linode, Hetzner) doesn't know about niceness.** It sees raw CPU utilization. On a 1 vCPU droplet during quiet hours, a single-threaded clamscan can fill 100% of the vCPU on its own, tripping a default `>85%/5m` CPU alert every week — even though the workload is genuinely insulating real traffic.
**Symptoms:**
- Weekly `[ALERT] CPU is running high` email from DO at the same time/day every week
- The alert clears within 1060 min (when scan finishes)
- No actual user-visible service degradation
- Netdata shows CPU 80100% but PHP-FPM/MySQL response times barely move
**Fix: per-droplet alert scoping.** Two changes via the DO API:
1. **Scope the existing fleet-wide CPU alert to exclude affected 1 vCPU droplets** by setting `entities` to an explicit array of *all other* droplet IDs.
2. **Add a new alert scoped to just the affected droplet(s)** with a relaxed threshold:
- `value: 95`
- `window: "30m"`
- `entities: [<droplet_id>]`
The relaxed threshold still catches runaway PHP loops, mining trojans, and actual sustained saturation — but ignores the weekly polite scan.
### Apply via DO API
```bash
TOKEN="<your DigitalOcean PAT>"
# 1. Scope existing CPU alert (PUT requires the full alert spec)
curl -sS -X PUT \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"alerts": {"email": ["you@example.com"], "slack": []},
"compare": "GreaterThan",
"description": "CPU is running high (excludes 1vCPU clamscan boxes)",
"enabled": true,
"entities": ["<droplet_id_1>", "<droplet_id_2>"],
"tags": [],
"type": "v1/insights/droplet/cpu",
"value": 85,
"window": "5m"
}' \
"https://api.digitalocean.com/v2/monitoring/alerts/<existing_uuid>"
# 2. Create a relaxed alert for the small box
curl -sS -X POST \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"alerts": {"email": ["you@example.com"], "slack": []},
"compare": "GreaterThan",
"description": "<host> CPU sustained high (clamscan-aware)",
"enabled": true,
"entities": ["<small_droplet_id>"],
"tags": [],
"type": "v1/insights/droplet/cpu",
"value": 95,
"window": "30m"
}' \
"https://api.digitalocean.com/v2/monitoring/alerts"
```
To list current alerts (find UUIDs and current `entities`):
```bash
curl -sS -H "Authorization: Bearer $TOKEN" \
"https://api.digitalocean.com/v2/monitoring/alerts" | jq
```
**When *not* to do this:** If your droplet has 2+ vCPUs and clamscan only consumes ~50% of total, you probably won't trip an 85% alert in the first place. The per-droplet exemption is mainly for 1 vCPU boxes.
**Alternative considered: switch to `clamdscan`** — uses a resident `clamd` daemon, signatures stay loaded, scan finishes ~10× faster with much less CPU/RAM. Better long-term answer, but requires running `clamd` continuously (memory cost on small boxes is ~250 MB resident vs the cron approach which only holds RAM during scan). Trade-off, not strictly better.
## See Also
- [clamscan-cpu-spike-nice-ionice](../../05-troubleshooting/security/clamscan-cpu-spike-nice-ionice.md) — troubleshooting CPU spikes from unthrottled scans

View file

@ -217,6 +217,7 @@ updated: 2026-05-10T01:30
| Date | Article | Domain |
|---|---|---|
| 2026-05-10 | [ClamAV Fleet Deployment with Ansible](02-selfhosting/security/clamav-fleet-deployment.md) — added DigitalOcean monitoring caveat for 1vCPU droplets | Self-Hosting |
| 2026-05-10 | [Claude Desktop MCP Mass-Disconnect After Blocking SSH Reboot](05-troubleshooting/claude-desktop-mcp-mass-disconnect-blocking-reboot.md) | Troubleshooting |
| 2026-05-10 | [Castopod Posts Don't Appear on Mastodon — Diagnosing the Federation Path](05-troubleshooting/security/castopod-broadcast-not-on-mastodon.md) | Troubleshooting |
| 2026-05-08 | [Castopod: Stale Federated Avatar URLs After Remote Profile Updates](05-troubleshooting/security/castopod-stale-federated-avatar.md) | Troubleshooting |