Operational/how-to references updated to the role entry playbooks after the ADR-0001 migration. Historical incident narrative (dated callouts, commit refs) preserved. - clamav-fleet-deployment: override + re-run -> clamav.yml; role note - ssh-hardening-ansible-fleet: note this is now the ssh_hardening role - vps-migration-baseline-checklist: table -> clamav.yml / ssh_hardening.yml - ssh-socket-tailscale-race-condition: Affected Hosts + Prevention + References -> tailscale role tasks (network_wait/ssh_only_ubuntu/ssh_only_fedora) - freshclam-logwatch-false-no-updates: codify refs -> clamav role
12 KiB
| title | domain | category | tags | status | created | updated | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ClamAV Fleet Deployment with Ansible | selfhosting | security |
|
published | 2026-04-18 | 2026-05-15T03:00 |
ClamAV Fleet Deployment with Ansible
Overview
ClamAV is the standard open-source antivirus for Linux servers. For internet-facing hosts, a weekly scan with fresh definitions catches known malware, web shells, and suspicious files before they cause damage. The key operational concern is CPU impact — an unthrottled clamscan will saturate a core for hours on a busy host. The solution is nice and ionice wrappers.
This guide covers deployment to internet-facing hosts. Internal-only hosts (storage, inference, gaming) are lower priority and can be skipped.
What Gets Deployed
clamav+clamav-updatepackages (providesclamscan+freshclam)freshclamservice enabled for automatic definition updates- A quarantine directory at
/var/lib/clamav/quarantine/ - A weekly
clamscancron job, niced to background priority - SELinux context set on the quarantine directory (Fedora hosts)
Ansible Playbook
On the MajorsHouse fleet this is packaged as the
clamavrole (roles/clamav/, tasks split install → service → scan → verify) and run viaclamav.ymlorsite.yml. The standalone playbook below is the illustrative equivalent.
- name: Deploy ClamAV to internet-facing hosts
hosts: internet_facing # dca, majorlinux, teelia, tttpod, majortoot, majormail
become: true
tasks:
- name: Install ClamAV packages
ansible.builtin.package:
name:
- clamav
- clamav-update
state: present
- name: Enable and start freshclam
ansible.builtin.service:
name: clamav-freshclam
enabled: true
state: started
- name: Create quarantine directory
ansible.builtin.file:
path: /var/lib/clamav/quarantine
state: directory
owner: root
group: root
mode: '0700'
- name: Set SELinux context on quarantine dir (Fedora/RHEL)
ansible.builtin.command:
cmd: chcon -t var_t /var/lib/clamav/quarantine
when: ansible_os_family == "RedHat"
changed_when: false
- name: Deploy weekly clamscan cron job
ansible.builtin.cron:
name: "Weekly ClamAV scan"
user: root
weekday: "0" # Sunday
hour: "3"
minute: "0"
job: >-
nice -n 19 ionice -c 3
clamscan -r /
--exclude-dir=^/proc
--exclude-dir=^/sys
--exclude-dir=^/dev
--exclude-dir=^/run
--move=/var/lib/clamav/quarantine
--log=/var/log/clamav/scan.log
--quiet
2>&1 | logger -t clamscan
The nice/ionice Flags
Without throttling, clamscan -r / will peg a CPU core for 30–90 minutes depending on disk size and file count. On production hosts this causes Netdata alerts and visible service degradation.
| Flag | Value | Meaning |
|---|---|---|
nice -n 19 |
Lowest CPU priority | Kernel will preempt this process for anything else |
ionice -c 3 |
Idle I/O class | Disk I/O only runs when no other process needs the disk |
With both flags set, clamscan becomes essentially invisible under normal load. The scan takes longer (possibly 2–4× on busy disks), but this is acceptable for a weekly background job.
SELinux on Fedora/Fedora:
ionicemay trigger AVC denials under SELinux Enforcing. If scans silently fail on Fedora hosts, checkausearch -m avc -ts recentforclamscandenials. See selinux-fail2ban-execmem-fix for the pattern.
Excluded Paths
Always exclude virtual/pseudo filesystems — scanning them wastes time and can trigger false positives or kernel errors:
--exclude-dir=^/proc # Process info (not real files)
--exclude-dir=^/sys # Kernel interfaces
--exclude-dir=^/dev # Device nodes
--exclude-dir=^/run # Runtime tmpfs
You may also want to exclude large data directories (/var/lib/docker, backup volumes, media stores) if scan time is a concern. These are lower-risk targets anyway.
Quarantine vs Delete
--move=/var/lib/clamav/quarantine moves detected files rather than deleting them. This is safer than --remove — you can inspect and restore false positives. Review the quarantine directory periodically:
ls -la /var/lib/clamav/quarantine/
If a file is a confirmed false positive, restore it and add it to /etc/clamav/whitelist.ign2.
Checking Scan Results
# View last scan log
cat /var/log/clamav/scan.log
# Summary line from the log
grep -E "^Infected|^Scanned" /var/log/clamav/scan.log | tail -5
# Check freshclam is keeping definitions current
systemctl status clamav-freshclam
freshclam --version
Verifying Deployment
Test that ClamAV can detect malware using the EICAR test file (a harmless string that all AV tools recognize as test malware):
echo 'X5O!P%@AP[4\PZX54(P^)7CC)7}$EICAR-STANDARD-ANTIVIRUS-TEST-FILE!$H+H*' \
> /tmp/eicar-test.txt
clamscan /tmp/eicar-test.txt
# Expected: /tmp/eicar-test.txt: Eicar-Signature FOUND
rm /tmp/eicar-test.txt
DigitalOcean Monitoring Caveat (1 vCPU droplets)
nice -n 19 ionice -c 3 plus MemoryMax/MemorySwapMax cgroups make clamscan "polite" to the Linux scheduler — it yields to PHP-FPM, MySQL, etc. instantly. But hypervisor-level CPU monitoring (DigitalOcean, Linode, Hetzner) doesn't know about niceness. It sees raw CPU utilization. On a 1 vCPU droplet during quiet hours, a single-threaded clamscan can fill 100% of the vCPU on its own, tripping a default >85%/5m CPU alert every week — even though the workload is genuinely insulating real traffic.
Symptoms:
- Weekly
[ALERT] CPU is running highemail from DO at the same time/day every week - The alert clears within 10–60 min (when scan finishes)
- No actual user-visible service degradation
- Netdata shows CPU 80–100% but PHP-FPM/MySQL response times barely move
Fix: per-droplet alert scoping. Two changes via the DO API:
- Scope the existing fleet-wide CPU alert to exclude affected 1 vCPU droplets by setting
entitiesto an explicit array of all other droplet IDs. - Add a new alert scoped to just the affected droplet(s) with a relaxed threshold:
value: 95window: "30m"entities: [<droplet_id>]
The relaxed threshold still catches runaway PHP loops, mining trojans, and actual sustained saturation — but ignores the weekly polite scan.
Apply via DO API
TOKEN="<your DigitalOcean PAT>"
# 1. Scope existing CPU alert (PUT requires the full alert spec)
curl -sS -X PUT \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"alerts": {"email": ["you@example.com"], "slack": []},
"compare": "GreaterThan",
"description": "CPU is running high (excludes 1vCPU clamscan boxes)",
"enabled": true,
"entities": ["<droplet_id_1>", "<droplet_id_2>"],
"tags": [],
"type": "v1/insights/droplet/cpu",
"value": 85,
"window": "5m"
}' \
"https://api.digitalocean.com/v2/monitoring/alerts/<existing_uuid>"
# 2. Create a relaxed alert for the small box
curl -sS -X POST \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"alerts": {"email": ["you@example.com"], "slack": []},
"compare": "GreaterThan",
"description": "<host> CPU sustained high (clamscan-aware)",
"enabled": true,
"entities": ["<small_droplet_id>"],
"tags": [],
"type": "v1/insights/droplet/cpu",
"value": 95,
"window": "30m"
}' \
"https://api.digitalocean.com/v2/monitoring/alerts"
To list current alerts (find UUIDs and current entities):
curl -sS -H "Authorization: Bearer $TOKEN" \
"https://api.digitalocean.com/v2/monitoring/alerts" | jq
When not to do this: If your droplet has 2+ vCPUs and clamscan only consumes ~50% of total, you probably won't trip an 85% alert in the first place. The per-droplet exemption is mainly for 1 vCPU boxes.
When the per-droplet relaxed alert also trips (and what to do): On a 1 vCPU droplet during low-traffic hours (e.g., the default Sunday-morning weekly cron window), clamscan has nothing real to yield to — nice 19 only matters when something else wants the CPU. The kernel correctly schedules clamscan as nice/idle (iostat shows %nice ~94, %idle 0) but DO sees 100% - 0% idle = 100% CPU and trips even the 95%/30m threshold for the duration of the scan (~30–50 min on small webserver boxes). At that point the realistic options are:
- Accept the weekly page as expected noise — simplest, no further engineering
- Switch to
clamdscan(daemon-backed) — scans finish ~3–5× faster and fit in a 30m window, butclamdadds ~250 MB resident memory continuously - Disable the per-droplet CPU alert entirely for that host and rely on Netdata for the real signal
The "polite CPU is invisible to DO" trick stops working once the box is small enough that the polite work fills the entire core unopposed. There is no DO threshold that distinguishes "polite scan filling idle CPU" from "runaway process pinning the vCPU" — that distinction lives in iostat's %nice vs %user split, which DO doesn't expose.
Alternative considered: switch to clamdscan — uses a resident clamd daemon, signatures stay loaded, scan finishes ~10× faster with much less CPU/RAM. Better long-term answer, but requires running clamd continuously (memory cost on small boxes is ~250 MB resident vs the cron approach which only holds RAM during scan). Trade-off, not strictly better.
Daemonless Mode on Memory-Constrained Hosts
On hosts with ≤2 GB RAM, running clamd continuously is often counterproductive. The daemon loads its full signature database (~950 MB RSS) into memory and keeps it resident. On small VMs this crowds out MySQL, PHP-FPM, and other services — often pushing the whole system into swap rather than preventing anything.
Affected hosts (fleet history):
| Host | RAM | Incident | Resolution |
|---|---|---|---|
| teelia | 1.9 GB | 2026-04-27 — clamd 728 MB RSS, 94% RAM alert | daemonless |
| dcaprod | 3.8 GB | 2026-04-30 — clamd OOM thrash after 512M cgroup cap | daemonless |
| majorlinux | 2.0 GB | 2026-05-15 — clamd 980 MB swap, mysqld swapping 293 MB | daemonless |
The fix: clamav_use_daemon: false host_var
The clamav role supports a per-host override. Add to the host's host_vars/<hostname>/vars.yml:
clamav_use_daemon: false
Then re-run the role:
ansible-playbook clamav.yml --limit <hostname>
This will:
- Stop and disable
clamav-daemon.serviceandclamav-daemon.socket - Deploy the weekly scan template using
clamscan(daemonless, loads DB per run) - Leave
clamav-freshclamactive so definitions stay current
Trade-off: Each weekly scan loads the signature DB fresh (~950 MB peak RAM for the scan duration, then freed). The scan takes longer than clamdscan (~3–5× on a warm daemon), but this is acceptable for a weekly background job. The systemd-run MemoryMax cgroup wrapper in the scan template caps peak usage so the scan can't OOM the host.
Rule of thumb: Use daemon mode (clamav_use_daemon: true or unset) on hosts with ≥4 GB RAM where scan speed matters (mail servers, upload handlers). Use daemonless on webservers and small VMs where continuous memory residency is the bigger risk.
See Also
- clamscan-cpu-spike-nice-ionice — troubleshooting CPU spikes from unthrottled scans
- linux-server-hardening-checklist
- ssh-hardening-ansible-fleet