MajorLinux a852f7b7bd ClamAV fleet caveat: add follow-up on the polite-CPU-on-1vCPU edge case

Same-day correction. The proposed per-droplet relaxed alert (>95%/30m)
turned out to also trip on a 1 vCPU box during low-traffic weekly scans,
because there's literally no real load for nice 19 to yield to —
clamscan opportunistically fills the vCPU and DO sees 100% utilization
regardless of `%nice` vs `%user` split. Documents the three realistic
options (accept page / switch to clamdscan / disable alert) and the
underlying limit (no DO threshold can distinguish polite from impolite
CPU when the box is fully utilized).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-10 02:32:35 -04:00

9.7 KiB

Raw Blame History

title

domain

ClamAV Fleet Deployment with Ansible

Overview

ClamAV is the standard open-source antivirus for Linux servers. For internet-facing hosts, a weekly scan with fresh definitions catches known malware, web shells, and suspicious files before they cause damage. The key operational concern is CPU impact — an unthrottled clamscan will saturate a core for hours on a busy host. The solution is nice and ionice wrappers.

This guide covers deployment to internet-facing hosts. Internal-only hosts (storage, inference, gaming) are lower priority and can be skipped.

What Gets Deployed

clamav + clamav-update packages (provides clamscan + freshclam)
freshclam service enabled for automatic definition updates
A quarantine directory at /var/lib/clamav/quarantine/
A weekly clamscan cron job, niced to background priority
SELinux context set on the quarantine directory (Fedora hosts)

Ansible Playbook

- name: Deploy ClamAV to internet-facing hosts
  hosts: internet_facing  # dca, majorlinux, teelia, tttpod, majortoot, majormail
  become: true

  tasks:

    - name: Install ClamAV packages
      ansible.builtin.package:
        name:
          - clamav
          - clamav-update
        state: present

    - name: Enable and start freshclam
      ansible.builtin.service:
        name: clamav-freshclam
        enabled: true
        state: started

    - name: Create quarantine directory
      ansible.builtin.file:
        path: /var/lib/clamav/quarantine
        state: directory
        owner: root
        group: root
        mode: '0700'

    - name: Set SELinux context on quarantine dir (Fedora/RHEL)
      ansible.builtin.command:
        cmd: chcon -t var_t /var/lib/clamav/quarantine
      when: ansible_os_family == "RedHat"
      changed_when: false

    - name: Deploy weekly clamscan cron job
      ansible.builtin.cron:
        name: "Weekly ClamAV scan"
        user: root
        weekday: "0"   # Sunday
        hour: "3"
        minute: "0"
        job: >-
          nice -n 19 ionice -c 3
          clamscan -r /
          --exclude-dir=^/proc
          --exclude-dir=^/sys
          --exclude-dir=^/dev
          --exclude-dir=^/run
          --move=/var/lib/clamav/quarantine
          --log=/var/log/clamav/scan.log
          --quiet
          2>&1 | logger -t clamscan

The nice/ionice Flags

Without throttling, clamscan -r / will peg a CPU core for 30–90 minutes depending on disk size and file count. On production hosts this causes Netdata alerts and visible service degradation.

Flag	Value	Meaning
`nice -n 19`	Lowest CPU priority	Kernel will preempt this process for anything else
`ionice -c 3`	Idle I/O class	Disk I/O only runs when no other process needs the disk

With both flags set, clamscan becomes essentially invisible under normal load. The scan takes longer (possibly 2–4× on busy disks), but this is acceptable for a weekly background job.

SELinux on Fedora/Fedora: ionice may trigger AVC denials under SELinux Enforcing. If scans silently fail on Fedora hosts, check ausearch -m avc -ts recent for clamscan denials. See selinux-fail2ban-execmem-fix for the pattern.

Excluded Paths

Always exclude virtual/pseudo filesystems — scanning them wastes time and can trigger false positives or kernel errors:

--exclude-dir=^/proc   # Process info (not real files)
--exclude-dir=^/sys    # Kernel interfaces
--exclude-dir=^/dev    # Device nodes
--exclude-dir=^/run    # Runtime tmpfs

You may also want to exclude large data directories (/var/lib/docker, backup volumes, media stores) if scan time is a concern. These are lower-risk targets anyway.

Quarantine vs Delete

--move=/var/lib/clamav/quarantine moves detected files rather than deleting them. This is safer than --remove — you can inspect and restore false positives. Review the quarantine directory periodically:

ls -la /var/lib/clamav/quarantine/

If a file is a confirmed false positive, restore it and add it to /etc/clamav/whitelist.ign2.

Checking Scan Results

# View last scan log
cat /var/log/clamav/scan.log

# Summary line from the log
grep -E "^Infected|^Scanned" /var/log/clamav/scan.log | tail -5

# Check freshclam is keeping definitions current
systemctl status clamav-freshclam
freshclam --version

Verifying Deployment

Test that ClamAV can detect malware using the EICAR test file (a harmless string that all AV tools recognize as test malware):

echo 'X5O!P%@AP[4\PZX54(P^)7CC)7}$EICAR-STANDARD-ANTIVIRUS-TEST-FILE!$H+H*' \
  > /tmp/eicar-test.txt
clamscan /tmp/eicar-test.txt
# Expected: /tmp/eicar-test.txt: Eicar-Signature FOUND
rm /tmp/eicar-test.txt

DigitalOcean Monitoring Caveat (1 vCPU droplets)

nice -n 19 ionice -c 3 plus MemoryMax/MemorySwapMax cgroups make clamscan "polite" to the Linux scheduler — it yields to PHP-FPM, MySQL, etc. instantly. But hypervisor-level CPU monitoring (DigitalOcean, Linode, Hetzner) doesn't know about niceness. It sees raw CPU utilization. On a 1 vCPU droplet during quiet hours, a single-threaded clamscan can fill 100% of the vCPU on its own, tripping a default >85%/5m CPU alert every week — even though the workload is genuinely insulating real traffic.

Symptoms:

Weekly [ALERT] CPU is running high email from DO at the same time/day every week
The alert clears within 10–60 min (when scan finishes)
No actual user-visible service degradation
Netdata shows CPU 80–100% but PHP-FPM/MySQL response times barely move

Fix: per-droplet alert scoping. Two changes via the DO API:

Scope the existing fleet-wide CPU alert to exclude affected 1 vCPU droplets by setting entities to an explicit array of all other droplet IDs.
Add a new alert scoped to just the affected droplet(s) with a relaxed threshold:
- value: 95
- window: "30m"
- entities: [<droplet_id>]

The relaxed threshold still catches runaway PHP loops, mining trojans, and actual sustained saturation — but ignores the weekly polite scan.

Apply via DO API

TOKEN="<your DigitalOcean PAT>"

# 1. Scope existing CPU alert (PUT requires the full alert spec)
curl -sS -X PUT \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "alerts": {"email": ["you@example.com"], "slack": []},
    "compare": "GreaterThan",
    "description": "CPU is running high (excludes 1vCPU clamscan boxes)",
    "enabled": true,
    "entities": ["<droplet_id_1>", "<droplet_id_2>"],
    "tags": [],
    "type": "v1/insights/droplet/cpu",
    "value": 85,
    "window": "5m"
  }' \
  "https://api.digitalocean.com/v2/monitoring/alerts/<existing_uuid>"

# 2. Create a relaxed alert for the small box
curl -sS -X POST \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "alerts": {"email": ["you@example.com"], "slack": []},
    "compare": "GreaterThan",
    "description": "<host> CPU sustained high (clamscan-aware)",
    "enabled": true,
    "entities": ["<small_droplet_id>"],
    "tags": [],
    "type": "v1/insights/droplet/cpu",
    "value": 95,
    "window": "30m"
  }' \
  "https://api.digitalocean.com/v2/monitoring/alerts"

To list current alerts (find UUIDs and current entities):

curl -sS -H "Authorization: Bearer $TOKEN" \
  "https://api.digitalocean.com/v2/monitoring/alerts" | jq

When not to do this: If your droplet has 2+ vCPUs and clamscan only consumes ~50% of total, you probably won't trip an 85% alert in the first place. The per-droplet exemption is mainly for 1 vCPU boxes.

When the per-droplet relaxed alert also trips (and what to do): On a 1 vCPU droplet during low-traffic hours (e.g., the default Sunday-morning weekly cron window), clamscan has nothing real to yield to — nice 19 only matters when something else wants the CPU. The kernel correctly schedules clamscan as nice/idle (iostat shows %nice ~94, %idle 0) but DO sees 100% - 0% idle = 100% CPU and trips even the 95%/30m threshold for the duration of the scan (~30–50 min on small webserver boxes). At that point the realistic options are:

Accept the weekly page as expected noise — simplest, no further engineering
Switch to clamdscan (daemon-backed) — scans finish ~3–5× faster and fit in a 30m window, but clamd adds ~250 MB resident memory continuously
Disable the per-droplet CPU alert entirely for that host and rely on Netdata for the real signal

The "polite CPU is invisible to DO" trick stops working once the box is small enough that the polite work fills the entire core unopposed. There is no DO threshold that distinguishes "polite scan filling idle CPU" from "runaway process pinning the vCPU" — that distinction lives in iostat's %nice vs %user split, which DO doesn't expose.

Alternative considered: switch to clamdscan — uses a resident clamd daemon, signatures stay loaded, scan finishes ~10× faster with much less CPU/RAM. Better long-term answer, but requires running clamd continuously (memory cost on small boxes is ~250 MB resident vs the cron approach which only holds RAM during scan). Trade-off, not strictly better.

9.7 KiB Raw Blame History Unescape Escape