MajorLinux af14e36caf ClamAV fleet article: add DigitalOcean monitoring caveat for 1vCPU droplets

DO's hypervisor-level CPU metric doesn't know about nice/ionice — a
"polite" weekly clamscan on a 1 vCPU droplet still reads 100% utilization
and trips a default >85%/5m alert. Adds a new section explaining the
trade-off and providing the DO API recipe (PUT existing alert with
explicit entities, POST a new relaxed alert scoped to the small
droplet) plus when not to bother (2+ vCPU boxes won't trip).

Triggered by the 2026-05-10 teelia incident where the weekly cron fired
the fleet-wide CPU alert despite the cron script already wrapping
clamscan in nice 19 + ionice idle + cgroup memory limits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-10 02:24:17 -04:00

8.5 KiB

Raw Blame History

title

domain

ClamAV Fleet Deployment with Ansible

Overview

ClamAV is the standard open-source antivirus for Linux servers. For internet-facing hosts, a weekly scan with fresh definitions catches known malware, web shells, and suspicious files before they cause damage. The key operational concern is CPU impact — an unthrottled clamscan will saturate a core for hours on a busy host. The solution is nice and ionice wrappers.

This guide covers deployment to internet-facing hosts. Internal-only hosts (storage, inference, gaming) are lower priority and can be skipped.

What Gets Deployed

clamav + clamav-update packages (provides clamscan + freshclam)
freshclam service enabled for automatic definition updates
A quarantine directory at /var/lib/clamav/quarantine/
A weekly clamscan cron job, niced to background priority
SELinux context set on the quarantine directory (Fedora hosts)

Ansible Playbook

- name: Deploy ClamAV to internet-facing hosts
  hosts: internet_facing  # dca, majorlinux, teelia, tttpod, majortoot, majormail
  become: true

  tasks:

    - name: Install ClamAV packages
      ansible.builtin.package:
        name:
          - clamav
          - clamav-update
        state: present

    - name: Enable and start freshclam
      ansible.builtin.service:
        name: clamav-freshclam
        enabled: true
        state: started

    - name: Create quarantine directory
      ansible.builtin.file:
        path: /var/lib/clamav/quarantine
        state: directory
        owner: root
        group: root
        mode: '0700'

    - name: Set SELinux context on quarantine dir (Fedora/RHEL)
      ansible.builtin.command:
        cmd: chcon -t var_t /var/lib/clamav/quarantine
      when: ansible_os_family == "RedHat"
      changed_when: false

    - name: Deploy weekly clamscan cron job
      ansible.builtin.cron:
        name: "Weekly ClamAV scan"
        user: root
        weekday: "0"   # Sunday
        hour: "3"
        minute: "0"
        job: >-
          nice -n 19 ionice -c 3
          clamscan -r /
          --exclude-dir=^/proc
          --exclude-dir=^/sys
          --exclude-dir=^/dev
          --exclude-dir=^/run
          --move=/var/lib/clamav/quarantine
          --log=/var/log/clamav/scan.log
          --quiet
          2>&1 | logger -t clamscan

The nice/ionice Flags

Without throttling, clamscan -r / will peg a CPU core for 30–90 minutes depending on disk size and file count. On production hosts this causes Netdata alerts and visible service degradation.

Flag	Value	Meaning
`nice -n 19`	Lowest CPU priority	Kernel will preempt this process for anything else
`ionice -c 3`	Idle I/O class	Disk I/O only runs when no other process needs the disk

With both flags set, clamscan becomes essentially invisible under normal load. The scan takes longer (possibly 2–4× on busy disks), but this is acceptable for a weekly background job.

SELinux on Fedora/Fedora: ionice may trigger AVC denials under SELinux Enforcing. If scans silently fail on Fedora hosts, check ausearch -m avc -ts recent for clamscan denials. See selinux-fail2ban-execmem-fix for the pattern.

Excluded Paths

Always exclude virtual/pseudo filesystems — scanning them wastes time and can trigger false positives or kernel errors:

--exclude-dir=^/proc   # Process info (not real files)
--exclude-dir=^/sys    # Kernel interfaces
--exclude-dir=^/dev    # Device nodes
--exclude-dir=^/run    # Runtime tmpfs

You may also want to exclude large data directories (/var/lib/docker, backup volumes, media stores) if scan time is a concern. These are lower-risk targets anyway.

Quarantine vs Delete

--move=/var/lib/clamav/quarantine moves detected files rather than deleting them. This is safer than --remove — you can inspect and restore false positives. Review the quarantine directory periodically:

ls -la /var/lib/clamav/quarantine/

If a file is a confirmed false positive, restore it and add it to /etc/clamav/whitelist.ign2.

Checking Scan Results

# View last scan log
cat /var/log/clamav/scan.log

# Summary line from the log
grep -E "^Infected|^Scanned" /var/log/clamav/scan.log | tail -5

# Check freshclam is keeping definitions current
systemctl status clamav-freshclam
freshclam --version

Verifying Deployment

Test that ClamAV can detect malware using the EICAR test file (a harmless string that all AV tools recognize as test malware):

echo 'X5O!P%@AP[4\PZX54(P^)7CC)7}$EICAR-STANDARD-ANTIVIRUS-TEST-FILE!$H+H*' \
  > /tmp/eicar-test.txt
clamscan /tmp/eicar-test.txt
# Expected: /tmp/eicar-test.txt: Eicar-Signature FOUND
rm /tmp/eicar-test.txt

DigitalOcean Monitoring Caveat (1 vCPU droplets)

nice -n 19 ionice -c 3 plus MemoryMax/MemorySwapMax cgroups make clamscan "polite" to the Linux scheduler — it yields to PHP-FPM, MySQL, etc. instantly. But hypervisor-level CPU monitoring (DigitalOcean, Linode, Hetzner) doesn't know about niceness. It sees raw CPU utilization. On a 1 vCPU droplet during quiet hours, a single-threaded clamscan can fill 100% of the vCPU on its own, tripping a default >85%/5m CPU alert every week — even though the workload is genuinely insulating real traffic.

Symptoms:

Weekly [ALERT] CPU is running high email from DO at the same time/day every week
The alert clears within 10–60 min (when scan finishes)
No actual user-visible service degradation
Netdata shows CPU 80–100% but PHP-FPM/MySQL response times barely move

Fix: per-droplet alert scoping. Two changes via the DO API:

Scope the existing fleet-wide CPU alert to exclude affected 1 vCPU droplets by setting entities to an explicit array of all other droplet IDs.
Add a new alert scoped to just the affected droplet(s) with a relaxed threshold:
- value: 95
- window: "30m"
- entities: [<droplet_id>]

The relaxed threshold still catches runaway PHP loops, mining trojans, and actual sustained saturation — but ignores the weekly polite scan.

Apply via DO API

TOKEN="<your DigitalOcean PAT>"

# 1. Scope existing CPU alert (PUT requires the full alert spec)
curl -sS -X PUT \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "alerts": {"email": ["you@example.com"], "slack": []},
    "compare": "GreaterThan",
    "description": "CPU is running high (excludes 1vCPU clamscan boxes)",
    "enabled": true,
    "entities": ["<droplet_id_1>", "<droplet_id_2>"],
    "tags": [],
    "type": "v1/insights/droplet/cpu",
    "value": 85,
    "window": "5m"
  }' \
  "https://api.digitalocean.com/v2/monitoring/alerts/<existing_uuid>"

# 2. Create a relaxed alert for the small box
curl -sS -X POST \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "alerts": {"email": ["you@example.com"], "slack": []},
    "compare": "GreaterThan",
    "description": "<host> CPU sustained high (clamscan-aware)",
    "enabled": true,
    "entities": ["<small_droplet_id>"],
    "tags": [],
    "type": "v1/insights/droplet/cpu",
    "value": 95,
    "window": "30m"
  }' \
  "https://api.digitalocean.com/v2/monitoring/alerts"

To list current alerts (find UUIDs and current entities):

curl -sS -H "Authorization: Bearer $TOKEN" \
  "https://api.digitalocean.com/v2/monitoring/alerts" | jq

When not to do this: If your droplet has 2+ vCPUs and clamscan only consumes ~50% of total, you probably won't trip an 85% alert in the first place. The per-droplet exemption is mainly for 1 vCPU boxes.

Alternative considered: switch to clamdscan — uses a resident clamd daemon, signatures stay loaded, scan finishes ~10× faster with much less CPU/RAM. Better long-term answer, but requires running clamd continuously (memory cost on small boxes is ~250 MB resident vs the cron approach which only holds RAM during scan). Trade-off, not strictly better.

8.5 KiB Raw Blame History Unescape Escape