Marcus Summers 06a794316b docs: point Ansible references at the new roles (clamav/ssh_hardening/tailscale)

Operational/how-to references updated to the role entry playbooks after the
ADR-0001 migration. Historical incident narrative (dated callouts, commit
refs) preserved.

- clamav-fleet-deployment: override + re-run -> clamav.yml; role note
- ssh-hardening-ansible-fleet: note this is now the ssh_hardening role
- vps-migration-baseline-checklist: table -> clamav.yml / ssh_hardening.yml
- ssh-socket-tailscale-race-condition: Affected Hosts + Prevention + References
  -> tailscale role tasks (network_wait/ssh_only_ubuntu/ssh_only_fedora)
- freshclam-logwatch-false-no-updates: codify refs -> clamav role

2026-06-11 11:33:42 -04:00

12 KiB

Raw Blame History

title

domain

ClamAV Fleet Deployment with Ansible

Overview

ClamAV is the standard open-source antivirus for Linux servers. For internet-facing hosts, a weekly scan with fresh definitions catches known malware, web shells, and suspicious files before they cause damage. The key operational concern is CPU impact — an unthrottled clamscan will saturate a core for hours on a busy host. The solution is nice and ionice wrappers.

This guide covers deployment to internet-facing hosts. Internal-only hosts (storage, inference, gaming) are lower priority and can be skipped.

What Gets Deployed

clamav + clamav-update packages (provides clamscan + freshclam)
freshclam service enabled for automatic definition updates
A quarantine directory at /var/lib/clamav/quarantine/
A weekly clamscan cron job, niced to background priority
SELinux context set on the quarantine directory (Fedora hosts)

Ansible Playbook

On the MajorsHouse fleet this is packaged as the clamav role (roles/clamav/, tasks split install → service → scan → verify) and run via clamav.yml or site.yml. The standalone playbook below is the illustrative equivalent.

- name: Deploy ClamAV to internet-facing hosts
  hosts: internet_facing  # dca, majorlinux, teelia, tttpod, majortoot, majormail
  become: true

  tasks:

    - name: Install ClamAV packages
      ansible.builtin.package:
        name:
          - clamav
          - clamav-update
        state: present

    - name: Enable and start freshclam
      ansible.builtin.service:
        name: clamav-freshclam
        enabled: true
        state: started

    - name: Create quarantine directory
      ansible.builtin.file:
        path: /var/lib/clamav/quarantine
        state: directory
        owner: root
        group: root
        mode: '0700'

    - name: Set SELinux context on quarantine dir (Fedora/RHEL)
      ansible.builtin.command:
        cmd: chcon -t var_t /var/lib/clamav/quarantine
      when: ansible_os_family == "RedHat"
      changed_when: false

    - name: Deploy weekly clamscan cron job
      ansible.builtin.cron:
        name: "Weekly ClamAV scan"
        user: root
        weekday: "0"   # Sunday
        hour: "3"
        minute: "0"
        job: >-
          nice -n 19 ionice -c 3
          clamscan -r /
          --exclude-dir=^/proc
          --exclude-dir=^/sys
          --exclude-dir=^/dev
          --exclude-dir=^/run
          --move=/var/lib/clamav/quarantine
          --log=/var/log/clamav/scan.log
          --quiet
          2>&1 | logger -t clamscan

The nice/ionice Flags

Without throttling, clamscan -r / will peg a CPU core for 30–90 minutes depending on disk size and file count. On production hosts this causes Netdata alerts and visible service degradation.

Flag	Value	Meaning
`nice -n 19`	Lowest CPU priority	Kernel will preempt this process for anything else
`ionice -c 3`	Idle I/O class	Disk I/O only runs when no other process needs the disk

With both flags set, clamscan becomes essentially invisible under normal load. The scan takes longer (possibly 2–4× on busy disks), but this is acceptable for a weekly background job.

SELinux on Fedora/Fedora: ionice may trigger AVC denials under SELinux Enforcing. If scans silently fail on Fedora hosts, check ausearch -m avc -ts recent for clamscan denials. See selinux-fail2ban-execmem-fix for the pattern.

Excluded Paths

Always exclude virtual/pseudo filesystems — scanning them wastes time and can trigger false positives or kernel errors:

--exclude-dir=^/proc   # Process info (not real files)
--exclude-dir=^/sys    # Kernel interfaces
--exclude-dir=^/dev    # Device nodes
--exclude-dir=^/run    # Runtime tmpfs

You may also want to exclude large data directories (/var/lib/docker, backup volumes, media stores) if scan time is a concern. These are lower-risk targets anyway.

Quarantine vs Delete

--move=/var/lib/clamav/quarantine moves detected files rather than deleting them. This is safer than --remove — you can inspect and restore false positives. Review the quarantine directory periodically:

ls -la /var/lib/clamav/quarantine/

If a file is a confirmed false positive, restore it and add it to /etc/clamav/whitelist.ign2.

Checking Scan Results

# View last scan log
cat /var/log/clamav/scan.log

# Summary line from the log
grep -E "^Infected|^Scanned" /var/log/clamav/scan.log | tail -5

# Check freshclam is keeping definitions current
systemctl status clamav-freshclam
freshclam --version

Verifying Deployment

Test that ClamAV can detect malware using the EICAR test file (a harmless string that all AV tools recognize as test malware):

echo 'X5O!P%@AP[4\PZX54(P^)7CC)7}$EICAR-STANDARD-ANTIVIRUS-TEST-FILE!$H+H*' \
  > /tmp/eicar-test.txt
clamscan /tmp/eicar-test.txt
# Expected: /tmp/eicar-test.txt: Eicar-Signature FOUND
rm /tmp/eicar-test.txt

DigitalOcean Monitoring Caveat (1 vCPU droplets)

nice -n 19 ionice -c 3 plus MemoryMax/MemorySwapMax cgroups make clamscan "polite" to the Linux scheduler — it yields to PHP-FPM, MySQL, etc. instantly. But hypervisor-level CPU monitoring (DigitalOcean, Linode, Hetzner) doesn't know about niceness. It sees raw CPU utilization. On a 1 vCPU droplet during quiet hours, a single-threaded clamscan can fill 100% of the vCPU on its own, tripping a default >85%/5m CPU alert every week — even though the workload is genuinely insulating real traffic.

Symptoms:

Weekly [ALERT] CPU is running high email from DO at the same time/day every week
The alert clears within 10–60 min (when scan finishes)
No actual user-visible service degradation
Netdata shows CPU 80–100% but PHP-FPM/MySQL response times barely move

Fix: per-droplet alert scoping. Two changes via the DO API:

Scope the existing fleet-wide CPU alert to exclude affected 1 vCPU droplets by setting entities to an explicit array of all other droplet IDs.
Add a new alert scoped to just the affected droplet(s) with a relaxed threshold:
- value: 95
- window: "30m"
- entities: [<droplet_id>]

The relaxed threshold still catches runaway PHP loops, mining trojans, and actual sustained saturation — but ignores the weekly polite scan.

Apply via DO API

TOKEN="<your DigitalOcean PAT>"

# 1. Scope existing CPU alert (PUT requires the full alert spec)
curl -sS -X PUT \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "alerts": {"email": ["you@example.com"], "slack": []},
    "compare": "GreaterThan",
    "description": "CPU is running high (excludes 1vCPU clamscan boxes)",
    "enabled": true,
    "entities": ["<droplet_id_1>", "<droplet_id_2>"],
    "tags": [],
    "type": "v1/insights/droplet/cpu",
    "value": 85,
    "window": "5m"
  }' \
  "https://api.digitalocean.com/v2/monitoring/alerts/<existing_uuid>"

# 2. Create a relaxed alert for the small box
curl -sS -X POST \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "alerts": {"email": ["you@example.com"], "slack": []},
    "compare": "GreaterThan",
    "description": "<host> CPU sustained high (clamscan-aware)",
    "enabled": true,
    "entities": ["<small_droplet_id>"],
    "tags": [],
    "type": "v1/insights/droplet/cpu",
    "value": 95,
    "window": "30m"
  }' \
  "https://api.digitalocean.com/v2/monitoring/alerts"

To list current alerts (find UUIDs and current entities):

curl -sS -H "Authorization: Bearer $TOKEN" \
  "https://api.digitalocean.com/v2/monitoring/alerts" | jq

When not to do this: If your droplet has 2+ vCPUs and clamscan only consumes ~50% of total, you probably won't trip an 85% alert in the first place. The per-droplet exemption is mainly for 1 vCPU boxes.

When the per-droplet relaxed alert also trips (and what to do): On a 1 vCPU droplet during low-traffic hours (e.g., the default Sunday-morning weekly cron window), clamscan has nothing real to yield to — nice 19 only matters when something else wants the CPU. The kernel correctly schedules clamscan as nice/idle (iostat shows %nice ~94, %idle 0) but DO sees 100% - 0% idle = 100% CPU and trips even the 95%/30m threshold for the duration of the scan (~30–50 min on small webserver boxes). At that point the realistic options are:

Accept the weekly page as expected noise — simplest, no further engineering
Switch to clamdscan (daemon-backed) — scans finish ~3–5× faster and fit in a 30m window, but clamd adds ~250 MB resident memory continuously
Disable the per-droplet CPU alert entirely for that host and rely on Netdata for the real signal

The "polite CPU is invisible to DO" trick stops working once the box is small enough that the polite work fills the entire core unopposed. There is no DO threshold that distinguishes "polite scan filling idle CPU" from "runaway process pinning the vCPU" — that distinction lives in iostat's %nice vs %user split, which DO doesn't expose.

Alternative considered: switch to clamdscan — uses a resident clamd daemon, signatures stay loaded, scan finishes ~10× faster with much less CPU/RAM. Better long-term answer, but requires running clamd continuously (memory cost on small boxes is ~250 MB resident vs the cron approach which only holds RAM during scan). Trade-off, not strictly better.

Daemonless Mode on Memory-Constrained Hosts

On hosts with ≤2 GB RAM, running clamd continuously is often counterproductive. The daemon loads its full signature database (~950 MB RSS) into memory and keeps it resident. On small VMs this crowds out MySQL, PHP-FPM, and other services — often pushing the whole system into swap rather than preventing anything.

Affected hosts (fleet history):

Host	RAM	Incident	Resolution
teelia	1.9 GB	2026-04-27 — clamd 728 MB RSS, 94% RAM alert	daemonless
dcaprod	3.8 GB	2026-04-30 — clamd OOM thrash after 512M cgroup cap	daemonless
majorlinux	2.0 GB	2026-05-15 — clamd 980 MB swap, mysqld swapping 293 MB	daemonless

The fix: clamav_use_daemon: false host_var

The clamav role supports a per-host override. Add to the host's host_vars/<hostname>/vars.yml:

clamav_use_daemon: false

Then re-run the role:

ansible-playbook clamav.yml --limit <hostname>

This will:

Stop and disable clamav-daemon.service and clamav-daemon.socket
Deploy the weekly scan template using clamscan (daemonless, loads DB per run)
Leave clamav-freshclam active so definitions stay current

Trade-off: Each weekly scan loads the signature DB fresh (~950 MB peak RAM for the scan duration, then freed). The scan takes longer than clamdscan (~3–5× on a warm daemon), but this is acceptable for a weekly background job. The systemd-run MemoryMax cgroup wrapper in the scan template caps peak usage so the scan can't OOM the host.

Rule of thumb: Use daemon mode (clamav_use_daemon: true or unset) on hosts with ≥4 GB RAM where scan speed matters (mail servers, upload handlers). Use daemonless on webservers and small VMs where continuous memory residency is the bigger risk.

12 KiB Raw Blame History Unescape Escape