Logwatch fleet article: add Fedora CA bundle diagnosis + bounce-source guidance
Documents three lessons from the 2026-05-10 fleet outage where the Fedora half (majorhome, majorlab) had been silently failing to send notification mail for days: - Missing /etc/pki/tls/certs/ca-bundle.crt symlink (extracted bundle exists at /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem but the consumer-path symlink was lost during a ca-certificates package event). Diagnosis includes the cross-tool tell — dnf and curl break with the same path. Fix is a single ln -sfn. - Methodology: Fedora and majormail log postfix to journald; Debian and Ubuntu log to /var/log/mail.log. Querying the wrong source returns false negatives for healthy hosts. - Bounce-source addresses (Watchtower NOTIFICATION_EMAIL_FROM, fail2ban sender, root@<host>.localdomain) must resolve to real mailboxes — otherwise the first failed delivery generates bounce-of-bounce churn. Also promoting the article from untracked to committed; it had been authored on 2026-05-09 and not yet added to the repo. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
a852f7b7bd
commit
631d7e8bc5
3 changed files with 182 additions and 0 deletions
180
02-selfhosting/monitoring/logwatch-fleet-setup.md
Normal file
180
02-selfhosting/monitoring/logwatch-fleet-setup.md
Normal file
|
|
@ -0,0 +1,180 @@
|
|||
---
|
||||
title: Logwatch Fleet Setup — Surviving Package Upgrades
|
||||
description: Configure logwatch on mixed Debian/Fedora fleets so settings survive package upgrades
|
||||
tags:
|
||||
- logwatch
|
||||
- monitoring
|
||||
- ansible
|
||||
- fedora
|
||||
- ubuntu
|
||||
status: published
|
||||
created: 2026-05-09
|
||||
updated: 2026-05-10
|
||||
---
|
||||
|
||||
# Logwatch Fleet Setup — Surviving Package Upgrades
|
||||
|
||||
Logwatch ships with a defaults file at `/usr/share/logwatch/default.conf/logwatch.conf`. On Fedora, package upgrades **silently reset** this file — wiping any customizations. The fix is to put all settings in the **local override file** at `/etc/logwatch/conf/logwatch.conf`, which is never touched by package managers.
|
||||
|
||||
## The Problem
|
||||
|
||||
Fedora 44's logwatch 7.14-1 upgrade (April 2026) reset `Output` from `mail` back to `stdout` in the defaults file. Servers that had been emailing daily reports for months went silent with zero errors. `rpm -V logwatch` shows the defaults file was modified (`S.5....T.`), but there's no warning during upgrade.
|
||||
|
||||
Ubuntu is less affected because its `/etc/cron.daily/00logwatch` script passes `--output mail` explicitly, overriding the config. Fedora's cron script does not.
|
||||
|
||||
## The Fix
|
||||
|
||||
Write all settings to the **override file** (`/etc/logwatch/conf/logwatch.conf`):
|
||||
|
||||
```ini
|
||||
# Managed by Ansible — do not edit manually.
|
||||
# Local overrides — survives package upgrades.
|
||||
Output = mail
|
||||
MailTo = marcus@majorshouse.com
|
||||
MailFrom = Logwatch@hostname.majorshouse.com
|
||||
Detail = Low
|
||||
```
|
||||
|
||||
Key settings:
|
||||
|
||||
| Setting | Value | Why |
|
||||
|---------|-------|-----|
|
||||
| `Output` | `mail` | Must be `mail`, not `stdout`. Fedora's cron script doesn't pass `--output mail` like Ubuntu's does. |
|
||||
| `MailTo` | recipient address | Where reports go. |
|
||||
| `MailFrom` | per-host sender | Makes it easy to identify which server sent the report. |
|
||||
| `Detail` | `Low` | Keeps emails scannable. Raise to `Med` or `High` for debugging. |
|
||||
|
||||
## Ansible Playbook
|
||||
|
||||
The `logwatch.yml` playbook handles both OS families:
|
||||
|
||||
```yaml
|
||||
- name: Install and configure logwatch
|
||||
hosts: all
|
||||
become: true
|
||||
gather_facts: true
|
||||
tasks:
|
||||
- name: Install logwatch (Debian/Ubuntu)
|
||||
ansible.builtin.apt:
|
||||
name: logwatch
|
||||
state: present
|
||||
when: ansible_facts['os_family'] == "Debian"
|
||||
|
||||
- name: Install logwatch (Fedora)
|
||||
ansible.builtin.dnf:
|
||||
name: logwatch
|
||||
state: present
|
||||
when: ansible_facts['os_family'] == "RedHat"
|
||||
|
||||
- name: Ensure logwatch override directory exists
|
||||
ansible.builtin.file:
|
||||
path: /etc/logwatch/conf
|
||||
state: directory
|
||||
mode: '0755'
|
||||
|
||||
- name: Configure logwatch override (survives package upgrades)
|
||||
ansible.builtin.copy:
|
||||
dest: /etc/logwatch/conf/logwatch.conf
|
||||
mode: '0644'
|
||||
content: |
|
||||
# Managed by Ansible — do not edit manually.
|
||||
Output = mail
|
||||
MailTo = {{ logwatch_email }}
|
||||
MailFrom = Logwatch@{{ inventory_hostname }}.majorshouse.com
|
||||
Detail = Low
|
||||
```
|
||||
|
||||
Include it in `harden.yml` so every new server gets logwatch as part of the baseline.
|
||||
|
||||
## Verifying
|
||||
|
||||
After deploying, test immediately:
|
||||
|
||||
```bash
|
||||
sudo logwatch --output mail --range today
|
||||
```
|
||||
|
||||
Check that the email arrives. If it doesn't, verify Postfix is installed and relaying correctly — logwatch depends on a working local MTA.
|
||||
|
||||
## Diagnosing Silent Failures
|
||||
|
||||
```bash
|
||||
# Check if the defaults file was modified by a package upgrade
|
||||
rpm -V logwatch # Fedora
|
||||
dpkg -V logwatch # Debian
|
||||
|
||||
# Look for S.5....T. on the defaults file — means it was replaced
|
||||
# S = size, 5 = md5, T = timestamp changed
|
||||
```
|
||||
|
||||
## Fedora CA Bundle Missing — Postfix TLS Engine Unavailable
|
||||
|
||||
If the Fedora half of your fleet is silent but the Debian/Ubuntu half is fine, and your relayhost requires TLS, suspect a missing CA bundle. Symptom on the sending host:
|
||||
|
||||
```
|
||||
postfix/error: status=deferred (delivery temporarily suspended:
|
||||
TLS is required, but our TLS engine is unavailable)
|
||||
```
|
||||
|
||||
The tell that this is the CA bundle and not a postfix-internal problem: **dnf and curl are also broken on the box.** Run any `sudo dnf list` / `sudo curl https://...` and look for:
|
||||
|
||||
```
|
||||
Curl error (77): Problem with the SSL CA cert (path? access rights?)
|
||||
[error adding trust anchors from file: /etc/pki/tls/certs/ca-bundle.crt]
|
||||
```
|
||||
|
||||
That's the same path postfix's `smtp_tls_CAfile` defaults to. Every TLS client on the box is failing because a single symlink is missing.
|
||||
|
||||
### Diagnosis
|
||||
|
||||
```bash
|
||||
# Is the consumer-path symlink there?
|
||||
ls -la /etc/pki/tls/certs/ca-bundle.crt
|
||||
# Expected: lrwxrwxrwx ... -> /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem
|
||||
|
||||
# Is the extracted bundle itself intact?
|
||||
ls -la /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem
|
||||
sudo grep -c 'BEGIN CERTIFICATE' /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem
|
||||
# Expected: ~140-150 certs, ~220 KB
|
||||
```
|
||||
|
||||
If the extracted bundle exists but the consumer-path symlink is gone, you've found it. `update-ca-trust extract` regenerates the `extracted/` paths but does **not** recreate the upstream-style symlink at `/etc/pki/tls/certs/ca-bundle.crt` — that symlink is shipped by the `ca-certificates` package and can be lost during a partial upgrade or a stray `rm`.
|
||||
|
||||
### Fix
|
||||
|
||||
```bash
|
||||
sudo ln -sfn /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem \
|
||||
/etc/pki/tls/certs/ca-bundle.crt
|
||||
sudo systemctl reload postfix
|
||||
sudo postqueue -f # drain deferred mail
|
||||
```
|
||||
|
||||
Verify with `sudo grep -c 'BEGIN CERTIFICATE' /etc/pki/tls/certs/ca-bundle.crt` (should match the extracted bundle's count) and `sudo dnf list --installed postfix` (should no longer show the curl error).
|
||||
|
||||
### Audit the rest of the Fedora fleet
|
||||
|
||||
Once you find one host with this issue, check the others — package events that broke one box may have broken its siblings:
|
||||
|
||||
```bash
|
||||
for host in $(your fleet | grep fedora); do
|
||||
echo "$host: $(ssh $host 'ls /etc/pki/tls/certs/ca-bundle.crt 2>&1' | tail -1)"
|
||||
done
|
||||
```
|
||||
|
||||
Hosts returning "No such file or directory" are silently broken. They won't fail loudly until something asks them to do TLS — which on a small homelab might be never until logwatch tries to mail you weeks later.
|
||||
|
||||
### Methodology note: postfix logs differ between distros
|
||||
|
||||
Don't trust a single log source when surveying a mixed fleet. **Fedora and majormail log postfix to journald** (`journalctl -u postfix`); **Debian/Ubuntu log to `/var/log/mail.log`** (and rotated `mail.log.1` / `mail.log.*.gz`). Querying journalctl on Ubuntu returns "no entries" even when mail is flowing — easy way to declare a working host broken. Always run `tail /var/log/mail.log` on Debian-family hosts and `journalctl -u postfix` on Fedora-family hosts.
|
||||
|
||||
## Bounce-source addresses must be real mailboxes
|
||||
|
||||
A subtle related class of bug: services like Watchtower, fail2ban, cron, and Netdata default to sending notifications **from** an identity that doesn't exist as a recipient — `watchtower@majorshouse.com`, `fail2ban@<host>.majorshouse.com`, `root@<host>.localdomain`. While the relayhost is healthy, nobody notices. The moment any delivery fails (network blip, recipient typo, queue overflow, the CA bundle bug above), the local MTA tries to bounce the original message back to that sender — finds no mailbox — and the bounce itself bounces. You get MAILER-DAEMON queue churn and `5.7.1 Relay access denied` rejections in your mail server logs.
|
||||
|
||||
Fix it once at the source: set `WATCHTOWER_NOTIFICATION_EMAIL_FROM`, fail2ban's `sender =`, and similar to a **real mailbox** on your mail server (e.g., `marcus@majorshouse.com`). Bounces then land somewhere a human can read them, and the noise disappears.
|
||||
|
||||
## Lesson Learned
|
||||
|
||||
Never customize `/usr/share/logwatch/default.conf/logwatch.conf`. Always use `/etc/logwatch/conf/logwatch.conf`. This applies to any software that has a "defaults" file and an "override" file — the override survives upgrades, the defaults file does not.
|
||||
|
||||
A second, broader lesson from the 2026-05-10 fleet outage: **silent fleet-wide email gaps are usually a stack of unrelated failures, not one cause.** That morning's investigation surfaced a missing CA bundle on two Fedora hosts, a postfix relayhost using a name that postfix's resolver couldn't handle, two services with non-mailbox sender addresses generating bounce churn, and a corrupt syslog-vs-journald assumption that hid working hosts. Each was minor in isolation. Together they made all seven hosts look broken when in fact only two were. Triage by ground-truth (what arrived in the destination mailbox) before assuming what's broken at the source.
|
||||
|
|
@ -34,6 +34,7 @@ updated: 2026-05-10T00:10
|
|||
* [Deploying Netdata to a New Server](02-selfhosting/monitoring/netdata-new-server-setup.md)
|
||||
* [Netdata SELinux AVC Denial Monitoring](02-selfhosting/monitoring/netdata-selinux-avc-chart.md)
|
||||
* [Netdata n8n Enriched Alert Emails](02-selfhosting/monitoring/netdata-n8n-enriched-alerts.md)
|
||||
* [Logwatch Fleet Setup — Surviving Package Upgrades](02-selfhosting/monitoring/logwatch-fleet-setup.md)
|
||||
* [Updating n8n Running in Docker](02-selfhosting/services/updating-n8n-docker.md)
|
||||
* [Mastodon Instance Tuning](02-selfhosting/services/mastodon-instance-tuning.md)
|
||||
* [Mastodon — The `--prune-profiles` Trap and How to Recover](02-selfhosting/services/mastodon-prune-profiles-trap.md)
|
||||
|
|
|
|||
1
index.md
1
index.md
|
|
@ -217,6 +217,7 @@ updated: 2026-05-10T01:30
|
|||
|
||||
| Date | Article | Domain |
|
||||
|---|---|---|
|
||||
| 2026-05-10 | [Logwatch Fleet Setup — Surviving Package Upgrades](02-selfhosting/monitoring/logwatch-fleet-setup.md) — added Fedora CA bundle missing diagnosis, journald-vs-mail.log methodology note, and bounce-source-must-be-real-mailbox section | Self-Hosting |
|
||||
| 2026-05-10 | [ClamAV Fleet Deployment with Ansible](02-selfhosting/security/clamav-fleet-deployment.md) — added DigitalOcean monitoring caveat for 1vCPU droplets (with follow-up note: per-droplet relaxed alert can still trip; accept-the-page decision) | Self-Hosting |
|
||||
| 2026-05-10 | [Claude Desktop MCP Mass-Disconnect After Blocking SSH Reboot](05-troubleshooting/claude-desktop-mcp-mass-disconnect-blocking-reboot.md) | Troubleshooting |
|
||||
| 2026-05-10 | [Castopod Posts Don't Appear on Mastodon — Diagnosing the Federation Path](05-troubleshooting/security/castopod-broadcast-not-on-mastodon.md) | Troubleshooting |
|
||||
|
|
|
|||
Loading…
Add table
Reference in a new issue