Add: tuning Netdata web_log_1m_successful for redirect-heavy WordPress

The stock alarm definition counts only 1xx/2xx/304/401/429 as successful,
which causes false CRITICALs on WP sites where 301 canonicalization is
normal traffic (legacy /?p=NNNN, slug edits, host/TLS upgrades, etc.).
Article documents the root cause, verification steps via the access log,
and an in-place threshold retune that keeps the alarm useful as an
"obvious meltdown" floor while delegating real outage detection to the
5xx and 4xx alarms.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Marcus Summers 2026-05-08 01:10:25 -04:00
parent 306e5f1f16
commit 393df3cc45
3 changed files with 204 additions and 5 deletions

View file

@ -0,0 +1,196 @@
---
title: "Tuning Netdata `web_log_1m_successful` for Redirect-Heavy WordPress Sites"
domain: troubleshooting
category: security
tags: [netdata, monitoring, wordpress, apache, fail2ban, alerts]
status: published
created: 2026-05-08
updated: 2026-05-08
---
# Tuning Netdata `web_log_1m_successful` for Redirect-Heavy WordPress Sites
## 🛑 Problem
Netdata's stock `web_log_1m_successful` alarm fires CRITICAL on a perfectly healthy WordPress site whenever a crawler hammers legacy URLs. Example email/notification:
```
[CRITICAL] web_log_1m_successful = 54.1%
Ratio of successful HTTP requests over the last minute (1xx, 2xx, 304, 401, 429)
```
Meanwhile the front page returns HTTP 200, no 5xx errors are logged, and only a handful of 4xx noise hits appear. So why the alert?
---
## 🔬 Root Cause
The metric counts as **"successful"** only the response code classes:
```
1xx, 2xx, 304, 401, 429
```
**301 redirects are NOT counted as successful.** They land in the `redirect` dimension and pull the success ratio down.
WordPress sites generate large volumes of 301s as a normal part of life:
| Redirect source | Why a 301 |
|---|---|
| `/?p=NNNN` legacy shortlinks | Canonical URL rewrite to slug |
| Stale post slugs after permalink edits | Old → new path |
| `/feed``/feed/` | Trailing-slash normalization |
| `http://``https://` | TLS upgrade |
| `domain.com``www.domain.com` | Host canonicalization |
| Proxy CONNECT probes (e.g. `www.instagram.com:443`) | Apache returns 301 to canonical host |
When a feed scraper or vulnerability crawler walks a long list of legacy `/?p=` URLs, **every single hit is a 301**. A short burst can push the ratio of `success / total_requests` below 75% (warn) or 65% (stock crit) within a single minute — even though the server is functioning perfectly.
### Verifying the cause
Pull the last few thousand lines of the access log and split by status code:
```sh
sudo tail -5000 /var/log/apache2/access.log | awk '{print $9}' | sort | uniq -c | sort -rn
```
If you see something like:
```
196 200
162 301
1 405
1 404
1 400
```
…the math is `196 / (196+162+5) ≈ 54%`, which matches the alarm value almost exactly. **The alert is correct by its definition; the definition is wrong for this workload.**
Cross-check the source IPs:
```sh
sudo tail -2000 /var/log/apache2/access.log | awk '{print $1}' | sort | uniq -c | sort -rn | head -10
```
If a single IP dominates (hundreds of requests in minutes) and most of its hits are 301 to legacy URLs, you have your culprit.
---
## ✅ Solution
Two parts: **fix the alarm definition** so normal redirect bursts don't trip it, and **block the abusive scraper** so it stops generating noise.
### 1. Retune `web_log_1m_successful` thresholds
Edit `/etc/netdata/health.d/web_log.conf` (this is a local override of the stock template). Locate the `template: web_log_1m_successful` block and replace its `warn`/`crit` lines:
```diff
template: web_log_1m_successful
on: web_log.type_requests
class: Workload
type: Web Server
component: Web log
lookup: sum -1m unaligned of success
calc: $this * 100 / $web_log_1m_requests
units: %
every: 10s
- warn: ($web_log_1m_requests > 120) ? ($this < (($status >= $WARNING ) ? ( 90 ) : ( 80 )) ) : ( 0 )
- crit: ($web_log_1m_requests > 120) ? ($this < (($status == $CRITICAL) ? ( 75 ) : ( 65 )) ) : ( 0 )
+ warn: ($web_log_1m_requests > 120) ? ($this < (($status >= $WARNING ) ? ( 50 ) : ( 40 )) ) : ( 0 )
+ crit: ($web_log_1m_requests > 120) ? ($this < (($status == $CRITICAL) ? ( 30 ) : ( 20 )) ) : ( 0 )
delay: up 2m down 15m multiplier 1.5 max 1h
summary: Web log successful
info: Ratio of successful HTTP requests over the last minute (1xx, 2xx, 304, 401, 429)
to: webmaster
```
Then reload Netdata health:
```sh
sudo netdatacli reload-health
```
Confirm the new thresholds are active:
```sh
curl -s http://localhost:19999/api/v1/alarms?all \
| jq -r '.alarms | to_entries[] | select(.value.name == "web_log_1m_successful") | .value.warn,.value.crit'
```
You should see the new `50/40` warn and `30/20` crit values.
### 2. Why the new thresholds make sense
The stock alarm assumes a low-redirect workload (typical SPA backend: lots of 200s, very few 301s). On a WP site with active permalink rewrites, expect routine ratios of 7095% successful with occasional dips into the 50s during crawler bursts. The retuned alarm:
- **Warn at <40%** — not until *most* responses are non-2xx
- **Crit at <20%** — only when the site is genuinely melting down (e.g., backend down, Apache returning 5xx for everything)
You haven't disabled the safety net — you've moved it past the floor of normal redirect-heavy noise.
### 3. Lean on the right alarms for real outages
Two other web_log alarms remain stock and **are** the correct outage signals:
| Alarm | Catches | Default thresholds |
|---|---|---|
| `web_log_1m_internal_errors` | 5xx ratio | warn 2% / crit 5% |
| `web_log_1m_bad_requests` | 4xx (excl. 401, 429) | warn 30% |
Verify both are active and CLEAR after your retune:
```sh
curl -s http://localhost:19999/api/v1/alarms?all \
| jq -r '.alarms | to_entries[] | select(.value.name | test("web_log")) | "\(.value.status) | \(.value.name)"'
```
### 4. Block the abusive scraper
Identify the dominant offender from step 1's IP list and ban it permanently via the recidive jail (assuming `bantime = -1` is set in `jail.local`):
```sh
sudo fail2ban-client set recidive banip 74.7.242.61
sudo fail2ban-client status recidive
```
The recidive jail uses iptables/nftables, so the IP is dropped at the firewall — Apache no longer sees it, and the redirect-flood stops contributing to the ratio. If `bantime` is finite on your host, edit `/etc/fail2ban/jail.local`:
```ini
[recidive]
bantime = -1
findtime = 86400
maxretry = 3
```
---
## 🧪 Verification
After both changes:
```sh
# 1. Active alarms — should be empty (or only your real ones)
curl -s http://localhost:19999/api/v1/alarms?active | jq '.alarms'
# 2. Recidive ban list includes the IP
sudo fail2ban-client status recidive
# 3. Live ratio — should climb above 50% within 12 minutes
watch -n 5 'curl -s http://localhost:19999/api/v1/data?chart=web_log_apache.requests_by_type\&after=-60\&points=1\&format=json | jq'
```
---
## 🧭 When NOT to apply this
- If your site is an API or SPA backend that should have a 200-dominated traffic mix, the stock thresholds are correct — diagnose what's actually returning 301 instead of relaxing the alarm.
- If 5xx errors are climbing in tandem with the success-ratio drop, retuning the 1m_successful alarm will mask a real outage. **Always check `web_log_1m_internal_errors` first.**
---
## 📚 References
- Netdata stock template: `/usr/lib/netdata/conf.d/health.d/web_log.conf`
- Local override: `/etc/netdata/health.d/web_log.conf`
- Netdata web_log Go module dimensions: `success`, `redirect`, `bad`, `error`, `other`
- Related: [Custom Fail2ban Jail: Apache Directory Scanning](apache-dirscan-fail2ban-jail.md)

View file

@ -1,6 +1,6 @@
---
created: 2026-04-02T16:03
updated: 2026-05-05T23:39
updated: 2026-05-08T01:08
---
* [Home](index.md)
* [Linux & Sysadmin](01-linux/index.md)
@ -75,6 +75,7 @@ updated: 2026-05-05T23:39
* [Tailscale SSH: Unexpected Re-Authentication Prompt](05-troubleshooting/networking/tailscale-ssh-reauth-prompt.md)
* [Fail2ban & UFW Rule Bloat Cleanup](05-troubleshooting/networking/fail2ban-ufw-rule-bloat-cleanup.md)
* [Custom Fail2ban Jail: Apache Directory Scanning](05-troubleshooting/security/apache-dirscan-fail2ban-jail.md)
* [Tuning Netdata `web_log_1m_successful` for Redirect-Heavy WordPress Sites](05-troubleshooting/security/netdata-web-log-successful-redirect-heavy-tuning.md)
* [Nextcloud AIO Unhealthy 20h After Nightly Update](05-troubleshooting/docker/nextcloud-aio-unhealthy-20h-stuck.md)
* [n8n Behind Reverse Proxy: X-Forwarded-For Trust Fix](05-troubleshooting/docker/n8n-proxy-trust-x-forwarded-for.md)
* [Docker & Caddy Recovery After Reboot (Fedora + SELinux)](05-troubleshooting/docker-caddy-selinux-post-reboot-recovery.md)

View file

@ -1,13 +1,13 @@
---
created: 2026-04-06T09:52
updated: 2026-05-02T17:50
updated: 2026-05-08T01:08
---
# MajorLinux Tech Wiki — Index
> A growing reference of Linux, self-hosting, open source, streaming, and troubleshooting guides. Written by MajorLinux. Used by MajorTwin.
>
> **Last updated:** 2026-05-02
> **Article count:** 106
> **Last updated:** 2026-05-08
> **Article count:** 107
## Domains
@ -17,7 +17,7 @@ updated: 2026-05-02T17:50
| 🏠 Self-Hosting & Homelab | `02-selfhosting/` | 39 |
| 🔓 Open Source Tools | `03-opensource/` | 10 |
| 🎙️ Streaming & Podcasting | `04-streaming/` | 2 |
| 🔧 General Troubleshooting | `05-troubleshooting/` | 43 |
| 🔧 General Troubleshooting | `05-troubleshooting/` | 44 |
---
@ -200,6 +200,7 @@ updated: 2026-05-02T17:50
### Security
- [ClamAV Safe Scheduling on Live Servers](05-troubleshooting/security/clamscan-cpu-spike-nice-ionice.md)
- [Custom Fail2ban Jail: Apache Directory Scanning & Junk Methods](05-troubleshooting/security/apache-dirscan-fail2ban-jail.md)
- [Tuning Netdata `web_log_1m_successful` for Redirect-Heavy WordPress Sites](05-troubleshooting/security/netdata-web-log-successful-redirect-heavy-tuning.md)
### Storage
- [mdadm RAID Recovery After USB Hub Disconnect](05-troubleshooting/storage/mdadm-usb-hub-disconnect-recovery.md)
@ -214,6 +215,7 @@ updated: 2026-05-02T17:50
| Date | Article | Domain |
|---|---|---|
| 2026-05-08 | [Tuning Netdata `web_log_1m_successful` for Redirect-Heavy WordPress Sites](05-troubleshooting/security/netdata-web-log-successful-redirect-heavy-tuning.md) | Troubleshooting |
| 2026-05-07 | [Mastodon — The `--prune-profiles` Trap and How to Recover](02-selfhosting/services/mastodon-prune-profiles-trap.md) | Self-Hosting |
| 2026-05-02 | [WSL2 Backup via PowerShell Scheduled Task](01-linux/distro-specific/wsl2-backup-powershell.md) | Linux |
| 2026-05-02 | [SSH Config and Key Management](01-linux/networking/ssh-config-key-management.md) | Linux |