Add: tuning Netdata web_log_1m_successful for redirect-heavy WordPress

The stock alarm definition counts only 1xx/2xx/304/401/429 as successful, which causes false CRITICALs on WP sites where 301 canonicalization is normal traffic (legacy /?p=NNNN, slug edits, host/TLS upgrades, etc.). Article documents the root cause, verification steps via the access log, and an in-place threshold retune that keeps the alarm useful as an "obvious meltdown" floor while delegating real outage detection to the 5xx and 4xx alarms. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 01:10:25 -04:00 · 2026-05-08 01:10:25 -04:00 · 393df3cc45
commit 393df3cc45
parent 306e5f1f16
3 changed files with 204 additions and 5 deletions
--- a/05-troubleshooting/security/netdata-web-log-successful-redirect-heavy-tuning.md
+++ b/05-troubleshooting/security/netdata-web-log-successful-redirect-heavy-tuning.md
@ -0,0 +1,196 @@
+---
+title: "Tuning Netdata `web_log_1m_successful` for Redirect-Heavy WordPress Sites"
+domain: troubleshooting
+category: security
+tags: [netdata, monitoring, wordpress, apache, fail2ban, alerts]
+status: published
+created: 2026-05-08
+updated: 2026-05-08
+---
+
+# Tuning Netdata `web_log_1m_successful` for Redirect-Heavy WordPress Sites
+
+## 🛑 Problem
+
+Netdata's stock `web_log_1m_successful` alarm fires CRITICAL on a perfectly healthy WordPress site whenever a crawler hammers legacy URLs. Example email/notification:
+
+```
+[CRITICAL] web_log_1m_successful = 54.1%
+Ratio of successful HTTP requests over the last minute (1xx, 2xx, 304, 401, 429)
+```
+
+Meanwhile the front page returns HTTP 200, no 5xx errors are logged, and only a handful of 4xx noise hits appear. So why the alert?
+
+---
+
+## 🔬 Root Cause
+
+The metric counts as **"successful"** only the response code classes:
+
+```
+1xx, 2xx, 304, 401, 429
+```
+
+**301 redirects are NOT counted as successful.** They land in the `redirect` dimension and pull the success ratio down.
+
+WordPress sites generate large volumes of 301s as a normal part of life:
+
+| Redirect source | Why a 301 |
+|---|---|
+| `/?p=NNNN` legacy shortlinks | Canonical URL rewrite to slug |
+| Stale post slugs after permalink edits | Old → new path |
+| `/feed` → `/feed/` | Trailing-slash normalization |
+| `http://` → `https://` | TLS upgrade |
+| `domain.com` ↔ `www.domain.com` | Host canonicalization |
+| Proxy CONNECT probes (e.g. `www.instagram.com:443`) | Apache returns 301 to canonical host |
+
+When a feed scraper or vulnerability crawler walks a long list of legacy `/?p=` URLs, **every single hit is a 301**. A short burst can push the ratio of `success / total_requests` below 75% (warn) or 65% (stock crit) within a single minute — even though the server is functioning perfectly.
+
+### Verifying the cause
+
+Pull the last few thousand lines of the access log and split by status code:
+
+```sh
+sudo tail -5000 /var/log/apache2/access.log | awk '{print $9}' | sort | uniq -c | sort -rn
+```
+
+If you see something like:
+
+```
+    196 200
+    162 301
+      1 405
+      1 404
+      1 400
+```
+
+…the math is `196 / (196+162+5) ≈ 54%`, which matches the alarm value almost exactly. **The alert is correct by its definition; the definition is wrong for this workload.**
+
+Cross-check the source IPs:
+
+```sh
+sudo tail -2000 /var/log/apache2/access.log | awk '{print $1}' | sort | uniq -c | sort -rn | head -10
+```
+
+If a single IP dominates (hundreds of requests in minutes) and most of its hits are 301 to legacy URLs, you have your culprit.
+
+---
+
+## ✅ Solution
+
+Two parts: **fix the alarm definition** so normal redirect bursts don't trip it, and **block the abusive scraper** so it stops generating noise.
+
+### 1. Retune `web_log_1m_successful` thresholds
+
+Edit `/etc/netdata/health.d/web_log.conf` (this is a local override of the stock template). Locate the `template: web_log_1m_successful` block and replace its `warn`/`crit` lines:
+
+```diff
+ template: web_log_1m_successful
+       on: web_log.type_requests
+    class: Workload
+     type: Web Server
+component: Web log
+   lookup: sum -1m unaligned of success
+     calc: $this * 100 / $web_log_1m_requests
+    units: %
+    every: 10s
+-    warn: ($web_log_1m_requests > 120) ? ($this < (($status >= $WARNING ) ? ( 90 ) : ( 80 )) ) : ( 0 )
+-    crit: ($web_log_1m_requests > 120) ? ($this < (($status == $CRITICAL) ? ( 75 ) : ( 65 )) ) : ( 0 )
+    warn: ($web_log_1m_requests > 120) ? ($this < (($status >= $WARNING ) ? ( 50 ) : ( 40 )) ) : ( 0 )
+    crit: ($web_log_1m_requests > 120) ? ($this < (($status == $CRITICAL) ? ( 30 ) : ( 20 )) ) : ( 0 )
+    delay: up 2m down 15m multiplier 1.5 max 1h
+  summary: Web log successful
+     info: Ratio of successful HTTP requests over the last minute (1xx, 2xx, 304, 401, 429)
+       to: webmaster
+```
+
+Then reload Netdata health:
+
+```sh
+sudo netdatacli reload-health
+```
+
+Confirm the new thresholds are active:
+
+```sh
+curl -s http://localhost:19999/api/v1/alarms?all \
+  | jq -r '.alarms | to_entries[] | select(.value.name == "web_log_1m_successful") | .value.warn,.value.crit'
+```
+
+You should see the new `50/40` warn and `30/20` crit values.
+
+### 2. Why the new thresholds make sense
+
+The stock alarm assumes a low-redirect workload (typical SPA backend: lots of 200s, very few 301s). On a WP site with active permalink rewrites, expect routine ratios of 70–95% successful with occasional dips into the 50s during crawler bursts. The retuned alarm:
+
+- **Warn at <40%** — not until *most* responses are non-2xx
+- **Crit at <20%** — only when the site is genuinely melting down (e.g., backend down, Apache returning 5xx for everything)
+
+You haven't disabled the safety net — you've moved it past the floor of normal redirect-heavy noise.
+
+### 3. Lean on the right alarms for real outages
+
+Two other web_log alarms remain stock and **are** the correct outage signals:
+
+| Alarm | Catches | Default thresholds |
+|---|---|---|
+| `web_log_1m_internal_errors` | 5xx ratio | warn 2% / crit 5% |
+| `web_log_1m_bad_requests` | 4xx (excl. 401, 429) | warn 30% |
+
+Verify both are active and CLEAR after your retune:
+
+```sh
+curl -s http://localhost:19999/api/v1/alarms?all \
+  | jq -r '.alarms | to_entries[] | select(.value.name | test("web_log")) | "\(.value.status) | \(.value.name)"'
+```
+
+### 4. Block the abusive scraper
+
+Identify the dominant offender from step 1's IP list and ban it permanently via the recidive jail (assuming `bantime = -1` is set in `jail.local`):
+
+```sh
+sudo fail2ban-client set recidive banip 74.7.242.61
+sudo fail2ban-client status recidive
+```
+
+The recidive jail uses iptables/nftables, so the IP is dropped at the firewall — Apache no longer sees it, and the redirect-flood stops contributing to the ratio. If `bantime` is finite on your host, edit `/etc/fail2ban/jail.local`:
+
+```ini
+[recidive]
+bantime = -1
+findtime = 86400
+maxretry = 3
+```
+
+---
+
+## 🧪 Verification
+
+After both changes:
+
+```sh
+# 1. Active alarms — should be empty (or only your real ones)
+curl -s http://localhost:19999/api/v1/alarms?active | jq '.alarms'
+
+# 2. Recidive ban list includes the IP
+sudo fail2ban-client status recidive
+
+# 3. Live ratio — should climb above 50% within 1–2 minutes
+watch -n 5 'curl -s http://localhost:19999/api/v1/data?chart=web_log_apache.requests_by_type\&after=-60\&points=1\&format=json | jq'
+```
+
+---
+
+## 🧭 When NOT to apply this
+
+- If your site is an API or SPA backend that should have a 200-dominated traffic mix, the stock thresholds are correct — diagnose what's actually returning 301 instead of relaxing the alarm.
+- If 5xx errors are climbing in tandem with the success-ratio drop, retuning the 1m_successful alarm will mask a real outage. **Always check `web_log_1m_internal_errors` first.**
+
+---
+
+## 📚 References
+
+- Netdata stock template: `/usr/lib/netdata/conf.d/health.d/web_log.conf`
+- Local override: `/etc/netdata/health.d/web_log.conf`
+- Netdata web_log Go module dimensions: `success`, `redirect`, `bad`, `error`, `other`
+- Related: [Custom Fail2ban Jail: Apache Directory Scanning](apache-dirscan-fail2ban-jail.md)
--- a/SUMMARY.md
+++ b/SUMMARY.md
@ -1,6 +1,6 @@
 ---
 created: 2026-04-02T16:03
-updated: 2026-05-05T23:39
+updated: 2026-05-08T01:08
 ---
 * [Home](index.md)
 * [Linux & Sysadmin](01-linux/index.md)
@ -75,6 +75,7 @@ updated: 2026-05-05T23:39
    * [Tailscale SSH: Unexpected Re-Authentication Prompt](05-troubleshooting/networking/tailscale-ssh-reauth-prompt.md)
    * [Fail2ban & UFW Rule Bloat Cleanup](05-troubleshooting/networking/fail2ban-ufw-rule-bloat-cleanup.md)
    * [Custom Fail2ban Jail: Apache Directory Scanning](05-troubleshooting/security/apache-dirscan-fail2ban-jail.md)
+    * [Tuning Netdata `web_log_1m_successful` for Redirect-Heavy WordPress Sites](05-troubleshooting/security/netdata-web-log-successful-redirect-heavy-tuning.md)
    * [Nextcloud AIO Unhealthy 20h After Nightly Update](05-troubleshooting/docker/nextcloud-aio-unhealthy-20h-stuck.md)
    * [n8n Behind Reverse Proxy: X-Forwarded-For Trust Fix](05-troubleshooting/docker/n8n-proxy-trust-x-forwarded-for.md)
    * [Docker & Caddy Recovery After Reboot (Fedora + SELinux)](05-troubleshooting/docker-caddy-selinux-post-reboot-recovery.md)
--- a/index.md
+++ b/index.md
@ -1,13 +1,13 @@
 ---
 created: 2026-04-06T09:52
-updated: 2026-05-02T17:50
+updated: 2026-05-08T01:08
 ---
 # MajorLinux Tech Wiki — Index

 > A growing reference of Linux, self-hosting, open source, streaming, and troubleshooting guides. Written by MajorLinux. Used by MajorTwin.
 >
-> **Last updated:** 2026-05-02
-> **Article count:** 106
+> **Last updated:** 2026-05-08
+> **Article count:** 107

 ## Domains

@ -17,7 +17,7 @@ updated: 2026-05-02T17:50
 | 🏠 Self-Hosting & Homelab | `02-selfhosting/` | 39 |
 | 🔓 Open Source Tools | `03-opensource/` | 10 |
 | 🎙️ Streaming & Podcasting | `04-streaming/` | 2 |
-| 🔧 General Troubleshooting | `05-troubleshooting/` | 43 |
+| 🔧 General Troubleshooting | `05-troubleshooting/` | 44 |


 ---
@ -200,6 +200,7 @@ updated: 2026-05-02T17:50
 ### Security
 - [ClamAV Safe Scheduling on Live Servers](05-troubleshooting/security/clamscan-cpu-spike-nice-ionice.md)
 - [Custom Fail2ban Jail: Apache Directory Scanning & Junk Methods](05-troubleshooting/security/apache-dirscan-fail2ban-jail.md)
+- [Tuning Netdata `web_log_1m_successful` for Redirect-Heavy WordPress Sites](05-troubleshooting/security/netdata-web-log-successful-redirect-heavy-tuning.md)

 ### Storage
 - [mdadm RAID Recovery After USB Hub Disconnect](05-troubleshooting/storage/mdadm-usb-hub-disconnect-recovery.md)
@ -214,6 +215,7 @@ updated: 2026-05-02T17:50

 | Date | Article | Domain |
 |---|---|---|
+| 2026-05-08 | [Tuning Netdata `web_log_1m_successful` for Redirect-Heavy WordPress Sites](05-troubleshooting/security/netdata-web-log-successful-redirect-heavy-tuning.md) | Troubleshooting |
 | 2026-05-07 | [Mastodon — The `--prune-profiles` Trap and How to Recover](02-selfhosting/services/mastodon-prune-profiles-trap.md) | Self-Hosting |
 | 2026-05-02 | [WSL2 Backup via PowerShell Scheduled Task](01-linux/distro-specific/wsl2-backup-powershell.md) | Linux |
 | 2026-05-02 | [SSH Config and Key Management](01-linux/networking/ssh-config-key-management.md) | Linux |