diff --git a/05-troubleshooting/security/netdata-web-log-successful-redirect-heavy-tuning.md b/05-troubleshooting/security/netdata-web-log-successful-redirect-heavy-tuning.md new file mode 100644 index 0000000..a553c74 --- /dev/null +++ b/05-troubleshooting/security/netdata-web-log-successful-redirect-heavy-tuning.md @@ -0,0 +1,196 @@ +--- +title: "Tuning Netdata `web_log_1m_successful` for Redirect-Heavy WordPress Sites" +domain: troubleshooting +category: security +tags: [netdata, monitoring, wordpress, apache, fail2ban, alerts] +status: published +created: 2026-05-08 +updated: 2026-05-08 +--- + +# Tuning Netdata `web_log_1m_successful` for Redirect-Heavy WordPress Sites + +## πŸ›‘ Problem + +Netdata's stock `web_log_1m_successful` alarm fires CRITICAL on a perfectly healthy WordPress site whenever a crawler hammers legacy URLs. Example email/notification: + +``` +[CRITICAL] web_log_1m_successful = 54.1% +Ratio of successful HTTP requests over the last minute (1xx, 2xx, 304, 401, 429) +``` + +Meanwhile the front page returns HTTP 200, no 5xx errors are logged, and only a handful of 4xx noise hits appear. So why the alert? + +--- + +## πŸ”¬ Root Cause + +The metric counts as **"successful"** only the response code classes: + +``` +1xx, 2xx, 304, 401, 429 +``` + +**301 redirects are NOT counted as successful.** They land in the `redirect` dimension and pull the success ratio down. + +WordPress sites generate large volumes of 301s as a normal part of life: + +| Redirect source | Why a 301 | +|---|---| +| `/?p=NNNN` legacy shortlinks | Canonical URL rewrite to slug | +| Stale post slugs after permalink edits | Old β†’ new path | +| `/feed` β†’ `/feed/` | Trailing-slash normalization | +| `http://` β†’ `https://` | TLS upgrade | +| `domain.com` ↔ `www.domain.com` | Host canonicalization | +| Proxy CONNECT probes (e.g. `www.instagram.com:443`) | Apache returns 301 to canonical host | + +When a feed scraper or vulnerability crawler walks a long list of legacy `/?p=` URLs, **every single hit is a 301**. A short burst can push the ratio of `success / total_requests` below 75% (warn) or 65% (stock crit) within a single minute β€” even though the server is functioning perfectly. + +### Verifying the cause + +Pull the last few thousand lines of the access log and split by status code: + +```sh +sudo tail -5000 /var/log/apache2/access.log | awk '{print $9}' | sort | uniq -c | sort -rn +``` + +If you see something like: + +``` + 196 200 + 162 301 + 1 405 + 1 404 + 1 400 +``` + +…the math is `196 / (196+162+5) β‰ˆ 54%`, which matches the alarm value almost exactly. **The alert is correct by its definition; the definition is wrong for this workload.** + +Cross-check the source IPs: + +```sh +sudo tail -2000 /var/log/apache2/access.log | awk '{print $1}' | sort | uniq -c | sort -rn | head -10 +``` + +If a single IP dominates (hundreds of requests in minutes) and most of its hits are 301 to legacy URLs, you have your culprit. + +--- + +## βœ… Solution + +Two parts: **fix the alarm definition** so normal redirect bursts don't trip it, and **block the abusive scraper** so it stops generating noise. + +### 1. Retune `web_log_1m_successful` thresholds + +Edit `/etc/netdata/health.d/web_log.conf` (this is a local override of the stock template). Locate the `template: web_log_1m_successful` block and replace its `warn`/`crit` lines: + +```diff + template: web_log_1m_successful + on: web_log.type_requests + class: Workload + type: Web Server +component: Web log + lookup: sum -1m unaligned of success + calc: $this * 100 / $web_log_1m_requests + units: % + every: 10s +- warn: ($web_log_1m_requests > 120) ? ($this < (($status >= $WARNING ) ? ( 90 ) : ( 80 )) ) : ( 0 ) +- crit: ($web_log_1m_requests > 120) ? ($this < (($status == $CRITICAL) ? ( 75 ) : ( 65 )) ) : ( 0 ) ++ warn: ($web_log_1m_requests > 120) ? ($this < (($status >= $WARNING ) ? ( 50 ) : ( 40 )) ) : ( 0 ) ++ crit: ($web_log_1m_requests > 120) ? ($this < (($status == $CRITICAL) ? ( 30 ) : ( 20 )) ) : ( 0 ) + delay: up 2m down 15m multiplier 1.5 max 1h + summary: Web log successful + info: Ratio of successful HTTP requests over the last minute (1xx, 2xx, 304, 401, 429) + to: webmaster +``` + +Then reload Netdata health: + +```sh +sudo netdatacli reload-health +``` + +Confirm the new thresholds are active: + +```sh +curl -s http://localhost:19999/api/v1/alarms?all \ + | jq -r '.alarms | to_entries[] | select(.value.name == "web_log_1m_successful") | .value.warn,.value.crit' +``` + +You should see the new `50/40` warn and `30/20` crit values. + +### 2. Why the new thresholds make sense + +The stock alarm assumes a low-redirect workload (typical SPA backend: lots of 200s, very few 301s). On a WP site with active permalink rewrites, expect routine ratios of 70–95% successful with occasional dips into the 50s during crawler bursts. The retuned alarm: + +- **Warn at <40%** β€” not until *most* responses are non-2xx +- **Crit at <20%** β€” only when the site is genuinely melting down (e.g., backend down, Apache returning 5xx for everything) + +You haven't disabled the safety net β€” you've moved it past the floor of normal redirect-heavy noise. + +### 3. Lean on the right alarms for real outages + +Two other web_log alarms remain stock and **are** the correct outage signals: + +| Alarm | Catches | Default thresholds | +|---|---|---| +| `web_log_1m_internal_errors` | 5xx ratio | warn 2% / crit 5% | +| `web_log_1m_bad_requests` | 4xx (excl. 401, 429) | warn 30% | + +Verify both are active and CLEAR after your retune: + +```sh +curl -s http://localhost:19999/api/v1/alarms?all \ + | jq -r '.alarms | to_entries[] | select(.value.name | test("web_log")) | "\(.value.status) | \(.value.name)"' +``` + +### 4. Block the abusive scraper + +Identify the dominant offender from step 1's IP list and ban it permanently via the recidive jail (assuming `bantime = -1` is set in `jail.local`): + +```sh +sudo fail2ban-client set recidive banip 74.7.242.61 +sudo fail2ban-client status recidive +``` + +The recidive jail uses iptables/nftables, so the IP is dropped at the firewall β€” Apache no longer sees it, and the redirect-flood stops contributing to the ratio. If `bantime` is finite on your host, edit `/etc/fail2ban/jail.local`: + +```ini +[recidive] +bantime = -1 +findtime = 86400 +maxretry = 3 +``` + +--- + +## πŸ§ͺ Verification + +After both changes: + +```sh +# 1. Active alarms β€” should be empty (or only your real ones) +curl -s http://localhost:19999/api/v1/alarms?active | jq '.alarms' + +# 2. Recidive ban list includes the IP +sudo fail2ban-client status recidive + +# 3. Live ratio β€” should climb above 50% within 1–2 minutes +watch -n 5 'curl -s http://localhost:19999/api/v1/data?chart=web_log_apache.requests_by_type\&after=-60\&points=1\&format=json | jq' +``` + +--- + +## 🧭 When NOT to apply this + +- If your site is an API or SPA backend that should have a 200-dominated traffic mix, the stock thresholds are correct β€” diagnose what's actually returning 301 instead of relaxing the alarm. +- If 5xx errors are climbing in tandem with the success-ratio drop, retuning the 1m_successful alarm will mask a real outage. **Always check `web_log_1m_internal_errors` first.** + +--- + +## πŸ“š References + +- Netdata stock template: `/usr/lib/netdata/conf.d/health.d/web_log.conf` +- Local override: `/etc/netdata/health.d/web_log.conf` +- Netdata web_log Go module dimensions: `success`, `redirect`, `bad`, `error`, `other` +- Related: [Custom Fail2ban Jail: Apache Directory Scanning](apache-dirscan-fail2ban-jail.md) diff --git a/SUMMARY.md b/SUMMARY.md index e1fafa9..a33e956 100644 --- a/SUMMARY.md +++ b/SUMMARY.md @@ -1,6 +1,6 @@ --- created: 2026-04-02T16:03 -updated: 2026-05-05T23:39 +updated: 2026-05-08T01:08 --- * [Home](index.md) * [Linux & Sysadmin](01-linux/index.md) @@ -75,6 +75,7 @@ updated: 2026-05-05T23:39 * [Tailscale SSH: Unexpected Re-Authentication Prompt](05-troubleshooting/networking/tailscale-ssh-reauth-prompt.md) * [Fail2ban & UFW Rule Bloat Cleanup](05-troubleshooting/networking/fail2ban-ufw-rule-bloat-cleanup.md) * [Custom Fail2ban Jail: Apache Directory Scanning](05-troubleshooting/security/apache-dirscan-fail2ban-jail.md) + * [Tuning Netdata `web_log_1m_successful` for Redirect-Heavy WordPress Sites](05-troubleshooting/security/netdata-web-log-successful-redirect-heavy-tuning.md) * [Nextcloud AIO Unhealthy 20h After Nightly Update](05-troubleshooting/docker/nextcloud-aio-unhealthy-20h-stuck.md) * [n8n Behind Reverse Proxy: X-Forwarded-For Trust Fix](05-troubleshooting/docker/n8n-proxy-trust-x-forwarded-for.md) * [Docker & Caddy Recovery After Reboot (Fedora + SELinux)](05-troubleshooting/docker-caddy-selinux-post-reboot-recovery.md) diff --git a/index.md b/index.md index 8a080ef..da88f8e 100644 --- a/index.md +++ b/index.md @@ -1,13 +1,13 @@ --- created: 2026-04-06T09:52 -updated: 2026-05-02T17:50 +updated: 2026-05-08T01:08 --- # MajorLinux Tech Wiki β€” Index > A growing reference of Linux, self-hosting, open source, streaming, and troubleshooting guides. Written by MajorLinux. Used by MajorTwin. > -> **Last updated:** 2026-05-02 -> **Article count:** 106 +> **Last updated:** 2026-05-08 +> **Article count:** 107 ## Domains @@ -17,7 +17,7 @@ updated: 2026-05-02T17:50 | 🏠 Self-Hosting & Homelab | `02-selfhosting/` | 39 | | πŸ”“ Open Source Tools | `03-opensource/` | 10 | | πŸŽ™οΈ Streaming & Podcasting | `04-streaming/` | 2 | -| πŸ”§ General Troubleshooting | `05-troubleshooting/` | 43 | +| πŸ”§ General Troubleshooting | `05-troubleshooting/` | 44 | --- @@ -200,6 +200,7 @@ updated: 2026-05-02T17:50 ### Security - [ClamAV Safe Scheduling on Live Servers](05-troubleshooting/security/clamscan-cpu-spike-nice-ionice.md) - [Custom Fail2ban Jail: Apache Directory Scanning & Junk Methods](05-troubleshooting/security/apache-dirscan-fail2ban-jail.md) +- [Tuning Netdata `web_log_1m_successful` for Redirect-Heavy WordPress Sites](05-troubleshooting/security/netdata-web-log-successful-redirect-heavy-tuning.md) ### Storage - [mdadm RAID Recovery After USB Hub Disconnect](05-troubleshooting/storage/mdadm-usb-hub-disconnect-recovery.md) @@ -214,6 +215,7 @@ updated: 2026-05-02T17:50 | Date | Article | Domain | |---|---|---| +| 2026-05-08 | [Tuning Netdata `web_log_1m_successful` for Redirect-Heavy WordPress Sites](05-troubleshooting/security/netdata-web-log-successful-redirect-heavy-tuning.md) | Troubleshooting | | 2026-05-07 | [Mastodon β€” The `--prune-profiles` Trap and How to Recover](02-selfhosting/services/mastodon-prune-profiles-trap.md) | Self-Hosting | | 2026-05-02 | [WSL2 Backup via PowerShell Scheduled Task](01-linux/distro-specific/wsl2-backup-powershell.md) | Linux | | 2026-05-02 | [SSH Config and Key Management](01-linux/networking/ssh-config-key-management.md) | Linux |