Add: tuning Netdata web_log_1m_successful for redirect-heavy WordPress
The stock alarm definition counts only 1xx/2xx/304/401/429 as successful, which causes false CRITICALs on WP sites where 301 canonicalization is normal traffic (legacy /?p=NNNN, slug edits, host/TLS upgrades, etc.). Article documents the root cause, verification steps via the access log, and an in-place threshold retune that keeps the alarm useful as an "obvious meltdown" floor while delegating real outage detection to the 5xx and 4xx alarms. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
306e5f1f16
commit
393df3cc45
3 changed files with 204 additions and 5 deletions
|
|
@ -0,0 +1,196 @@
|
|||
---
|
||||
title: "Tuning Netdata `web_log_1m_successful` for Redirect-Heavy WordPress Sites"
|
||||
domain: troubleshooting
|
||||
category: security
|
||||
tags: [netdata, monitoring, wordpress, apache, fail2ban, alerts]
|
||||
status: published
|
||||
created: 2026-05-08
|
||||
updated: 2026-05-08
|
||||
---
|
||||
|
||||
# Tuning Netdata `web_log_1m_successful` for Redirect-Heavy WordPress Sites
|
||||
|
||||
## 🛑 Problem
|
||||
|
||||
Netdata's stock `web_log_1m_successful` alarm fires CRITICAL on a perfectly healthy WordPress site whenever a crawler hammers legacy URLs. Example email/notification:
|
||||
|
||||
```
|
||||
[CRITICAL] web_log_1m_successful = 54.1%
|
||||
Ratio of successful HTTP requests over the last minute (1xx, 2xx, 304, 401, 429)
|
||||
```
|
||||
|
||||
Meanwhile the front page returns HTTP 200, no 5xx errors are logged, and only a handful of 4xx noise hits appear. So why the alert?
|
||||
|
||||
---
|
||||
|
||||
## 🔬 Root Cause
|
||||
|
||||
The metric counts as **"successful"** only the response code classes:
|
||||
|
||||
```
|
||||
1xx, 2xx, 304, 401, 429
|
||||
```
|
||||
|
||||
**301 redirects are NOT counted as successful.** They land in the `redirect` dimension and pull the success ratio down.
|
||||
|
||||
WordPress sites generate large volumes of 301s as a normal part of life:
|
||||
|
||||
| Redirect source | Why a 301 |
|
||||
|---|---|
|
||||
| `/?p=NNNN` legacy shortlinks | Canonical URL rewrite to slug |
|
||||
| Stale post slugs after permalink edits | Old → new path |
|
||||
| `/feed` → `/feed/` | Trailing-slash normalization |
|
||||
| `http://` → `https://` | TLS upgrade |
|
||||
| `domain.com` ↔ `www.domain.com` | Host canonicalization |
|
||||
| Proxy CONNECT probes (e.g. `www.instagram.com:443`) | Apache returns 301 to canonical host |
|
||||
|
||||
When a feed scraper or vulnerability crawler walks a long list of legacy `/?p=` URLs, **every single hit is a 301**. A short burst can push the ratio of `success / total_requests` below 75% (warn) or 65% (stock crit) within a single minute — even though the server is functioning perfectly.
|
||||
|
||||
### Verifying the cause
|
||||
|
||||
Pull the last few thousand lines of the access log and split by status code:
|
||||
|
||||
```sh
|
||||
sudo tail -5000 /var/log/apache2/access.log | awk '{print $9}' | sort | uniq -c | sort -rn
|
||||
```
|
||||
|
||||
If you see something like:
|
||||
|
||||
```
|
||||
196 200
|
||||
162 301
|
||||
1 405
|
||||
1 404
|
||||
1 400
|
||||
```
|
||||
|
||||
…the math is `196 / (196+162+5) ≈ 54%`, which matches the alarm value almost exactly. **The alert is correct by its definition; the definition is wrong for this workload.**
|
||||
|
||||
Cross-check the source IPs:
|
||||
|
||||
```sh
|
||||
sudo tail -2000 /var/log/apache2/access.log | awk '{print $1}' | sort | uniq -c | sort -rn | head -10
|
||||
```
|
||||
|
||||
If a single IP dominates (hundreds of requests in minutes) and most of its hits are 301 to legacy URLs, you have your culprit.
|
||||
|
||||
---
|
||||
|
||||
## ✅ Solution
|
||||
|
||||
Two parts: **fix the alarm definition** so normal redirect bursts don't trip it, and **block the abusive scraper** so it stops generating noise.
|
||||
|
||||
### 1. Retune `web_log_1m_successful` thresholds
|
||||
|
||||
Edit `/etc/netdata/health.d/web_log.conf` (this is a local override of the stock template). Locate the `template: web_log_1m_successful` block and replace its `warn`/`crit` lines:
|
||||
|
||||
```diff
|
||||
template: web_log_1m_successful
|
||||
on: web_log.type_requests
|
||||
class: Workload
|
||||
type: Web Server
|
||||
component: Web log
|
||||
lookup: sum -1m unaligned of success
|
||||
calc: $this * 100 / $web_log_1m_requests
|
||||
units: %
|
||||
every: 10s
|
||||
- warn: ($web_log_1m_requests > 120) ? ($this < (($status >= $WARNING ) ? ( 90 ) : ( 80 )) ) : ( 0 )
|
||||
- crit: ($web_log_1m_requests > 120) ? ($this < (($status == $CRITICAL) ? ( 75 ) : ( 65 )) ) : ( 0 )
|
||||
+ warn: ($web_log_1m_requests > 120) ? ($this < (($status >= $WARNING ) ? ( 50 ) : ( 40 )) ) : ( 0 )
|
||||
+ crit: ($web_log_1m_requests > 120) ? ($this < (($status == $CRITICAL) ? ( 30 ) : ( 20 )) ) : ( 0 )
|
||||
delay: up 2m down 15m multiplier 1.5 max 1h
|
||||
summary: Web log successful
|
||||
info: Ratio of successful HTTP requests over the last minute (1xx, 2xx, 304, 401, 429)
|
||||
to: webmaster
|
||||
```
|
||||
|
||||
Then reload Netdata health:
|
||||
|
||||
```sh
|
||||
sudo netdatacli reload-health
|
||||
```
|
||||
|
||||
Confirm the new thresholds are active:
|
||||
|
||||
```sh
|
||||
curl -s http://localhost:19999/api/v1/alarms?all \
|
||||
| jq -r '.alarms | to_entries[] | select(.value.name == "web_log_1m_successful") | .value.warn,.value.crit'
|
||||
```
|
||||
|
||||
You should see the new `50/40` warn and `30/20` crit values.
|
||||
|
||||
### 2. Why the new thresholds make sense
|
||||
|
||||
The stock alarm assumes a low-redirect workload (typical SPA backend: lots of 200s, very few 301s). On a WP site with active permalink rewrites, expect routine ratios of 70–95% successful with occasional dips into the 50s during crawler bursts. The retuned alarm:
|
||||
|
||||
- **Warn at <40%** — not until *most* responses are non-2xx
|
||||
- **Crit at <20%** — only when the site is genuinely melting down (e.g., backend down, Apache returning 5xx for everything)
|
||||
|
||||
You haven't disabled the safety net — you've moved it past the floor of normal redirect-heavy noise.
|
||||
|
||||
### 3. Lean on the right alarms for real outages
|
||||
|
||||
Two other web_log alarms remain stock and **are** the correct outage signals:
|
||||
|
||||
| Alarm | Catches | Default thresholds |
|
||||
|---|---|---|
|
||||
| `web_log_1m_internal_errors` | 5xx ratio | warn 2% / crit 5% |
|
||||
| `web_log_1m_bad_requests` | 4xx (excl. 401, 429) | warn 30% |
|
||||
|
||||
Verify both are active and CLEAR after your retune:
|
||||
|
||||
```sh
|
||||
curl -s http://localhost:19999/api/v1/alarms?all \
|
||||
| jq -r '.alarms | to_entries[] | select(.value.name | test("web_log")) | "\(.value.status) | \(.value.name)"'
|
||||
```
|
||||
|
||||
### 4. Block the abusive scraper
|
||||
|
||||
Identify the dominant offender from step 1's IP list and ban it permanently via the recidive jail (assuming `bantime = -1` is set in `jail.local`):
|
||||
|
||||
```sh
|
||||
sudo fail2ban-client set recidive banip 74.7.242.61
|
||||
sudo fail2ban-client status recidive
|
||||
```
|
||||
|
||||
The recidive jail uses iptables/nftables, so the IP is dropped at the firewall — Apache no longer sees it, and the redirect-flood stops contributing to the ratio. If `bantime` is finite on your host, edit `/etc/fail2ban/jail.local`:
|
||||
|
||||
```ini
|
||||
[recidive]
|
||||
bantime = -1
|
||||
findtime = 86400
|
||||
maxretry = 3
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🧪 Verification
|
||||
|
||||
After both changes:
|
||||
|
||||
```sh
|
||||
# 1. Active alarms — should be empty (or only your real ones)
|
||||
curl -s http://localhost:19999/api/v1/alarms?active | jq '.alarms'
|
||||
|
||||
# 2. Recidive ban list includes the IP
|
||||
sudo fail2ban-client status recidive
|
||||
|
||||
# 3. Live ratio — should climb above 50% within 1–2 minutes
|
||||
watch -n 5 'curl -s http://localhost:19999/api/v1/data?chart=web_log_apache.requests_by_type\&after=-60\&points=1\&format=json | jq'
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🧭 When NOT to apply this
|
||||
|
||||
- If your site is an API or SPA backend that should have a 200-dominated traffic mix, the stock thresholds are correct — diagnose what's actually returning 301 instead of relaxing the alarm.
|
||||
- If 5xx errors are climbing in tandem with the success-ratio drop, retuning the 1m_successful alarm will mask a real outage. **Always check `web_log_1m_internal_errors` first.**
|
||||
|
||||
---
|
||||
|
||||
## 📚 References
|
||||
|
||||
- Netdata stock template: `/usr/lib/netdata/conf.d/health.d/web_log.conf`
|
||||
- Local override: `/etc/netdata/health.d/web_log.conf`
|
||||
- Netdata web_log Go module dimensions: `success`, `redirect`, `bad`, `error`, `other`
|
||||
- Related: [Custom Fail2ban Jail: Apache Directory Scanning](apache-dirscan-fail2ban-jail.md)
|
||||
|
|
@ -1,6 +1,6 @@
|
|||
---
|
||||
created: 2026-04-02T16:03
|
||||
updated: 2026-05-05T23:39
|
||||
updated: 2026-05-08T01:08
|
||||
---
|
||||
* [Home](index.md)
|
||||
* [Linux & Sysadmin](01-linux/index.md)
|
||||
|
|
@ -75,6 +75,7 @@ updated: 2026-05-05T23:39
|
|||
* [Tailscale SSH: Unexpected Re-Authentication Prompt](05-troubleshooting/networking/tailscale-ssh-reauth-prompt.md)
|
||||
* [Fail2ban & UFW Rule Bloat Cleanup](05-troubleshooting/networking/fail2ban-ufw-rule-bloat-cleanup.md)
|
||||
* [Custom Fail2ban Jail: Apache Directory Scanning](05-troubleshooting/security/apache-dirscan-fail2ban-jail.md)
|
||||
* [Tuning Netdata `web_log_1m_successful` for Redirect-Heavy WordPress Sites](05-troubleshooting/security/netdata-web-log-successful-redirect-heavy-tuning.md)
|
||||
* [Nextcloud AIO Unhealthy 20h After Nightly Update](05-troubleshooting/docker/nextcloud-aio-unhealthy-20h-stuck.md)
|
||||
* [n8n Behind Reverse Proxy: X-Forwarded-For Trust Fix](05-troubleshooting/docker/n8n-proxy-trust-x-forwarded-for.md)
|
||||
* [Docker & Caddy Recovery After Reboot (Fedora + SELinux)](05-troubleshooting/docker-caddy-selinux-post-reboot-recovery.md)
|
||||
|
|
|
|||
10
index.md
10
index.md
|
|
@ -1,13 +1,13 @@
|
|||
---
|
||||
created: 2026-04-06T09:52
|
||||
updated: 2026-05-02T17:50
|
||||
updated: 2026-05-08T01:08
|
||||
---
|
||||
# MajorLinux Tech Wiki — Index
|
||||
|
||||
> A growing reference of Linux, self-hosting, open source, streaming, and troubleshooting guides. Written by MajorLinux. Used by MajorTwin.
|
||||
>
|
||||
> **Last updated:** 2026-05-02
|
||||
> **Article count:** 106
|
||||
> **Last updated:** 2026-05-08
|
||||
> **Article count:** 107
|
||||
|
||||
## Domains
|
||||
|
||||
|
|
@ -17,7 +17,7 @@ updated: 2026-05-02T17:50
|
|||
| 🏠 Self-Hosting & Homelab | `02-selfhosting/` | 39 |
|
||||
| 🔓 Open Source Tools | `03-opensource/` | 10 |
|
||||
| 🎙️ Streaming & Podcasting | `04-streaming/` | 2 |
|
||||
| 🔧 General Troubleshooting | `05-troubleshooting/` | 43 |
|
||||
| 🔧 General Troubleshooting | `05-troubleshooting/` | 44 |
|
||||
|
||||
|
||||
---
|
||||
|
|
@ -200,6 +200,7 @@ updated: 2026-05-02T17:50
|
|||
### Security
|
||||
- [ClamAV Safe Scheduling on Live Servers](05-troubleshooting/security/clamscan-cpu-spike-nice-ionice.md)
|
||||
- [Custom Fail2ban Jail: Apache Directory Scanning & Junk Methods](05-troubleshooting/security/apache-dirscan-fail2ban-jail.md)
|
||||
- [Tuning Netdata `web_log_1m_successful` for Redirect-Heavy WordPress Sites](05-troubleshooting/security/netdata-web-log-successful-redirect-heavy-tuning.md)
|
||||
|
||||
### Storage
|
||||
- [mdadm RAID Recovery After USB Hub Disconnect](05-troubleshooting/storage/mdadm-usb-hub-disconnect-recovery.md)
|
||||
|
|
@ -214,6 +215,7 @@ updated: 2026-05-02T17:50
|
|||
|
||||
| Date | Article | Domain |
|
||||
|---|---|---|
|
||||
| 2026-05-08 | [Tuning Netdata `web_log_1m_successful` for Redirect-Heavy WordPress Sites](05-troubleshooting/security/netdata-web-log-successful-redirect-heavy-tuning.md) | Troubleshooting |
|
||||
| 2026-05-07 | [Mastodon — The `--prune-profiles` Trap and How to Recover](02-selfhosting/services/mastodon-prune-profiles-trap.md) | Self-Hosting |
|
||||
| 2026-05-02 | [WSL2 Backup via PowerShell Scheduled Task](01-linux/distro-specific/wsl2-backup-powershell.md) | Linux |
|
||||
| 2026-05-02 | [SSH Config and Key Management](01-linux/networking/ssh-config-key-management.md) | Linux |
|
||||
|
|
|
|||
Loading…
Add table
Reference in a new issue