diff --git a/_drafts/.keep b/0 similarity index 100% rename from _drafts/.keep rename to 0 diff --git a/01-linux/distro-specific/wsl2-rebuild-fedora43-training-env.md b/01-linux/distro-specific/wsl2-rebuild-fedora43-training-env.md index 601f840..b073d14 100644 --- a/01-linux/distro-specific/wsl2-rebuild-fedora43-training-env.md +++ b/01-linux/distro-specific/wsl2-rebuild-fedora43-training-env.md @@ -154,7 +154,7 @@ alias majorlab='ssh root@100.86.14.126' alias majormail='ssh root@100.84.165.52' alias teelia='ssh root@100.120.32.69' alias tttpod='ssh root@100.84.42.102' -alias majorrig='ssh -p 2222 majorlinux@100.98.47.29' +alias majorrig='ssh majorlinux@100.98.47.29' # port 2222 retired 2026-03-25, fleet uses port 22 # DNF5 alias update='sudo dnf upgrade --refresh' diff --git a/Network/overview.md b/02-selfhosting/dns-networking/network-overview.md similarity index 100% rename from Network/overview.md rename to 02-selfhosting/dns-networking/network-overview.md diff --git a/02-selfhosting/docker/docker-healthchecks.md b/02-selfhosting/docker/docker-healthchecks.md new file mode 100644 index 0000000..0db5929 --- /dev/null +++ b/02-selfhosting/docker/docker-healthchecks.md @@ -0,0 +1,157 @@ +--- +title: "Docker Healthchecks" +domain: selfhosting +category: docker +tags: [docker, healthcheck, monitoring, uptime-kuma, compose] +status: published +created: 2026-03-23 +updated: 2026-03-23 +--- + +# Docker Healthchecks + +A Docker healthcheck tells the daemon (and any monitoring tool) whether a container is actually working — not just running. Without one, a container shows as `Up` even if the app inside is crashed, deadlocked, or waiting on a dependency. + +## Why It Matters + +Tools like Uptime Kuma report containers without healthchecks as: + +> Container has not reported health and is currently running. As it is running, it is considered UP. Consider adding a health check for better service visibility. + +A healthcheck upgrades that to a real `(healthy)` or `(unhealthy)` status, making monitoring meaningful. + +## Basic Syntax (docker-compose) + +```yaml +healthcheck: + test: ["CMD", "wget", "-q", "--spider", "http://localhost:8080/health"] + interval: 30s + timeout: 10s + retries: 3 + start_period: 30s +``` + +| Field | Description | +|---|---| +| `test` | Command to run. Exit 0 = healthy, non-zero = unhealthy. | +| `interval` | How often to run the check. | +| `timeout` | How long to wait before marking as failed. | +| `retries` | Failures before marking `unhealthy`. | +| `start_period` | Grace period on startup before failures count. | + +## Common Patterns + +### HTTP service (wget — available in Alpine) +```yaml +healthcheck: + test: ["CMD", "wget", "-q", "--spider", "http://localhost:2368/"] + interval: 30s + timeout: 10s + retries: 3 + start_period: 30s +``` + +### HTTP service (curl) +```yaml +healthcheck: + test: ["CMD", "curl", "-f", "http://localhost:8080/health"] + interval: 30s + timeout: 10s + retries: 3 + start_period: 30s +``` + +### MySQL / MariaDB +```yaml +healthcheck: + test: ["CMD", "mysqladmin", "ping", "-h", "localhost", "-u", "root", "-psecret"] + interval: 10s + timeout: 5s + retries: 3 + start_period: 20s +``` + +### PostgreSQL +```yaml +healthcheck: + test: ["CMD-SHELL", "pg_isready -U postgres"] + interval: 10s + timeout: 5s + retries: 5 +``` + +### Redis +```yaml +healthcheck: + test: ["CMD", "redis-cli", "ping"] + interval: 10s + timeout: 5s + retries: 3 +``` + +### TCP port check (no curl/wget available) +```yaml +healthcheck: + test: ["CMD-SHELL", "nc -z localhost 8080 || exit 1"] + interval: 30s + timeout: 5s + retries: 3 +``` + +## Using Healthchecks with `depends_on` + +Healthchecks enable proper startup ordering. Instead of a fixed sleep, a dependent container waits until its dependency is actually ready: + +```yaml +services: + app: + depends_on: + db: + condition: service_healthy + + db: + image: mysql:8.0 + healthcheck: + test: ["CMD", "mysqladmin", "ping", "-h", "localhost"] + interval: 10s + timeout: 5s + retries: 3 + start_period: 20s +``` + +This prevents the classic race condition where the app starts before the database is ready to accept connections. + +## Checking Health Status + +```bash +# See health status in container list +docker ps + +# Get detailed health info including last check output +docker inspect --format='{{json .State.Health}}' | jq +``` + +## Ghost Example + +Ghost (Alpine-based) uses `wget` rather than `curl`: + +```yaml +healthcheck: + test: ["CMD", "wget", "-q", "--spider", "http://localhost:2368/ghost/api/v4/admin/site/"] + interval: 30s + timeout: 10s + retries: 3 + start_period: 30s +``` + +## Gotchas & Notes + +- **Alpine images** don't have `curl` by default — use `wget` or install curl in the image. +- **`start_period`** is critical for slow-starting apps (databases, JVM services). Failures during this window don't count toward `retries`. +- **`CMD` vs `CMD-SHELL`** — use `CMD` for direct exec (no shell needed), `CMD-SHELL` when you need pipes, `&&`, or shell builtins. +- **Uptime Kuma** will pick up Docker healthcheck status automatically when monitoring via the Docker socket — no extra config needed. + +## See Also + +- [[debugging-broken-docker-containers]] +- [[netdata-docker-health-alarm-tuning]] diff --git a/02-selfhosting/index.md b/02-selfhosting/index.md index 48300bf..282ff3e 100644 --- a/02-selfhosting/index.md +++ b/02-selfhosting/index.md @@ -24,6 +24,7 @@ Guides for running your own services at home, including Docker, reverse proxies, - [Tuning Netdata Web Log Alerts](monitoring/tuning-netdata-web-log-alerts.md) - [Tuning Netdata Docker Health Alarms](monitoring/netdata-docker-health-alarm-tuning.md) +- [Deploying Netdata to a New Server](monitoring/netdata-new-server-setup.md) ## Security diff --git a/02-selfhosting/monitoring/netdata-docker-health-alarm-tuning.md b/02-selfhosting/monitoring/netdata-docker-health-alarm-tuning.md index fe116ac..01a9e2f 100644 --- a/02-selfhosting/monitoring/netdata-docker-health-alarm-tuning.md +++ b/02-selfhosting/monitoring/netdata-docker-health-alarm-tuning.md @@ -5,7 +5,7 @@ category: monitoring tags: [netdata, docker, nextcloud, alarms, health, monitoring] status: published created: 2026-03-18 -updated: 2026-03-18 +updated: 2026-03-22 --- # Tuning Netdata Docker Health Alarms to Prevent Update Flapping @@ -40,7 +40,7 @@ component: Docker every: 30s lookup: average -5m of unhealthy warn: $this > 0 - delay: down 5m multiplier 1.5 max 30m + delay: up 3m down 5m multiplier 1.5 max 30m summary: Docker container ${label:container_name} health info: ${label:container_name} docker container health status is unhealthy to: sysadmin @@ -49,10 +49,38 @@ component: Docker | Setting | Default | Tuned | Effect | |---|---|---|---| | `every` | 10s | 30s | Check less frequently | -| `lookup` | average -10s | average -5m | Must be unhealthy for sustained 5 minutes | -| `delay` | none | down 5m (max 30m) | Grace period after recovery before clearing | +| `lookup` | average -10s | average -5m | Smooths transient unhealthy samples over 5 minutes | +| `delay: up 3m` | none | 3m | Won't fire until unhealthy condition persists for 3 continuous minutes | +| `delay: down 5m` | none | 5m (max 30m) | Grace period after recovery before clearing | -A typical Nextcloud AIO update cycle (30–90 seconds of container restarts) won't sustain 5 minutes of unhealthy status, so no alert fires. A genuinely broken container will still be caught. +The `up` delay is the critical addition. Nextcloud AIO's `nextcloud-aio-nextcloud` container checks both PostgreSQL (port 5432) and PHP-FPM (port 9000). PHP-FPM takes ~90 seconds to warm up after a restart, causing 2–3 failing health checks before the container becomes healthy. With `delay: up 3m`, Netdata waits for 3 continuous minutes of unhealthy status before firing — absorbing the ~90 second startup window with margin to spare. A genuinely broken container will still trigger the alert. + +## Also: Suppress `docker_container_down` for Normally-Exiting Containers + +Nextcloud AIO runs `borgbackup` (scheduled backups) and `watchtower` (auto-updates) as containers that exit with code 0 after completing their work. The stock `docker_container_down` alarm fires on any exited container, generating false alerts after every nightly cycle. + +Add a second override to the same file using `chart labels` to exclude them: + +```ini +# Suppress docker_container_down for Nextcloud AIO containers that exit normally +# (borgbackup runs on schedule then exits; watchtower does updates then exits) +template: docker_container_down + on: docker.container_running_state + class: Errors + type: Containers +component: Docker + units: status + every: 30s + lookup: average -5m of down +chart labels: container_name=!nextcloud-aio-borgbackup !nextcloud-aio-watchtower * + warn: $this > 0 + delay: up 3m down 5m multiplier 1.5 max 30m + summary: Docker container ${label:container_name} down + info: ${label:container_name} docker container is down + to: sysadmin +``` + +The `chart labels` line uses Netdata's simple pattern syntax — `!` prefix excludes a container, `*` matches everything else. All other exited containers still alert normally. ## Applying the Config @@ -74,7 +102,7 @@ In the Netdata UI, navigate to **Alerts → Manage Alerts** and search for `dock ## Notes -- This only overrides the `docker_container_unhealthy` alarm. The `docker_container_down` alarm (for exited containers) is left at its default — it already has a `delay: down 1m` and is disabled by default (`chart labels: container_name=!*`). +- Both `docker_container_unhealthy` and `docker_container_down` are overridden in this config. Any container not explicitly excluded in the `chart labels` filter will still alert normally. - If you want per-container silencing instead of a blanket delay, use the `host labels` or `chart labels` filter to scope the alarm to specific containers. - Config volume path on majorlab: `/var/lib/docker/volumes/netdata_netdataconfig/_data/` diff --git a/02-selfhosting/monitoring/netdata-n8n-enriched-alerts.md b/02-selfhosting/monitoring/netdata-n8n-enriched-alerts.md new file mode 100644 index 0000000..f25536d --- /dev/null +++ b/02-selfhosting/monitoring/netdata-n8n-enriched-alerts.md @@ -0,0 +1,159 @@ +# Netdata → n8n Enriched Alert Emails + +**Status:** Live across all MajorsHouse fleet servers as of 2026-03-21 + +Replaces Netdata's plain-text alert emails with rich HTML emails that include a plain-English explanation, a suggested remediation command, and a direct link to the relevant MajorWiki article. + +--- + +## How It Works + +``` +Netdata alarm fires + → custom_sender() in health_alarm_notify.conf + → POST JSON payload to n8n webhook + → Code node enriches with suggestion + wiki link + → Send Email node sends HTML email via SMTP + → Respond node returns 200 OK +``` + +--- + +## n8n Workflow + +**Name:** Netdata Enriched Alerts +**URL:** https://n8n.majorshouse.com +**Webhook endpoint:** `POST https://n8n.majorshouse.com/webhook/netdata-alert` +**Workflow ID:** `a1b2c3d4-aaaa-bbbb-cccc-000000000001` + +### Nodes + +1. **Netdata Webhook** — receives POST from Netdata's `custom_sender()` +2. **Enrich Alert** — Code node; matches alarm/chart/family to enrichment table, builds HTML email body in `$json.emailBody` +3. **Send Enriched Email** — sends via SMTP port 465 (SMTP account 2), from `netdata@majorshouse.com` to `marcus@majorshouse.com` +4. **Respond OK** — returns `ok` with HTTP 200 to Netdata + +### Enrichment Keys + +The Code node matches on `alarm`, `chart`, or `family` field (case-insensitive substring): + +| Key | Title | Wiki Article | Notes | +|-----|-------|-------------|-------| +| `disk_space` | Disk Space Alert | snapraid-mergerfs-setup | | +| `ram` | Memory Alert | managing-linux-services-systemd-ansible | | +| `cpu` | CPU Alert | managing-linux-services-systemd-ansible | | +| `load` | Load Average Alert | managing-linux-services-systemd-ansible | | +| `net` | Network Alert | tailscale-homelab-remote-access | | +| `docker` | Docker Container Alert | debugging-broken-docker-containers | | +| `web_log` | Web Log Alert | tuning-netdata-web-log-alerts | Hostname-aware suggestion (see below) | +| `health` | Docker Health Alarm | netdata-docker-health-alarm-tuning | | +| `mdstat` | RAID Array Alert | mdadm-usb-hub-disconnect-recovery | | +| `systemd` | Systemd Service Alert | docker-caddy-selinux-post-reboot-recovery | | +| _(no match)_ | Server Alert | netdata-new-server-setup | | + +> [!info] web_log hostname-aware suggestion (updated 2026-03-24) +> The `web_log` suggestion branches on `hostname` in the Code node: +> - **`majorlab`** → Check `docker logs caddy` (Caddy reverse proxy) +> - **`teelia`, `majorlinux`, `dca`** → Check Apache logs + Fail2ban jail status +> - **other** → Generic web server log guidance + +--- + +## Netdata Configuration + +### Config File Locations + +| Server | Path | +|--------|------| +| majorhome, majormail, majordiscord, tttpod, teelia | `/etc/netdata/health_alarm_notify.conf` | +| majorlinux, majortoot, dca | `/usr/lib/netdata/conf.d/health_alarm_notify.conf` | + +### Required Settings + +```bash +DEFAULT_RECIPIENT_CUSTOM="n8n" +role_recipients_custom[sysadmin]="${DEFAULT_RECIPIENT_CUSTOM}" +``` + +### custom_sender() Function + +```bash +custom_sender() { + local to="${1}" + local payload + payload=$(jq -n \ + --arg hostname "${host}" \ + --arg alarm "${name}" \ + --arg chart "${chart}" \ + --arg family "${family}" \ + --arg status "${status}" \ + --arg old_status "${old_status}" \ + --arg value "${value_string}" \ + --arg units "${units}" \ + --arg info "${info}" \ + --arg alert_url "${goto_url}" \ + --arg severity "${severity}" \ + --arg raised_for "${raised_for}" \ + --arg total_warnings "${total_warnings}" \ + --arg total_critical "${total_critical}" \ + '{hostname:$hostname,alarm:$alarm,chart:$chart,family:$family,status:$status,old_status:$old_status,value:$value,units:$units,info:$info,alert_url:$alert_url,severity:$severity,raised_for:$raised_for,total_warnings:$total_warnings,total_critical:$total_critical}') + local httpcode + httpcode=$(docurl -s -o /dev/null -w "%{http_code}" \ + -X POST \ + -H "Content-Type: application/json" \ + -d "${payload}" \ + "https://n8n.majorshouse.com/webhook/netdata-alert") + if [ "${httpcode}" = "200" ]; then + info "sent enriched notification to n8n for ${status} of ${host}.${name}" + sent=$((sent + 1)) + else + error "failed to send notification to n8n, HTTP code: ${httpcode}" + fi +} +``` + +!!! note "jq required" + The `custom_sender()` function requires `jq` to be installed. Verify with `which jq` on each server. + +--- + +## Deploying to a New Server + +```bash +# 1. Find the config file +find /etc/netdata /usr/lib/netdata -name health_alarm_notify.conf 2>/dev/null + +# 2. Edit it — add the two lines and the custom_sender() function above + +# 3. Test connectivity from the server +curl -s -o /dev/null -w "%{http_code}" \ + -X POST https://n8n.majorshouse.com/webhook/netdata-alert \ + -H "Content-Type: application/json" \ + -d '{"hostname":"test","alarm":"disk_space._","status":"WARNING"}' +# Expected: 200 + +# 4. Restart Netdata +systemctl restart netdata + +# 5. Send a test alarm +/usr/libexec/netdata/plugins.d/alarm-notify.sh test custom +``` + +--- + +## Troubleshooting + +**Emails not arriving — check n8n execution log:** +Go to https://n8n.majorshouse.com → open "Netdata Enriched Alerts" → Executions tab. Look for `error` status entries. + +**Email body empty:** +The Send Email node's HTML field must be `={{ $json.emailBody }}`. Shell variable expansion can silently strip `$json` if the workflow is patched via inline SSH commands — always use a Python script file. + +**`000` curl response from a server:** +Usually a timeout, not a DNS or connection failure. Re-test with `--max-time 30`. + +**`custom_sender()` syntax error in Netdata logs:** +Bash heredocs don't work inside sourced config files. Use `jq -n --arg ...` as shown above — no heredocs. + +**n8n `N8N_TRUST_PROXY` must be set:** +Without `N8N_TRUST_PROXY=true` in the Docker environment, Caddy's `X-Forwarded-For` header causes n8n's rate limiter to abort requests before parsing the body. Set in `/opt/n8n/compose.yml`. diff --git a/02-selfhosting/monitoring/netdata-new-server-setup.md b/02-selfhosting/monitoring/netdata-new-server-setup.md new file mode 100644 index 0000000..c91d6f9 --- /dev/null +++ b/02-selfhosting/monitoring/netdata-new-server-setup.md @@ -0,0 +1,161 @@ +--- +title: "Deploying Netdata to a New Server" +domain: selfhosting +category: monitoring +tags: [netdata, monitoring, email, notifications, netdata-cloud, ubuntu, debian, n8n] +status: published +created: 2026-03-18 +updated: 2026-03-22 +--- + +# Deploying Netdata to a New Server + +This covers the full Netdata setup for a new server in the fleet: install, email notification config, n8n webhook integration, and Netdata Cloud claim. Applies to Ubuntu/Debian servers. + +## 1. Install Prerequisites + +Install `jq` before anything else. It is required by the `custom_sender()` function in `health_alarm_notify.conf` to build the JSON payload sent to the n8n webhook. **If `jq` is missing, the webhook will fire with an empty body and n8n alert emails will have no information in them.** + +```bash +apt install -y jq +``` + +Verify: + +```bash +jq --version +``` + +## 2. Install Netdata + +Use the official kickstart script: + +```bash +wget -O /tmp/netdata-install.sh https://get.netdata.cloud/kickstart.sh +sh /tmp/netdata-install.sh --non-interactive --stable-channel --disable-telemetry +``` + +Verify it's running: + +```bash +systemctl is-active netdata +curl -s http://localhost:19999/api/v1/info | python3 -c "import sys,json; d=json.load(sys.stdin); print('Netdata', d['version'])" +``` + +## 3. Configure Email Notifications + +Copy the default config and set the three required values: + +```bash +cp /usr/lib/netdata/conf.d/health_alarm_notify.conf /etc/netdata/health_alarm_notify.conf +``` + +Edit `/etc/netdata/health_alarm_notify.conf`: + +```ini +EMAIL_SENDER="netdata@majorshouse.com" +SEND_EMAIL="YES" +DEFAULT_RECIPIENT_EMAIL="marcus@majorshouse.com" +``` + +Or apply with `sed` in one shot: + +```bash +sed -i 's/^#\?EMAIL_SENDER=.*/EMAIL_SENDER="netdata@majorshouse.com"/' /etc/netdata/health_alarm_notify.conf +sed -i 's/^#\?SEND_EMAIL=.*/SEND_EMAIL="YES"/' /etc/netdata/health_alarm_notify.conf +sed -i 's/^#\?DEFAULT_RECIPIENT_EMAIL=.*/DEFAULT_RECIPIENT_EMAIL="marcus@majorshouse.com"/' /etc/netdata/health_alarm_notify.conf +``` + +Restart and test: + +```bash +systemctl restart netdata +/usr/libexec/netdata/plugins.d/alarm-notify.sh test 2>&1 | grep -E '(OK|FAILED|email)' +``` + +You should see three `# OK` lines (WARNING → CRITICAL → CLEAR test cycle) and confirmation that email was sent to `marcus@majorshouse.com`. + +> [!note] Delivery via local Postfix +> Email is relayed through the server's local Postfix instance. Ensure Postfix is installed and `/usr/sbin/sendmail` resolves. + +## 4. Configure n8n Webhook Notifications + +Copy the `health_alarm_notify.conf` from an existing server (e.g. majormail) which contains the `custom_sender()` function. This sends enriched JSON payloads to the n8n webhook at `https://n8n.majorshouse.com/webhook/netdata-alert`. + +> [!warning] jq required +> The `custom_sender()` function uses `jq` to build the JSON payload. If `jq` is not installed, `payload` will be empty, curl will send `Content-Length: 0`, and n8n will produce alert emails with `Host: unknown`, blank alert/value fields, and `Status: UNKNOWN`. Always install `jq` first (Step 1). + +After deploying the config, run a test to confirm the webhook fires correctly: + +```bash +systemctl restart netdata +/usr/libexec/netdata/plugins.d/alarm-notify.sh test 2>&1 | grep -E '(custom|n8n|OK|FAILED)' +``` + +Verify in n8n that the latest execution shows a non-empty body with `hostname`, `alarm`, and `status` fields populated. + +## 5. Claim to Netdata Cloud + +Get the claim command from **Netdata Cloud → Space Settings → Nodes → Add Nodes**. It will look like: + +```bash +wget -O /tmp/netdata-kickstart.sh https://get.netdata.cloud/kickstart.sh +sh /tmp/netdata-kickstart.sh --stable-channel \ + --claim-token \ + --claim-rooms \ + --claim-url https://app.netdata.cloud +``` + +Verify the claim was accepted: + +```bash +cat /var/lib/netdata/cloud.d/claimed_id +``` + +A UUID will be present if claimed successfully. The node should appear in Netdata Cloud within ~60 seconds. + +## 6. Verify Alerts + +Check that no unexpected alerts are active after setup: + +```bash +curl -s 'http://localhost:19999/api/v1/alarms?active' | python3 -c " +import sys, json +d = json.load(sys.stdin) +active = [v for v in d.get('alarms', {}).values() if v.get('status') not in ('CLEAR', 'UNINITIALIZED', 'UNDEFINED')] +print(f'{len(active)} active alert(s)') +for v in active: + print(f' [{v[\"status\"]}] {v[\"name\"]} on {v[\"chart\"]}') +" +``` + +## Fleet-wide Alert Check + +To audit all servers at once (requires Tailscale SSH access): + +```bash +for host in majorlab majorhome majormail majordiscord majortoot majorlinux tttpod dca teelia; do + echo "=== $host ===" + ssh root@$host "curl -s 'http://localhost:19999/api/v1/alarms?active' | python3 -c \ + \"import sys,json; d=json.load(sys.stdin); active=[v for v in d.get('alarms',{}).values() if v.get('status') not in ('CLEAR','UNINITIALIZED','UNDEFINED')]; print(str(len(active))+' active')\"" +done +``` + +## Fleet-wide jq Audit + +To check that all servers with `custom_sender` have `jq` installed: + +```bash +for host in majorlab majorhome majormail majordiscord majortoot majorlinux tttpod dca teelia; do + echo -n "=== $host: " + ssh -o ConnectTimeout=5 root@$host \ + 'has_cs=$(grep -l "custom_sender\|n8n.majorshouse.com" /etc/netdata/health_alarm_notify.conf 2>/dev/null | wc -l); has_jq=$(which jq 2>/dev/null && echo yes || echo NO); echo "custom_sender=$has_cs jq=$has_jq"' +done +``` + +Any server showing `custom_sender=1 jq=NO` needs `apt install -y jq` immediately. + +## Related + +- [Tuning Netdata Web Log Alerts](tuning-netdata-web-log-alerts.md) +- [Tuning Netdata Docker Health Alarms](netdata-docker-health-alarm-tuning.md) diff --git a/02-selfhosting/monitoring/netdata-selinux-avc-chart.md b/02-selfhosting/monitoring/netdata-selinux-avc-chart.md new file mode 100644 index 0000000..5f65823 --- /dev/null +++ b/02-selfhosting/monitoring/netdata-selinux-avc-chart.md @@ -0,0 +1,137 @@ +--- +title: "Netdata SELinux AVC Denial Monitoring" +domain: selfhosting +category: monitoring +tags: [netdata, selinux, fedora, monitoring, ausearch, charts.d] +status: published +created: 2026-03-27 +updated: 2026-03-27 +--- + +# Netdata SELinux AVC Denial Monitoring + +A custom `charts.d` plugin that tracks SELinux AVC denials over time via Netdata. Deployed on all Fedora boxes in the fleet where SELinux is Enforcing. + +## What It Does + +The plugin runs `ausearch -m avc` every 60 seconds and reports the count of AVC denial events from the last 10 minutes. This gives a real-time chart in Netdata Cloud showing SELinux denial spikes — useful for catching misconfigurations after service changes or package updates. + +## Where It's Deployed + +| Host | OS | SELinux | Chart Installed | +|------|----|---------|-----------------| +| majorhome | Fedora 43 | Enforcing | Yes | +| majorlab | Fedora 43 | Enforcing | Yes | +| majormail | Fedora 43 | Enforcing | Yes | +| majordiscord | Fedora 43 | Enforcing | Yes | + +Ubuntu hosts (dca, teelia, tttpod, majortoot, majorlinux) do not run SELinux and do not have this chart. + +## Installation + +### 1. Create the Chart Plugin + +Create `/etc/netdata/charts.d/selinux.chart.sh`: + +```bash +cat > /etc/netdata/charts.d/selinux.chart.sh << 'EOF' +# SELinux AVC denial counter for Netdata charts.d +selinux_update_every=60 +selinux_priority=90000 + +selinux_check() { + which ausearch >/dev/null 2>&1 || return 1 + return 0 +} + +selinux_create() { + cat </dev/null | grep -c "type=AVC") + echo "BEGIN selinux.avc_denials $1" + echo "SET denials = ${count}" + echo "END" + return 0 +} +EOF +``` + +### 2. Grant Netdata Sudo Access to ausearch + +`ausearch` requires root to read the audit log. Add a sudoers entry for the `netdata` user: + +```bash +echo 'netdata ALL=(root) NOPASSWD: /usr/bin/ausearch -m avc -if /var/log/audit/audit.log -ts recent' > /etc/sudoers.d/netdata-selinux +chmod 440 /etc/sudoers.d/netdata-selinux +visudo -c +``` + +The `visudo -c` validates syntax. If it reports errors, fix the file before proceeding — a broken sudoers file can lock out sudo entirely. + +### 3. Restart Netdata + +```bash +systemctl restart netdata +``` + +### 4. Verify + +Check that the chart is collecting data: + +```bash +curl -s 'http://localhost:19999/api/v1/chart?chart=selinux.avc_denials' | python3 -c " +import sys, json +d = json.load(sys.stdin) +print(f'Chart: {d[\"id\"]}') +print(f'Update every: {d[\"update_every\"]}s') +print(f'Type: {d[\"chart_type\"]}') +" +``` + +If the chart doesn't appear, check that `charts.d` is enabled in `/etc/netdata/netdata.conf` and that the plugin file is readable by the `netdata` user. + +## Known Side Effect: pam_systemd Log Noise + +Because the `netdata` user calls `sudo ausearch` every 60 seconds, `pam_systemd` logs a warning each time: + +``` +pam_systemd(sudo:session): Failed to check if /run/user/0/bus exists, ignoring: Permission denied +``` + +This is cosmetic. The `sudo` command succeeds — `pam_systemd` just can't find a D-Bus user session for the `netdata` service account, which is expected. The message volume scales with the collection interval (1,440/day at 60-second intervals). + +**To suppress it**, the `system-auth` PAM config on Fedora already marks `pam_systemd.so` as `-session optional` (the `-` prefix means "don't fail if the module errors"). The messages are informational log noise, not actual failures. No PAM changes are needed. + +If the log volume is a concern for log analysis or monitoring, filter it at the journald level: + +```ini +# /etc/rsyslog.d/suppress-pam-systemd.conf +:msg, contains, "pam_systemd(sudo:session): Failed to check" stop +``` + +Or in Netdata's log alert config, exclude the pattern from any log-based alerts. + +## Fleet Audit + +To verify the chart is deployed and functioning on all Fedora hosts: + +```bash +for host in majorhome majorlab majormail majordiscord; do + echo -n "=== $host: " + ssh root@$host "curl -s 'http://localhost:19999/api/v1/chart?chart=selinux.avc_denials' 2>/dev/null | python3 -c 'import sys,json; d=json.load(sys.stdin); print(d[\"id\"], \"every\", str(d[\"update_every\"])+\"s\")' 2>/dev/null || echo 'NOT FOUND'" +done +``` + +## Related + +- [Deploying Netdata to a New Server](netdata-new-server-setup.md) +- [Tuning Netdata Web Log Alerts](tuning-netdata-web-log-alerts.md) +- [Tuning Netdata Docker Health Alarms](netdata-docker-health-alarm-tuning.md) +- [SELinux: Fixing Dovecot Mail Spool Context](/05-troubleshooting/selinux-dovecot-vmail-context.md) diff --git a/05-troubleshooting/ansible-vault-password-file-missing.md b/05-troubleshooting/ansible-vault-password-file-missing.md new file mode 100644 index 0000000..2686506 --- /dev/null +++ b/05-troubleshooting/ansible-vault-password-file-missing.md @@ -0,0 +1,59 @@ +# Ansible: Vault Password File Not Found + +## Error + +``` +[WARNING]: Error getting vault password file (default): The vault password file /Users/majorlinux/.ansible/vault_pass was not found +[ERROR]: The vault password file /Users/majorlinux/.ansible/vault_pass was not found +``` + +## Cause + +Ansible is configured to look for a vault password file at `~/.ansible/vault_pass`, but the file does not exist. This is typically set in `ansible.cfg` via the `vault_password_file` directive. + +## Solutions + +### Option 1: Remove the vault config (if you're not using Vault) + +Check your `ansible.cfg` for this line and remove it if Vault is not needed: + +```ini +[defaults] +vault_password_file = ~/.ansible/vault_pass +``` + +### Option 2: Create the vault password file + +```bash +echo 'your_vault_password' > ~/.ansible/vault_pass +chmod 600 ~/.ansible/vault_pass +``` + +> **Security note:** Keep permissions tight (`600`) so only your user can read the file. The actual vault password is stored in Bitwarden under the "Ansible Vault Password" entry. + +### Option 3: Pass the password at runtime (no file needed) + +```bash +ansible-playbook test.yml --ask-vault-pass +``` + +## Diagnosing the Source of the Config + +To find which config file is setting `vault_password_file`, run: + +```bash +ansible-config dump --only-changed +``` + +This shows all non-default config values and their source files. Config is loaded in this order of precedence: + +1. `ANSIBLE_CONFIG` environment variable +2. `./ansible.cfg` (current directory) +3. `~/.ansible.cfg` +4. `/etc/ansible/ansible.cfg` + +## Related + +- [Ansible Getting Started](../01-linux/shell-scripting/ansible-getting-started.md) +- Vault password is stored in Bitwarden under **"Ansible Vault Password"** +- Ansible playbooks live at `~/MajorAnsible` on MajorAir/MajorMac diff --git a/05-troubleshooting/index.md b/05-troubleshooting/index.md index 8de9d8e..d58a20a 100644 --- a/05-troubleshooting/index.md +++ b/05-troubleshooting/index.md @@ -9,6 +9,7 @@ Practical fixes for common Linux, networking, and application problems. - [Apache Outage: Fail2ban Self-Ban + Missing iptables Rules](networking/fail2ban-self-ban-apache-outage.md) - [Mail Client Stops Receiving: Fail2ban IMAP Self-Ban](networking/fail2ban-imap-self-ban-mail-client.md) - [firewalld: Mail Ports Wiped After Reload](networking/firewalld-mail-ports-reset.md) +- [Tailscale SSH: Unexpected Re-Authentication Prompt](networking/tailscale-ssh-reauth-prompt.md) - [ISP SNI Filtering & Caddy](isp-sni-filtering-caddy.md) - [yt-dlp YouTube JS Challenge Fix](yt-dlp-fedora-js-challenge.md) diff --git a/05-troubleshooting/networking/tailscale-ssh-reauth-prompt.md b/05-troubleshooting/networking/tailscale-ssh-reauth-prompt.md new file mode 100644 index 0000000..bc8638d --- /dev/null +++ b/05-troubleshooting/networking/tailscale-ssh-reauth-prompt.md @@ -0,0 +1,66 @@ +# Tailscale SSH: Unexpected Re-Authentication Prompt + +If a Tailscale SSH connection unexpectedly presents a browser authentication URL mid-session, the first instinct is to check the ACL policy. However, this is often a one-off Tailscale hiccup rather than a misconfiguration. + +## Symptoms + +- SSH connection to a fleet node displays a Tailscale auth URL: + ``` + To authenticate, visit: https://login.tailscale.com/a/xxxxxxxx + ``` +- The prompt appears even though the node worked fine previously +- Other nodes in the fleet connect without prompting + +## What Causes It + +Tailscale SSH supports two ACL `action` values: + +| Action | Behavior | +|---|---| +| `accept` | Trusts Tailscale identity — no additional auth required | +| `check` | Requires periodic browser-based re-authentication | + +If `action: "check"` is set, every session (or after token expiry) will prompt for browser auth. However, even with `action: "accept"`, a one-off prompt can appear due to a Tailscale daemon glitch or key refresh event. + +## How to Diagnose + +### 1. Verify the ACL policy + +In the Tailscale admin console (or via `tailscale debug acl`), inspect the SSH rules. For a trusted homelab fleet, the rule should use `accept`: + +```json +{ + "src": ["autogroup:member"], + "dst": ["autogroup:self"], + "users": ["autogroup:nonroot", "root"], + "action": "accept", +} +``` + +If `action` is `check`, that is the root cause — change it to `accept` for trusted source/destination pairs. + +### 2. Confirm it was a one-off + +If the ACL already shows `accept`, the prompt was transient. Test with: + +```bash +ssh "echo ok" +``` + +No auth prompt + `ok` output = resolved. Note that this test is only meaningful if the previous session's auth token has expired, or you test from a different device that hasn't recently authenticated. + +## Fix + +**If ACL shows `check`:** Change to `accept` in the Tailscale admin console under Access Controls. Takes effect immediately — no server changes needed. + +**If ACL already shows `accept`:** No action required. The prompt was a one-off Tailscale event (daemon restart, key refresh, etc.). Monitor for recurrence. + +## Notes + +- ~~Port 2222 on **MajorRig** previously existed as a hard bypass for Tailscale SSH browser auth. This workaround was retired on 2026-03-25 after the Tailscale SSH authentication issue was resolved. The entire fleet now uses port 22 uniformly.~~ +- The `autogroup:self` destination means the rule applies when connecting from your own devices to your own devices — appropriate for a personal homelab fleet. + +## Related + +- [[Network Overview]] — Tailscale fleet inventory and SSH access model +- [[SSH-Aliases]] — Fleet SSH access shortcuts diff --git a/05-troubleshooting/networking/windows-sshd-stops-after-reboot.md b/05-troubleshooting/networking/windows-sshd-stops-after-reboot.md index 0227238..9c02095 100644 --- a/05-troubleshooting/networking/windows-sshd-stops-after-reboot.md +++ b/05-troubleshooting/networking/windows-sshd-stops-after-reboot.md @@ -48,7 +48,7 @@ The Windows OpenSSH Server is installed as a Windows Feature (`Add-WindowsCapabi - **This is a Windows-side issue** — WSL2 itself is unaffected. The service must be started and configured from Windows, not from within WSL2. - **Elevated PowerShell required** — `Start-Service` and `Set-Service` for sshd will return "Access is denied" if run without Administrator privileges. -- **Port 2222 is also affected** — both the standard port 22 and the bypass port 2222 on MajorRig are served by the same `sshd` service. +- **Port 2222 was retired (2026-03-25)** — the bypass port 2222 on MajorRig is no longer in use. The entire fleet now uses port 22 uniformly after the Tailscale SSH auth fix. Only port 22 needs to be verified when troubleshooting sshd. - **Default shell still works once fixed** — MajorRig's sshd is configured to use `C:\Windows\System32\wsl.exe` as the default shell, dropping SSH sessions directly into WSL2/Bash. This config is preserved across service restarts. --- diff --git a/05-troubleshooting/security/clamscan-cpu-spike-nice-ionice.md b/05-troubleshooting/security/clamscan-cpu-spike-nice-ionice.md new file mode 100644 index 0000000..a0480f0 --- /dev/null +++ b/05-troubleshooting/security/clamscan-cpu-spike-nice-ionice.md @@ -0,0 +1,73 @@ +# ClamAV Safe Scheduling on Live Servers + +Running `clamscan` unthrottled on a live server will peg CPU until completion. On a small VPS (1 vCPU), a full recursive scan can sustain 70–100% CPU for an hour or more, degrading or taking down hosted services. + +## The Problem + +A common out-of-the-box ClamAV cron setup looks like this: + +```cron +0 1 * * 0 clamscan --infected --recursive / --exclude=/sys +``` + +This runs at Linux's default scheduling priority (`nice 0`) with normal I/O priority. On a live server it will: + +- Monopolize the CPU for the scan duration +- Cause high I/O wait, degrading web serving, databases, and other services +- Trigger monitoring alerts (e.g., Netdata `10min_cpu_usage`) + +## The Fix + +Throttle the scan with `nice` and `ionice`: + +```cron +0 1 * * 0 nice -n 19 ionice -c 3 clamscan --infected --recursive / --exclude=/sys +``` + +| Flag | Meaning | +|------|---------| +| `nice -n 19` | Lowest CPU scheduling priority (range: -20 to 19) | +| `ionice -c 3` | Idle I/O class — only uses disk when no other process needs it | + +The scan will take longer but will not impact server performance. + +## Applying the Fix + +Edit root's crontab: + +```bash +crontab -e +``` + +Or apply non-interactively: + +```bash +crontab -l | sed 's|clamscan|nice -n 19 ionice -c 3 clamscan|' | crontab - +``` + +Verify: + +```bash +crontab -l | grep clam +``` + +## Diagnosing a Runaway Scan + +If CPU is already pegged, identify and kill the process: + +```bash +ps aux --sort=-%cpu | head -15 +# Look for clamscan +kill +``` + +## Notes + +- `ionice -c 3` (Idle) requires Linux kernel ≥ 2.6.13 and CFQ/BFQ I/O scheduler. Works on most Ubuntu/Debian/Fedora systems. +- On multi-core servers, consider also using `cpulimit` for a hard cap: `cpulimit -l 30 -- clamscan ...` +- Always keep `--exclude=/sys` (and optionally `--exclude=/proc`, `--exclude=/dev`) to avoid scanning virtual filesystems. + +## Related + +- [ClamAV Documentation](https://docs.clamav.net/) +- [[02-selfhosting/security/linux-server-hardening-checklist|Linux Server Hardening Checklist]] diff --git a/MajorWiki-Deploy-Status.md b/MajorWiki-Deploy-Status.md index 6929b5d..407453f 100644 --- a/MajorWiki-Deploy-Status.md +++ b/MajorWiki-Deploy-Status.md @@ -128,7 +128,7 @@ Every time a new article is added, the following **MUST** be updated to maintain **Updated:** `updated: 2026-03-17` -## Session Update — 2026-03-18 +## Session Update — 2026-03-18 (morning) **Article count:** 48 (was 47) @@ -136,3 +136,12 @@ Every time a new article is added, the following **MUST** be updated to maintain - `02-selfhosting/monitoring/netdata-docker-health-alarm-tuning.md` — tuning docker_container_unhealthy alarm to prevent flapping during Nextcloud AIO updates **Updated:** `updated: 2026-03-18` + +## Session Update — 2026-03-18 (afternoon) + +**Article count:** 49 (was 48) + +**New articles added:** +- `02-selfhosting/monitoring/netdata-new-server-setup.md` — full Netdata deployment guide: install via kickstart.sh, email notification config, Netdata Cloud claim + +**Updated:** `updated: 2026-03-18` diff --git a/README.md b/README.md index 32da4a2..68101e3 100644 --- a/README.md +++ b/README.md @@ -3,14 +3,14 @@ > A growing reference of Linux, self-hosting, open source, streaming, and troubleshooting guides. Written by MajorLinux. Used by MajorTwin. > **Last updated:** 2026-03-18 -**Article count:** 48 +**Article count:** 49 ## Domains | Domain | Folder | Articles | |---|---|---| | 🐧 Linux & Sysadmin | `01-linux/` | 11 | -| 🏠 Self-Hosting & Homelab | `02-selfhosting/` | 10 | +| 🏠 Self-Hosting & Homelab | `02-selfhosting/` | 11 | | 🔓 Open Source Tools | `03-opensource/` | 9 | | 🎙️ Streaming & Podcasting | `04-streaming/` | 2 | | 🔧 General Troubleshooting | `05-troubleshooting/` | 16 | @@ -65,6 +65,7 @@ ### Monitoring - [Tuning Netdata Web Log Alerts](02-selfhosting/monitoring/tuning-netdata-web-log-alerts.md) — tuning web_log_1m_redirects threshold for HTTPS-forcing servers - [Tuning Netdata Docker Health Alarms](02-selfhosting/monitoring/netdata-docker-health-alarm-tuning.md) — preventing false alerts during nightly Nextcloud AIO container update cycles +- [Deploying Netdata to a New Server](02-selfhosting/monitoring/netdata-new-server-setup.md) — install, email notifications, and Netdata Cloud claim for Ubuntu/Debian servers ### Security - [Linux Server Hardening Checklist](02-selfhosting/security/linux-server-hardening-checklist.md) — non-root user, SSH key auth, sshd_config, firewall, fail2ban @@ -129,6 +130,7 @@ | Date | Article | Domain | |---|---|---| +| 2026-03-18 | [Deploying Netdata to a New Server](02-selfhosting/monitoring/netdata-new-server-setup.md) | Self-Hosting | | 2026-03-18 | [Tuning Netdata Docker Health Alarms](02-selfhosting/monitoring/netdata-docker-health-alarm-tuning.md) | Self-Hosting | | 2026-03-17 | [Ollama Drops Off Tailscale When Mac Sleeps](05-troubleshooting/ollama-macos-sleep-tailscale-disconnect.md) | Troubleshooting | | 2026-03-17 | [Windows OpenSSH Server (sshd) Stops After Reboot](05-troubleshooting/networking/windows-sshd-stops-after-reboot.md) | Troubleshooting | diff --git a/SUMMARY.md b/SUMMARY.md index 3590afc..0b709a5 100644 --- a/SUMMARY.md +++ b/SUMMARY.md @@ -15,11 +15,13 @@ * [Self-Hosting Starter Guide](02-selfhosting/docker/self-hosting-starter-guide.md) * [Docker vs VMs for the Homelab](02-selfhosting/docker/docker-vs-vms-homelab.md) * [Debugging Broken Docker Containers](02-selfhosting/docker/debugging-broken-docker-containers.md) + * [Docker Healthchecks](02-selfhosting/docker/docker-healthchecks.md) * [Setting Up Caddy as a Reverse Proxy](02-selfhosting/reverse-proxy/setting-up-caddy-reverse-proxy.md) * [Tailscale for Homelab Remote Access](02-selfhosting/dns-networking/tailscale-homelab-remote-access.md) * [rsync Backup Patterns](02-selfhosting/storage-backup/rsync-backup-patterns.md) * [Tuning Netdata Web Log Alerts](02-selfhosting/monitoring/tuning-netdata-web-log-alerts.md) * [Tuning Netdata Docker Health Alarms](02-selfhosting/monitoring/netdata-docker-health-alarm-tuning.md) + * [Deploying Netdata to a New Server](02-selfhosting/monitoring/netdata-new-server-setup.md) * [Linux Server Hardening Checklist](02-selfhosting/security/linux-server-hardening-checklist.md) * [Standardizing unattended-upgrades with Ansible](02-selfhosting/security/ansible-unattended-upgrades-fleet.md) * [Open Source & Alternatives](03-opensource/index.md) @@ -39,6 +41,7 @@ * [Apache Outage: Fail2ban Self-Ban + Missing iptables Rules](05-troubleshooting/networking/fail2ban-self-ban-apache-outage.md) * [Mail Client Stops Receiving: Fail2ban IMAP Self-Ban](05-troubleshooting/networking/fail2ban-imap-self-ban-mail-client.md) * [firewalld: Mail Ports Wiped After Reload](05-troubleshooting/networking/firewalld-mail-ports-reset.md) + * [Tailscale SSH: Unexpected Re-Authentication Prompt](05-troubleshooting/networking/tailscale-ssh-reauth-prompt.md) * [Docker & Caddy Recovery After Reboot (Fedora + SELinux)](05-troubleshooting/docker-caddy-selinux-post-reboot-recovery.md) * [ISP SNI Filtering with Caddy](05-troubleshooting/isp-sni-filtering-caddy.md) * [Obsidian Vault Recovery — Loading Cache Hang](05-troubleshooting/obsidian-cache-hang-recovery.md) @@ -51,3 +54,6 @@ * [mdadm RAID Recovery After USB Hub Disconnect](05-troubleshooting/storage/mdadm-usb-hub-disconnect-recovery.md) * [Windows OpenSSH Server (sshd) Stops After Reboot](05-troubleshooting/networking/windows-sshd-stops-after-reboot.md) * [Ollama Drops Off Tailscale When Mac Sleeps](05-troubleshooting/ollama-macos-sleep-tailscale-disconnect.md) + * [ClamAV CPU Spike: Safe Scheduling with nice/ionice](05-troubleshooting/security/clamscan-cpu-spike-nice-ionice.md) + * [Ansible: Vault Password File Not Found](05-troubleshooting/ansible-vault-password-file-missing.md) + diff --git a/index.md b/index.md index 30a2d1d..6378d93 100644 --- a/index.md +++ b/index.md @@ -2,18 +2,20 @@ > A growing reference of Linux, self-hosting, open source, streaming, and troubleshooting guides. Written by MajorLinux. Used by MajorTwin. > -> **Last updated:** 2026-03-18 -> **Article count:** 48 +> **Last updated:** 2026-03-23 +> **Article count:** 50 + ## Domains | Domain | Folder | Articles | |---|---|---| | 🐧 Linux & Sysadmin | `01-linux/` | 11 | -| 🏠 Self-Hosting & Homelab | `02-selfhosting/` | 10 | +| 🏠 Self-Hosting & Homelab | `02-selfhosting/` | 11 | | 🔓 Open Source Tools | `03-opensource/` | 9 | | 🎙️ Streaming & Podcasting | `04-streaming/` | 2 | -| 🔧 General Troubleshooting | `05-troubleshooting/` | 16 | +| 🔧 General Troubleshooting | `05-troubleshooting/` | 17 | + --- @@ -65,6 +67,7 @@ ### Monitoring - [Tuning Netdata Web Log Alerts](02-selfhosting/monitoring/tuning-netdata-web-log-alerts.md) — tuning web_log_1m_redirects threshold for HTTPS-forcing servers - [Tuning Netdata Docker Health Alarms](02-selfhosting/monitoring/netdata-docker-health-alarm-tuning.md) — preventing false alerts during nightly Nextcloud AIO container update cycles +- [Deploying Netdata to a New Server](02-selfhosting/monitoring/netdata-new-server-setup.md) — install, email notifications, and Netdata Cloud claim for Ubuntu/Debian servers ### Security - [Linux Server Hardening Checklist](02-selfhosting/security/linux-server-hardening-checklist.md) — non-root user, SSH key auth, sshd_config, firewall, fail2ban @@ -122,6 +125,8 @@ - [mdadm RAID Recovery After USB Hub Disconnect](05-troubleshooting/storage/mdadm-usb-hub-disconnect-recovery.md) — diagnosing and recovering a failed mdadm array caused by a USB hub dropout - [Windows OpenSSH Server (sshd) Stops After Reboot](05-troubleshooting/networking/windows-sshd-stops-after-reboot.md) — fixing sshd not running after reboot due to Manual startup type - [Ollama Drops Off Tailscale When Mac Sleeps](05-troubleshooting/ollama-macos-sleep-tailscale-disconnect.md) — keeping Ollama reachable over Tailscale by disabling macOS sleep on AC power +- [Ansible: Vault Password File Not Found](05-troubleshooting/ansible-vault-password-file-missing.md) — fixing the missing vault_pass file error when running ansible-playbook + --- @@ -129,6 +134,8 @@ | Date | Article | Domain | |---|---|---| +| 2026-03-23 | [Ansible: Vault Password File Not Found](05-troubleshooting/ansible-vault-password-file-missing.md) | Troubleshooting | +| 2026-03-18 | [Deploying Netdata to a New Server](02-selfhosting/monitoring/netdata-new-server-setup.md) | Self-Hosting | | 2026-03-18 | [Tuning Netdata Docker Health Alarms](02-selfhosting/monitoring/netdata-docker-health-alarm-tuning.md) | Self-Hosting | | 2026-03-17 | [Ollama Drops Off Tailscale When Mac Sleeps](05-troubleshooting/ollama-macos-sleep-tailscale-disconnect.md) | Troubleshooting | | 2026-03-17 | [Windows OpenSSH Server (sshd) Stops After Reboot](05-troubleshooting/networking/windows-sshd-stops-after-reboot.md) | Troubleshooting |