Add Ansible SSH timeout troubleshooting article

Documents the SSH keepalive fix for dnf upgrade timeouts on Fedora hosts, plus the do-agent task guard fix. Also adds Ansible & Fleet Management section to the troubleshooting index. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 11:21:18 -04:00
parent 23a35e021b
commit 1bb872ef75
3 changed files with 77 additions and 0 deletions
--- a/05-troubleshooting/ansible-ssh-timeout-dnf-upgrade.md
+++ b/05-troubleshooting/ansible-ssh-timeout-dnf-upgrade.md
@@ -0,0 +1,72 @@
+---
+title: Ansible SSH Timeout During dnf upgrade on Fedora Hosts
+domain: troubleshooting
+category: ansible
+tags:
+  - ansible
+  - ssh
+  - fedora
+  - dnf
+  - timeout
+  - fleet-management
+status: published
+created: '2026-03-28'
+updated: '2026-03-28'
+---
+
+# Ansible SSH Timeout During dnf upgrade on Fedora Hosts
+
+## Symptom
+
+Running `ansible-playbook update.yml` against Fedora/CentOS hosts fails with:
+
+```
+fatal: [hostname]: UNREACHABLE! => {"changed": false,
+  "msg": "Failed to connect to the host via ssh: Shared connection to <IP> closed."}
+```
+
+The failure occurs specifically during `ansible.builtin.dnf` tasks that upgrade all packages (`name: '*'`, `state: latest`), because the operation takes long enough for the SSH connection to drop.
+
+## Root Cause
+
+Without explicit SSH keepalive settings in `ansible.cfg`, OpenSSH defaults apply. Long-running tasks like full `dnf upgrade` across a fleet can exceed idle timeouts, causing the control connection to close mid-task.
+
+## Fix
+
+Add a `[ssh_connection]` section to `ansible.cfg`:
+
+```ini
+[ssh_connection]
+ssh_args = -o ServerAliveInterval=30 -o ServerAliveCountMax=10 -o ControlMaster=auto -o ControlPersist=60s
+```
+
+| Setting | Purpose |
+|---------|---------|
+| `ServerAliveInterval=30` | Send a keepalive every 30 seconds |
+| `ServerAliveCountMax=10` | Allow 10 missed keepalives before disconnect (~5 min tolerance) |
+| `ControlMaster=auto` | Reuse SSH connections across tasks |
+| `ControlPersist=60s` | Keep the master connection open 60s after last use |
+
+## Related Fix: do-agent Task Guard
+
+In the same playbook run, a second failure surfaced on hosts where the `ansible.builtin.uri` task to fetch the latest `do-agent` release was **skipped** (non-RedHat hosts or hosts without do-agent installed). The registered variable existed but contained a skipped result with no `.json` attribute, causing:
+
+```
+object of type 'dict' has no attribute 'json'
+```
+
+Fix: add guards to downstream tasks that reference the URI result:
+
+```yaml
+when:
+  - do_agent_release is defined
+  - do_agent_release is not skipped
+  - do_agent_release.json is defined
+```
+
+## Environment
+
+- **Controller:** macOS (MajorAir)
+- **Targets:** Fedora 43 (majorlab, majormail, majorhome, majordiscord)
+- **Ansible:** community edition via Homebrew
+- **Committed:** `d9c6bdb` in MajorAnsible repo
--- a/05-troubleshooting/index.md
+++ b/05-troubleshooting/index.md
@@ -13,6 +13,10 @@ Practical fixes for common Linux, networking, and application problems.
 - [ISP SNI Filtering & Caddy](isp-sni-filtering-caddy.md)
 - [yt-dlp YouTube JS Challenge Fix](yt-dlp-fedora-js-challenge.md)

+## ⚙️ Ansible & Fleet Management
+- [SSH Timeout During dnf upgrade on Fedora Hosts](ansible-ssh-timeout-dnf-upgrade.md)
+- [Vault Password File Missing](ansible-vault-password-file-missing.md)
+
 ## 📦 Docker & Systems
 - [Docker & Caddy Recovery After Reboot (Fedora + SELinux)](docker-caddy-selinux-post-reboot-recovery.md)
 - [Gitea Actions Runner: Boot Race Condition Fix](gitea-runner-boot-race-network-target.md)