Add Ansible SSH timeout troubleshooting article
Documents the SSH keepalive fix for dnf upgrade timeouts on Fedora hosts, plus the do-agent task guard fix. Also adds Ansible & Fleet Management section to the troubleshooting index. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
72
05-troubleshooting/ansible-ssh-timeout-dnf-upgrade.md
Normal file
72
05-troubleshooting/ansible-ssh-timeout-dnf-upgrade.md
Normal file
@@ -0,0 +1,72 @@
|
|||||||
|
---
|
||||||
|
title: Ansible SSH Timeout During dnf upgrade on Fedora Hosts
|
||||||
|
domain: troubleshooting
|
||||||
|
category: ansible
|
||||||
|
tags:
|
||||||
|
- ansible
|
||||||
|
- ssh
|
||||||
|
- fedora
|
||||||
|
- dnf
|
||||||
|
- timeout
|
||||||
|
- fleet-management
|
||||||
|
status: published
|
||||||
|
created: '2026-03-28'
|
||||||
|
updated: '2026-03-28'
|
||||||
|
---
|
||||||
|
|
||||||
|
# Ansible SSH Timeout During dnf upgrade on Fedora Hosts
|
||||||
|
|
||||||
|
## Symptom
|
||||||
|
|
||||||
|
Running `ansible-playbook update.yml` against Fedora/CentOS hosts fails with:
|
||||||
|
|
||||||
|
```
|
||||||
|
fatal: [hostname]: UNREACHABLE! => {"changed": false,
|
||||||
|
"msg": "Failed to connect to the host via ssh: Shared connection to <IP> closed."}
|
||||||
|
```
|
||||||
|
|
||||||
|
The failure occurs specifically during `ansible.builtin.dnf` tasks that upgrade all packages (`name: '*'`, `state: latest`), because the operation takes long enough for the SSH connection to drop.
|
||||||
|
|
||||||
|
## Root Cause
|
||||||
|
|
||||||
|
Without explicit SSH keepalive settings in `ansible.cfg`, OpenSSH defaults apply. Long-running tasks like full `dnf upgrade` across a fleet can exceed idle timeouts, causing the control connection to close mid-task.
|
||||||
|
|
||||||
|
## Fix
|
||||||
|
|
||||||
|
Add a `[ssh_connection]` section to `ansible.cfg`:
|
||||||
|
|
||||||
|
```ini
|
||||||
|
[ssh_connection]
|
||||||
|
ssh_args = -o ServerAliveInterval=30 -o ServerAliveCountMax=10 -o ControlMaster=auto -o ControlPersist=60s
|
||||||
|
```
|
||||||
|
|
||||||
|
| Setting | Purpose |
|
||||||
|
|---------|---------|
|
||||||
|
| `ServerAliveInterval=30` | Send a keepalive every 30 seconds |
|
||||||
|
| `ServerAliveCountMax=10` | Allow 10 missed keepalives before disconnect (~5 min tolerance) |
|
||||||
|
| `ControlMaster=auto` | Reuse SSH connections across tasks |
|
||||||
|
| `ControlPersist=60s` | Keep the master connection open 60s after last use |
|
||||||
|
|
||||||
|
## Related Fix: do-agent Task Guard
|
||||||
|
|
||||||
|
In the same playbook run, a second failure surfaced on hosts where the `ansible.builtin.uri` task to fetch the latest `do-agent` release was **skipped** (non-RedHat hosts or hosts without do-agent installed). The registered variable existed but contained a skipped result with no `.json` attribute, causing:
|
||||||
|
|
||||||
|
```
|
||||||
|
object of type 'dict' has no attribute 'json'
|
||||||
|
```
|
||||||
|
|
||||||
|
Fix: add guards to downstream tasks that reference the URI result:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
when:
|
||||||
|
- do_agent_release is defined
|
||||||
|
- do_agent_release is not skipped
|
||||||
|
- do_agent_release.json is defined
|
||||||
|
```
|
||||||
|
|
||||||
|
## Environment
|
||||||
|
|
||||||
|
- **Controller:** macOS (MajorAir)
|
||||||
|
- **Targets:** Fedora 43 (majorlab, majormail, majorhome, majordiscord)
|
||||||
|
- **Ansible:** community edition via Homebrew
|
||||||
|
- **Committed:** `d9c6bdb` in MajorAnsible repo
|
||||||
@@ -13,6 +13,10 @@ Practical fixes for common Linux, networking, and application problems.
|
|||||||
- [ISP SNI Filtering & Caddy](isp-sni-filtering-caddy.md)
|
- [ISP SNI Filtering & Caddy](isp-sni-filtering-caddy.md)
|
||||||
- [yt-dlp YouTube JS Challenge Fix](yt-dlp-fedora-js-challenge.md)
|
- [yt-dlp YouTube JS Challenge Fix](yt-dlp-fedora-js-challenge.md)
|
||||||
|
|
||||||
|
## ⚙️ Ansible & Fleet Management
|
||||||
|
- [SSH Timeout During dnf upgrade on Fedora Hosts](ansible-ssh-timeout-dnf-upgrade.md)
|
||||||
|
- [Vault Password File Missing](ansible-vault-password-file-missing.md)
|
||||||
|
|
||||||
## 📦 Docker & Systems
|
## 📦 Docker & Systems
|
||||||
- [Docker & Caddy Recovery After Reboot (Fedora + SELinux)](docker-caddy-selinux-post-reboot-recovery.md)
|
- [Docker & Caddy Recovery After Reboot (Fedora + SELinux)](docker-caddy-selinux-post-reboot-recovery.md)
|
||||||
- [Gitea Actions Runner: Boot Race Condition Fix](gitea-runner-boot-race-network-target.md)
|
- [Gitea Actions Runner: Boot Race Condition Fix](gitea-runner-boot-race-network-target.md)
|
||||||
|
|||||||
@@ -62,3 +62,4 @@
|
|||||||
* [Ollama Drops Off Tailscale When Mac Sleeps](05-troubleshooting/ollama-macos-sleep-tailscale-disconnect.md)
|
* [Ollama Drops Off Tailscale When Mac Sleeps](05-troubleshooting/ollama-macos-sleep-tailscale-disconnect.md)
|
||||||
* [ClamAV CPU Spike: Safe Scheduling with nice/ionice](05-troubleshooting/security/clamscan-cpu-spike-nice-ionice.md)
|
* [ClamAV CPU Spike: Safe Scheduling with nice/ionice](05-troubleshooting/security/clamscan-cpu-spike-nice-ionice.md)
|
||||||
* [Ansible: Vault Password File Not Found](05-troubleshooting/ansible-vault-password-file-missing.md)
|
* [Ansible: Vault Password File Not Found](05-troubleshooting/ansible-vault-password-file-missing.md)
|
||||||
|
* [Ansible: SSH Timeout During dnf upgrade on Fedora Hosts](05-troubleshooting/ansible-ssh-timeout-dnf-upgrade.md)
|
||||||
|
|||||||
Reference in New Issue
Block a user