majorwiki/05-troubleshooting/ansible-ssh-timeout-dnf-upgrade.md
majorlinux 1bb872ef75 Add Ansible SSH timeout troubleshooting article
Documents the SSH keepalive fix for dnf upgrade timeouts on Fedora hosts,
plus the do-agent task guard fix. Also adds Ansible & Fleet Management
section to the troubleshooting index.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 11:22:48 -04:00

2.3 KiB

title domain category tags status created updated
Ansible SSH Timeout During dnf upgrade on Fedora Hosts troubleshooting ansible
ansible
ssh
fedora
dnf
timeout
fleet-management
published 2026-03-28 2026-03-28

Ansible SSH Timeout During dnf upgrade on Fedora Hosts

Symptom

Running ansible-playbook update.yml against Fedora/CentOS hosts fails with:

fatal: [hostname]: UNREACHABLE! => {"changed": false,
  "msg": "Failed to connect to the host via ssh: Shared connection to <IP> closed."}

The failure occurs specifically during ansible.builtin.dnf tasks that upgrade all packages (name: '*', state: latest), because the operation takes long enough for the SSH connection to drop.

Root Cause

Without explicit SSH keepalive settings in ansible.cfg, OpenSSH defaults apply. Long-running tasks like full dnf upgrade across a fleet can exceed idle timeouts, causing the control connection to close mid-task.

Fix

Add a [ssh_connection] section to ansible.cfg:

[ssh_connection]
ssh_args = -o ServerAliveInterval=30 -o ServerAliveCountMax=10 -o ControlMaster=auto -o ControlPersist=60s
Setting Purpose
ServerAliveInterval=30 Send a keepalive every 30 seconds
ServerAliveCountMax=10 Allow 10 missed keepalives before disconnect (~5 min tolerance)
ControlMaster=auto Reuse SSH connections across tasks
ControlPersist=60s Keep the master connection open 60s after last use

In the same playbook run, a second failure surfaced on hosts where the ansible.builtin.uri task to fetch the latest do-agent release was skipped (non-RedHat hosts or hosts without do-agent installed). The registered variable existed but contained a skipped result with no .json attribute, causing:

object of type 'dict' has no attribute 'json'

Fix: add guards to downstream tasks that reference the URI result:

when:
  - do_agent_release is defined
  - do_agent_release is not skipped
  - do_agent_release.json is defined

Environment

  • Controller: macOS (MajorAir)
  • Targets: Fedora 43 (majorlab, majormail, majorhome, majordiscord)
  • Ansible: community edition via Homebrew
  • Committed: d9c6bdb in MajorAnsible repo