Files
MajorWiki/05-troubleshooting/ansible-ssh-timeout-dnf-upgrade.md
majorlinux 1bb872ef75 Add Ansible SSH timeout troubleshooting article
Documents the SSH keepalive fix for dnf upgrade timeouts on Fedora hosts,
plus the do-agent task guard fix. Also adds Ansible & Fleet Management
section to the troubleshooting index.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 11:22:48 -04:00

2.3 KiB

title, domain, category, tags, status, created, updated
title domain category tags status created updated
Ansible SSH Timeout During dnf upgrade on Fedora Hosts troubleshooting ansible
ansible
ssh
fedora
dnf
timeout
fleet-management
published 2026-03-28 2026-03-28

Ansible SSH Timeout During dnf upgrade on Fedora Hosts

Symptom

Running ansible-playbook update.yml against Fedora/CentOS hosts fails with:

fatal: [hostname]: UNREACHABLE! => {"changed": false,
  "msg": "Failed to connect to the host via ssh: Shared connection to <IP> closed."}

The failure occurs specifically during ansible.builtin.dnf tasks that upgrade all packages (name: '*', state: latest), because the operation takes long enough for the SSH connection to drop.

Root Cause

Without explicit SSH keepalive settings in ansible.cfg, OpenSSH defaults apply. Long-running tasks like full dnf upgrade across a fleet can exceed idle timeouts, causing the control connection to close mid-task.

Fix

Add a [ssh_connection] section to ansible.cfg:

[ssh_connection]
ssh_args = -o ServerAliveInterval=30 -o ServerAliveCountMax=10 -o ControlMaster=auto -o ControlPersist=60s
Setting Purpose
ServerAliveInterval=30 Send a keepalive every 30 seconds
ServerAliveCountMax=10 Allow 10 missed keepalives before disconnect (~5 min tolerance)
ControlMaster=auto Reuse SSH connections across tasks
ControlPersist=60s Keep the master connection open 60s after last use

In the same playbook run, a second failure surfaced on hosts where the ansible.builtin.uri task to fetch the latest do-agent release was skipped (non-RedHat hosts or hosts without do-agent installed). The registered variable existed but contained a skipped result with no .json attribute, causing:

object of type 'dict' has no attribute 'json'

Fix: add guards to downstream tasks that reference the URI result:

when:
  - do_agent_release is defined
  - do_agent_release is not skipped
  - do_agent_release.json is defined

Environment

  • Controller: macOS (MajorAir)
  • Targets: Fedora 43 (majorlab, majormail, majorhome, majordiscord)
  • Ansible: community edition via Homebrew
  • Committed: d9c6bdb in MajorAnsible repo