Marcus Summers 11b455a0e2 Add runbook: Ansible host-key verification failed after host rebuild/migration

Documents the Ansible-by-IP known_hosts gap: interactive ssh works (key
stored under hostname) but Ansible connects by inventory IP and fails with
UNREACHABLE/Host key verification failed. Includes tailnet-safe ssh-keyscan
fix and prevention notes. Surfaced by the Hetzner migration IP churn.

2026-06-12 09:30:09 -04:00

4.9 KiB

Raw Blame History

title

domain

Ansible UNREACHABLE: Host Key Verification Failed After a Host Rebuild or Migration

Symptom

A subset of hosts in an Ansible run fail at Gathering Facts while the rest succeed:

[ERROR]: Task failed: Data could not be sent to remote host "100.112.127.0".
Make sure this host can be reached over ssh: Host key verification failed.
fatal: [majormail]: UNREACHABLE! => {"unreachable": true, ...}

The failing hosts are exactly the ones that were recently rebuilt or migrated (new server, new OS install, or a cloud move that issued a new Tailscale IP). Hosts that were never rebuilt connect fine.

Confusingly, interactive ssh root@<host> works perfectly for the same boxes — only Ansible fails.

Cause

SSH stores each accepted host key in ~/.ssh/known_hosts keyed by the exact address you connected with. A key accepted for ssh root@tttpod is saved under the hostname tttpod; it is not indexed under that node's IP.

Ansible inventories almost always set ansible_host to a literal IP (here, the Tailscale 100.x.x.x address). So Ansible's SSH lookup is by IP, finds no matching entry, and with StrictHostKeyChecking=yes (or accept-new already exhausted) it refuses the connection:

No ED25519 host key is known for 100.112.127.0 and you have requested strict checking.
Host key verification failed.

The hostname-form and IP-form entries are independent. Fixing interactive SSH (e.g. converting aliases to MagicDNS names and re-accepting keys) does nothing for Ansible, because Ansible never uses the hostname.

A rebuilt host also generates brand-new host keys, so any old IP-form entry would additionally be a mismatch — but the common case after a migration to a new IP is simply that no IP entry exists at all.

Diagnosis

# 1. Is there any known_hosts entry for the failing IP? (0 = none)
ssh-keygen -F 100.112.127.0

# 2. Reproduce the exact failure without an interactive prompt:
ssh -o BatchMode=yes -o StrictHostKeyChecking=yes root@100.112.127.0 true
# -> "Host key verification failed."  confirms the gap

# 3. Confirm the inventory IP is actually the host's CURRENT address
#    (guards against stale-IP drift, a separate problem):
tailscale status | grep majormail
ssh-keyscan -t ed25519 100.112.127.0 | ssh-keygen -lf -   # fingerprint it

If step 3 shows the inventory IP matches the live Tailscale node and the box answers ssh-keyscan, the only problem is the missing IP-form key.

Fix

Add the IP-form host keys to the known_hosts of the user that runs Ansible. Back up first, scan over the tailnet, de-dup:

cp ~/.ssh/known_hosts ~/.ssh/known_hosts.bak.$(date +%Y%m%d)

for ip in 100.98.223.93 100.112.127.0 100.73.85.46 100.95.137.38 100.76.51.16 100.64.169.62; do
  ssh-keyscan -T 5 -t rsa,ecdsa,ed25519 "$ip" >> ~/.ssh/known_hosts
done
sort -u ~/.ssh/known_hosts -o ~/.ssh/known_hosts

Verify before re-running the playbook:

ansible <hosts> -m ping        # expect "pong" from each

Why `ssh-keyscan` is safe here

ssh-keyscan trusts whatever answers on the wire — normally a MITM risk. Over Tailscale, the connection rides WireGuard, which cryptographically authenticates the peer by its tailnet identity: reaching 100.x.x.x guarantees you are talking to the node that owns that tailnet address. Scanning and trusting the key over the tailnet is therefore as trustworthy as the tailnet itself. Always cross-check the IP against tailscale status first (step 3) so you scan the right node.

Prevention

Per-workstation, not fleet-wide. known_hosts is local to each machine + user. After a migration, every host that runs Ansible (each workstation, plus any control node like majorlab) needs the IP keys added independently. Adding them on one Mac does not help the others.
Sweep on every migration phase. A rolling migration changes one node's IP at a time; fold the keyscan above into the post-cutover checklist so Ansible never breaks mid-rollout.
Alternative — accept-new. Setting host_key_checking = False in ansible.cfg (or ANSIBLE_HOST_KEY_CHECKING=False) sidesteps the prompt but trades away host-key verification entirely. Prefer the explicit keyscan: it keeps strict checking on for every future run while accepting the new key exactly once, under your control.

SSH-Aliases — Fleet SSH access; the MagicDNS-vs-pinned-IP strategy and the Ansible-by-IP known_hosts note
Network Overview — Tailscale fleet inventory and current IPs
Hetzner-Migration-Status — the migration that triggered the fleet-wide IP churn
ssh-socket-tailscale-race-condition — a different "SSH unreachable after reboot" failure mode

4.9 KiB Raw Blame History