Add runbook: Ansible host-key verification failed after host rebuild/migration

Documents the Ansible-by-IP known_hosts gap: interactive ssh works (key
stored under hostname) but Ansible connects by inventory IP and fails with
UNREACHABLE/Host key verification failed. Includes tailnet-safe ssh-keyscan
fix and prevention notes. Surfaced by the Hetzner migration IP churn.
This commit is contained in:
Marcus Summers 2026-06-12 09:30:09 -04:00
parent bc4ff144df
commit 11b455a0e2
2 changed files with 95 additions and 0 deletions

View file

@ -0,0 +1,94 @@
---
title: "Ansible UNREACHABLE: Host Key Verification Failed After a Host Rebuild or Migration"
domain: troubleshooting
category: networking
tags: [ansible, ssh, known-hosts, tailscale, host-key, migration]
status: published
created: 2026-06-12
updated: 2026-06-12
---
# Ansible UNREACHABLE: Host Key Verification Failed After a Host Rebuild or Migration
## Symptom
A subset of hosts in an Ansible run fail at **Gathering Facts** while the rest succeed:
```
[ERROR]: Task failed: Data could not be sent to remote host "100.112.127.0".
Make sure this host can be reached over ssh: Host key verification failed.
fatal: [majormail]: UNREACHABLE! => {"unreachable": true, ...}
```
The failing hosts are exactly the ones that were recently **rebuilt or migrated** (new server, new OS install, or a cloud move that issued a new Tailscale IP). Hosts that were never rebuilt connect fine.
Confusingly, **interactive `ssh root@<host>` works perfectly** for the same boxes — only Ansible fails.
## Cause
SSH stores each accepted host key in `~/.ssh/known_hosts` keyed by the **exact address you connected with**. A key accepted for `ssh root@tttpod` is saved under the hostname `tttpod`; it is *not* indexed under that node's IP.
Ansible inventories almost always set `ansible_host` to a **literal IP** (here, the Tailscale `100.x.x.x` address). So Ansible's SSH lookup is by IP, finds no matching entry, and with `StrictHostKeyChecking=yes` (or `accept-new` already exhausted) it refuses the connection:
```
No ED25519 host key is known for 100.112.127.0 and you have requested strict checking.
Host key verification failed.
```
The hostname-form and IP-form entries are independent. Fixing interactive SSH (e.g. converting aliases to MagicDNS names and re-accepting keys) does **nothing** for Ansible, because Ansible never uses the hostname.
A rebuilt host also generates **brand-new host keys**, so any old IP-form entry would additionally be a mismatch — but the common case after a migration to a *new* IP is simply that no IP entry exists at all.
## Diagnosis
```bash
# 1. Is there any known_hosts entry for the failing IP? (0 = none)
ssh-keygen -F 100.112.127.0
# 2. Reproduce the exact failure without an interactive prompt:
ssh -o BatchMode=yes -o StrictHostKeyChecking=yes root@100.112.127.0 true
# -> "Host key verification failed." confirms the gap
# 3. Confirm the inventory IP is actually the host's CURRENT address
# (guards against stale-IP drift, a separate problem):
tailscale status | grep majormail
ssh-keyscan -t ed25519 100.112.127.0 | ssh-keygen -lf - # fingerprint it
```
If step 3 shows the inventory IP matches the live Tailscale node and the box answers `ssh-keyscan`, the only problem is the missing IP-form key.
## Fix
Add the **IP-form** host keys to the `known_hosts` of the user that runs Ansible. Back up first, scan over the tailnet, de-dup:
```bash
cp ~/.ssh/known_hosts ~/.ssh/known_hosts.bak.$(date +%Y%m%d)
for ip in 100.98.223.93 100.112.127.0 100.73.85.46 100.95.137.38 100.76.51.16 100.64.169.62; do
ssh-keyscan -T 5 -t rsa,ecdsa,ed25519 "$ip" >> ~/.ssh/known_hosts
done
sort -u ~/.ssh/known_hosts -o ~/.ssh/known_hosts
```
Verify before re-running the playbook:
```bash
ansible <hosts> -m ping # expect "pong" from each
```
### Why `ssh-keyscan` is safe here
`ssh-keyscan` trusts whatever answers on the wire — normally a MITM risk. Over **Tailscale**, the connection rides WireGuard, which cryptographically authenticates the peer by its tailnet identity: reaching `100.x.x.x` *guarantees* you are talking to the node that owns that tailnet address. Scanning and trusting the key over the tailnet is therefore as trustworthy as the tailnet itself. Always cross-check the IP against `tailscale status` first (step 3) so you scan the right node.
## Prevention
- **Per-workstation, not fleet-wide.** `known_hosts` is local to each machine + user. After a migration, *every* host that runs Ansible (each workstation, plus any control node like `majorlab`) needs the IP keys added independently. Adding them on one Mac does not help the others.
- **Sweep on every migration phase.** A rolling migration changes one node's IP at a time; fold the keyscan above into the post-cutover checklist so Ansible never breaks mid-rollout.
- **Alternative — `accept-new`.** Setting `host_key_checking = False` in `ansible.cfg` (or `ANSIBLE_HOST_KEY_CHECKING=False`) sidesteps the prompt but trades away host-key verification entirely. Prefer the explicit keyscan: it keeps strict checking on for every *future* run while accepting the new key exactly once, under your control.
## Related
- SSH-Aliases — Fleet SSH access; the MagicDNS-vs-pinned-IP strategy and the Ansible-by-IP `known_hosts` note
- Network Overview — Tailscale fleet inventory and current IPs
- Hetzner-Migration-Status — the migration that triggered the fleet-wide IP churn
- [[ssh-socket-tailscale-race-condition]] — a different "SSH unreachable after reboot" failure mode

View file

@ -136,5 +136,6 @@ updated: 2026-05-15T09:00
* [Ansible Fails with Permission Denied While `ssh <alias>` Works (Host Alias Bypass)](05-troubleshooting/ansible-ssh-host-alias-bypass.md)
* [SSH Alias Falls Through to MagicDNS — Host-Key Verification Failure (No `Host` Block)](05-troubleshooting/networking/ssh-missing-host-block-magicdns-host-key-failure.md)
* [MagicDNS Names vs Pinned IPs for Tailscale SSH (After a Fleet Migration)](05-troubleshooting/networking/tailscale-ssh-magicdns-vs-pinned-ip-after-migration.md)
* [Ansible UNREACHABLE: Host Key Verification Failed After a Host Rebuild or Migration](05-troubleshooting/networking/ansible-host-key-verification-failed-rebuilt-host.md)
* [Ghost EmailAnalytics Lag Warning — What It Means and When to Worry](05-troubleshooting/ghost-emailanalytics-lag-warning.md)
* [claude-mem: --setting-sources Empty Arg Bug (Claude Code 2.1.x)](05-troubleshooting/claude-mem-setting-sources-empty-arg.md)