diff --git a/05-troubleshooting/networking/ansible-host-key-verification-failed-rebuilt-host.md b/05-troubleshooting/networking/ansible-host-key-verification-failed-rebuilt-host.md new file mode 100644 index 0000000..fca0066 --- /dev/null +++ b/05-troubleshooting/networking/ansible-host-key-verification-failed-rebuilt-host.md @@ -0,0 +1,94 @@ +--- +title: "Ansible UNREACHABLE: Host Key Verification Failed After a Host Rebuild or Migration" +domain: troubleshooting +category: networking +tags: [ansible, ssh, known-hosts, tailscale, host-key, migration] +status: published +created: 2026-06-12 +updated: 2026-06-12 +--- + +# Ansible UNREACHABLE: Host Key Verification Failed After a Host Rebuild or Migration + +## Symptom + +A subset of hosts in an Ansible run fail at **Gathering Facts** while the rest succeed: + +``` +[ERROR]: Task failed: Data could not be sent to remote host "100.112.127.0". +Make sure this host can be reached over ssh: Host key verification failed. +fatal: [majormail]: UNREACHABLE! => {"unreachable": true, ...} +``` + +The failing hosts are exactly the ones that were recently **rebuilt or migrated** (new server, new OS install, or a cloud move that issued a new Tailscale IP). Hosts that were never rebuilt connect fine. + +Confusingly, **interactive `ssh root@` works perfectly** for the same boxes — only Ansible fails. + +## Cause + +SSH stores each accepted host key in `~/.ssh/known_hosts` keyed by the **exact address you connected with**. A key accepted for `ssh root@tttpod` is saved under the hostname `tttpod`; it is *not* indexed under that node's IP. + +Ansible inventories almost always set `ansible_host` to a **literal IP** (here, the Tailscale `100.x.x.x` address). So Ansible's SSH lookup is by IP, finds no matching entry, and with `StrictHostKeyChecking=yes` (or `accept-new` already exhausted) it refuses the connection: + +``` +No ED25519 host key is known for 100.112.127.0 and you have requested strict checking. +Host key verification failed. +``` + +The hostname-form and IP-form entries are independent. Fixing interactive SSH (e.g. converting aliases to MagicDNS names and re-accepting keys) does **nothing** for Ansible, because Ansible never uses the hostname. + +A rebuilt host also generates **brand-new host keys**, so any old IP-form entry would additionally be a mismatch — but the common case after a migration to a *new* IP is simply that no IP entry exists at all. + +## Diagnosis + +```bash +# 1. Is there any known_hosts entry for the failing IP? (0 = none) +ssh-keygen -F 100.112.127.0 + +# 2. Reproduce the exact failure without an interactive prompt: +ssh -o BatchMode=yes -o StrictHostKeyChecking=yes root@100.112.127.0 true +# -> "Host key verification failed." confirms the gap + +# 3. Confirm the inventory IP is actually the host's CURRENT address +# (guards against stale-IP drift, a separate problem): +tailscale status | grep majormail +ssh-keyscan -t ed25519 100.112.127.0 | ssh-keygen -lf - # fingerprint it +``` + +If step 3 shows the inventory IP matches the live Tailscale node and the box answers `ssh-keyscan`, the only problem is the missing IP-form key. + +## Fix + +Add the **IP-form** host keys to the `known_hosts` of the user that runs Ansible. Back up first, scan over the tailnet, de-dup: + +```bash +cp ~/.ssh/known_hosts ~/.ssh/known_hosts.bak.$(date +%Y%m%d) + +for ip in 100.98.223.93 100.112.127.0 100.73.85.46 100.95.137.38 100.76.51.16 100.64.169.62; do + ssh-keyscan -T 5 -t rsa,ecdsa,ed25519 "$ip" >> ~/.ssh/known_hosts +done +sort -u ~/.ssh/known_hosts -o ~/.ssh/known_hosts +``` + +Verify before re-running the playbook: + +```bash +ansible -m ping # expect "pong" from each +``` + +### Why `ssh-keyscan` is safe here + +`ssh-keyscan` trusts whatever answers on the wire — normally a MITM risk. Over **Tailscale**, the connection rides WireGuard, which cryptographically authenticates the peer by its tailnet identity: reaching `100.x.x.x` *guarantees* you are talking to the node that owns that tailnet address. Scanning and trusting the key over the tailnet is therefore as trustworthy as the tailnet itself. Always cross-check the IP against `tailscale status` first (step 3) so you scan the right node. + +## Prevention + +- **Per-workstation, not fleet-wide.** `known_hosts` is local to each machine + user. After a migration, *every* host that runs Ansible (each workstation, plus any control node like `majorlab`) needs the IP keys added independently. Adding them on one Mac does not help the others. +- **Sweep on every migration phase.** A rolling migration changes one node's IP at a time; fold the keyscan above into the post-cutover checklist so Ansible never breaks mid-rollout. +- **Alternative — `accept-new`.** Setting `host_key_checking = False` in `ansible.cfg` (or `ANSIBLE_HOST_KEY_CHECKING=False`) sidesteps the prompt but trades away host-key verification entirely. Prefer the explicit keyscan: it keeps strict checking on for every *future* run while accepting the new key exactly once, under your control. + +## Related + +- SSH-Aliases — Fleet SSH access; the MagicDNS-vs-pinned-IP strategy and the Ansible-by-IP `known_hosts` note +- Network Overview — Tailscale fleet inventory and current IPs +- Hetzner-Migration-Status — the migration that triggered the fleet-wide IP churn +- [[ssh-socket-tailscale-race-condition]] — a different "SSH unreachable after reboot" failure mode diff --git a/SUMMARY.md b/SUMMARY.md index 76bc1a0..bc5d2c8 100644 --- a/SUMMARY.md +++ b/SUMMARY.md @@ -136,5 +136,6 @@ updated: 2026-05-15T09:00 * [Ansible Fails with Permission Denied While `ssh ` Works (Host Alias Bypass)](05-troubleshooting/ansible-ssh-host-alias-bypass.md) * [SSH Alias Falls Through to MagicDNS — Host-Key Verification Failure (No `Host` Block)](05-troubleshooting/networking/ssh-missing-host-block-magicdns-host-key-failure.md) * [MagicDNS Names vs Pinned IPs for Tailscale SSH (After a Fleet Migration)](05-troubleshooting/networking/tailscale-ssh-magicdns-vs-pinned-ip-after-migration.md) + * [Ansible UNREACHABLE: Host Key Verification Failed After a Host Rebuild or Migration](05-troubleshooting/networking/ansible-host-key-verification-failed-rebuilt-host.md) * [Ghost EmailAnalytics Lag Warning — What It Means and When to Worry](05-troubleshooting/ghost-emailanalytics-lag-warning.md) * [claude-mem: --setting-sources Empty Arg Bug (Claude Code 2.1.x)](05-troubleshooting/claude-mem-setting-sources-empty-arg.md)