Add runbook: Ansible host-key verification failed after host rebuild/migration
Documents the Ansible-by-IP known_hosts gap: interactive ssh works (key stored under hostname) but Ansible connects by inventory IP and fails with UNREACHABLE/Host key verification failed. Includes tailnet-safe ssh-keyscan fix and prevention notes. Surfaced by the Hetzner migration IP churn.
This commit is contained in:
parent
bc4ff144df
commit
11b455a0e2
2 changed files with 95 additions and 0 deletions
|
|
@ -0,0 +1,94 @@
|
|||
---
|
||||
title: "Ansible UNREACHABLE: Host Key Verification Failed After a Host Rebuild or Migration"
|
||||
domain: troubleshooting
|
||||
category: networking
|
||||
tags: [ansible, ssh, known-hosts, tailscale, host-key, migration]
|
||||
status: published
|
||||
created: 2026-06-12
|
||||
updated: 2026-06-12
|
||||
---
|
||||
|
||||
# Ansible UNREACHABLE: Host Key Verification Failed After a Host Rebuild or Migration
|
||||
|
||||
## Symptom
|
||||
|
||||
A subset of hosts in an Ansible run fail at **Gathering Facts** while the rest succeed:
|
||||
|
||||
```
|
||||
[ERROR]: Task failed: Data could not be sent to remote host "100.112.127.0".
|
||||
Make sure this host can be reached over ssh: Host key verification failed.
|
||||
fatal: [majormail]: UNREACHABLE! => {"unreachable": true, ...}
|
||||
```
|
||||
|
||||
The failing hosts are exactly the ones that were recently **rebuilt or migrated** (new server, new OS install, or a cloud move that issued a new Tailscale IP). Hosts that were never rebuilt connect fine.
|
||||
|
||||
Confusingly, **interactive `ssh root@<host>` works perfectly** for the same boxes — only Ansible fails.
|
||||
|
||||
## Cause
|
||||
|
||||
SSH stores each accepted host key in `~/.ssh/known_hosts` keyed by the **exact address you connected with**. A key accepted for `ssh root@tttpod` is saved under the hostname `tttpod`; it is *not* indexed under that node's IP.
|
||||
|
||||
Ansible inventories almost always set `ansible_host` to a **literal IP** (here, the Tailscale `100.x.x.x` address). So Ansible's SSH lookup is by IP, finds no matching entry, and with `StrictHostKeyChecking=yes` (or `accept-new` already exhausted) it refuses the connection:
|
||||
|
||||
```
|
||||
No ED25519 host key is known for 100.112.127.0 and you have requested strict checking.
|
||||
Host key verification failed.
|
||||
```
|
||||
|
||||
The hostname-form and IP-form entries are independent. Fixing interactive SSH (e.g. converting aliases to MagicDNS names and re-accepting keys) does **nothing** for Ansible, because Ansible never uses the hostname.
|
||||
|
||||
A rebuilt host also generates **brand-new host keys**, so any old IP-form entry would additionally be a mismatch — but the common case after a migration to a *new* IP is simply that no IP entry exists at all.
|
||||
|
||||
## Diagnosis
|
||||
|
||||
```bash
|
||||
# 1. Is there any known_hosts entry for the failing IP? (0 = none)
|
||||
ssh-keygen -F 100.112.127.0
|
||||
|
||||
# 2. Reproduce the exact failure without an interactive prompt:
|
||||
ssh -o BatchMode=yes -o StrictHostKeyChecking=yes root@100.112.127.0 true
|
||||
# -> "Host key verification failed." confirms the gap
|
||||
|
||||
# 3. Confirm the inventory IP is actually the host's CURRENT address
|
||||
# (guards against stale-IP drift, a separate problem):
|
||||
tailscale status | grep majormail
|
||||
ssh-keyscan -t ed25519 100.112.127.0 | ssh-keygen -lf - # fingerprint it
|
||||
```
|
||||
|
||||
If step 3 shows the inventory IP matches the live Tailscale node and the box answers `ssh-keyscan`, the only problem is the missing IP-form key.
|
||||
|
||||
## Fix
|
||||
|
||||
Add the **IP-form** host keys to the `known_hosts` of the user that runs Ansible. Back up first, scan over the tailnet, de-dup:
|
||||
|
||||
```bash
|
||||
cp ~/.ssh/known_hosts ~/.ssh/known_hosts.bak.$(date +%Y%m%d)
|
||||
|
||||
for ip in 100.98.223.93 100.112.127.0 100.73.85.46 100.95.137.38 100.76.51.16 100.64.169.62; do
|
||||
ssh-keyscan -T 5 -t rsa,ecdsa,ed25519 "$ip" >> ~/.ssh/known_hosts
|
||||
done
|
||||
sort -u ~/.ssh/known_hosts -o ~/.ssh/known_hosts
|
||||
```
|
||||
|
||||
Verify before re-running the playbook:
|
||||
|
||||
```bash
|
||||
ansible <hosts> -m ping # expect "pong" from each
|
||||
```
|
||||
|
||||
### Why `ssh-keyscan` is safe here
|
||||
|
||||
`ssh-keyscan` trusts whatever answers on the wire — normally a MITM risk. Over **Tailscale**, the connection rides WireGuard, which cryptographically authenticates the peer by its tailnet identity: reaching `100.x.x.x` *guarantees* you are talking to the node that owns that tailnet address. Scanning and trusting the key over the tailnet is therefore as trustworthy as the tailnet itself. Always cross-check the IP against `tailscale status` first (step 3) so you scan the right node.
|
||||
|
||||
## Prevention
|
||||
|
||||
- **Per-workstation, not fleet-wide.** `known_hosts` is local to each machine + user. After a migration, *every* host that runs Ansible (each workstation, plus any control node like `majorlab`) needs the IP keys added independently. Adding them on one Mac does not help the others.
|
||||
- **Sweep on every migration phase.** A rolling migration changes one node's IP at a time; fold the keyscan above into the post-cutover checklist so Ansible never breaks mid-rollout.
|
||||
- **Alternative — `accept-new`.** Setting `host_key_checking = False` in `ansible.cfg` (or `ANSIBLE_HOST_KEY_CHECKING=False`) sidesteps the prompt but trades away host-key verification entirely. Prefer the explicit keyscan: it keeps strict checking on for every *future* run while accepting the new key exactly once, under your control.
|
||||
|
||||
## Related
|
||||
|
||||
- SSH-Aliases — Fleet SSH access; the MagicDNS-vs-pinned-IP strategy and the Ansible-by-IP `known_hosts` note
|
||||
- Network Overview — Tailscale fleet inventory and current IPs
|
||||
- Hetzner-Migration-Status — the migration that triggered the fleet-wide IP churn
|
||||
- [[ssh-socket-tailscale-race-condition]] — a different "SSH unreachable after reboot" failure mode
|
||||
|
|
@ -136,5 +136,6 @@ updated: 2026-05-15T09:00
|
|||
* [Ansible Fails with Permission Denied While `ssh <alias>` Works (Host Alias Bypass)](05-troubleshooting/ansible-ssh-host-alias-bypass.md)
|
||||
* [SSH Alias Falls Through to MagicDNS — Host-Key Verification Failure (No `Host` Block)](05-troubleshooting/networking/ssh-missing-host-block-magicdns-host-key-failure.md)
|
||||
* [MagicDNS Names vs Pinned IPs for Tailscale SSH (After a Fleet Migration)](05-troubleshooting/networking/tailscale-ssh-magicdns-vs-pinned-ip-after-migration.md)
|
||||
* [Ansible UNREACHABLE: Host Key Verification Failed After a Host Rebuild or Migration](05-troubleshooting/networking/ansible-host-key-verification-failed-rebuilt-host.md)
|
||||
* [Ghost EmailAnalytics Lag Warning — What It Means and When to Worry](05-troubleshooting/ghost-emailanalytics-lag-warning.md)
|
||||
* [claude-mem: --setting-sources Empty Arg Bug (Claude Code 2.1.x)](05-troubleshooting/claude-mem-setting-sources-empty-arg.md)
|
||||
|
|
|
|||
Loading…
Add table
Reference in a new issue