majorlinux 950759da52 wiki: add MagicDNS-names-vs-pinned-IPs Tailscale SSH article

New troubleshooting/networking article covering the three SSH failure modes
after a fleet migration (stale hardcoded IP, Tailscale 1.98.x cold-path
teardown, rebuilt-box host-key mismatch) and the durable fix (MagicDNS names +
known_hosts purge + ConnectTimeout), with the WSL2 no-resolver caveat.
Cross-links the existing host-key article (adds a 'when pinning the IP is
wrong' callout) and adds the SUMMARY nav entry.

2026-06-12 01:33:31 -04:00

5.9 KiB

Raw Blame History

title

domain

MagicDNS Names vs Pinned IPs for Tailscale SSH (After a Fleet Migration)

You have SSH aliases for a Tailscale fleet (alias tttpod='ssh root@100.84.42.102'). They worked for months. Then you migrate or rebuild some nodes — and now a third of them hang on connect or refuse the host key. This is the failure mode that hardcoded addresses hit, and why the durable answer is MagicDNS names, not pinned IPs.

This is the sequel to SSH Alias Falls Through to MagicDNS — Host-Key Verification Failure (No Host Block). That article says pin the IP known_hosts already trusts — correct when the node is stable. This one covers what happens when a migration changes the IP and the host key, which is exactly when IP-pinning stops paying off.

The Three Failure Modes

A migration/rebuild can trigger any of these — often several at once across a fleet, which is what makes it confusing:

1. Stale hardcoded IP → connection times out

The node re-registered on the tailnet with a new Tailscale IP, but your alias still names the old one:

$ tttpod
ssh: connect to host 100.84.42.102 port 22: Operation timed out

The old address is dead; SSH waits the full timeout and gives up. Confirm by asking the tailnet for the node's current IP by name:

$ tailscale status | grep tttpod
100.95.137.38   tttpod   ...     # alias points at 100.84.42.102 — stale

2. Cold-path teardown → first connect after idle times out

The IP is correct and the node is up (it answers ping), but TCP/22 still times out on the first try after a quiet period, then works on retry. Tailscale 1.98.x is more aggressive about tearing down idle direct UDP paths; the first SSH has to re-establish NAT traversal, which can overrun SSH's default connect timeout.

$ tailscale status | grep tttpod
100.95.137.38   tttpod   ...   idle, tx 9360 rx 0      # cold path
$ tailscale ping tttpod
pong from tttpod (100.95.137.38) via 5.161.118.84:41641 in 48ms   # warms instantly

3. Host-key verification failed → box was rebuilt

The node was reinstalled, so it presents a new SSH host key. Your known_hosts still has the old one, so even StrictHostKeyChecking=accept-new aborts — accept-new only adds genuinely new hosts, it refuses a mismatch:

$ ssh root@tttpod hostname
Host key verification failed.

The Fix

Three changes, applied on every name-capable machine (see the WSL2 caveat below):

a. Switch aliases from IPs to MagicDNS names

# before — rots on every migration
alias tttpod='ssh root@100.84.42.102'
# after — always resolves the node's current IP
alias tttpod='ssh root@tttpod'

MagicDNS resolves the name to whatever IP the node currently has, so a future migration needs zero alias edits. This is the whole point: the tailnet already knows the mapping — stop duplicating (and stale-ing) it in your dotfiles.

Exception: if there's no tailnet device with that exact name (e.g. an alias teelia pointing at a node actually named temptedparadise), MagicDNS can't resolve it — keep the IP for that one.

b. Purge stale host keys, then re-accept

After a rebuild, clear the old entries under both the name and the current IP, then reconnect with accept-new to record the fresh key. Over Tailscale's authenticated WireGuard tunnel, a key change from a known rebuild is safe to accept.

for pair in "tttpod:100.95.137.38" "majortoot:100.64.169.62" "dcaprod:100.98.223.93"; do
  n="${pair%%:*}"; ip="${pair##*:}"
  ssh-keygen -R "$n"; ssh-keygen -R "$ip"
done
# repopulate
ssh -o StrictHostKeyChecking=accept-new root@tttpod hostname

c. Add a cold-path cushion to `~/.ssh/config`

Give the first (cold) connection time to renegotiate instead of erroring:

Host majorlinux tttpod majortoot majordiscord dcaprod majormail majorhome
    ConnectTimeout 25
    ServerAliveInterval 30
    ServerAliveCountMax 4

ConnectTimeout 25 turns the cold-path timeout into a ~1–2 s pause. The keepalives hold the path open during an active session so it doesn't drop mid-command.

Caveat: WSL2 Can't Use MagicDNS

A Linux box under WSL2 typically has no tailscale CLI and no MagicDNS resolver — it rides the Windows host's networking, and name lookups for tailnet nodes fail:

$ getent hosts tttpod        # (inside WSL2)
                             # nothing — no resolution
$ command -v tailscale       # nothing — CLI lives on the Windows side

On those machines you must keep hardcoded IPs in ~/.ssh/config (or use Host blocks with explicit HostName <ip>), and refresh them by hand when a node migrates. There's no self-healing option there — the trade is unavoidable.

Diagnosis Checklist

tailscale status | grep <host> — does your alias's IP match the current one? (Mode 1: stale IP.)
ping/tailscale ping <host> works but TCP/22 times out on first try, succeeds on retry? (Mode 2: cold path.)
ssh root@<host> true → Host key verification failed (not Permission denied)? (Mode 3: rebuilt box, stale known_hosts.)
Is the client a WSL2 box? getent hosts <name> returns nothing → MagicDNS unavailable, stay on IPs.

Takeaway

Pin the IP when a host is stable and the IP-keyed known_hosts entry is your durable trust anchor. Switch to MagicDNS names when hosts move — migrations, rebuilds, provider changes — so the tailnet's own name→IP mapping does the work your dotfiles kept getting wrong. And on WSL2, you don't get the choice: hardcoded IPs, refreshed by hand.

5.9 KiB Raw Blame History Unescape Escape