diff --git a/05-troubleshooting/networking/ssh-socket-tailscale-race-condition.md b/05-troubleshooting/networking/ssh-socket-tailscale-race-condition.md index 22e5b25..817958d 100644 --- a/05-troubleshooting/networking/ssh-socket-tailscale-race-condition.md +++ b/05-troubleshooting/networking/ssh-socket-tailscale-race-condition.md @@ -1,14 +1,20 @@ -# ssh.socket Unreachable After Reboot (Tailscale Race Condition) +# Tailscale Boot Race Conditions (SSH Unreachable After Reboot) -## Symptom +Two related race conditions can make a host unreachable via Tailscale after reboot. Both stem from systemd services starting before Tailscale or the network is ready. + +--- + +## Race 1: ssh.socket Binds Before Tailscale Is Up (Ubuntu) + +### Symptom SSH to a host via Tailscale IP times out. `tailscale ping` works, `tailscale status` shows `active; direct`, but SSH on port 22 refuses connections. No access via Hetzner console if root password is unset. -## Cause +### Cause Ubuntu 24.04 uses systemd **socket activation** for SSH (`ssh.socket` instead of persistent `ssh.service`). When the socket override binds to a Tailscale IP, it can start *before* `tailscaled.service` is ready. The bind may succeed initially (Tailscale state file caches the IP), but a later Tailscale reconnect or interface reset invalidates the bound address silently — SSH dies with no recovery path. -## Diagnosis +### Diagnosis ```bash # From another host: @@ -21,7 +27,7 @@ journalctl -b -1 -u ssh # likely empty — sshd never spawned journalctl -b -1 -u ssh.socket # socket started before tailscaled ``` -## Fix +### Fix Add Tailscale dependency to the socket override: @@ -47,17 +53,67 @@ systemctl status ssh.socket # verify Listen: shows correct IP - `After=` ensures the socket waits for Tailscale to start - `BindsTo=` restarts the socket if Tailscale restarts, preventing stale binds +### Affected Hosts + +Ubuntu hosts using `configure_tailscale_ssh_only.yml`: majorlinux, dcaprod-hetzner. Fedora hosts (majordiscord) use firewall rules for SSH restriction — not affected by this race. + +--- + +## Race 2: tailscaled Starts Before Network Is Online (All Hosts) + +### Symptom + +Host reboots but never appears on Tailscale. `tailscale ping` times out entirely. SSH is dead because Tailscale never connects. The host is up (accessible via provider console) but isolated from the Tailscale network. + +### Cause + +`tailscaled.service` ships with `After=network-pre.target`, which fires *before* the network interface has an IP. On VPS hosts (especially Hetzner), the interface can take several seconds to come online. Tailscale starts, sees no network (`SetNetworkUp(false)`, `link state: defaultRoute= ifs={} v4=false v6=false`), fails DNS bootstrap and DERP relay connections, and gets stuck — never retrying. + +### Diagnosis + +```bash +# From Hetzner console or another access method: +journalctl -b -u tailscaled | grep -E "SetNetworkUp|link state|error|DERP" +# Look for: +# magicsock: SetNetworkUp(false) +# link state: interfaces.State{defaultRoute= ifs={} v4=false v6=false} +# health: Tailscale could not connect to any relay server +``` + +### Fix + +Deploy a systemd drop-in to wait for full network connectivity: + +```ini +# /etc/systemd/system/tailscaled.service.d/override.conf +[Unit] +After=network-online.target +Wants=network-online.target +``` + +Then reload and restart: + +```bash +systemctl daemon-reload +systemctl restart tailscaled +``` + +### Affected Hosts + +All hosts where Tailscale is the primary access path. Particularly impactful on VPS hosts with slow interface bringup. Both Fedora and Ubuntu hosts are affected. + +--- + ## Prevention -- Set root passwords on all Hetzner hosts for emergency console access -- Ansible playbook `configure_tailscale_ssh_only.yml` includes both directives as of commit `7ef182b` - -## Affected Hosts - -Ubuntu hosts using `configure_tailscale_ssh_only.yml`: majorlinux, dcaprod-hetzner. Fedora hosts (majordiscord) use firewall rules for SSH restriction — not affected. +- Set root passwords on all VPS hosts for emergency console access +- Ansible playbooks deploy both fixes automatically: + - `configure_tailscale_network_wait.yml` — tailscaled network-online dependency (all hosts) + - `configure_tailscale_ssh_only.yml` — ssh.socket Tailscale dependency (Ubuntu only) ## References - [[dcaprod#2026-05-19 — SSH unreachable due to ssh.socket race condition with Tailscale]] +- [[majordiscord#2026-05-19 — Tailscale boot race: unreachable after Ansible reboot]] - [[majorlinux#2026-05-19 — ssh.socket override patched: added Tailscale dependency]] -- Ansible: `configure_tailscale_ssh_only.yml` +- Ansible: `configure_tailscale_ssh_only.yml`, `configure_tailscale_network_wait.yml`