MajorLinux 8d4dee5da3 troubleshooting: correct ssh tailscale-race article (Fedora ListenAddress variant + playbook cycle landmine)

- Fedora hosts are NOT automatically immune: a leftover manual
  `ListenAddress <tailscale-ip>` drop-in reintroduces the sshd boot bind-race
  even under firewalld (hit on majordiscord 2026-06-07; fix = remove it).
- The Ubuntu playbook kept shipping the cycle-causing [Unit] gate on
  ssh.socket despite the 2026-06-04 resolution; re-running it re-armed the
  ordering cycle (clobbered majorlinux; majortoot-hetzner found armed).
  Corrected in MajorAnsible e0d35aa. Fleet ssh-lockdown state is inconsistent
  (dcaprod/tttpod lack wait-ready; teelia no override) — needs a per-host audit.

2026-06-07 05:56:25 -04:00

8 KiB

Raw Blame History

Tailscale Boot Race Conditions (SSH Unreachable After Reboot)

Two related race conditions can make a host unreachable via Tailscale after reboot. Both stem from systemd services starting before Tailscale or the network is ready.

Race 1: ssh.socket Binds Before Tailscale Is Up (Ubuntu)

Symptom

SSH to a host via Tailscale IP times out. tailscale ping works, tailscale status shows active; direct, but SSH on port 22 refuses connections. No access via Hetzner console if root password is unset.

Cause

Ubuntu 24.04 uses systemd socket activation for SSH (ssh.socket instead of persistent ssh.service). When the socket override binds to a Tailscale IP, it can start before tailscaled.service is ready. The bind may succeed initially (Tailscale state file caches the IP), but a later Tailscale reconnect or interface reset invalidates the bound address silently — SSH dies with no recovery path.

Diagnosis

# From another host:
tailscale ping <IP>          # succeeds — host is up
ssh root@<IP>                # times out — sshd not listening

# After gaining console access or reboot:
systemctl status ssh.socket  # check Listen: address
journalctl -b -1 -u ssh     # likely empty — sshd never spawned
journalctl -b -1 -u ssh.socket  # socket started before tailscaled

Fix (current — 2026-05-31)

After=tailscaled.service orders against the service becoming active — not against the tailscale0 interface actually having an IPv4 address. tailscaled flips to active within a second of starting, but the kernel doesn't have the address bound to the interface until DERP relays connect and the control plane confirms the node. ssh.socket attempting ListenStream=<TS IP>:22 in that window fails with Cannot assign requested address, the socket goes into a failed state, and there is no automatic retry.

The proper gate is a dedicated readiness service that waits for the tailscale0 IPv4 address to exist before letting ssh.socket bind:

# /etc/systemd/system/tailscale-wait-ready.service
[Unit]
Description=Wait until tailscale0 has an IPv4 address
After=tailscaled.service
Requires=tailscaled.service
ConditionPathExists=/usr/sbin/ip

[Service]
Type=oneshot
RemainAfterExit=yes
TimeoutStartSec=120
ExecStart=/usr/bin/bash -c 'for i in $(seq 1 120); do ip -4 -o addr show tailscale0 2>/dev/null | grep -q "inet " && exit 0; sleep 1; done; exit 1'

[Install]
WantedBy=multi-user.target

# /etc/systemd/system/ssh.socket.d/override.conf
[Unit]
After=tailscale-wait-ready.service
Requires=tailscale-wait-ready.service

[Socket]
ListenStream=
ListenStream=<TAILSCALE_IP>:22

Reload + restart:

systemctl daemon-reload
systemctl enable tailscale-wait-ready.service
systemctl restart ssh.socket
ss -tlnp | grep :22   # verify bound to Tailscale IP

!!! note "Evolution of this fix" - 2026-05-19 v1 — After=tailscaled.service + BindsTo=tailscaled.service. Worked initially but caused a shutdown-time ordering cycle. - 2026-05-23 v2 — BindsTo swapped for Requires to break the cycle. Fixed the cycle but did not wait for tailscale0 to actually have an IP — just for tailscaled to be active. Hosts continued losing SSH after some reboots (intermittent, depending on whether the race won). - 2026-05-31 v3 — Added tailscale-wait-ready.service to gate ssh.socket on the interface having an address. This is the current canonical fix.

!!! warning "Do NOT use BindsTo" BindsTo=tailscaled.service creates a systemd ordering cycle during shutdown: basic.target → sockets.target → ssh.socket → tailscaled.service → basic.target. Systemd breaks the cycle by deleting jobs unpredictably, which can prevent ssh.socket from starting on the next boot. Use Requires= for startup ordering without the bidirectional lifecycle coupling.

Affected Hosts

Ubuntu hosts using configure_tailscale_ssh_only.yml: majorlinux, dcaprod-hetzner, tttpod-hetzner, majortoot-hetzner.

[!danger] The Ubuntu playbook shipped the cycle pattern until 2026-06-07 Despite the 2026-06-04 resolution above, configure_tailscale_ssh_only.yml in the repo kept deploying the [Unit] Requires=tailscale-wait-ready.service gate on ssh.socket (the cycle-causer) and never added the ssh.service gate — so re-running it re-armed the ordering cycle. Caught 2026-06-07: it clobbered majorlinux's hand-fix, and majortoot-hetzner was found already armed with the latent cycle (would have lost SSH on its next reboot). Both restored/defused; playbook corrected in MajorAnsible e0d35aa (gate on ssh.service, dependency-free socket). ⚠️ dcaprod-hetzner / tttpod-hetzner lack tailscale-wait-ready.service and teelia has no socket override — the Ubuntu SSH-lockdown state is inconsistent across the fleet and needs a deliberate per-host audit.

[!warning] Fedora hosts are NOT automatically immune (corrected 2026-06-07) The firewalld method (configure_tailscale_ssh_only_fedora.yml) binds sshd on 0.0.0.0:22 and enforces Tailscale-only via the firewall, so it has no dependency on the Tailscale address — unless a host also carries a leftover manual ListenAddress <tailscale-ip> drop-in (/etc/ssh/sshd_config.d/tailscale-only.conf) from the pre-firewall lockdown. Then sshd.service hits the same boot bind-race (Bind to port 22 on <ts-ip> failed: Cannot assign requested address) and flaps every reboot. Hit on majordiscord 2026-06-07; fixed by removing the redundant drop-in (firewall stays the enforcing layer). The Fedora playbook now removes it automatically (MajorAnsible b4a9090).

Race 2: tailscaled Starts Before Network Is Online (All Hosts)

Symptom

Host reboots but never appears on Tailscale. tailscale ping times out entirely. SSH is dead because Tailscale never connects. The host is up (accessible via provider console) but isolated from the Tailscale network.

Cause

tailscaled.service ships with After=network-pre.target, which fires before the network interface has an IP. On VPS hosts (especially Hetzner), the interface can take several seconds to come online. Tailscale starts, sees no network (SetNetworkUp(false), link state: defaultRoute= ifs={} v4=false v6=false), fails DNS bootstrap and DERP relay connections, and gets stuck — never retrying.

Diagnosis

# From Hetzner console or another access method:
journalctl -b -u tailscaled | grep -E "SetNetworkUp|link state|error|DERP"
# Look for:
#   magicsock: SetNetworkUp(false)
#   link state: interfaces.State{defaultRoute= ifs={} v4=false v6=false}
#   health: Tailscale could not connect to any relay server

Fix

Deploy a systemd drop-in to wait for full network connectivity:

# /etc/systemd/system/tailscaled.service.d/override.conf
[Unit]
After=network-online.target
Wants=network-online.target

Then reload and restart:

systemctl daemon-reload
systemctl restart tailscaled

Affected Hosts

All hosts where Tailscale is the primary access path. Particularly impactful on VPS hosts with slow interface bringup. Both Fedora and Ubuntu hosts are affected.

Prevention

Set root passwords on all VPS hosts for emergency console access
Ansible playbooks deploy both fixes automatically:
- configure_tailscale_network_wait.yml — tailscaled network-online dependency (all hosts)
- configure_tailscale_ssh_only.yml — ssh.socket Tailscale dependency (Ubuntu only)

References

dcaprod#2026-05-19 — SSH unreachable due to ssh.socket race condition with Tailscale
majordiscord#2026-05-19 — Tailscale boot race: unreachable after Ansible reboot
majorlinux#2026-05-19 — ssh.socket override patched: added Tailscale dependency
dcaprod#2026-05-23 — SSH unreachable again: BindsTo ordering cycle in ssh.socket override
majortoot#2026-05-31 — ssh.socket race post-reboot on majortoot-hetzner (during cutover night)
Ansible: configure_tailscale_ssh_only.yml, configure_tailscale_network_wait.yml

8 KiB Raw Blame History

Tailscale Boot Race Conditions (SSH Unreachable After Reboot)

Race 1: ssh.socket Binds Before Tailscale Is Up (Ubuntu)

Symptom

Cause

Diagnosis

Fix (current — 2026-05-31)

Affected Hosts

Race 2: tailscaled Starts Before Network Is Online (All Hosts)

Symptom

Cause

Diagnosis

Fix

Affected Hosts

Prevention

References

8 KiB

Raw Blame History