majorwiki/05-troubleshooting/networking/ssh-socket-tailscale-race-condition.md
majorlinux 3b8c8b0597 ssh.socket wiki: correct BindsTo→Requires, add warning
BindsTo=tailscaled.service causes a systemd ordering cycle that
prevents ssh.socket from starting on reboot. Updated the recommended
fix to use Requires= and added a warning admonition explaining why
BindsTo must not be used. Added tttpod-hetzner to affected hosts
list and linked the 2026-05-23 dcaprod incident.
2026-05-23 02:40:04 -04:00

4.9 KiB

Tailscale Boot Race Conditions (SSH Unreachable After Reboot)

Two related race conditions can make a host unreachable via Tailscale after reboot. Both stem from systemd services starting before Tailscale or the network is ready.


Race 1: ssh.socket Binds Before Tailscale Is Up (Ubuntu)

Symptom

SSH to a host via Tailscale IP times out. tailscale ping works, tailscale status shows active; direct, but SSH on port 22 refuses connections. No access via Hetzner console if root password is unset.

Cause

Ubuntu 24.04 uses systemd socket activation for SSH (ssh.socket instead of persistent ssh.service). When the socket override binds to a Tailscale IP, it can start before tailscaled.service is ready. The bind may succeed initially (Tailscale state file caches the IP), but a later Tailscale reconnect or interface reset invalidates the bound address silently — SSH dies with no recovery path.

Diagnosis

# From another host:
tailscale ping <IP>          # succeeds — host is up
ssh root@<IP>                # times out — sshd not listening

# After gaining console access or reboot:
systemctl status ssh.socket  # check Listen: address
journalctl -b -1 -u ssh     # likely empty — sshd never spawned
journalctl -b -1 -u ssh.socket  # socket started before tailscaled

Fix

Add Tailscale dependency to the socket override:

# /etc/systemd/system/ssh.socket.d/override.conf
[Unit]
After=tailscaled.service
Requires=tailscaled.service

[Socket]
ListenStream=
ListenStream=<TAILSCALE_IP>:22

Then reload and restart:

systemctl daemon-reload
systemctl restart ssh.socket
systemctl status ssh.socket   # verify Listen: shows correct IP
  • After= ensures the socket waits for Tailscale to start
  • Requires= ensures tailscaled must be running for the socket to activate

!!! warning "Do NOT use BindsTo" BindsTo=tailscaled.service creates a systemd ordering cycle during shutdown: basic.target → sockets.target → ssh.socket → tailscaled.service → basic.target. Systemd breaks the cycle by deleting jobs unpredictably, which can prevent ssh.socket from starting on the next boot — leaving SSH dead until manual intervention. This was discovered on 2026-05-23 after the original fix (2026-05-19) used BindsTo and caused a second outage on dcaprod-hetzner. Requires provides the startup dependency without the dangerous bidirectional lifecycle coupling.

Affected Hosts

Ubuntu hosts using configure_tailscale_ssh_only.yml: majorlinux, dcaprod-hetzner, tttpod-hetzner. Fedora hosts (majordiscord) use firewall rules for SSH restriction — not affected by this race.


Race 2: tailscaled Starts Before Network Is Online (All Hosts)

Symptom

Host reboots but never appears on Tailscale. tailscale ping times out entirely. SSH is dead because Tailscale never connects. The host is up (accessible via provider console) but isolated from the Tailscale network.

Cause

tailscaled.service ships with After=network-pre.target, which fires before the network interface has an IP. On VPS hosts (especially Hetzner), the interface can take several seconds to come online. Tailscale starts, sees no network (SetNetworkUp(false), link state: defaultRoute= ifs={} v4=false v6=false), fails DNS bootstrap and DERP relay connections, and gets stuck — never retrying.

Diagnosis

# From Hetzner console or another access method:
journalctl -b -u tailscaled | grep -E "SetNetworkUp|link state|error|DERP"
# Look for:
#   magicsock: SetNetworkUp(false)
#   link state: interfaces.State{defaultRoute= ifs={} v4=false v6=false}
#   health: Tailscale could not connect to any relay server

Fix

Deploy a systemd drop-in to wait for full network connectivity:

# /etc/systemd/system/tailscaled.service.d/override.conf
[Unit]
After=network-online.target
Wants=network-online.target

Then reload and restart:

systemctl daemon-reload
systemctl restart tailscaled

Affected Hosts

All hosts where Tailscale is the primary access path. Particularly impactful on VPS hosts with slow interface bringup. Both Fedora and Ubuntu hosts are affected.


Prevention

  • Set root passwords on all VPS hosts for emergency console access
  • Ansible playbooks deploy both fixes automatically:
    • configure_tailscale_network_wait.yml — tailscaled network-online dependency (all hosts)
    • configure_tailscale_ssh_only.yml — ssh.socket Tailscale dependency (Ubuntu only)

References