majorwiki/05-troubleshooting/networking/ssh-socket-tailscale-race-condition.md

# Tailscale Boot Race Conditions (SSH Unreachable After Reboot)

Two related race conditions can make a host unreachable via Tailscale after reboot. Both stem from systemd services starting before Tailscale or the network is ready.

---

## Race 1: ssh.socket Binds Before Tailscale Is Up (Ubuntu)

### Symptom

SSH to a host via Tailscale IP times out. `tailscale ping` works, `tailscale status` shows `active; direct`, but SSH on port 22 refuses connections. No access via Hetzner console if root password is unset.

### Cause

Ubuntu 24.04 uses systemd **socket activation** for SSH (`ssh.socket` instead of persistent `ssh.service`). When the socket override binds to a Tailscale IP, it can start *before* `tailscaled.service` is ready. The bind may succeed initially (Tailscale state file caches the IP), but a later Tailscale reconnect or interface reset invalidates the bound address silently — SSH dies with no recovery path.

### Diagnosis

```bash
# From another host:
tailscale ping <IP>          # succeeds — host is up
ssh root@<IP>                # times out — sshd not listening

# After gaining console access or reboot:
systemctl status ssh.socket  # check Listen: address
journalctl -b -1 -u ssh     # likely empty — sshd never spawned
journalctl -b -1 -u ssh.socket  # socket started before tailscaled
```

### Fix

Add Tailscale dependency to the socket override:

```ini
# /etc/systemd/system/ssh.socket.d/override.conf
[Unit]
After=tailscaled.service
Requires=tailscaled.service

[Socket]
ListenStream=
ListenStream=<TAILSCALE_IP>:22
```

Then reload and restart:

```bash
systemctl daemon-reload
systemctl restart ssh.socket
systemctl status ssh.socket   # verify Listen: shows correct IP
```

- `After=` ensures the socket waits for Tailscale to start
- `Requires=` ensures tailscaled must be running for the socket to activate

!!! warning "Do NOT use BindsTo"
    `BindsTo=tailscaled.service` creates a **systemd ordering cycle** during shutdown: `basic.target → sockets.target → ssh.socket → tailscaled.service → basic.target`. Systemd breaks the cycle by deleting jobs unpredictably, which can prevent `ssh.socket` from starting on the next boot — leaving SSH dead until manual intervention. This was discovered on 2026-05-23 after the original fix (2026-05-19) used `BindsTo` and caused a second outage on dcaprod-hetzner. `Requires` provides the startup dependency without the dangerous bidirectional lifecycle coupling.

### Affected Hosts

Ubuntu hosts using `configure_tailscale_ssh_only.yml`: majorlinux, dcaprod-hetzner, tttpod-hetzner. Fedora hosts (majordiscord) use firewall rules for SSH restriction — not affected by this race.

---

## Race 2: tailscaled Starts Before Network Is Online (All Hosts)

### Symptom

Host reboots but never appears on Tailscale. `tailscale ping` times out entirely. SSH is dead because Tailscale never connects. The host is up (accessible via provider console) but isolated from the Tailscale network.

### Cause

`tailscaled.service` ships with `After=network-pre.target`, which fires *before* the network interface has an IP. On VPS hosts (especially Hetzner), the interface can take several seconds to come online. Tailscale starts, sees no network (`SetNetworkUp(false)`, `link state: defaultRoute= ifs={} v4=false v6=false`), fails DNS bootstrap and DERP relay connections, and gets stuck — never retrying.

### Diagnosis

```bash
# From Hetzner console or another access method:
journalctl -b -u tailscaled | grep -E "SetNetworkUp|link state|error|DERP"
# Look for:
#   magicsock: SetNetworkUp(false)
#   link state: interfaces.State{defaultRoute= ifs={} v4=false v6=false}
#   health: Tailscale could not connect to any relay server
```

### Fix

Deploy a systemd drop-in to wait for full network connectivity:

```ini
# /etc/systemd/system/tailscaled.service.d/override.conf
[Unit]
After=network-online.target
Wants=network-online.target
```

Then reload and restart:

```bash
systemctl daemon-reload
systemctl restart tailscaled
```

### Affected Hosts

All hosts where Tailscale is the primary access path. Particularly impactful on VPS hosts with slow interface bringup. Both Fedora and Ubuntu hosts are affected.

---

## Prevention

- Set root passwords on all VPS hosts for emergency console access
- Ansible playbooks deploy both fixes automatically:
  - `configure_tailscale_network_wait.yml` — tailscaled network-online dependency (all hosts)
  - `configure_tailscale_ssh_only.yml` — ssh.socket Tailscale dependency (Ubuntu only)

## References

- [[dcaprod#2026-05-19 — SSH unreachable due to ssh.socket race condition with Tailscale]]
- [[majordiscord#2026-05-19 — Tailscale boot race: unreachable after Ansible reboot]]
- [[majorlinux#2026-05-19 — ssh.socket override patched: added Tailscale dependency]]
- [[dcaprod#2026-05-23 — SSH unreachable again: BindsTo ordering cycle in ssh.socket override]]
- Ansible: `configure_tailscale_ssh_only.yml`, `configure_tailscale_network_wait.yml`