Merge branch 'cowork/majorair/tailscale-boot-race-wiki'
This commit is contained in:
commit
318f50c50b
1 changed files with 68 additions and 12 deletions
|
|
@ -1,14 +1,20 @@
|
||||||
# ssh.socket Unreachable After Reboot (Tailscale Race Condition)
|
# Tailscale Boot Race Conditions (SSH Unreachable After Reboot)
|
||||||
|
|
||||||
## Symptom
|
Two related race conditions can make a host unreachable via Tailscale after reboot. Both stem from systemd services starting before Tailscale or the network is ready.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Race 1: ssh.socket Binds Before Tailscale Is Up (Ubuntu)
|
||||||
|
|
||||||
|
### Symptom
|
||||||
|
|
||||||
SSH to a host via Tailscale IP times out. `tailscale ping` works, `tailscale status` shows `active; direct`, but SSH on port 22 refuses connections. No access via Hetzner console if root password is unset.
|
SSH to a host via Tailscale IP times out. `tailscale ping` works, `tailscale status` shows `active; direct`, but SSH on port 22 refuses connections. No access via Hetzner console if root password is unset.
|
||||||
|
|
||||||
## Cause
|
### Cause
|
||||||
|
|
||||||
Ubuntu 24.04 uses systemd **socket activation** for SSH (`ssh.socket` instead of persistent `ssh.service`). When the socket override binds to a Tailscale IP, it can start *before* `tailscaled.service` is ready. The bind may succeed initially (Tailscale state file caches the IP), but a later Tailscale reconnect or interface reset invalidates the bound address silently — SSH dies with no recovery path.
|
Ubuntu 24.04 uses systemd **socket activation** for SSH (`ssh.socket` instead of persistent `ssh.service`). When the socket override binds to a Tailscale IP, it can start *before* `tailscaled.service` is ready. The bind may succeed initially (Tailscale state file caches the IP), but a later Tailscale reconnect or interface reset invalidates the bound address silently — SSH dies with no recovery path.
|
||||||
|
|
||||||
## Diagnosis
|
### Diagnosis
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# From another host:
|
# From another host:
|
||||||
|
|
@ -21,7 +27,7 @@ journalctl -b -1 -u ssh # likely empty — sshd never spawned
|
||||||
journalctl -b -1 -u ssh.socket # socket started before tailscaled
|
journalctl -b -1 -u ssh.socket # socket started before tailscaled
|
||||||
```
|
```
|
||||||
|
|
||||||
## Fix
|
### Fix
|
||||||
|
|
||||||
Add Tailscale dependency to the socket override:
|
Add Tailscale dependency to the socket override:
|
||||||
|
|
||||||
|
|
@ -47,17 +53,67 @@ systemctl status ssh.socket # verify Listen: shows correct IP
|
||||||
- `After=` ensures the socket waits for Tailscale to start
|
- `After=` ensures the socket waits for Tailscale to start
|
||||||
- `BindsTo=` restarts the socket if Tailscale restarts, preventing stale binds
|
- `BindsTo=` restarts the socket if Tailscale restarts, preventing stale binds
|
||||||
|
|
||||||
|
### Affected Hosts
|
||||||
|
|
||||||
|
Ubuntu hosts using `configure_tailscale_ssh_only.yml`: majorlinux, dcaprod-hetzner. Fedora hosts (majordiscord) use firewall rules for SSH restriction — not affected by this race.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Race 2: tailscaled Starts Before Network Is Online (All Hosts)
|
||||||
|
|
||||||
|
### Symptom
|
||||||
|
|
||||||
|
Host reboots but never appears on Tailscale. `tailscale ping` times out entirely. SSH is dead because Tailscale never connects. The host is up (accessible via provider console) but isolated from the Tailscale network.
|
||||||
|
|
||||||
|
### Cause
|
||||||
|
|
||||||
|
`tailscaled.service` ships with `After=network-pre.target`, which fires *before* the network interface has an IP. On VPS hosts (especially Hetzner), the interface can take several seconds to come online. Tailscale starts, sees no network (`SetNetworkUp(false)`, `link state: defaultRoute= ifs={} v4=false v6=false`), fails DNS bootstrap and DERP relay connections, and gets stuck — never retrying.
|
||||||
|
|
||||||
|
### Diagnosis
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# From Hetzner console or another access method:
|
||||||
|
journalctl -b -u tailscaled | grep -E "SetNetworkUp|link state|error|DERP"
|
||||||
|
# Look for:
|
||||||
|
# magicsock: SetNetworkUp(false)
|
||||||
|
# link state: interfaces.State{defaultRoute= ifs={} v4=false v6=false}
|
||||||
|
# health: Tailscale could not connect to any relay server
|
||||||
|
```
|
||||||
|
|
||||||
|
### Fix
|
||||||
|
|
||||||
|
Deploy a systemd drop-in to wait for full network connectivity:
|
||||||
|
|
||||||
|
```ini
|
||||||
|
# /etc/systemd/system/tailscaled.service.d/override.conf
|
||||||
|
[Unit]
|
||||||
|
After=network-online.target
|
||||||
|
Wants=network-online.target
|
||||||
|
```
|
||||||
|
|
||||||
|
Then reload and restart:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
systemctl daemon-reload
|
||||||
|
systemctl restart tailscaled
|
||||||
|
```
|
||||||
|
|
||||||
|
### Affected Hosts
|
||||||
|
|
||||||
|
All hosts where Tailscale is the primary access path. Particularly impactful on VPS hosts with slow interface bringup. Both Fedora and Ubuntu hosts are affected.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
## Prevention
|
## Prevention
|
||||||
|
|
||||||
- Set root passwords on all Hetzner hosts for emergency console access
|
- Set root passwords on all VPS hosts for emergency console access
|
||||||
- Ansible playbook `configure_tailscale_ssh_only.yml` includes both directives as of commit `7ef182b`
|
- Ansible playbooks deploy both fixes automatically:
|
||||||
|
- `configure_tailscale_network_wait.yml` — tailscaled network-online dependency (all hosts)
|
||||||
## Affected Hosts
|
- `configure_tailscale_ssh_only.yml` — ssh.socket Tailscale dependency (Ubuntu only)
|
||||||
|
|
||||||
Ubuntu hosts using `configure_tailscale_ssh_only.yml`: majorlinux, dcaprod-hetzner. Fedora hosts (majordiscord) use firewall rules for SSH restriction — not affected.
|
|
||||||
|
|
||||||
## References
|
## References
|
||||||
|
|
||||||
- [[dcaprod#2026-05-19 — SSH unreachable due to ssh.socket race condition with Tailscale]]
|
- [[dcaprod#2026-05-19 — SSH unreachable due to ssh.socket race condition with Tailscale]]
|
||||||
|
- [[majordiscord#2026-05-19 — Tailscale boot race: unreachable after Ansible reboot]]
|
||||||
- [[majorlinux#2026-05-19 — ssh.socket override patched: added Tailscale dependency]]
|
- [[majorlinux#2026-05-19 — ssh.socket override patched: added Tailscale dependency]]
|
||||||
- Ansible: `configure_tailscale_ssh_only.yml`
|
- Ansible: `configure_tailscale_ssh_only.yml`, `configure_tailscale_network_wait.yml`
|
||||||
|
|
|
||||||
Loading…
Add table
Reference in a new issue