Operational/how-to references updated to the role entry playbooks after the ADR-0001 migration. Historical incident narrative (dated callouts, commit refs) preserved. - clamav-fleet-deployment: override + re-run -> clamav.yml; role note - ssh-hardening-ansible-fleet: note this is now the ssh_hardening role - vps-migration-baseline-checklist: table -> clamav.yml / ssh_hardening.yml - ssh-socket-tailscale-race-condition: Affected Hosts + Prevention + References -> tailscale role tasks (network_wait/ssh_only_ubuntu/ssh_only_fedora) - freshclam-logwatch-false-no-updates: codify refs -> clamav role
157 lines
8.8 KiB
Markdown
157 lines
8.8 KiB
Markdown
# Tailscale Boot Race Conditions (SSH Unreachable After Reboot)
|
|
|
|
Two related race conditions can make a host unreachable via Tailscale after reboot. Both stem from systemd services starting before Tailscale or the network is ready.
|
|
|
|
---
|
|
|
|
## Race 1: ssh.socket Binds Before Tailscale Is Up (Ubuntu)
|
|
|
|
### Symptom
|
|
|
|
SSH to a host via Tailscale IP times out. `tailscale ping` works, `tailscale status` shows `active; direct`, but SSH on port 22 refuses connections. No access via Hetzner console if root password is unset.
|
|
|
|
### Cause
|
|
|
|
Ubuntu 24.04 uses systemd **socket activation** for SSH (`ssh.socket` instead of persistent `ssh.service`). When the socket override binds to a Tailscale IP, it can start *before* `tailscaled.service` is ready. The bind may succeed initially (Tailscale state file caches the IP), but a later Tailscale reconnect or interface reset invalidates the bound address silently — SSH dies with no recovery path.
|
|
|
|
### Diagnosis
|
|
|
|
```bash
|
|
# From another host:
|
|
tailscale ping <IP> # succeeds — host is up
|
|
ssh root@<IP> # times out — sshd not listening
|
|
|
|
# After gaining console access or reboot:
|
|
systemctl status ssh.socket # check Listen: address
|
|
journalctl -b -1 -u ssh # likely empty — sshd never spawned
|
|
journalctl -b -1 -u ssh.socket # socket started before tailscaled
|
|
```
|
|
|
|
### Fix (current — 2026-05-31)
|
|
|
|
`After=tailscaled.service` orders against the service becoming `active` — **not** against the `tailscale0` interface actually having an IPv4 address. tailscaled flips to active within a second of starting, but the kernel doesn't have the address bound to the interface until DERP relays connect and the control plane confirms the node. ssh.socket attempting `ListenStream=<TS IP>:22` in that window fails with `Cannot assign requested address`, the socket goes into a failed state, and there is no automatic retry.
|
|
|
|
The proper gate is a dedicated readiness service that **waits for the tailscale0 IPv4 address to exist** before letting ssh.socket bind:
|
|
|
|
```ini
|
|
# /etc/systemd/system/tailscale-wait-ready.service
|
|
[Unit]
|
|
Description=Wait until tailscale0 has an IPv4 address
|
|
After=tailscaled.service
|
|
Requires=tailscaled.service
|
|
ConditionPathExists=/usr/sbin/ip
|
|
|
|
[Service]
|
|
Type=oneshot
|
|
RemainAfterExit=yes
|
|
TimeoutStartSec=120
|
|
ExecStart=/usr/bin/bash -c 'for i in $(seq 1 120); do ip -4 -o addr show tailscale0 2>/dev/null | grep -q "inet " && exit 0; sleep 1; done; exit 1'
|
|
|
|
[Install]
|
|
WantedBy=multi-user.target
|
|
```
|
|
|
|
```ini
|
|
# /etc/systemd/system/ssh.socket.d/override.conf
|
|
[Unit]
|
|
After=tailscale-wait-ready.service
|
|
Requires=tailscale-wait-ready.service
|
|
|
|
[Socket]
|
|
ListenStream=
|
|
ListenStream=<TAILSCALE_IP>:22
|
|
```
|
|
|
|
Reload + restart:
|
|
|
|
```bash
|
|
systemctl daemon-reload
|
|
systemctl enable tailscale-wait-ready.service
|
|
systemctl restart ssh.socket
|
|
ss -tlnp | grep :22 # verify bound to Tailscale IP
|
|
```
|
|
|
|
!!! note "Evolution of this fix"
|
|
- **2026-05-19 v1** — `After=tailscaled.service` + `BindsTo=tailscaled.service`. Worked initially but caused a shutdown-time ordering cycle.
|
|
- **2026-05-23 v2** — `BindsTo` swapped for `Requires` to break the cycle. Fixed the cycle but did **not** wait for `tailscale0` to actually have an IP — just for `tailscaled` to be active. Hosts continued losing SSH after some reboots (intermittent, depending on whether the race won).
|
|
- **2026-05-31 v3** — Added `tailscale-wait-ready.service` to gate ssh.socket on the interface having an address. This is the current canonical fix.
|
|
|
|
!!! warning "Do NOT use BindsTo"
|
|
`BindsTo=tailscaled.service` creates a **systemd ordering cycle** during shutdown: `basic.target → sockets.target → ssh.socket → tailscaled.service → basic.target`. Systemd breaks the cycle by deleting jobs unpredictably, which can prevent `ssh.socket` from starting on the next boot. Use `Requires=` for startup ordering without the bidirectional lifecycle coupling.
|
|
|
|
### Affected Hosts
|
|
|
|
Ubuntu hosts locked via the `tailscale` role (`ssh_only_ubuntu` task, formerly `configure_tailscale_ssh_only.yml`): majorlinux, dcaprod-hetzner, tttpod-hetzner, majortoot-hetzner.
|
|
|
|
> [!danger] The Ubuntu playbook shipped the cycle pattern until 2026-06-07
|
|
> Despite the 2026-06-04 resolution above, `configure_tailscale_ssh_only.yml` in the repo kept deploying the `[Unit] Requires=tailscale-wait-ready.service` gate on **ssh.socket** (the cycle-causer) and never added the ssh.service gate — so re-running it *re-armed* the ordering cycle. Caught 2026-06-07: it clobbered majorlinux's hand-fix, and **majortoot-hetzner was found already armed** with the latent cycle (would have lost SSH on its next reboot). Both restored/defused; playbook corrected in MajorAnsible `e0d35aa` (gate on ssh.service, dependency-free socket).
|
|
>
|
|
> **Fleet audited & reconciled 2026-06-07:** dcaprod-hetzner + tttpod-hetzner had the dependency-free socket already but were **missing `tailscale-wait-ready.service`** (their ssh.service gate referenced a non-existent unit → inert → latent *bind* race, not a cycle); the corrected playbook was applied to both, deploying the service and activating the gate. teelia uses **Tailscale SSH** (no sshd, ss.socket/ssh.service disabled) — immune to both races. All Ubuntu hosts now run the same pattern: dependency-free `ss.socket` bind + `ssh.service` readiness gate + `tailscale-wait-ready.service`.
|
|
|
|
> [!warning] Fedora hosts are NOT automatically immune (corrected 2026-06-07)
|
|
> The firewalld method (`configure_tailscale_ssh_only_fedora.yml`) binds sshd on `0.0.0.0:22` and enforces Tailscale-only via the firewall, so it has no dependency on the Tailscale address — **unless** a host also carries a leftover manual `ListenAddress <tailscale-ip>` drop-in (`/etc/ssh/sshd_config.d/tailscale-only.conf`) from the pre-firewall lockdown. Then sshd.service hits the same boot bind-race (`Bind to port 22 on <ts-ip> failed: Cannot assign requested address`) and flaps every reboot. Hit on **majordiscord 2026-06-07**; fixed by removing the redundant drop-in (firewall stays the enforcing layer). The Fedora playbook now removes it automatically (MajorAnsible `b4a9090`).
|
|
|
|
---
|
|
|
|
## Race 2: tailscaled Starts Before Network Is Online (All Hosts)
|
|
|
|
### Symptom
|
|
|
|
Host reboots but never appears on Tailscale. `tailscale ping` times out entirely. SSH is dead because Tailscale never connects. The host is up (accessible via provider console) but isolated from the Tailscale network.
|
|
|
|
### Cause
|
|
|
|
`tailscaled.service` ships with `After=network-pre.target`, which fires *before* the network interface has an IP. On VPS hosts (especially Hetzner), the interface can take several seconds to come online. Tailscale starts, sees no network (`SetNetworkUp(false)`, `link state: defaultRoute= ifs={} v4=false v6=false`), fails DNS bootstrap and DERP relay connections, and gets stuck — never retrying.
|
|
|
|
### Diagnosis
|
|
|
|
```bash
|
|
# From Hetzner console or another access method:
|
|
journalctl -b -u tailscaled | grep -E "SetNetworkUp|link state|error|DERP"
|
|
# Look for:
|
|
# magicsock: SetNetworkUp(false)
|
|
# link state: interfaces.State{defaultRoute= ifs={} v4=false v6=false}
|
|
# health: Tailscale could not connect to any relay server
|
|
```
|
|
|
|
### Fix
|
|
|
|
Deploy a systemd drop-in to wait for full network connectivity:
|
|
|
|
```ini
|
|
# /etc/systemd/system/tailscaled.service.d/override.conf
|
|
[Unit]
|
|
After=network-online.target
|
|
Wants=network-online.target
|
|
```
|
|
|
|
Then reload and restart:
|
|
|
|
```bash
|
|
systemctl daemon-reload
|
|
systemctl restart tailscaled
|
|
```
|
|
|
|
### Affected Hosts
|
|
|
|
All hosts where Tailscale is the primary access path. Particularly impactful on VPS hosts with slow interface bringup. Both Fedora and Ubuntu hosts are affected.
|
|
|
|
---
|
|
|
|
## Prevention
|
|
|
|
- Set root passwords on all VPS hosts for emergency console access
|
|
- The `tailscale` role deploys all fixes automatically (run via `tailscale.yml` / `site.yml`):
|
|
- `network_wait` task — tailscaled network-online dependency (all hosts)
|
|
- `ssh_only_ubuntu` task — dependency-free ssh.socket bind + ssh.service readiness gate + `tailscale-wait-ready.service` (Ubuntu group)
|
|
- `ssh_only_fedora` task — firewalld Tailscale-only lockdown; removes any leftover `ListenAddress` drop-in (Fedora group)
|
|
|
|
## References
|
|
|
|
- [[dcaprod#2026-05-19 — SSH unreachable due to ssh.socket race condition with Tailscale]]
|
|
- [[majordiscord#2026-05-19 — Tailscale boot race: unreachable after Ansible reboot]]
|
|
- [[majorlinux#2026-05-19 — ssh.socket override patched: added Tailscale dependency]]
|
|
- [[dcaprod#2026-05-23 — SSH unreachable again: BindsTo ordering cycle in ssh.socket override]]
|
|
- [[majorlinux#2026-05-31 — ssh.socket race recurrence post-reboot (Requires= insufficient; added wait-ready gate)]]
|
|
- [[majortoot#2026-05-31 — ssh.socket race post-reboot on majortoot-hetzner (during cutover night)]]
|
|
- Ansible: the `tailscale` role (`tailscale.yml`) — `network_wait` + `ssh_only_ubuntu`/`ssh_only_fedora` tasks, consolidated from the former `configure_tailscale_*` playbooks (MajorAnsible `656302e`)
|