diff --git a/02-selfhosting/security/ansible-flat-playbooks-to-roles.md b/02-selfhosting/security/ansible-flat-playbooks-to-roles.md new file mode 100644 index 0000000..47edb84 --- /dev/null +++ b/02-selfhosting/security/ansible-flat-playbooks-to-roles.md @@ -0,0 +1,130 @@ +--- +title: "Migrating Flat Ansible Playbooks to Roles (Safely)" +domain: selfhosting +category: security +tags: [ansible, roles, refactor, fleet, migration, fail2ban, infrastructure] +status: published +created: 2026-06-18 +updated: 2026-06-18 +--- +# Migrating Flat Ansible Playbooks to Roles (Safely) + +## Overview + +A fleet repo tends to grow a sprawl of flat `configure_*.yml` playbooks — one per subsystem, plus near-duplicates for variants (e.g. ~10 `configure_fail2ban_*` playbooks), all sharing a single overloaded top-level `templates/` directory. It works, but it resists reuse: there is no clean `defaults/` precedence, no encapsulation, and no way to compose a host's full configuration in one place. + +Ansible **roles** fix this — but migrating a *live* fleet is where it gets dangerous. The risk is not the refactor itself; it's accidentally changing deployed behaviour while you "just reorganize." This article covers the incremental, regression-free approach used to migrate an 11-host fleet, including the two techniques that keep it safe: **byte-identical migration** and **capture-based reconciliation**. + +> This is a process/pattern article. For the specific roles in this fleet, see the internal runbook. The techniques here generalize to any flat-playbook → role migration. + +## Decide What Becomes a Role vs. What Stays a Playbook + +Not everything should be a role. Draw the line by purpose: + +| Becomes a role | Stays a playbook | +|---|---| +| Reusable host **configuration** (a subsystem you converge to a desired state) | **Ops / one-off** actions: `update`, `reboot`, `harden`, `bootstrap`, `provision`, `fix_*`, `verify_*` | +| Has templates/files, defaults, handlers | Orchestrators that just `import_playbook` other things | +| Applied repeatedly and idempotently | Run-once or run-as-needed remediation | + +Roles get the standard `roles//` layout (`tasks/`, `defaults/`, `handlers/`, `templates/`, `files/`, `meta/`). Name them after the **subsystem noun** (`fail2ban`, `clamav`, `firewall`) — drop the `configure_` verb prefix. + +## The Incremental Loop (one role per branch) + +Migrate **one subsystem per branch** and validate before merging. This keeps every change small enough to diff by eye and roll back cleanly: + +1. `git mv` the templates/files into `roles//` so **git tracks them as renames** (history preserved, 100% rename score). +2. Move task bodies into `roles//tasks/` (split by lifecycle: install → service → config → verify). +3. Lift tunables into `roles//defaults/main.yml`; keep per-host overrides in `group_vars`/`host_vars`. +4. Add a thin entry playbook `.yml` (`hosts: ` + `roles: []`). +5. Validate with `--check --diff` against a single host **before** merging. +6. Merge, then move to the next subsystem. + +## Technique 1: Byte-Identical Migration + +When the goal is "reorganize without changing behaviour," **prove** it. After moving a playbook into a role, the rendered task bodies should be identical to the original. Verify with a normalized diff against `main`: + +```bash +# Compare the role's task body against the original flat playbook, +# ignoring only comments/whitespace you intend to change. +git show main:configure_clamav.yml > /tmp/old.yml +# ...extract the task list from roles/clamav/tasks/*.yml and diff +diff <(yq '.[] | .tasks' /tmp/old.yml) <(cat roles/clamav/tasks/*.yml) +``` + +The acceptance bar: `--check --diff` against a real host returns **`changed=0`** (or only the diffs you explicitly intended, like a doc-comment line). If a "faithful" migration shows unexpected `changed=N`, you altered behaviour — stop and reconcile before merging. Templates moved via `git mv` show as **100% renames** in `git show --stat`, which is your proof the deployed content is unchanged. + +## Technique 2: Consolidating Near-Duplicates with Feature Flags + +The big win is collapsing a family of near-duplicate playbooks (the ~10 `configure_fail2ban_*`) into **one role with flag-gated task files**: + +```yaml +# group_vars/.yml — hosts self-select which jails/components they get +fail2ban_jail_sshd: true +fail2ban_jail_wordpress: true +fail2ban_jail_nginx_bad_request: false +``` + +```yaml +# roles/fail2ban/tasks/main.yml +- import_tasks: jail_wordpress.yml + when: fail2ban_jail_wordpress | default(false) +``` + +> **Critical gotcha — key flags to inventory GROUPS, not `ansible_os_family`.** It is tempting to gate OS-specific task files on `ansible_os_family == 'Debian'`. Don't. Inventory groups frequently include hosts the *original playbooks deliberately excluded* (e.g. a LAN-only Debian box that should get the network-wait step but **not** the public SSH bind, or a WSL host in the `fedora` group that must be skipped). Keep the original curated host patterns and set the flag per play/group. Keying on `os_family` silently widens a play's host set and is exactly how a "refactor" pushes config to a host that never had it. + +## Technique 3: Capture-Based Reconciliation (the safety net) + +This is the one that prevents an outage. Sometimes a role gets written as a **fresh re-implementation** of a subsystem rather than a faithful move — a cleaner `jail.local`, new drop-ins, a different default set. It may even be merged into `site.yml`. The trap: that role has **never been rolled out**, and its config *diverges* from what's actually deployed. + +Running it would push divergent config to a live, security-sensitive subsystem (intrusion protection, firewall) across the whole fleet on the next `harden.yml`. + +The check that catches it: + +```bash +ansible-playbook fail2ban.yml --check --diff --limit +# Divergent role => changed=8-12 per host + failures (missing filters/timers) +# Faithful role => changed=0, failed=0 +``` + +**Capture-based reconciliation** is the fix: instead of pushing the role's idea of "correct," bring the **role into parity with the live, working config** first. Capture what's actually deployed, fold it into the role's templates/defaults until `--check` is clean fleet-wide, *then* switch the orchestrator over and retire the old playbooks. Order of operations: + +1. **Decide the source of truth** — the live config or the new role. For security subsystems, the live (working) config wins. +2. **Reconcile** the role to match live until `--check` shows `changed=0, failed=0` on every host. +3. **Roll out host-by-host** with real runs; verify the service restarts cleanly and (for fail2ban) jails are actually active. +4. **Only then** delete the old playbooks, rewire `harden.yml`/`bootstrap.yml`, and remove the orphaned top-level templates. + +Never delete the old mechanism until the new one is proven converged everywhere. "It's in `site.yml`" is not the same as "it's been rolled out." + +## Composition: `site.yml`, `harden.yml`, `bootstrap.yml` + +Once subsystems are roles, compose them with thin orchestrators that `import_playbook` the role entry points — so each subsystem keeps a **single source of truth** for its host mapping: + +```yaml +# site.yml — day-to-day fleet convergence, in dependency order +- import_playbook: swap.yml +- import_playbook: tailscale.yml +- import_playbook: ssh_hardening.yml +- import_playbook: firewall.yml +- import_playbook: fail2ban.yml +- import_playbook: clamav.yml +``` + +Order matters: base layer (swap) → networking (tailscale) → access (ssh_hardening) → perimeter (firewall) → intrusion protection (fail2ban). Bootstrap-only roles (guest agent, root password, provisioning prerequisites) belong in `bootstrap.yml`, not `site.yml`. + +## Verification Checklist + +- [ ] Templates moved with `git mv` (show as 100% renames) +- [ ] `--check --diff` on a real host = `changed=0` (or only intended diffs) +- [ ] Consolidation flags keyed to **inventory groups**, not `ansible_os_family` +- [ ] Re-implemented roles reconciled to live parity **before** rollout (no surprise `changed=N`) +- [ ] Security subsystems rolled out host-by-host with service-active verification +- [ ] Old playbooks/templates deleted **only after** the role is converged fleet-wide +- [ ] Orchestrators (`site.yml`/`harden.yml`/`bootstrap.yml`) rewired; stale references swept + +## Related + +- [SSH Hardening Fleet-Wide with Ansible](ssh-hardening-ansible-fleet.md) +- [ClamAV Fleet Deployment with Ansible](clamav-fleet-deployment.md) +- [Firewall Hardening with firewalld on Fedora Fleet](firewalld-fleet-hardening.md) +- [Standardizing unattended-upgrades with Ansible](ansible-unattended-upgrades-fleet.md) diff --git a/SUMMARY.md b/SUMMARY.md index 8f5edc8..13bc6d8 100644 --- a/SUMMARY.md +++ b/SUMMARY.md @@ -57,6 +57,7 @@ updated: 2026-05-15T09:00 * [Fail2ban Custom Jail: Nginx Bad Request Detection](02-selfhosting/security/fail2ban-nginx-bad-request-jail.md) * [Fail2ban Custom Jail: Apache Bad Request Detection](02-selfhosting/security/fail2ban-apache-bad-request-jail.md) * [SSH Hardening Fleet-Wide with Ansible](02-selfhosting/security/ssh-hardening-ansible-fleet.md) + * [Migrating Flat Ansible Playbooks to Roles (Safely)](02-selfhosting/security/ansible-flat-playbooks-to-roles.md) * [ClamAV Fleet Deployment with Ansible](02-selfhosting/security/clamav-fleet-deployment.md) * [Fail2Ban Digest Mode — Fleet-Wide Quiet Alerts](02-selfhosting/security/fail2ban-digest-mode-fleet.md) * [Apache CVE-2026-23918 — HTTP/2 Double Free Mitigation](02-selfhosting/security/apache-cve-2026-23918-http2-mitigation.md)