Merge branch 'code/majormac/ansible-roles-migration-article'

2026-06-18 14:32:02 -04:00 · 2026-06-18 14:32:02 -04:00 · 2bed2cbae3
commit 2bed2cbae3
parent 4fa5e33d93 ebdb28e9e2
2 changed files with 131 additions and 0 deletions
--- a/02-selfhosting/security/ansible-flat-playbooks-to-roles.md
+++ b/02-selfhosting/security/ansible-flat-playbooks-to-roles.md
@ -0,0 +1,130 @@
+---
+title: "Migrating Flat Ansible Playbooks to Roles (Safely)"
+domain: selfhosting
+category: security
+tags: [ansible, roles, refactor, fleet, migration, fail2ban, infrastructure]
+status: published
+created: 2026-06-18
+updated: 2026-06-18
+---
+# Migrating Flat Ansible Playbooks to Roles (Safely)
+
+## Overview
+
+A fleet repo tends to grow a sprawl of flat `configure_*.yml` playbooks — one per subsystem, plus near-duplicates for variants (e.g. ~10 `configure_fail2ban_*` playbooks), all sharing a single overloaded top-level `templates/` directory. It works, but it resists reuse: there is no clean `defaults/` precedence, no encapsulation, and no way to compose a host's full configuration in one place.
+
+Ansible **roles** fix this — but migrating a *live* fleet is where it gets dangerous. The risk is not the refactor itself; it's accidentally changing deployed behaviour while you "just reorganize." This article covers the incremental, regression-free approach used to migrate an 11-host fleet, including the two techniques that keep it safe: **byte-identical migration** and **capture-based reconciliation**.
+
+> This is a process/pattern article. For the specific roles in this fleet, see the internal runbook. The techniques here generalize to any flat-playbook → role migration.
+
+## Decide What Becomes a Role vs. What Stays a Playbook
+
+Not everything should be a role. Draw the line by purpose:
+
+| Becomes a role | Stays a playbook |
+|---|---|
+| Reusable host **configuration** (a subsystem you converge to a desired state) | **Ops / one-off** actions: `update`, `reboot`, `harden`, `bootstrap`, `provision`, `fix_*`, `verify_*` |
+| Has templates/files, defaults, handlers | Orchestrators that just `import_playbook` other things |
+| Applied repeatedly and idempotently | Run-once or run-as-needed remediation |
+
+Roles get the standard `roles/<name>/` layout (`tasks/`, `defaults/`, `handlers/`, `templates/`, `files/`, `meta/`). Name them after the **subsystem noun** (`fail2ban`, `clamav`, `firewall`) — drop the `configure_` verb prefix.
+
+## The Incremental Loop (one role per branch)
+
+Migrate **one subsystem per branch** and validate before merging. This keeps every change small enough to diff by eye and roll back cleanly:
+
+1. `git mv` the templates/files into `roles/<name>/` so **git tracks them as renames** (history preserved, 100% rename score).
+2. Move task bodies into `roles/<name>/tasks/` (split by lifecycle: install → service → config → verify).
+3. Lift tunables into `roles/<name>/defaults/main.yml`; keep per-host overrides in `group_vars`/`host_vars`.
+4. Add a thin entry playbook `<name>.yml` (`hosts: <group>` + `roles: [<name>]`).
+5. Validate with `--check --diff` against a single host **before** merging.
+6. Merge, then move to the next subsystem.
+
+## Technique 1: Byte-Identical Migration
+
+When the goal is "reorganize without changing behaviour," **prove** it. After moving a playbook into a role, the rendered task bodies should be identical to the original. Verify with a normalized diff against `main`:
+
+```bash
+# Compare the role's task body against the original flat playbook,
+# ignoring only comments/whitespace you intend to change.
+git show main:configure_clamav.yml > /tmp/old.yml
+# ...extract the task list from roles/clamav/tasks/*.yml and diff
+diff <(yq '.[] | .tasks' /tmp/old.yml) <(cat roles/clamav/tasks/*.yml)
+```
+
+The acceptance bar: `--check --diff` against a real host returns **`changed=0`** (or only the diffs you explicitly intended, like a doc-comment line). If a "faithful" migration shows unexpected `changed=N`, you altered behaviour — stop and reconcile before merging. Templates moved via `git mv` show as **100% renames** in `git show --stat`, which is your proof the deployed content is unchanged.
+
+## Technique 2: Consolidating Near-Duplicates with Feature Flags
+
+The big win is collapsing a family of near-duplicate playbooks (the ~10 `configure_fail2ban_*`) into **one role with flag-gated task files**:
+
+```yaml
+# group_vars/<group>.yml — hosts self-select which jails/components they get
+fail2ban_jail_sshd: true
+fail2ban_jail_wordpress: true
+fail2ban_jail_nginx_bad_request: false
+```
+
+```yaml
+# roles/fail2ban/tasks/main.yml
+- import_tasks: jail_wordpress.yml
+  when: fail2ban_jail_wordpress | default(false)
+```
+
+> **Critical gotcha — key flags to inventory GROUPS, not `ansible_os_family`.** It is tempting to gate OS-specific task files on `ansible_os_family == 'Debian'`. Don't. Inventory groups frequently include hosts the *original playbooks deliberately excluded* (e.g. a LAN-only Debian box that should get the network-wait step but **not** the public SSH bind, or a WSL host in the `fedora` group that must be skipped). Keep the original curated host patterns and set the flag per play/group. Keying on `os_family` silently widens a play's host set and is exactly how a "refactor" pushes config to a host that never had it.
+
+## Technique 3: Capture-Based Reconciliation (the safety net)
+
+This is the one that prevents an outage. Sometimes a role gets written as a **fresh re-implementation** of a subsystem rather than a faithful move — a cleaner `jail.local`, new drop-ins, a different default set. It may even be merged into `site.yml`. The trap: that role has **never been rolled out**, and its config *diverges* from what's actually deployed.
+
+Running it would push divergent config to a live, security-sensitive subsystem (intrusion protection, firewall) across the whole fleet on the next `harden.yml`.
+
+The check that catches it:
+
+```bash
+ansible-playbook fail2ban.yml --check --diff --limit <host>
+# Divergent role => changed=8-12 per host + failures (missing filters/timers)
+# Faithful role  => changed=0, failed=0
+```
+
+**Capture-based reconciliation** is the fix: instead of pushing the role's idea of "correct," bring the **role into parity with the live, working config** first. Capture what's actually deployed, fold it into the role's templates/defaults until `--check` is clean fleet-wide, *then* switch the orchestrator over and retire the old playbooks. Order of operations:
+
+1. **Decide the source of truth** — the live config or the new role. For security subsystems, the live (working) config wins.
+2. **Reconcile** the role to match live until `--check` shows `changed=0, failed=0` on every host.
+3. **Roll out host-by-host** with real runs; verify the service restarts cleanly and (for fail2ban) jails are actually active.
+4. **Only then** delete the old playbooks, rewire `harden.yml`/`bootstrap.yml`, and remove the orphaned top-level templates.
+
+Never delete the old mechanism until the new one is proven converged everywhere. "It's in `site.yml`" is not the same as "it's been rolled out."
+
+## Composition: `site.yml`, `harden.yml`, `bootstrap.yml`
+
+Once subsystems are roles, compose them with thin orchestrators that `import_playbook` the role entry points — so each subsystem keeps a **single source of truth** for its host mapping:
+
+```yaml
+# site.yml — day-to-day fleet convergence, in dependency order
+- import_playbook: swap.yml
+- import_playbook: tailscale.yml
+- import_playbook: ssh_hardening.yml
+- import_playbook: firewall.yml
+- import_playbook: fail2ban.yml
+- import_playbook: clamav.yml
+```
+
+Order matters: base layer (swap) → networking (tailscale) → access (ssh_hardening) → perimeter (firewall) → intrusion protection (fail2ban). Bootstrap-only roles (guest agent, root password, provisioning prerequisites) belong in `bootstrap.yml`, not `site.yml`.
+
+## Verification Checklist
+
+- [ ] Templates moved with `git mv` (show as 100% renames)
+- [ ] `--check --diff` on a real host = `changed=0` (or only intended diffs)
+- [ ] Consolidation flags keyed to **inventory groups**, not `ansible_os_family`
+- [ ] Re-implemented roles reconciled to live parity **before** rollout (no surprise `changed=N`)
+- [ ] Security subsystems rolled out host-by-host with service-active verification
+- [ ] Old playbooks/templates deleted **only after** the role is converged fleet-wide
+- [ ] Orchestrators (`site.yml`/`harden.yml`/`bootstrap.yml`) rewired; stale references swept
+
+## Related
+
+- [SSH Hardening Fleet-Wide with Ansible](ssh-hardening-ansible-fleet.md)
+- [ClamAV Fleet Deployment with Ansible](clamav-fleet-deployment.md)
+- [Firewall Hardening with firewalld on Fedora Fleet](firewalld-fleet-hardening.md)
+- [Standardizing unattended-upgrades with Ansible](ansible-unattended-upgrades-fleet.md)
--- a/SUMMARY.md
+++ b/SUMMARY.md
@ -57,6 +57,7 @@ updated: 2026-05-15T09:00
    * [Fail2ban Custom Jail: Nginx Bad Request Detection](02-selfhosting/security/fail2ban-nginx-bad-request-jail.md)
    * [Fail2ban Custom Jail: Apache Bad Request Detection](02-selfhosting/security/fail2ban-apache-bad-request-jail.md)
    * [SSH Hardening Fleet-Wide with Ansible](02-selfhosting/security/ssh-hardening-ansible-fleet.md)
+    * [Migrating Flat Ansible Playbooks to Roles (Safely)](02-selfhosting/security/ansible-flat-playbooks-to-roles.md)
    * [ClamAV Fleet Deployment with Ansible](02-selfhosting/security/clamav-fleet-deployment.md)
    * [Fail2Ban Digest Mode — Fleet-Wide Quiet Alerts](02-selfhosting/security/fail2ban-digest-mode-fleet.md)
    * [Apache CVE-2026-23918 — HTTP/2 Double Free Mitigation](02-selfhosting/security/apache-cve-2026-23918-http2-mitigation.md)