Merge branch 'code/majormac/ansible-roles-migration-article'
This commit is contained in:
commit
2bed2cbae3
2 changed files with 131 additions and 0 deletions
130
02-selfhosting/security/ansible-flat-playbooks-to-roles.md
Normal file
130
02-selfhosting/security/ansible-flat-playbooks-to-roles.md
Normal file
|
|
@ -0,0 +1,130 @@
|
|||
---
|
||||
title: "Migrating Flat Ansible Playbooks to Roles (Safely)"
|
||||
domain: selfhosting
|
||||
category: security
|
||||
tags: [ansible, roles, refactor, fleet, migration, fail2ban, infrastructure]
|
||||
status: published
|
||||
created: 2026-06-18
|
||||
updated: 2026-06-18
|
||||
---
|
||||
# Migrating Flat Ansible Playbooks to Roles (Safely)
|
||||
|
||||
## Overview
|
||||
|
||||
A fleet repo tends to grow a sprawl of flat `configure_*.yml` playbooks — one per subsystem, plus near-duplicates for variants (e.g. ~10 `configure_fail2ban_*` playbooks), all sharing a single overloaded top-level `templates/` directory. It works, but it resists reuse: there is no clean `defaults/` precedence, no encapsulation, and no way to compose a host's full configuration in one place.
|
||||
|
||||
Ansible **roles** fix this — but migrating a *live* fleet is where it gets dangerous. The risk is not the refactor itself; it's accidentally changing deployed behaviour while you "just reorganize." This article covers the incremental, regression-free approach used to migrate an 11-host fleet, including the two techniques that keep it safe: **byte-identical migration** and **capture-based reconciliation**.
|
||||
|
||||
> This is a process/pattern article. For the specific roles in this fleet, see the internal runbook. The techniques here generalize to any flat-playbook → role migration.
|
||||
|
||||
## Decide What Becomes a Role vs. What Stays a Playbook
|
||||
|
||||
Not everything should be a role. Draw the line by purpose:
|
||||
|
||||
| Becomes a role | Stays a playbook |
|
||||
|---|---|
|
||||
| Reusable host **configuration** (a subsystem you converge to a desired state) | **Ops / one-off** actions: `update`, `reboot`, `harden`, `bootstrap`, `provision`, `fix_*`, `verify_*` |
|
||||
| Has templates/files, defaults, handlers | Orchestrators that just `import_playbook` other things |
|
||||
| Applied repeatedly and idempotently | Run-once or run-as-needed remediation |
|
||||
|
||||
Roles get the standard `roles/<name>/` layout (`tasks/`, `defaults/`, `handlers/`, `templates/`, `files/`, `meta/`). Name them after the **subsystem noun** (`fail2ban`, `clamav`, `firewall`) — drop the `configure_` verb prefix.
|
||||
|
||||
## The Incremental Loop (one role per branch)
|
||||
|
||||
Migrate **one subsystem per branch** and validate before merging. This keeps every change small enough to diff by eye and roll back cleanly:
|
||||
|
||||
1. `git mv` the templates/files into `roles/<name>/` so **git tracks them as renames** (history preserved, 100% rename score).
|
||||
2. Move task bodies into `roles/<name>/tasks/` (split by lifecycle: install → service → config → verify).
|
||||
3. Lift tunables into `roles/<name>/defaults/main.yml`; keep per-host overrides in `group_vars`/`host_vars`.
|
||||
4. Add a thin entry playbook `<name>.yml` (`hosts: <group>` + `roles: [<name>]`).
|
||||
5. Validate with `--check --diff` against a single host **before** merging.
|
||||
6. Merge, then move to the next subsystem.
|
||||
|
||||
## Technique 1: Byte-Identical Migration
|
||||
|
||||
When the goal is "reorganize without changing behaviour," **prove** it. After moving a playbook into a role, the rendered task bodies should be identical to the original. Verify with a normalized diff against `main`:
|
||||
|
||||
```bash
|
||||
# Compare the role's task body against the original flat playbook,
|
||||
# ignoring only comments/whitespace you intend to change.
|
||||
git show main:configure_clamav.yml > /tmp/old.yml
|
||||
# ...extract the task list from roles/clamav/tasks/*.yml and diff
|
||||
diff <(yq '.[] | .tasks' /tmp/old.yml) <(cat roles/clamav/tasks/*.yml)
|
||||
```
|
||||
|
||||
The acceptance bar: `--check --diff` against a real host returns **`changed=0`** (or only the diffs you explicitly intended, like a doc-comment line). If a "faithful" migration shows unexpected `changed=N`, you altered behaviour — stop and reconcile before merging. Templates moved via `git mv` show as **100% renames** in `git show --stat`, which is your proof the deployed content is unchanged.
|
||||
|
||||
## Technique 2: Consolidating Near-Duplicates with Feature Flags
|
||||
|
||||
The big win is collapsing a family of near-duplicate playbooks (the ~10 `configure_fail2ban_*`) into **one role with flag-gated task files**:
|
||||
|
||||
```yaml
|
||||
# group_vars/<group>.yml — hosts self-select which jails/components they get
|
||||
fail2ban_jail_sshd: true
|
||||
fail2ban_jail_wordpress: true
|
||||
fail2ban_jail_nginx_bad_request: false
|
||||
```
|
||||
|
||||
```yaml
|
||||
# roles/fail2ban/tasks/main.yml
|
||||
- import_tasks: jail_wordpress.yml
|
||||
when: fail2ban_jail_wordpress | default(false)
|
||||
```
|
||||
|
||||
> **Critical gotcha — key flags to inventory GROUPS, not `ansible_os_family`.** It is tempting to gate OS-specific task files on `ansible_os_family == 'Debian'`. Don't. Inventory groups frequently include hosts the *original playbooks deliberately excluded* (e.g. a LAN-only Debian box that should get the network-wait step but **not** the public SSH bind, or a WSL host in the `fedora` group that must be skipped). Keep the original curated host patterns and set the flag per play/group. Keying on `os_family` silently widens a play's host set and is exactly how a "refactor" pushes config to a host that never had it.
|
||||
|
||||
## Technique 3: Capture-Based Reconciliation (the safety net)
|
||||
|
||||
This is the one that prevents an outage. Sometimes a role gets written as a **fresh re-implementation** of a subsystem rather than a faithful move — a cleaner `jail.local`, new drop-ins, a different default set. It may even be merged into `site.yml`. The trap: that role has **never been rolled out**, and its config *diverges* from what's actually deployed.
|
||||
|
||||
Running it would push divergent config to a live, security-sensitive subsystem (intrusion protection, firewall) across the whole fleet on the next `harden.yml`.
|
||||
|
||||
The check that catches it:
|
||||
|
||||
```bash
|
||||
ansible-playbook fail2ban.yml --check --diff --limit <host>
|
||||
# Divergent role => changed=8-12 per host + failures (missing filters/timers)
|
||||
# Faithful role => changed=0, failed=0
|
||||
```
|
||||
|
||||
**Capture-based reconciliation** is the fix: instead of pushing the role's idea of "correct," bring the **role into parity with the live, working config** first. Capture what's actually deployed, fold it into the role's templates/defaults until `--check` is clean fleet-wide, *then* switch the orchestrator over and retire the old playbooks. Order of operations:
|
||||
|
||||
1. **Decide the source of truth** — the live config or the new role. For security subsystems, the live (working) config wins.
|
||||
2. **Reconcile** the role to match live until `--check` shows `changed=0, failed=0` on every host.
|
||||
3. **Roll out host-by-host** with real runs; verify the service restarts cleanly and (for fail2ban) jails are actually active.
|
||||
4. **Only then** delete the old playbooks, rewire `harden.yml`/`bootstrap.yml`, and remove the orphaned top-level templates.
|
||||
|
||||
Never delete the old mechanism until the new one is proven converged everywhere. "It's in `site.yml`" is not the same as "it's been rolled out."
|
||||
|
||||
## Composition: `site.yml`, `harden.yml`, `bootstrap.yml`
|
||||
|
||||
Once subsystems are roles, compose them with thin orchestrators that `import_playbook` the role entry points — so each subsystem keeps a **single source of truth** for its host mapping:
|
||||
|
||||
```yaml
|
||||
# site.yml — day-to-day fleet convergence, in dependency order
|
||||
- import_playbook: swap.yml
|
||||
- import_playbook: tailscale.yml
|
||||
- import_playbook: ssh_hardening.yml
|
||||
- import_playbook: firewall.yml
|
||||
- import_playbook: fail2ban.yml
|
||||
- import_playbook: clamav.yml
|
||||
```
|
||||
|
||||
Order matters: base layer (swap) → networking (tailscale) → access (ssh_hardening) → perimeter (firewall) → intrusion protection (fail2ban). Bootstrap-only roles (guest agent, root password, provisioning prerequisites) belong in `bootstrap.yml`, not `site.yml`.
|
||||
|
||||
## Verification Checklist
|
||||
|
||||
- [ ] Templates moved with `git mv` (show as 100% renames)
|
||||
- [ ] `--check --diff` on a real host = `changed=0` (or only intended diffs)
|
||||
- [ ] Consolidation flags keyed to **inventory groups**, not `ansible_os_family`
|
||||
- [ ] Re-implemented roles reconciled to live parity **before** rollout (no surprise `changed=N`)
|
||||
- [ ] Security subsystems rolled out host-by-host with service-active verification
|
||||
- [ ] Old playbooks/templates deleted **only after** the role is converged fleet-wide
|
||||
- [ ] Orchestrators (`site.yml`/`harden.yml`/`bootstrap.yml`) rewired; stale references swept
|
||||
|
||||
## Related
|
||||
|
||||
- [SSH Hardening Fleet-Wide with Ansible](ssh-hardening-ansible-fleet.md)
|
||||
- [ClamAV Fleet Deployment with Ansible](clamav-fleet-deployment.md)
|
||||
- [Firewall Hardening with firewalld on Fedora Fleet](firewalld-fleet-hardening.md)
|
||||
- [Standardizing unattended-upgrades with Ansible](ansible-unattended-upgrades-fleet.md)
|
||||
|
|
@ -57,6 +57,7 @@ updated: 2026-05-15T09:00
|
|||
* [Fail2ban Custom Jail: Nginx Bad Request Detection](02-selfhosting/security/fail2ban-nginx-bad-request-jail.md)
|
||||
* [Fail2ban Custom Jail: Apache Bad Request Detection](02-selfhosting/security/fail2ban-apache-bad-request-jail.md)
|
||||
* [SSH Hardening Fleet-Wide with Ansible](02-selfhosting/security/ssh-hardening-ansible-fleet.md)
|
||||
* [Migrating Flat Ansible Playbooks to Roles (Safely)](02-selfhosting/security/ansible-flat-playbooks-to-roles.md)
|
||||
* [ClamAV Fleet Deployment with Ansible](02-selfhosting/security/clamav-fleet-deployment.md)
|
||||
* [Fail2Ban Digest Mode — Fleet-Wide Quiet Alerts](02-selfhosting/security/fail2ban-digest-mode-fleet.md)
|
||||
* [Apache CVE-2026-23918 — HTTP/2 Double Free Mitigation](02-selfhosting/security/apache-cve-2026-23918-http2-mitigation.md)
|
||||
|
|
|
|||
Loading…
Add table
Reference in a new issue