Migrating Flat Ansible Playbooks to Roles (Safely)

Overview

A fleet repo tends to grow a sprawl of flat configure_*.yml playbooks — one per subsystem, plus near-duplicates for variants (e.g. ~10 configure_fail2ban_* playbooks), all sharing a single overloaded top-level templates/ directory. It works, but it resists reuse: there is no clean defaults/ precedence, no encapsulation, and no way to compose a host's full configuration in one place.

Ansible roles fix this — but migrating a live fleet is where it gets dangerous. The risk is not the refactor itself; it's accidentally changing deployed behaviour while you "just reorganize." This article covers the incremental, regression-free approach used to migrate an 11-host fleet, including the two techniques that keep it safe: byte-identical migration and capture-based reconciliation.

This is a process/pattern article. For the specific roles in this fleet, see the internal runbook. The techniques here generalize to any flat-playbook → role migration.

Decide What Becomes a Role vs. What Stays a Playbook

Not everything should be a role. Draw the line by purpose:

Becomes a role	Stays a playbook
Reusable host configuration (a subsystem you converge to a desired state)	Ops / one-off actions: `update`, `reboot`, `harden`, `bootstrap`, `provision`, `fix_`, `verify_`
Has templates/files, defaults, handlers	Orchestrators that just `import_playbook` other things
Applied repeatedly and idempotently	Run-once or run-as-needed remediation

Roles get the standard roles/<name>/ layout (tasks/, defaults/, handlers/, templates/, files/, meta/). Name them after the subsystem noun (fail2ban, clamav, firewall) — drop the configure_ verb prefix.

The Incremental Loop (one role per branch)

Migrate one subsystem per branch and validate before merging. This keeps every change small enough to diff by eye and roll back cleanly:

git mv the templates/files into roles/<name>/ so git tracks them as renames (history preserved, 100% rename score).
Move task bodies into roles/<name>/tasks/ (split by lifecycle: install → service → config → verify).
Lift tunables into roles/<name>/defaults/main.yml; keep per-host overrides in group_vars/host_vars.
Add a thin entry playbook <name>.yml (hosts: <group> + roles: [<name>]).
Validate with --check --diff against a single host before merging.
Merge, then move to the next subsystem.

Technique 1: Byte-Identical Migration

When the goal is "reorganize without changing behaviour," prove it. After moving a playbook into a role, the rendered task bodies should be identical to the original. Verify with a normalized diff against main:

# Compare the role's task body against the original flat playbook,
# ignoring only comments/whitespace you intend to change.
git show main:configure_clamav.yml > /tmp/old.yml
# ...extract the task list from roles/clamav/tasks/*.yml and diff
diff <(yq '.[] | .tasks' /tmp/old.yml) <(cat roles/clamav/tasks/*.yml)

The acceptance bar: --check --diff against a real host returns changed=0 (or only the diffs you explicitly intended, like a doc-comment line). If a "faithful" migration shows unexpected changed=N, you altered behaviour — stop and reconcile before merging. Templates moved via git mv show as 100% renames in git show --stat, which is your proof the deployed content is unchanged.

Technique 2: Consolidating Near-Duplicates with Feature Flags

The big win is collapsing a family of near-duplicate playbooks (the ~10 configure_fail2ban_*) into one role with flag-gated task files:

# group_vars/<group>.yml — hosts self-select which jails/components they get
fail2ban_jail_sshd: true
fail2ban_jail_wordpress: true
fail2ban_jail_nginx_bad_request: false

# roles/fail2ban/tasks/main.yml
- import_tasks: jail_wordpress.yml
  when: fail2ban_jail_wordpress | default(false)

Critical gotcha — key flags to inventory GROUPS, not ansible_os_family. It is tempting to gate OS-specific task files on ansible_os_family == 'Debian'. Don't. Inventory groups frequently include hosts the original playbooks deliberately excluded (e.g. a LAN-only Debian box that should get the network-wait step but not the public SSH bind, or a WSL host in the fedora group that must be skipped). Keep the original curated host patterns and set the flag per play/group. Keying on os_family silently widens a play's host set and is exactly how a "refactor" pushes config to a host that never had it.

Technique 3: Capture-Based Reconciliation (the safety net)

This is the one that prevents an outage. Sometimes a role gets written as a fresh re-implementation of a subsystem rather than a faithful move — a cleaner jail.local, new drop-ins, a different default set. It may even be merged into site.yml. The trap: that role has never been rolled out, and its config diverges from what's actually deployed.

Running it would push divergent config to a live, security-sensitive subsystem (intrusion protection, firewall) across the whole fleet on the next harden.yml.

The check that catches it:

ansible-playbook fail2ban.yml --check --diff --limit <host>
# Divergent role => changed=8-12 per host + failures (missing filters/timers)
# Faithful role  => changed=0, failed=0

Capture-based reconciliation is the fix: instead of pushing the role's idea of "correct," bring the role into parity with the live, working config first. Capture what's actually deployed, fold it into the role's templates/defaults until --check is clean fleet-wide, then switch the orchestrator over and retire the old playbooks. Order of operations:

Decide the source of truth — the live config or the new role. For security subsystems, the live (working) config wins.
Reconcile the role to match live until --check shows changed=0, failed=0 on every host.
Roll out host-by-host with real runs; verify the service restarts cleanly and (for fail2ban) jails are actually active.
Only then delete the old playbooks, rewire harden.yml/bootstrap.yml, and remove the orphaned top-level templates.

Never delete the old mechanism until the new one is proven converged everywhere. "It's in site.yml" is not the same as "it's been rolled out."

Composition: `site.yml`, `harden.yml`, `bootstrap.yml`

Once subsystems are roles, compose them with thin orchestrators that import_playbook the role entry points — so each subsystem keeps a single source of truth for its host mapping:

# site.yml — day-to-day fleet convergence, in dependency order
- import_playbook: swap.yml
- import_playbook: tailscale.yml
- import_playbook: ssh_hardening.yml
- import_playbook: firewall.yml
- import_playbook: fail2ban.yml
- import_playbook: clamav.yml

Order matters: base layer (swap) → networking (tailscale) → access (ssh_hardening) → perimeter (firewall) → intrusion protection (fail2ban). Bootstrap-only roles (guest agent, root password, provisioning prerequisites) belong in bootstrap.yml, not site.yml.

Verification Checklist

Templates moved with git mv (show as 100% renames)
--check --diff on a real host = changed=0 (or only intended diffs)
Consolidation flags keyed to inventory groups, not ansible_os_family
Re-implemented roles reconciled to live parity before rollout (no surprise changed=N)
Security subsystems rolled out host-by-host with service-active verification
Old playbooks/templates deleted only after the role is converged fleet-wide
Orchestrators (site.yml/harden.yml/bootstrap.yml) rewired; stale references swept

8.2 KiB Raw Permalink Blame History