majorlinux c358e0dfea restic runbook: document the snapshot-group-per-path-set gotcha

Changing a host's restic_paths spawns a new snapshot group (restic
groups by host+paths), so old and new path-sets each keep their own
retention lineage. Surfaced while extending majorlab's backup scope.

2026-06-21 12:33:56 -04:00

7.8 KiB

Raw Blame History

title

domain

App-Consistent Fleet Backups with restic + Backblaze B2

A repeatable pattern for backing up a mixed fleet (Ubuntu + Fedora, VPS + homelab, bare services + Docker) to Backblaze B2 with restic — encrypted, deduplicated, and app-consistent (databases are dumped before the snapshot, not copied live). Driven by Ansible and a per-host systemd timer.

The Short Answer

Per host, nightly: dump every database to a staging dir → restic backup that staging dir plus the data paths → apply retention → wipe staging. A monthly timer runs restic prune. Anything that fails emails the admin. One B2 bucket holds a separate repo per host at b2:<bucket>:<hostname>.

Retention is --keep-daily 7 --keep-weekly 4 --keep-monthly 6 (~6 months of history).

Why dump databases first

Copying a live database's files (/var/lib/mysql, a running SQLite file, a Postgres data dir) gives you a crash-consistent copy at best — restorable only if you're lucky. Logical dumps are guaranteed consistent:

MySQL / MariaDB: mysqldump --single-transaction --routines --triggers --databases <db>
PostgreSQL: pg_dump -Fc <db> (custom format) via the postgres system user (peer auth)
SQLite: sqlite3 <file> ".backup '<out>'" — uses the online backup API, safe against a running writer
Dockerized DBs: docker exec <container> sh -c '<dump cmd>', letting the container's own shell expand its root-password env var

restic then backs up the dump files (which dedupe beautifully — only the changed blocks upload each night).

Repository layout

One private B2 bucket (e.g. majorshouse-backups).
One repo per host: b2:majorshouse-backups:<hostname>.
The application key needs read + write + delete for the bucket. restic deletes objects during forget/prune, so a pure append-only key will break retention. (True append-only requires splitting forget/prune onto a separate maintenance key — a worthwhile hardening step, but not the default.)
Credentials live in an EnvironmentFile (/etc/restic/restic-env, mode 0600, root): RESTIC_REPOSITORY, RESTIC_PASSWORD, B2_ACCOUNT_ID, B2_ACCOUNT_KEY.

The backup script (shape)

set -uo pipefail
STAGING=/var/backups/restic-staging
rm -rf "$STAGING"; mkdir -p "$STAGING"; chmod 700 "$STAGING"

# per-engine dumps into $STAGING ...
mysqldump --single-transaction --routines --triggers --databases wordpress > "$STAGING/mysql-wordpress.sql"
sudo -u postgres pg_dump -Fc mastodon_production            > "$STAGING/pg-mastodon_production.dump"
sqlite3 /opt/phantombot/config/phantombot.db ".backup '$STAGING/sqlite-phantombot.db'"

restic backup --tag fleet-backup --host "$(hostname -s)" \
  "$STAGING" /var/www /etc/letsencrypt --exclude /path/to/already-offsite/media

restic forget --keep-daily 7 --keep-weekly 4 --keep-monthly 6
rm -rf "$STAGING"

Wrap each step so a failure mails the admin and aborts (don't silently back up a half-state). On hosts where the mail CLI is absent, pipe a message to /usr/sbin/sendmail -t instead.

systemd units

A oneshot service + a timer. Stagger OnCalendar per host to spread B2 load, and always set RESTIC_CACHE_DIR (see Gotchas):

# restic-backup.service
[Service]
Type=oneshot
EnvironmentFile=/etc/restic/restic-env
Environment=RESTIC_CACHE_DIR=/var/cache/restic
ExecStart=/usr/local/sbin/restic-backup.sh
Nice=10
IOSchedulingClass=idle

# restic-backup.timer
[Timer]
OnCalendar=*-*-* 02:30:00
RandomizedDelaySec=20m
Persistent=true
[Install]
WantedBy=timers.target

A second restic-prune.timer runs restic prune monthly (OnCalendar=*-*-01 04:00:00).

Restore procedure

The whole point. From the target host (or any host with the repo creds):

# load repo + B2 creds without echoing them
set -a; . /etc/restic/restic-env; set +a

restic snapshots                      # list; note the snapshot ID or use 'latest'

# restore specific paths to a scratch dir (never restore in place blindly)
restic restore latest --target /tmp/restore \
  --include /var/backups/restic-staging \
  --include /var/www/html/wp-config.php

# verify before doing anything with it
ls -la /tmp/restore/var/backups/restic-staging/
head -1 /tmp/restore/var/backups/restic-staging/mysql-wordpress.sql   # "-- MySQL dump 10.13 ..."

To recover a database, restore the dump then load it: mysql <db> < mysql-<db>.sql, pg_restore -d <db> pg-<db>.dump, or copy the SQLite file back. Test restores periodically — a backup you've never restored is a hope, not a backup. Restore the highest-stakes data (password manager, mail) first in any drill.

Adding a host

Add it to the backups inventory group.

Give it a host_vars scope — which DBs to dump and which paths to back up:

restic_backup_oncalendar: "*-*-* 02:40:00"   # stagger
restic_mysql_dbs: [castopod_db]
restic_paths: [/var/www/html/castopod]
restic_excludes: [/var/www/html/castopod/public/media]   # already offsite

Run the playbook against that host. The role installs restic, deploys the script + units, restic inits the repo if absent, and enables the timers.

Gotchas & Notes

RESTIC_CACHE_DIR is mandatory under systemd. systemd services run with no $HOME, so restic can't find its cache and warns "unable to locate cache directory: neither $XDG_CACHE_HOME nor $HOME are defined" — and re-reads every file each run (no incremental). Point it at /var/cache/restic in the unit.
sqlite3 may not be installed. A host that runs a SQLite-backed app (e.g. a bot) often lacks the sqlite3/sqlite CLI. Install it where restic_sqlite_paths is set, or the .backup step fails.
Docker DB password env-var names vary. Don't assume: the MariaDB image may use MYSQL_ROOT_PASSWORD (not MARIADB_ROOT_PASSWORD), and a Postgres container's superuser is whatever POSTGRES_USER is set to — reference "$POSTGRES_USER" rather than hardcoding postgres. Check with docker exec <c> sh -c 'env | grep -oE "^(MYSQL|MARIADB|POSTGRES)_[A-Z_]*"' (name only).
B2 key needs delete capability. Otherwise forget/prune fail. Scope the key to the bucket; reach for per-host namePrefix-restricted keys for blast-radius isolation.
Exclude data that's already offsite. Media already synced to object storage (S3/B2 via the app or rclone) should be --excluded so you don't pay to store it twice.
First upload is slow, the rest are fast. The initial snapshot reads and uploads everything; subsequent runs only ship changed blocks. For a large first run, fire it detached and watch from a transient unit that emails you on completion.
Keep secrets out of git. The repo password and B2 key belong in an Ansible vault (committed encrypted), referenced into the role — never in plaintext vars.
Changing a host's backup paths starts a new snapshot group. restic forget groups snapshots by host+paths by default, so adding or removing a path on an existing host creates a separate lineage: the old path-set and the new one each retain their own 7d/4w/6m snapshots, and restic snapshots shows both. Expected, not a bug — but it means the old-path snapshots age out on their own schedule rather than being superseded. To collapse everything into one retention bucket, run forget with --group-by host (be deliberate: it then treats any path-set on that host as the same group).

7.8 KiB Raw Blame History