MajorLinux 4e63d8546c mastodon: document S3 ACL upload failures + bulk avatar restore

New article mastodon-s3-acl-upload-failures.md: a BucketOwnerEnforced S3
bucket plus a stale S3_PERMISSION/S3_ACL in .env.production makes every
Mastodon upload fail with AccessControlListNotSupported, silently. Covers
symptoms (incl. why a missing object returns 403 not 404), diagnosis,
the fix (S3_PERMISSION= empty, public read via bucket policy), recovery,
a synthetic-write health check, and Ansible enforcement.

Extend mastodon-prune-profiles-trap.md: add a "Bulk restore at scale"
procedure (list existing keys, null missing DB refs, enqueue
RedownloadAvatar/HeaderWorker), a "storage-level deletion without DB
de-ref" section, and a stronger recommendation to disable automated
profile pruning (and scheduled accounts refresh --all) entirely.

Link both from SUMMARY.md and the selfhosting index.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-01 15:45:23 -04:00

12 KiB

Raw Permalink Blame History

title

description

Mastodon — The `--prune-profiles` Trap and How to Recover

If you administer a Mastodon instance and run tootctl media remove --prune-profiles on a schedule, you're probably introducing a long-running cosmetic regression that no one will be able to explain when it happens.

This article documents what the flag actually does, why the missing avatars don't auto-recover, and the smallest tool you can ship to fix things on demand.

TL;DR

tootctl media remove --prune-profiles deletes cached remote avatars older than --days=N from your S3/local storage and clears accounts.avatar_file_name in the database.
Mastodon does not re-fetch avatars when a client views a profile. Re-fetch happens only on incoming Update ActivityPub activities or via an explicit tootctl accounts refresh.
Quiet remote accounts therefore stay broken — sometimes for weeks — after a prune.
The disk savings are modest (≈250 KB per account on average) and the cosmetic damage hits exactly the accounts you care about most: your follows.
Most admins should drop --prune-profiles and --remove-headers from cron and refresh on demand instead.

What the flags actually do

tootctl media remove has three distinct modes:

Invocation	Target	Default `--days`
`tootctl media remove`	remote media attachments (images/video in posts)	7
`tootctl media remove --prune-profiles`	remote avatars	7
`tootctl media remove --remove-headers`	remote headers	7

Each mode deletes the file from your storage backend and nullifies the corresponding accounts.avatar_file_name / header_file_name column. They are mutually exclusive — passing two at once produces:

--prune-profiles and --remove-headers should not be specified simultaneously

If your cron script combines them, the avatar/header pruning silently never runs, and the first time you correct the bug you'll suddenly nuke everything that's accumulated since the instance was created.

Why the pictures don't come back

Mastodon's media-recovery model is event-driven, not lazy. The triggers that cause a remote avatar to be re-fetched are:

The remote actor emits an Update ActivityPub activity — typically when they edit their profile, change avatar, change display name, etc.
Less reliably, certain Create activities on accounts whose actor state appears stale.
Manual: tootctl accounts refresh user@instance.tld, the web UI's "Refresh profile" button (gear menu on the profile page), or admin actions touching the actor record.

What does not trigger a re-fetch:

Loading the profile in any client (web, iOS app, Ivory, Tusky, Toot!, etc.).
Liking, replying to, boosting, or following toots from the user.
Viewing the user in your followers/following list.

This is why you see broken avatars consistently across every client and device — the asset is missing on your server, and your clients are all faithfully fetching from the same broken URL.

Active accounts re-emit Update activities reasonably often, so they self-heal over hours/days. Quiet accounts, accounts on small or down instances, and accounts whose owners simply don't update their profiles can stay broken indefinitely.

Recovery on demand

Single account:

sudo -u mastodon -H bash -c '
  cd /home/mastodon/live
  export RAILS_ENV=production
  export PATH=/home/mastodon/.rbenv/bin:/home/mastodon/.rbenv/shims:$PATH
  bin/tootctl accounts refresh user@instance.tld
'

For your local user's follows, a small wrapper that finds only accounts with broken avatars whose origin actually advertises one:

#!/bin/bash
# refresh-my-follows.sh — repopulate broken avatars for the local user's
# follows. Idempotent. Skips accounts whose origin has no avatar (e.g.,
# users who never set one) and headers entirely (most users have none).
set -euo pipefail

export PATH="/home/mastodon/.rbenv/bin:/home/mastodon/.rbenv/shims:$PATH"
export RAILS_ENV=production
cd /home/mastodon/live

USER_TO_REFRESH="${1:-yourusername}"

accts=$(bin/rails runner "
  acct = Account.find_by(username: %q($USER_TO_REFRESH), domain: nil)
  abort %q(no such local account) unless acct
  acct.following
    .where.not(domain: nil)
    .where(avatar_file_name: nil)
    .where.not(avatar_remote_url: [nil, ''])
    .pluck(:username, :domain)
    .each { |u, d| puts %Q(#{u}@#{d}) }
" | grep -E '^[^[:space:]@]+@[^[:space:]@]+$' || true)

count=$(printf '%s\n' "$accts" | grep -cv '^$' || true)
echo "Found $count remote follows with missing avatar"

i=0
while IFS= read -r a; do
  [ -z "$a" ] && continue
  i=$((i+1))
  printf '[%d/%d] refresh %s ... ' "$i" "$count" "$a"
  if bin/tootctl accounts refresh "$a" >/dev/null 2>&1; then
    echo OK
  else
    echo FAIL
  fi
done <<< "$accts"

Three things in that WHERE clause matter:

avatar_file_name: nil — local cache is empty, so we need to fetch.
domain: not nil — only remote accounts have cached avatars to repopulate.
avatar_remote_url: [nil, ''] excluded — if the origin actor object has no avatar, refresh will not populate anything. Including these accounts puts the script in an infinite-retry loop on every run.

Bulk restore at scale

When the breakage is large — a bad prune across the whole instance, or a storage-level deletion (see the next section) — refreshing follows one at a time isn't enough. The generalized procedure:

List the keys that actually exist in storage, so you only touch the broken ones.
For each account whose current avatar/header key is absent, null the *_file_name (the redownload workers skip accounts that still have a file name) and enqueue the worker.
Let Sidekiq's pull queue drain.

require "aws-sdk-s3"; require "set"
c = Aws::S3::Client.new(region: ENV["S3_REGION"], access_key_id: ENV["AWS_ACCESS_KEY_ID"], secret_access_key: ENV["AWS_SECRET_ACCESS_KEY"])
b = ENV["S3_BUCKET"]

def keys(c, b, prefix)
  s = Set.new; t = nil
  loop do
    r = c.list_objects_v2(bucket: b, prefix: prefix, continuation_token: t, max_keys: 1000)
    r.contents.each { |o| s << o.key }
    break unless r.is_truncated
    t = r.next_continuation_token
  end
  s
end

avset = keys(c, b, "cache/accounts/avatars/")
hdset = keys(c, b, "cache/accounts/headers/")

Account.where.not(domain: nil)
       .where("avatar_file_name IS NOT NULL OR header_file_name IS NOT NULL")
       .find_each(batch_size: 1000) do |a|
  if a.avatar_file_name.present? && a.avatar_remote_url.present? &&
     !avset.include?(a.avatar.path.sub(%r{^/}, ""))
    a.update_column(:avatar_file_name, nil)
    RedownloadAvatarWorker.perform_async(a.id)
  end
  if a.header_file_name.present? && a.header_remote_url.present? &&
     !hdset.include?(a.header.path.sub(%r{^/}, ""))
    a.update_column(:header_file_name, nil)
    RedownloadHeaderWorker.perform_async(a.id)
  end
end

Notes:

Listing existing keys first means you re-fetch only what's missing, instead of re-downloading every avatar — which would re-bloat a bucket you may have just trimmed.
The workers return early if *_file_name is present, which is why you must update_column(..., nil) before enqueuing.
Avatars are small (tens of KB each), so re-fetching the whole missing set typically adds a few GB and a few hours of Sidekiq pull work. Headers are larger but still modest.
Origins that deleted the avatar after you cached it return 404 — the permanent, irrecoverable tail.

Broader failure: storage-level deletion without DB de-ref

--prune-profiles is one way avatars vanish, but it at least nulls the database column, so the account re-fetches on its next Update. The more dangerous variant is deleting objects directly in your storage backend — a manual aws s3 rm, an S3 lifecycle expiration rule, a bucket migration that doesn't copy everything, or any "cost cleanup" done outside tootctl. Those delete the file but leave accounts.avatar_file_name set, pointing at an object that no longer exists.

Why it's worse:

The DB still thinks the avatar is present, and the redownload workers skip the account (*_file_name is non-null) — so it never self-heals until an Update arrives.
It can hit every remote account at once, not just quiet ones.
It looks identical to the S3-ACL upload bug — see Mastodon on S3 — Silent Upload Failures. Tell them apart by checking whether new uploads succeed (ACL bug) versus only old objects being gone (a one-off deletion).

Recover with the bulk restore procedure above. Prevent it by never deleting Mastodon media at the storage level: prune attachments through tootctl media remove (which derefs the DB and re-fetches on demand) and leave avatars/headers alone.

Why `header_file_name IS NULL` is a bad signal

A naive script will treat both avatar_file_name IS NULL and header_file_name IS NULL as "broken." Don't.

Roughly 20% of Mastodon users never set a custom header — the default blank header isn't represented as a file, so header_file_name is legitimately NULL for them. After a tootctl accounts refresh, the field stays NULL because there is genuinely nothing to fetch. A script with OR header_file_name IS NULL will retry these accounts forever and never make progress.

Avatar is different — nearly all real users set one, so avatar_file_name IS NULL AND avatar_remote_url IS NOT NULL is a reliable "broken and fixable" signal.

The cron decision

If your weekly media-prune cron currently looks like:

bin/tootctl media remove --days=7 --concurrency=5
bin/tootctl media remove --prune-profiles --days=7 --concurrency=5
bin/tootctl media remove --remove-headers --days=7 --concurrency=5
bin/tootctl preview_cards remove --days=30 --concurrency=5

Consider deleting the middle two lines. The attachment prune is the real disk-saver (gigabytes per week on a busy instance). The avatar prune is small (~250 KB per remote account) and damages your UX. The header prune is even smaller and rarely worth it.

Stronger recommendation: after being bitten more than once, the safest policy is to disable automated profile/header pruning entirely — and reconsider scheduled tootctl accounts refresh --all, which re-fetches every profile and is destructive when uploads are failing at the time. Keep only a deliberate, occasional attachment prune if bucket size demands it. Pair that with a synthetic upload monitor (see Mastodon on S3 — Silent Upload Failures) so any future regression is caught in hours instead of by a user weeks later.

Edge cases

Origin-side 404: the actor object advertises an avatar URL, but the URL itself returns 404. Your local cache stays empty no matter how many times you refresh. Only the origin user can fix it (re-upload). The script above will keep retrying these on every run; if that bothers you, add a "tried within last N hours" filter.
Suspended accounts: tootctl accounts refresh returns OK on suspended accounts but does not download media. They'll stay broken, which is correct behavior.
Sidekiq backlog: the avatar fetch is queued as a Sidekiq job, not done synchronously. If your pull queue is deep, you'll see a delay between "OK" and the avatar actually appearing in the database.

Mastodon Instance Tuning — broader perf notes for self-hosters
Mastodon DB Maintenance — what to run on a schedule and when
Mastodon Federation — how the actor refresh fits into the larger federation model

12 KiB Raw Permalink Blame History

Mastodon — The --prune-profiles Trap and How to Recover