MajorLinux 4e63d8546c mastodon: document S3 ACL upload failures + bulk avatar restore

New article mastodon-s3-acl-upload-failures.md: a BucketOwnerEnforced S3
bucket plus a stale S3_PERMISSION/S3_ACL in .env.production makes every
Mastodon upload fail with AccessControlListNotSupported, silently. Covers
symptoms (incl. why a missing object returns 403 not 404), diagnosis,
the fix (S3_PERMISSION= empty, public read via bucket policy), recovery,
a synthetic-write health check, and Ansible enforcement.

Extend mastodon-prune-profiles-trap.md: add a "Bulk restore at scale"
procedure (list existing keys, null missing DB refs, enqueue
RedownloadAvatar/HeaderWorker), a "storage-level deletion without DB
de-ref" section, and a stronger recommendation to disable automated
profile pruning (and scheduled accounts refresh --all) entirely.

Link both from SUMMARY.md and the selfhosting index.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-01 15:45:23 -04:00

7.1 KiB

Raw Blame History

title

description

domain

Mastodon on S3 — Silent Upload Failures When the Bucket Disables ACLs

If your Mastodon instance stores media on S3 and you switch the bucket to Object Ownership = BucketOwnerEnforced (which AWS now recommends, and which the console nudges you toward), every media upload can start failing silently unless you also remove the object-ACL setting from .env.production. New avatars, headers, and attachments stop appearing; old ones keep working; nothing obvious is logged. This article is the diagnosis and fix.

TL;DR

BucketOwnerEnforced disables ACLs entirely on the bucket. Any request that carries an x-amz-acl header is rejected with AccessControlListNotSupported: The bucket does not allow ACLs.
Mastodon (via Paperclip) attaches x-amz-acl to every upload if S3_PERMISSION (or S3_ACL) is set in .env.production. The common value S3_PERMISSION=public-read — or a migration leftover like S3_PERMISSION=private — triggers the rejection.
Result: every new upload fails, but the database row is still updated, so Mastodon believes it has the file. The object never lands → broken image. Objects written before the bucket changed keep serving fine, which masks the problem.
Fix: set S3_PERMISSION= (empty) and remove any S3_ACL= line, then restart mastodon-web + mastodon-sidekiq. Public read is now served by the bucket policy, not per-object ACLs.

Symptoms

Newly-changed avatars/headers show broken; attachments on new posts fail to display.
Avatars that were cached before the bucket setting changed still work — so "some work, some don't."
tootctl and the web UI report success; Sidekiq doesn't obviously error.
Direct fetch of a broken object's URL returns 403 AccessDenied (not 404 — see below).

Why a missing object returns 403, not 404

A typical Mastodon S3 bucket policy grants public s3:GetObject but not s3:ListBucket. Without ListBucket, S3 hides whether a key exists: a GET on a missing key returns 403 AccessDenied, identical to a permissions denial. So "403" here usually means the object isn't there, not the object is forbidden. This is why the failure reads like a permissions problem when it's really a failed write.

Diagnosis

Run these with the instance's own S3 credentials (e.g. via bin/rails runner, which loads .env.production):

require "aws-sdk-s3"
c = Aws::S3::Client.new(region: ENV["S3_REGION"],
                        access_key_id: ENV["AWS_ACCESS_KEY_ID"],
                        secret_access_key: ENV["AWS_SECRET_ACCESS_KEY"])
b = ENV["S3_BUCKET"]

# 1. Is the bucket ACL-disabled?
puts c.get_bucket_ownership_controls(bucket: b).ownership_controls.rules.map(&:object_ownership).inspect
#   => ["BucketOwnerEnforced"]   <-- ACLs are OFF

# 2. Does an upload WITH an ACL fail, and WITHOUT one succeed?
begin
  c.put_object(bucket: b, key: "tmp/acltest", body: "x", acl: "public-read")
  puts "PUT+acl: OK"
rescue => e
  puts "PUT+acl FAILS: #{e.class} / #{e.message}"   # AccessControlListNotSupported
end
c.put_object(bucket: b, key: "tmp/noacltest", body: "x")   # succeeds
c.delete_object(bucket: b, key: "tmp/noacltest")

# 3. Confirm a "broken" avatar's object is actually missing
key = Account.find_by(username: "someuser", domain: "remote.tld").avatar.path.sub(%r{^/}, "")
begin; c.head_object(bucket: b, key: key); puts "EXISTS"
rescue Aws::S3::Errors::NotFound; puts "MISSING"; end

If #1 shows BucketOwnerEnforced and #2 shows the ACL'd PUT failing while the plain PUT succeeds, you've confirmed it.

Check .env.production for the offending settings:

grep -E '^S3_(ACL|PERMISSION|NO_INHERIT)' /home/mastodon/live/.env.production
# S3_ACL=private          <-- remove
# S3_PERMISSION=private    <-- set empty

The fix

Edit .env.production:
- S3_PERMISSION= (empty — Paperclip then sends no x-amz-acl header)
- remove/comment any S3_ACL= line
Restart so the env is reloaded: systemctl restart mastodon-sidekiq mastodon-web
Verify the previously-failing write path now works — reprocess any existing avatar and confirm it serves 200:

a = Account.local.first
a.avatar.reprocess!     # used to raise AccessControlListNotSupported; now succeeds

Public readability is now provided by the bucket policy (grant s3:GetObject on arn:aws:s3:::your-bucket/* to Principal: "*"), with the account-level Block Public Access "ACLs" toggles off and "policy" allowed. You do not need per-object ACLs at all.

Recovering the avatars that broke while it was failing

Any media that failed to upload during the broken window is gone from S3 while the DB still references it. Because Mastodon's redownload workers skip accounts whose *_file_name is already set, you must null the dead reference first, then enqueue the worker. See Mastodon — The --prune-profiles Trap and How to Recover for the bulk procedure.

Don't let it happen silently again — monitor uploads

The worst part of this bug is the silence. Add a periodic synthetic write check that uploads a tiny object with the app's own credentials, confirms it, deletes it, and alerts on failure:

s3.put_object(bucket: b, key: "health/upload-check", body: "ok")  # no acl
s3.head_object(bucket: b, key: "health/upload-check")
s3.delete_object(bucket: b, key: "health/upload-check")
# any exception -> email an alert

Pair it with an HTTP check that your local account avatars all return 200 (they always should). Run both every few hours from cron. A regression then pages you in hours instead of being discovered by a user weeks later.

Ansible enforcement

If you manage the host with Ansible, enforce the safe values so a future template render can't reintroduce the ACL header:

- name: Ensure S3_PERMISSION is empty (no x-amz-acl on uploads)
  ansible.builtin.lineinfile:
    path: /home/mastodon/live/.env.production
    regexp: '^S3_PERMISSION='
    line: 'S3_PERMISSION='
  notify: Restart Mastodon services

- name: Remove any active S3_ACL line (ACLs unsupported on this bucket)
  ansible.builtin.lineinfile:
    path: /home/mastodon/live/.env.production
    regexp: '^S3_ACL=.+'
    state: absent
  notify: Restart Mastodon services

Mastodon — The --prune-profiles Trap and How to Recover — the other way avatars go missing, plus the bulk-restore script
Mastodon Post-Install Hardening (Permissions + Account)
AWS S3 Cost Management — pruning attachments to control bucket size (safely)

7.1 KiB Raw Blame History