mastodon: document S3 ACL upload failures + bulk avatar restore
New article mastodon-s3-acl-upload-failures.md: a BucketOwnerEnforced S3 bucket plus a stale S3_PERMISSION/S3_ACL in .env.production makes every Mastodon upload fail with AccessControlListNotSupported, silently. Covers symptoms (incl. why a missing object returns 403 not 404), diagnosis, the fix (S3_PERMISSION= empty, public read via bucket policy), recovery, a synthetic-write health check, and Ansible enforcement. Extend mastodon-prune-profiles-trap.md: add a "Bulk restore at scale" procedure (list existing keys, null missing DB refs, enqueue RedownloadAvatar/HeaderWorker), a "storage-level deletion without DB de-ref" section, and a stronger recommendation to disable automated profile pruning (and scheduled accounts refresh --all) entirely. Link both from SUMMARY.md and the selfhosting index. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
155651c373
commit
4e63d8546c
4 changed files with 205 additions and 1 deletions
|
|
@ -37,6 +37,7 @@ Guides for running your own services at home, including Docker, reverse proxies,
|
|||
- [Mastodon DB Maintenance](services/mastodon-db-maintenance.md)
|
||||
- [Mastodon Federation](services/mastodon-federation.md)
|
||||
- [Mastodon `--prune-profiles` Trap](services/mastodon-prune-profiles-trap.md)
|
||||
- [Mastodon on S3 — Silent Upload Failures](services/mastodon-s3-acl-upload-failures.md)
|
||||
- [Ghost SMTP via Mailgun](services/ghost-smtp-mailgun-setup.md)
|
||||
- [Updating n8n Docker](services/updating-n8n-docker.md)
|
||||
- [Claude Code Remote Control](services/claude-code-remote-control.md)
|
||||
|
|
|
|||
|
|
@ -8,7 +8,7 @@ tags:
|
|||
- self-hosting
|
||||
- troubleshooting
|
||||
created: 2026-05-07
|
||||
updated: 2026-05-07
|
||||
updated: 2026-06-01
|
||||
---
|
||||
|
||||
# Mastodon — The `--prune-profiles` Trap and How to Recover
|
||||
|
|
@ -122,6 +122,68 @@ Three things in that WHERE clause matter:
|
|||
- `domain: not nil` — only remote accounts have cached avatars to repopulate.
|
||||
- `avatar_remote_url: [nil, '']` excluded — if the origin actor object has no avatar, refresh will not populate anything. Including these accounts puts the script in an infinite-retry loop on every run.
|
||||
|
||||
## Bulk restore at scale
|
||||
|
||||
When the breakage is large — a bad prune across the whole instance, or a storage-level deletion (see the next section) — refreshing follows one at a time isn't enough. The generalized procedure:
|
||||
|
||||
1. List the keys that actually exist in storage, so you only touch the broken ones.
|
||||
2. For each account whose current `avatar`/`header` key is **absent**, null the `*_file_name` (the redownload workers skip accounts that still have a file name) and enqueue the worker.
|
||||
3. Let Sidekiq's `pull` queue drain.
|
||||
|
||||
```ruby
|
||||
require "aws-sdk-s3"; require "set"
|
||||
c = Aws::S3::Client.new(region: ENV["S3_REGION"], access_key_id: ENV["AWS_ACCESS_KEY_ID"], secret_access_key: ENV["AWS_SECRET_ACCESS_KEY"])
|
||||
b = ENV["S3_BUCKET"]
|
||||
|
||||
def keys(c, b, prefix)
|
||||
s = Set.new; t = nil
|
||||
loop do
|
||||
r = c.list_objects_v2(bucket: b, prefix: prefix, continuation_token: t, max_keys: 1000)
|
||||
r.contents.each { |o| s << o.key }
|
||||
break unless r.is_truncated
|
||||
t = r.next_continuation_token
|
||||
end
|
||||
s
|
||||
end
|
||||
|
||||
avset = keys(c, b, "cache/accounts/avatars/")
|
||||
hdset = keys(c, b, "cache/accounts/headers/")
|
||||
|
||||
Account.where.not(domain: nil)
|
||||
.where("avatar_file_name IS NOT NULL OR header_file_name IS NOT NULL")
|
||||
.find_each(batch_size: 1000) do |a|
|
||||
if a.avatar_file_name.present? && a.avatar_remote_url.present? &&
|
||||
!avset.include?(a.avatar.path.sub(%r{^/}, ""))
|
||||
a.update_column(:avatar_file_name, nil)
|
||||
RedownloadAvatarWorker.perform_async(a.id)
|
||||
end
|
||||
if a.header_file_name.present? && a.header_remote_url.present? &&
|
||||
!hdset.include?(a.header.path.sub(%r{^/}, ""))
|
||||
a.update_column(:header_file_name, nil)
|
||||
RedownloadHeaderWorker.perform_async(a.id)
|
||||
end
|
||||
end
|
||||
```
|
||||
|
||||
Notes:
|
||||
|
||||
- Listing existing keys first means you re-fetch only what's missing, instead of re-downloading every avatar — which would re-bloat a bucket you may have just trimmed.
|
||||
- The workers return early if `*_file_name` is present, which is why you must `update_column(..., nil)` before enqueuing.
|
||||
- Avatars are small (tens of KB each), so re-fetching the whole missing set typically adds a few GB and a few hours of Sidekiq `pull` work. Headers are larger but still modest.
|
||||
- Origins that deleted the avatar after you cached it return 404 — the permanent, irrecoverable tail.
|
||||
|
||||
## Broader failure: storage-level deletion without DB de-ref
|
||||
|
||||
`--prune-profiles` is one way avatars vanish, but it at least nulls the database column, so the account re-fetches on its next `Update`. The **more dangerous** variant is deleting objects directly in your storage backend — a manual `aws s3 rm`, an S3 lifecycle expiration rule, a bucket migration that doesn't copy everything, or any "cost cleanup" done outside `tootctl`. Those delete the file but leave `accounts.avatar_file_name` **set**, pointing at an object that no longer exists.
|
||||
|
||||
Why it's worse:
|
||||
|
||||
- The DB still thinks the avatar is present, and the redownload workers skip the account (`*_file_name` is non-null) — so it never self-heals until an `Update` arrives.
|
||||
- It can hit **every** remote account at once, not just quiet ones.
|
||||
- It looks identical to the S3-ACL upload bug — see [Mastodon on S3 — Silent Upload Failures](mastodon-s3-acl-upload-failures.md). Tell them apart by checking whether new uploads succeed (ACL bug) versus only old objects being gone (a one-off deletion).
|
||||
|
||||
Recover with the [bulk restore](#bulk-restore-at-scale) procedure above. **Prevent** it by never deleting Mastodon media at the storage level: prune *attachments* through `tootctl media remove` (which derefs the DB and re-fetches on demand) and leave avatars/headers alone.
|
||||
|
||||
## Why `header_file_name IS NULL` is a bad signal
|
||||
|
||||
A naive script will treat both `avatar_file_name IS NULL` and `header_file_name IS NULL` as "broken." Don't.
|
||||
|
|
@ -143,6 +205,8 @@ bin/tootctl preview_cards remove --days=30 --concurrency=5
|
|||
|
||||
Consider deleting the middle two lines. The attachment prune is the real disk-saver (gigabytes per week on a busy instance). The avatar prune is small (~250 KB per remote account) and damages your UX. The header prune is even smaller and rarely worth it.
|
||||
|
||||
**Stronger recommendation:** after being bitten more than once, the safest policy is to **disable automated profile/header pruning entirely** — and reconsider scheduled `tootctl accounts refresh --all`, which re-fetches every profile and is destructive when uploads are failing at the time. Keep only a deliberate, occasional **attachment** prune if bucket size demands it. Pair that with a synthetic upload monitor (see [Mastodon on S3 — Silent Upload Failures](mastodon-s3-acl-upload-failures.md)) so any future regression is caught in hours instead of by a user weeks later.
|
||||
|
||||
## Edge cases
|
||||
|
||||
- **Origin-side 404:** the actor object advertises an avatar URL, but the URL itself returns 404. Your local cache stays empty no matter how many times you refresh. Only the origin user can fix it (re-upload). The script above will keep retrying these on every run; if that bothers you, add a "tried within last N hours" filter.
|
||||
|
|
|
|||
138
02-selfhosting/services/mastodon-s3-acl-upload-failures.md
Normal file
138
02-selfhosting/services/mastodon-s3-acl-upload-failures.md
Normal file
|
|
@ -0,0 +1,138 @@
|
|||
---
|
||||
title: Mastodon on S3 — Silent Upload Failures When the Bucket Disables ACLs
|
||||
description: Why a BucketOwnerEnforced S3 bucket plus a stale S3_PERMISSION/S3_ACL in .env.production makes every Mastodon media upload fail with AccessControlListNotSupported, how to diagnose it, and how to fix and monitor it.
|
||||
domain: selfhosting
|
||||
category: services
|
||||
tags:
|
||||
- mastodon
|
||||
- fediverse
|
||||
- self-hosting
|
||||
- aws
|
||||
- s3
|
||||
- paperclip
|
||||
- troubleshooting
|
||||
status: published
|
||||
created: 2026-06-01
|
||||
updated: 2026-06-01
|
||||
---
|
||||
|
||||
# Mastodon on S3 — Silent Upload Failures When the Bucket Disables ACLs
|
||||
|
||||
If your Mastodon instance stores media on S3 and you switch the bucket to **Object Ownership = `BucketOwnerEnforced`** (which AWS now recommends, and which the console nudges you toward), every media upload can start failing **silently** unless you also remove the object-ACL setting from `.env.production`. New avatars, headers, and attachments stop appearing; old ones keep working; nothing obvious is logged. This article is the diagnosis and fix.
|
||||
|
||||
## TL;DR
|
||||
|
||||
- `BucketOwnerEnforced` **disables ACLs entirely** on the bucket. Any request that carries an `x-amz-acl` header is rejected with `AccessControlListNotSupported: The bucket does not allow ACLs`.
|
||||
- Mastodon (via Paperclip) attaches `x-amz-acl` to every upload **if** `S3_PERMISSION` (or `S3_ACL`) is set in `.env.production`. The common value `S3_PERMISSION=public-read` — or a migration leftover like `S3_PERMISSION=private` — triggers the rejection.
|
||||
- Result: **every new upload fails**, but the database row is still updated, so Mastodon believes it has the file. The object never lands → broken image. Objects written *before* the bucket changed keep serving fine, which masks the problem.
|
||||
- **Fix:** set `S3_PERMISSION=` (empty) and remove any `S3_ACL=` line, then restart `mastodon-web` + `mastodon-sidekiq`. Public read is now served by the **bucket policy**, not per-object ACLs.
|
||||
|
||||
## Symptoms
|
||||
|
||||
- Newly-changed avatars/headers show broken; attachments on new posts fail to display.
|
||||
- Avatars that were cached **before** the bucket setting changed still work — so "some work, some don't."
|
||||
- `tootctl` and the web UI report success; Sidekiq doesn't obviously error.
|
||||
- Direct fetch of a broken object's URL returns **403 AccessDenied** (not 404 — see below).
|
||||
|
||||
## Why a missing object returns 403, not 404
|
||||
|
||||
A typical Mastodon S3 bucket policy grants public `s3:GetObject` but **not** `s3:ListBucket`. Without `ListBucket`, S3 hides whether a key exists: a `GET` on a **missing** key returns **403 AccessDenied**, identical to a permissions denial. So "403" here usually means *the object isn't there*, not *the object is forbidden*. This is why the failure reads like a permissions problem when it's really a failed write.
|
||||
|
||||
## Diagnosis
|
||||
|
||||
Run these with the instance's own S3 credentials (e.g. via `bin/rails runner`, which loads `.env.production`):
|
||||
|
||||
```ruby
|
||||
require "aws-sdk-s3"
|
||||
c = Aws::S3::Client.new(region: ENV["S3_REGION"],
|
||||
access_key_id: ENV["AWS_ACCESS_KEY_ID"],
|
||||
secret_access_key: ENV["AWS_SECRET_ACCESS_KEY"])
|
||||
b = ENV["S3_BUCKET"]
|
||||
|
||||
# 1. Is the bucket ACL-disabled?
|
||||
puts c.get_bucket_ownership_controls(bucket: b).ownership_controls.rules.map(&:object_ownership).inspect
|
||||
# => ["BucketOwnerEnforced"] <-- ACLs are OFF
|
||||
|
||||
# 2. Does an upload WITH an ACL fail, and WITHOUT one succeed?
|
||||
begin
|
||||
c.put_object(bucket: b, key: "tmp/acltest", body: "x", acl: "public-read")
|
||||
puts "PUT+acl: OK"
|
||||
rescue => e
|
||||
puts "PUT+acl FAILS: #{e.class} / #{e.message}" # AccessControlListNotSupported
|
||||
end
|
||||
c.put_object(bucket: b, key: "tmp/noacltest", body: "x") # succeeds
|
||||
c.delete_object(bucket: b, key: "tmp/noacltest")
|
||||
|
||||
# 3. Confirm a "broken" avatar's object is actually missing
|
||||
key = Account.find_by(username: "someuser", domain: "remote.tld").avatar.path.sub(%r{^/}, "")
|
||||
begin; c.head_object(bucket: b, key: key); puts "EXISTS"
|
||||
rescue Aws::S3::Errors::NotFound; puts "MISSING"; end
|
||||
```
|
||||
|
||||
If #1 shows `BucketOwnerEnforced` and #2 shows the ACL'd PUT failing while the plain PUT succeeds, you've confirmed it.
|
||||
|
||||
Check `.env.production` for the offending settings:
|
||||
|
||||
```bash
|
||||
grep -E '^S3_(ACL|PERMISSION|NO_INHERIT)' /home/mastodon/live/.env.production
|
||||
# S3_ACL=private <-- remove
|
||||
# S3_PERMISSION=private <-- set empty
|
||||
```
|
||||
|
||||
## The fix
|
||||
|
||||
1. Edit `.env.production`:
|
||||
- `S3_PERMISSION=` (empty — Paperclip then sends no `x-amz-acl` header)
|
||||
- remove/comment any `S3_ACL=` line
|
||||
2. Restart so the env is reloaded: `systemctl restart mastodon-sidekiq mastodon-web`
|
||||
3. Verify the previously-failing write path now works — reprocess any existing avatar and confirm it serves 200:
|
||||
|
||||
```ruby
|
||||
a = Account.local.first
|
||||
a.avatar.reprocess! # used to raise AccessControlListNotSupported; now succeeds
|
||||
```
|
||||
|
||||
Public readability is now provided by the **bucket policy** (grant `s3:GetObject` on `arn:aws:s3:::your-bucket/*` to `Principal: "*"`), with the account-level **Block Public Access** "ACLs" toggles off and "policy" allowed. You do **not** need per-object ACLs at all.
|
||||
|
||||
### Recovering the avatars that broke while it was failing
|
||||
|
||||
Any media that failed to upload during the broken window is gone from S3 while the DB still references it. Because Mastodon's redownload workers **skip accounts whose `*_file_name` is already set**, you must null the dead reference first, then enqueue the worker. See [Mastodon — The `--prune-profiles` Trap and How to Recover](mastodon-prune-profiles-trap.md#bulk-restore-at-scale) for the bulk procedure.
|
||||
|
||||
## Don't let it happen silently again — monitor uploads
|
||||
|
||||
The worst part of this bug is the silence. Add a periodic **synthetic write check** that uploads a tiny object with the app's own credentials, confirms it, deletes it, and alerts on failure:
|
||||
|
||||
```ruby
|
||||
s3.put_object(bucket: b, key: "health/upload-check", body: "ok") # no acl
|
||||
s3.head_object(bucket: b, key: "health/upload-check")
|
||||
s3.delete_object(bucket: b, key: "health/upload-check")
|
||||
# any exception -> email an alert
|
||||
```
|
||||
|
||||
Pair it with an HTTP check that your **local** account avatars all return 200 (they always should). Run both every few hours from cron. A regression then pages you in hours instead of being discovered by a user weeks later.
|
||||
|
||||
## Ansible enforcement
|
||||
|
||||
If you manage the host with Ansible, enforce the safe values so a future template render can't reintroduce the ACL header:
|
||||
|
||||
```yaml
|
||||
- name: Ensure S3_PERMISSION is empty (no x-amz-acl on uploads)
|
||||
ansible.builtin.lineinfile:
|
||||
path: /home/mastodon/live/.env.production
|
||||
regexp: '^S3_PERMISSION='
|
||||
line: 'S3_PERMISSION='
|
||||
notify: Restart Mastodon services
|
||||
|
||||
- name: Remove any active S3_ACL line (ACLs unsupported on this bucket)
|
||||
ansible.builtin.lineinfile:
|
||||
path: /home/mastodon/live/.env.production
|
||||
regexp: '^S3_ACL=.+'
|
||||
state: absent
|
||||
notify: Restart Mastodon services
|
||||
```
|
||||
|
||||
## Related
|
||||
|
||||
- [Mastodon — The `--prune-profiles` Trap and How to Recover](mastodon-prune-profiles-trap.md) — the other way avatars go missing, plus the bulk-restore script
|
||||
- [Mastodon Post-Install Hardening (Permissions + Account)](mastodon-post-install-hardening.md)
|
||||
- [AWS S3 Cost Management](../cloud/aws-s3-cost-management.md) — pruning attachments to control bucket size (safely)
|
||||
|
|
@ -40,6 +40,7 @@ updated: 2026-05-15T09:00
|
|||
* [Mastodon Instance Tuning](02-selfhosting/services/mastodon-instance-tuning.md)
|
||||
* [Mastodon Post-Install Hardening (Permissions + Account)](02-selfhosting/services/mastodon-post-install-hardening.md)
|
||||
* [Mastodon — The `--prune-profiles` Trap and How to Recover](02-selfhosting/services/mastodon-prune-profiles-trap.md)
|
||||
* [Mastodon on S3 — Silent Upload Failures (BucketOwnerEnforced/ACLs)](02-selfhosting/services/mastodon-s3-acl-upload-failures.md)
|
||||
* [Ghost Email Configuration with Mailgun](02-selfhosting/services/ghost-smtp-mailgun-setup.md)
|
||||
* [Claude Code Remote Control — Mobile Access to a Persistent Host Session](02-selfhosting/services/claude-code-remote-control.md)
|
||||
* [Linux Server Hardening Checklist](02-selfhosting/security/linux-server-hardening-checklist.md)
|
||||
|
|
|
|||
Loading…
Add table
Reference in a new issue