majorwiki/05-troubleshooting/gpu-display/lora-adapter-gguf-conversion-fails.md
majorlinux 4126656c05 wiki: update fail2ban digest + netdata docker health + 3 new articles
- fail2ban-digest-mode-fleet: recidive-only email model, sshd now silent,
  defaults-debian.conf gotcha added
- netdata-docker-health-alarm-tuning: 30m/10m config, tuning history table
- New: wp-fail2ban-logpath-debian-ubuntu, lora-adapter-gguf-conversion-fails,
  tailscale-status-json-hostname-localhost-ios
- Various article updates and nav index refreshes

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-02 14:58:07 -04:00

4.5 KiB

title domain category tags status created updated
LoRA adapter — GGUF conversion fails with 'config.json not found' troubleshooting gpu-display
lora
qlora
gguf
llama.cpp
unsloth
fine-tuning
qwen
published 2026-04-30 2026-04-30

LoRA adapter — GGUF conversion fails with 'config.json not found'

Problem

After a QLoRA fine-tune, you point llama.cpp/convert_hf_to_gguf.py at the training output directory and it crashes immediately:

FileNotFoundError: [Errno 2] No such file or directory:
  '/path/to/training-runs/<run>/final/config.json'

The output directory looks fine — it contains:

adapter_config.json
adapter_model.safetensors  (~150 MB for a 7B base)
chat_template.jinja
tokenizer_config.json
tokenizer.json

But no config.json, and adapter_model.safetensors is 150 MB — way smaller than the ~14 GB you'd expect for a full Qwen2.5-7B 16-bit checkpoint.

Root cause

model.save_pretrained() after a LoRA/QLoRA train saves only the adapter weights, not a merged full-precision model. convert_hf_to_gguf.py expects a full HuggingFace model directory — it reads config.json to identify the architecture. Adapter-only directories don't have one.

You need to merge the LoRA adapter into the base model first, then point the GGUF converter at the merged dir.

Solution

Quick fix — inline merge step

Insert this block between training completion and convert_hf_to_gguf.py:

from unsloth import FastLanguageModel

adapter = "/path/to/training-runs/<run>/final"
merged  = "/path/to/training-runs/<run>/merged"

model, tok = FastLanguageModel.from_pretrained(
    model_name=adapter,
    max_seq_length=2048,
    load_in_4bit=True,
)
model.save_pretrained_merged(merged, tok, save_method="merged_16bit")

Then run the GGUF converter against the merged dir, not the adapter dir:

python3 llama.cpp/convert_hf_to_gguf.py /path/to/training-runs/<run>/merged \
  --outfile model-f16.gguf --outtype f16

The merged dir will contain config.json, model-00001-of-00004.safetensors (multiple shards totaling the full base model size), generation_config.json, etc.

Cleaner fix — use a wrapper

If you do this often, encapsulate it:

  1. Wrapper Python script accepts --adapter, --output, --skip-merge, --all-quants
  2. Step 1: load adapter via FastLanguageModel.from_pretrained(), call save_pretrained_merged()
  3. Step 2: subprocess convert_hf_to_gguf.py on the merged dir
  4. Step 3: subprocess llama-quantize for each requested quant

This is what ~/corpus/scripts/convert_gguf.py does on MajorRig (rewritten 2026-04-09 for the MajorTwin v7b cycle).

Why this trips people up

  • Unsloth and PEFT both save adapter-only by default after trainer.save_model() or model.save_pretrained(). There's no warning that downstream tools expect a merged model.
  • The training output looks complete — there's a tokenizer.json, a chat_template.jinja, and a non-trivial .safetensors. It feels like a checkpoint.
  • A pipeline that uses convert_gguf.py (with merge) once and then someone reimplements Step 4 inline (skipping the wrapper) will silently lose the merge step. This is what happened in MajorTwin v8c (Apr 30, 2026) — see majortwin-v8b-plan#Pipeline Bug + Fix (2026-04-30).

Verification checklist

After training, before running the GGUF converter, verify the directory you're pointing at:

File Adapter-only dir Merged dir
adapter_config.json
adapter_model.safetensors (~150 MB / 7B)
config.json
model-*.safetensors (sharded) (~14 GB / 7B)
generation_config.json
tokenizer.json

If you see only the left column, you need to merge before converting.

Resuming a failed pipeline without re-training

The adapter is small and self-contained. If your pipeline crashes at the GGUF step, you do NOT need to retrain — the LoRA adapter at <run>/final/ is intact. Write a resume wrapper that runs only:

  1. Merge (save_pretrained_merged)
  2. F16 conversion (convert_hf_to_gguf.py)
  3. Quantization (llama-quantize)
  4. Deploy

This saves the cost of however many GPU-hours the training took. See ~/corpus/scripts/resume_v8c_step4.sh on MajorRig for an example.

Maintenance

  • 2026-04-30 — Created after MajorTwin v8c pipeline failed Step 4. Root-caused, patched, resumed.