Documents the gotcha where convert_hf_to_gguf.py crashes with 'config.json not found' because the training output directory holds only the LoRA adapter, not a merged HF model. Includes inline save_pretrained_merged() fix snippet, verification checklist, and resume-pipeline-without-retraining pattern. Discovered today during the MajorTwin v8c pipeline failure (Step 4). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4.5 KiB
| title | domain | category | tags | status | created | updated | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| LoRA adapter — GGUF conversion fails with 'config.json not found' | troubleshooting | gpu-display |
|
published | 2026-04-30 | 2026-04-30 |
LoRA adapter — GGUF conversion fails with 'config.json not found'
Problem
After a QLoRA fine-tune, you point llama.cpp/convert_hf_to_gguf.py at the training output directory and it crashes immediately:
FileNotFoundError: [Errno 2] No such file or directory:
'/path/to/training-runs/<run>/final/config.json'
The output directory looks fine — it contains:
adapter_config.json
adapter_model.safetensors (~150 MB for a 7B base)
chat_template.jinja
tokenizer_config.json
tokenizer.json
But no config.json, and adapter_model.safetensors is 150 MB — way smaller than the ~14 GB you'd expect for a full Qwen2.5-7B 16-bit checkpoint.
Root cause
model.save_pretrained() after a LoRA/QLoRA train saves only the adapter weights, not a merged full-precision model. convert_hf_to_gguf.py expects a full HuggingFace model directory — it reads config.json to identify the architecture. Adapter-only directories don't have one.
You need to merge the LoRA adapter into the base model first, then point the GGUF converter at the merged dir.
Solution
Quick fix — inline merge step
Insert this block between training completion and convert_hf_to_gguf.py:
from unsloth import FastLanguageModel
adapter = "/path/to/training-runs/<run>/final"
merged = "/path/to/training-runs/<run>/merged"
model, tok = FastLanguageModel.from_pretrained(
model_name=adapter,
max_seq_length=2048,
load_in_4bit=True,
)
model.save_pretrained_merged(merged, tok, save_method="merged_16bit")
Then run the GGUF converter against the merged dir, not the adapter dir:
python3 llama.cpp/convert_hf_to_gguf.py /path/to/training-runs/<run>/merged \
--outfile model-f16.gguf --outtype f16
The merged dir will contain config.json, model-00001-of-00004.safetensors (multiple shards totaling the full base model size), generation_config.json, etc.
Cleaner fix — use a wrapper
If you do this often, encapsulate it:
- Wrapper Python script accepts
--adapter,--output,--skip-merge,--all-quants - Step 1: load adapter via
FastLanguageModel.from_pretrained(), callsave_pretrained_merged() - Step 2: subprocess
convert_hf_to_gguf.pyon the merged dir - Step 3: subprocess
llama-quantizefor each requested quant
This is what ~/corpus/scripts/convert_gguf.py does on MajorRig (rewritten 2026-04-09 for the MajorTwin v7b cycle).
Why this trips people up
- Unsloth and PEFT both save adapter-only by default after
trainer.save_model()ormodel.save_pretrained(). There's no warning that downstream tools expect a merged model. - The training output looks complete — there's a
tokenizer.json, achat_template.jinja, and a non-trivial.safetensors. It feels like a checkpoint. - A pipeline that uses
convert_gguf.py(with merge) once and then someone reimplements Step 4 inline (skipping the wrapper) will silently lose the merge step. This is what happened in MajorTwin v8c (Apr 30, 2026) — see majortwin-v8b-plan#Pipeline Bug + Fix (2026-04-30).
Verification checklist
After training, before running the GGUF converter, verify the directory you're pointing at:
| File | Adapter-only dir | Merged dir |
|---|---|---|
adapter_config.json |
✅ | ❌ |
adapter_model.safetensors |
✅ (~150 MB / 7B) | ❌ |
config.json |
❌ | ✅ |
model-*.safetensors (sharded) |
❌ | ✅ (~14 GB / 7B) |
generation_config.json |
❌ | ✅ |
tokenizer.json |
✅ | ✅ |
If you see only the left column, you need to merge before converting.
Resuming a failed pipeline without re-training
The adapter is small and self-contained. If your pipeline crashes at the GGUF step, you do NOT need to retrain — the LoRA adapter at <run>/final/ is intact. Write a resume wrapper that runs only:
- Merge (
save_pretrained_merged) - F16 conversion (
convert_hf_to_gguf.py) - Quantization (
llama-quantize) - Deploy
This saves the cost of however many GPU-hours the training took. See ~/corpus/scripts/resume_v8c_step4.sh on MajorRig for an example.
Related
- qwen-14b-oom-3080ti — base model size choice on a 12GB GPU
- majortwin-v8b-plan — v8c pipeline architecture and resume
Maintenance
- 2026-04-30 — Created after MajorTwin v8c pipeline failed Step 4. Root-caused, patched, resumed.