- Fixed 4 broken markdown links (bad relative paths in See Also sections) - Corrected n8n port binding to 127.0.0.1:5678 (matches actual deployment) - Updated SnapRAID article with actual majorhome paths (/majorRAID, disk1-3) - Converted 67 Obsidian wikilinks to relative markdown links or plain text - Added YAML frontmatter to 35 articles missing it entirely - Completed frontmatter on 8 articles with missing fields Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
64 lines
2.1 KiB
Markdown
64 lines
2.1 KiB
Markdown
---
|
|
title: "Qwen2.5-14B OOM on RTX 3080 Ti (12GB)"
|
|
domain: troubleshooting
|
|
category: gpu-display
|
|
tags: [gpu, vram, oom, qwen, cuda, fine-tuning]
|
|
status: published
|
|
created: 2026-04-02
|
|
updated: 2026-04-02
|
|
---
|
|
# Qwen2.5-14B OOM on RTX 3080 Ti (12GB)
|
|
|
|
## Problem
|
|
|
|
When attempting to run or fine-tune **Qwen2.5-14B** on an NVIDIA RTX 3080 Ti with 12GB of VRAM, the process fails with an Out of Memory (OOM) error:
|
|
|
|
```
|
|
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate X GiB (GPU 0; 12.00 GiB total capacity; Y GiB already allocated; Z GiB free; ...)
|
|
```
|
|
|
|
The 12GB VRAM limit is hit during the initial model load or immediately upon starting the first training step.
|
|
|
|
## Root Causes
|
|
|
|
1. **Model Size:** A 14B parameter model in FP16/BF16 requires ~28GB of VRAM just for the weights.
|
|
2. **Context Length:** High context lengths (e.g., 4096+) significantly increase VRAM usage during training.
|
|
3. **Training Overhead:** Even with QLoRA (4-bit quantization), the overhead of gradients, optimizer states, and activations can exceed 12GB for a 14B model.
|
|
|
|
---
|
|
|
|
## Solutions
|
|
|
|
### 1. Pivot to a 7B Model (Recommended)
|
|
|
|
For a 12GB GPU, a 7B parameter model (like **Qwen2.5-7B-Instruct**) is the sweet spot. It provides excellent performance while leaving enough VRAM for high context lengths and larger batch sizes.
|
|
|
|
- **VRAM Usage (7B QLoRA):** ~6-8GB
|
|
- **Pros:** Stable, fast, supports long context.
|
|
- **Cons:** Slightly lower reasoning capability than 14B.
|
|
|
|
### 2. Aggressive Quantization
|
|
|
|
If you MUST run 14B, use 4-bit quantization (GGUF or EXL2) for inference only. Training 14B on 12GB is not reliably possible even with extreme offloading.
|
|
|
|
```bash
|
|
# Example Ollama run (uses 4-bit quantization by default)
|
|
ollama run qwen2.5:14b
|
|
```
|
|
|
|
### 3. Training Optimizations (if attempting 14B)
|
|
|
|
If you have no choice but to try 14B training:
|
|
- Set `max_seq_length` to 512 or 1024.
|
|
- Use `Unsloth` (it is highly memory-efficient).
|
|
- Enable `gradient_checkpointing`.
|
|
- Set `per_device_train_batch_size = 1`.
|
|
|
|
---
|
|
|
|
## Maintenance
|
|
|
|
Keep your NVIDIA drivers and CUDA toolkit updated. On Windows (MajorRig), ensure WSL2 has sufficient memory allocation in `.wslconfig`.
|
|
|
|
---
|