Files
MajorWiki/05-troubleshooting/gpu-display/qwen-14b-oom-3080ti.md
MajorLinux 6592eb4fea wiki: audit fixes — broken links, wikilinks, frontmatter, stale content (66 files)
- Fixed 4 broken markdown links (bad relative paths in See Also sections)
- Corrected n8n port binding to 127.0.0.1:5678 (matches actual deployment)
- Updated SnapRAID article with actual majorhome paths (/majorRAID, disk1-3)
- Converted 67 Obsidian wikilinks to relative markdown links or plain text
- Added YAML frontmatter to 35 articles missing it entirely
- Completed frontmatter on 8 articles with missing fields

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 11:16:29 -04:00

2.1 KiB

title, domain, category, tags, status, created, updated
title domain category tags status created updated
Qwen2.5-14B OOM on RTX 3080 Ti (12GB) troubleshooting gpu-display
gpu
vram
oom
qwen
cuda
fine-tuning
published 2026-04-02 2026-04-02

Qwen2.5-14B OOM on RTX 3080 Ti (12GB)

Problem

When attempting to run or fine-tune Qwen2.5-14B on an NVIDIA RTX 3080 Ti with 12GB of VRAM, the process fails with an Out of Memory (OOM) error:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate X GiB (GPU 0; 12.00 GiB total capacity; Y GiB already allocated; Z GiB free; ...)

The 12GB VRAM limit is hit during the initial model load or immediately upon starting the first training step.

Root Causes

  1. Model Size: A 14B parameter model in FP16/BF16 requires ~28GB of VRAM just for the weights.
  2. Context Length: High context lengths (e.g., 4096+) significantly increase VRAM usage during training.
  3. Training Overhead: Even with QLoRA (4-bit quantization), the overhead of gradients, optimizer states, and activations can exceed 12GB for a 14B model.

Solutions

For a 12GB GPU, a 7B parameter model (like Qwen2.5-7B-Instruct) is the sweet spot. It provides excellent performance while leaving enough VRAM for high context lengths and larger batch sizes.

  • VRAM Usage (7B QLoRA): ~6-8GB
  • Pros: Stable, fast, supports long context.
  • Cons: Slightly lower reasoning capability than 14B.

2. Aggressive Quantization

If you MUST run 14B, use 4-bit quantization (GGUF or EXL2) for inference only. Training 14B on 12GB is not reliably possible even with extreme offloading.

# Example Ollama run (uses 4-bit quantization by default)
ollama run qwen2.5:14b

3. Training Optimizations (if attempting 14B)

If you have no choice but to try 14B training:

  • Set max_seq_length to 512 or 1024.
  • Use Unsloth (it is highly memory-efficient).
  • Enable gradient_checkpointing.
  • Set per_device_train_batch_size = 1.

Maintenance

Keep your NVIDIA drivers and CUDA toolkit updated. On Windows (MajorRig), ensure WSL2 has sufficient memory allocation in .wslconfig.