majorwiki/05-troubleshooting/wsl2-pytorch-checkpoint-windows-filesystem-deadlock.md
majorlinux 52ca8a0413 wiki: batch update — 4 new articles + 4 updates
New articles:
- Postfix SendGrid TLS handshake failure (port 465 vs 587)
- Plex transcoding troubleshooting
- Ansible Ubuntu reboot detection kernel mismatch
- WSL2 PyTorch checkpoint Windows filesystem deadlock

Updated:
- AWS S3 cost management (expanded)
- Network overview (IP updates)
- HEVC VAAPI batch encode (progress + fixes)
- SUMMARY.md (new entries)
2026-05-25 13:55:10 -04:00

125 lines
4.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: "WSL2: PyTorch Training Deadlocks on Windows Filesystem Checkpoint Saves"
domain: troubleshooting
category: wsl2
tags: [wsl2, pytorch, huggingface, training, llm, checkpoint, windows, ntfs, deadlock, majortwin]
status: published
created: 2026-05-23
updated: 2026-05-23
---
# WSL2: PyTorch Training Deadlocks on Windows Filesystem Checkpoint Saves
## Problem
A Hugging Face Trainer / Unsloth fine-tuning run starts successfully, logs training steps for a while, then freezes completely. The tqdm progress bar stops advancing, GPU utilization drops to near-zero, but the training process stays alive at 100% CPU with the full model loaded in VRAM. No new checkpoint directories appear.
**Confirming it's a checkpoint deadlock:**
```bash
# Check if training is frozen — same step count + elapsed time across checks
tmux capture-pane -t <session> -p | tail -5
sleep 60
tmux capture-pane -t <session> -p | tail -5
# GPU idle despite process alive
nvidia-smi --query-gpu=utilization.gpu,memory.used --format=csv,noheader
# No new checkpoint directories written
ls -lt /mnt/d/your/training/output/ | head -10
```
If the tqdm step count is identical both times and the newest directory timestamp is from a previous run, the save is deadlocked.
---
## Root Cause
WSL2's `/mnt/d/` paths go through the **virtio-9p filesystem driver** to reach the host Windows NTFS volume. Large sequential writes — like saving a multi-GB PyTorch checkpoint (optimizer states, model weights, scheduler, RNG state) — can deadlock when:
- A Windows process (antivirus, VSS, Windows Search) holds a lock on the output directory
- The Windows virtual disk hits write pressure from concurrent activity
The Linux process blocks in a kernel `write()` syscall waiting for virtio-9p to acknowledge the write. The process is alive and spinning at 100% CPU in the kernel, but no userspace progress occurs. This is distinct from OOM kills (which log clearly) and out-of-disk errors (which exit cleanly).
---
## Fix: Train on Linux-Native Storage
Keep all training I/O on Linux ext4 (`~/`), and copy final artifacts to Windows only after training completes.
### Change output paths
```bash
# Before
TRAIN_OUT="/mnt/d/corpus/training-runs/v9"
GGUF_OUT="/mnt/d/corpus/models"
# After — Linux-native for training
TRAIN_OUT="/home/majorlinux/corpus/training-runs/v8i"
GGUF_OUT="/home/majorlinux/corpus/models"
```
The WSL2 home directory lives on a Linux ext4 `.vhdx` managed by WSL2 — writes here bypass virtio-9p entirely.
### Copy to Windows after training finishes
```bash
cp "$GGUF_OUT/majortwin-v8i-q4-k-m.gguf" "/mnt/d/corpus/models/"
cp "$GGUF_OUT/majortwin-v8i-q4-k-m.gguf" "/mnt/d/MajorTwin/06-Models/"
```
Single large-file copies to `/mnt/d/` complete reliably — it's repeated checkpoint saves during training that deadlock.
### Kill a stuck training process
```bash
kill $(pgrep -f 'train_v3.py')
sleep 2
tmux kill-session -t majortwin_v8i
nvidia-smi --query-gpu=utilization.gpu,memory.used --format=csv,noheader
# Should show low utilization and <1GB memory used
```
The original checkpoint files from the previous run in `/mnt/d/` are untouched — the deadlock prevents writes, it does not corrupt existing data.
---
## Why Previous Runs May Have Worked
The deadlock is not guaranteed. It depends on Windows-side state at checkpoint save time. Factors:
- Antivirus scanning newly created checkpoint files
- Windows Search indexing the output directory
- VSS snapshot in progress
- Concurrent Windows desktop I/O
A run on a quiet machine may succeed; the same run during normal desktop use may deadlock.
---
## Confirming the Fix
```bash
# Watch for checkpoint directories appearing at each save_steps interval
watch -n 30 'ls -lt ~/corpus/training-runs/v8i/ | head -8'
# GPU should be active (8599%) during training steps
nvidia-smi --query-gpu=utilization.gpu --format=csv,noheader
```
---
## Notes
- Setting `save_strategy="no"` in TrainingArguments eliminates checkpoint saves entirely — useful as a diagnostic to confirm this is the cause, at the cost of no crash recovery.
- `torch.compile()` / `torch._inductor` can add hours of CPU-bound kernel compilation before the first training step. Long startup + eventual freeze together can make a session look permanently stuck when they're actually two separate issues.
- This applies to any large sequential WSL2→Windows write, not just PyTorch — large `rsync` or `tar` to `/mnt/<drive>/` can also stall.
---
## Related
- [[wsl2-rebuild-fedora43-training-env]] — Full WSL2 training environment setup
- [[wsl2-backup-powershell]] — Backing up WSL2 virtual disks from PowerShell
- [[ansible-wsl2-world-writable-mount-ignores-cfg]] — Other WSL2 filesystem quirks