majorwiki/05-troubleshooting/wsl2-pytorch-checkpoint-windows-filesystem-deadlock.md
majorlinux 52ca8a0413 wiki: batch update — 4 new articles + 4 updates
New articles:
- Postfix SendGrid TLS handshake failure (port 465 vs 587)
- Plex transcoding troubleshooting
- Ansible Ubuntu reboot detection kernel mismatch
- WSL2 PyTorch checkpoint Windows filesystem deadlock

Updated:
- AWS S3 cost management (expanded)
- Network overview (IP updates)
- HEVC VAAPI batch encode (progress + fixes)
- SUMMARY.md (new entries)
2026-05-25 13:55:10 -04:00

4.7 KiB
Raw Permalink Blame History

title domain category tags status created updated
WSL2: PyTorch Training Deadlocks on Windows Filesystem Checkpoint Saves troubleshooting wsl2
wsl2
pytorch
huggingface
training
llm
checkpoint
windows
ntfs
deadlock
majortwin
published 2026-05-23 2026-05-23

WSL2: PyTorch Training Deadlocks on Windows Filesystem Checkpoint Saves

Problem

A Hugging Face Trainer / Unsloth fine-tuning run starts successfully, logs training steps for a while, then freezes completely. The tqdm progress bar stops advancing, GPU utilization drops to near-zero, but the training process stays alive at 100% CPU with the full model loaded in VRAM. No new checkpoint directories appear.

Confirming it's a checkpoint deadlock:

# Check if training is frozen — same step count + elapsed time across checks
tmux capture-pane -t <session> -p | tail -5
sleep 60
tmux capture-pane -t <session> -p | tail -5

# GPU idle despite process alive
nvidia-smi --query-gpu=utilization.gpu,memory.used --format=csv,noheader

# No new checkpoint directories written
ls -lt /mnt/d/your/training/output/ | head -10

If the tqdm step count is identical both times and the newest directory timestamp is from a previous run, the save is deadlocked.


Root Cause

WSL2's /mnt/d/ paths go through the virtio-9p filesystem driver to reach the host Windows NTFS volume. Large sequential writes — like saving a multi-GB PyTorch checkpoint (optimizer states, model weights, scheduler, RNG state) — can deadlock when:

  • A Windows process (antivirus, VSS, Windows Search) holds a lock on the output directory
  • The Windows virtual disk hits write pressure from concurrent activity

The Linux process blocks in a kernel write() syscall waiting for virtio-9p to acknowledge the write. The process is alive and spinning at 100% CPU in the kernel, but no userspace progress occurs. This is distinct from OOM kills (which log clearly) and out-of-disk errors (which exit cleanly).


Fix: Train on Linux-Native Storage

Keep all training I/O on Linux ext4 (~/), and copy final artifacts to Windows only after training completes.

Change output paths

# Before
TRAIN_OUT="/mnt/d/corpus/training-runs/v9"
GGUF_OUT="/mnt/d/corpus/models"

# After — Linux-native for training
TRAIN_OUT="/home/majorlinux/corpus/training-runs/v8i"
GGUF_OUT="/home/majorlinux/corpus/models"

The WSL2 home directory lives on a Linux ext4 .vhdx managed by WSL2 — writes here bypass virtio-9p entirely.

Copy to Windows after training finishes

cp "$GGUF_OUT/majortwin-v8i-q4-k-m.gguf" "/mnt/d/corpus/models/"
cp "$GGUF_OUT/majortwin-v8i-q4-k-m.gguf" "/mnt/d/MajorTwin/06-Models/"

Single large-file copies to /mnt/d/ complete reliably — it's repeated checkpoint saves during training that deadlock.

Kill a stuck training process

kill $(pgrep -f 'train_v3.py')
sleep 2
tmux kill-session -t majortwin_v8i
nvidia-smi --query-gpu=utilization.gpu,memory.used --format=csv,noheader
# Should show low utilization and <1GB memory used

The original checkpoint files from the previous run in /mnt/d/ are untouched — the deadlock prevents writes, it does not corrupt existing data.


Why Previous Runs May Have Worked

The deadlock is not guaranteed. It depends on Windows-side state at checkpoint save time. Factors:

  • Antivirus scanning newly created checkpoint files
  • Windows Search indexing the output directory
  • VSS snapshot in progress
  • Concurrent Windows desktop I/O

A run on a quiet machine may succeed; the same run during normal desktop use may deadlock.


Confirming the Fix

# Watch for checkpoint directories appearing at each save_steps interval
watch -n 30 'ls -lt ~/corpus/training-runs/v8i/ | head -8'

# GPU should be active (8599%) during training steps
nvidia-smi --query-gpu=utilization.gpu --format=csv,noheader

Notes

  • Setting save_strategy="no" in TrainingArguments eliminates checkpoint saves entirely — useful as a diagnostic to confirm this is the cause, at the cost of no crash recovery.
  • torch.compile() / torch._inductor can add hours of CPU-bound kernel compilation before the first training step. Long startup + eventual freeze together can make a session look permanently stuck when they're actually two separate issues.
  • This applies to any large sequential WSL2→Windows write, not just PyTorch — large rsync or tar to /mnt/<drive>/ can also stall.