New articles: - Postfix SendGrid TLS handshake failure (port 465 vs 587) - Plex transcoding troubleshooting - Ansible Ubuntu reboot detection kernel mismatch - WSL2 PyTorch checkpoint Windows filesystem deadlock Updated: - AWS S3 cost management (expanded) - Network overview (IP updates) - HEVC VAAPI batch encode (progress + fixes) - SUMMARY.md (new entries)
125 lines
4.7 KiB
Markdown
125 lines
4.7 KiB
Markdown
---
|
||
title: "WSL2: PyTorch Training Deadlocks on Windows Filesystem Checkpoint Saves"
|
||
domain: troubleshooting
|
||
category: wsl2
|
||
tags: [wsl2, pytorch, huggingface, training, llm, checkpoint, windows, ntfs, deadlock, majortwin]
|
||
status: published
|
||
created: 2026-05-23
|
||
updated: 2026-05-23
|
||
---
|
||
|
||
# WSL2: PyTorch Training Deadlocks on Windows Filesystem Checkpoint Saves
|
||
|
||
## Problem
|
||
|
||
A Hugging Face Trainer / Unsloth fine-tuning run starts successfully, logs training steps for a while, then freezes completely. The tqdm progress bar stops advancing, GPU utilization drops to near-zero, but the training process stays alive at 100% CPU with the full model loaded in VRAM. No new checkpoint directories appear.
|
||
|
||
**Confirming it's a checkpoint deadlock:**
|
||
|
||
```bash
|
||
# Check if training is frozen — same step count + elapsed time across checks
|
||
tmux capture-pane -t <session> -p | tail -5
|
||
sleep 60
|
||
tmux capture-pane -t <session> -p | tail -5
|
||
|
||
# GPU idle despite process alive
|
||
nvidia-smi --query-gpu=utilization.gpu,memory.used --format=csv,noheader
|
||
|
||
# No new checkpoint directories written
|
||
ls -lt /mnt/d/your/training/output/ | head -10
|
||
```
|
||
|
||
If the tqdm step count is identical both times and the newest directory timestamp is from a previous run, the save is deadlocked.
|
||
|
||
---
|
||
|
||
## Root Cause
|
||
|
||
WSL2's `/mnt/d/` paths go through the **virtio-9p filesystem driver** to reach the host Windows NTFS volume. Large sequential writes — like saving a multi-GB PyTorch checkpoint (optimizer states, model weights, scheduler, RNG state) — can deadlock when:
|
||
|
||
- A Windows process (antivirus, VSS, Windows Search) holds a lock on the output directory
|
||
- The Windows virtual disk hits write pressure from concurrent activity
|
||
|
||
The Linux process blocks in a kernel `write()` syscall waiting for virtio-9p to acknowledge the write. The process is alive and spinning at 100% CPU in the kernel, but no userspace progress occurs. This is distinct from OOM kills (which log clearly) and out-of-disk errors (which exit cleanly).
|
||
|
||
---
|
||
|
||
## Fix: Train on Linux-Native Storage
|
||
|
||
Keep all training I/O on Linux ext4 (`~/`), and copy final artifacts to Windows only after training completes.
|
||
|
||
### Change output paths
|
||
|
||
```bash
|
||
# Before
|
||
TRAIN_OUT="/mnt/d/corpus/training-runs/v9"
|
||
GGUF_OUT="/mnt/d/corpus/models"
|
||
|
||
# After — Linux-native for training
|
||
TRAIN_OUT="/home/majorlinux/corpus/training-runs/v8i"
|
||
GGUF_OUT="/home/majorlinux/corpus/models"
|
||
```
|
||
|
||
The WSL2 home directory lives on a Linux ext4 `.vhdx` managed by WSL2 — writes here bypass virtio-9p entirely.
|
||
|
||
### Copy to Windows after training finishes
|
||
|
||
```bash
|
||
cp "$GGUF_OUT/majortwin-v8i-q4-k-m.gguf" "/mnt/d/corpus/models/"
|
||
cp "$GGUF_OUT/majortwin-v8i-q4-k-m.gguf" "/mnt/d/MajorTwin/06-Models/"
|
||
```
|
||
|
||
Single large-file copies to `/mnt/d/` complete reliably — it's repeated checkpoint saves during training that deadlock.
|
||
|
||
### Kill a stuck training process
|
||
|
||
```bash
|
||
kill $(pgrep -f 'train_v3.py')
|
||
sleep 2
|
||
tmux kill-session -t majortwin_v8i
|
||
nvidia-smi --query-gpu=utilization.gpu,memory.used --format=csv,noheader
|
||
# Should show low utilization and <1GB memory used
|
||
```
|
||
|
||
The original checkpoint files from the previous run in `/mnt/d/` are untouched — the deadlock prevents writes, it does not corrupt existing data.
|
||
|
||
---
|
||
|
||
## Why Previous Runs May Have Worked
|
||
|
||
The deadlock is not guaranteed. It depends on Windows-side state at checkpoint save time. Factors:
|
||
|
||
- Antivirus scanning newly created checkpoint files
|
||
- Windows Search indexing the output directory
|
||
- VSS snapshot in progress
|
||
- Concurrent Windows desktop I/O
|
||
|
||
A run on a quiet machine may succeed; the same run during normal desktop use may deadlock.
|
||
|
||
---
|
||
|
||
## Confirming the Fix
|
||
|
||
```bash
|
||
# Watch for checkpoint directories appearing at each save_steps interval
|
||
watch -n 30 'ls -lt ~/corpus/training-runs/v8i/ | head -8'
|
||
|
||
# GPU should be active (85–99%) during training steps
|
||
nvidia-smi --query-gpu=utilization.gpu --format=csv,noheader
|
||
```
|
||
|
||
---
|
||
|
||
## Notes
|
||
|
||
- Setting `save_strategy="no"` in TrainingArguments eliminates checkpoint saves entirely — useful as a diagnostic to confirm this is the cause, at the cost of no crash recovery.
|
||
- `torch.compile()` / `torch._inductor` can add hours of CPU-bound kernel compilation before the first training step. Long startup + eventual freeze together can make a session look permanently stuck when they're actually two separate issues.
|
||
- This applies to any large sequential WSL2→Windows write, not just PyTorch — large `rsync` or `tar` to `/mnt/<drive>/` can also stall.
|
||
|
||
---
|
||
|
||
## Related
|
||
|
||
- [[wsl2-rebuild-fedora43-training-env]] — Full WSL2 training environment setup
|
||
- [[wsl2-backup-powershell]] — Backing up WSL2 virtual disks from PowerShell
|
||
- [[ansible-wsl2-world-writable-mount-ignores-cfg]] — Other WSL2 filesystem quirks
|