Optimizing GPU Utilization: A Combinatorial Optimization Problem
In deep learning, GPUs are not always “fully loaded”. Even with powerful graphics cards (such as the RTX 5070 Ti, A100, etc.), it is common to see GPU utilization rates as low as 20–40%. This low utilizationis not due to the model being too small, but rather because there are data flow bottlenecks in the training pipeline between “CPU ↔ GPU ↔ Memory ↔ Disk”.
To keep the GPU continuously computing with minimal waiting, a dynamic balance must be achieved across multiple dimensions such asmemory usage, computational density, data loading speed, and parallelism. This constitutes a typical combinatorial constrained optimization problem.
Mechanism Determining GPU Utilization
GPU utilization can be approximately represented as:
Where:
- is the time for the GPU to perform forward and backward propagation;
- is the time spent waiting for the CPU to prepare data, data transfer, and synchronization.
The goal is to minimize , which means keeping the GPU continuously “busy”. However, is determined by multiple factors:
| Stage | Main Parameters | Impact |
|---|---|---|
| Data Loading (CPU Side) | <span>num_workers</span>, <span>prefetch_factor</span>, <span>pin_memory</span> |
Affects data reading and prefetching speed |
| Memory Transfer (CPU→GPU) | PCIe bandwidth, <span>pin_memory</span> |
Affects transfer latency |
| Computation Stage (GPU Side) | <span>batch_size</span>, <span>hidden_dim</span>, <span>sequence_length</span>, <span>mixed_precision</span> |
Affects computation time per iteration |
| Optimizer and Gradient Update | <span>gradient_accumulation_steps</span>, <span>optimizer_state</span> |
Determines memory and bandwidth usage |
| System Resource Limitations | Operating System (Windows/Linux), number of CPU cores, RAM size | Determines the number of parallel processes |
These factors are interrelated, and any adjustment of a single parameter may affect the performance balance of other modules. For example:
- Increasing
<span>batch_size</span>can enhance parallelism but may lead to out-of-memory errors; - Increasing
<span>num_workers</span>can alleviate data waiting but will increase memory usage; - Enabling
<span>mixed_precision</span>can save memory but must ensure numerical stability.
This makes GPU utilization optimizationimpossible to achieve through a single variable optimization, and requires a comprehensive consideration of the interactions between parameters.
Modeling Approach from a Combinatorial Optimization Perspective
We can abstract GPU performance tuning as the following optimization problem:
Where:
- is a set of decision variables (e.g., batch_size, num_workers, prefetch_factor, etc.);
- is a function of computation time;
- is a function of data waiting time;
- is GPU memory consumption;
- is system memory consumption.
This problem has the following characteristics:
- Non-linearity: Terms like
<span>hidden_dim²</span>and<span>batch × seq</span>lead to computational complexity; - Multi-constraint: Memory, storage, and thread counts are all limited;
- Discreteness: Parameters like batch_size and num_workers must be integers;
- Interactivity: There are multiplicative relationships between variables;
- Empirical: Actual bottlenecks depend on the operating system and driver implementations.
Therefore, this problem cannot be solved using simple gradient descent or linear programming, but requires methods such asheuristic search or Bayesian optimization.
The Root of Data Flow Bottlenecks: Asynchronous CPU–GPU Systems
The computational speed of GPUs far exceeds that of CPUs. In a typical training loop in PyTorch, it looks like this:
for batch in dataloader:
x = x.to(device) # Copy from CPU to GPU
y_pred = model(x) # Forward propagation on GPU
loss = criterion(y_pred, y) # Compute loss on GPU
loss.backward() # Backward propagation on GPU
optimizer.step() # Update parameters
Bottlenecks often occur at:
- Slow DataLoader loading (I/O or insufficient CPU preprocessing);
- GPU waiting for data (batches not yet transferred to memory);
- Frequent CPU↔GPU synchronization (blocking calls);
- Insufficient system memory (prefetching failure).
Even when the GPU is idle, it must wait for the data stream to be fully in place. This means:
★
The upper limit of GPU utilization depends not only on the computing power of the graphics card but also on the efficiency of CPU data preparation.
Therefore, optimization should start from the perspective of maximizing “GPU busy time” to reduce idle time.
Optimization Goal: Keeping the GPU Always “Busy”
In an ideal state:
- The CPU and DataLoader continuously push data to the GPU;
- Batch computation on the GPU overlaps with data loading on the CPUin parallel;
- The training process has no waiting or memory bottlenecks.
At this point, GPU utilization approaches 100%, but the following conditions must be met:
- Memory Fully Occupied;
- Compute Dense;
- Async Stream for Data Transfer;
- Balanced Pipeline between CPU and GPU.
These four conditions are constrained by different hyperparameters, leading to ahigh-dimensional combinatorial space that requires trade-offs. Thus, this type of optimization problem is structurally similar to problems like “optimal portfolio allocation” and “risk budget optimization”:
★
We aim to find a combination of parameters that maximizes the overall throughput of the system under limited resources.
Linux GPU Acceleration under WSL2: From Principles to Practice
Why WSL2 is Needed
Windows uses the spawn model for multi-process mechanisms:
# Windows (spawn)
每个worker独立启动 → 数据需pickle序列化 → workers ≤ 8
Whereas Linux uses the fork model:
# Linux (fork)
复制父进程内存 → 无需pickle大数据 → workers可达16–32
In practical training, this difference means:
- GPU utilization on Windows is often **20–30%**;
- On WSL2/Linux, it can be increased to **60–80%**;
- And there will be no more “pickle errors” or “DataLoader hangs”.
| Issue | Performance in Windows | Solution in WSL2 |
|---|---|---|
| DataLoader Multi-process Bottleneck | spawn mode, slow pickle | fork mode, fast memory copy |
| Low GPU Utilization | 20–30% | Increased to 60–80% |
| Insufficient Memory Utilization | Often idle >70% | Continuously maintain high occupancy |
| Complex CUDA Environment Configuration | Complicated driver + Toolkit installation | One pip command completes |
| File Isolation | System files do not intercommunicate | Fully shared |
WSL2 GPU Architecture and Principles
WSL2 (Windows Subsystem for Linux 2) is a lightweight virtualization environment that runs a native Linux kernel in Windows.
┌──────────────────────────────┐
│ Windows 11 System Layer │
│ ├─ NVIDIA Driver (RTX 5070 Ti) │
│ └─ GPU Virtualization Layer │
└──────────────┬───────────────┘
↓ GPU Passthrough
┌──────────────────────────────┐
│ Ubuntu (WSL2) │
│ ├─ Uses Windows GPU Driver │
│ ├─ Supports CUDA, cuDNN, PyTorch │
│ └─ Performance ≈ Native Linux 95–99% │
└──────────────────────────────┘
Its greatest advantages are:
- No need to install NVIDIA drivers or CUDA Toolkit
- Directly use Windows drivers for GPU acceleration
- Supports all features of PyTorch/TensorFlow
- File system interoperability with Windows
- Performance nearly equivalent to native Linux
WSL2 Configuration and Usage Steps (Tested)
1. Install WSL2
# PowerShell (Administrator)
wsl --install
This automatically completes the following steps:
- Enable WSL feature
- Install Linux kernel and Ubuntu
- Set username and password after reboot
If installation fails, you can enable the feature manually:
dism.exe /online /enable-feature /featurename:Microsoft-Windows-Subsystem-Linux /all /norestart
dism.exe /online /enable-feature /featurename:VirtualMachinePlatform /all /norestart
wsl --set-default-version 2
2. Install Python and PyTorch
sudo apt update && sudo apt upgrade -y
sudo apt install python3.10 python3.10-venv python3-pip -y
# Check GPU availability
nvidia-smi # Display RTX 5070 Ti information
# Install PyTorch (CUDA 12.1)
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
Verification:
python3 -c "import torch; print(torch.cuda.is_available())"
# Output True indicates success
3. File System Interoperability and Project Execution
# Access Windows from WSL2
cd /mnt/d/quant_research
# Run training
python3 main.py
# Monitor GPU in real-time
watch -n 1 nvidia-smi
Windows ↔ WSL2 File Access:
WSL2 access Windows: /mnt/c/, /mnt/d/
Windows access WSL2: \wsl$\Ubuntu\home\<user>\