Optimizing GPU Utilization: A Combinatorial Optimization Problem

In deep learning, GPUs are not always “fully loaded”. Even with powerful graphics cards (such as the RTX 5070 Ti, A100, etc.), it is common to see GPU utilization rates as low as 20–40%. This low utilizationis not due to the model being too small, but rather because there are data flow bottlenecks in the training pipeline between “CPU ↔ GPU ↔ Memory ↔ Disk”.

To keep the GPU continuously computing with minimal waiting, a dynamic balance must be achieved across multiple dimensions such asmemory usage, computational density, data loading speed, and parallelism. This constitutes a typical combinatorial constrained optimization problem.

Mechanism Determining GPU Utilization

GPU utilization can be approximately represented as:

Where:

is the time for the GPU to perform forward and backward propagation;
is the time spent waiting for the CPU to prepare data, data transfer, and synchronization.

The goal is to minimize , which means keeping the GPU continuously “busy”. However, is determined by multiple factors:

Stage	Main Parameters	Impact
Data Loading (CPU Side)	`<span>num_workers</span>`, `<span>prefetch_factor</span>`, `<span>pin_memory</span>`	Affects data reading and prefetching speed
Memory Transfer (CPU→GPU)	PCIe bandwidth, `<span>pin_memory</span>`	Affects transfer latency
Computation Stage (GPU Side)	`<span>batch_size</span>`, `<span>hidden_dim</span>`, `<span>sequence_length</span>`, `<span>mixed_precision</span>`	Affects computation time per iteration
Optimizer and Gradient Update	`<span>gradient_accumulation_steps</span>`, `<span>optimizer_state</span>`	Determines memory and bandwidth usage
System Resource Limitations	Operating System (Windows/Linux), number of CPU cores, RAM size	Determines the number of parallel processes

These factors are interrelated, and any adjustment of a single parameter may affect the performance balance of other modules. For example:

Increasing batch_size can enhance parallelism but may lead to out-of-memory errors;
Increasing num_workers can alleviate data waiting but will increase memory usage;
Enabling mixed_precision can save memory but must ensure numerical stability.

This makes GPU utilization optimizationimpossible to achieve through a single variable optimization, and requires a comprehensive consideration of the interactions between parameters.

Modeling Approach from a Combinatorial Optimization Perspective

We can abstract GPU performance tuning as the following optimization problem:

Where:

is a set of decision variables (e.g., batch_size, num_workers, prefetch_factor, etc.);
is a function of computation time;
is a function of data waiting time;
is GPU memory consumption;
is system memory consumption.

This problem has the following characteristics:

Non-linearity: Terms like hidden_dim² and batch × seq lead to computational complexity;
Multi-constraint: Memory, storage, and thread counts are all limited;
Discreteness: Parameters like batch_size and num_workers must be integers;
Interactivity: There are multiplicative relationships between variables;
Empirical: Actual bottlenecks depend on the operating system and driver implementations.

Therefore, this problem cannot be solved using simple gradient descent or linear programming, but requires methods such asheuristic search or Bayesian optimization.

The Root of Data Flow Bottlenecks: Asynchronous CPU–GPU Systems

The computational speed of GPUs far exceeds that of CPUs. In a typical training loop in PyTorch, it looks like this:

for batch in dataloader:
    x = x.to(device)             # Copy from CPU to GPU
    y_pred = model(x)            # Forward propagation on GPU
    loss = criterion(y_pred, y)  # Compute loss on GPU
    loss.backward()              # Backward propagation on GPU
    optimizer.step()             # Update parameters

Bottlenecks often occur at:

Slow DataLoader loading (I/O or insufficient CPU preprocessing);
GPU waiting for data (batches not yet transferred to memory);
Frequent CPU↔GPU synchronization (blocking calls);
Insufficient system memory (prefetching failure).

Even when the GPU is idle, it must wait for the data stream to be fully in place. This means:

★

The upper limit of GPU utilization depends not only on the computing power of the graphics card but also on the efficiency of CPU data preparation.

Therefore, optimization should start from the perspective of maximizing “GPU busy time” to reduce idle time.

Optimization Goal: Keeping the GPU Always “Busy”

In an ideal state:

The CPU and DataLoader continuously push data to the GPU;
Batch computation on the GPU overlaps with data loading on the CPUin parallel;
The training process has no waiting or memory bottlenecks.

At this point, GPU utilization approaches 100%, but the following conditions must be met:

Memory Fully Occupied;
Compute Dense;
Async Stream for Data Transfer;
Balanced Pipeline between CPU and GPU.

These four conditions are constrained by different hyperparameters, leading to ahigh-dimensional combinatorial space that requires trade-offs. Thus, this type of optimization problem is structurally similar to problems like “optimal portfolio allocation” and “risk budget optimization”:

★

We aim to find a combination of parameters that maximizes the overall throughput of the system under limited resources.

Linux GPU Acceleration under WSL2: From Principles to Practice

Why WSL2 is Needed

Windows uses the spawn model for multi-process mechanisms:

# Windows (spawn)
每个worker独立启动 → 数据需pickle序列化 → workers ≤ 8

Whereas Linux uses the fork model:

# Linux (fork)
复制父进程内存 → 无需pickle大数据 → workers可达16–32

In practical training, this difference means:

GPU utilization on Windows is often **20–30%**;
On WSL2/Linux, it can be increased to **60–80%**;
And there will be no more “pickle errors” or “DataLoader hangs”.

Issue	Performance in Windows	Solution in WSL2
DataLoader Multi-process Bottleneck	spawn mode, slow pickle	fork mode, fast memory copy
Low GPU Utilization	20–30%	Increased to 60–80%
Insufficient Memory Utilization	Often idle >70%	Continuously maintain high occupancy
Complex CUDA Environment Configuration	Complicated driver + Toolkit installation	One pip command completes
File Isolation	System files do not intercommunicate	Fully shared

WSL2 GPU Architecture and Principles

WSL2 (Windows Subsystem for Linux 2) is a lightweight virtualization environment that runs a native Linux kernel in Windows.

┌──────────────────────────────┐
│ Windows 11 System Layer      │
│ ├─ NVIDIA Driver (RTX 5070 Ti) │
│ └─ GPU Virtualization Layer   │
└──────────────┬───────────────┘
               ↓ GPU Passthrough
┌──────────────────────────────┐
│ Ubuntu (WSL2)                │
│ ├─ Uses Windows GPU Driver    │
│ ├─ Supports CUDA, cuDNN, PyTorch │
│ └─ Performance ≈ Native Linux 95–99% │
└──────────────────────────────┘

Its greatest advantages are:

No need to install NVIDIA drivers or CUDA Toolkit
Directly use Windows drivers for GPU acceleration
Supports all features of PyTorch/TensorFlow
File system interoperability with Windows
Performance nearly equivalent to native Linux

WSL2 Configuration and Usage Steps (Tested)

1. Install WSL2

# PowerShell (Administrator)
wsl --install

This automatically completes the following steps:

Enable WSL feature
Install Linux kernel and Ubuntu
Set username and password after reboot

If installation fails, you can enable the feature manually:

dism.exe /online /enable-feature /featurename:Microsoft-Windows-Subsystem-Linux /all /norestart
dism.exe /online /enable-feature /featurename:VirtualMachinePlatform /all /norestart
wsl --set-default-version 2

2. Install Python and PyTorch

sudo apt update &amp;&amp; sudo apt upgrade -y
sudo apt install python3.10 python3.10-venv python3-pip -y

# Check GPU availability
nvidia-smi  # Display RTX 5070 Ti information

# Install PyTorch (CUDA 12.1)
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

Verification:

python3 -c "import torch; print(torch.cuda.is_available())"
# Output True indicates success

3. File System Interoperability and Project Execution

# Access Windows from WSL2
cd /mnt/d/quant_research

# Run training
python3 main.py

# Monitor GPU in real-time
watch -n 1 nvidia-smi

Windows ↔ WSL2 File Access:

WSL2 access Windows: /mnt/c/, /mnt/d/
Windows access WSL2: \wsl$\Ubuntu\home\<user>\

Optimizing GPU Utilization and Linux Acceleration Solutions under WSL2

Optimizing GPU Utilization: A Combinatorial Optimization Problem

Mechanism Determining GPU Utilization

Modeling Approach from a Combinatorial Optimization Perspective

The Root of Data Flow Bottlenecks: Asynchronous CPU–GPU Systems

Optimization Goal: Keeping the GPU Always “Busy”

Linux GPU Acceleration under WSL2: From Principles to Practice

Why WSL2 is Needed

WSL2 GPU Architecture and Principles

WSL2 Configuration and Usage Steps (Tested)

1. Install WSL2

2. Install Python and PyTorch

3. File System Interoperability and Project Execution

Leave a Comment Cancel reply

Optimizing GPU Utilization: A Combinatorial Optimization Problem

Mechanism Determining GPU Utilization

Modeling Approach from a Combinatorial Optimization Perspective

The Root of Data Flow Bottlenecks: Asynchronous CPU–GPU Systems

Optimization Goal: Keeping the GPU Always “Busy”

Linux GPU Acceleration under WSL2: From Principles to Practice

Why WSL2 is Needed

WSL2 GPU Architecture and Principles

WSL2 Configuration and Usage Steps (Tested)

1. Install WSL2

2. Install Python and PyTorch

3. File System Interoperability and Project Execution

Related posts

Leave a Comment Cancel reply