Understanding 'Load Average' in Linux

1. Introduction: Why is the system slow when CPU usage is low?

One early morning, a monitoring alert went off: “Server load average spiked to 20!” You quickly log into the server:

$ uptime
 12:30:01 up 30 days,  5:22,  2 users,  load average: 20.12, 18.45, 15.67

Next, check top:

%Cpu(s):  2.1%us,  0.5%sy,  0.0%ni, 97.0%id,  0.3%wa,  0.0%hi,  0.1%si,  0.0%st

CPU idle rate is as high as 97%! So why is the load so high? The system is still slow and cannot open web pages?

This is a common confusion for many beginners: “High load ≠ High CPU”.

Today, we will thoroughly understand what Linux Load Average is, how to check it, and how to analyze it.

2. What is ‘Load Average’?

Official Definition (Simplified):

Load Average = Average number of processes in the system that are in a ‘runnable’ state + ‘uninterruptible sleep’ state

Note! It is not CPU usage, but rather a measure of the overall “busyness” of the system.

Breaking down the two key states:

State	Description	Common Scenarios
Runnable	Currently using CPU or waiting for CPU scheduling	CPU-intensive tasks (e.g., calculations, compilation)
Uninterruptible Sleep (D State)	Process waiting for I/O (disk, network, etc.), and cannot be interrupted by signals	Slow disk, NFS hang, hardware failure

💡 Key Point: Even if the CPU is idle, as long as a large number of processes are stuck in I/O (for example, slow disk response), the load will spike!

3. What do the three numbers represent?

load average: 1.20, 0.95, 0.78

These three numbers represent:

1 minute average load
5 minutes average load
15 minutes average load

They are Exponentially Weighted Moving Averages (EWMA), with more recent data having higher weight.

How to determine if it is ‘too high’

Rule of Thumb: Load < Number of CPU cores → System is healthy Load ≈ Number of CPU cores → System is fully loaded but manageable Load > Number of CPU cores × 2 → Potential performance bottleneck, needs investigation!

For example: On a 4-core server, if the load consistently exceeds 8, it’s time to be cautious.

⚠️ Note: It is not always better to have a lower absolute value! Occasional spikes are normal.

4. Practical: How to locate the cause of high load?

Step 1: Confirm if the load is really high

$ uptime
$ cat /proc/loadavg

Step 2: Check which processes are causing the issue

# Check all processes in D state (uninterruptible sleep)
$ ps aux | awk '$8 ~ /^D/ {print $0}'

Example output:

root      1234  0.0  0.0      0     0 ?        D    10:00   0:05 [kworker/2:1]
user      5678  0.0  0.1 123456 7890 ?        D    10:05   0:02 ./my_app

If you see a large number of D state processes, it is likely an I/O issue (disk, storage, NFS, etc.).

Step 3: Check CPU and I/O usage

# Real-time monitoring of CPU, memory, I/O
$ top          # Press '1' to see load per core
$ iostat -x 2  # Check disk util%, await
$ dstat        # Comprehensive monitoring (recommended)

Key points to focus on:

iostat %util > 90%? → Disk saturation
await (average I/O wait time) > 20ms? → Slow disk

Step 4: In-depth analysis (advanced)

# Check I/O of specific processes
$ pidstat -d 2

# Check if kernel threads (like kswapd, jbd2) are active
$ top -p $(pgrep -d',' kswapd)

# Check for a large number of zombie processes (though they do not affect load, they need to be cleaned up)
$ ps aux | grep 'Z'

5. Kernel Calculation Mechanism: Exponential Moving Average (EMA)

5.1 Why not a simple average?

The kernel does not use “the average number of processes in the past minute” because that would require saving 60 sampling points. Instead, the kernel uses the EMA algorithm, which has the following characteristics:

Efficient calculation: Only needs to save the previous value, O(1) complexity
Sensitive to trends: More recent data has greater weight, allowing for quicker reflection of system changes

5.2 Core Algorithm Implementation

Kernel source location: kernel/sched/core.c

// Exponential weighted moving average calculation
static unsigned long calc_load(unsigned long load, unsigned long exp, unsigned long active)
{
    load *= exp;                    // Historical value decay
    load += active * (FIXED_1 - exp); // Add current active process count
    load += 1UL << (FSHIFT - 1);    // Rounding
    return load >> FSHIFT;          // Right shift to restore
}

// Called every 5 seconds
void calc_global_load(void)
{
    long active = atomic_long_read(&calc_load_tasks); // Current active process count
    
    // Calculate 1/5/15 minute load
    avenrun[0] = calc_load(avenrun[0], EXP_1, active);
    avenrun[1] = calc_load(avenrun[1], EXP_5, active);
    avenrun[2] = calc_load(avenrun[2], EXP_15, active);
}

Key parameters:

active: Total number of R + D processes at the current moment
EXP_1/5/15: Exponential decay coefficients corresponding to the time constants of 1/5/15 minutes
avenrun[]: Global array that stores load values for three time periods

6. Common High Load Scenarios and Solutions

Scenario	Symptoms	Solutions
Disk Performance Bottleneck	High load, high CPU idle,`<span>iostat %util=100%</span>`	Upgrade to SSD, optimize database queries, add caching
NFS/GlusterFS Hang	Large number of D state processes, unresponsive mount point	Check network, server load, timeout settings
Insufficient Memory + Frequent Swapping	`<span>kswapd</span>` high usage, load spikes	Add memory, reduce application memory usage
CPU Intensive Tasks Piling Up	High load, high CPU us/sy	Throttle, scale up, optimize algorithms
Kernel Bugs or Hardware Failures	Random D state, unable to kill	Upgrade kernel, check RAID/SMART

7. Conclusion

Load Average ≠ CPU Usage
High load may be caused by CPU busy or I/O blocking
D state processes are key clues to I/O issues
Investigation approach:uptime → ps aux → iostat/dstat → pidstat
Core: Do not look at metrics in isolation, analyze in context, must consider CPU, memory, disk, network, and application logs.

Understanding ‘Load Average’ in Linux

1. Introduction: Why is the system slow when CPU usage is low?

2. What is ‘Load Average’?

Official Definition (Simplified):

Breaking down the two key states:

3. What do the three numbers represent?

How to determine if it is ‘too high’

4. Practical: How to locate the cause of high load?

Step 1: Confirm if the load is really high

Step 2: Check which processes are causing the issue

Step 3: Check CPU and I/O usage

Step 4: In-depth analysis (advanced)

5. Kernel Calculation Mechanism: Exponential Moving Average (EMA)

5.1 Why not a simple average?

5.2 Core Algorithm Implementation

6. Common High Load Scenarios and Solutions

7. Conclusion

Leave a Comment Cancel reply

1. Introduction: Why is the system slow when CPU usage is low?

2. What is ‘Load Average’?

Official Definition (Simplified):

Breaking down the two key states:

3. What do the three numbers represent?

How to determine if it is ‘too high’

4. Practical: How to locate the cause of high load?

Step 1: Confirm if the load is really high

Step 2: Check which processes are causing the issue

Step 3: Check CPU and I/O usage

Step 4: In-depth analysis (advanced)

5. Kernel Calculation Mechanism: Exponential Moving Average (EMA)

5.1 Why not a simple average?

5.2 Core Algorithm Implementation

6. Common High Load Scenarios and Solutions

7. Conclusion

Related posts

Leave a Comment Cancel reply