The Ultimate Guide to Disk IO Monitoring in Linux: From iostat to iotop

The Ultimate Guide to Disk IO Monitoring in Linux: From iostat to iotop

πŸš€ When the system lags, database queries slow down, or file transfers take too long, the issue often lies with disk IO. Master these IO monitoring tools to uncover performance bottlenecks!

πŸ” Disk IO: A Key Dimension of Performance Analysis

In the world of performance optimization, there is a classic saying: “CPU waits for IO, IO waits for disk“. When our applications experience performance issues, disk IO is often the most overlooked yet critical bottleneck.

Imagine these scenarios:

  • β€’ Database queries suddenly become abnormally slow
  • β€’ File server response times are increasing
  • β€’ Virtual machine startup speeds are as slow as a snail
  • β€’ Big data processing tasks are stalled

The root cause of these issues likely points in the same directionβ€”disk IO performance bottlenecks. Today, we will delve into the powerful IO monitoring tools in Linux, from the basic iostat to modern visualization tools, helping you build a complete IO performance monitoring system.

πŸ“Š First Generation: iostat – The Pioneer of IO Statistics

Background

iostat (Input/Output Statistics) is a core component of the sysstat toolkit. Since its first release in 1999, it has been the preferred tool for Linux system administrators for IO performance analysis. It provides system-level disk IO statistics and is the cornerstone for understanding system IO behavior.

Core Functions and Output Interpretation

# Basic usage
iostat

# Display extended statistics
iostat -x

# Update every 2 seconds, showing 5 times
iostat -x 2 5

Detailed Output Analysis

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2.50    0.00    1.25   15.75    0.00   80.50

Device            r/s     w/s     rkB/s    wkB/s  rrqm/s  wrqm/s  %rrqm  %wrqm  r_await  w_await aqu-sz rareq-sz wareq-sz  svctm  %util
nvme0n1         45.20   28.60    1840.5   1145.2    2.40    5.80   5.05  16.88     8.50   12.30   0.85    40.73    40.04   4.50   33.20
sda              0.20    2.40       8.0     96.0    0.00    0.20   0.00   7.69    15.00   25.50   0.06    40.00    40.00  12.50    3.00

Key Metrics Explained:

CPU Section:

  • β€’ %iowait: Percentage of time the CPU waits for IO operations to complete (critical metric)
  • β€’ %system: System CPU usage percentage
  • β€’ %idle: Percentage of CPU idle time

Disk Section:

  • β€’ r/s, w/s: Number of read/write requests per second
  • β€’ rkB/s, wkB/s: Amount of data read/written per second (KB)
  • β€’ r_await, w_await: Average read/write wait time (milliseconds)
  • β€’ %util: Device utilization (critical performance bottleneck metric)
  • β€’ aqu-sz: Average queue length

iostat Best Practices

# 1. System IO overview monitoring
iostat -x 1

# 2. Specific device monitoring
iostat -x nvme0n1 2

# 3. Show only disk statistics (no CPU)
iostat -d 2

# 4. Display NFS statistics
iostat -n 2

# 5. Generate IO performance report
#!/bin/bash
echo "=== IO Performance Report $(date) ==="
iostat -x 1 10 | awk '
    /avg-cpu/ { getline; cpu_iowait = $4 }
    /Device/ { 
        getline; 
        while(getline && NF > 0) {
            if($14 > 80) print "Warning: " $1 " utilization too high: " $14 "%"
            if($10 > 100) print "Warning: " $1 " write latency too high: " $10 "ms"
            if($9 > 50) print "Warning: " $1 " read latency too high: " $9 "ms"
        }
    }
    END { 
        if(cpu_iowait > 20) print "Warning: CPU IO wait too high: " cpu_iowait "%" 
    }
'

# 6. Historical data collection script
#!/bin/bash
# Long-term IO performance monitoring
LOG_FILE="/var/log/iostat_$(date +%Y%m%d).log"
while true; do
    echo "=== $(date) ===" >> $LOG_FILE
    iostat -x 1 1 >> $LOG_FILE
    sleep 300  # Record every 5 minutes
done

Performance Analysis Guide

Health Status Indicators:

  • β€’ %util < 80%: Device utilization is normal
  • β€’ iowait < 10%: IO wait time is reasonable
  • β€’ await < 20ms: Response time is good
  • β€’ aqu-sz < 2: Queue length is moderate

Identifying Performance Bottlenecks:

  • β€’ %util > 90%: Disk is nearing saturation
  • β€’ iowait > 20%: CPU spending a lot of time waiting for IO
  • β€’ await > 100ms: IO response time is too long
  • β€’ rrqm/s + wrqm/s very low: Too much random IO, low merge rate

🎯 Second Generation: iotop – The Revolution of Process-Level IO Monitoring

Technical Breakthrough

The emergence of iotop fills the gap in process-level IO monitoring in Linux systems. It borrows the interface design concept from the top command but focuses on IO performance analysis, allowing administrators to quickly identify which processes are consuming IO resources.

Installation and Basic Usage

# Install on Ubuntu/Debian
sudo apt install iotop

# Install on CentOS/RHEL  
sudo yum install iotop
# or sudo dnf install iotop

# Requires root privileges
sudo iotop

Interface Interpretation and Features

Total DISK READ :      12.45 M/s | Total DISK WRITE :       8.67 M/s
Actual DISK READ:      12.45 M/s | Actual DISK WRITE:       8.67 M/s
    TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO%    COMMAND
   1234    be/4 mysql      8.45 M/s     2.34 M/s  0.00 %  85.20 % mysqld --defaults-file=/etc/mysql/my.cnf
   5678    be/4 root       2.34 M/s     4.56 M/s  0.00 %  45.67 % python3 /opt/backup/backup_script.py
   9012    be/4 www-data   1.23 M/s     1.45 M/s  0.00 %  23.45 % nginx: worker process
   3456    be/4 user       0.43 M/s     0.32 M/s  0.00 %  12.34 % rsync -av /home/user/docs/ /backup/

Interface Elements Explained:

  • β€’ Total DISK READ/WRITE: Total disk read/write speed of the system
  • β€’ Actual DISK READ/WRITE: Actual disk read/write speed (excluding cache)
  • β€’ TID: Thread ID
  • β€’ PRIO: IO scheduling priority (be=best effort, rt=real time, idle=idle)
  • β€’ IO%: Percentage of IO time for that process
  • β€’ COMMAND: Process command line

iotop Advanced Usage and Practical Applications

# 1. Show only processes with IO activity
sudo iotop -o

# 2. Show processes instead of threads
sudo iotop -P

# 3. Show cumulative IO statistics
sudo iotop -a

# 4. Set update interval (seconds)
sudo iotop -d 0.5

# 5. Batch mode (suitable for scripts)
sudo iotop -b -n 5

# 6. Show only processes of a specific user
sudo iotop -u mysql

# 7. Interactive mode shortcuts
# o - Toggle to show only processes with IO
# p - Toggle process/thread display mode
# a - Toggle between cumulative/current IO
# r - Reverse sort
# q - Exit

# 8. IO hotspot process monitoring script
#!/bin/bash
echo "=== IO Hotspot Process Monitoring $(date) ==="
sudo iotop -b -n 1 -o | head -20 | awk '
    NR > 3 && $4 != "0.00" || $5 != "0.00" {
        printf "Process: %-20s Read: %8s Write: %8s IO%%: %6s\n", $7, $4, $5, $6
    }
'

# 9. Database IO analysis script
#!/bin/bash
# Specifically monitor IO activities related to databases
sudo iotop -b -n 1 | grep -E "(mysql|postgres|mongodb|redis)" | 
while read line; do
    echo "Database IO: $line"
    # Can add alert logic
done

# 10. Establishing IO performance baseline
#!/bin/bash
# Establish system IO performance baseline
BASELINE_FILE="/tmp/io_baseline_$(date +%Y%m%d_%H%M%S).log"
echo "Establishing IO baseline, monitoring for 30 minutes..." 
for i in {1..360}; do
    echo "=== Sample #${i} $(date) ===" >> $BASELINE_FILE
    sudo iotop -b -n 1 -o >> $BASELINE_FILE
    sleep 5
done
echo "Baseline data saved to: $BASELINE_FILE"

Fault Diagnosis Practical Cases

# Case 1: Database Performance Issue Diagnosis
echo "=== Database IO Performance Analysis ==="
sudo iotop -b -n 5 -u mysql | awk '
    /mysqld/ {
        total_read += $4
        total_write += $5
        samples++
    }
    END {
        if(samples > 0) {
            print "MySQL Average Read Speed:", total_read/samples, "MB/s"
            print "MySQL Average Write Speed:", total_write/samples, "MB/s"
            if(total_read/samples > 100) print "Warning: MySQL read IO too high"
            if(total_write/samples > 50) print "Warning: MySQL write IO too high"
        }
    }
'

# Case 2: Backup Task IO Impact Analysis
#!/bin/bash
echo "Analyzing the impact of backup tasks on system IO..."
BEFORE=$(sudo iotop -b -n 1 | awk '/Total DISK/ {print $4}' | head -1)
echo "System IO before backup: $BEFORE"

# Start backup task
sudo rsync -av /data/ /backup/ &
BACKUP_PID=$!

sleep 10
DURING=$(sudo iotop -b -n 1 | awk '/Total DISK/ {print $4}' | head -1)
echo "System IO during backup: $DURING"

wait $BACKUP_PID
AFTER=$(sudo iotop -b -n 1 | awk '/Total DISK/ {print $4}' | head -1)
echo "System IO after backup: $AFTER"

🌐 Third Generation: Modern IO Monitoring Ecosystem

iftop – Visual Monitoring of Network IO

Although primarily used for network monitoring, iftop plays an important role in IO analysis, especially in network storage and distributed system environments.

# Install iftop
sudo apt install iftop  # Ubuntu/Debian
sudo yum install iftop  # CentOS/RHEL

# Monitor network interface
sudo iftop -i eth0

# Display port information
sudo iftop -P

# Do not perform DNS resolution (improves performance)
sudo iftop -n

iftop Interface Interpretation:

                    12.5Kb  25.0Kb  37.5Kb  50.0Kb  62.5Kb
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄
server1.example.com  => client1.example.com  1.25Mb  2.34Mb  1.89Mb
                     <=                      890Kb   1.45Mb  1.23Mb
server1.example.com  => client2.example.com  567Kb   890Kb   678Kb
                     <=                      234Kb   456Kb   345Kb
──────────────────────────────────────────────────────────────────
TX:             cum:      15.6MB   peak rate:    3.45Mb
RX:                       8.9MB                  2.34Mb
TOTAL:                   24.5MB                  5.79Mb

nmon – Comprehensive Performance Monitoring Tool

# Install nmon
sudo apt install nmon  # Ubuntu/Debian

# Start nmon
nmon

# Shortcuts in nmon
# d - Disk IO statistics
# n - Network statistics
# m - Memory statistics
# c - CPU statistics
# t - Top processes
# q - Exit

# Data collection mode
nmon -f -s 30 -c 120  # Sample every 30 seconds for 1 hour

dstat – A Modern System Statistics Tool

# Install dstat
sudo apt install dstat

# Basic usage
dstat

# Detailed IO monitoring
dstat -d -D sda,sdb

# Comprehensive monitoring
dstat -cdngy 5

# Custom output format
dstat --output system_stats.csv 5

dstat Output Example:

--total-cpu-usage-- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai stl| read  writ| recv  send|  in   out | int   csw 
  2   1  96   1   0|  45k  128k|   0     0 |   0     0 |1.2k  2.5k
  3   2  94   1   0|  67k  156k| 234B  456B|   0     0 |1.4k  2.8k
  1   1  97   1   0|  23k   89k| 123B  234B|   0     0 |1.1k  2.2k

πŸ“ˆ Tool Selection and Usage Strategies

Performance Monitoring Tool Comparison

Tool Applicable Scenarios Advantages Disadvantages
iostat System-level IO monitoring Lightweight, comprehensive, long history No process-level details
iotop Process-level IO analysis Intuitive, real-time, easy to use Requires root privileges
iftop Network IO monitoring Strong network visualization Limited to network
nmon Comprehensive performance monitoring Comprehensive features, graphical High learning cost
dstat Multi-dimensional statistics Flexible, customizable Complex output

Monitoring Strategy Recommendations

Daily Operations Monitoring:

# System health check script
#!/bin/bash
echo "=== System IO Health Check $(date) ==="

# 1. Check overall system IO status
echo "--- System IO Overview ---"
iostat -x 1 3 | tail -10

# 2. Check IO hotspot processes
echo "--- IO Hotspot Processes ---"
sudo iotop -b -n 1 -o | head -10

# 3. Check disk utilization
echo "--- Disk Utilization Alerts ---"
iostat -x 1 1 | awk '$NF > 80 {print "Warning: " $1 " utilization " $NF "%"}'

# 4. Check IO wait time
echo "--- CPU IO Wait Check ---"
iostat 1 1 | awk '/avg-cpu/ {getline; if($4 > 10) print "Warning: IO wait time too high " $4 "%"}'

Performance Bottleneck Diagnosis:

# Comprehensive IO performance diagnosis script
#!/bin/bash
echo "=== Deep IO Performance Diagnosis ==="

echo "1. System IO Load Assessment"
iostat -x 1 5

echo -e "\n2. Process IO Ranking"
sudo iotop -b -n 1 | head -20

echo -e "\n3. Disk Queue Depth Analysis"
cat /sys/block/*/queue/nr_requests

echo -e "\n4. IO Scheduler Status"
for disk in /sys/block/sd* /sys/block/nvme*; do
    if [ -d "$disk" ]; then
        echo "$(basename $disk): $(cat $disk/queue/scheduler)"
    fi
done

echo -e "\n5. Filesystem IO Statistics"
cat /proc/diskstats | awk '{print $3, $4, $8}'

πŸš€ Advanced IO Optimization Techniques

IO Scheduler Optimization

# View current IO scheduler
cat /sys/block/sda/queue/scheduler

# Temporarily change IO scheduler
echo noop > /sys/block/sda/queue/scheduler

# Permanently change (add to /etc/fstab or use udev rules)
echo 'ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/scheduler}="deadline"' > /etc/udev/rules.d/60-io-scheduler.rules

# Scheduler selection for different scenarios
# SSD: noop or deadline
# Traditional hard drives: cfq
# Database servers: deadline
# Desktop systems: cfq

Filesystem Tuning

# ext4 filesystem optimization
mount -o noatime,data=writeback /dev/sda1 /data

# XFS filesystem optimization
mount -o noatime,logbufs=8,logbsize=256k /dev/sda1 /data

# Adjust filesystem parameters
echo 5 > /proc/sys/vm/dirty_background_ratio
echo 10 > /proc/sys/vm/dirty_ratio
echo 3000 > /proc/sys/vm/dirty_writeback_centisecs

Application Layer Optimization Recommendations

# Database IO optimization checklist
echo "=== Database IO Optimization Recommendations ==="
echo "1. Check innodb_buffer_pool_size setting"
echo "2. Enable innodb_flush_method=O_DIRECT"
echo "3. Adjust innodb_log_file_size"
echo "4. Consider using SSD storage for redo log"
echo "5. Monitor slow query logs"

# Web application IO optimization
echo "=== Web Application IO Optimization ==="
echo "1. Enable static file caching"
echo "2. Use CDN to distribute static resources"
echo "3. Optimize image and resource sizes"
echo "4. Implement application-level caching"
echo "5. Consider using in-memory databases"

πŸ“Š Performance Benchmarking and Capacity Planning

IO Performance Benchmarking

# Use dd for basic IO testing
echo "=== Disk Write Performance Test ==="
dd if=/dev/zero of=/tmp/testfile bs=1M count=1024 oflag=direct

echo "=== Disk Read Performance Test ==="
dd if=/tmp/testfile of=/dev/null bs=1M iflag=direct

# Use fio for professional IO testing
sudo apt install fio

# Random read test
fio --name=random-read --rw=randread --size=1G --bs=4k --ioengine=libaio --iodepth=32 --direct=1

# Random write test  
fio --name=random-write --rw=randwrite --size=1G --bs=4k --ioengine=libaio --iodepth=32 --direct=1

# Mixed read/write test
fio --name=mixed-rw --rw=randrw --rwmixread=70 --size=1G --bs=4k --ioengine=libaio --iodepth=32 --direct=1

Capacity Planning Script

#!/bin/bash
# IO capacity planning analysis
echo "=== IO Capacity Planning Analysis ==="

# Historical IO peak analysis
echo "--- IO Peak of the Past 7 Days ---"
for i in {1..7}; do
    DATE=$(date -d "-$i day" +%Y%m%d)
    if [ -f "/var/log/iostat_$DATE.log" ]; then
        MAX_UTIL=$(grep -v '^$\|Device' /var/log/iostat_$DATE.log | awk '{if($14>max) max=$14} END {print max}')
        echo "$(date -d "-$i day" +%m/%d): Highest Utilization ${MAX_UTIL}%"
    fi
done

# Growth trend prediction
echo -e "\n--- IO Growth Trend Prediction ---"
CURRENT_AVG=$(iostat -x 1 10 | awk '/avg-cpu/ {count++} count>1&&/nvme/ {sum+=$14; samples++} END {if(samples>0) print sum/samples}')
echo "Current average utilization: ${CURRENT_AVG}%"

# Capacity planning recommendations
if (( $(echo "$CURRENT_AVG > 60" | bc -l) )); then
    echo "Recommendation: Consider upgrading storage system or optimizing IO"
elif (( $(echo "$CURRENT_AVG > 40" | bc -l) )); then
    echo "Recommendation: Monitor closely, prepare expansion plan"
else
    echo "Recommendation: IO capacity is sufficient, continue monitoring"
fi

πŸ” Fault Diagnosis Practical Cases

Case One: Database Slow Query Issue

# Problem Phenomenon: MySQL query suddenly slows down
echo "=== Database Slow Query IO Diagnosis ==="

# 1. Check overall system IO status
echo "--- System IO Status ---"
iostat -x 1 5

# 2. Locate MySQL process IO usage
echo "--- MySQL IO Usage Analysis ---"
sudo iotop -b -n 5 -u mysql

# 3. Check disk utilization related to MySQL
echo "--- Data Directory Disk Status ---"
df -h /var/lib/mysql
iostat -x 1 3 | grep $(df /var/lib/mysql | tail -1 | awk '{print $1}' | sed 's/.*
//')

# 4. Analyze the correlation between slow queries and IO
echo "--- Slow Query and IO Correlation Analysis ---"
sudo tail -100 /var/log/mysql/mysql-slow.log | grep -A 5 -B 5 "Query_time"

Case Two: Slow System Startup

# System Startup IO Bottleneck Analysis
echo "=== Startup Process IO Analysis ==="

# 1. Immediately check IO status after boot
systemd-analyze blame | head -10

# 2. Check system service IO usage
sudo iotop -b -n 10 | grep systemd

# 3. Analyze disk utilization during startup
echo "--- Disk Status During Startup ---"
dmesg | grep -i "ata\|scsi\|nvme" | tail -20

# 4. Optimization Recommendations
echo "=== Optimization Recommendations ==="
echo "1. Check startup services"
echo "2. Consider using SSD"
echo "3. Optimize filesystem mount options"
echo "4. Adjust IO scheduler"

Case Three: Backup Task Impact on Performance

# Backup Task IO Impact Assessment
echo "=== Backup Task IO Impact Analysis ==="

# Monitor IO status before, during, and after backupackup_io_monitor() {
    local phase=$1
    echo "--- $phase IO Status ---"
iostat -x 1 3 | tail -10
    echo "--- $phase Hotspot Processes ---"
sudo iotop -b -n 1 -o | head -10
    echo "--- $phase System Load ---"
uptime
    echo ""
}

echo "Starting to monitor the impact of backup tasks on IO..."
backup_io_monitor "Before Backup"

# Start backup task (example)
echo "Starting backup task..."
# rsync -av --progress /data/ /backup/ &
# BACKUP_PID=$!

sleep 30
backup_io_monitor "During Backup"

# wait $BACKUP_PID
sleep 30
backup_io_monitor "After Backup"

echo "=== Optimization Recommendations ==="
echo "1. Use ionice to lower backup process IO priority"
echo "2. Perform backups during off-peak hours"
echo "3. Use incremental backups to reduce IO"
echo "4. Consider using a dedicated backup network"

🎯 Future Outlook: IO Monitoring in the Cloud-Native Era

IO Monitoring in Container Environments

# Docker Container IO Monitoring
docker stats --format "table {{.Container}}\t{{.BlockIO}}\t{{.MemUsage}}" --no-stream

# K8s Pod IO Monitoring
kubectl top pods --sort-by=cpu
kubectl describe node | grep -A 5 "Allocated resources"

# cAdvisor Integrated Monitoring
curl http://localhost:8080/api/v1.3/containers/

Cloud Storage IO Monitoring

  • β€’ AWS CloudWatch: EBS volume IO monitoring
  • β€’ Azure Monitor: Disk performance metrics
  • β€’ Google Cloud Monitoring: Persistent disk monitoring
  • β€’ Prometheus + Grafana: Self-built monitoring stack

Emerging Technology Trends

  • β€’ NVMe over Fabrics: Networked NVMe storage
  • β€’ Storage Class Memory: Storage-class memory technology
  • β€’ AI-driven Performance Optimization: Intelligent IO scheduling
  • β€’ Edge Computing Storage: Distributed storage architecture

πŸ“ Summary and Best Practices

Linux disk IO monitoring is a key aspect of system performance optimization. From traditional iostat to modern visualization tools, each tool has its unique value and application scenarios.

Monitoring Strategy Recommendations

  1. 1. Layered Monitoring: System-level β†’ Process-level β†’ Application-level
  2. 2. Key Metrics: Utilization, wait time, queue depth, IOPS
  3. 3. Alert Mechanism: Set reasonable thresholds and notification methods
  4. 4. Historical Analysis: Establish performance baselines, trend analysis
  5. 5. Continuous Optimization: Regularly evaluate and adjust monitoring strategies

Recommended Tool Combinations

  • β€’ Daily Monitoring: iostat + iotop
  • β€’ In-depth Analysis: nmon + dstat
  • β€’ Automation: Script integration + alerts
  • β€’ Visualization: Grafana + Prometheus
  • β€’ Fault Diagnosis: Comprehensive use of all tools

Remember, optimizing IO performance is not a one-time process; it requires continuous monitoring, analysis, and tuning. Master these monitoring tools to handle IO performance issues with ease and become a true Linux performance optimization expert!

What IO performance issues have you encountered in your daily work? Share your experiences and solutions in the comments!

#Linux #PerformanceOptimization #DiskIO #Operations #SystemMonitoring #DatabaseOptimization

Leave a Comment