Introduction
Starting with a painful downtime incident.
The response time of the online business system skyrocketed from 200ms to 8 seconds, yet the CPU usage (65%), memory (30% free), and disk I/O appeared “normal”.
The core issue: judging system health based on a single metric ignores the complexity of Linux performance, which is a complex symphony.
Background Explanation
Why is performance tuning so important?
The Real Cost of Performance Issues
For every additional second in page load time, the conversion rate drops by 7%; 53% of mobile users will abandon a page that takes more than 3 seconds to load; severe performance failures can lead to millions in business losses.
Typical Performance Bottleneck Scenarios
- 1Traffic surges during major e-commerce promotions (Double 11/618, traffic increases 10-20 times compared to normal)
- 2Database slow queries causing avalanches (an unoptimized SQL query can bring down the system)
- 3Memory leaks as a chronic poison (Full GC and memory overflow in Java applications)
- 4I/O bottlenecks as invisible killers (performance drops sharply during log writing and data backups)
Core Methodology
Three-step positioning method.
Step One: Global Scan
Quick diagnosis in 10 minutes.
The Golden Three Commands (can be encapsulated as an alias health):
- uptime: check load trends
- dmesg | tail: check system logs
- vmstat 1: check overall resource usage
Load Judgment Techniques:
- 1-minute load > 5-minute > 15-minute: problem worsening
- 15-minute load > 5-minute > 1-minute: problem alleviating
Step Two: Layered Deep Dive
Precisely locate bottlenecks.
CPU Bottleneck Positioning
Three Analysis Tools:
- top: check overall CPU usage
- mpstat -P ALL 1: check usage of each CPU core
- pidstat -u 1: check CPU usage by process
Real case: 8-core server total CPU usage at 2.5% but slow response → mpstat revealed a single core at 100% utilization (single-threaded program bottleneck).
Solution: Use taskset to bind CPU cores; refactor the program to be multi-threaded.
Memory Bottleneck Positioning
Combination Analysis:
- free -h: check memory overview
- cat /proc/meminfo: detailed memory information
- slabtop: kernel memory usage
Pitfalls to Avoid: Free showing little remaining ≠ insufficient memory (Linux caches free memory).
Correct Judgment: Available memory = free + buffers + cache; sar -r 1 to check memory trends; sar -w 1 to check swap trends (frequent swap triggers indicate insufficient memory).
Optimization Techniques:
- Adjust swappiness (recommended to set below 10): echo 10 > /proc/sys/vm/swappiness
- Clear cache (use with caution): sync && echo 3 > /proc/sys/vm/drop_caches
- Large page memory optimization: echo 2048 > /proc/sys/vm/nr_hugepage
I/O Bottleneck Positioning
Analysis Tools:
- iostat -x 1: disk I/O statistics
- blktrace: I/O tracing tool
Key Metric Interpretation:
- %util: sustained 100% disk usage → disk saturation
- await: average wait time exceeding 10ms needs attention
- r_await/w_await: read/write latency, to determine if it’s a read/write issue
Real case: MySQL server %util at 50% but await at 200ms → large number of random small I/Os causing issues → adjusted innodb_flush_method + increased SSD cache.
Step Three: Comprehensive Tuning
Systematic solutions.
Core Parameter Optimization Checklist:
- Network optimization (high concurrency scenarios): adjust net.ipv4.tcp_max_syn_backlog and other kernel parameters
- File system optimization: set soft nofile / hard nofile (number of file handles)
Experience Sharing
My tuning toolbox.
- 1Establishing Performance Baselines: Use sar to establish a 7×24 hour performance baseline (/usr/lib64/sa/sa1 1 1 collects data every minute, sa2 -A generates daily reports)
- 2Automated Alert Scripts: Monitor load average, automatically collect top / iostat data when thresholds are exceeded
- 3Stress Testing and Validation: Use stress (CPU/memory stress testing), fio (I/O stress testing) to validate tuning effects
Trends and Extensions
The future of performance tuning.
- 1eBPF: A revolution in performance analysis: achieving finer-grained performance monitoring without modifying code (e.g., bpf_trace to trace system call latency)
- 2Intelligent Operations: Combining machine learning for performance prediction, automatic tuning, and root cause analysis of anomalies
- 3Challenges in Cloud-Native Environments: New dimensions such as container resource limits, container network performance, and K8s scheduling optimization
Conclusion
Continuous optimization, never-ending.
Performance tuning is not a one-time task, but a cycle of “establishing monitoring → setting baselines → continuous optimization → validating effects”.