Linux Performance Tuning: A Comprehensive Guide to Troubleshooting Memory Issues

1. Introduction

In previous articles, we have discussed the principles of memory, the use of swap, types of caches, and tools for diagnosing memory leaks. Now, we have a basic understanding of memory. When encountering memory issues in work or study, how should we troubleshoot or where should we start? This article mainly discusses the thought process for troubleshooting memory issues.

If you haven’t read the previous articles, it is recommended to check them out first:

Linux Performance Tuning: About Memory

Linux Performance Tuning: Why Swap Usage Increased

Linux Performance Tuning: Understanding Caches in Memory

Linux Performance Tuning: How to Quickly Locate and Handle Memory Leaks?

Linux Performance Tuning: Detailed Usage of Memory Analysis Tools memleak-bpfcc and valgrind

Generally, when we suspect a memory issue, the first indication will come from memory-related alarm metrics, which reflect the problem immediately. With these metrics, we can choose appropriate tools for further analysis. After in-depth analysis with the tools, we can usually quickly determine where the problem lies and take action.

2. Memory Metrics

Based on the previous descriptions, the memory metrics we need to pay attention to are as follows:System Memory Metrics: Total memory, free memory, used memory, buffer/cache usage, cache hit rate, page faults, slab reclaimable memory, etc.Process Memory Metrics: Virtual memory (VIRT), resident memory/physical memory (RES/RSS), shared memory (SHR), process memory usage percentage (%MEM), exclusive memory (approximately RES – SHR), virtual memory size used by the process (VSZ), etc.Swap Memory: Total memory, used memory, remaining memory, swap in/out speed.

3. Viewing Tools

Having understood the above metrics, we need corresponding tools to view accurate data in a timely manner to determine where the problem lies.

root@test:~# free -wh
              total        used        free      shared     buffers       cache   available
Mem:          1.0Ti        48Gi       803Gi       3.0Mi       2.0Gi       152Gi       954Gi
Swap:          39Gi          0B        39Gi

The free command is the most commonly used tool to check memory, allowing you to quickly see memory usage, how much is used, and how much is remaining.

Column meanings: total: Total memory used: Amount of memory used (used = total – free – buffers – cache) free: Amount of memory not in use shared: Total memory shared by multiple processes (usually 0 or very small) buffers: Amount of memory used by kernel buffers cache: Amount of memory used for page cache and slab available: Estimated amount of memory available for starting new applications (excluding swap space)

top - 11:34:44 up 526 days, 20:11,  1 user,  load average: 0.84, 0.92, 1.04
Tasks: 1343 total,   1 running, 1342 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.4 us,  0.0 sy,  0.0 ni, 99.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem : 1031349.+total, 823056.8 free,  50146.1 used, 158146.7 buff/cache
MiB Swap:  40960.0 total,  40960.0 free,      0.0 used. 976896.4 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                                    
   4095 mysql     20   0 6189264   1.6g 550648 S  21.5   0.2 204441:48 mysqld                                                                                                                                    
   2732 root      20   0  141524  34936   9324 S  17.2   0.0 194545:07 node_exporter                                                                                                                              
1129876 root      20   0 7095304   1.1g  54208 S   5.6   0.1   1770:30 prometheus                                                                                                                                  
2936909 root      20   0 5411276 231104  56112 S   2.0   0.0  19108:45 wsssr_defence_s

The top command is also an important tool for viewing memory, allowing you to see real-time changes in memory. For more detailed information, you can use the htop command, which needs to be installed separately.

1) Fourth line (Memory): MiB Mem : Total memory. free: Free memory in MiB. used: Used memory in MiB. buff/cache: Memory used for buffers and caches in MiB. 2) Fifth line (Swap): MiB Swap: Total swap space. free: Free swap space in MiB. used: Used swap space in MiB. avail Mem: Estimated available memory (including unused and reclaimable from buffers and caches) in MiB. 3) Real-time process display PID: Process ID. USER: Process owner. PR: Priority. NI: Nice value, negative values indicate high priority, positive values indicate low priority. VIRT: Virtual memory usage (in KiB). RES: Resident memory size (in KiB). SHR: Shared memory size (in KiB). S: Process state (S=sleeping, R=running, etc.). %CPU: CPU usage percentage. %MEM: Memory usage percentage. TIME+: Total CPU time used by the process, accurate to one-hundredth of a second. COMMAND: Process name.

root@test:~# ps -aux |head -5
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.1  0.0 171128 13972 ?        Ss    2024 1505:02 /sbin/init nopti
root           2  0.0  0.0      0     0 ?        S     2024   8:41 [kthreadd]
root           3  0.0  0.0      0     0 ?        I<    2024   0:00 [rcu_gp]
root           4  0.0  0.0      0     0 ?        I<    2024   0:00 [rcu_par_gp]

The ps command can be used to view the memory usage of processes.

Column meanings are as follows: USER: Owner of the process (which user started it) PID: Process ID %CPU: Percentage of CPU used by the process %MEM: Percentage of memory used by the process VSZ: Size of virtual memory used by the process (in KB) RSS: Size of physical memory used by the process (in KB) TTY: Terminal on which the process is running (if ?, it means not associated with a terminal) STAT: Process state (e.g., S=sleeping, R=running, Z=zombie, etc.) START: Process start time TIME: Total CPU time used by the process COMMAND: Command that started the process

root@test:~# cat /proc/meminfo
...
Slab:           24081144 kB
SReclaimable:   18618148 kB
SUnreclaim:      5462996 kB
...
AnonHugePages:   3555328 kB
...
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB

/proc/meminfo is an important location for some special memory metrics, allowing you to view slab kernel memory and configured huge page memory.

SReclaimable: Reclaimable slab, kernel cache that can be reclaimedSUnreclaim: Unreclaimable slab, memory that must be retained by the kernelAnonHugePages: Anonymous huge pages, memory used by transparent huge pagesHugePages_Total: Number of free huge pages, number of unused huge pagesHugePages_Free: Total number of huge pages, number of configured huge pagesHugepagesize: Size of each huge page, 2MB per huge page

root@test:~# vmstat 1 2
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0      0 842544896 2081680 160133632    0    0     0     4    0    0  1  0 99  0  0
 2  0      0 842544192 2081680 160133744    0    0     0    92 3923 5803  2  0 98  0  0

The vmstat tool can be used to view the dynamic changes in memory metrics, where 1 indicates the time interval and 2 indicates the number of times.

procs (Processes): r: Number of processes in the run queue (processes running and waiting for CPU) b: Number of processes waiting for I/O (processes in uninterruptible sleep state) memory (Memory, in KB): swpd: Size of used virtual memory (swap space, in KB) free: Size of free physical memory (in KB) buff: Size of memory used as buffers (in KB) cache: Size of memory used as caches (in KB) swap (Swap space, in KB/sec): si: Amount of data read into swap space from disk (KB/sec) so: Amount of data written from swap space to disk (KB/sec) io (I/O, in blocks/sec): bi: Number of blocks received from block devices (blocks/sec, reading from disk) bo: Number of blocks sent to block devices (blocks/sec, writing to disk) system (System): in: Number of interrupts per second (including clock interrupts) cs: Number of context switches per second cpu (CPU usage, percentage): us: Percentage of CPU time used by user processes sy: Percentage of CPU time used by kernel processes id: Percentage of idle CPU time wa: Percentage of CPU time waiting for I/O st: Percentage of CPU time stolen (usually in virtualized environments, CPU time occupied by other virtual machines)

root@test:~# sar -r 1 3
Linux 5.4.0-59-generic (jnai1asan01)    11/05/2025      _x86_64_        (128 CPU)

02:36:42 PM kbmemfree   kbavail kbmemused  %memused kbbuffers  kbcached  kbcommit   %commit  kbactive   kbinact   kbdirty
02:36:43 PM 842489592 1000321788  45882776      4.34   2081692 141547176  53245784      4.85 142730504  44375344      1200
02:36:44 PM 842489340 1000321652  45883056      4.34   2081692 141547236  53245784      4.85 142730656  44375460      1260
02:36:45 PM 842489772 1000322360  45882228      4.34   2081692 141547508  53245784      4.85 142730568  44375676      1536

The sar -r command can be used to view the specific situation of system memory, focusing on cache usage and page accounting.

kbmemfree: Amount of free physical memory (KB). kbavail: Amount of available memory (KB), this is an estimate that includes free memory and reclaimable cache (excluding swap space). kbmemused: Amount of used physical memory (KB). Note that this value may include memory used by the kernel, including cache and buffers. %memused: Percentage of used memory. kbbuffers: Amount of memory occupied by kernel buffers (KB). kbcached: Amount of memory occupied by kernel cached data (KB). kbcommit: Total amount of memory required for the current workload (KB). This is an estimate that includes used memory and memory not yet used but expected to be needed (such as malloc allocations that have not actually been used). %commit: Percentage of kbcommit to total memory (including swap space). kbactive: Amount of active memory (KB), i.e., memory that has been accessed recently. kbinact: Amount of inactive memory (KB), i.e., memory that has not been accessed recently and may be reclaimed for other uses. kbdirty: Amount of dirty memory waiting to be written back to disk (KB).

In addition to the above common tools, there are also cachetop (to view process cache and buffer hit rate information), pmap (to view the distribution of specific processes). Additionally, important memory leak checking tools include memleak-bpfcc and valgrind.

4. Memory Issue Troubleshooting Approach

Now that we understand the relevant memory metrics and how to obtain these metrics using tools, it is clear that these metrics do not exist in isolation; they are interrelated. This also indicates that the underlying principles of memory in the system are fixed, and various tools reflect memory usage from different perspectives.

When a memory issue is detected, we cannot simply use tools to check all metrics one by one, as this would waste time and be unnecessary. Based on the characteristics of memory metrics, analyzing memory issues from a global to local perspective and in a layered manner would be a better choice. Let’s summarize a good troubleshooting approach.

First, the most direct manifestation of a memory issue is excessive usage that does not meet expectations. Generally, we first use the free and top commands to check the overall memory usage. It is not necessarily a problem if the remaining memory decreases; it could also be due to cache usage. When we find that a lot of memory is occupied by cache, we can use vmstat or sar to observe the trend of cache changes to confirm whether there is unreasonable growth. If the cache continues to increase, we can use cachetop or slabtop, as well as /proc/meminfo, to see which processes are occupying these caches. After identifying the processes, we can use pmap to analyze the memory usage within those processes.

Then, if cache usage is normal, we should check whether swap has been utilized. If it has, it may indicate insufficient memory configuration. We can use vmstat to observe the changes in swap; if swap is constantly being swapped in and out, it indicates that memory configuration is indeed insufficient. Additionally, there may be processes whose memory usage is increasing significantly, leading to overall memory usage growth. We can use top and ps tools to find and analyze the relevant processes.

Furthermore, if memory growth cannot be alleviated and continues to increase, resulting in OOM (Out of Memory) situations, we can conclude that there may be processes experiencing memory overflow. At this point, we can follow the previous steps to identify the relevant processes and then use memleak-bpfcc and valgrind to analyze the leaking processes and their stack traces.

Finally, sometimes memory issues may not be caused by programs but could also be due to hardware failures, such as network card failures leading to abnormal data transmission, which in turn raises memory cache. Alternatively, disk I/O read/write anomalies may cause significant increases in memory usage by processes. We need to make a comprehensive judgment. Future articles will continue to expand on the impact of CPU, disk, and network on performance. Stay tuned!

Leave a Comment