Reflections and Insights on Performance Testing

Introduction: In actual work, we often encounter situations where hardware materials are replaced, among which eMCP is quite common. This includes the selection of materials at the beginning of a project, comparing multiple materials for testing, or during project iterations when old materials are no longer produced, verifying whether new materials can meet requirements. Here, the author summarizes methods and thoughts based on personal work experience and learning during the testing process.

1. About eMCP

eMCP is a more advanced storage device compared to eMMC. It integrates eMMC and LPDDR into one package, reducing size while also minimizing circuit link design, primarily used in smartphones priced above a thousand yuan.

2. Performance Testing and Reflections on Issues

There are many methods for testing IO read and write speeds, such as: the commonly used dd command under Linux, iozone, fio, and H2testw under Windows. Below, we introduce several common testing tools, the issues encountered during their use, thoughts, applicable scenarios, and analysis methods.

Using dd Command and Reflections on Issues

The dd command is a commonly used command for testing IO, with the advantage of being convenient and quick, requiring no download or push of testing tools, allowing for quick preliminary assessments. Thus, it can be used to test the sequential read and write capabilities of disks. It can write to files and raw devices.

/mnt/sdcard # dd --helpUsage: dd [OPERAND]...  or:  dd OPTIONCopy a file, converting and formatting according to the operands.  bs=BYTES        read and write up to BYTES bytes at a time (default: 512);                  overrides ibs and obs  cbs=BYTES       convert BYTES bytes at a time  conv=CONVS      convert the file as per the comma separated symbol list  count=N         copy only N input blocks  ibs=BYTES       read up to BYTES bytes at a time (default: 512)  if=FILE         read from FILE instead of stdin  iflag=FLAGS     read as per the comma separated symbol list  obs=BYTES       write BYTES bytes at a time (default: 512)  of=FILE         write to FILE instead of stdout  oflag=FLAGS     write as per the comma separated symbol list  seek=N          skip N obs-sized blocks at start of output  skip=N          skip N ibs-sized blocks at start of input  status=LEVEL    The LEVEL of information to print to stderr;                  'none' suppresses everything but error messages,                  'noxfer' suppresses the final transfer statistics,                  'progress' shows periodic transfer statistics

Detailed Explanation:

if=xxx Read from xxx, e.g., if=/dev/zero, this device endlessly provides 0 (does not generate read disk IO).
of=xxx Write to xxx, can write to files or raw devices. For example, of=/dev/null, the “black hole”, equivalent to a write-only file. All content written to it will be lost forever (does not generate write disk IO).
bs=1M Size of each read or write, i.e., block size.
count=xxx Total number of blocks to read/write.

Familiarizing with Two Special Devices:

/dev/null: Recycle bin, bottomless pit.
/dev/zero: Generates characters.

Based on the –help prompt, we can summarize a commonly used testing template:

# Write Speed dd if=/dev/zero of=/sdcard/test bs=1M count=1000
# Read Speed dd if=/sdcard/test of=/dev/null bs=1M count=1000

Preliminary Testing

# Write Speed Test 1GB/mnt/sdcard # dd if=/dev/zero of=/sdcard/test bs=1M count=10001000+0 records in1000+0 records out1048576000 bytes (1.0 GB, 1000 MiB) copied, 89.1365 s, 11.8 MB/s
# Write Speed Test 100M/mnt/sdcard # dd if=/dev/zero of=/sdcard/test bs=1M count=100100+0 records in100+0 records out104857600 bytes (105 MB, 100 MiB) copied, 3.11532 s, 33.7 MB/s

Here, a significant discrepancy in test speed was discovered.

Reflections on Issues

Issue 1: For the same device, what is the actual test result we need when the speed difference between the 1GB and 100MB tests is too large?

To address this issue, we first need to understand the memory caching mechanism. Simply put, it is whether the dd command has allowed the system to actually write the file to the disk before completion.

Then we can conduct the following test:

/mnt/sdcard # free -h              total        used        free      shared  buff/cache   availableMem:           958M        257M        505M        2.0M        196M        684MSwap:          255M          0B        255M/mnt/sdcard # dd if=/dev/zero of=/sdcard/test bs=1M count=100100+0 records in100+0 records out104857600 bytes (105 MB, 100 MiB) copied, 1.88379 s, 55.7 MB/s/mnt/sdcard # free -h              total        used        free      shared  buff/cache   availableMem:           958M        257M        393M        2.0M        307M        683MSwap:          255M          0B        255M

Therefore, the above command merely reads this 128MB of data into the memory buffer (write cache). So the speed you get will be extremely fast. Because what dd provides you is only the read speed; the system only starts writing data to the disk after dd completes, but you won’t see that speed.

At this point, we find the parameter –help reveals this option:

fdatasync  physically write output file data before finishing

By adding this parameter, the dd command will actually perform a “sync” operation at the end, so the time you get reflects the time needed to read the 128MB of data into memory and write it to disk, making the calculated time more realistic.

/mnt/sdcard # dd bs=1M count=128 if=/dev/zero of=test conv=fdatasync128+0 records in128+0 records out134217728 bytes (134 MB, 128 MiB) copied, 8.65137 s, 15.5 MB/s

Issue 2: By using the method of adding the parameter conv=fdatasync, is the write cache (write cache) also utilized?

We can conduct the following test:

/mnt/sdcard # free -h              total        used        free      shared  buff/cache   availableMem:           958M        258M        504M        2.0M        196M        683MSwap:          255M          0B        255M/mnt/sdcard # dd bs=1M count=128 if=/dev/zero of=test conv=fdatasync128+0 records in128+0 records out134217728 bytes (134 MB, 128 MiB) copied, 7.93997 s, 16.9 MB/s/mnt/sdcard # free -h              total        used        free      shared  buff/cache   availableMem:           958M        258M        360M        2.0M        339M        682MSwap:          255M

From the test results, the cache has clearly increased. This means that the write cache was indeed utilized.

Issue 3: How can we skip the write cache?

I found two different answers online to conduct the following tests:

oflag=direct
oflag=dsync

First, let’s look at the explanation from –help

direct    use direct I/O for datadsync     use synchronized I/O for data

From the explanation, they seem similar. Let’s conduct a test:

Test 1:

/mnt/sdcard # free -h              total        used        free      shared  buff/cache   availableMem:           958M        257M        503M        2.0M        197M        683MSwap:          255M          0B        255M/mnt/sdcard # dd bs=1M count=128 if=/dev/zero of=test oflag=direct128+0 records in128+0 records out134217728 bytes (134 MB, 128 MiB) copied, 8.62906 s, 15.6 MB/s/mnt/sdcard # free -h              total        used        free      shared  buff/cache   availableMem:           958M        257M        502M        2.0M        198M        683MSwap:          255M          0B        255M

Test 2:

/mnt/sdcard # free -h              total        used        free      shared  buff/cache   availableMem:           958M        258M        504M        2.0M        196M        683MSwap:          255M          0B        255M/mnt/sdcard # dd bs=1M count=128 if=/dev/zero of=test oflag=dsync128+0 records in128+0 records out134217728 bytes (134 MB, 128 MiB) copied, 9.72236 s, 13.8 MB/s/mnt/sdcard # free -h              total        used        free      shared  buff/cache   availableMem:           958M        258M        361M        2.0M        338M        682MSwap:          255M

From the test results, oflag=direct bypassed the memory cache. oflag=dsync did not. However, the speed was indeed slower because each time it synchronized the I/O, it had to write this 1M to disk before reading the next 1M, repeating this 128 times. Thus, the speed was very slow, and it hardly utilized the write cache.

Issue 4: What command is more reasonable for us to use in actual testing?

This mainly depends on our testing goals, summarized as:

# Test the most realistic file write speed dd bs=50M count=100 if=/dev/zero of=test conv=fdatasync
# Test cache write speed dd if=/dev/zero of=/sdcard/test bs=1M count=100# (bs count values depend on the actual device Mem situation)
# Skip memory cache dd bs=1M count=128 if=/dev/zero of=test oflag=direct

Knowledge Extension

About Buff/Cache in Memory

Let’s take another look at the output of the free command:

/mnt/sdcard # free -h              total        used        free      shared  buff/cache   availableMem:           958M        259M        475M        2.0M        224M        682MSwap:          255M          0B        255M

By just using the dd command to write files, we can see the changes in buff/cache. However, it is important to distinguish between Buffer and cache, which is not as easy to understand as physical memory Mem and swap Swap.

According to online encyclopedia explanations, Buffer is a buffer area, while cache is a cache. From a literal understanding, it is not easy to distinguish. We can use /proc/meminfo to break it down.

/mnt/sdcard # cat /proc/meminfoMemTotal:         981744 kBMemFree:          486620 kBMemAvailable:     698236 kBBuffers:           10532 kBCached:           187600 kBSwapCached:            0 kBActive:           166944 kBInactive:         267388 kBActive(anon):     102548 kBInactive(anon):   135700 kBActive(file):      64396 kBInactive(file):   131688 kBUnevictable:           0 kBMlocked:               0 kBSwapTotal:        262140 kBSwapFree:         262140 kBDirty:             65580 kBWriteback:             0 kBAnonPages:        236288 kBMapped:            49336 kBShmem:              2060 kBSlab:              44112 kBSReclaimable:      31368 kBSUnreclaim:        12744 kBKernelStack:        3280 kBPageTables:         1856 kBNFS_Unstable:          0 kBBounce:                0 kBWritebackTmp:          0 kBCommitLimit:      753012 kBCommitted_AS:     805532 kBVmallocTotal:   258867136 kBVmallocUsed:           0 kBVmallocChunk:          0 kB

From the explanation of man free:

buffers     Memory used by kernel buffers (Buffers in /proc/meminfo)
cache       Memory used by the page cache and slabs (Cached and SReclaimable in /proc/meminfo)
buff/cache      Sum of buffers and cache

So buff/cache in free refers to Buffers + Cached + SReclaimable. The data comes from /proc/meminfo.

Continuing with man proc, we can obtain detailed documentation on the proc filesystem. Translated, it states:

Buffers is temporary storage for raw disk blocks, used to cache disk data, usually not particularly large (around 20MB). This allows the kernel to concentrate scattered writes and optimize disk writing uniformly, such as merging multiple small writes into a single large write, etc.
Cached is the page cache for files read from disk, used to cache data read from files. Thus, when accessing these file data next time, it can be quickly retrieved from memory without needing to access the slow disk again.
SReclaimable is part of the Slab. Slab consists of two parts, with the reclaimable part recorded by SReclaimable; while the unreclaimable part is recorded by SUnreclaim.

Testing

Clearing System Cache

/ # free -h              total        used        free      shared  buff/cache   availableMem:           958M        259M        475M        2.0M        224M        682MSwap:          255M          0B        255M/ # echo 3 > /proc/sys/vm/drop_caches/ # free -h              total        used        free      shared  buff/cache   availableMem:           958M        259M        635M        2.0M         64M        681MSwap:          255M          0B        255M/ #

Execute dd if=/dev/zero of=./test bs=1M count=500 Observe the changes in memory and I/O

/mnt/sdcard #  vmstat 1procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st 1  0      0 650608    948  64324    0    0     3   284  216  248  5 21 71  3  0 1  0      0 650444    948  64460    0    0     0     0 1581 2006  5 20 75  0  0 2  0      0 650640    948  64460    0    0     0     0 1518 1958  5 20 74  0  0 1  0      0 650428    948  64460    0    0     0     0 1542 1997  5 21 74  0  0 2  0      0 624732   1716  85624    0    0  1873    12 7329 7395  6 32 58  4  0 2  0      0 566844   1772 145052    0    0    10 12606 8838 6537  7 51 43  0  0 1  1      0 558848   1780 153820    0    0     2     0 3158 2955  5 25 51 19  0 2  0      0 554000   1780 158068    0    0     4 4154 2968 3202  5 23 55 17  0 3  0      0 508076   1876 203236    0    0     6 37030 15177 13524  6 43 45  5  0 2  0      0 470896   1932 241664    0    0     6 66952 9652 8827  5 37 46 11  0 1 1      0

Both Buffer and Cache are increasing, but it is evident that Cache is increasing much faster. This indicates that data is being cached in Cache during file writing.
The results of multiple I/O writes add up to the 500MB of data that dd needs to write. Only after this is completed do we obtain the actual write speed we wish to test.

The greater advantage of Cached will be reflected in iozone rewrite tests.

Summary of iozone Experience and Reflections on Issues

About iozone

iozone is a benchmark tool for file systems that can test the read and write performance of file systems under different operating systems. It can test various modes such as Read, write, re-read, re-write, read backwards, read strided, fread, fwrite, random read, pread, mmap, aio_read, aio_write, etc. When testing, please ensure that the size of the test file is larger than your memory (ideally twice the size of memory), otherwise Linux will cache the read and write content, resulting in very unrealistic values. (Excerpt from encyclopedia)

There are too many explanations online for the installation and use of iozone, so I won’t take up too much space here and will go straight to the point.

Common Testing Commands

# Write Speed
./iozone -i0 -Rab ./test-iozone-write.xls -g 2G -n 1M -w -e -C 
# Read Speed
./iozone -i1 -Rab ./test-iozone-write.xls -g 1G -n 1M -w -e -C

Testing Script

Because of the large number of tests, it is generally recommended to organize a testing script for convenience in executing a large number of testing tasks.

clear    echo "a. automatic mode"    echo "b. write/rewrite "    echo "c. read/re-read"    echo "d. random-read/write"    echo "e. fwrite"    echo "f. fread"    echo "please input (a-e) to select function"    read letter    case $letter in      "a")        iozone -Rab /tmp/test-iozone-auto.xls -g 2G -n 1M -w -e -C       ;;      "b")        iozone -i0 -Rab /tmp/test-iozone-write.xls -g 2G -n 1M -w -e -C       ;;      "c")        perf_proc/iozone -i1 -Rab /tmp/test-iozone-read.xls -g 2G -n 1M -w -e -C       ;;      "d")        perf_proc/iozone -i2 -Rab /tmp/test-iozone-random-rw.xls -g 2G -n 1M -w -e -C       ;;      "e")        perf_proc/iozone -i3 -Rab /tmp/test-iozone-fwrite.xls -g 2G -n 1M -w -e -C       ;;      "f")        perf_proc/iozone -i4 -Rab /tmp/test-iozone-fread.xls -g 2G -n 1M -w -e -C       ;;      *)      ;;    esac  ;;

Testing Analysis

Here is a write report from an automated test result, with columns (file size KB) and rows (reclen KB)

Table: Device Writer Test Records

Issue 1: In actual testing, with a larger test range and data volume, how can we better analyze?

To intuitively feel the trend, we can draw a surface plot. Simple analysis can be done directly using Excel’s charting function by selecting a surface chart. Other tools like gnuplot can also be used. The author here uses Excel’s charting function.

Figure: Device Re-writer Test Records Surface Plot

When the file is smaller than 262M, the actual test is the process of reading from disk into memory, which is the write cache process, equivalent to the above:

dd if=/dev/zero of=/sdcard/test bs=1M count=100

When the file is larger than 524M, the actual test results are close to the real disk IO performance.

Next, let’s look at the re-writer’s test records:

Table: Device Re-writer Test Records

Figure: Device Re-writer Test Records Surface Plot

Speed differences are more evident in rewrites. When the file is smaller than 262M, it directly reads from the cache, achieving peak speed. When it is 524M or larger, the write and rewrite speeds show almost no difference.

Issue 2: In actual testing, with so many values, which value should we focus on?

Answer: All of them are important. In tests involving replacing eMCP, both performance values should be considered. If replacing eMMC, the focus is generally on disk IO performance (specifics depend on the test summary).

Testing Summary:

In tests involving the replacement of eMCP, after completing IO, Mem stress tests and performance tests, we ultimately need to return to functionality. In cases where functionality tests are all OK, we should compare performance in real functional scenarios, such as:

OTA upgrade speed;
Flashing speed;
Factory reset speed;
Performance of other core device functions;

In actual testing, there have been instances where the disk performance of Material A was better than that of Material B, but the OTA upgrade speed was significantly lower than that of Material B. Therefore, comparative testing needs to be evaluated and analyzed from multiple dimensions.

Finally, attached is hardcore performance testing material, a 50-page case study on performance bottlenecks.

Content includes:

1. Performance Analysis Steps

2. Performance Analysis Commands

3. Common Performance Issues Summary

4. Bottleneck Analysis Summary and Cases

A total of 50 pages of material, this resource is sufficient for understanding knowledge about performance bottlenecks!