Performance Optimization Revealed: How to Achieve a 30x Speed Increase Using SIMD Instructions? (Example of Global Threshold Binarization)

In daily development, we often need to process large amounts of data. For example, performing binarization on images or making threshold judgments on data. Today, we will look at a practical example of how to achieve a 30x speed increase using SIMD instructions!

Comparison of Three Methods

First, let’s look at the most basic method:

static void Test1(int size)        {            byte[] data = new byte[size];            Random rnd = new Random();            rnd.NextBytes(data);            var stopwatch = Stopwatch.StartNew();            for (int i = 0; i < size; i++)            {                if (data[i] >= 100) data[i] = 255;                else data[i] = 0;            }            stopwatch.Stop();            Trace.WriteLine($"Test1耗时：{stopwatch.ElapsedMilliseconds}ms");        }

This method is simple and straightforward, processing each byte individually, but it is not efficient.

Next is the improved version using Vector256:

static unsafe void Test2(int size)        {            byte* data = (byte*)Marshal.AllocHGlobal(size);            Random rnd = new Random();            rnd.NextBytes(new Span<byte>(data, size));            Vector256<byte> threshold = Vector256.Create((byte)100);            Vector256<byte> zero = Vector256<byte>.Zero;            Vector256<byte> maxValue = Vector256.Create((byte)255);            int vsize = Vector256<byte>.Count;            var stopwatch = Stopwatch.StartNew();            for (int i = 0; i < size; i += vsize)            {                byte* temp = data + i;                Vector256<byte> mask = Vector256.GreaterThanOrEqual(*(Vector256<byte>*)temp, threshold);                Vector256<byte> result = Vector256.ConditionalSelect(mask, maxValue, zero);                Avx2.Store(temp, result);            }            stopwatch.Stop();            Trace.WriteLine($"Test2耗时：{stopwatch.ElapsedMilliseconds}ms");            Marshal.FreeHGlobal((IntPtr)data);        }

Finally, the optimized version using the AVX2 instruction set:

static unsafe void Test3(int size)        {            byte* data = (byte*)Marshal.AllocHGlobal(size);            Random rnd = new Random();            rnd.NextBytes(new Span<byte>(data, size));            Vector256<byte> threshold = Vector256.Create((byte)100);            Vector256<byte> zero = Vector256<byte>.Zero;            Vector256<byte> maxValue = Vector256.Create((byte)255);            int vsize = Vector256<byte>.Count;            var stopwatch = Stopwatch.StartNew();            for (int i = 0; i < size; i += vsize)            {                byte* temp = data + i;                var v = *(Vector256<byte>*)temp;                Vector256<byte> diff = Avx2.SubtractSaturate(*(Vector256<byte>*)temp, threshold);                Vector256<byte> mask = Avx2.CompareEqual(diff, zero);                mask = Avx2.Xor(mask, maxValue);                Avx2.Store(temp, mask);            }            stopwatch.Stop();            Trace.WriteLine($"Test3耗时：{stopwatch.ElapsedMilliseconds}ms");            Marshal.FreeHGlobal((IntPtr)data);        }

Performance Test Results

When processing 32<<25 bytes of data, the time taken by the three methods is compared as follows:

Test1 (Basic Loop): Approximately 3673ms
Test2 (Vector256): Approximately 124ms
Test3 (AVX2 Instruction): Approximately 119ms

Test3 is nearly 30 times faster than Test1! This improvement is quite remarkable.

Note: Why is Test2 slower than Test3? Because Vector256.GreaterThanOrEqual degrades to Vector128.GreaterThanOrEqual, which further degrades to Vector64.GreaterThanOrEqual, all of which incur performance penalties.

In-Depth Technical Analysis

SIMD: Single Instruction Multiple Data

SIMD allows us to perform the same operation on multiple data simultaneously. In our example:

Traditional Method: Processes 1 byte at a time
AVX2 Method: Processes 32 bytes at a time (256 bits / 8 bits)

This is the theoretical basis for the 30x performance increase.

The Ingenuity of Test3

The implementation of Test3 is quite clever:

Saturation Subtraction: <span>Avx2.SubtractSaturate</span> sets values below the threshold to 0 and values greater than or equal to the threshold to their original value minus 100
Comparison Operation: Identifies which positions were originally below the threshold by comparing with 0
Bitwise Inversion: Uses XOR operation to invert the result, obtaining the final mask

This process avoids branch prediction failures and fully leverages the advantages of parallel computation.

Practical Application Scenarios

This optimization is particularly useful in the following scenarios:

Image Processing: Binarization, threshold segmentation
Data Filtering: Real-time data stream processing
Scientific Computing: Large-scale numerical calculations
Game Development: Particle systems, physics calculations

Optimization Suggestions

Memory Alignment: Ensure data addresses are aligned with vector sizes to further enhance performance
Loop Unrolling: Appropriately unroll loops to reduce branch prediction overhead
Data Preheating: Ensure data is already in the CPU cache
Instruction Selection: Choose the optimal instructions based on specific CPU characteristics

Conclusion

Through the SIMD instruction set, we achieved nearly a 30-fold performance increase. Although this optimization requires a deeper understanding of hardware, the benefits it brings when processing large-scale data are immense.

In modern software development, understanding the underlying hardware characteristics is becoming increasingly important. Whether it is the AVX2 instruction set or other hardware acceleration technologies, they provide us with powerful tools for optimizing program performance.

Technical optimization is an endless journey; sometimes, a change in perspective can reveal new realms of performance!

Note: Actual performance improvements may vary based on hardware configuration and data characteristics; specific testing in the target environment is recommended. The AVX2 instruction requires a CPU that supports AVX2, such as Intel Haswell and later architectures.