Optimization Techniques for C/C++ Code

Optimization Techniques for C/C++ Code

I stumbled upon a short article from 2014 that introduces how to optimize performance in C/C++. Although the original text is focused on ray tracing in graphics, it still provides valuable guidance. I translated it into English, hoping it will help everyone write high-quality code. After each suggestion in the text, I added remarks based on modern compiler optimizations to update the advice.

For the original article, please refer to: Tips for Optimizing C/C++ Code

1. Remember Amdahl’s Law

Optimization Techniques for C/C++ Code

  • Here, <span>func-cost</span> represents the percentage of the total runtime of the program that the function <span>func</span> occupies, and <span>func-speedup</span> indicates the speedup achieved for that function.
  • Therefore, if we optimize a function <span>TriangleIntersect()</span> that takes up <span>40%</span> of the program’s runtime, and its runtime is halved, the overall performance of the program will improve by <span>25%</span>. Please see the calculation below.Optimization Techniques for C/C++ Code
  • This formula tells us that code that is used infrequently (such as scene loaders) may not need optimization (or only minimal optimization).
  • In summary: “Make frequently used code faster, and ensure infrequently used code is correct.”

Remarks:

Amdahl’s Law states that the overall performance improvement of a system is limited by the parts that cannot be optimized. Even if a certain part improves significantly, if it occupies a small percentage of the overall, the total performance gain will be limited. Therefore, optimization should focus on the “hot” code that takes the most time.

2. Ensure code correctness before optimizing!

  • This does not mean you should spend 8 weeks writing a fully functional ray tracer and then spend another 8 weeks optimizing it!
  • Optimization of the ray tracer should be done in stages.
  • First, write correct code, then optimize the functions that are called frequently.
  • Finally, perform performance analysis to identify bottlenecks and eliminate them through optimization or improved algorithms. Often, improving the algorithm will significantly change where the bottleneck lies, possibly shifting it to functions you did not expect. This is why explicit optimization of all known high-frequency functions is necessary.

Remarks:

This still applies. Premature optimization can waste time and introduce bugs. Modern development processes emphasize ensuring correctness first, then using performance analysis tools to locate bottlenecks, and finally optimizing purposefully.

3. The code experts I know say they spend at least twice as much time optimizing code as they do writing it

Remarks:

In modern development, performance tuning often takes longer than feature development. It is recommended to use performance analysis tools (such as perf, gprof, VTune) to locate bottlenecks and avoid blind optimization. Also, set optimization goals to conduct optimization work purposefully.

4. Jumps/branch operations are costly, so minimize their use

  • Function calls, aside from stack memory operations, also require two jumps.
  • Prefer iteration over recursion.
  • Use inline implementations for short functions to eliminate the overhead of function calls.
  • Move loops inside function calls, for example, change
    for(i=0; i < 100;i++)
      DoSomething();
    

    to

    DoSomething() {
        for(i=0; i < 100; i++)
        {
            ...
        }
    }
    
  • Long <span>if...else if...else if...else if...</span> chains can lead to many jumps at the end of the chain (besides checking each condition one by one).
    • If possible, use a <span>switch</span> statement, as compilers can sometimes optimize it into a single jump table lookup.
    • If a <span>switch</span> statement cannot be used, place the most common branches at the beginning of the condition chain.

Remarks:

Modern CPUs have strong branch prediction capabilities, but mispredictions still incur performance losses. Organizing branches reasonably and reducing deep nesting is still beneficial, but there is no need to worry excessively. Modern compilers will automatically optimize some branches.

5. Pay attention to the order of array indexing

  • Two-dimensional and higher-dimensional arrays are still stored in memory in a one-dimensional manner. This means that (for C/C++ arrays)<span>array[i][j]</span> and <span>array[i][j+1]</span> are stored adjacently, while <span>array[i][j]</span> and <span>array[i+1][j]</span> may be far apart.
  • Accessing data in the order of physical memory storage can significantly improve code execution speed (sometimes by an order of magnitude or more)!
  • When modern CPUs load data from memory into cache, they do not just fetch a single value. They fetch a block of memory that includes the requested data and its adjacent data (cache line). This means that when <span>array[i][j]</span> enters the CPU cache, <span>array[i][j+1]</span> is likely already in the cache, while <span>array[i+1][j]</span> may still be in memory.

Remarks:

This is still important. Accessing data in memory order (improving cache hit rates) has a significant impact on performance. The cache mechanisms of modern CPUs make this even more critical. However, for C++ programs, it is recommended to use standard library containers like <span>std::vector</span> and <span>std::array</span><span>, rather than C-style arrays.</span>

6. Consider instruction-level parallelism

  • Although many applications still rely on single-threaded execution, modern CPUs have achieved parallelism on a single core. This means a single CPU may simultaneously execute 4 floating-point multiplication operations, wait for 4 memory requests, and perform comparison operations for upcoming branches.
  • To fully utilize this instruction parallelism, code blocks (i.e., between jump instructions) need to contain enough independent instructions to keep the CPU fully utilized.
  • Consider using loop unrolling.
  • Consider using inline functions.

Remarks:

Modern compilers can automatically perform instruction scheduling and out-of-order execution, but writing code that is free of data dependencies and can be parallelized still helps the compiler generate efficient instructions.

7. Avoid or reduce the number of local variables

  • Local variables are usually stored on the stack. However, if there are few enough, they can be stored in registers instead. In this case, the function not only gains the advantage of fast access to data stored in registers but also avoids the overhead of creating a stack frame.
  • (But never convert all to global variables!)

Remarks:

Modern compilers will automatically allocate active variables to registers. There is no need to deliberately reduce local variables; maintaining code readability is more important, but this does not encourage the abuse of local variables.

8. Reduce the number of function parameters

  • The reason is the same as reducing local variables—they are also stored on the stack.

Remarks:

Modern compilers will automatically optimize parameter passing. Having too many parameters may affect readability and stack space, but there is no need to deliberately reduce the number of parameters for performance.

9. Structures should be passed by reference rather than by value

  • In ray tracers, I have never seen a case where structures need to be passed by value (even simple structures like vectors, points, and colors are no exception).

Remarks:

This still applies. Large objects should be passed by reference or pointer to avoid unnecessary copies. Small POD types (like int, float, std::pair) can be passed by value without issue.

10. If a function does not need to return a value, do not define a return type

Remarks:

This still applies. Although modern compilers will optimize unused return values, it is still recommended to always explicitly declare the return type (like void) to enhance code standards, readability, and maintainability.

11. Avoid or minimize the use of type casting

  • Integer and floating-point instructions often operate on different registers, so type casting requires data copying.
  • Short types (like char and short) still occupy a full register and need to be padded to 32/64 bits before being stored back to memory and converted back to a smaller type (this overhead needs to be weighed against the additional memory cost of larger data types).

Remarks:

This still applies. Frequent type casting can lead to additional instructions and performance losses, as well as affect code readability.

12. Be cautious when declaring C++ object variables

  • Prefer initialization over assignment
    Color c(black);
    

    is faster than

    Color c;
    c = black;
    

    .

Remarks:

Prefer using initializer lists; modern compilers can optimize the initialization process. It is recommended to use uniform initialization (from C++11 onwards) with <span>{}</span>.

13. Design the default constructor of classes to be as small as possible

  • Especially for simple and frequently used classes (like color, vector, point, etc.).
  • These default constructors are often called in unexpected situations.
  • Use constructor initializer lists. Use
    Color::Color() : r(0), g(0), b(0)
    {}
    

    instead of

    Color::Color() {
        r = g = b = 0;
    }
    

Remarks:

This still applies. Classes that frequently create objects should avoid complex default construction logic, and using initializer lists is more efficient.

14. Use shift operations >> and << instead of integer multiplication and division whenever possible

Remarks:

Modern compilers will automatically optimize operations like multiplying by 2 or dividing by 2 into shifts. There is no need to manually replace them; keep the expression semantics clear.

15. Be cautious when using lookup functions.

  • Many advocate using precomputed values for complex functions (like trigonometric functions). However, in ray tracing, this is often unnecessary. Memory lookups are costly (and the cost continues to rise), and recalculating trigonometric functions is often as fast as retrieving values from memory (especially considering that lookup tables can pollute the CPU’s cache structure).
  • However, in other scenarios, lookup tables can be very useful. For GPU programming, complex functions are often better served by lookup tables.

Remarks:

With the improved cache and branch prediction capabilities of modern CPUs, lookup tables may not be faster than direct calculations. Only adopt them when performance analysis measurements prove that lookup tables are superior.

16. For most classes, use operators +=, -=, *=, and /= instead of operators +, -, *, and /

  • Simple operations require creating an anonymous temporary intermediate object.
  • For example:
    Vector v = Vector(1,0,0) + Vector(0,1,0) + Vector(0,0,1);
    

    This code will create five unnamed temporary vectors:<span>Vector(1,0,0)</span>, <span>Vector(0,1,0)</span>, <span>Vector(0,0,1)</span>, <span>Vector(1,0,0) + Vector(0,1,0)</span>, and <span>Vector(1,0,0) + Vector(0,1,0) + Vector(0,0,1)</span>.

  • The slightly longer code:
    Vector v(1,0,0); v+= Vector(0,1,0); v+= Vector(0,0,1);
    

    only creates two temporary vectors:<span>Vector(0,1,0)</span> and <span>Vector(0,0,1)</span>. This saves 6 function calls (3 constructors and 3 destructors).

Remarks:

This still applies. Compound assignment operators can reduce the creation of temporary objects and improve performance.

17. For basic data types, use operators +, -, *, and / instead of operators +=, -=, *=, and /=

Remarks:

Modern compilers have optimized operations for basic types very well, and there is no significant performance difference between the two forms. Code writing should prioritize readability.

18. Delay the declaration of local variables

  • Declaring an object variable inevitably involves a function call (constructor).
  • If a variable is only used in specific scenarios (like inside an <span>if</span> statement), then declare it only when truly needed, ensuring the constructor is called only when the variable is used.

Remarks:

Modern compilers can optimize variable scope, but delaying declaration can improve code readability and reduce unnecessary constructions.

19. For objects, use the prefix operator ++obj instead of the postfix operator obj++

  • This may not be an issue in ray tracers.
  • The postfix operator creates a temporary copy of the object (thus calling the constructor and destructor additionally), while the prefix operator does not require a temporary copy.

Remarks:

This still applies. The prefix form can avoid unnecessary temporary objects, especially for custom types.

20. Be cautious when using templates

  • Different instantiation scenarios may require different optimization strategies!
  • The standard template library has been thoroughly optimized, but if you plan to implement an interactive ray tracer, it is still advisable to avoid using it.
  • Why? Only by implementing it yourself will you master its underlying algorithms and clarify the optimal way to use the code.
  • More importantly, in my experience, the debug build of the STL library runs slowly. This is usually not a problem, but when you use the debug version for performance analysis, you will encounter trouble. You will find that STL constructors, iterators, etc., occupy more than <span>15%</span> of the runtime, making the performance analysis output harder to interpret.

Remarks:

The modern STL is highly optimized, and there is no need to avoid it unless in extreme performance scenarios. The debug version of STL has lower performance, while the release version has minimal impact.

21. Avoid dynamic memory allocation during calculations

  • Dynamic memory is very suitable for storing scenes and other data that do not change during calculations.
  • However, in most systems, dynamic memory allocation requires a locking mechanism to control access to the allocator. For multi-threaded applications that use dynamic memory, the waiting time during memory allocation and release can actually lead to performance degradation!
  • Even for single-threaded applications, the overhead of allocating memory on the heap is higher than on the stack. The operating system needs to compute to find the required size of the memory block.

Remarks:

This still applies. Frequent allocation/release of memory can affect performance; it is recommended to reuse objects or use memory pools.

22. Find and utilize information related to system memory caching

  • If a data structure can fit within a single cache line, then processing the entire class only requires one data read from main memory.
  • Ensure all data structures are aligned to cache line boundaries (if a data structure and cache line are both 128 bytes, but one byte of the structure is in one cache line and the remaining 127 bytes are in another cache line, then performance will still be poor).

Remarks:

This still applies. Proper data structure layout and alignment can improve cache hit rates; modern compilers can automatically optimize some alignments, but manual optimization is still valuable.

23. Avoid unnecessary data initialization

  • If you need to initialize large blocks of memory, consider using the <span>memset()</span> function.

Remarks:

This still applies. For initializing large contiguous memory, it is recommended to use <span>memset</span> or <span>std::fill</span>; modern compilers can optimize some initialization operations.

24. Try to adopt early termination in loops and early return mechanisms in functions

  • Consider the problem of a ray intersecting a triangle. A common case is that the ray passes through the triangle without intersecting, so optimizations should be made for this scenario.
  • If you choose to let the ray intersect the triangle plane, you can immediately return when the intersection point t value is negative. This allows about half of the ray-triangle intersection calculations to skip the barycentric coordinate calculations, significantly improving efficiency! Once it is confirmed that there is no intersection, the intersection function should terminate immediately.
  • Similarly, some loops can terminate early. For example, when casting shadow rays, as soon as any occlusion point is found, you can exit the intersection detection program immediately without calculating the nearest intersection position.

Remarks:

This still applies. Early termination can reduce unnecessary calculations and improve performance.

25. Always simplify equations on paper!

  • In many equations, certain terms will cancel each other out—either always or only under specific conditions.
  • The compiler cannot discover these opportunities for equation simplification, but you can. Sometimes, eliminating a few costly operations in an inner loop can enhance program speed more than spending days optimizing other parts.

Remarks:

This still applies. Manually simplifying expressions can reduce computational load, and compilers find it difficult to automatically discover all mathematical simplifications. Using AI tools can enhance efficiency in this area.

26. The differences between integer operations, fixed-point operations, 32-bit floating-point operations, and 64-bit double-precision operations are smaller than you think

  • In modern CPUs, the throughput of floating-point operations is roughly equivalent to that of integer operations. In compute-intensive programs like ray tracing, the cost difference between integer and floating-point operations is negligible. Therefore, you do not need to deliberately choose integer operations.
  • Double-precision floating-point operations are not necessarily slower than single-precision floating-point operations, especially on 64-bit machines. I have seen ray tracers using full double-precision floating-point operations run faster than those using full single-precision floating-point operations on the same machine. But I have also seen the opposite.

Remarks:

Modern CPU performance for floating-point and integer operations is similar. Unless there are special requirements, there is no need to deliberately use integers instead of floating-point numbers.

27. Consider restructuring mathematical expressions to eliminate costly operations

  • <span>sqrt()</span> functions can often be avoided, especially in scenarios where the same result can be obtained by comparing squared values.
  • If you need to divide by <span>x</span> frequently, consider calculating 1 and then multiplying by the result. This method has significantly improved vector normalization efficiency (reducing 3 divisions), but it has recently been found to be somewhat mediocre. However, when the number of divisions exceeds 3, this method still has advantages.
  • If you need to perform loops, ensure that calculations that yield constant results are completed outside the loop.
  • Consider whether you can incrementally calculate a variable within the loop (rather than recalculating that variable from scratch each time).

Remarks:

This still applies. Reducing high-cost operations (like division and square roots) helps improve performance, especially in loops and hot code.

Supplement: Best Practices Under Modern Compiler Optimization

Optimization Techniques for C/C++ Code

  1. Keep code simple and trust the compiler Simple code is easy to maintain, and smart modern compilers will quietly help you with many optimization tasks.

  2. Prefer passing by (const) reference or returning (const) reference Exceptions: simple data types (like int, bool, float, etc.) should use pass by value or return value.

  3. Use the inline keyword Inline small functions. Inline frequently called functions.

  4. Remove redundant code Delete unused code, refactor and streamline redundant code (ensure sufficient unit tests for regression testing to avoid introducing new issues that lead to code instability).

  5. Use performance analysis tools Utilize <span>perf</span>, <span>gprof</span>, <span>valgrind</span>, <span>clang-tidy</span>, and <span>VTune</span> to locate real bottleneck hot code, avoiding blind optimization. To avoid interference, only profile hot code.

  6. Choose appropriate data structures and algorithms The choice of algorithms and data structures has a far greater impact on performance than micro-level code optimization.

  7. Utilize modern C++ features Use <span>move</span> semantics, smart pointers, range for loops, concurrency libraries, etc., to enhance performance and safety.

  8. Enable compiler optimization options For release versions, it is recommended to use <span>-O2</span> or <span>-O3</span>, combined with <span>-march=native</span>, <span>-flto</span>, and other advanced optimizations.

  9. Focus on multithreading and concurrency Modern CPUs are multi-core; properly utilizing multithreading and concurrency libraries can significantly enhance performance.

  10. Maintain code readability and maintainability Optimization should prioritize readability; excessive manual optimization may lead to maintenance difficulties.

  11. Explicitly specify standard versions Use <span>-std=c++17</span> or higher standards to leverage new features for performance and safety.

  12. Utilize compiler warnings and static analysis Always enable <span>-Wall -Wextra</span>, and regularly check code quality with static analysis tools.

Optimization Techniques for C/C++ Code

Further Reading

  • Effective Modern C++
  • C++ Performance Optimization Handbook
  • Programming Techniques to Improve C++ Performance
  • The Art of Writing Efficient Programs
  • C++ High Performance 2nd Edition
  • CPU Performance Analysis and Optimization

At the end, here are some discount links

Advanced C++23 Programming

https://u.jd.com/YDbpGGG

C++20 Template Metaprogramming

https://u.jd.com/YGbSvDi

Optimization Techniques for C/C++ Code

Leave a Comment