Source | Embedded Miscellaneous
In embedded development, resources are always scarce. Insufficient memory, slow execution speed, high power consumption… do these issues often trouble you?
Today, I will share several code optimization techniques that have been validated in practice!
1. Time Efficiency Optimization
Avoid Floating Point Operations
// Slow version: floating point operation
float calculate_voltage(int adc_value) {
return adc_value * 3.3f / 4096.0f;
}
// Fast version: fixed-point operation
int calculate_voltage_fast(int adc_value) {
return (adc_value * 3300) >> 12; // Replace division by 4096 with right shift
}
Reduce Function Calls
// Slow version: frequent function calls
for(int i = 0; i < 1000; i++) {
set_led_state(i % 2);
}
// Fast version: inline expansion
for(int i = 0; i < 1000; i++) {
if(i % 2) {
GPIO_SetBits(GPIOA, GPIO_Pin_5);
} else {
GPIO_ResetBits(GPIOA, GPIO_Pin_5);
}
}
2. Space Efficiency Optimization
Select Appropriate Data Types
// Memory-wasting approach
struct sensor_data {
int temperature; // Only needs -40 to 125
int humidity; // Only needs 0 to 100
int pressure; // Only needs 300 to 1100
};
// Memory-saving approach
struct sensor_data_optimized {
int8_t temperature; // -128 to 127, sufficient
uint8_t humidity; // 0 to 255, sufficient
uint16_t pressure; // 0 to 65535, sufficient
};
// Memory saving: reduced from 12 bytes to 4 bytes
Use Unions to Save Space
// Communication protocol data packet
typedef union {
struct {
uint8_t header[4];
uint8_t cmd;
uint8_t data[32];
uint8_t checksum;
} packet;
uint8_t raw_data[38];
} comm_frame_t;
// Can be accessed by byte or by structure
comm_frame_t frame;
frame.packet.cmd = 0x01; // Structured access
send_data(frame.raw_data, 38); // Byte array access
3. Trade Space for Time
Scenario
You need to count how many 1s are in a 4-bit data (0x0~0xF), the traditional method is to loop through each bit.
Regular Approach
int count_ones_slow(unsigned char data) {
int cnt = 0;
unsigned char temp = data & 0xf;
for (int i = 0; i < 4; i++) {
if (temp & 0x01) {
cnt++;
}
temp >>= 1;
}
return cnt;
}
Optimized Approach
// Pre-computed lookup table
static int ones_table[16] = {
0, 1, 1, 2, 1, 2, 2, 3,
1, 2, 2, 3, 2, 3, 3, 4
};
int count_ones_fast(unsigned char data) {
return ones_table[data & 0xf];
}
Performance Comparison:
- Traditional method: requires 4 loops + 4 bit operations
- Lookup method: only requires 1 array access
Applicable Scenarios: complex calculations, trigonometric functions, CRC checks, etc.
4. Use Flexible Arrays
Problems with Traditional Pointer Method
typedef struct {
uint16_t head;
uint8_t id;
uint8_t type;
uint8_t length;
uint8_t *value; // Pointer method
} protocol_old_t;
Problems:
- Requires two memory allocations
- Memory is not contiguous, access efficiency is low
- Memory release is prone to errors
The Elegance of Flexible Arrays
typedef struct {
uint16_t head;
uint8_t id;
uint8_t type;
uint8_t length;
uint8_t value[]; // Flexible array
} protocol_new_t;
Advantages:
- Single allocation, contiguous memory
- Faster access speed
- Simpler memory management
- Avoids memory leak risks
Usage Example:
// Allocate struct + data space
protocol_new_t *p = malloc(sizeof(protocol_new_t) + data_len);
5. Use Bit Manipulation
Bit Fields: A Memory-Saving Trick
How would you manage 8 flag bits?
Memory-Wasting Approach:
struct flags_waste {
unsigned char flag1;
unsigned char flag2;
unsigned char flag3;
unsigned char flag4;
unsigned char flag5;
unsigned char flag6;
unsigned char flag7;
unsigned char flag8; // Total 8 bytes!
};
Memory-Efficient Approach:
struct flags_smart {
unsigned char flag1:1; // Only uses 1 bit
unsigned char flag2:1;
unsigned char flag3:1;
unsigned char flag4:1;
unsigned char flag5:1;
unsigned char flag6:1;
unsigned char flag7:1;
unsigned char flag8:1; // Total 1 byte!
} flags;
Memory Saving: reduced from 8 bytes to 1 byte!
Bit Operations: Replacing Multiplication and Division
Slow Version:
uint32_t val = 1024;
uint32_t doubled = val * 2; // Multiplication instruction
uint32_t halved = val / 2; // Division instruction
Fast Version:
uint32_t val = 1024;
uint32_t doubled = val << 1; // Left shift 1 bit = multiply by 2
uint32_t halved = val >> 1; // Right shift 1 bit = divide by 2
6. Loop Unrolling – Reducing Jumps
Overhead of Traditional Loops
// Each loop has jump overhead
for (int i = 0; i < 4; i++) {
process(array[i]);
}
Efficient Version After Unrolling
// Direct execution, no jump overhead
process(array[0]);
process(array[1]);
process(array[2]);
process(array[3]);
Advanced Unrolling: Parallel Computation
Regular Version:
long calc_sum_slow(int *arr0, int *arr1) {
long sum = 0;
for (int i = 0; i < 1000; i++) {
sum += arr0[i] * arr1[i]; // Serial computation
}
return sum;
}
Optimized Version:
long calc_sum_fast(int *arr0, int *arr1) {
long sum0 = 0, sum1 = 0, sum2 = 0, sum3 = 0;
for (int i = 0; i < 250; i += 4) {
sum0 += arr0[i+0] * arr1[i+0]; // Parallel computation
sum1 += arr0[i+1] * arr1[i+1];
sum2 += arr0[i+2] * arr1[i+2];
sum3 += arr0[i+3] * arr1[i+3];
}
return sum0 + sum1 + sum2 + sum3;
}
7. Inline Functions – Eliminating Function Call Overhead
The Hidden Cost of Function Calls
Each function call has overhead:
- Parameter stack pushing
- Jump instructions
- Stack frame management
- Return address saving
The Magic of Inline Functions
static inline void toggle_led(uint8_t pin) {
PORT ^= 1 << pin;
}
// Directly expanded after compilation, no function call overhead
toggle_led(LED_PIN);
Applicable Scenarios:
- Frequently called small functions
- Functions on critical paths
- Simple utility functions
8. Data Type Optimization
The Art of Loop Variables
Inefficient Approach:
char i; // May overflow, compiler needs extra checks
for (i = 0; i < N; i++) {
// ...
}
Efficient Approach:
int i; // Better compiler optimization
for (i = 0; i < N; i++) {
// ...
}
Data Type Selection Principles
- Loop Index: Prefer using int
- Storage Optimization: Use char if possible instead of int
- Compute Intensive: Avoid unnecessary floating-point operations
- Type Conversion: Reduce implicit type conversions
9. Loop Optimization Strategies
The Art of Nested Loop Arrangement
Inefficient Arrangement (long loop on the outer layer):
for (row = 0; row < 100; row++) { // Outer loop 100 times
for (col = 0; col < 5; col++) { // Inner loop 5 times
sum += a[row][col];
}
}
// Total jump count: 100 outer jumps
Efficient Arrangement (long loop on the inner layer):
for (col = 0; col < 5; col++) { // Outer loop 5 times
for (row = 0; row < 100; row++) { // Inner loop 100 times
sum += a[row][col];
}
}
// Total jump count: 5 outer jumps
Early Exit Strategy
Inefficient Method (execute complete loop):
bool found = false;
for (int i = 0; i < 10000; i++) {
if (list[i] == target) {
found = true; // Continue looping even after found!
}
}
Efficient Method (exit immediately upon finding):
bool found = false;
for (int i = 0; i < 10000; i++) {
if (list[i] == target) {
found = true;
break; // Exit immediately
}
}
10. Structure Memory Alignment Optimization
The Impact of Memory Alignment
Unoptimized Version:
struct waste_memory {
char a; // 1 byte
short b; // 2 bytes, needs alignment
char c; // 1 byte
int d; // 4 bytes, needs alignment
char e; // 1 byte
};
// Actual usage: 16 bytes
Optimized Version:
struct save_memory {
char a; // 1 byte
char c; // 1 byte
short b; // 2 bytes, perfectly aligned
int d; // 4 bytes, perfectly aligned
char e; // 1 byte
};
// Actual usage: 12 bytes, saving 25%!
Notes
Optimization Principles
Do not optimize for the sake of optimization; do not sacrifice code readability.
- Measure first, then optimize: Use tools to find the real bottlenecks
- Trade-offs: Performance vs readability vs maintainability
- Incremental optimization: Optimize hotspot code first
- Validate results: Test correctness after optimization
Good optimization is not about showing off skills, but finding the best balance under constraints.
What other optimization techniques have you used in embedded development?
———— END ————

In embedded projects, do you use off-the-shelf hardware and software modules, or do you build from scratch?

OSPI Flash Adaptation Secrets: The Art of Transformation

Principles and Implementation Methods of External Flash Download Algorithms for Microcontrollers