Click the above“Embedded and Linux Matters” Select“Top/Star Public Account” Welfare and valuable content delivered promptly

Defensive Programming

The reliability of embedded products is naturally inseparable from hardware, but when the hardware is determined and there is no third-party testing, code written with the idea of defensive programming often has higher stability.

Defensive programming first requires recognizing the various flaws and pitfalls of the C language. C has very weak runtime checks, requiring programmers to carefully consider the code and add checks when necessary; another core idea of defensive programming is to assume that the code runs on unreliable hardware, where external interference may disrupt the program execution order or change RAM storage data, etc.

1 Functions with Parameters Must Check the Validity of Passed Arguments.

Programmers may unintentionally pass incorrect parameters; strong external interference may modify the passed parameters or accidentally call functions with random parameters. Therefore, before executing the function body, it is necessary to first ensure that the actual parameters are valid.

1. int exam_fun( unsigned char *str )   
2. {
3.       if( str != NULL )     //  Check the condition "assume pointer is not null" 
4.       {
5.            // Normal processing code               
6.       }
7.       else 
8.       {
9.           // Handle error code  
10.      }
11. }

2 Carefully Check Function Return Values

Function return error codes should be comprehensively and carefully handled, and error logging should be done when necessary.

1. char *DoSomething(…)  
2. {
3.     char * p;
4.      p=malloc(1024);
5.      if(p==NULL)          /* Check the function return value */
6.      {
7.         UARTprintf(…);   /* Print error message */  
8.          return NULL;
9.      }
10.      return p;
11. }

3 Prevent Pointer Out-of-Bounds

When dynamically calculating an address, ensure that the calculated address is reasonable and points to a meaningful location. Especially for pointers pointing to the internals of a structure or array, when the pointer is incremented or changed, it should still point to the same structure or array.

4 Prevent Array Out-of-Bounds

The issue of array out-of-bounds has been discussed extensively earlier. Since C does not effectively check arrays, it is necessary to explicitly check for array out-of-bounds issues in applications. The following example can be used for interrupt reception of communication data.

1. #define REC_BUF_LEN 100  
2. unsigned char RecBuf[REC_BUF_LEN];
3. // Other code  
4. void Uart_IRQHandler(void)  
5. {
6.     static RecCount=0;          // Receive data length counter  
7.     // Other code  
8.     if(RecCount< REC_BUF_LEN)   // Check if the array is out of bounds
9.     {
10.          RecBuf[RecCount]=…;     // Get data from hardware  
11.          RecCount++;
12.          // Other code  
13.     }
14.     else
15.     {
16.         // Error handling code   
17.     }
18.      // Other code 
19. }

When using some library functions, boundary checks are also necessary. For example, the memset(RecBuf,0,len) function fills the first len bytes of the memory area pointed to by RecBuf with 0. If the length of len is not carefully considered, it will clear memory outside the RecBuf array:

1. #define REC_BUF_LEN 100  
2. unsigned char RecBuf[REC_BUF_LEN];
3.   
4. if(len< REC_BUF_LEN)
5. {
6.     memset(RecBuf,0,len);       // Clear the RecBuf array  
7. }
8. else
9. {
10.     // Handle error  
11. }

5 Mathematical Operations

5.1 Is Checking for Zero Divisor Enough?

Before division operations, checking if the divisor is zero has almost become a consensus, but is it enough to only check if the divisor is zero?

Consider dividing two integers. For a signed long type variable, its value range is: -2147483648 ~ +2147483647. If we let -2147483648 / -1, the result should be +2147483648, but this result has exceeded the range that signed long can represent. Therefore, in this case, in addition to checking if the divisor is zero, we also need to check for overflow in the division.

1. #include <limits.h>    
2. signed long sl1, sl2, result;
3. /* Initialize sl1 and sl2 */    
4. if((sl2==0)||(sl1==LONG_MIN && sl2==-1))
5. {
6.     // Handle error    
7. }
8. else   
9. {
10.     result = sl1 / sl2;
11. }

5.2 Check for Arithmetic Overflow

Integer addition, subtraction, and multiplication can all potentially overflow. When discussing undefined behavior, a code snippet for checking signed integer addition overflow was provided. Here is a code snippet for checking unsigned integer addition overflow:

1. #include <limits.h>    
2. unsigned int a, b, result;
3. /* Initialize a, b */    
4. if(UINT_MAX-a<b)
5. {
6.     // Handle overflow    
7. }
8. else   
9. {
10.     result=a+b;
11. }

Embedded hardware generally does not have a floating-point processor, and floating-point operations are relatively rare in embedded systems. Overflow checks heavily rely on C library support, so this will not be discussed here.

5.3 Check for Shifts

When discussing undefined behavior, it was mentioned that right shifting signed numbers, shifting by a negative value, or shifting by a value greater than the number of bits in the operand are all undefined behaviors. It was also mentioned not to perform bit operations on signed numbers, but it is necessary to check if the shift amount exceeds the number of bits in the operand. Below is a code snippet for checking left shifts of unsigned integers:

1. unsigned int ui1;
2. unsigned int ui2;
3. unsigned int uresult;
4.   
5. /* Initialize ui1, ui2 */
6. if(ui2>=sizeof(unsigned int)*CHAR_BIT)
7. {
8.     // Handle error  
9. }
10. else
11. {
12.     uresult=ui1<<ui2;
13. }

6 Use Hardware Watchdog if Available

In case all other measures fail, the watchdog may be the last line of defense. Its principle is particularly simple, but it can greatly enhance the reliability of the device. If the device has a hardware watchdog, be sure to write a driver for it.

Enable the watchdog as early as possible

This is because during the time from power-up reset to enabling the watchdog, the device may be interfered with and skip the watchdog initialization program, leading to watchdog failure. Enabling the watchdog as early as possible can reduce this probability;

Do not feed the watchdog in interrupts unless there are other linked measures

Feeding the watchdog in an interrupt may cause the program to remain in the interrupt due to interference, leading to watchdog failure. If a flag is set in the main program, it is permissible to feed the watchdog in the interrupt program in conjunction with this flag;

The feeding interval of the watchdog depends on product requirements, not a specific time

The characteristics of the product determine the feeding interval of the watchdog. For devices that do not involve safety or real-time requirements, the feeding interval can be relatively loose, but the interval should not be too long, or it will be perceived by users, affecting user experience. For devices designed for safety and real-time control, the principle is to reset as quickly as possible, or accidents may occur.

The Clementine spacecraft was originally scheduled to fly from the Moon to the Geographos asteroid in deep space for exploration during its second phase of the mission. However, due to a software defect, the spacecraft interrupted operation for 20 minutes while flying towards the asteroid, failing to reach it and causing a reduction in power supply due to the control nozzle burning for 11 minutes, making it impossible to control the spacecraft remotely, ultimately ending this mission and leading to resource and financial waste.

“I was shocked by the failure of the Clementine space mission. It could have been avoided with a simple hardware watchdog timer, but due to the tight development schedule at the time, the programmers did not have time to write the program to enable it,” Ganssle said.

Unfortunately, the NEAR spacecraft launched in 1998 also encountered the same problem. Because the programmers did not adopt the recommendations, when the thruster deceleration system failed, 29 kilograms of reserve fuel was wasted—this was also a problem that could have been avoided by programming the watchdog timer, proving that learning from the mistakes of other programmers is not easy.

7 Key Data Should Have Multiple Backups, Use Voting Method for Data Retrieval

Data in RAM may be altered under interference, and critical system data should be protected. Critical data includes global variables, static variables, and data areas that need protection. Backup data should not be located adjacent to the original data, so the compiler should not allocate backup data locations by default; instead, the programmer should specify storage areas.

RAM can be divided into three areas: the first area stores the original code, the second area stores the complement code, and the third area stores the XOR code, with a certain amount of “blank” RAM reserved between the areas for isolation. The compiler’s “scatter loading” mechanism can be used to store variables in these areas. When reading is required, all three pieces of data should be read simultaneously and voted on, taking the value that has at least two matches.

If the device’s RAM starts at 0x1000_0000, I need to store the original code in the RAM from 0x1000_0000 to 0x10007FFF, the complement code in 0x1000_9000 to 0x10009FFF, and the XOR code in 0x1000_B000 to 0x1000BFFF. The compiler’s scatter loading can be set as follows:

1. LR_IROM1 0x00000000 0x00080000  {    ; load region size_region
2.   ER_IROM1 0x00000000 0x00080000  {  ; load address = execution address
3.    *.o (RESET, +First)
4.    *(InRoot$$Sections)
5.    .ANY (+RO)
6.   }
7.   RW_IRAM1 0x10000000 0x00008000  {  ; Save original code
8.    .ANY (+RW +ZI )
9.   }
10.     
11.   RW_IRAM3 0x10009000 0x00001000{    ; Save complement code
12.    .ANY (MY_BK1)
13.   }
14.     
15.   RW_IRAM2 0x1000B000 0x00001000  {  ; Save XOR code
16.    .ANY (MY_BK2)
17.   }
18. }

If a critical variable needs multiple backups, it can be defined as follows, with the three variables specified in three non-contiguous RAM areas, initialized according to the original code, complement code, and XOR code of 0xAA.

1. uint32 plc_pc=0;                                                       // Original code  
2. __attribute__((section("MY_BK1"))) uint32 plc_pc_not=~0x0;              // Complement code  
3. __attribute__((section("MY_BK2"))) uint32 plc_pc_xor=0x0^0xAAAAAAAA;    // XOR code

When writing this variable, all three locations must be updated; when reading the variable, read the three values for judgment, taking the value that has at least two matches.

Why choose XOR code instead of complement code? This is because MDK’s integers are stored in complement form, and the complement of a positive number is the same as the original code. In this case, the original code and complement are consistent, which not only does not provide redundancy but can be harmful to reliability. For example, if a non-zero integer area is cleared due to interference, since the original code and complement are the same, using the “voting method” of taking 3 out of 2 will treat the interference value 0 as the correct data.

8 Backup Storage for Non-Volatile Memory

Non-volatile memory includes but is not limited to Flash, EEPROM, and ferroelectric memory. Simply reading back the data written to non-volatile memory for verification is not enough. Under strong interference, data in non-volatile memory may become erroneous, and a power failure during writing to non-volatile memory will lead to data loss. If the program runs into the function of writing to non-volatile memory due to interference, it will lead to disordered data storage. A reliable method is to divide the non-volatile memory into multiple areas, with each piece of data written in different forms to these partitions. When reading is required, multiple pieces of data should be read simultaneously and voted on, taking the value that has the most matches.

9 Software Locks

For initialization sequences or function calls that have a certain order, to ensure the calling order or ensure that each function is called, we can use interlocking, which is essentially a software lock. Additionally, for some safety-critical code statements (which are statements, not functions), software locks can be set, allowing access to these critical codes only to those holding a specific key. It can also be understood colloquially that critical safety code cannot be executed under a single condition; an additional flag should be set.

For example, when writing data to Flash, we will check if the data is valid, if the write address is valid, and calculate the sector to be written. Then we call the write Flash subroutine, in which we check if the sector address is valid, if the data length is valid, and then write the data to Flash. Since the write Flash statement is safety-critical code, the program locks these statements: they can only be executed if the correct key is held. This way, even if the program runs into the write Flash subroutine, the risk of erroneous writing is greatly reduced.

1. /**************************************************************************** 
2. * Name: RamToFlash() 
3. * Function: Copy data from RAM to FLASH, command code 51. 
4. * Entry parameters: dst        Target address, i.e., FLASH starting address. Divided by 512 bytes 
5. *           src        Source address, i.e., RAM address. Address must be word-aligned 
6. *           no         Number of bytes to copy, 512/1024/4096/8192 
7. *           ProgStart  Software lock flag    
8. * Exit parameters: IAP return value (paramout buffer) CMD_SUCCESS, SRC_ADDR_ERROR, DST_ADDR_ERROR, 
9. SRC_ADDR_NOT_MAPPED, DST_ADDR_NOT_MAPPED, COUNT_ERROR, BUSY, sector not selected 
10. ****************************************************************************/
11. void  RamToFlash(uint32 dst, uint32 src, uint32 no,uint8 ProgStart)  
12. {
13.     PLC_ASSERT("Sector number",(dst>=0x00040000)&&(dst<=0x0007FFFF));
14.     PLC_ASSERT("Copy bytes number is 512",(no==512));
15.     PLC_ASSERT("ProgStart==0xA5",(ProgStart==0xA5));
16.       
17.     paramin[0] = IAP_RAMTOFLASH;             // Set command word  
18.     paramin[1] = dst;                        // Set parameters  
19.     paramin[2] = src;
20.     paramin[3] = no;
21.     paramin[4] = Fcclk/1000;
22.     if(ProgStart==0xA5)                     // Only when the software lock flag is correct, execute critical code  
23.     {
24.         iap_entry(paramin, paramout);       // Call IAP service program                 
25.         ProgStart=0;
26.     }
27.     else
28.     {
29.         paramout[0]=PROG_UNSTART;
30.     }
31. }

This code segment is programming the internal Flash of the LPC1778. The function that calls the IAP program, iap_entry(paramin, paramout), is critical safety code, so before executing this code, it first checks a specific safety lock flag, ProgStart. Only if this flag meets the set value will the Flash programming operation be executed. If the program unexpectedly runs into this function, due to the incorrect ProgStart flag, it will not program the Flash.

10 Communication

Data errors on communication lines are relatively serious; the longer the communication line and the harsher the environment, the more serious the errors will be. Setting aside the effects of hardware and environment, our software should be able to recognize erroneous communication data. Some application measures include:

Limit the number of bytes per frame when formulating protocols;

The more bytes per frame, the greater the likelihood of errors, and the more invalid data there will be. For this reason, Ethernet specifies that each frame of data should not exceed 1500 bytes, high-reliability CAN transceivers specify that each frame of data must not exceed 8 bytes, and for RS485, the widely used Modbus protocol specifies that a frame of data should not exceed 256 bytes. Therefore, it is recommended that when formulating internal communication protocols, the RS485 should specify that each frame of data does not exceed 256 bytes;

Use multiple checks

When writing programs, enable parity checks. For applications with frames exceeding 16 bytes, it is recommended to at least write a CRC16 check program;

Add additional checks

1) Add buffer overflow checks. This is because data reception is often completed in interrupts, and the compiler cannot detect whether the buffer overflows; manual checks are required, as detailed in the previous section on data overflow.

2) Add timeout checks. If half a frame of data is received and the remaining data is not received for a long time, consider this frame of data invalid and restart reception. This is optional and related to different protocols, but buffer overflow checks must be implemented. This is because for protocols that require frame header checks, the host may send the frame header and suddenly lose power. After rebooting, the host starts sending from a new frame, but the slave has already received the last incomplete frame header, so the host’s current frame header will be treated as normal data by the slave. This may cause the data length field to be a very large value, requiring a considerable amount of data to fill that length (for example, a frame may be 1000 bytes), affecting response time; on the other hand, if the program does not have buffer overflow checks, the buffer may overflow, leading to disastrous consequences.

Retransmission mechanism

If communication data errors are detected, there should be a retransmission mechanism to resend the erroneous frame.

11 Detection and Confirmation of Digital Input

Digital inputs are easily affected by sharp pulse interference, which may cause erroneous actions if not filtered. Generally, it is necessary to sample the digital input signal multiple times and perform logical judgments until the signal is confirmed to be correct.

12 Digital Output

A simple one-time output of a digital signal is not safe; interference signals may flip the state of the digital output. Repeatedly refreshing the output can effectively prevent level flipping.

13 Saving and Restoring Initialization Information

The values of microprocessor registers may also change due to external interference, and peripheral initialization values need to be stored in registers for a long time, which are the most easily damaged. Since data in Flash is relatively difficult to damage, initialization information can be pre-written to Flash. When the program is idle, compare the register values related to initialization to see if they have changed. If illegal changes are found, use the values in Flash to restore them.

The 4.3-inch LCD screen currently used by the company has average anti-interference capability. If the distance between the display screen and the controller is too long, or if static electricity or pulse groups are applied to the device using this display, the screen may become garbled or white. To address this, we can save the initialization data of the display screen in Flash. After the program runs, periodically read the current values from the display’s registers and compare them with the values stored in Flash. If a difference is found, reinitialize the display. Below is the verification source code for reference.

Define data structure:

1. typedef struct {  
2.     uint8_t  lcd_command;           // LCD register  
3.     uint8_t  lcd_get_value[8];      // Values written to registers during initialization  
4.     uint8_t  lcd_value_num;         // Number of values written to registers during initialization  
5. } lcd_redu_list_struct;

Define a const-modified structure variable to store the initial values of the LCD part registers. This initial value is related to specific application initialization and is not necessarily the data in the table. Typically, this structure variable is stored in Flash.

1. /* LCD part register setting value list*/
2. lcd_redu_list_struct const lcd_redu_list_str[]=
3. {
4.   {SSD1963_Get_Address_Mode,{0x20}                                   ,1}, /*1*/
5.   {SSD1963_Get_Pll_Mn      ,{0x3b,0x02,0x04}                         ,3}, /*2*/
6.   {SSD1963_Get_Pll_Status  ,{0x04}                                   ,1}, /*3*/
7.   {SSD1963_Get_Lcd_Mode    ,{0x24,0x20,0x01,0xdf,0x01,0x0f,0x00}     ,7}, /*4*/
8.   {SSD1963_Get_Hori_Period ,{0x02,0x0c,0x00,0x2a,0x07,0x00,0x00,0x00},8}, /*5*/
9.   {SSD1963_Get_Vert_Period ,{0x01,0x1d,0x00,0x0b,0x09,0x00,0x00}     ,7}, /*6*/
10.   {SSD1963_Get_Power_Mode  ,{0x1c}                                   ,1}, /*7*/
11.   {SSD1963_Get_Display_Mode,{0x03}                                   ,1}, /*8*/
12.   {SSD1963_Get_Gpio_Conf   ,{0x0F,0x01}                              ,2}, /*9*/
13.   {SSD1963_Get_Lshift_Freq ,{0x00,0xb8}                              ,2}, /*10*/
14. };

Implement the function as follows. The function will traverse each command in the structure variable and each initial value under each command. If any one is incorrect, it will exit the loop and execute reinitialization and recovery measures.

The MY_DEBUGF macro in this function is my own debugging function, which uses serial port printing to output debugging information. The details will be discussed in the fifth part.

Through this function, I can monitor for a long time which commands and which bits of the display screen are easily affected. The program uses a demonized keyword: goto. Most C language books are wary of the goto keyword, but you should have your own judgment.

In the function, besides the goto keyword, what other method can exit multiple loops so simply and efficiently!

1. /**  
2. * LCD display redundancy  
3. * Call this program once every period  
4. */
5. void lcd_redu(void)  
6. {
7.     uint8_t  tmp[8];
8.     uint32_t i,j;
9.     uint32_t lcd_init_flag;
10.       
11.     lcd_init_flag =0;
12.     for(i=0;i<sizeof(lcd_redu_list_str)/sizeof(lcd_redu_list_str[0]);i++)
13.     {
14.         LCD_SendCommand(lcd_redu_list_str[i].lcd_command);
15.         uyDelay(10);
16.         for(j=0;j<lcd_redu_list_str[i].lcd_value_num;j++)
17.         {
18.             tmp[j]=LCD_ReadData();
19.             if(tmp[j]!=lcd_redu_list_str[i].lcd_get_value[j])
20.             {
21.                 lcd_init_flag=0x55;
22.                 MY_DEBUGF(MENU_DEBUG,("Read lcd register value does not match expected, command:0x%x, parameter %d,
23.             expected value:0x%x, actual read value:0x%x\n",lcd_redu_list_str[i].lcd_command,j+1,
24.             lcd_redu_list_str[i].lcd_get_value[j],tmp[j]));
25.                 goto handle_lcd_init;
26.             }
27.         }
28.     }
29.       
30. handle_lcd_init:
31.     if(lcd_init_flag==0x55)
32.     {
33.         // Reinitialize LCD  
34.         // Some necessary recovery measures  
35.     }
36. }

14 Traps

For 8051 core microcontrollers, since there is no corresponding hardware support, software traps can be set purely in software to intercept some runaway programs. For ARM7 or Cortex-M series microcontrollers, hardware has already built in various exceptions, and software needs to write trap programs based on hardware exceptions to quickly locate or even recover errors.

15 Blocking Handling

Sometimes programmers use while(!flag); statements to block and wait for the flag to change, such as waiting for a byte of data to be sent during serial transmission. Such code is risky; if for some reason the flag does not change, it will cause the system to hang.

A well-redundant program sets a timeout timer, forcing the program to exit the while loop after a certain time.

The W32.Blaster.Worm incident on August 11, 2003, caused global economic losses of up to $500 million. This vulnerability exploited a logical flaw in the remote procedure call interface of Windows Distributed Component Object Model: the loop in the GetMachineName() function only set an insufficient termination condition.

The original code is simplified as follows:

1. HRESULT GetMachineName ( WCHAR *pwszPath,  
2. WCHAR wszMachineName[MAX_COMPUTERNAME_LENGTH_FQDN+1])
3. {
4.        WCHAR *pwszServerName = wszMachineName;
5.        WCHAR *pwszTemp = pwszPath + 2;
6.        while ( *pwszTemp != L'\' )           /* This line's loop termination condition is insufficient */  
7.              *pwszServerName++= *pwszTemp++;
8.        /*… */  
9. }

Microsoft’s security patch MS03-026 resolved this issue by setting a sufficient termination condition for the GetMachineName() function.

A simplified solution code is as follows (not the Microsoft patch code):

1. HRESULT GetMachineName( WCHAR *pwszPath,  
2. WCHAR wszMachineName[MAX_COMPUTERNAME_LENGTH_FQDN+1])
3. {
4.        WCHAR *pwszServerName = wszMachineName;
5.        WCHAR *pwszTemp = pwszPath + 2;
6.        WCHAR *end_addr = pwszServerName + MAX_COMPUTERNAME_LENGTH_FQDN;
7.        while ((*pwszTemp != L'\') && (*pwszTemp != L'\0')
8. && (pwszServerName < end_addr))  /* Sufficient termination condition */  
9.              *pwszServerName++= *pwszTemp++;
10.        /*… */  
11. }

Copyright Notice: This article is sourced from the internet, and the copyright belongs to the original author. For copyright issues, please contact for deletion.

end

Previous Recommendations

Essential Classic Books for Embedded Linux

Recommended Learning Path for Embedded Systems

A Reader’s Clear Logic Question

Successful Transition from Mechanical to Embedded

A Reader’s Experience of Landing a Job in Audio and Video Direction

Scan to add me on WeChat

Join the technical exchange group

15 Golden Rules of Defensive Programming in Embedded C: Make Your Code Rock Solid!

Collect

Looking