Key Points and Ideas for Embedded Software Reliability Design

1 Print Error Messages
Serial port printing of error messages is used to expose bugs in the design and correct them.
unsigned int WriteData(unsigned int addr){    if((addr>= BASE_ADDR)&&(addr<=END_ADDR))     {        …/* Address is valid, process it */    }     else     { /* Address error, print error message */        UARTprintf ("File %s at line %d encountered address error while writing data, error address: 0x%x\n",__FILE__,__LINE__,addr);        …/* Error handling code */    }}

2 Are Actual Parameters Valid?

Before executing the function body, it is necessary to determine whether the actual parameters are valid.
int exam_fun( unsigned char *str ) {     if( str != NULL )    { //  Check if "assumed pointer is not null" condition is met        ... // Normal processing code    }     else     {        UARTprintf(...); // Print error message        ...// Error handling code    }}

3 Check Function Return Values

All error codes returned by functions must be carefully handled and logged if necessary.
char *DoSomething(...){    char * p;    p=malloc(1024);    if(p==NULL)     { /* Check function return value */        UARTprintf(...); /* Print error message */        return NULL;    }    return p;}

4 Prevent Pointer Out-of-Bounds

When dynamically calculating an address, ensure that the calculated address is reasonable and points to a meaningful location. Especially for pointers pointing to the internals of a structure or array, when the pointer increments or changes, it should still point to the same structure or array.

5 Prevent Array Out-of-Bounds

Explicitly check for array out-of-bounds issues in applications. (Interrupt receiving communication data)
#define REC_BUF_LEN 100unsigned char RecBuf[REC_BUF_LEN];… // Other codevoid Uart_IRQHandler(void){    static RecCount=0;   // Receive data length counter    …       // Other code    if(RecCount < REC_BUF_LEN)    {        RecBuf[RecCount]=…;  // Get data from hardware        RecCount++;        …      // Other code    }     else     {        UARTprintf(...);   // Print error message        …      // Other error handling code    }    …}
When using some library functions, boundary checks are also necessary:
#define REC_BUF_LEN 100unsigned char RecBuf[REC_BUF_LEN]; if(len< REC_BUF_LEN){    memset(RecBuf,0,len);  // Clear the RecBuf array} else {    // Handle error}

6 Mathematical Operations

When dividing two integers, in addition to checking if the divisor is zero, also check for overflow in the division.
#include <limits.h>signed long sl1,sl2,result;/* Initialize sl1 and sl2 */if((sl2==0)||((sl1==LONG_MIN) && (sl2==-1))){    // Handle error} else {    result = sl1 / sl2;}
Addition overflow check:
a) Unsigned Addition
#include <limits.h>unsigned int a,b,result;/* Initialize a, b */if(UINT_MAX-a<b){    // Handle overflow} else {    result=a+b;}
b) Signed Addition
#include <limits.h>signed int a,b,result;/* Initialize a, b */if((a>0 && INT_MAX-a<b)||(a<0 && INT_MIN-a>b)){    // Handle overflow} else {    result=a+b;}
Multiplication overflow check:
a) Unsigned Multiplication
#include <limits.h>unsigned int a,b,result;/* Initialize a, b */if((a!=0) && (UINT_MAX/a<b)) {    // Handle overflow} else {    result=a*b;}
b) Signed Multiplication
#include <limits.h>signed int a,b,tmp,result;/* Initialize a, b */tmp=a * b;if(a!=0 && tmp/a!=b){    // Handle overflow} else {    result=tmp;}

7 Set Dynamic Code Checks

During program design, set dynamic code checks according to system requirements.

8 Compiler Semantic Checks

For example, array out-of-bounds, pointer validity, whether the operation result overflows, etc.
Check for infinite loops. If similar code appears in infrequently used branches, it can cause seemingly inexplicable crashes or restarts.
a. unsigned char i;                     for(i=0;i<256;i++)  {… }              b. unsigned char i;   for(i=10;i>=0;i--) { … }
Adding a semicolon after an if statement changes the program logic, and the compiler will cooperate to cover it up, sometimes not even giving a warning. The code is as follows:
if(a>b);          // A semicolon was mistakenly added herea=b;               // This line of code is always executed
The compiler will also ignore extra spaces and line breaks, just like the following code will not give sufficient hints:
if(n<3)return    // A semicolon is missing herelogrec.data=x[0];logrec.time=x[1];logrec.code=x[2];
The intention was for the program to directly return when n<3, but due to the programmer’s mistake, the return statement lacked an ending semicolon. The compiler translates it into returning the expression logrec.data=x[0]. In C language, even if it is an expression, it is allowed after a return. When n>=3, the expression logrec.data=x[0]; will not be executed, leaving a hidden danger in the program.

9 Backup Critical Data in Multiple Regions, Use Voting Method for Data Retrieval

Data in RAM may change under interference; critical system data must be protected. Critical data includes global variables, static variables, and data areas that need protection. Data backups and original data should not be allocated in adjacent locations by the compiler but specified by the programmer.
RAM can be divided into three areas: the first area stores the original code, the second area stores the complement code, and the third area stores the XOR code. A certain amount of “blank” RAM should be reserved between the areas for isolation. The compiler’s “scatter loading” mechanism can be used to store variables in these areas. When reading, retrieve three copies of data and use voting to take the value that has at least two matches.
If a critical variable needs multiple backups, it can be defined as follows, and the three variables can be specified in three non-contiguous RAM areas, initialized according to the original code, complement, and XOR code with 0xAA.
uint32 plc_pc=0; // Original code__attribute__((section("MY_BK1"))) uint32 plc_pc_not=~0x0; // Complement code__attribute__((section("MY_BK2"))) uint32 plc_pc_xor=0x0^0xAAAAAAAA; // XOR code
When writing this variable, all three locations must be updated; when reading the variable, read three values for judgment, taking the one that has at least two matches.

10 Data Storage in Non-Volatile Memory

Non-volatile memory includes but is not limited to Flash, EEPROM, and ferroelectric memory. Simply reading back data written to non-volatile memory for verification is not sufficient. Under strong interference, it may cause data errors in non-volatile memory, and power loss during writing to non-volatile memory will lead to data loss. If the program runs into the function for writing non-volatile memory due to interference, it will lead to data storage confusion.
A reliable method is to divide non-volatile memory into multiple zones, and each piece of data will be written into these zones in different forms. When reading, multiple copies of data should be read simultaneously, and voting should be used to take the value that appears most frequently.
For programs that run into the writing function for non-volatile memory due to interference, software locks and strict entry checks should also be used. Relying solely on writing data to multiple zones is not sufficient and is unwise; interception should occur at the source.

11 Software Locks

Software locks can be implemented but are not limited to interlocking. For initialization sequences or function calls with a certain order, to ensure the calling order or ensure each function is called, interlocking is used; this is essentially a software lock. In addition, for some safety-critical code statements (which are statements, not functions), software locks can be set so that only those with a specific key can access these critical codes.
When writing a data to Flash, we will check if the data is valid, if the write address is valid, and calculate the sector to be written into. Then call the write Flash subroutine, where the sector address and data length are validated before writing data to Flash. Since writing Flash statements is critical code, the program locks these statements: the correct key must be held to write to Flash. This way, even if the program runs into the write Flash subroutine, the risk of erroneous writes is greatly reduced.
/**************************************************************** Name: RamToFlash()* Function: Copy RAM data to FLASH, command code 51.* Entry Parameters: dst Target address, i.e., FLASH starting address. Divided by 512 bytes* src Source address, i.e., RAM address. Address must be word-aligned* no Number of bytes to copy, which can be 512/1024/4096/8192* ProgStart Software lock flag* Exit Parameters: IAP return value (paramout buffer) CMD_SUCCESS,SRC_ADDR_ERROR,DST_ADDR_ERROR,SRC_ADDR_NOT_MAPPED,DST_ADDR_NOT_MAPPED,COUNT_ERROR,BUSY, Sector not selected****************************************************************/void RamToFlash(uint32 dst, uint32 src, uint32 no,uint8 ProgStart){    PLC_ASSERT("Sector number",(dst>=0x00040000)&&(dst<=0x0007FFFF));    PLC_ASSERT("Copy bytes number is 512",(no==512));    PLC_ASSERT("ProgStart==0xA5",(ProgStart==0xA5));    paramin[0] = IAP_RAMTOFLASH; // Set command word    paramin[1] = dst; // Set parameters    paramin[2] = src;    paramin[3] = no;    paramin[4] = Fcclk/1000;    if(ProgStart==0xA5) // Only when the software lock flag is correct, execute critical code    {        iap_entry(paramin, paramout); // Call IAP service program        ProgStart=0;    }    else    {        paramout[0]=PROG_UNSTART;    }}
This program segment is programming the internal Flash of LPC1778, where the function calling IAP program iap_entry(paramin, paramout) is critical safety code. Before executing this code, it checks a specific security lock flag ProgStart; only when this flag meets the set value will it execute the programming Flash operation. If the program runs into this function unexpectedly, due to the incorrect ProgStart flag, it will not program the Flash.

12 Error Checking for Communication Data

Limit the number of bytes per frame when formulating protocols;

The more bytes per frame, the greater the chance of errors and the more invalid data there will be. In this regard, Ethernet specifies that each frame of data should not exceed 1500 bytes, high-reliability CAN transceivers specify that each frame of data must not exceed 8 bytes, and for RS485, the widely used Modbus protocol specifies that a frame of data must not exceed 256 bytes. It is recommended that when formulating internal communication protocols, when using RS485, the number of bytes per frame should not exceed 256 bytes;

  • Use multiple checks:

When writing programs, enable parity checks. For applications with frames exceeding 16 bytes, it is recommended to at least write a CRC16 check program.

  • Add additional checks:

1) Add buffer overflow checks. Data reception is often completed in interrupts, and the compiler cannot detect if the buffer overflows; manual checks are required, as detailed in the previous section on data overflow. 2) Add timeout checks. When half a frame of data is received and no remaining data is received for a long time, consider this frame of data invalid and restart reception. This is optional and relates to different protocols, but buffer overflow checks must be implemented. This is because for protocols that require frame header checks, the host may suddenly lose power after sending the frame header, and after rebooting, the host starts sending from a new frame, but the slave has already received the previous incomplete frame header, so the host’s frame header may be received as normal data by the slave. This may cause the data length field to have a very large value, requiring a considerable amount of data to fill that length (for example, a frame may be 1000 bytes), affecting response time; on the other hand, if the program does not have a buffer overflow check, the buffer is likely to overflow.

  • Retransmission Mechanism:

If communication data errors are detected, a retransmission mechanism should be in place to resend the erroneous frame.

13 Detection and Confirmation of Switch Quantity Input

Switch quantities are easily affected by sharp pulse interference. If not filtered out, it may cause erroneous actions. Generally, it is necessary to sample the switch input signal multiple times and perform logical judgment until the signal is confirmed to be correct. There should be a certain time interval between multiple samples, which is related to the maximum switching frequency of the switch quantity, generally not less than 1ms.

14 Switch Quantity Output

Simply outputting the switch signal once is not safe; interference signals may flip the state of the switch output. Repeatedly refreshing the output can effectively prevent level flipping.

15 Saving and Restoring Initialization Information

The register values of microprocessors may also change due to external interference, and peripheral initialization values need to be preserved in registers for a long time, which are most easily destroyed.Since data in Flash is relatively resistant to destruction, initialization information can be pre-written into Flash and, when the program is idle, compare whether the register values related to initialization have been changed. If illegal changes are found, use the values in Flash to restore them.

16 While Loop

A well-redundant program sets a timeout timer to forcibly exit the while loop after a certain amount of time.

17 System Self-Check

Self-checking of CPU, RAM, Flash, external power loss storage memory, and other circuits.

Leave a Comment

×