Common Fault Tolerance Designs in Embedded Code

Common Fault Tolerance Designs in Embedded Code

If a large embedded project does not implement fault tolerance design, can you imagine what the consequences would be?
Experienced friends can certainly think of countless bugs in such projects, and some bugs are very difficult to trace.
Today, let’s discuss some common fault tolerance design methods in embedded code.

Using Assertions (Assert)

What is an Assert assertion? Let’s illustrate this with an example.

Consider the following array and function:
int Array[5] = {0xA1, 0xB2, 0xC3, 0xD4, 0xE5};
int Fun(char i){ return Array[i];}
If you call the Fun function like this, do you think it will cause an error?
int a;
a = Fun(8);

Experienced friends must have guessed that adding an Assert mechanism in the Fun function can avoid errors.

Assertions (Assert) are one of the most common fault tolerance designs in code, and many source code libraries can be seen using assertions, such as the STM32 peripheral library:

void GPIO_Init(GPIO_TypeDef* GPIOx, GPIO_InitTypeDef* GPIO_InitStruct){  /* Check the parameters */  assert_param(IS_GPIO_ALL_PERIPH(GPIOx));  assert_param(IS_GPIO_MODE(GPIO_InitStruct->GPIO_Mode));  assert_param(IS_GPIO_PIN(GPIO_InitStruct->GPIO_Pin));  /* ... */}

Clearly Defined Return Values and Error Codes

Commonly used protocol stacks, peripheral libraries, operating systems, etc., mostly have APIs that are designed perfectly, providing reasonable return values for functions to feedback on the success or failure of operations. For example, using 0 to indicate success and non-zero values to indicate specific error codes.

For example, the RTOS task creation function:

INT8U OSTaskCreate (void (*task)(void *p_arg),                     void *p_arg,                     OS_STK *ptos,                     INT8U prio){    OS_STK *psp;    INT8U err;#if OS_CRITICAL_METHOD == 3u                 /* Allocate storage for CPU status register */    OS_CPU_SR cpu_sr = 0u;#endif
#ifdef OS_SAFETY_CRITICAL_IEC61508    if (OSSafetyCriticalStartFlag == OS_TRUE) {        OS_SAFETY_CRITICAL_EXCEPTION();        return (OS_ERR_ILLEGAL_CREATE_RUN_TIME);    }#endif
#if OS_ARG_CHK_EN > 0u    if (prio > OS_LOWEST_PRIO) {             /* Make sure priority is within allowable range */        return (OS_ERR_PRIO_INVALID);    }#endif    OS_ENTER_CRITICAL();    if (OSIntNesting > 0u) {                 /* Make sure we don't create the task from within an ISR */        OS_EXIT_CRITICAL();        return (OS_ERR_TASK_CREATE_ISR);    }    /* ... */}

Designing reasonable return values and error codes for functions will also make your code more robust, especially making it easier to find bugs.

Logging

Why do we need to log? Recording detailed log information, including the time, location, and reason for the error, helps in tracing and analyzing when bugs occur.

When we first learn embedded systems, we basically learn the printing function like printf, which corresponds to another function of logging.

In addition to storing logs locally, we can also use printf to print output to another terminal (such as a host computer) for log storage.

Fatal Bug Restart Strategy

When we encounter some fatal bugs in software, such as hardware faults (HardFault), memory overflow (MemManage), we can choose a restart strategy.

Of course, the restart should depend on the actual situation of the project, choosing what kind of restart method, such as: core reset, system reset.

1. Core Reset

Only reset the Cortex-M core, without resetting UART and other on-chip peripherals.

In the Cortex-M core documentation, there is a description like this: By setting the VECTRESET bit in the AIRCR of the application interrupt and reset control register (NVIC), the processor core can be reset without resetting other on-chip facilities.

The core reset function (modified from the core code) is as follows:

void NVIC_CoreReset(void){  __DSB();  SCB->AIRCR  = ((0x5FA << SCB_AIRCR_VECTKEY_Pos)      |                 (SCB->AIRCR & SCB_AIRCR_PRIGROUP_Msk) |                 SCB_AIRCR_VECTRESET_Msk);       // Set VECTRESET  __DSB();  while(1) { __NOP(); }}

2. System Reset

The register bit (SYSRESETREQ) for system reset in software reset operations is different, and the object of the reset is the whole chip (except for the backup area).

System reset function:

void NVIC_SysReset(void){  __DSB();  SCB->AIRCR  = ((0x5FA << SCB_AIRCR_VECTKEY_Pos)      |                  (SCB->AIRCR & SCB_AIRCR_PRIGROUP_Msk) |                  SCB_AIRCR_SYSRESETREQ_Msk);     // Set SYSRESETREQ  __DSB();  while(1) { __NOP(); }}

Static Analysis Tools

Using static analysis tools to check for potential issues in the code, such as uninitialized variables, memory leaks, buffer overflows, etc. These tools can detect many issues before compilation, thus improving code quality.

Although this is not exactly a fault tolerance design, it is also an important part of the development process, and its role can sometimes exceed conventional fault tolerance designs.

Finally, with countless code bugs, in addition to regular fault tolerance designs, coding standards are also very important.
Lastly, when you write code, what fault tolerance designs do you consider? Feel free to leave a comment.

Author | strongerHuang

WeChat Official Account | strongerHuang
Recommended Hot Articles
  • Looks like a tripartite, but actually a memorial

  • DJI Drone Electronic Speed Control Circuit Board

  • Case | Power Entry with Ferrite Bead, Trouble Occurred

  • Investigation of Poaching, Taiwan Calls for 8 Mainland Chip Companies

  • Domestic GPU Unicorn Begins Layoffs: “We’ll Rehire Everyone When We Have Money”

Common Fault Tolerance Designs in Embedded Code

Leave a Comment