[Introduction] After writing so many microcontroller programs, you see the watchdog every day. Are you taking care of your watchdog properly? Just keep feeding it, and as long as it doesn’t bark, everything is fine, right? Is it really that simple? In fact, it may not be as straightforward as you think…
What is a Watchdog?
A watchdog, also known as a watchdog timer, is essentially a timing circuit or software timer mechanism.
Working Principle:
The hardware basis of a watchdog is a counter that is set to a certain initial timing value and then decrements to zero. The software is responsible for periodically resetting the counter to its initial timing value to ensure that the count never reaches zero. If it does reach zero, it indicates that some fault has occurred, and corresponding measures must be taken, such as restarting or entering a fail-safe state, depending on the system’s design.
During normal operation, the microcontroller, processor, or thread periodically resets the watchdog timer’s timing value, while the timer continuously counts in the background. If the timing period expires without being fed again, the watchdog barks, indicating that something unusual has happened! At this point, the watchdog issues commands externally to execute corresponding actions. What these actions are depends on the actual system design. Common watchdog chips will send a reset signal to the microcontroller or processor, while for software timers, the specific actions can vary widely, depending on the safety strategy employed.
In simple terms, this is also called feeding the watchdog; the timing value is equivalent to dog food. The watchdog consumes the food continuously, and if it is not fed before it runs out, it will bark, sending a warning message. Conversely, a system that operates normally will always have its watchdog well-fed and will not bark out of hunger.
Note: I have seen articles refer to resetting the watchdog timer as kicking the dog. Well, that’s not very nice; we should treat the watchdog better and call it feeding instead~~
The watchdog mechanism plays a crucial role in electronic systems. For example, if the Mars rover’s program hangs, it would be equivalent to losing communication if there were no watchdog circuit. Just imagine the scenario: unable to communicate or wake up, it would quickly become space debris~~
What Errors Can It Monitor?
- Stack or heap overflow, causing the program to crash
- A certain segment of the program fails to return or enters an infinite loop
- Strong electromagnetic interference corrupting data, leading to system anomalies; you might not understand this, but think of many electronic systems in military or aerospace fields that often operate in strong electromagnetic interference environments
- System crashes caused by bugs
- Deadlocks in multitasking systems
- ……
There are countless reasons, but don’t panic! You have a good watchdog to help you; let the watchdog clean up the mess. In a complex embedded system, it is impossible to guarantee that there are no bugs, but by using a watchdog, you can ensure that no bug will indefinitely hang the system.
What to Do When the Watchdog Barks?
What are the common handling strategies?
- System Reset: Most people have experienced this; when the system hangs, what do you do? Restart. It reminds me of Liu Huan’s song <
>; if life could be restarted, how wonderful it would be, but it can’t! If you’re interested, give it a listen~~ - Fail-Safe: This is often referred to as fail-safe mode. It means that even if the device experiences a fatal failure, it should not cause a safety incident. To put it bluntly, even if it crashes, it should not affect others. This can be hard to understand; for example, if an elevator is descending and the watchdog detects a program anomaly, the safe action is to stop the motor immediately; otherwise, it would free-fall, and that would be disastrous. This is reflected in IEC61508 functional safety standards, as well as medical and automotive safety standards.
- Here is a recommended practice: after the chip resets, use the chip’s reset status register value to count the watchdog reset events. If this type of reset occurs three times in a row, the conservative approach is to switch the system to a safe state or display an error message, thus avoiding infinite restarts. How to do this? Taking IAR as an example, you can define a variable to prevent the system from automatically initializing (in IAR, this is called __no_init), allowing for counting; after a reset, its value remains preserved unless power is lost. __no_init int wdtResetCounter;
- ….depends on the specific design strategy
If we want the system to recover quickly, we should adopt a strategy where the initialization after a watchdog reset is shorter than the normal power-on initialization. This means skipping some self-checks of the device. However, in some systems, it is best to perform a comprehensive self-check, as the root cause of the watchdog timeout may be due to such hardware anomalies.
How to Feed the Watchdog?
For bare-metal programs, I recommend the following two handling strategies: fault detection feeding and enhanced fault detection feeding.
Fault Detection Feeding
For a bare-metal microcontroller program, you can detect some critical runtime states while feeding the watchdog, such as stack depth, buffer status, and the hardware of critical function chains (like sensors, actuators, etc.). If these states are abnormal, record the error state and put the device into a functional safety state.
Enhanced Fault Detection Feeding
What is sequence detection feeding? There is a paradigm in IEC-61508 called sequence check, which might sound a bit strange. Just look at the diagram, and you’ll understand immediately.
This involves setting a sequence marker for the key functional blocks of the main function. If the sequence is incorrect, perform safe fault handling; if correct, continue executing the next block. When feeding the watchdog, check if the sequence is correct; if it is, feed it; otherwise, perform error handling, or simply letting the watchdog bark is also an option.
For multitasking real-time systems, there are some different requirements:
- Detect whether the operating system is running correctly
- Detect if there are infinite loops in all tasks
- Detect deadlocks involving two or more tasks
- Detect if certain low-priority tasks cannot run due to high-priority tasks occupying the CPU
- ….
Mother Dog with Puppies Feeding Method
This name sounds a bit crude, haha. To make it easier to understand, let’s call it that. Let’s look at a diagram first and then explain:
Implementation Strategy Description:
watchdogTask can be seen as a doghouse, where a group of dogs live, with the hardware watchdog as the mother dog and the software watchdogs for sub-tasks as the puppies. Each sub-task needs to feed the watchdog once in each loop cycle (of course, in actual implementation, fault detection feeding can also be added). In each loop of watchdogTask, it decrements all software watchdogs; if any overflow occurs, the software watchdog barks, requiring exception handling (reset or enter fail-safe mode). If all software watchdogs do not overflow, then feed the hardware watchdog (which may be an internal or external chip of the microcontroller).
In actual implementation, attention must be paid to:
- watchdogTask should be selected with the highest priority
- Each loop should call os_delay for a certain time to yield CPU time for other tasks to run. The suspended time should be less than the maximum hardware watchdog timeout.
- Tasks’ priorities should be reasonably arranged
- Feeding the watchdog in interrupt handlers and other functions is strictly prohibited.
How Often Should the Watchdog Bark?
The Pain of Being Too Short
If the watchdog timer’s timing is set too short, the system may easily misjudge, leading to frequent resets or entering fail-safe mode. The quality of any safety chain depends on its weakest link; if a timeout interval is chosen too short, the firmware’s loop time is dynamic, especially when there are many external asynchronous events or nested interrupts, the fluctuations can be significant. Therefore, the worst-case scenario must be considered: how long does it take for the system to loop once?
The Harm of Being Too Long
One method is to choose a timeout interval of several seconds. When you only want to reset a truly hung system but do not wish to conduct a detailed study of the system’s timing, this strategy can be employed. It is a robust method. However, some systems require quick recovery, which can lead to slow fault diagnosis, especially in high-safety-requirement scenarios, such as nuclear power systems, automotive electronic systems, and medical device systems.
Therefore, in actual design, it is necessary to consider the worst-case scenario and try to choose a relatively short timeout duration, finding a balance between the two.
In Summary
For microcontroller programming, the watchdog strategy has extensive applications even in embedded Linux and databases. How to use the watchdog effectively is a very important topic for designing a robust electronic system.
Authorized transfer from: Embedded Guest House / Author: Yijun