[Introduction] Writing so many microcontroller programs, you see the watchdog every day; are you raising your dog correctly? Just feed the dog continuously until it stops barking, right? Is it really that simple? In fact, it might not be as simple as you think…..
What is a Watchdog?
The watchdog, also known as the watchdog timer, is essentially a timing circuit or software timer mechanism.
Working Principle:
The hardware basis of the watchdog is a counter, which is set to a certain initial timing value and then decrements to zero. The software is responsible for frequently resetting the counter to its initial timing value to ensure that the count never reaches zero. If it does reach zero, it indicates a fault has occurred, and corresponding measures must be taken, either a restart or entering a fail-safe state, depending on the system’s design.
During normal operation, the microcontroller, processor, or thread periodically resets the watchdog timer’s timing value, while the timer continuously counts in the background. If the timing period expires without feeding the dog again, the dog barks, indicating that something unusual has occurred! At this point, the dog sends out a command externally to execute corresponding actions. What exactly these actions are depends on the actual system design. Common watchdog chips will send a reset signal to the microcontroller or processor, while for software timers, the specific actions can vary widely, depending on the safety strategy adopted.
In simple terms, this is also referred to as feeding the dog; this timing value is equivalent to dog food. The dog consumes the food in its stomach continuously; if it isn’t fed again before the food runs out, the dog will bark for food, sending out a warning message. Conversely, a system that is functioning normally will have its watchdog well-fed and will not bark excessively.
Note: I’ve seen articles referring to resetting the watchdog timer as kicking the dog (kick watchdog), which is not ideal; we should treat the dog well, so let’s call it feeding instead~~~
The watchdog mechanism plays a very important role in electronic systems. Here’s an extreme example: if the Mars rover’s program hangs, it would mean losing communication without a watchdog circuit. Imagine what that scenario would be like—unable to communicate or wake up, turning into space debris~~~
What Errors Can It Monitor?
-
Stack or heap overflow, program runaway -
A certain piece of code fails to return or enters an infinite loop -
Strong electromagnetic interference damaging data, causing system anomalies; you might find this hard to understand, but think of many electronic systems in military or aerospace fields that often operate in environments with strong electromagnetic interference -
System crash due to bugs -
Deadlock in multitasking systems -
……
There are countless reasons, don’t panic! You have a good watchdog helping you; let the watchdog clean up the mess. In a complex embedded system, it’s impossible to guarantee there are no bugs, but by using a watchdog, you can ensure that no bug will hang the system indefinitely.
What to Do When the Dog Barks?
What are the common handling strategies?
-
System Reset, most people have the experience of what to do when a system hangs: restart. It reminds me of Liu Huan’s song “From the Beginning Again”; how great it would be if life could be restarted, but it can’t! Interested? Take a listen~~~ -
Fail-Safe, commonly called fail-safe mode by foreigners. It means that even if the device experiences a fatal fault, it should not cause a safety incident. To put it bluntly, even if it hangs, it should not affect others. This is not easy to understand; for example, if an elevator is descending and the watchdog detects a program anomaly, the safe action is to stop the motor immediately; otherwise, it would free fall, leading to disaster. This is reflected in IEC61508 functional safety standards, as well as medical safety standards and automotive safety standards. -
This describes a recommended practice: after the chip resets, use the chip reset status register value to count the watchdog reset events. If it happens three times in a row, the conservative approach is to switch the system to a safe state or display an error message, thus avoiding endless restarts. How to do this? Using IAR as an example, you can define a variable to prevent the system from automatically initializing (called __no_init in IAR), achieving counting, and the value will still be preserved after a reset unless powered off. __no_init int wdtResetCounter; -
….depends on specific design strategies
If we want the system to recover quickly, we should adopt a strategy where the watchdog reset initialization is shorter than the normal power-on initialization. This means skipping some self-checks of the device. Of course, in some systems, it’s best to conduct a comprehensive self-check, as the root cause of the watchdog timeout may be due to such hardware anomalies.
How to Feed the Dog Specifically?
For bare-metal programs, I recommend the following two handling strategies: fault detection feeding and enhanced fault detection feeding.
Fault Detection Feeding
For a bare-metal microcontroller program, while feeding the dog, you can also check some key runtime states, such as stack depth, buffer status, and key function chain hardware (like sensors, actuators, etc.); if these states are abnormal, record the error status and put the device into a functional safe state.
Enhanced Fault Detection Feeding
What is sequence detection feeding? There is a paradigm called sequence check in IEC-61508, which might sound a bit strange; just look at the picture, and you’ll understand immediately.
This involves setting a sequence marker for the main function’s key functional blocks; if the sequence goes wrong, perform safe fault handling; if correct, continue executing the next block. While feeding the dog, check if the sequence is correct; if correct, feed; if not, perform error handling, or simply letting the dog bark is also an option.
For multitasking real-time systems, there are some different requirements:
Detect whether the operating system is running correctly Detect if there are any infinite loops among all tasks Detect deadlocks involving two or more tasks Detect if certain low-priority tasks cannot run due to high-priority tasks occupying the CPU ….
Mother Dog with Puppies Feeding Method
This name sounds a bit crude, haha. To facilitate understanding, let’s call it that; let’s show a picture first before explaining:
Implementation Strategy Description:
watchdogTask can be seen as a doghouse, where a group of dogs lives; among them, the hardware watchdog is the mother dog, and the sub-task software watchdogs are the puppies. Each sub-task needs to feed the dog once in each loop cycle (of course, in actual implementation, you can also include task fault detection feeding). In each cycle of watchdogTask, all software watchdogs are decremented; if they overflow, the soft dog barks, and abnormal handling needs to be performed (reset or enter fail-safe mode). If all software dogs do not overflow, then feed the hardware watchdog (which may be built-in or an external chip).
In actual implementation, attention must be paid to:
-
watchdogTask should be selected with the highest priority -
Each loop should call os_delay for a certain time to allow CPU time for other tasks to run. The suspension time should be less than the maximum hardware watchdog timeout. -
Tasks’ priorities must be arranged reasonably -
Feeding the dog in interrupt handlers and other functions is strictly prohibited.
How Long Should the Dog Bark?
The Pain of Being Too Short
If the watchdog timer’s timing is set too short, the system is prone to misjudgment, which may lead to frequent resets or entering fail-safe mode. Because the reliability of any safety chain depends on its weakest link, if a timeout interval is chosen too short, the firmware’s loop time is dynamic, especially when there are many external asynchronous events or interrupt nesting, the fluctuations can be significant, so the worst-case scenario needs to be considered for how long it takes for the system to loop once.
The Harm of Being Too Long
One method is to choose an interval of several seconds. When you only try to reset a truly hung system but don’t want to investigate the system’s timing in detail, this strategy can be adopted. It’s a robust method. However, some systems require rapid recovery, which leads to slow fault diagnosis, especially in high-safety-requirement scenarios, such as nuclear power systems, automotive electronic systems, medical device systems, and so on.
Therefore, in actual design, it is necessary to consider the worst-case scenario and try to choose a relatively short timing duration, seeking a balance between the two.
In Summary
For microcontroller programming, the watchdog strategy has extensive applications, even in embedded Linux and databases. How to use the watchdog reasonably is a very important topic for designing a robust electronic system.
Original content is not easy to create. If you find this article valuable, please click to read again or share it with your friends, so more people can see it.
—END—