Source: Segger
Edited by: strongerHuang
On January 25, 1994, the Clementine spacecraft was launched, a satellite by NASA used to test sensors and spacecraft components under long-term exposure to the space environment. Due to the lack of several watchdog codes, its mission was lost on May 7, 1994.
Clementine left lunar orbit for its next target, the near-Earth asteroid Geographos, after two months of lunar mapping. However, soon after, one of the onboard computers on Clementine malfunctioned, effectively preventing NASA from operating the spacecraft and causing one of its thrusters to fire uncontrollably.
NASA spent 20 minutes trying to revive the system, but to no avail. A hardware reset command eventually brought Clementine back online, but it was too late; it had exhausted all its fuel and had to cancel the continuation of the mission.
Subsequently, the development team responsible for Clementine’s software wished they had utilized a hardware watchdog timer, as it turned out their software timeouts were insufficient.
1. How Can Watchdogs Help?
A watchdog is a hardware device that can be directly integrated into a microcontroller (MCU) or connected externally to the microcontroller. Its primary purpose is to execute error handling (usually a hardware reset) when it can be safely assumed that the system has hung or is executing incorrectly.
The main component of a watchdog is a counter, which is initially configured to a certain value and then decremented to zero. Software must frequently reset this counter to its initial value to ensure it never reaches zero. Otherwise, it could lead to a failure, typically resulting in a CPU reset. This indicates that the watchdog is a last resort choice, only to be used when all other methods have failed, just like with Clementine.
2. How to Feed the Watchdog
Correctly using a watchdog timer is not as simple as restarting the counter (this process is commonly referred to as “feeding” or “kicking” the watchdog). When running a watchdog timer in a system, developers must carefully choose the watchdog’s timeout period so that the watchdog can intervene before the faulty system performs any irreversible malicious operations.
In simple applications, especially those not using an RTOS, developers typically provide the watchdog from the main loop. This method only requires configuring an appropriate initial counter value, as simple as selecting any value that exceeds the worst-case execution time of the entire main loop by at least one timer cycle. This is generally a fairly robust approach: while some systems need immediate recovery, others only need to ensure they do not hang indefinitely—this will certainly get the job done.
3. In a Multi-tasking (RTOS) Environment
In more complex systems, especially in multi-tasking systems, various reasons can cause threads to hang. Some threads may not run for long periods, such as communication threads waiting to receive data. Providing the watchdog regularly in a clean manner while ensuring that each different process remains in good condition has become a major challenge for these system developers, such as needing to pay attention to the following aspects:
-
Whether the operating system is functioning correctly. -
Whether high-priority tasks are exhausting the CPU, completely blocking the execution of low-priority tasks. -
Whether a deadlock is occurring that prevents the execution of one or more tasks. -
Whether the task routines are executed correctly and completely. -
Developers also need to ensure that any modifications made to the source code (whether a dedicated watchdog task or specific modifications to monitored tasks) must be small and optimized for efficiency to minimize disruption.
4. Utilizing Watchdog Support in RTOS
Some RTOS operating systems (such as SEGGER‘s embOS) come with built-in watchdog solutions, simplifying the handling of watchdogs and reducing the time spent during any development process.
There are many ways to implement hardware watchdogs in RTOS; I remember sharing with you before. In fact, understanding some basic principles allows you to design one yourself. For example: adding a “watchdog counter” for each task, performing certain actions if it exceeds the set time, or else resetting the watchdog.
Of course, some operating systems come with built-in watchdog functionalities that only require calling API functions. For instance, embOS: tasks can easily register themselves in the embOS watchdog module and can separately configure their timeout periods. Then, that task can signal its correct execution by calling a simple embOS API function. Whether all monitored tasks have signaled their correct execution within their specified timeout period will then be checked through another single embOS API call, which can be executed from a dedicated watchdog task, from OS_Idle(), or even from a regular OS internal timer interrupt service routine or any other ISR.
The user only needs to provide and register two functions: the first executes the hardware-related feeding of the watchdog, and the second specifies further actions to be taken when the watchdog counter reaches zero. For example, this allows logging to Flash with more information about the system state before executing a hardware reset or taking any other measures.
5. Finally
When starting to design and develop applications using watchdogs, decide early on how you intend to use the watchdog and consider the available tools to help you implement it faster. At the very least, you don’t want to be stuck in “space,” right?
Long pressto go to the public account included in the imageto follow