In critical scenarios such as industrial control and server applications, the stable operation of programs directly relates to business continuity and system reliability. Imagine if a server program suddenly becomes unresponsive without anyone knowing; it could lead to severe consequences such as data loss and service interruption. Today, we will introduce the “Watchdog” technology, which acts like a tireless guardian, capable of monitoring program status in real-time and automatically recovering in case of anomalies, ensuring program stability.What is a Watchdog?The Watchdog technology originated from embedded systems, initially as a hardware-level monitoring mechanism, and later evolved into software implementations. Simply put, a Watchdog is like a timer alarm: the monitored program needs to periodically “feed the dog” (send heartbeat signals). If the heartbeat is not received within a set time, the Watchdog will trigger a recovery mechanism (usually restarting the program). This design effectively addresses issues such as program hangs, infinite loops, and resource exhaustion.In C++ development, we can implement a software Watchdog using multithreading technology. This article will analyze how to build a flexible and reliable Watchdog system based on actual code.Core Design of C++ WatchdogOur Watchdog implementation uses class encapsulation design, ensuring flexibility and scalability through a modular structure. The core class Watchdog includes the following key components:Configuration Structure: Customizing Monitoring ParametersThe behavior of the Watchdog is configured through the Config structure, with key parameters including:• Timeout (timeout): The maximum time (in milliseconds) allowed for the program to be unresponsive; exceeding this duration will trigger recovery.• Check Interval (checkInterval): The time interval at which the Watchdog checks the heartbeat.• Heartbeat File Path (heartbeatFile): The file path used to store the heartbeat timestamp.• Main Program Path (mainProgram): The path of the program to be monitored, used for restarting in case of anomalies.These parameters can be flexibly adjusted based on actual needs; for example, a shorter timeout can be set for scenarios requiring high responsiveness.Dual Mode Design: Master and Slave CooperationTo achieve reliable monitoring, the system adopts a “master-slave mode” cooperation mechanism, defined by the Mode enumeration:• Master Mode: Responsible for monitoring the heartbeat file, determining if the program has timed out, and executing recovery operations.• Slave Mode: Run by the monitored program, periodically updating the heartbeat file to send a living signal.This separation design ensures the independence of the monitoring and monitored parties, avoiding single points of failure that could affect the entire monitoring system.Core Methods: Implementing Monitoring Closed LoopThe Watchdog class provides complete lifecycle management and functional interfaces:• Start()/Stop(): Control the starting and stopping of the Watchdog.• SetTimeoutCallback(): Set a custom handling function after a timeout.• SetLogCallback(): Configure the log output method for easier problem troubleshooting.• log(): A unified log output interface that supports custom log callbacks.• Private methods: Include core logic implementations for heartbeat checks, program restarts, etc.Master and Slave: Dual Mode Cooperation MechanismThe collaborative work of the master-slave mode is key to the reliable operation of the Watchdog, with the core process as follows:Slave Mode: Periodically Sending HeartbeatsThe program running in Slave mode will write the current millisecond timestamp to the heartbeat file at intervals set by checkInterval. The code logic is as follows:1. After entering the loop, sleep for the set check interval.2. Attempt to open the heartbeat file and write the current timestamp.3. If the file operation fails, log the error.For example, when configuring checkInterval=2000ms, the Slave will update the heartbeat file every 2 seconds, informing the Master, “I am still running normally.” Master Mode: Real-time Monitoring and RecoveryAs the monitoring center, the Master mode has a more complex workflow:1. Check the latest timestamp of the heartbeat file at the set interval.2. Calculate the difference between the current time and the last heartbeat time.3. If the difference exceeds the timeout threshold, determine that the program is abnormal:◦ If a custom callback function is set, execute the callback logic.◦ Otherwise, execute the default recovery process: first stop the abnormal program, then restart it.1. Capture exceptions throughout the process to ensure the stability of the monitoring logic itself.For example, with a configuration of timeout=10000ms and checkInterval=5000ms, the Master checks every 5 seconds, and if it finds that the heartbeat has not been updated for more than 10 seconds, it will trigger a program restart.Practical Usage: Code Example AnalysisBelow is a specific example illustrating how to use the Watchdog system:Master Mode Configuration Example // Configuration ParametersWatchdog::Config config = { std::chrono::milliseconds(10 * 1000), // Timeout of 10 seconds std::chrono::milliseconds(5 * 1000), // Check every 5 seconds “heartbeat.txt”, // Heartbeat file path “WatchDogSlaveTest.exe” // Monitored program};// Create Watchdog instanceWatchdog watchdog(config);// Set log callback (output to file)watchdog.SetLogCallback(WatchDogLog);// Set timeout callback (custom handling logic)watchdog.SetTimeoutCallback(OnTimeout);// Start Master mode monitoringwatchdog.Start(Watchdog::Master);Slave Mode Configuration Example Watchdog::Config config = { std::chrono::milliseconds(10 * 1000), // Timeout parameter is invalid in Slave mode std::chrono::milliseconds(2 * 1000), // Update heartbeat every 2 seconds “heartbeat.txt”, // Shared heartbeat file with Master “WatchDogSlaveTest.exe” // Program path in Slave mode is invalid};Watchdog watchdog(config);watchdog.SetLogCallback(WatchDogLog);// Start Slave mode to send heartbeatwatchdog.Start(Watchdog::Slave);From the example, it can be seen that the Slave mode only needs to focus on sending heartbeats, while the Master mode requires complete configuration of monitoring parameters and recovery strategies.Key Technical Points of the WatchdogThis C++ Watchdog implementation utilizes multiple technologies to ensure reliability:Thread-Safe DesignUsing std::atomic<bool> type variable m_running to control the thread running state, avoiding race conditions in a multithreaded environment, ensuring the safety of start/stop operations.Cross-Platform CompatibilityAdapting the program restart commands for Windows and Linux systems through conditional compilation:• Windows systems use taskkill and start commands.• Linux systems use pkill and the background running symbol &.This design allows the Watchdog to work stably in different operating system environments.Flexible Callback MechanismProviding two extension interfaces for log callbacks and timeout callbacks, allowing developers to:• Output logs to files, databases, or monitoring systems.• Implement custom recovery logic (such as sending alert emails, performing specific cleanup operations, etc.).Conclusion: Ensuring Program StabilityThe Watchdog technology is an important means to enhance program reliability, especially suitable for long-running service programs, industrial control software, and other scenarios. The C++ Watchdog implementation introduced in this article has the following advantages:• Lightweight and efficient: Implemented based on the standard library, without relying on third-party components.• Flexible configuration: Adaptable to different scenario needs through parameter adjustments.• Easy to extend: The callback mechanism supports custom business logic.• Cross-platform compatibility: Supports both Windows and Linux systems.In actual development, it is recommended to set timeout parameters reasonably based on business importance (usually 3-5 times the normal response time of the program) and to pair it with a comprehensive logging system, making the Watchdog a solid backing for stable program operation. Whether for personal projects or enterprise-level applications, introducing the Watchdog mechanism can significantly reduce losses caused by program anomalies, making it a reliability assurance technology that every developer should master.