Original:https://mp.weixin.qq.com/s/8tBqZ3G7reXkHOXj4rxMMA

1. Introduction

The Hibernate and Sleep functions are core features of Linux Power Management (PM), and their purposes are similar: pause usage —> save context —> power off the system to save energy········> restore the system —> restore context —> continue usage.

This article introduces these two functions from the perspective of the interfaces provided by the kernel to user space, and subsequent articles will analyze their implementation logic and execution actions.

2. Terminology Related to Hibernate and Sleep

▆ Hibernate and Sleep

These are abstractions of Linux power management from the user’s perspective, representing tangible features visible to users. Their commonality lies in saving the system’s operational context before suspending (suspend) the system, and resuming operation after the system is restored, as if nothing had happened. Their differences lie in the location of context saving, the triggering method for system restoration, and the specific implementation mechanisms.

▆ Suspend

This term has two levels of meaning. One refers to the general term for the Hibernate and Sleep functions in low-level implementation, which refers to suspending (Suspend) the system. Depending on the context saving location, it can be divided into Suspend to Disk (STD, i.e., Hibernate, where the context is saved on the hard disk) and Suspend to RAM (STR, a type of Sleep, where the context is saved in RAM); the second refers to the code-level implementation of the Sleep function, represented in the “kernel/power/suspend.c” file.

▆ Standby is a special case of the Sleep function, which can be translated as “napping“.

Normal Sleep (STR) will place the CPU in a low-power state (usually Sleep) after processing the context, via architecture-dependent code. In reality, depending on the different requirements for power consumption and sleep wake-up time, the CPU may provide various low-power states. For instance, besides Sleep, it may offer a Standby state, where the CPU is in a light sleep mode and will wake up immediately at any disturbance.

▆ Wakeup

This is the first time we formally introduce the concept of Wakeup. We have mentioned restoring the system multiple times, which is actually referred to as Wakeup in the kernel. On the surface, waking up seems simple; whether from hibernation, sleep, or napping, there must be a stimulus to return to a normal state. However, the complexity lies in what kind of stimulus can wake us up?

In the animal kingdom, a rise in temperature may be the only stimulus that can wake animals from hibernation. In contrast, a kick or an alarm clock can wake us from sleep. For napping, any disturbance can wake us up.

In the computing world, during hibernation (Hibernate), the entire system’s power is turned off, so the only way to wake up is by using the Power button. However, during sleep, to shorten the wake-up time, not all power is turned off. Additionally, to enhance user experience, certain important devices (like keyboards) are usually kept powered, allowing them to wake the system.

These deliberately retained devices that can wake the system are collectively referred to as wakeup sources (Wakeup source). The selection of Wakeup sources is a key focus in PM design work, especially for functions like Sleep and Standby.

3. Software Architecture and Module Overview

3.1 Software Architecture

The software architecture in the kernel can be roughly divided into three levels, as shown in the diagram below:

Introduction to Hibernate and Sleep Functions in Linux Power Management

1) API Layer, which describes an abstract layer of user space APIs.

There are two types of APIs here: one type involves the Hibernate and Sleep functions (global APIs), including actual functionality, test functionality, debug functionality, etc., provided through sysfs and debugfs; the other type is specific to Hibernate (STD APIs), provided through sysfs and character devices.

2) PM Core, the core logic layer of power management, located in the kernel/power/ directory, includes multiple sub-modules such as main functionality (main), STD, STR & Standby, and auxiliary functions (assistant).

The main functionality is primarily responsible for implementing the logic related to global APIs, providing corresponding APIs to user space;

STD includes sub-modules like hibernate, snapshot, swap, block_io, etc., responsible for implementing STD functionality and hardware-independent logic;

STR & Standby includes sub-modules like suspend and suspend_test, responsible for implementing STR, Standby, and other hardware-independent logic.

3) PM Driver, the power management driver layer, involves architecture-independent drivers, architecture-dependent drivers, device models, and various device drivers.

3.2 User Space Interfaces

3.2.1 /sys/power/state

The state is a file in sysfs, serving as the core interface for PM, implemented in “kernel/power/main.c”, used to set the system to a specified Power State (power mode, such as Hibernate, Sleep, Standby, etc.). Different power management functions are implemented at the low level by switching between different Power States.

Reading this file returns the current Power States supported by the system, in string form. In the kernel, there are two types of Power States: one related to Hibernate, named “disk”; besides “disk”, the kernel defines three other states in “kernel/power/suspend.c” in an array format, as follows:

1: const char *const pm_states[PM_SUSPEND_MAX] = { 2: [PM_SUSPEND_FREEZE] = “freeze”, 3: [PM_SUSPEND_STANDBY] = “standby”, 4: [PM_SUSPEND_MEM] = “mem”, 5: };

The explanations for these Power States are as follows:

▆ freeze

This Power State does not involve specific Hardware or Driver; it merely freezes all processes, including user space processes and kernel threads. Compared to the familiar “hibernate” and “sleep”, we can call it “resting” (it is easy to imagine that the energy savings are limited).

[Note: In our previous descriptions, we did not specifically mention this State because it was only part of the Sleep, Hibernate, and other functions in earlier kernels and was only recently separated. Another reason is that the power-saving effect of this state is not very ideal, so its usage scenarios are also limited.]

▆ standby, which is the Standby state described in Chapter 2.

▆ mem, which refers to the Sleep function commonly discussed, also described as STR in Chapter 2, Suspend to RAM.

▆ disk, which refers to the Hibernate function, also described as STD in Chapter 2, Suspend to Disk.

Writing a specific Power State string will set the system to that mode.

3.2.2 /sys/power/pm_trace

PM Trace is used to provide trace records during the power management process, controlled by the “CONFIG_PM_TRACE” macro definition (kernel/power/Kconfig) to determine whether it is compiled into the kernel, and the “sys/power/pm_trace” file controls whether this function is enabled at runtime.

3.2.3 /sys/power/pm_test

PM test is used for testing power management functions, controlled by the “CONFIG_PM_DEBUG” macro definition (kernel/power/Kconfig) to determine whether it is compiled into the kernel. The core idea is:

▆ The power management process is divided into multiple steps in sequence, such as core, platform, devices, etc. These steps are referred to as PM Test Levels.

▆ The system saves the current PM Test Level in a global variable (pm_test_level). The value of this variable can be obtained and modified through the “/sys/power/pm_test” file.

▆ After each power management step ends, PM test code is inserted, which takes the current execution step as a parameter and checks whether the current PM Test Level matches the execution step. If they match, it indicates that the step executed successfully. For testing purposes, after successful execution, the system will print test information and exit the PM process after waiting for a while.

▆ Developers can modify the global Test Level to purposefully test whether the steps of interest execute successfully.

As mentioned above, this file is used to obtain and modify the PM Test Level, with specific Level information defined in “kernel/power/main.c” in the following format (the specific meaning is quite simple, and it is very clear when looking at the related code, so we won’t elaborate here):

1: static const char * const pm_tests[__TEST_AFTER_LAST] = { 2: [TEST_NONE] = “none”, 3: [TEST_CORE] = “core”, 4: [TEST_CPUS] = “processors”, 5: [TEST_PLATFORM] = “platform”, 6: [TEST_DEVICES] = “devices”, 7: [TEST_FREEZER] = “freezer”, 8: };

3.2.4 /sys/power/wakeup_count

This interface is only related to the Sleep function, thus controlled by the “CONFIG_PM_SLEEP” macro definition (kernel/power/Kconfig). Its existence is to solve the synchronization problem between Sleep and Wakeup.

We know that after the system goes to sleep, it can be awakened by the retained Wakeup source. In today’s CPU architecture, waking up the system means waking up the CPU, and the only way to wake up the CPU is through an interrupt generated by the Wakeup source (referred to as Wakeup event in the kernel). The kernel must ensure that the behavior of Sleep/Wakeup operates correctly under various states, as follows:

▆ When the system is in sleep state, a Wakeup event occurs. At this point, the system should wake up directly. This is straightforward.

▆ When a Wakeup event occurs during the process of entering sleep, the system should abandon entering sleep.

This is not so easy to achieve. For example, when a Wakeup event occurs after writing to “/sys/power/state” but before the kernel executes the freeze operation. At this point, user space programs can still handle the Wakeup event, or only partially handle it. However, the kernel assumes that the event has been processed and will not abandon the sleep action.

This can lead to a situation where, after a Wakeup event occurs, the user space program regrets and does not want to sleep, but ultimately still goes to sleep until the next Wakeup event arrives.

To solve the above problem, the kernel provides a wakeup_count mechanism, in conjunction with “/sys/power/state”, to achieve synchronization during the Sleep process. The operational behavior of this mechanism is as follows:

▆ wakeup_count is used by the kernel to store the current count of Wakeup events that have occurred.

▆ User space programs should first read the wakeup_count before writing the state to switch states and write the obtained count back to wakeup_count.

▆ The kernel will compare the written count with the current count. If they do not match, it indicates that a new Wakeup event has occurred between the read/write operations, and the kernel will return an error.

▆ If the user space program detects a write error, it cannot continue with subsequent actions and must handle the corresponding event and wait to read/write wakeup_count again.

▆ If the kernel’s comparison is consistent, it will record a snapshot of the event when writing wakeup_count is successful. When continuing the suspend action, it will check whether it matches the snapshot; if not, it will terminate the suspend.

▆ If the user space program detects a successful write, it can continue writing to the state to initiate a state switch, which is safe at this point.

3.2.6 /sys/power/image_size

This interface is also specific to STD. We know that the principle of STD is to save the current running context to the system’s disk (such as NAND Flash or hard disk) and then choose an appropriate method to power off or restart the system. Saving the context requires storage space, not only on the disk but also in memory for swapping or buffering.

This interface is used to set or obtain how much space needs to be allocated in memory for buffering data that needs to be written to disk. The unit is bytes.

3.2.7 /sys/power/reserved_size

reserved_size is used to indicate how much memory space to reserve for saving the space allocated by device drivers during the ->freeze() and ->freeze_noirq() processes, to avoid loss during STD.

3.2.8 /sys/power/resume

This interface is also specific to STD. Normally, after rebooting, the kernel reads the image saved on the disk during the later initialization process and restores the system. This interface provides a method to manually read the image and restore the system from user space.

Typically, this operation occurs during the normal operation of the system, requiring loading and executing another image.

3.2.9 debugfs/suspend_status

This interface provides statistical information about the suspend process to user space in the form of debugfs, including: the number of successful attempts, the number of failures, the number of freeze failures, etc.

3.2.10 /dev/snapshot

This interface is also specific to STD. It provides software STD operations to user space in the form of character devices. We will describe this in detail in subsequent articles.