Linux Power Management (4) – Power Management Interface

Original:https://mp.weixin.qq.com/s/JX5PS7Cxs9mRzzl96cwzRg

1. Introduction

A significant portion of Linux power management deals with functionalities such as Hibernate, Suspend, and Runtime PM. These functionalities are based on a similar logic, known as the “Power Management Interface.” The code for this interface is implemented in files such as “include/linux/pm.h” and “drivers/base/power/main.c.” Its main functions are: to define Device PM-related callback functions for various drivers to implement; and to provide a unified PM operation function for the PM core logic to call.

Therefore, before analyzing functionalities like Hibernate, Suspend, and Runtime PM, it is essential to familiarize oneself with the PM Interface, which is the primary purpose of this article.

2. Device PM Callbacks

In a system, devices are the most numerous and also the largest consumers of power, making device power management a core aspect of Linux power management. The most critical operation in device power management is: at the appropriate time (e.g., when not in use or paused), to set the device to a reasonable state (e.g., off or asleep). This is the purpose of device PM callbacks: to define a unified method for devices to enter similar states in a coordinated manner at specific times (one can imagine the command “one two one” during military training).

In older kernel versions, these PM callbacks were scattered across large data structures in the device model, such as suspend, suspend_late, resume, and resume_late in struct bus_type, and suspend and resume in struct device_driver/struct class/struct device_type. Clearly, this lacked good encapsulation, as the increasing complexity of devices meant that simple suspend and resume could no longer meet power management needs, necessitating the expansion of PM callbacks, which would inevitably require changes to these data structures.

Thus, in newer kernel versions, these callbacks have been unified into a single data structure—struct dev_pm_ops—where the upper-level data structures only need to include this structure. This way, if PM callbacks need to be added or modified, the upper-level structures do not need to be changed (this is a vivid embodiment of abstraction and encapsulation in software design, elegantly akin to art). Of course, to maintain compatibility with older designs, the aforementioned suspend/resume type callbacks are still retained, but their use is discouraged, and this article will not cover them.

Every Linux engineer familiar with older kernel versions will likely be taken aback by the complexity of struct dev_pm_ops! Just take a look:

   1: /* include/linux/pm.h, line 276 in linux-3.10.29 */   2: struct dev_pm_ops {   3:         int (*prepare)(struct device *dev);   4:         void (*complete)(struct device *dev);   5:         int (*suspend)(struct device *dev);   6:         int (*resume)(struct device *dev);   7:         int (*freeze)(struct device *dev);   8:         int (*thaw)(struct device *dev);   9:         int (*poweroff)(struct device *dev);  10:         int (*restore)(struct device *dev);  11:         int (*suspend_late)(struct device *dev);  12:         int (*resume_early)(struct device *dev);  13:         int (*freeze_late)(struct device *dev);  14:         int (*thaw_early)(struct device *dev);  15:         int (*poweroff_late)(struct device *dev);  16:         int (*restore_early)(struct device *dev);  17:         int (*suspend_noirq)(struct device *dev);  18:         int (*resume_noirq)(struct device *dev);  19:         int (*freeze_noirq)(struct device *dev);  20:         int (*thaw_noirq)(struct device *dev);  21:         int (*poweroff_noirq)(struct device *dev);  22:         int (*restore_noirq)(struct device *dev);  23:         int (*runtime_suspend)(struct device *dev);  24:         int (*runtime_resume)(struct device *dev);  25:         int (*runtime_idle)(struct device *dev);  26: };

From the perspective of the Linux PM Core, these callbacks are not complex, as the PM Core’s job is to call the corresponding callbacks at specific power management stages. For example, during the suspend/resume process, the PM Core will sequentially call “prepare—>suspend—>suspend_late—>suspend_noirq——-wakeup———>resume_noirq—>resume_early—>resume–>complete.”

Linux Power Management (4) - Power Management Interface

Stage

Call Order

Key Operations

Notes

prepare

1

– Notify processes/kernel threads to freeze (freeze).

– Pause user space processes.

Irreversible operations begin, new process creation is denied.

suspend

2

– Device driver calls

.suspend()

callback.

– Disable device functionalities (e.g., network, screen).

The driver must save the state, failure may lead to rollback.

suspend_late

3

– Core system components suspend (e.g., clock, interrupt controller).

– Power off non-critical devices.

Ensure dependency order (e.g., turn off peripherals before the clock).

suspend_noirq

4

– Disable all interrupts (IRQ).

– Save the last hardware state (e.g., CPU registers).

Last chance

to handle hardware state, after which the CPU stops executing instructions.

Hardware Sleep

– CPU enters low power state (e.g.,

ACPI S3

).

– Only wake-up sources remain (e.g., power button, RTC).

Resume from

resume_noirq

.

resume_noirq

5

– Restore the interrupt controller.

– Initialize critical hardware (e.g., CPU, memory controller).

Must be strictly symmetrical with

suspend_noirq

to avoid inconsistent hardware states.

resume_early

6

– Enable clocks and basic peripherals.

– Unfreeze core components.

Does not involve user space, only kernel-level recovery.

resume

7

– Device driver calls

.resume()

callback.

– Restore device functionalities (e.g., GPU, storage).

The driver must restore the state prior to

suspend.

complete

8

– Unfreeze user space processes.

– Send

PM_POST_SUSPEND

notification.

User interaction resumes (e.g., screen lights up).

However, since these callbacks need to be implemented by specific device drivers, it requires driver engineers to clearly understand the usage scenarios of these callbacks, whether they need to be implemented, and how to implement them, which is the complexity of struct dev_pm_ops. The Linux kernel documentation for struct dev_pm_ops is already quite detailed, but understanding the usage scenarios of each callback and the underlying reasoning is not an easy task.

In the above, freezing in Android and Linux each has one

Dimension

Linux PM Freezing

Android CachedAppOptimizer Freezing

Design Goal

Prepare for system sleep (Suspend-to-RAM/Disk) to ensure processes do not interfere with hardware state saving.

Optimize memory and power consumption, freeze background applications (Cached Apps) to reduce resource usage.

Trigger Timing

Triggered only when the system enters sleep (e.g., echo mem > /sys/power/state).

Managed dynamically by system services (ActivityManagerService), triggered based on memory pressure or policy.

Frozen Objects

All user space processes

(including critical processes).

Only background application processes

(lower priority cached processes).

Freezing Depth

Completely suspend process scheduling (process enters

TASK_UNINTERRUPTIBLE

state).

Partial freezing (e.g., stop executing code but retain process structure).

Recovery Method

All processes are uniformly unfrozen after the system wakes up.

Unfreeze on demand (e.g., user switches back to the application or broadcast triggers).

Kernel Dependency

Depends on the kernel’s freezer subsystem (CONFIG_PM_FREEZER).

Depends on Android-specific Binder and Process management mechanisms.

Technology

Linux PM Freezing

Android CachedAppOptimizer

Process Scheduling Control

✓ (

PF_FROZEN

)

✓ (

SIGSTOP

/

cgroup

)

Binder Freezing

✓ (Android specific)

Memory Management

✓ (

onTrimMemory

event-driven)

User Space Notification

✓ (via AMS broadcast)

3. Representation of Device PM Callbacks in the Device Model

Many data structures in the Linux device model contain the struct dev_pm_ops variable, as follows:

struct bus_type {   2:         ...   3:         const struct dev_pm_ops *pm;   4:         ...   5: };   6:    7: struct device_driver {   8:         ...   9:         const struct dev_pm_ops *pm;  10:         ...  11: };  12:   13: struct class {  14:         ...  15:         const struct dev_pm_ops *pm;  16:         ...  17: };  18:   19: struct device_type {  20:         ...  21:         const struct dev_pm_ops *pm;  22: };  23:   24: struct device {  25:         ...  26:         struct dev_pm_info      power;  27:         struct dev_pm_domain    *pm_domain;  28:         ...  29: };

The pm pointer in structures like bus_type, device_driver, class, and device_type is relatively easy to understand, similar to the old suspend/resume callbacks. We will focus on the power and pm_domain variables in the device structure.

◆ Power Variable

power is a variable of type struct dev_pm_info, also defined in “include/linux/pm.h.” The power variable mainly stores PM-related states, such as the current power_state, whether it can be woken up, whether preparation has been completed, whether suspension has been completed, etc.

◆ pm_domain Pointer

In the current kernel, the struct dev_pm_domain structure only contains a struct dev_pm_ops ops.

The so-called PM Domain (power domain) is specific to the “device.” Structures like bus_type, device_driver, class, and device_type essentially represent device drivers, and it is reasonable for device drivers to be responsible for power management operations. However, in the kernel, for various reasons, it is allowed for devices without drivers to exist, so how is power management for these devices handled? It is achieved through the device’s power domain.

4. Operation Functions of Device PM Callbacks

While defining the device PM callbacks data structure, the kernel also defines a large number of operation APIs for convenience. These APIs are divided into two categories.

◆ General auxiliary APIs that directly call the corresponding callback of the driver bound to the specified device’s pm pointer, as follows:

   1: extern int pm_generic_prepare(struct device *dev);   2: extern int pm_generic_suspend_late(struct device *dev);   3: extern int pm_generic_suspend_noirq(struct device *dev);   4: extern int pm_generic_suspend(struct device *dev);   5: extern int pm_generic_resume_early(struct device *dev);   6: extern int pm_generic_resume_noirq(struct device *dev);   7: extern int pm_generic_resume(struct device *dev);   8: extern int pm_generic_freeze_noirq(struct device *dev);   9: extern int pm_generic_freeze_late(struct device *dev);  10: extern int pm_generic_freeze(struct device *dev);  11: extern int pm_generic_thaw_noirq(struct device *dev);  12: extern int pm_generic_thaw_early(struct device *dev);  13: extern int pm_generic_thaw(struct device *dev);  14: extern int pm_generic_restore_noirq(struct device *dev);  15: extern int pm_generic_restore_early(struct device *dev);  16: extern int pm_generic_restore(struct device *dev);  17: extern int pm_generic_poweroff_noirq(struct device *dev);  18: extern int pm_generic_poweroff_late(struct device *dev);  19: extern int pm_generic_poweroff(struct device *dev);  20: extern void pm_generic_complete(struct device *dev);

For example, pm_generic_prepare checks if the dev->driver->pm->prepare interface exists, and if it does, it directly calls it and returns the result.

◆ APIs related to overall power management behavior, aimed at combining various independent power management actions into a simpler function, as follows:

   1: #ifdef CONFIG_PM_SLEEP   2: extern void device_pm_lock(void);   3: extern void dpm_resume_start(pm_message_t state);   4: extern void dpm_resume_end(pm_message_t state);   5: extern void dpm_resume(pm_message_t state);   6: extern void dpm_complete(pm_message_t state);   7:    8: extern void device_pm_unlock(void);   9: extern int dpm_suspend_end(pm_message_t state);  10: extern int dpm_suspend_start(pm_message_t state);  11: extern int dpm_suspend(pm_message_t state);  12: extern int dpm_prepare(pm_message_t state);  13:   14: extern void __suspend_report_result(const char *function, void *fn, int ret);  15:   16: #define suspend_report_result(fn, ret)                                  \  17:         do {                                                            \  18:                 __suspend_report_result(__func__, fn, ret);             \  19:         } while (0)  20:   21: extern int device_pm_wait_for_dev(struct device *sub, struct device *dev);  22: extern void dpm_for_each_dev(void *data, void (*fn)(struct device *, void *));

The functions and actions of these APIs are as follows.

dpm_prepare executes all devices’ “->prepare() callback(s)” with the following internal actions:

1) Traverse dpm_list, sequentially retrieving device pointers hanging on that list.

【Note 1: When adding a device (device_add), the device model calls the device_pm_add interface to add the device to the global list dpm_list for subsequent traversal operations.】

2) Call the internal interface device_prepare to perform the actual prepare action. This interface will return the execution result.

3) If execution fails, print an error message.

4) If execution succeeds, setdev->power.is_prepared (which is the struct dev_pm_info type variable mentioned above) to TRUE, indicating that the device has been prepared. At the same time, add the device to the dpm_prepared_list (this list stores all devices that are in the prepared state).

The internal interface device_prepare’s actions are as follows:

1) Determine if the device is a syscore device based on dev->power.syscore. If so, return directly (as syscore devices are handled separately). 2) During the prepare phase, call the pm_runtime_get_noresume interface to disable Runtime suspend functionality to avoid issues with not being able to wake up normally due to Runtime suspend. This functionality will be re-enabled during completion.   【Note 2: The implementation of pm_runtime_get_noresume is straightforward; it simply increments the power variable's reference count (dev->power.usage_count), and Runtime PM will determine whether to enable Runtime PM functionality based on whether this count is greater than zero.】3) Call the device_may_wakeup interface to determine if the device has a wakeup source (dev->power.wakeup) and whether it is allowed to wake up (dev->power.can_wakeup), recording whether the device is a wakeup path (stored in dev->power.wakeup_path).   【Note 3: The wake-up functionality of a device refers to the ability of the system to be awakened by certain devices while in low power states (e.g., suspend, hibernate). This is part of the power management process.】4) Obtain the callback function for prepare based on priority. Since the device model has multiple levels such as bus, driver, and device, the prepare interface may be implemented at any level. The priority order means that as long as a higher priority level registers prepare, it will be used preferentially, and lower priority prepares will not be used. The priority order is: dev->pm_domain->ops, dev->type->pm, dev->class->pm, dev->bus->pm, dev->driver->pm (this priority order also applies to other callbacks).

5) If a valid prepare function is obtained, call it and return the result.

dpm_suspend executes all devices’ “->suspend() callback(s)” with internal actions similar to dpm_prepare:

1) Traverse dpm_list, sequentially retrieving device pointers hanging on that list. 2) Call the internal interface device_suspend to perform the actual suspend action. This interface will return the execution result. 3) If suspend fails, record the device's information in an array of struct suspend_stats type and print an error message. 4) Finally, move the device from other lists (e.g., dpm_prepared_list) to the dpm_suspended_list.

The internal interface device_suspend’s actions are similar to those of device_prepare, and will not be described further here.

dpm_suspend_start sequentially executes dpm_prepare and dpm_suspend actions.

dpm_suspend_end sequentially executes all devices’ “->suspend_late() callback(s)” and all devices’ “->suspend_noirq() callback(s).” The actions are similar to those described above and will not be elaborated further.

dpm_resume, dpm_complete, dpm_resume_start, and dpm_resume_end are wake-up actions in the power management process, similar to the dpm_suspend_xxx series of interfaces. They will not be described further.

Leave a Comment