This article analyzes the current community and product status of the hot project Large Folios in the Linux kernel and predicts future trends. The technical content discussed in this article comes from contributions made by various companies, including Google, OPPO, ARM, Nvidia, Samsung, Huawei, Alibaba, and others.
Authors | Barry Song, Yu Zhao Editor | Meng YidanProduced by | CSDN (ID: CSDNnews)
In the Linux kernel, a folio can contain just one page or multiple pages. When a folio contains multiple pages, we refer to it as a large folio, generally known as 大页 in the Chinese community. Adopting large folios can potentially bring numerous benefits, such as:
1. Reduced TLB misses; for example, many hardware supports PMD mapping, allowing a 2MB large folio to take up only one TLB entry; some hardware supports contiguous PTE mapping, such as ARM64, which allows 16 consecutive pages to occupy just one TLB entry through CONT-PTE.2. Reduced page fault occurrences; for instance, if do_anonymous_page() directly requests a large folio and maps a CONT-PTE after a page fault occurs on a certain PTE, the remaining 15 PTEs will no longer experience page faults.3. Reduced the scale of LRU and the cost of memory reclamation; reclaiming based on large folios incurs lower reverse mapping costs than multiple small folios being reclaimed individually; theoretically, this applies to try_to_unmap_one() as well.4. Potential opportunities for compressing/decompressing at a larger granularity in zRAM/zsmalloc, thereby reducing CPU utilization during compression/decompression and improving compression rates. For example, compressing an entire 64KiB large folio is significantly more advantageous than compressing it into 16 small 4KiB folios.In the overall memory management of the Linux kernel, large folios coexist with small folios (which only have one page). For example, in the LRU linked list, the folios hanging above may be either large or small; the memory in a process’s VMA may consist of a mix of large and small folios; on the pagecache of a file, different offsets may correspond to either small or large folios.
File Page Large FoliosThe Linux community has developed multiple file systems that support large folios for file pages. These file systems inform the page cache layer through mapping_set_large_folios() that they support large folios:
afs
bcachefs
erofs non-compressed files
xfs
Meanwhile, the page cache layer will be aware of this situation and, when mapping_large_folio_support() is true, will allow the allocation of large folios to fill the pagecache’s xarray:
Currently, the file page large folios supported file systems are still very limited, so they cannot be used in many industries, such as the erofs and f2fs widely used in the mobile industry. Currently, we see that Zhang Yi from Huawei is completing a patchset: ext4: use iomap for regular file’s buffered IO path and enable large folios, seeking support for ext4’s iomap and large folios. The performance data provided by Zhang Yi can, in a sense, prove the benefits of file systems supporting large folios:
Anonymous Page Large FoliosIn the community, Ryan Roberts from ARM is the main initiator of this project and one of the main contributors to the related patchsets. Currently, there are several topics related to anonymous page patchsets, some of which have been merged, some are in Andrew Morton’s mm tree iteration, and some are still under community discussion or in their infancy.1. Multi-size THP for Anonymous Memory contributed by Ryan Roberts (ARM)[2]This patchset allows anonymous pages to request multiple different sizes of PTE-mapped large folios when page faults occur. The original THP in the kernel mainly targets the PMD-mapped 2MiB size, and with support for various sizes, we refer to the multi-size THP as mTHP. Now, in the /sys/kernel/mm/transparent_hugepage directory, there will be multiple hugepages subdirectories:
For example, if you enable 64KiB large folios:
Then, when a PF occurs, do_anonymous_page() can request a 64KiB mTHP and set all 16 PTEs at once through set_ptes:
The remaining 15 PTEs will no longer experience PF. Ryan’s patchset maintains ABI compatibility with the previous THP regarding mTHP, such as the previous MADV_HUGEPAGE and MADV_NOHUGEPAGE still applicable to mTHP.2. Transparent Contiguous PTEs for User Mappings contributed by Ryan Roberts (ARM)[3]This patchset mainly enables mTHP to automatically utilize ARM64’s CONT-PTE, meaning if the PFNs corresponding to 16 PTEs are physically contiguous and naturally aligned, the CONT bit is set so that they occupy only one TLB entry. The exciting part of Ryan’s patchset is that the core mm layer does not need to be aware of the existence of CONT-PTE (since not all hardware ARCHs have this optimization), maintaining complete compatibility of PTE-related APIs with mm, while automatically adding or removing the CONT bit at the implementation level of ARM64 arch.For example, if the original 16 PTEs meet the CONT conditions, and someone unmaps one of the PTEs or mprotect changes the attributes of some of the PTEs among the 16, causing CONT to no longer be satisfied, the contpte_try_unfold() called by set_ptes() can automatically remove the CONT bit:
The adoption of CONT-PTE has effectively improved the performance of some benchmarks, such as kernel compilation:
3. Swap-out mTHP without Splitting contributed by Ryan Roberts (ARM)[4]This patchset does not split mTHP into small folios when reclaiming memory in vmscan.c (unless the large folio has already been added to the _deferred_list, indicating it is likely already partially unmapped), but instead requests multiple swap slots as a whole and writes to the swapfile.However, there is a problem here: when calling add_to_swap() to collectively request nr_pages contiguous swap slots, the swapfile may have become fragmented, resulting in an inability to allocate, and it still needs to fall back to splitting:
I believe that the fragmentation issue of swapfiles will be an important topic for the community in the future. Chris Li (Google) has some thoughts on Swap Abstraction “the pony”, and more discussions may take place at the LSF in Salt Lake City in May 2024.4. mm: support large folios swap-in contributed by Chuanhua Han (OPPO) and Barry Song (OPPO Consultant)[6]This patchset aims to allow do_swap_page() to also directly perform swap-in in the form of large folios, thereby reducing PF on the do_swap_page() path. Furthermore, importantly, if the swap-in path does not support mTHP, the previous work by Ryan Roberts may result in the mTHP being swapped out and then swapped back in as something other than mTHP.In scenarios such as Android and embedded systems where swapouts occur frequently, losing the advantages of mTHP overnight due to swapouts is quite unreasonable. Theoretically, there are three possible paths for mTHP support on the swap-in path:
Directly hitting a large folio in swapcache
Swap-in for synchronous devices on the SWP_SYNCHRONOUS_IO path
Swap-in on the swapin_readahead() path for asynchronous devices or when __swap_count(entry) != 1.
Currently, the patchset targets paths a and b for mobile and embedded scenarios using zRAM, and it is believed that this patchset will further develop support for path c. The earliest part likely to be merged is path a’s handling of refault cases for large folios swap-in: first[7].5. mTHP-friendly compression in zsmalloc and zram based on multi-pages contributed by Tangquan Zheng (OPPO)[8]This patchset’s idea is based on the ability to compress and decompress anonymous pages at a larger granularity during swapout/swapin, which can greatly reduce CPU utilization and improve compression rates. The cover letter of this patchset presents a set of data showing that for raw data, compressing at 4KiB and 64KiB units significantly reduces the time taken, and the compressed data is also much smaller:
The aforementioned works by Ryan, Chuanhua, Barry, etc., can operate at the granularity of large folios during swapout and swapin, providing a practical foundation for Tangquan’s work. In Tangquan’s work, if a 64KiB mTHP is given to zRAM, it can be compressed directly at the 64KiB unit; if a 128KiB mTHP is given to zRAM, it can be decomposed into 2 64KiB units for compression (without Tangquan’s work, 128KiB would be compressed as 32 4KiB pages).6. TAO: THP Allocator Optimizations contributed by Yu Zhao (Google)[9]This patchset (which is also a topic for LSF/MM/eBPF) aims to solve the cost of metadata for large folios and the fragmentation issues commonly encountered during allocation. The Buddy allocator often leads to fragmentation due to non-movable memory pages after running complex applications for extended periods, making it difficult to continue providing contiguous physical memory effectively. If over 90% of large folio requests fall back to 4KB pages, the previous mTHP solution essentially cannot realize its function.TAO (which is also the English translation of the Chinese word “道”) is based on the idea that 4KB pages are a relic from decades ago and do not meet the demands of contemporary high-performance hardware and user-space software; the existence of 4KB pages is merely for backward compatibility. Therefore, the design of the operating system’s memory management should focus on optimizing large folios, while maintaining ABI compatibility with 4KB and other kernel components (such as SLAB). To illustrate, 4KB pages are more like a DMA zone, existing to ensure that devices from the 1980s can continue to function. Based on this idea, memory can be abstracted into two policy zones: a backward-compatible 4KB page zone and a large folio zone that better suits contemporary software and hardware. The former primarily addresses 4KB allocation needs but can also provide unguaranteed large folio allocation needs; the latter only addresses large folio allocation needs and ensures a minimum guaranteed THP coverage.This design’s advantage is its convenience. Specifically, TAO can perfectly integrate with MGLRU, achieving targeted reclamation for both 4KB pages and large folios: 4KB page allocations only need to be reclaimed from the 4KB page zone; large folio allocations are first reclaimed from the large folio zone, and if that cannot meet the allocation needs, reclamation can also occur from the 4KB page zone while performing compaction.TAO will also naturally extend HVO (HugeTLB Vmemmap Optimization) to THP, thus reducing the overhead of struct page for 2MB THP (to one-eighth of the previous).TAO’s concluding section (see the above link) also presents an interesting new concept: thinking about THP in terms of fungibility in finance (referring to the ability to use one item to replace another for debt repayment). For example, regarding 2MB THP, if its user cannot fully utilize its value, memory management should swap this 2MB THP with 512 non-contiguous 4KB pages. This process is called THP shattering, which seems similar to existing THP splitting, but its essence is “stealing beams to replace columns”, aiming to preserve the original THP from being split for users with genuine needs. This concept could also be applied to future 1GB THP. If THP is split, existing collapse will require allocating a new THP and copying data. For 2MB THP, allocation and copying may not pose significant issues. However, for future 1GB THP, both would be unacceptable. Therefore, the only feasible solution is THP recovery, which retains unallocated pages from the split 1GB THP and copies data from already reallocated pages to separate 4KB pages, then returns the original 1GB physical area to THP ownership.For 2MB THP, the following 2×2 matrix can summarize the aforementioned four combinations:
7. THP_SWAP support for ARM64 SoC with MTE contributed by Barry Song (OPPO Consultant)[10]This patchset addresses the issue of saving and restoring ARM64 MTE tags on large folios during overall SWPOUT and SWPIN, allowing ARM64 hardware that supports MTE to benefit from the overall swapout and swapin advantages of mTHP.8. mm: add per-order mTHP alloc and swpout counters contributed by Barry Song (OPPO Consultant)[11]As mTHP has developed to this point, counting and debugging functions have become essential; otherwise, the entire mTHP appears as an elusive black box to users. The patchset currently contributed by Barry implements two sets of counting:1) per-order mTHP allocation success and failure rates to reflect whether mTHP is genuinely still effective in the system and to see if buddy fragmentation leads to mTHP allocation failures;2) per-order mTHP SWPOUT and FALLBACK rates to reflect whether swap partitions are fragmented, leading to difficulties in allocating contiguous swap slots needed for mTHP swapout.The patchset adds a stats directory to each size’s sysfs file, /sys/kernel/mm/transparent_hugepage/hugepages-<size>/stats, to present the counts:
9. Split a folio to any lower order folios contributed by Zi Yan (Nvidia)[12]Previously, large folios could only be split into order-0 small folios; Zi Yan’s patchset allows them to be converted into any order. This patchset provides a debugfs interface that can be written to with a pid and a virtual address range, after which the kernel will split the specified range of memory into the designated order of mTHP.
Currently, Zi Yan has demonstrated an application scenario where, during Pankaj Raghav (Samsung)’s enable bs > ps in XFS patchset, effective utilization of this split to non-order-0 work can occur:
Pankaj Raghav’s work aims to provide support for block sizes greater than page sizes in XFS; through Zi Yan’s work, after splitting large folios, the result remains large folios (of lower order but satisfying the mapping_min_folio_order requirement).10. Support multi-size THP numa balancing contributed by Baolin Wang (alibaba)[14]Baolin’s patchset extends memory migration under NUMA balancing to the mTHP domain, allowing mTHP to be scanned and migrated. Since mTHP’s size is larger than that of 4KiB small folios, it is theoretically more prone to false sharing, and frequent migration can lead to memory ping-ponging across NUMA nodes. Therefore, algorithmically, mTHP temporarily borrows the 2-stage filter mechanism of PMD-mapped. Through Baolin’s patchset, the performance of autonuma-benchmark has significantly improved:
11. mm/madvise: enhance lazyfreeing with mTHP in madvise_free contributed by kernel enthusiast Lance Yang[15]
This patchset enables the MADV_FREE mechanism of mTHP to no longer require splitting folios. Previously, MADV_FREE had to split large folios, but now it does not, thus significantly improving the speed of MADV_FREE calls for large folios (the author also believes this will ultimately enhance the speed at which lazyfree folios are reclaimed in the LRU linked list):
The Real Products of Large FoliosAs of the moment this article was written, there are still few mature commercial cases of mTHP in the community. However, OPPO phones have implemented dynamic large pages in kernels 4.19, 5.10, 5.15, and 6.1 before the community mTHP project matured, and have deployed them in a large number of phones in 2023.At the 2023 China Linux Kernel Developers Conference, Chuanhua Han (OPPO) presented the software architecture and benefits of OPPO’s dynamic large page project.Compared to the community project supporting various mTHP sizes, the dynamic large pages deployed in OPPO phones mainly target the 64KiB size, which can utilize CONT-PTE. The following diagram illustrates a typical flowchart of do_anonymous_page() PF processing.
Chuanhua presented the software architecture diagram of OPPO’s dynamic large pages at CLK2023:
In this architecture diagram, there are several highlights:1. Comprehensive modifications to do_anonymous_page, do_wp_page, do_swap_page, and swapout, indicating that this solution supports large folio requests during PF, CoW at the large folio level, and swapout/swapin at the large folio level;2. Large page pool: using pooling technology to solve the overhead of large folio requests and provide a high success rate for large folio requests;3. Dual LRU: large folios and small folios are located in different LRUs rather than mixed together. This way, the reclamation efficiency of both large and small folios is relatively high, avoiding the phenomenon of mutual blockage (for example, if you urgently need large folios but the tail of the LRU has 100 small folios, reclaiming those 100 still won’t yield large folios);4. zsmalloc and zRAM support both large folios and small folios simultaneously, enjoying the high compression ratio and low CPU utilization of large folios.The speech also presented the benefit data of OPPO phones adopting dynamic large pages:Benchmark aspects:
User experience aspects
The Future of Large FoliosSeveral predictions about the future:1. More file systems will support large folios mapping.2. A practical mechanism similar to TAO that guarantees high success rates for large folio allocations is needed.3. Large folios swap-in needs to be supported by the mainline.4. Large folios will bring more parallel processing capabilities, such as combining hardware compression and decompression offload functions; one large folio may be offloaded for fast multi-threaded compression and decompression.5. In the future, zswap may also need to provide large folio support; currently, zswap does not support mTHP.6. Swap fragmentation, or storing swapped-out mTHP in non-contiguous swap slots. Currently, the overall swapout of mTHP requires that nr_pages of swap slots be contiguous and naturally aligned.7. Balancing the performance improvements brought by mTHP with the potential increase in memory fragmentation waste. Due to the larger granularity of mTHP compared to small folios, it may potentially allocate some memory that does not need to be accessed. However, Tangquan’s work on zRAM/zsmalloc also shows us another possibility: large folios may not always be a source of waste but can also become a source of memory savings, reflecting the principle of mutual generation and restriction.8. Properly handling the potential waste of fragmentation and performance loss caused by user-space partially unmapping a large folio. Since user-space typically only understands the base page size, it may not know that the underlying structure is already a large folio, and user-space may munmap, mprotect, or madvise in a manner that does not align with the large folio. For example, if a user-space munmaps 0~60KiB of a 64KiB large folio, the remaining 4KiB may still occupy the entire 64KiB large folio for a long time.Author Introduction:Barry Song: A long-time frontline Linux kernel developer and OPPO consultant, author of projects like per-numa CMA, SCHED_CLUSTER, and ARM64 BATCHED_UNMAP_TLB_FLUSH;Yu Zhao: A well-known developer in the Linux kernel community, Google Staff Software Engineer, author of projects like Multi-Gen LRU and POSIX_FADV_NOREUSE.References[1] https://lwn.net/Articles/956575/[2] https://lore.kernel.org/all/[email protected]/[3] https://lore.kernel.org/all/[email protected]/[4] https://lore.kernel.org/linux-mm/[email protected]/[5] https://lore.kernel.org/linux-mm/CAF8kJuMQ7qBZqdHHS52jRyA-ETTfHnPv+V9ChaBsJ_q_G801Lw@mail.gmail.com/[6] https://lore.kernel.org/linux-mm/[email protected]/[7] https://lore.kernel.org/linux-mm/[email protected]/[8] https://lore.kernel.org/linux-mm/[email protected]/[9] https://lore.kernel.org/all/[email protected]/[10] https://lore.kernel.org/linux-mm/[email protected]/[11]https://lore.kernel.org/linux-mm/[email protected]/[12] https://lore.kernel.org/all/[email protected]/[13] https://lore.kernel.org/linux-mm/[email protected]/[14] https://lore.kernel.org/all/[email protected]/[15] https://lore.kernel.org/all/[email protected]/[16] https://github.com/ChinaLinuxKernel/CLK2023/blob/main/%E5%88%86%E8%AE%BA%E5%9D%9B1%EF%BC%88%E5%86%85%E5%AD%98%E7%AE%A1%E7%90%86%EF%BC%89/8%20%20%E5%8A%A8%E6%80%81%E5%A4%A7%E9%A1%B5%EF%BC%9A%E5%9F%BA%E4%BA%8EARM64%20contiguous%20PTE%E7%9A%8464KB%20HugePageLarge%20Folios%E2%80%94%E2%80%94%E9%9F%A9%E4%BC%A0%E5%8D%8E.pptx