Linux 7.0 Speeds Up Reclaiming File-Backed Large Folios By 50~75%

Alibaba engineer Baolin Wang's batched unmapping patches for large folios deliver substantial performance gains in Linux 7.0, with 75% improvement on 32-core Arm64 servers and 50%+ on x86 systems.

Linux 7.0 introduces a significant memory management optimization that dramatically accelerates the reclaiming of file-backed large folios. The improvement comes from a series of patches developed by Alibaba engineer Baolin Wang, which implement batched unmapping for large folios - a change that shows performance gains of 50-75% depending on the architecture.

The Problem With Sequential Reference Checking

The core issue addressed by these patches stems from how Linux currently handles reference checking for large folios. The existing folio_referenced_one() function checks the young flag for each page table entry (PTE) sequentially. While this approach works adequately for smaller memory operations, it becomes a significant bottleneck when dealing with large folios.

Baolin Wang identified this inefficiency during performance analysis of clean file-backed large folio reclamation. The folio_referenced() function emerged as a major performance hotspot in these scenarios. The sequential nature of the reference checking means that as folio size increases, the time required for reclamation grows linearly - creating scalability issues on modern multi-core systems.

Arm Architecture's Partial Solution

Interestingly, the Arm architecture already had some optimization in place for contiguous PTEs. Arm supports contiguous page table entries, allowing for batch operations within these contiguous ranges through the CONT_PTE_SIZE optimization. However, this optimization only covers portions of large folios, leaving significant performance potential untapped.

Wang's insight was to extend this batched operation concept to encompass entire large folios, even when they exceed the contiguous range limitations. This approach leverages the fact that modern memory management often deals with large, contiguous memory allocations that can be processed more efficiently as batches rather than individual entries.

Performance Results

The performance improvements from these patches are substantial and architecture-dependent:

Arm64 32-core server: 75% performance improvement
x86 machine: 50%+ performance improvement

These tests involved allocating 10GB of clean file-backed folios using mmap() within a memory cgroup, then attempting to reclaim 8GB of these file-backed folios through the memory.reclaim interface. The batched unmapping approach showed dramatic improvements over the sequential reference checking method.

Why This Matters for Modern Linux

The timing of this optimization is particularly relevant given the increasing adoption of folios throughout the Linux kernel. Folios represent a more flexible and efficient way to manage memory compared to the traditional page-based approach. As more kernel subsystems transition to using folios, optimizations that improve folio operations have compounding benefits across the entire system.

Memory reclamation is a critical operation in Linux, particularly in containerized environments and systems with memory pressure. Faster reclamation means better memory utilization, reduced latency during memory pressure events, and improved overall system responsiveness. For cloud providers and large-scale deployments - including Alibaba's own infrastructure - these improvements can translate to significant operational benefits.

Technical Implementation

The batched unmapping implementation works by grouping reference checks and unmapping operations for multiple PTEs within a large folio. Rather than iterating through each PTE individually, the system can process them in batches, reducing the overhead of function calls and memory access patterns.

This approach is particularly effective on Arm64 systems with their 32-core configuration, where the parallel nature of the workload can better exploit the batched operations. The x86 improvement, while slightly lower at 50%+, still represents a substantial gain for a core memory management operation.

Looking Forward

These patches, now merged into the Linux 7.0 merge window, demonstrate how targeted optimizations to fundamental kernel operations can yield significant performance dividends. As Linux continues to evolve toward more efficient memory management primitives like folios, we can expect to see more optimizations that leverage batch operations and reduce per-element overhead.

The work also highlights the valuable contributions from industry engineers working on real-world performance challenges. Alibaba's involvement in kernel development continues to produce practical improvements that benefit the entire Linux ecosystem.

For system administrators and developers working with memory-intensive applications or large-scale deployments, Linux 7.0's improved folio reclamation performance offers tangible benefits in terms of memory efficiency and system responsiveness under load.

#Linux #memory-management #Kernel Optimization #Performance #Alibaba