The need for large page support on 32 bit ARM
Storage space has become more and more affordable to a point that it is now possible to have multiple hard drives of dozens of terabytes in a single consumer-grade device. With a few 10 TiB hard drives and thanks to RAID technology, storage capacities that exceed 16 or 32 TiB can easily be reached and at a relatively low cost.
However, a number of consumer NAS systems used in the field today are still based on 32 bit ARM processors. The problem is that, with Linux on a 32 bit system, it’s only possible to address up to 16 TiB of storage space. This is still true even with the ext4 filesystem, even though it uses 64 bit pointers.
We were lucky to have a customer contracting us to update older Large Page Support patches to a recent version of the Linux kernel. This set of patches are one way of overcoming this 16 TiB limitation for ARM 32-bit systems. Since updating this patch series was a non trivial task, we are happy to share the results of our efforts with the community, both through this blog post and through a patch series we posted to the Linux ARM kernel mailing list: ARM: Add support for large kernel page (from 8K to 64K).
How Large Page Support works
The 16 TiB limitation comes from the use of
page->index which is a
pgoff_t offset type corresponding to
unsigned long. This limits us to a 32-bit page offsets, so with 4 KiB physical pages, we end up with a maximum of 16 TiB. A way to address this limitation is to use larger physical pages. We can reach 32 TiB with 8 KiB pages, 64 TiB with 16 KiB pages and up to 256 TiB with 64 KiB pages.
Before going further, the ARM32 Page Tables article from Linus Walleij is a good reference to understand how the Linux kernel deals with ARM32 page tables. In our case, we are only going to cover the non LPAE case. As explained there, the way the Linux kernel sees the page tables actually doesn’t match reality. First, the kernel deals with 4 levels of page tables while on hardware there are only 2 levels. In addition, while the ARM32 hardware stores only 256 PTEs in Page Tables, taking up only 1 KB, Linux optimizes things by storing in each 4 KB page two sets of 256 PTEs, and two sets of shadow PTEs that are used to store additional metadata needed by Linux about each page (such as the dirty and accessed/young bits). So, there is already some magic between what is presented to the Linux virtual memory management subsystem, and what is really programmed into the hardware page tables. To support large pages, the idea is to go further in this direction by emulating larger physical pages.
Our series (and especially patch 5: ARM: Add large kernel page support) proposes to pretend to have larger hardware pages. The ARM 32-bit architecture only supports 4 KiB or 64 KiB page sizes, but we would like to support intermediate values of 8 KiB, 16 KiB and 32 KiB as well. So what we do to support 8 KiB pages is that we tell Linux the hardware has 8 KiB pages, but in fact we simply use two consecutive 4 KiB pages at the hardware level that we manipulate and configure simultaneously. To support 16 KiB pages, we use 4 consecutive 4 KiB pages, for 32 KiB pages, we use 8 consecutive pages, etc. So really, we “emulate” having larger page sizes by grouping 2, 4 or 8 pages together. Adding this feature only required a few changes in the code, mainly dealing with ranges of pages every time we were dealing with a single page. Actually, most of the code in the series is about making it possible to modify the hard coded value of the hardware page size and fixing the assumptions associated to such a fixed value.
In addition to this emulated mechanism that we provide for 8 KiB, 16 KiB, 32 KiB and 64 KiB pages, we also added support for using real hardware 64 KiB pages as part of this patch series.
Overall the number of changes is very limited (271 lines added, 13 lines removed), and allows to use much larger storage devices. Here is the diffstat of the full patch series:
arch/arm/include/asm/elf.h | 2 +- arch/arm/include/asm/fixmap.h | 3 +- arch/arm/include/asm/page.h | 12 ++++ arch/arm/include/asm/pgtable-2level-hwdef.h | 8 +++ arch/arm/include/asm/pgtable-2level.h | 6 +- arch/arm/include/asm/pgtable.h | 4 ++ arch/arm/include/asm/shmparam.h | 4 ++ arch/arm/include/asm/tlbflush.h | 21 +++++- arch/arm/kernel/entry-common.S | 13 ++++ arch/arm/kernel/traps.c | 10 +++ arch/arm/mm/Kconfig | 72 +++++++++++++++++++++ arch/arm/mm/fault.c | 19 ++++++ arch/arm/mm/mmu.c | 22 ++++++- arch/arm/mm/pgd.c | 2 + arch/arm/mm/proc-v7-2level.S | 72 ++++++++++++++++++++- arch/arm/mm/tlb-v7.S | 14 +++- 16 files changed, 271 insertions(+), 13 deletions(-)
This patch series is running in production now on some NAS devices from a very popular NAS brand.
Limitations and alternatives
The submission of our patch series is recent but this feature has actually been running for years on many NAS systems in the field. Our new series is based on the original patchset, with the purpose of submitting it to the mainline kernel community. However, there is little chance that it will ever be merged into the mainline kernel.
The main drawback of this approach are large pages themselves: as each file in the page cache uses at least one page, the memory wasted increases as the size of the pages increases. For this reason, Linus Torvalds was against similar series proposed in the past.
To show how much memory is wasted, Arnd Bergmann ran some numbers to measure the page cache overhead for a typical set of files (Linux 5.7 kernel sources) for 5 different page sizes:
|Page size (KiB)||4||8||16||32||64|
|page cache usage (MiB)||1,023.26||1,209.54||1,628.39||2,557.31||4,550.88|
|factor over 4K pages||1.00x||1.18x||1.59x||2.50x||4.45x|
We can see that while a factor of 1.18 is acceptable for 8 KiB pages, a 4.45 multiplier looks excessive with 64 KiB pages.
Actually, to make it possible to address large volumes on 32 bit ARM, another solution was pointed out during the review of our series. Instead of using larger pages which have an impact on the entire system, an alternative is to modify the way the filesystem addresses the memory by using 64 bits
pgoff_t offsets. This has already been implemented in vendor kernels running in some NAS systems, but this has never been submitted to mainline developers.