Gregory is an embedded Linux, kernel and realtime engineer at Bootlin, which he joined in 2010. Since 2002, he has acquired vast on the field experience in porting and operating embedded Linux, in particular for industrial and transportation customers.
Storage space has become more and more affordable to a point that it is now possible to have multiple hard drives of dozens of terabytes in a single consumer-grade device. With a few 10 TiB hard drives and thanks to RAID technology, storage capacities that exceed 16 or 32 TiB can easily be reached and at a relatively low cost.
However, a number of consumer NAS systems used in the field today are still based on 32 bit ARM processors. The problem is that, with Linux on a 32 bit system, it’s only possible to address up to 16 TiB of storage space. This is still true even with the >ext4 filesystem, even though it uses 64 bit pointers.
We were lucky to have a customer contracting us to update older Large Page Support patches to a recent version of the Linux kernel. This set of patches are one way of overcoming this 16 TiB limitation for ARM 32-bit systems. Since updating this patch series was a non trivial task, we are happy to share the results of our efforts with the community, both through this blog post and through a patch series we posted to the Linux ARM kernel mailing list: ARM: Add support for large kernel page (from 8K to 64K).
How Large Page Support works
The 16 TiB limitation comes from the use of page->index which is a pgoff_t offset type corresponding to unsigned long. This limits us to a 32-bit page offsets, so with 4 KiB physical pages, we end up with a maximum of 16 TiB. A way to address this limitation is to use larger physical pages. We can reach 32 TiB with 8 KiB pages, 64 TiB with 16 KiB pages and up to 256 TiB with 64 KiB pages.
Before going further, the ARM32 Page Tables article from Linus Walleij is a good reference to understand how the Linux kernel deals with ARM32 page tables. In our case, we are only going to cover the non LPAE case. As explained there, the way the Linux kernel sees the page tables actually doesn’t match reality. First, the kernel deals with 4 levels of page tables while on hardware there are only 2 levels. In addition, while the ARM32 hardware stores only 256 PTEs in Page Tables, taking up only 1 KB, Linux optimizes things by storing in each 4 KB page two sets of 256 PTEs, and two sets of shadow PTEs that are used to store additional metadata needed by Linux about each page (such as the dirty and accessed/young bits). So, there is already some magic between what is presented to the Linux virtual memory management subsystem, and what is really programmed into the hardware page tables. To support large pages, the idea is to go further in this direction by emulating larger physical pages.
Our series (and especially patch 5: ARM: Add large kernel page support) proposes to pretend to have larger hardware pages. The ARM 32-bit architecture only supports 4 KiB or 64 KiB page sizes, but we would like to support intermediate values of 8 KiB, 16 KiB and 32 KiB as well. So what we do to support 8 KiB pages is that we tell Linux the hardware has 8 KiB pages, but in fact we simply use two consecutive 4 KiB pages at the hardware level that we manipulate and configure simultaneously. To support 16 KiB pages, we use 4 consecutive 4 KiB pages, for 32 KiB pages, we use 8 consecutive pages, etc. So really, we “emulate” having larger page sizes by grouping 2, 4 or 8 pages together. Adding this feature only required a few changes in the code, mainly dealing with ranges of pages every time we were dealing with a single page. Actually, most of the code in the series is about making it possible to modify the hard coded value of the hardware page size and fixing the assumptions associated to such a fixed value.
In addition to this emulated mechanism that we provide for 8 KiB, 16 KiB, 32 KiB and 64 KiB pages, we also added support for using real hardware 64 KiB pages as part of this patch series.
Overall the number of changes is very limited (271 lines added, 13 lines removed), and allows to use much larger storage devices. Here is the diffstat of the full patch series:
This patch series is running in production now on some NAS devices from a very popular NAS brand.
Limitations and alternatives
The submission of our patch series is recent but this feature has actually been running for years on many NAS systems in the field. Our new series is based on the original patchset, with the purpose of submitting it to the mainline kernel community. However, there is little chance that it will ever be merged into the mainline kernel.
The main drawback of this approach are large pages themselves: as each file in the page cache uses at least one page, the memory wasted increases as the size of the pages increases. For this reason, Linus Torvalds was against similar series proposed in the past.
To show how much memory is wasted, Arnd Bergmann ran some numbers to measure the page cache overhead for a typical set of files (Linux 5.7 kernel sources) for 5 different page sizes:
Page size (KiB)
page cache usage (MiB)
factor over 4K pages
We can see that while a factor of 1.18 is acceptable for 8 KiB pages, a 4.45 multiplier looks excessive with 64 KiB pages.
Actually, to make it possible to address large volumes on 32 bit ARM, another solution was pointed out during the review of our series. Instead of using larger pages which have an impact on the entire system, an alternative is to modify the way the filesystem addresses the memory by using 64 bits pgoff_t offsets. This has already been implemented in vendor kernels running in some NAS systems, but this has never been submitted to mainline developers.
The Embedded Linux Conference Europe edition 2019 took place a few weeks ago in Lyon, France, and no less than 7 engineers from Bootlin attended the conference. We would like to highlight a selection of talks that Bootlin engineers found interesting. We asked each of the 7 engineers who attended the event to pick one talk they liked, and make a small write-up about it. Of course, many other talks were interesting and what makes a talk interesting is very subjective!
Introduction to HyperBus Memory Devices, by Vignesh Raghavendra
Talk selected by Gregory Clement
Vignesh started his talk by presenting the HyperBus from the hardware point of view. It is a bus using 8 data lines, using either a single or a differential clock as well as a bi-directional data strobe. These last two features clearly indicate that the designers of this bus seek high bandwidth. Two types of memory are available: HyperRAM and HyperFlash and the talk focused on the second one.
The read throughput of the HyperFlash can reach 400MB/s, it is compatible with SPI flash and is an alternative to the octal SPI NOR flashes. Then Vignesh presented the transactions done on the bus for the flash, which is very similar to what is done on SPI bus. He also compares the traditional parallel CFI flashes to the HyperFlash. And finally he describes the 2 types of controllers: Dedicated HyperBus Controllers and Multi IO Serial controllers.
In the last part of the talk, Vignesh presented the recently add kernel features, and future improvements.
This talk was a good introduction to this new bus, covering the hardware parts as well as software support in the Linux kernel.
Open Source Graphics 101: Getting Started, by Boris Brezillon
Talk selected by Paul Kocialkowski
During this talk, Boris provides a comprehensive and accessible overview of the graphics stack that supports GPUs in systems based on the Linux kernel. He also provides insight about the inner workings and architecture of GPUs. Although Boris defines himself as a not (yet) experienced Graphics developer, his talk contains all the elements needed to get a clear first idea on the topic as he covers hardware, kernel and userspace aspects.
To begin with, he explains how the GPU pipeline is split into multiple stages that are needed one after the other to generate a final image from a set of 3D models. The first stage is geometry and involves operations on the vertices that compose the 3D models, followed by rasterization where the view of the 3D scene is materialized on a 2D viewport, producing the end image. He also presents the concept of shaders: they are small dedicated programs that run on the GPU to make each stage configurable and flexible in order to produce the exact wanted result.
He then provides details about how GPU architectures implement massive parallelization to provide efficient results and also details some of the pitfalls that can occur with this approach. After that, Boris presents how the main CPU interacts with the GPU, introducing the concept of a command stream to submit jobs to the GPU.
With all these concepts laid out comes the time for him to present how the software stack is organized to support GPUs. After a general overview, specific aspects are presented. This includes graphics APIs such as OpenGL and Vulkan (and how they follow distinct paradigms) but also covers topics related to Mesa internals such as intermediate representations or windowing system integration. Kernel aspects are not forgotten either and a rationale regarding the (unusual) kernel/userspace separation in place for GPUs is also provided to clarify prominent design choices.
This talk succeeds at providing an introduction to GPUs and 3D rendering that can be understood without specific graphics knowledge while also giving a good idea of how the supporting software stack is organized. It is highly recommended for anyone interested in learning more on these topics!
Learning the Linux Kernel Configuration Space: Results and Challenges, Matthieu Acher
Talk selected by Michael Opdenacker
TuxML (Linux and Machine Learning) is an open-source research project aiming at exploring the Linux kernel configuration space through machine learning. With more than 15,000 configuration parameters, the Linux kernel now has up to 106000 possible configurations. Compare this figure to 1080, the approximate number of atoms in the universe.
As it is not possible to test all such configurations (all the more as each takes about 10 minutes to compile), the goal of the project is to predict “interesting” configurations, that could expose distinct bugs.
Starting from random configurations (from make randconfig), they use statistical learning to eventually pinpoint sets of parameters causing build failures, and avoid testing configurations that are expected to fail. This way, TUXML can be more efficient in exploring the configuration space and find bugs.
That’s typically where researchers can help us engineers and Linux kernel contributors. You need a solid theoretical background in machine learning to process data efficiently.
On Linux 4.13, the research team has managed to explore more than 15,000 different configurations through more than 95,000 hours of computing, eventually to find (and fix) 16 bugs in Linux. Some of these bugs may not come as a surprise for experienced kernel developers, but some others could expose unexpected issues that a human user may not find spontaneously.
They are also trying to use their data to predict the impact of configuration parameters on kernel size, but it turns out that size is hard to predict. At least, they managed to find “influential” options, some expected ones, and some less expected ones, deserving further investigation.
This project looks definitely useful for improving the test coverage of the Linux kernel, by working smarter than trying to compile purely random configurations. Your help is needed for testing, investigating and fixing kernel bugs, and giving your feedback.
Sergio gave a talk about debugging with an interesting approach: he started by acknowledging that today, printk is very widely used to do serious debugging but that in some cases it would be much more efficient to use other tools. Indeed there are plenty of open source tools available out there so why don’t we use them?
He presented a table indicating, from his point of view, which tool would best fit a given problem and then enumerated a few tips and commands that everybody can use to understand what went wrong in their kernel, for instance after a panic.
Is addr2line the best way to avoid printk messages right after a panic? or maybe the Linux script faddr2line? or even GDB? Maybe you don’t have access to the panic trace yet, in this case you could be interested in looking at pstore or kdump?
Or maybe an issue will more efficiently be hunted with tracing, in this case Sergio shown several static and dynamic options: using tracepoints, kprobes, ftrace, and proposed many others.
Lock-ups and memory leaks are also covered in the slides (see below), but not in the video because unfortunately the 35 minutes slot allowed was not enough for Sergio to detail all these interesting debugging methods. We wish he had more time to give all his feedback around these underused -while powerful- tools!
Supporting Video (de)serializers in Linux: Challenges and Works in Progress
Talk selected by Thomas Petazzoni
Luca Ceresoli’s talk at this ELCE is a good example of an interesting talk, as it combines an introduction to new hardware, what is the status of the support for this hardware in Linux, and what are the challenges to overcome to complete the integration of the hardware support in the kernel, with some open discussion.
Luca’s talk was about the support for video serializers and deserializers, with a focus on camera support. Cameras are usually connected to a system-on-chip using a parallel interface, a MIPI CSI interface or some LVDS interface. However, these interfaces only work for very short distances between the system-on-chip and the camera, and may not work well in electromagnetically noisy environments. For such situations, there are some technical solutions that consists in serializing the camera stream in a fast robust link (typically a coax cable) and then deserialize it before it is captured by the SoC through a standard camera interface. This fast robust link of course transports the stream data itself, but can also transport control information (GPIO status, I2C bus to talk to the remote camera sensor).
There are two main technologies today implementing this: the GMSL technology from Maxim and the FDP-LinkIII technology from Texas Instruments. Luca’s focus is on the latter technology, since that’s what he has been working on for the past months.
After this hardware introduction, Luca gave a status of the different patch series that have been posted by various contributors (himself included) on the Linux kernel mailing lists: some preliminary support for GMSL has been posted by Kieran Bingham, and some preliminary support for FDP-LinkIII has been posted by Luca.
Luca then presented the ideal implementation to support these interfaces, but then quickly dived into the troubles and tribulations: there is no support for stream multiplexing in Video4Linux currently, there is no support for parts of a V4L pipeline going faulty, and there is no support for hotplugging in V4L.
Then, there are some challenges with how to handle the remote I2C bus offered by those serializers/deserializers. Since camera sensors often have the same I2C address, the serializers/deserializers often have some sort of “solution” to this: an I2C switch in the GMSL (de)serializers, and a translation table for I2C addresses in FDP-LinkIII (de)serializers. Luca discussed how these are currently supported in Linux.
At the end of the talk, quite a bit of discussion took place, both about the V4L issues and the I2C issues raised in Luca’s talk. Overall, it was a useful talk if you’re interested in this specific topic.
As someone not very familiar with the V4L2 Framework, I was pleasantly surprised that Hans’ talk was done in such a way that both experienced and beginner developers were able to follow.
Hans started with a quick introduction to the various concepts of video encoding and decoding that needs to be understood to follow the highly technical explanations of the current status of stateless and stateful codecs support.
Besides describing the technical challenge of implementing such support, Hans gave a good overview of the challenges that are faced by the community, focusing on the necessity of having good testing tools, such as the new vicodec driver.
He described the complexity of implementing support for Stateless decoders (where the hardware decoder doesn’t keep track of the state, this has to be done in software), and explained that the new Request API is a good step towards achieving such support, with 2 decoders already supported in the staging area.
Hans then explained the userspace APIs that are to be used when dealing with Stateless decoders, starting some interesting discussions along the way.
All in all, such a talk is a good example of how we can use events such as ELCE to both give good technical insight on existing frameworks, but also to trigger discussions about the ongoing and future work amongst the active developers and maintainers that are brought together by the event.
He starts by explaining the use case for building both FreeRTOS and a Linux system using the same build system, in this case OpenEmbedded.
He then shows the meta-freertos layer he developped to get OpenEmbedded to build FreeRTOS. The toolchain he used is fairly classic with GCC, binutils, gdb. The main difference is that newlib is used as the C library.
meta-freertos then depends on previous work that has been done, integrating a newlib and libgloss recipes in oe-core. The core of meta-freertos is then a class, freertos-app.bbclass, allowing to abstract many details allowing to build a FreeRTOS application and image for the target. A poky-freertos distribution configuration is also provided.
Alejandro then demoes multiple FreeRTOS applications.
Finally, he goes over multiconfig, the multiple configuration build dependencies, allowing OpenEmbedded to build an image using a configuration but depending on tasks using a different configuration. In other words, this allows to build a Linux system image after building a FreeRTOS application so it can be included in the image. This is very useful in the case Linux is running on the application processor and needs to load FreeRTOS on a smaller processor.
Authenticate and Encrypted Storage on Embedded Linux
Talk selected by Kamel Bouhara
Jan’s talk is introducing us to the current authentication and encryption methods that go on top of the Linux storage stack.
He started reminding us some basic crypto terminologies and then depicted all the existing technologies by the storage they fit into.
For block device storage dm-verity is a good choice to verify integrity of read-only filesystems and the verification is done on each node of a hash tree. For a file or application based verification fsverity is a more relevant tool as it allows on-demand verification.
On raw NAND devices, the integrity should be checked using the UBI filesystem associated to an HMAC or image signature authentication with a root key. This solution can be completed with fscrypt to encrypt specific data on the filesytem.
For the encryption stage, Jan mentioned the ecryptfs project, which is not maintained anymore and could be well replaced by fscrypt which allows to hold several keys in the same filesystem in a multi-user environment. It is therefore a good alternative to dm-crypt which is a block-based based encryption solution used on large block devices and it is not protected against replay attacks using old blocks.
For a TPM based authentication, he recommendeds using the kernel integrity subsystem called IMA/EVM, which is a layout on top of other filesystem, the project Keylime is good example for this: https://sched.co/TLCY.
Jan shared some good practices on how to manage the Master key storage like not using a password based key, if possible use hardware capabilities like ARM TrustZone and OPTEE and use of a verified boot and key wrapping for the master key.
After Toulouse and Orange, Lyon is the third city chosen for opening a Bootlin office. Since September 1st of this year (2017), Alexandre Belloni and Grégory Clement have been working more precisely in Oullins close to the subway and the train station. It is the first step to make the Lyon team grow, with the opportunity to welcome interns and engineers.
Their new desks are already crowded by many boards running our favorite system.
At the beginning of October a Kickstarter campaign was launched to fund the development of a low-cost board based on one of the latest Marvell ARM 64-bit SoC: the Armada 3700. While being under $50, the board would allow using most of the Armada 3700 features:
We pushed the initial support of this SoC to the mainline Linux kernel 6 months ago, and it landed in Linux 4.6. There are still a number of hardware features that are not yet supported in the mainline kernel, but we are actively working on it. As an example, support for the PCIe controller was merged in Linux 4.8, released last Sunday. According to the Kickstarter page the first boards would be delivered in January 2017 and by this time we hope to have managed to push more support for this SoC to the mainline Linux kernel.
We have been working on the mainline support of the Marvell SoC for 4 years and we are glad to see at last the first board under $50 using this SoC. We hope it will help expanding the open source community around this SoC family and will bring more contributions to the Marvell EBU SoCs.
The real time page I wrote for Atmel was finally released on the Linux4Sam Atmel Wiki. The purpose of this page was to help new comers to use real time features with Atmel CPUs and to present the state of the real time support.
Here are some figures associated to this work:
On this page I present the results of more than 300 hours of benchmarks!
During the setup and the tuning tests ran for more than 600 hours.
Analysis and formatting took a few dozen hours of work.
The benchmarks have been run on 3 boards, 3 flavors of Linux (vanilla, PREEMPT-RT patches, Xenomai co-kernel approach), and 2 kinds of tests (timer-based and GPIO-based)