Linux 5.18 released, Bootlin contributions inside

Linux 5.18 has been released a bit over a week ago. As usual, we recommend the resources provided by LWN.net (part 1 and part 2) and KernelNewbies.org to get an overall view of the major features and improvements of this Linux kernel release.

Bootlin engineers have collectively contributed 80 patches to this Linux kernel release, making us the 28th contributing company according to these statistics.

  • Alexandre Belloni, as the RTC subsystem maintainer, continued to improve the overall subsystem, and migrate drivers to new features and mechanisms introduced in the core RTC subsystem
  • Clément Léger contributed a new RTC driver that allows to use the RTC exposed by the OP-TEE Trusted Execution Environment, as well as a few other fixes
  • Hervé Codina and Luca Ceresoli contributed some fixes: Hervé to the dw-edma dmaengine driver, and Luca to the Rockchip RK3308 pinctrl driver
  • Miquèl Raynal, as the MTD subsystem co-maintainer, contributed the remainder of his work to generalize the support of ECC handling, and allow both parallel and SPI NAND to use either software ECC, on-die ECC, or ECC done by a dedicated controller. Included in this work is a new driver for the Macronix external ECC engine, in drivers/mtd/nand/ecc-mxic.c
  • Miquèl Raynal also made a few contributions to the 802.15.4 part of the networking stack, and we have more contributions in this area coming up.
  • Paul Kocialkowski contributed a small fix to Allwinner Device Tree files, and another attempt at fixing an issue with the display panel detection/probing in the DRM subsystem

Linux 5.17 released: Bootlin contributions

Linux 5.17 has been released last Sunday. As usual, the best coverage of what is part of this release comes from LWN (part 1 and part 2), as well as KernelNewbies (unresponsive at the time of this writing) or CNX Software (for an ARM/RISC-V/MIPS focused description).

Bootlin contributed just 34 patches to this release, which isn’t a lot by the number of patches, but in fact includes a number of important new features. Also, we have many more contributions being discussed on the mailing lists or in preparation. For this 5.17 release here are the highlights of our contributions:

  • Alexandre Belloni, as the maintainer of the RTC subsystem, contributed one improvement to an RTC driver
  • Clément Léger improved the Microchip Ocelot Ethernet switch driver performance by implementing FDMA support. This allows network packets that are going from the switch to the CPU, or from the CPU to the switch to be received/sent in a much more efficient fashion than before. The Microchip Ocelot Ethernet switch driver was developed and upstreamed several years ago by Bootlin, see our previous blog post.
  • Clément Léger also contributed smaller fixes: a bug fix in the core software node code, and one PHY driver fix.
  • Hervé Codina implemented support for GPIO interrupts on the old ST Spear320 platform.
  • Maxime Chevallier contributed mqprio support to the Marvell Ethernet MAC mvneta driver, which was the topic of a previous blog post
  • Miquèl Raynal contributed a brand new NAND controller driver, for the NAND controller found in the Renesas RZ/N1 SoC. We expect to contribute to many more aspects of the Renesas RZ/N1 Linux kernel support in the next few months.
  • Miquèl Raynal contributed a few Device Tree changes enabling the ADC on the Texas Instruments AM473x platform, after contributing the driver changes a few releases ago.
  • Miquèl Raynal started contributing some improvements to the 802.15.4 Linux kernel stack, and we also have many more changes in the pipe for this Linux kernel subsystem.
  • Thomas Perrot added support for the Sierra EM919X modem to the existing MHI PCI driver.

Here is the full list of our contributions:

Announcing buildroot-external-st, Buildroot support for STM32MP1 platforms

STM32MP1 Discovery Kit 2Back in 2019, ST released a brand new processor family, the STM32MP1, whose members are currently based on a dual Cortex-A7 to run Linux combined with one Cortex-M4 to run bare-metal applications, together with a wide range of peripherals.

Following the release of this new platform, Bootlin ported its Embedded Linux and Yocto training courses to be available on STM32MP1, and also published a long series of tutorials showing how to use Buildroot to build an embedded Linux system on STM32MP1: part 1, part 2, part 3, part 4, part 5, part 6 and part 7.

We are happy to announce that we have partnered with ST to develop an improved support of Buildroot on STM32MP1, which is materialized by a Buildroot BR2_EXTERNAL available on Github at https://github.com/bootlin/buildroot-external-st. In Buildroot, a BR2_EXTERNAL is an extension of the core Buildroot, with additional configurations and/or packages.

This BR2_EXTERNAL tree is an extension of Buildroot 2021.02, which provides four example Buildroot configuration to easily get started on STM32MP1 platforms:

  • st_stm32mp157a_dk1, building a basic Linux system for the STM32MP1 Discovery Kit 1 platform, with a minimal Linux root filesystem
  • st_stm32mp157c_dk2, building a basic Linux system for the STM32MP1 Discovery Kit 2 platform, with a minimal Linux root filesystem
  • st_stm32mp157a_dk1_demo, building a much more featureful Linux system for the STM32MP1 Discovery Kit 1 platform, with Linux root filesystem that allows to run Qt5 applications with OpenGL acceleration on the HDMI output, that supports audio, demonstrates the usage of the Cortex-M4, uses OP-TEE, and more.
  • st_stm32mp157c_dk2_demo, building a much more featureful Linux system for the STM32MP1 Discovery Kit 2 platform, with Linux root filesystem that allows to run Qt5 applications with OpenGL acceleration on both the integrated DSI display and HDMI output, that supports audio, WiFi and Bluetooth, demonstrates the usage of the Cortex-M4, uses OP-TEE, and more.

This BR2_EXTERNAL is designed to work with Buildroot 2021.02, with only a small set of modifications, which since then have been integrated in upstream Buildroot, ensuring that STM32MP1 users can directly use upstream Buildroot for their projects.

There is extensive documentation on how to use this BR2_EXTERNAL tree as well as how to test the various features of the STM32MP1 platform: using STM32CubeProgrammer, using Device Tree generated from STM32 CubeMX, using the Cortex-M4, testing display support, using Qt5, using WiFi, using Bluetooth, using audio and using OP-TEE. We have also documented the internals of the BR2_EXTERNAL components.

This Buildroot support is using the latest software components from the recently released 3.1 BSP from ST (see release notes), so it is based on Linux 5.10, U-Boot 2020.10, TF-A 2.4 and OP-TEE 3.12. We will keep this BR2_EXTERNAL updated with newer releases of the ST BSP.

If you face any issue while using this Buildroot support for STM32MP1, you can use the issue tracker of the Github project, or use the ST community forums. Bootlin can also provide commercial support on Linux on STM32MP1 platforms, as well as training courses.

Bootlin contributions to Linux 5.14 and 5.15

It’s been a while we haven’t posted about Bootlin contributions to the Linux kernel, and in fact missed both the Linux 5.14 and Linux 5.15 releases, which we will cover in this blog post.

Linux 5.14 was released on August 29, 2021. The usual KernelNewbies.org page and the LWN articles on the merge window (part 1 and part 2) provide the best summaries of the new features and hardware support offered by this release.

Linux 5.15 on the other hand was released on November 1, 2021. Here as well, we have a great KernelNewbies.org article and LWN articles on the merge window (part 1 and part 2).

In total for those two releases, Bootlin contributed 79 commits, in various areas:

  • Alexandre Belloni, as the RTC subsystem maintainer, contributed 9 patches improving various aspects of RTC drivers, the RTC subsystem or Device Tree bindings for RTC
  • Clément Léger contributed a small improvement to the at_xdmac driver used on Microchip ARM platforms
  • Hervé Codina enabled Ethernet support on the old ST Spear320 SoC, by leveraging the existing stmmac Ethernet controller driver
  • Maxime Chevallier fixed a small issue with the Ethernet PHY on the i.MX6 Solidrun system-on-module
  • Miquèl Raynal added support for NV-DDR timings in the MTD susbsystem. This allows to improve performance with NAND flash memories that support those timings. Their usage is specifically implemented in the Arasan NAND controller driver, which Miquèl contributed back in Linux 5.8. See our previous blog post on this topic for more details
  • Miquèl Raynal added support for yet another NAND controller driver, the ARM PL35x, which is used for example on Xilinx Zynq 7000. See our previous blog post on this topic.
  • Miquèl Raynal added support for NAND chips with large pages (larger than 4 KB) to the OMAP GPMC driver.
  • Miquèl Raynal made a few fixes to the IIO driver for the max1027 ADC.
  • Paul Kocialkowski contributed a few patches to enable usage of the Hantro video decoder driver on the Rockchip PX30 processor.
  • Thomas Perrot contributed one patch to enable usage of the Flex Timers on i.MX7, and one to fix an issue in the PL022 SPI controller driver.

And now, as usual the complete list of our contributions to Linux 5.14 and 5.15:

The backbone of a Linux Industrial I/O driver

As part of recent projects, we had to dig into the Linux kernel Industrial I/O (IIO) subsystem with the goals of supporting a new ADC and adding new features to an existing driver. These tasks involved quite a few discussions between our engineering team and the IIO maintainers and reviewers. The aim of this blog post is to summarize the substance of these explanations to help others understand how an IIO kernel driver works and interacts with the core IIO subsystem.

Disclaimer: The IIO core is huge and keeps evolving. The aim of this article is not to cover it entirely, but at least explain our knowledge of how to use its basic features for common situations.

What is IIO?

The Industrial I/O subsystem covers any type of device that is commonly called as a “sensor”: ADCs, IMUs, temperature sensors, accelerometers, pressure sensors, potentiometers, light sensors, proximity sensors, etc (as well as few actuators, which I will on purpose disregard in this blog post). All these devices, besides measuring truly different physical components of our three dimensional world, end-up sharing quite a few properties. Any of these sensors must first be configured in order to know what must be measured and possibly how. When adequate, the device must be triggered in order to start converting. When the requested samples are ready, there must be some kind of signaling involved in order for the user to retrieve and process the data.

When thinking about the generic interfaces which could be needed by all these devices, it is quite straightforward to list:

  • The configuration before sampling
  • The triggering mechanism
  • The signaling for an end of conversion situation
  • The reading of the samples
  • The advertisement of the data

Registering an IIO device

The IIO core manipulates struct iio_dev * objects which inherits from struct device. This object should be allocated by the device driver with devm_iio_device_alloc(), providing the size of the driver’s internal structure as second argument. The allocated area dedicated for this internal pointer can be then retrieved with iio_priv().

This iio_dev structure must then be filled with a number of information:

  • The name of the device
  • A set of struct iio_info operations, typically a hook to read one or multiple samples on demand, optionally be able to write to the device, etc.
  • A set of supported modes, such as INDIO_DIRECT_MODE, which is used when samples can be retrieved at any time by the user from sysfs.
  • A scan mask, namely available_scan_masks which defines what are the possible/impossible scan combinations when requesting a read. Typically, a device might be configured to scan all of its internal channels from 1 to N. This can be described with a list of GENMASK(X, 0), with X ranging from 0 to the maximum number of channels. When the user will request a given set of channels, the IIO core will go through all the available masks registered by the driver and pick the first one that contains the desired channels. The selected mask will be available to the driver through the active_scan_mask entry of the iio_dev structure. If ‘anything goes’ and the devices has no restriction regarding which channel(s) can be scanned, this field should be skipped.
  • A definition of all the possible channels, including the type of physical measurements the device is able to perform (IIO_VOLTAGE, IIO_CURRENT, IIO_TEMPERATURE, IIO_STEPS, IIO_ROT, etc), the channel index and the data format.

Here is an example of channel description and below the meaning of these fields.

struct iio_chan_spec chan1 = {
	.type = IIO_VOLTAGE,
	.indexed = 0,
	.channel = index,
	.info_mask_separate = BIT(IIO_CHAN_INFO_RAW),
	.info_mask_shared_by_type = BIT(IIO_CHAN_INFO_SCALE),
	.scan_index = 1,
	.scan_type = {
		.sign = 'u',
		.realbits = 10,
		.storagebits = 16,
		.shift = 2,
		.endianness = IIO_BE,
	},
}
  • .info_mask_separate indicates an entry in sysfs that will be present for this the channel. In this case, IIO_CHAN_INFO_RAW indicates it is going to be the raw value of the sample.
  • .info_mask_shared_by_type indicates an entry in sysfs that will be shared by all channels of the same type. IIO_CHAN_INFO_SCALE means that there will be a common voltage scale sysfs entry shared by all the voltage raw entries. If the device was also able to read a temperature, we would also have had a single file indicating the scale for all the temperature samples.
  • The .scan_type field in the example indicates that values are provided as 16-bit big-endian samples that must be shifted by two bits. The full scale range is 0-1023. This conversion only applies to the buffer reading path: raw values directly read from sysfs and returned by the ->read_raw() hook (see below) should be converted by the driver itself.

Once the device fully described (and initialized, of course), the driver must register it with devm_iio_device_register().

Scaling factors

The int (*read_raw)(struct iio_dev *indio_dev, struct iio_chan_spec const *chan, int *val, int *val2, long mask) callback will be executed when reading either of the ‘raw value’ or ‘scale’ files from sysfs.

The type of data that must be returned is provided in the mask parameter: IIO_CHAN_INFO_RAW to retrieve the raw measurement or IIO_CHAN_INFO_SCALE to retrieve the scaling parameters, based on the scale information available in the iio_chan_spec structure that describes the channel.

For the IIO_CHAN_INFO_RAW case, most drivers return an IIO_VAL_INT type which can be simply “returned” into the *val argument. It is however possible to return a fixed point number, in this case the logic explained right after applies.

For the IIO_CHAN_INFO_SCALE case, the return value indicates what type of scaling should be done. In most cases here a fixed point value will be used so *val and *val2 will carry the scaling parameters. Here are two examples:

  1. IIO_TEMP example:
    *val = 1;
    *val2 = 8;
    ret = IIO_VAL_FRACTIONAL;
    

    The full scale sample value should be multiplied by 1/8 in order to get Celcius degrees.

  2. IIO_VOLTAGE example:

    *val = 2500;
    *val2 = 10;
    ret = IIO_VAL_FRACTIONAL_LOG2;
    

    The full scale range is a 10-bit value mapped to a 0-2500mV input level, said otherwise the scaling factor should be 2500 / 1024. The core will automatically do the computation of this factor and return 2,44140625 to the user in order to get milli-Volts.

Sampling

There are two basic common cases here.

In simple situations, a “single” on-demand read was issued by user-space directly by reading /sys/bus/devices/iio:device/in_<type><index>_raw. In this case the ->read_raw() callback should handle basically all the steps necessary to get a measurement, as detailed in our introduction.

However, user-space can also pick a more advanced way of interacting with the measurement device, called triggers.

A trigger is a specific configuration of the device which will sample a number of channels upon a specific event. This event might be the user requesting it from userspace with a so called software trigger, it might also be an external hardware event, or a periodic signal, or an internal continuous read mode… There are many ways of triggering a sensor and they are all covered by the subsystem.

Many devices cannot handle both modes at the same time. The only situation where this might work smoothly is when a device provides a hardware FIFO where you can read from (or a ‘latest value’ register) while not disrupting the FIFO read back. Otherwise, it will be needed, in order to avoid collisions between these two modes, to verify that exclusive access to the device is granted with a call to iio_device_claim_direct_mode() when starting a direct mode operation. As this helper grabs a mutex, it should be only called from process context and always be balanced with a call to iio_device_release_direct_mode().

IIO interoperability model

In the IIO core these four concepts are used:

  • IIO device: the hardware part which produces samples
  • IIO trigger: the signaling capability to request a conversion start
  • IIO buffers: where to store the samples
  • IIO events: threshold detectors

Even though a single hardware device might have hardware support for all these features, they must be described and handled separately so that, when applicable, other IIO devices might use them as well, eg. IIO device 2 could start a conversion upon IIO device 1 trigger state change. In practice it is not always possible but the way the API is built should lead us to keep things well separated anyway.

Registering a trigger

A struct iio_trigger must be allocated by devm_iio_trigger_alloc(), giving the new trigger a name.

The trigger should then receive a set of operations (struct iio_trigger_ops) with at least ->set_trigger_state() implemented, in order to switch on and off the trigger. One can use iio_trigger_set_drvdata() in order to link private data with the trigger and get this pointer back from the trigger callbacks.

Once initialized, devm_iio_trigger_register() will register the IIO trigger. This trigger will appear as a dedicated IIO device in sysfs.

It is likely that an IRQ will need to be registered as part of the trigger initialization step: the driver must be notified somehow that the trigger was toggled. If the asynchronous signaling is tied to a “trigger change” condition, which is the easiest situation, then it is advised to provide iio_trigger_generic_data_rdy_poll() as hard IRQ handler. This helper will just call iio_trigger_poll() and return.

You may of course want to handle more than this but in any case the rule is clear, triggers, buffers and devices should be fully separated. Hence, do not directly handle any data from this handler: an IIO trigger is only supposed to indicate a hardware transition, no more.

The call to iio_trigger_poll() will effectively go through the IIO internal interrupt tree, find the device that is connected to the trigger which fired and call the relevant handler in order to request the waiting device to process the data (which may be identical or different than the triggering device).

In the case of the device being limited to, for instance, an End Of Conversion (EOC) interrupt, you should still consider this signal as being suitable for being registered as a trigger. Yes, this might imply an additional delay between the hardware toggling and the IRQ being fired which is not ideal, but from a software point of view, the split between driver code and core logic will let other IIO devices use this IRQ as a trigger with no additional change needed to your code.

Note: There is one exception here. When a device does not provide any visible per-scan interrupt and the software has only access to some kind of FIFO watermark events, the whole trigger + buffer representation is swapped with a pure buffer-only implementation.

Registering triggered buffers

If the device itself is able to provide fast samples, the driver should also register a buffer, with iio_triggered_buffer_setup(). Both a hard IRQ handler and a threaded IRQ handler can be registered, as well as additional callbacks called before and after enabling and disabling the buffers in order to eg. configure the requested channels based on the current ->active_scan_mask.

Upon a trigger condition, these are the handlers that might be chosen by the core if the trigger is connected to your device!

The hard IRQ handler might be used to eg. save timestamps. The threaded IRQ handler is dedicated to the data processing. Depending of the type of trigger (iio_trigger_using_own()) the driver must decide whether it should start a conversion manually or if the data is waiting somewhere in a hardware FIFO, ready to be retrieved.

The final step is to push the samples into the core’s buffers. This should not be done manually. Let’s say that the user requested channels 0, 1 and 3 while the selected scan mask was including all channels from 0 to 4. Just calling iio_push_to_buffers() is the solution: the core knows that it will receive five samples of 16 bits, it also knows that the user only requested three of them and will automatically pick the right ones.

With all these IIO objects registered, you should be able to properly interact with the core and the other drivers, providing trigger capabilities to third party devices, or benefiting from other’s triggers.

What if my design lacks trigger capabilities?

You can still use triggers by enabling IIO_CONFIGFS (enables the configuration interface) and IIO_SW_TRIGGER. Then, you can either choose to trigger your scans from userspace with a simple file write, thanks to CONFIG_IIO_SYSFS_TRIGGER, or leverage timers to get periodic scans with CONFIG_IIO_HRTIMER_TRIGGER.

As an example, here is how to create a sysfs trigger:

# echo 0 > /sys/bus/iio/devices/iio_sysfs_trigger/add_trigger
# cat /sys/bus/iio/devices/iio_sysfs_trigger/trigger0/name
sysfstrig0

And here is how to create a timer based software trigger:

# mkdir -p /config
# mount -t configfs none /config
# mkdir /config/iio/triggers/hrtimer/my_5ms_trigger
# cat /sys/bus/iio/devices/trigger0/name
my_5ms_trigger
# echo 200 > /sys/bus/iio/devices/trigger0/sampling_frequency

How to use triggers and buffers from userspace

Just for the reference, linking an IIO trigger to an IIO device is as simple as:

# cat /sys/bus/iio/devices/trigger0/name > /sys/bus/iio/devices/iio:device0/trigger/current_trigger

The next step is to configure the channels that should be scanned:

# echo 1 > /sys/bus/iio/devices/iio:device0/scan_elements/in_voltage0_en
# echo 1 > /sys/bus/iio/devices/iio:device0/scan_elements/in_voltage1_en
# echo 1 > /sys/bus/iio/devices/iio:device0/scan_elements/in_voltage3_en

Starting the sampling process is managed with:

# echo 1 > /sys/bus/iio/devices/iio:device0/buffer/enable

In the case of a sysfs software trigger, it is the user’s responsibility to timely run:

# echo 1 > /sys/bus/iio/devices/trigger0/trigger_now

The samples are available to be read in /dev/iio\:device0.

The decoding process before the scaling operation must be performed by the userspace, following the content of:

# cat /sys/bus/iio/devices/iio:device0/scan_elements/in_voltage0_type
be:u12/16>>2 

Which in this case would mean that each sample is 16 bits wide, values should be considered big-endian, shifted twice before being considered as an unsigned 12-bit value.

# od -t x1 /dev/iio\:device0
0000000 08 06

Should be interpreted as 0x806 >> 2 = 0x201, which should be then multipled by the scaling factor in order to get the final mV value.

Conclusion

While contributing to this (relatively new) subsystem, we discovered a number of interesting features and design choices which would really benefit from a much tougher in-kernel documentation as most of the available information explains how to use IIO (with libiio or configfs) more than how to write a decent and properly shaped IIO device driver. As the subsystem is still pretty recent, it is valid to look at existing drivers to make design choices, but that is not a magic solution as no device never fully matches any software API anyway, and sensors unfortunately do not escape from that sticky rule.

We want to warmly thank Jonathan Cameron, IIO founder and maintainer, for his precious feedback on the mailing list, as well as his valuable review and contribution to this blog post.

We hope this article will help you go through this API and if it does, please mind letting us know by dropping a comment in the section down below!

Bootlin contributions to Linux 5.13

After finally publishing about our Linux 5.12 contributions and even though Linux 5.14 was just released yesterday, it’s hopefully still time to talk about our contributions to Linux 5.13. Check out the LWN articles about the merge window to get the bigger picture about this release: part 1 and part 2.

In terms of Bootlin contributions, this was a much more quiet release than Linux 5.12, with just 28 contributions. The main highlights are:

  • The usual round of RTC subsystem updates from its maintainer Alexandre Belloni
  • A large amount of improvements in the MTD subsystem by its co-maintainer Miquèl Raynal, continuing his effort to improve the ECC handling in the MTD subsystem. See Miquèl’s talk at ELCE 2020 for more details on this effort: slides and video.
  • A small fix for an annoying regression in the musb USB gadget controller driver.

Even though we contributed just 28 commits to this release, as maintainers, some of us also reviewed and merged code from other contributors: Miquèl Raynal as the MTD co-maintainer merged 63 patches, Alexandre Belloni merged 22 patches, and Grégory Clement 6 patches.

Here are the details of our contributions to Linux 5.13:

Bootlin contributions to Linux 5.12

Yes, Linux 5.13 was released yesterday, but we never published the blog post detailing our contributions to Linux 5.12, so let’s do this now! First of all the usual links to the excellent LWN.net articles on the 5.12 merge window: part 1 and part 2.

LWN.net also published an article with Linux 5.12 development statistics, and two Bootlin engineers made their way to the statistics: Alexandre Belloni in the list of top contributors by number of changesets, with 69 commits, and Paul Kocialkowski in the list of top contributors by number of changed lines, with over 6000 lines changed.

Here are the highlights of our contributions:

  • Addition of a new driver for the Silvaco I3C master controller. This was contributed by Miquèl Raynal, who became the maintainer for this driver. Bootlin has pioneered support for I3C in Linux, by introducing the complete drivers/i3c subsystem a few years ago, together with the first controller driver, for a Cadence IP, see our blog post from 2018.
  • Addition of two new camera sensor drivers, one for the Omnivision OV5648 and another for the Omnivision OV8865. These were contributed by Paul Kocialkowski.
  • Implementation of mqprio support in the Marvell Ethernet controller driver mvneta, see this commit. As explained in the tc-mqprio man page, the MQPRIO qdisc is a simple queuing discipline that allows mapping traffic flows to hardware queue ranges using priorities and a configurable priority to traffic class mapping. This was contributed by Maxime Chevallier
  • Improvements in the IIO driver for the ms58xx family of sensors, contributed by Alexandre Belloni.
  • The final removal of the atmel_tclib code, which has been replaced by proper drivers for the TCB timers on Atmel/Microchip ARM platforms over the past few releases, also by Alexandre Belloni.
  • As usual, a large amount of fixes and improvements in the RTC subsystem, by its maintainer Alexandre Belloni.

Here is the detailed list of our contributions to this release:

Bootlin contributions to Linux 5.11

Linux 5.11 was released quite some time ago now, but it’s never too late to have a look at Bootlin contributions to this release. As usual, we recommend reading the LWN articles on the 5.11 merge window: part 1 and part2. Also of interest is the Kernelnewbies page for 5.11.

Here are the main highlights of our contributions:

  • Alexandre Belloni, as the maintainer of the RTC subsystem, continued making numerous improvements and fixes to RTC drivers
  • On the support for Microchip ARM platforms, Alexandre Belloni switched the PWM atmel-tcb driver to a new Device Tree binding and added SAMA5D2 support, he did some improvements to the IIO driver for the Microchip ADC, and continued to remove platform_data support from Microchip drivers as all platforms are now converted to the Device Tree.
  • Alexandre Belloni contributed a new Simple Audio Mux driver for the ALSA subsystem, which can be used to control simple audio multiplexers driven using GPIOs, that allows to select which of their input line is connected to the output line.
  • Grégory Clement added support for several new MIPS platforms from Microchip: Luton, Serval and Jaguar2. All those platforms include a MIPS core, a few peripherals and more importantly an Ethernet switch. For now the support only includes the base platform support, but we are working on the switchdev driver for the Ethernet switch.
  • Miquèl Raynal, maintainer of the NAND subsystem and co-maintainer of the MTD subsystem, contributed numerous changes to the ECC support in the MTD subsystem, making it more generic so that it can be used not just for parallel NAND flashes, but also SPI NAND flashes. For more details, see the talk from Miquèl Raynal on this topic.

In addition to those 95 patches that we authored and contributed, several Bootlin engineers being maintainers of different subsystems of the Linux kernel reviewed and merged patches from other contributors:

  • Miquèl Raynal, as the NAND maintainer and MTD co-maintainer, reviewed and merged 67 patches from other contributors
  • Alexandre Belloni, as the RTC, I3C and Microchip ARM/MIPS platforms maintainer, reviewed and merged 47 patches from other contributors
  • Grégory Clement, as the Marvell EBU platform co-maintainer, reviewed and merged 33 patches from other contributors

Here is the detailed list of our contributions to Linux 5.11:

Large Page Support for NAS systems on 32 bit ARM

The need for large page support on 32 bit ARM

Storage space has become more and more affordable to a point that it is now possible to have multiple hard drives of dozens of terabytes in a single consumer-grade device. With a few 10 TiB hard drives and thanks to RAID technology, storage capacities that exceed 16 or 32 TiB can easily be reached and at a relatively low cost.

However, a number of consumer NAS systems used in the field today are still based on 32 bit ARM processors. The problem is that, with Linux on a 32 bit system, it’s only possible to address up to 16 TiB of storage space. This is still true even with the ext4 filesystem, even though it uses 64 bit pointers.

We were lucky to have a customer contracting us to update older Large Page Support patches to a recent version of the Linux kernel. This set of patches are one way of overcoming this 16 TiB limitation for ARM 32-bit systems. Since updating this patch series was a non trivial task, we are happy to share the results of our efforts with the community, both through this blog post and through a patch series we posted to the Linux ARM kernel mailing list: ARM: Add support for large kernel page (from 8K to 64K).

How Large Page Support works

The 16 TiB limitation comes from the use of page->index which is a pgoff_t offset type corresponding to unsigned long. This limits us to a 32-bit page offsets, so with 4 KiB physical pages, we end up with a maximum of 16 TiB. A way to address this limitation is to use larger physical pages. We can reach 32 TiB with 8 KiB pages, 64 TiB with 16 KiB pages and up to 256 TiB with 64 KiB pages.

Before going further, the ARM32 Page Tables article from Linus Walleij is a good reference to understand how the Linux kernel deals with ARM32 page tables. In our case, we are only going to cover the non LPAE case. As explained there, the way the Linux kernel sees the page tables actually doesn’t match reality. First, the kernel deals with 4 levels of page tables while on hardware there are only 2 levels. In addition, while the ARM32 hardware stores only 256 PTEs in Page Tables, taking up only 1 KB, Linux optimizes things by storing in each 4 KB page two sets of 256 PTEs, and two sets of shadow PTEs that are used to store additional metadata needed by Linux about each page (such as the dirty and accessed/young bits). So, there is already some magic between what is presented to the Linux virtual memory management subsystem, and what is really programmed into the hardware page tables. To support large pages, the idea is to go further in this direction by emulating larger physical pages.

Our series (and especially patch 5: ARM: Add large kernel page support) proposes to pretend to have larger hardware pages. The ARM 32-bit architecture only supports 4 KiB or 64 KiB page sizes, but we would like to support intermediate values of 8 KiB, 16 KiB and 32 KiB as well. So what we do to support 8 KiB pages is that we tell Linux the hardware has 8 KiB pages, but in fact we simply use two consecutive 4 KiB pages at the hardware level that we manipulate and configure simultaneously. To support 16 KiB pages, we use 4 consecutive 4 KiB pages, for 32 KiB pages, we use 8 consecutive pages, etc. So really, we “emulate” having larger page sizes by grouping 2, 4 or 8 pages together. Adding this feature only required a few changes in the code, mainly dealing with ranges of pages every time we were dealing with a single page. Actually, most of the code in the series is about making it possible to modify the hard coded value of the hardware page size and fixing the assumptions associated to such a fixed value.

In addition to this emulated mechanism that we provide for 8 KiB, 16 KiB, 32 KiB and 64 KiB pages, we also added support for using real hardware 64 KiB pages as part of this patch series.

Overall the number of changes is very limited (271 lines added, 13 lines removed), and allows to use much larger storage devices. Here is the diffstat of the full patch series:

 arch/arm/include/asm/elf.h                  |  2 +-
 arch/arm/include/asm/fixmap.h               |  3 +-
 arch/arm/include/asm/page.h                 | 12 ++++
 arch/arm/include/asm/pgtable-2level-hwdef.h |  8 +++
 arch/arm/include/asm/pgtable-2level.h       |  6 +-
 arch/arm/include/asm/pgtable.h              |  4 ++
 arch/arm/include/asm/shmparam.h             |  4 ++
 arch/arm/include/asm/tlbflush.h             | 21 +++++-
 arch/arm/kernel/entry-common.S              | 13 ++++
 arch/arm/kernel/traps.c                     | 10 +++
 arch/arm/mm/Kconfig                         | 72 +++++++++++++++++++++
 arch/arm/mm/fault.c                         | 19 ++++++
 arch/arm/mm/mmu.c                           | 22 ++++++-
 arch/arm/mm/pgd.c                           |  2 +
 arch/arm/mm/proc-v7-2level.S                | 72 ++++++++++++++++++++-
 arch/arm/mm/tlb-v7.S                        | 14 +++-
 16 files changed, 271 insertions(+), 13 deletions(-)

This patch series is running in production now on some NAS devices from a very popular NAS brand.

Limitations and alternatives

The submission of our patch series is recent but this feature has actually been running for years on many NAS systems in the field. Our new series is based on the original patchset, with the purpose of submitting it to the mainline kernel community. However, there is little chance that it will ever be merged into the mainline kernel.

The main drawback of this approach are large pages themselves: as each file in the page cache uses at least one page, the memory wasted increases as the size of the pages increases. For this reason, Linus Torvalds was against similar series proposed in the past.

To show how much memory is wasted, Arnd Bergmann ran some numbers to measure the page cache overhead for a typical set of files (Linux 5.7 kernel sources) for 5 different page sizes:

Page size (KiB) 4 8 16 32 64
page cache usage (MiB) 1,023.26 1,209.54 1,628.39 2,557.31 4,550.88
factor over 4K pages 1.00x 1.18x 1.59x 2.50x 4.45x

We can see that while a factor of 1.18 is acceptable for 8 KiB pages, a 4.45 multiplier looks excessive with 64 KiB pages.

Actually, to make it possible to address large volumes on 32 bit ARM, another solution was pointed out during the review of our series. Instead of using larger pages which have an impact on the entire system, an alternative is to modify the way the filesystem addresses the memory by using 64 bits pgoff_t offsets. This has already been implemented in vendor kernels running in some NAS systems, but this has never been submitted to mainline developers.

Videos and slides from Bootlin talks at Embedded Linux Conference Europe 2020

The Embedded Linux Conference Europe took place online last week. While we definitely missed the experience of an in-person event, we strongly participated to this conference with no less than 7 talks on various topics showing Bootlin expertise in different fields: Linux kernel development in networking, multimedia and storage, but also build systems and tooling. We’re happy to be publishing now the slides and videos of our talks.

From the camera sensor to the user: the journey of a video frame, Maxime Chevallier

Download the slides: PDF, source.

OpenEmbedded and Yocto Project best practices, Alexandre Belloni

Download the slides: PDF, source.

Supporting hardware-accelerated video encoding with mainline Linux, Paul Kocialkowski

Download the slides: PDF, source.

Building embedded Debian/Ubuntu systems with ELBE, Köry Maincent

Download the slides: PDF, source.

Understand ECC support for NAND flash devices in Linux, Miquèl Raynal

Download the slides: PDF, source.

Using Visual Studio Code for Embedded Linux Development, Michael Opdenacker

Download the slides: PDF, source.

Precision Time Protocol (PTP) and packet timestamping in Linux, Antoine Ténart

Download the slides: PDF, source.