Multi-queue improvements in Linux kernel Ethernet driver mvneta

In the past months, the Linux kernel driver for the Ethernet MAC found in a number of Marvell SoCs, mvneta, has seen quite a few improvements. Lorenzo Bianconi brought support for XDP operations on non-linear buffers, a follow-up work on the already-great XDP support that offers very nice performances on that controller. Russell King contributed an improved, more generic and easier to maintain phylink support, to deal with the variety of embedded use-cases.

At that point, it’s getting difficult to squeeze more performances out of this controller. However, we still have some tricks we can use to improve some use-cases so in the past months, we’ve worked on implementing QoS features on mvneta, through the use of mqprio.

Multi-queue support

A simple Ethernet NIC (Network Interface Controller) could be described as a controller that handles some protocol-level aspect of the Ethernet standard, where the incoming and outgoing packets would be enqueued in a dedicated ring-buffer that serves as an interface between the controller and the kernel.

The buffer containing packets that needs to be sent is called the Transmit Queue, often written as txq. It’s fed by the kernel, where the NIC driver converts some struct sk_buff objects called skb, that represent packets trough their journey in the kernel, into buffers that are enqueued in the txq.

On the ingress side, the Receive Queue, written rxq, is fed by the MAC, and dequeued by the NIC driver, who creates skb containing the payload and attached metadata.

In the end, every incoming or outgoing packet is enqueued and dequeued in a typical first-in first-out fashion. When the queue is full, packets are dropped, everything is neat, simple and deterministic.

However, typical use-cases are never simple. Most modern NICs have several queues in TX and RX. On the receive side, it’s useful to have multiple queues for performance reasons. When receiving packets, we can spread the traffic across multiple queues (with RSS for example), and have one CPU core dedicated to each queue. More complex use-cases can be expressed, such as flow steering, that you can learn about in the kernel documentation.

On the transmit side, it’s a bit less obvious why we want to have multiple queues. If the CPU is the bottleneck in terms of performances, having multiple TX queues won’t help much. Still, there are ways to benefit from having multiple TX queues on a multi-cpu system with XPS.

However, when the line-rate is the bottleneck, having multiple queues gets useful. By sorting the outgoing traffic by priority and assign each priority to a TX queue, the NIC can then pick the next packet to send from the highest priority queue until it’s empty, and then move on to the next priority. That way, we implement what’s called QoS (Quality of Service), where we care about high-priority traffic making it through the controller.

QoS itself is useful for various reasons. Besides the obvious prioritization of high-value streams for not-so-neutral networking, we can also imagine tagging management traffic as high-priority, to guarantee the ability to remotely access a machine under heavy traffic.

Other applications can be in the realm of Time Sensitive Networking, where it’s important that time-sensitive traffic gets sent right-away, using the high-priority queues as shortcuts to bypass the low-priority over-crowded queues.

NICs can use more advanced algorithms to pick the queue to de-queue the packet from, in a process called arbitration, to give low-priority traffic a chance to get sent even when high-priority traffic takes most of the bandwidth.

Typical algorithms range from strict priority-based FIFO to more flexible Weighted Round-Robin arbitration, to allow at least a few low-priority packets to leave the NIC.

In the case of mvneta, we’re limited in terms of what we can do in the receive path. Most of the work we’ve done focused on the transmit side.

Traffic Priorisation and Arbitration

Multi-queue support on the TX path is a three-fold process.

First, we have to know which packets are high-priority and which aren’t, by assigning a value to the skb->priority field.

There are several ways of doing so, using iptables, nftables, tc, socket parameters through SO_PRIORITY, or switching parameters. This is outside of the scope of this article, we’ll assume that incoming packets are already prioritized through one of the above mechanisms.

Once the packet comes into the network stack as a skb, we need to assign it to a TX queue. That’s where tc-mqprio comes in play.

In that second step, we build a mapping between priorities and queues. This mapping is done through the use of an intermediate mapping, the Traffic Classes. A Traffic Class is a mapping between N priorities and M transmit queues. We therefore have 2 levels of mappings to configure :

The priority to Traffic Class mapping (N to 1 mapping)
The Traffic Class to TX queue mapping (1 to M mapping)

All of this is done in one single tc command. We’re not going to dive too much into the tc tool itself, but we’ll still see a bit how tc-mqprio works.

Here’s an example :

tc qdisc add dev eth0 parent root handle 100 mqprio num_tc 8 map 0 1 2 3 4 5 6 7 queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 hw 1

Let’s break this down a bit.

Queuing Disciplines (QDiscs) are algorithms you use to select how and when you enqueue and dequeue traffic on a NIC. It benefits from great software support, but can also be offloaded to hardware, as we’ll see.

So the first part of the command is about attaching the mqprio QDisc to the network interface :

tc qdisc add dev eth0 parent root handle 100 mqprio

After that, we configure the traffic classes. Here, we create 8 traffic classes :

num_tc 8

And then we map each class to a priority. In this list, the n-th element corresponds to the class you want to assign to priority n. Here, we map the 8 first priorities to a dedicated class. Priorities that aren’t assigned a class are mapped to the class 0.

map 0 1 2 3 4 5 6 7

Finally, we map each class to a queue. Classes can only be assigned a contiguous set of queues, so the only parameters needed for the mapping are the number of queues assigned to the class (the number before the @), and the offset in the queue list (after the @). We specify one mapping per class, the m-th element in the list being the mapping for class number m. In the following example, we simply assign one queue per traffic class :

queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7

Under the hood, tc-mqprio will actually assign one QDisc per queue and map the Traffic Classes to these QDiscs, so that we can still hook other tc configurations on a per-queue basis.

Finally, we enable hardware offloading of the prioritization :

hw 1

The last part of the work consists in configuring the hardware itself to correctly prioritize each queue. That’s done by the NIC driver, which gets the configuration from the tc hooks.

If we take a look at the Hardware, we see that the queues are exposed to the kernel, which will enqueue packets into them. Each queue then has a Bandwidth Limiter, followed by the arbitration layer. Finally, we have one last global Rate limiter, and the path then continues to the egress blocks of the controller.

The arbiter configuration is easy, it’s just a matter of enabling the strict priority arbitration in a few registers.

Let’s summarize what the stack looks like :

Traffic shaping

Being able to prioritize outgoing traffic is good, but we can step-it up by allowing to limit the rate on each queue. That way, we can neatly organize and control how we want to use our bandwidth, not on a per-packet level but really on a bits/s basis.

Controlling the rate of individual flows or queues is called Traffic Shaping. When configuring a shaper, not only do we focus on the desired max rate, but also how we deal with traffic bursts. We can smooth them by tightly controlling how and when we send packets, or we can allow some burstiness by being more lenient on how often we enforce the rate limitation.

In the mvneta hardware, we have 2 levels of shaping : one level of per-queue shapers, and one per-port shaper. They can all be controlled semi-independently, some parameters being shared between all shapers.

With mqprio, the shapers are configured on a per-Traffic-Class basis. Since the hardware only allows per-queue shaping, we enforce in the driver that one traffic class is assigned to only one queue.

A final configuration with mqprio looks like this :

Most shapers use a variant of the Token Bucket Filter algorithm. The principle is the following :

Each queue has a Token Bucket, with a limited capacity. When a packet needs to be sent, one token per bit in the packet gets taken from the bucket. If the bucket is empty, transmission stops. The bucket then gets refilled at a configurable rate, with a configurable amount of tokens.

Configuring the TBF for each queue boils down to setting a few parameters :

– The Bucket Size, sometimes called burst, buffer or maxburst.

It should be big enough to accommodate for the shaping rate required, but not too big. If the bucket is too big and you have a very slow traffic going out, the bucket is going to fill up to its full size. When you suddenly have a lot of traffic to send, you’ll first have a huge burst, the time for the bucket to empty, before the traffic starts to actually get limited. Unlike the tc-tbf interface, tc-mqprio doesn’t allow to change the burst size, so it’s up to the driver to set it.

– The Refill Rate, sometimes called tick, defining how often you refill the bucket.
– The Refill Value, defining how many tokens you put back into the bucket at each refill.

These two, combined together, define the actual rate limitation. The approach taken to select the value is to select a fixed value for one of the parameters, and vary the other one to select the desired rate. In the case of mvneta, we chose to use a fixed refill rate, and change the refill value. This means that we have a limited resolution in the rates we can express. In our case, we have a 10kbps resolution, allowing us to cover any rate limitation between 10kbps and 5Gbps, with 10k increments.

– One final parameter is the minburst or MTU parameter, and controls the minimum amount of tokens required to allow packet transmission.
Since transmission stops when the bucket is empty, we can end-up with an empty bucket in the middle of a packet being transmitted.The link-partner may not be too happy about that, so it’s common to set the Maximum Transmission Unit as the minimum amount of tokens required to trigger transmission.

That way, we are sure that when we start sending a packet, we’ll always have enough tokens in the bucket to send the full packet. We can play a bit with this parameter if we want to buffer small packets and send them in a short burst when we have enough. In our case, we simply configured it as the MTU, which is good enough for most cases.

In the end, the gains are not necessarily measurable in terms of raw performances, but thanks to mqprio and traffic shaping, we can now have better control on the traffic that comes out of our interface.

An example of configuration would be :

# Shaping and priority configuration
tc qdisc add dev eth0 parent root handle 100 mqprio \
num_tc 8 map 0 1 2 3 4 5 6 7 \
queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 \
hw 1 mode channel shaper bw_rlimit \
max_rate 1Gbit 500Mbit 250Mbit 125Mbit 50Mbit 25Mbit 25Mbit 25MBit

# Assign HTTP traffic a priority of 1, to limit it to 500Mbps
iptables -t mangle -A POSTROUTING -p tcp --dport 80 -j CLASSIFY \
--set-class 0:1

There are tons of other use-cases, for example limiting per-vlan speeds, or in contrary making sure that traffic on a specific vlan has the highest priority, and all of that mostly offloaded to the hardware itself !

Bootlin contributions to Linux 5.14 and 5.15

It’s been a while we haven’t posted about Bootlin contributions to the Linux kernel, and in fact missed both the Linux 5.14 and Linux 5.15 releases, which we will cover in this blog post.

Linux 5.14 was released on August 29, 2021. The usual KernelNewbies.org page and the LWN articles on the merge window (part 1 and part 2) provide the best summaries of the new features and hardware support offered by this release.

Linux 5.15 on the other hand was released on November 1, 2021. Here as well, we have a great KernelNewbies.org article and LWN articles on the merge window (part 1 and part 2).

In total for those two releases, Bootlin contributed 79 commits, in various areas:

Alexandre Belloni, as the RTC subsystem maintainer, contributed 9 patches improving various aspects of RTC drivers, the RTC subsystem or Device Tree bindings for RTC
Clément Léger contributed a small improvement to the at_xdmac driver used on Microchip ARM platforms
Hervé Codina enabled Ethernet support on the old ST Spear320 SoC, by leveraging the existing stmmac Ethernet controller driver
Maxime Chevallier fixed a small issue with the Ethernet PHY on the i.MX6 Solidrun system-on-module
Miquèl Raynal added support for NV-DDR timings in the MTD susbsystem. This allows to improve performance with NAND flash memories that support those timings. Their usage is specifically implemented in the Arasan NAND controller driver, which Miquèl contributed back in Linux 5.8. See our previous blog post on this topic for more details
Miquèl Raynal added support for yet another NAND controller driver, the ARM PL35x, which is used for example on Xilinx Zynq 7000. See our previous blog post on this topic.
Miquèl Raynal added support for NAND chips with large pages (larger than 4 KB) to the OMAP GPMC driver.
Miquèl Raynal made a few fixes to the IIO driver for the max1027 ADC.
Paul Kocialkowski contributed a few patches to enable usage of the Hantro video decoder driver on the Rockchip PX30 processor.
Thomas Perrot contributed one patch to enable usage of the Flex Timers on i.MX7, and one to fix an issue in the PL022 SPI controller driver.

And now, as usual the complete list of our contributions to Linux 5.14 and 5.15:

Initial Allwinner V3 ISP support in mainline Linux

Introduction

Several months ago, Bootlin announced ongoing work on MIPI CSI-2 support for the Allwinner A31/V3 and A83T platforms in mainline Linux, as well as support for the Omnivision OV8865 and OV5648 image sensors. This effort has been a success and while the sensor patches were already integrated in mainline Linux since, the MIPI CSI-2 controller patches are on their way towards inclusion.

Since then we had the opportunity to take things further and start tackling the next steps for advanced camera support in mainline Linux on Allwinner SoCs!

With MIPI CSI-2 support and proper sensor drivers available in V4L2, we were able to capture raw bayer data provided by the sensors. But this data does not constitute the final picture that can be displayed or encoded into a file: a number of enhancement and transformation steps are required to achieve a visually-pleasing result that users typically expect.

These steps are quite calculation-intensive and it does not make sense to implement them with a software pipeline, especially with rates of 25, 30 or even 60 frames per second that are typically expected for video recording.

An open-source and upstream driver for the Allwinner ISP

Allwinner SoCs that support MIPI CSI-2 also include an Image Signal Processor hardware unit, a dedicated accelerator for enhancing and transforming raw data received from sensors.

Allwinner ISP features — Features of the ISP as described in the Allwinner V3 datasheet.

Since support for this ISP was implemented using a non-free blob in Allwinner SDKs, this area remained unsupported in mainline Linux… Until now!

Thanks to some help from Allwinner, we were able to implement a proper V4L2 driver for the Allwinner ISP found in the Allwinner V3, completely open-source, with no binary blob involved. This work was recently submitted upstream, with a first revision totaling more than 8000 new lines of code, which comes together with a significant rework of the Allwinner camera interface driver to make it usable with or without the ISP, and including the new MIPI CSI-2 support which we had submitted previously. We are very happy to keep contributing to advancing fully open-source Allwinner SoCs support in mainline Linux and help tackle some of the remaining areas there!

[PATCH 00/22] Allwinner A31/A83T MIPI CSI-2 Support and A31 ISP Support

Our currently proposed driver for the Allwinner ISP only supports a limited set of features: debayering with coefficients and 2D noise filtering. These features were sufficient for our use case, and allowed to offload the computationally intensive debayering process to a dedicated hardware accelerator.

As the driver for now relies on a specific user-space API that does not yet cover all aspects of the ISP, the driver was submitted to the Linux kernel staging area and will probably stay there until all ISP features are properly described.

Our work on this advanced camera support, including the ISP driver, has been described in the talk we have given earlier this week at the Embedded Linux Conference, for which the slides are already available.

Additional features and future work

The Allwinner ISP supports a lot more features beyond just debayering and noise filtering. For example, it supports statistics to implement 3A algorithms (auto-focus, auto-exposition and auto-white-balance) which are necessary to avoid manual configuration of scene-specific parameters. These could typically be implemented in libcamera, the community free software project that supports complex image capture pipelines and ISPs.

As a result Bootlin would be very interested to continue this work and bring this driver to a more advanced state. So if you have a project that could help move this topic forward, do not hesitate to contact us about it!

Bootlin contributions to Linux 5.13

After finally publishing about our Linux 5.12 contributions and even though Linux 5.14 was just released yesterday, it’s hopefully still time to talk about our contributions to Linux 5.13. Check out the LWN articles about the merge window to get the bigger picture about this release: part 1 and part 2.

In terms of Bootlin contributions, this was a much more quiet release than Linux 5.12, with just 28 contributions. The main highlights are:

The usual round of RTC subsystem updates from its maintainer Alexandre Belloni
A large amount of improvements in the MTD subsystem by its co-maintainer Miquèl Raynal, continuing his effort to improve the ECC handling in the MTD subsystem. See Miquèl’s talk at ELCE 2020 for more details on this effort: slides and video.
A small fix for an annoying regression in the musb USB gadget controller driver.

Even though we contributed just 28 commits to this release, as maintainers, some of us also reviewed and merged code from other contributors: Miquèl Raynal as the MTD co-maintainer merged 63 patches, Alexandre Belloni merged 22 patches, and Grégory Clement 6 patches.

Here are the details of our contributions to Linux 5.13:

Mainline Linux support for the ARM Primecell PL35X NAND controller

It has been more than 7 years since the first draft of a Linux kernel driver for the ARM Primecell PL35X NAND controller was posted on a public mailing list. Maybe because of the lack of time, each new version was delayed so much that it actually needed another iteration just to catch up with the latest internal API changes in the MTD subsystem (quite a number of them happened in the last 2-3 years). The NAND controller itself is part of an ARM Primecell Static Memory bus Controller (SMC) which increased the overall complexity. Finally, the way the commands and data are shared with the memory controller is very specific to the SMC. All these technical points probably played against Xilinx engineers, and Bootlin was contracted in 2021 to finalize the work of getting the ARM Primecell PL35X NAND controller driver in the upstream Linux kernel.

Static Memory Controller principles

The SMC can interface with two different memory types: NAND or SRAM/NOR. As it features two memory slots, this means that it can drive two memories, but they must be of the same type. When handling NAND devices, a hardware ECC engine is available to perform on-the-fly correction.

As only a single type of memory device can be plugged in at a time (either two SRAM/NORs or two NANDs), we don’t need to share a lot of controls with the SRAM/NOR controllers. So in the end the memory bus driver is almost an empty envelope that relies on the child controller driver to do the job.

Interactions with a memory device

On the CPU side, the controller has two main interfaces: APB and AXI.

The APB interface works like any regular interface: the CPU sees registers that it can access with diverse read and write operations, which will effectively read and write the content of the 32-bit registers located at the desired addresses. This is how the driver configures the device type, the timings, the possible ECC configuration and so forth. All the initial SMC configuration is done through the APB interface.

The AXI interface does not quite work like this. Instead of featuring a set of registers at a fixed address in which the content of the command, address and data cycles would be written in order to be forwarded to the memory device, the AXI interface needs to reserve a notable range in the addressable space. In particular, the offset targetted by the AXI write depend on the type of action that must be performed and the content of the action:

When requesting the controller to send command and address cycles to the memory device, the datasheet refers to it as the “command phase”.
When doing I/Os, eg. actually reading from/writing to the memory device, the datasheet calls this the “data phase”.

Both the command and data phase use regular AXI read/writes, but the offsets and values are different than usual.

Command phase

When the driver wants to send command cycles, it must perform one or two register writes. The address of the write operation in the AXI address space must target a specific offset. This offset indicates a number of information:

A specific bit is set to tell the SMC that it must enter a command phase.
Part of the offset are made of the shifted values of the different command opcodes for the memory device.
Part of the offset encodes the number of address cycles to perform on the NAND bus.

The payload of the AXI write contains the value of the address cycles that should be forwarded to the memory device. If there are more than 4 address cycles (which is quite common today), then a second AXI write containing the remaining address cycles as payload must happen at the same offset as before.

/*
 * Define the offset in the AXI address space where to write with:
 * - the bit indicating the command phase
 * - the number of address cycles
 * - the command opcode
 */
cmd_addr = PL35X_SMC_CMD_PHASE |
           PL35X_SMC_CMD_PHASE_NADDRS(naddr_cycles) |
           PL35X_SMC_CMD_PHASE_CMD0(NAND_CMD_XXXX);

/* Define the payload with the address bytes */
for (i = 0, row = 0; row < nrows; i++, row++) {
        [...]
        if (row < 4)
                addr1 |= PL35X_SMC_CMD_PHASE_ADDR(row, addr);
        else
                addr2 |= PL35X_SMC_CMD_PHASE_ADDR(row - 4, addr);
}

/* Send the command and address cycles */
writel(addr1, nfc->io_regs + cmd_addr);
if (naddr_cycles > 4)
        writel(addr2, nfc->io_regs + cmd_addr);

Data phase

The data phase is a bit easier to understand: several AXI reads or writes will be performed at a specific offset. The payload matches our expectations: it is actually the data that we want to read from or write to the device. However, the offset in the AXI address space is again a bit counter-intuitive:

It contains a specific bit such as the command phase to inform the controller that the data phase must be entered.
It also contains shifted values of different flags regarding the ECC configuration. The thing is, this offset will change at the end of the I/O operation because the last chunk of data must always be handled differently because of the ECC calculations that must be manually started. We end up reading or writing physically contiguous data by accessing two completely different offsets.

/* I/O transfers: simple case */
for (i = 0; i < buf_end; i++) {
        data_phase_addr = PL35X_SMC_DATA_PHASE;
        if (i + 1 == buf_end)
                data_phase_addr += PL35X_SMC_DATA_PHASE_ECC_LAST;

        writel(buf32[i], nfc->io_regs + data_phase_addr);
}

But what happens if a command cycle must be sent at the end of a data transfer (typical case of a PAGE_WRITE)? While it would certainly be more logical to perform an additional command phase AXI write, it was certainly more optimized to merge data and command phase on the last access. And here is how it looks like:

/* I/O transfers: less straightforward situation */
for (i = 0; i < buf_end; i++) {
        data_phase_addr = PL35X_SMC_DATA_PHASE;
        if (i + 1 == buf_end)
                data_phase_addr +=
                    PL35X_SMC_DATA_PHASE_ECC_LAST |
                    PL35X_SMC_CMD_PHASE_CMD1(NAND_CMD_PAGEPROG) |
                    PL35X_SMC_CMD_PHASE_CMD1_VALID);

        writel(buf32[i], nfc->io_regs + data_phase_addr);
}

Of course, nothing highly unreadable, but at the very least, these accesses are quite uncommon.

A memory bus driver and a NAND controller driver

As explained earlier, this SMC controller can support different types of memories, and this has called for a Device Tree representation where the SMC controller is one node, and the memories connected to it are represented as sub-node. So, the Device Tree representation of the SMC controller, used with its NAND controller looks like this:

    smcc: memory-controller@e000e000 {
      compatible = "arm,pl353-smc-r2p1", "arm,primecell";
      reg = <0xe000e000 0x0001000>;
      clock-names = "memclk", "apb_pclk";
      clocks = <&clkc 11>, <&clkc 44>;
      ranges = <0x0 0x0 0xe1000000 0x1000000 /* Nand CS region */
                0x1 0x0 0xe2000000 0x2000000 /* SRAM/NOR CS0 region */
                0x2 0x0 0xe4000000 0x2000000>; /* SRAM/NOR CS1 region */
      #address-cells = <2>;
      #size-cells = <1>;

      nfc0: nand-controller@0,0 {
        compatible = "arm,pl353-nand-r2p1";
        reg = <0 0 0x1000000>;
        #address-cells = <1>;
        #size-cells = <0>;
      };
    };

So, we first have a node for the SMC controller itself, memory-controller@e000e000, which will allow probing the memory bus driver located at drivers/memory/pl353-smc.c. This driver is very simple: it enables the clocks necessary for the SMC to work, and then it probes the first child device that matches either the cfi-flash or arm,pl353-nand-r2p1 compatible strings. In the latter case (which is illustrated in our example), the NAND controller driver at drivers/mtd/nand/raw/pl35x-nand-controller.c will be probed, and where the two memory areas (accessed through APB and AXI) will be mapped, and accessed to program the NAND controller.

Now in the mainline Linux kernel

Starting from the latest version posted by Xilinx, Miquèl Raynal, Bootlin’s NAND/MTD expert, performed a massive cleanup of the memory bus driver and the NAND controller driver, rewrote entirely the binding file (in YAML schema!) and three versions later, with the support of Xilinx engineers and the acknowledgements of Rob Herring and Krzysztof Kozlowski, managed to finally close the story. The driver is now part of Linux 5.14-rc, and will therefore be in the final Linux 5.14 release in a few weeks!

GPIO Aggregator, a virtual gpio chip

GPIOs are obviously widely used in embedded systems, and many of them are typically driven directly by Linux kernel drivers for interrupt lines, reset lines, or other control lines used to connect with various peripherals. However, a number of GPIOs are sometimes directly driven by user-space applications. Historically, the Linux kernel has provided a sysfs interface, in /sys/class/gpio to allow such direct control. But in recent years, this sysfs interface has been superseded by a new user-space interface based on /dev/gpiochip* character devices.

This new interface has numerous advantages over the previous /sys/class/gpio interface. However, one drawback is that it creates one device file per GPIO chip, which means that access rights are defined per GPIO chip, and not per GPIOs.

For this reason, in Linux 5.8, Geert Uytterhoeven has contributed the GPIO aggregator mechanism. It allows to group a number of GPIOs into a virtual GPIO chip, visible as an additional /dev/gpiochip*. Its documentation can be found in Documentation/admin-guide/gpio/gpio-aggregator.rst.

The list of GPIOs part of this new virtual GPIO chip is defined in the Device Tree. One other interesting thing is that, as any GPIO controler, the lines can be named, and then queried by user-space applications based on their name, using the libgpiod library.

Configuration and Device Tree description

To have GPIO Aggregator support in your kernel, simply configure

CONFIG_GPIO_AGGREGATOR=y

Add a gpio-aggregator node in your Device Tree source. For instance, the following DTS snippet declares an aggregator with several GPIO lines:

gpio-aggregator {
    pinctrl-names = "default";
    pinctrl-0 = <&gpio_pins>;
    compatible = "gpio-aggregator";

    gpios = <&gpio3 4 GPIO_ACTIVE_HIGH>,
            <&gpio2 4 GPIO_ACTIVE_HIGH>,
            <&gpio1 28 GPIO_ACTIVE_HIGH>,
            <&gpio1 29 GPIO_ACTIVE_HIGH>,
            <&gpio2 0 GPIO_ACTIVE_HIGH>,
            <&gpio2 1 GPIO_ACTIVE_HIGH>,
            <&gpio3 8 GPIO_ACTIVE_HIGH>;

    gpio-line-names = "line_a", "line_b", "line_c", "line_d",
            "line_e", "line_f", "line_g";
};

In this example, line 4 of gpio controller gpio3 is used and is named “line_a”, line 4 of gpio controller gpio2 is used and is named “line_b”, and so on up to line 8 of gpio controler gpio3.

Usage from user-space

From userspace we can see the GPIO chip and its aggregated lines:

# gpioinfo
...
gpiochip6 - 7 lines:
    line 0: "line_a" unused input active-high
    line 1: "line_b" unused input active-high
    line 2: "line_c" unused input active-high
    line 3: "line_d" unused input active-high
    line 4: "line_e" unused input active-high
    line 5: "line_f" unused input active-high
    line 6: "line_g" unused input active-high
...

We can search a gpio chip and a line number by the line name:

# gpiofind 'line_b'
gpiochip6 1

We can access a GPIO line by its name

# gpioget $(gpiofind 'line_b')
1
#
# gpioset $(gpiofind 'line_e')=1
# gpioset $(gpiofind 'line_e')=0

We can change the GPIO chip device file ownership to allow user or group to access the attached lines:

# ls -al /dev/gpiochip*
crw------- 1 root root 254, 0 Jan 1 00:00 /dev/gpiochip0
crw------- 1 root root 254, 1 Jan 1 00:00 /dev/gpiochip1
crw------- 1 root root 254, 2 Jan 1 00:00 /dev/gpiochip2
crw------- 1 root root 254, 3 Jan 1 00:00 /dev/gpiochip3
crw------- 1 root root 254, 4 Jan 1 00:00 /dev/gpiochip4
crw------- 1 root root 254, 5 Jan 1 00:00 /dev/gpiochip5
crw------- 1 root root 254, 6 Jan 1 00:00 /dev/gpiochip6
#
# chown root:users /dev/gpiochip6
# chmod 660 /dev/gpiochip6
# ls -al /dev/gpiochip*
crw------- 1 root root 254, 0 Jan 1 00:00 /dev/gpiochip0
crw------- 1 root root 254, 1 Jan 1 00:00 /dev/gpiochip1
crw------- 1 root root 254, 2 Jan 1 00:00 /dev/gpiochip2
crw------- 1 root root 254, 3 Jan 1 00:00 /dev/gpiochip3
crw------- 1 root root 254, 4 Jan 1 00:00 /dev/gpiochip4
crw------- 1 root root 254, 5 Jan 1 00:00 /dev/gpiochip5
crw-rw---- 1 root users 254, 6 Jan 1 00:00 /dev/gpiochip6
#

The GPIO chip created by the aggregator can be retrieved from sysfs:

# ls -1 /sys/bus/platform/devices/gpio-aggregator/
driver
driver_override
gpio
gpiochip6
modalias
of_node
power
subsystem
uevent
# 
# cat /sys/bus/platform/devices/gpio-aggregator/gpiochip6/dev
254:6
#

Conclusion

A GPIO Aggregator can be used to group a subset of GPIO lines, name them, access them by their names and manage access control to the virtual gpio chip created by the aggregator. On an embedded system, this can simplify the management and usage of individual GPIO lines.

Bootlin contributions to Linux 5.12

Yes, Linux 5.13 was released yesterday, but we never published the blog post detailing our contributions to Linux 5.12, so let’s do this now! First of all the usual links to the excellent LWN.net articles on the 5.12 merge window: part 1 and part 2.

LWN.net also published an article with Linux 5.12 development statistics, and two Bootlin engineers made their way to the statistics: Alexandre Belloni in the list of top contributors by number of changesets, with 69 commits, and Paul Kocialkowski in the list of top contributors by number of changed lines, with over 6000 lines changed.

Here are the highlights of our contributions:

Addition of a new driver for the Silvaco I3C master controller. This was contributed by Miquèl Raynal, who became the maintainer for this driver. Bootlin has pioneered support for I3C in Linux, by introducing the complete drivers/i3c subsystem a few years ago, together with the first controller driver, for a Cadence IP, see our blog post from 2018.
Addition of two new camera sensor drivers, one for the Omnivision OV5648 and another for the Omnivision OV8865. These were contributed by Paul Kocialkowski.
Implementation of mqprio support in the Marvell Ethernet controller driver mvneta, see this commit. As explained in the tc-mqprio man page, the MQPRIO qdisc is a simple queuing discipline that allows mapping traffic flows to hardware queue ranges using priorities and a configurable priority to traffic class mapping. This was contributed by Maxime Chevallier
Improvements in the IIO driver for the ms58xx family of sensors, contributed by Alexandre Belloni.
The final removal of the atmel_tclib code, which has been replaced by proper drivers for the TCB timers on Atmel/Microchip ARM platforms over the past few releases, also by Alexandre Belloni.
As usual, a large amount of fixes and improvements in the RTC subsystem, by its maintainer Alexandre Belloni.

Here is the detailed list of our contributions to this release:

LTP: Linux Test Project, Bootlin contributions

Introduction

The Linux Test Project is a project that develops and maintains a large test suite that helps validating the reliability, robustness and stability of the Linux kernel and related features. LTP has been mainly developed by companies such as IBM, Cisco, Fujitsu, SUSE, RedHat, with a focus on desktop distributions.

On the embedded side, both the openembedded-core Yocto layer and Buildroot have packages that allow to use LTP on embedded targets. However, for a recent project, we practically tried to run the full LTP test suite on an i.MX8 based platform running a Linux system built with Yocto. It turned out that LTP was apparently not very often tested on Busybox-based embedded systems, and we faced a number of issues. In addition to reporting various bugs/issues to the upstream LTP project, we also contributed a number of fixes and improvements:

Our contributions received a very warm welcome in the LTP community, which turned out to be very open and responsive. We hope that these contributions will encourage others to use LTP, and hopefully to make sure it continues to work on embedded platforms.

Quick start guide

At the time of this writing, LTP has more than 3800 tests written by the community, including about 1000 network-related tests. The tests are grouped together in categories described by files in the runtest/ folder. Based on this, two scenarios of tests are defined: default and network which are described by two files in the scenario_groups/ folder. These two scenarios simply list the categories of tests that need to be executed.

Here are the contents of the default and network:

$ cat scenario_groups/default 
syscalls
fs
fs_perms_simple
fsx
dio
io
mm
ipc
sched
math
nptl
pty
containers
fs_bind
controllers
filecaps
cap_bounds
fcntl-locktests
connectors
power_management_tests
hugetlb
commands
hyperthreading
can
cpuhotplug
net.ipv6_lib
input
cve
crypto
kernel_misc
uevent

$ cat scenario_groups/network 
can
net.features
net.ipv6
net.ipv6_lib
net.tcp_cmds
net.multicast
net.rpc
net.nfs
net.rpc_tests
net.tirpc_tests
net.sctp
net_stress.appl
net_stress.broken_ip
net_stress.interface
net_stress.ipsec_dccp
net_stress.ipsec_icmp
net_stress.ipsec_sctp
net_stress.ipsec_tcp
net_stress.ipsec_udp
net_stress.multicast
net_stress.route

Once you have LTP built and installed on your board thanks to the appropriate OpenEmbedded or Buildroot package, you can run these two scenarios of test with the following commands (-n specify the network one):

$ cd /opt/ltp
$ ./runltp
$ ./runltp -n

Then take a look at the content of the result and the output directories.

For more information on building or running LTP please read this readme.

Bootlin contributions to Linux 5.11

Linux 5.11 was released quite some time ago now, but it’s never too late to have a look at Bootlin contributions to this release. As usual, we recommend reading the LWN articles on the 5.11 merge window: part 1 and part2. Also of interest is the Kernelnewbies page for 5.11.

Here are the main highlights of our contributions:

Alexandre Belloni, as the maintainer of the RTC subsystem, continued making numerous improvements and fixes to RTC drivers
On the support for Microchip ARM platforms, Alexandre Belloni switched the PWM atmel-tcb driver to a new Device Tree binding and added SAMA5D2 support, he did some improvements to the IIO driver for the Microchip ADC, and continued to remove platform_data support from Microchip drivers as all platforms are now converted to the Device Tree.
Alexandre Belloni contributed a new Simple Audio Mux driver for the ALSA subsystem, which can be used to control simple audio multiplexers driven using GPIOs, that allows to select which of their input line is connected to the output line.
Grégory Clement added support for several new MIPS platforms from Microchip: Luton, Serval and Jaguar2. All those platforms include a MIPS core, a few peripherals and more importantly an Ethernet switch. For now the support only includes the base platform support, but we are working on the switchdev driver for the Ethernet switch.
Miquèl Raynal, maintainer of the NAND subsystem and co-maintainer of the MTD subsystem, contributed numerous changes to the ECC support in the MTD subsystem, making it more generic so that it can be used not just for parallel NAND flashes, but also SPI NAND flashes. For more details, see the talk from Miquèl Raynal on this topic.

In addition to those 95 patches that we authored and contributed, several Bootlin engineers being maintainers of different subsystems of the Linux kernel reviewed and merged patches from other contributors:

Miquèl Raynal, as the NAND maintainer and MTD co-maintainer, reviewed and merged 67 patches from other contributors
Alexandre Belloni, as the RTC, I3C and Microchip ARM/MIPS platforms maintainer, reviewed and merged 47 patches from other contributors
Grégory Clement, as the Marvell EBU platform co-maintainer, reviewed and merged 33 patches from other contributors

Here is the detailed list of our contributions to Linux 5.11:

Large Page Support for NAS systems on 32 bit ARM

The need for large page support on 32 bit ARM

Storage space has become more and more affordable to a point that it is now possible to have multiple hard drives of dozens of terabytes in a single consumer-grade device. With a few 10 TiB hard drives and thanks to RAID technology, storage capacities that exceed 16 or 32 TiB can easily be reached and at a relatively low cost.

However, a number of consumer NAS systems used in the field today are still based on 32 bit ARM processors. The problem is that, with Linux on a 32 bit system, it’s only possible to address up to 16 TiB of storage space. This is still true even with the ext4 filesystem, even though it uses 64 bit pointers.

We were lucky to have a customer contracting us to update older Large Page Support patches to a recent version of the Linux kernel. This set of patches are one way of overcoming this 16 TiB limitation for ARM 32-bit systems. Since updating this patch series was a non trivial task, we are happy to share the results of our efforts with the community, both through this blog post and through a patch series we posted to the Linux ARM kernel mailing list: ARM: Add support for large kernel page (from 8K to 64K).

How Large Page Support works

The 16 TiB limitation comes from the use of page->index which is a pgoff_t offset type corresponding to unsigned long. This limits us to a 32-bit page offsets, so with 4 KiB physical pages, we end up with a maximum of 16 TiB. A way to address this limitation is to use larger physical pages. We can reach 32 TiB with 8 KiB pages, 64 TiB with 16 KiB pages and up to 256 TiB with 64 KiB pages.

Before going further, the ARM32 Page Tables article from Linus Walleij is a good reference to understand how the Linux kernel deals with ARM32 page tables. In our case, we are only going to cover the non LPAE case. As explained there, the way the Linux kernel sees the page tables actually doesn’t match reality. First, the kernel deals with 4 levels of page tables while on hardware there are only 2 levels. In addition, while the ARM32 hardware stores only 256 PTEs in Page Tables, taking up only 1 KB, Linux optimizes things by storing in each 4 KB page two sets of 256 PTEs, and two sets of shadow PTEs that are used to store additional metadata needed by Linux about each page (such as the dirty and accessed/young bits). So, there is already some magic between what is presented to the Linux virtual memory management subsystem, and what is really programmed into the hardware page tables. To support large pages, the idea is to go further in this direction by emulating larger physical pages.

Our series (and especially patch 5: ARM: Add large kernel page support) proposes to pretend to have larger hardware pages. The ARM 32-bit architecture only supports 4 KiB or 64 KiB page sizes, but we would like to support intermediate values of 8 KiB, 16 KiB and 32 KiB as well. So what we do to support 8 KiB pages is that we tell Linux the hardware has 8 KiB pages, but in fact we simply use two consecutive 4 KiB pages at the hardware level that we manipulate and configure simultaneously. To support 16 KiB pages, we use 4 consecutive 4 KiB pages, for 32 KiB pages, we use 8 consecutive pages, etc. So really, we “emulate” having larger page sizes by grouping 2, 4 or 8 pages together. Adding this feature only required a few changes in the code, mainly dealing with ranges of pages every time we were dealing with a single page. Actually, most of the code in the series is about making it possible to modify the hard coded value of the hardware page size and fixing the assumptions associated to such a fixed value.

In addition to this emulated mechanism that we provide for 8 KiB, 16 KiB, 32 KiB and 64 KiB pages, we also added support for using real hardware 64 KiB pages as part of this patch series.

Overall the number of changes is very limited (271 lines added, 13 lines removed), and allows to use much larger storage devices. Here is the diffstat of the full patch series:

 arch/arm/include/asm/elf.h                  |  2 +-
 arch/arm/include/asm/fixmap.h               |  3 +-
 arch/arm/include/asm/page.h                 | 12 ++++
 arch/arm/include/asm/pgtable-2level-hwdef.h |  8 +++
 arch/arm/include/asm/pgtable-2level.h       |  6 +-
 arch/arm/include/asm/pgtable.h              |  4 ++
 arch/arm/include/asm/shmparam.h             |  4 ++
 arch/arm/include/asm/tlbflush.h             | 21 +++++-
 arch/arm/kernel/entry-common.S              | 13 ++++
 arch/arm/kernel/traps.c                     | 10 +++
 arch/arm/mm/Kconfig                         | 72 +++++++++++++++++++++
 arch/arm/mm/fault.c                         | 19 ++++++
 arch/arm/mm/mmu.c                           | 22 ++++++-
 arch/arm/mm/pgd.c                           |  2 +
 arch/arm/mm/proc-v7-2level.S                | 72 ++++++++++++++++++++-
 arch/arm/mm/tlb-v7.S                        | 14 +++-
 16 files changed, 271 insertions(+), 13 deletions(-)

This patch series is running in production now on some NAS devices from a very popular NAS brand.

Limitations and alternatives

The submission of our patch series is recent but this feature has actually been running for years on many NAS systems in the field. Our new series is based on the original patchset, with the purpose of submitting it to the mainline kernel community. However, there is little chance that it will ever be merged into the mainline kernel.

The main drawback of this approach are large pages themselves: as each file in the page cache uses at least one page, the memory wasted increases as the size of the pages increases. For this reason, Linus Torvalds was against similar series proposed in the past.

To show how much memory is wasted, Arnd Bergmann ran some numbers to measure the page cache overhead for a typical set of files (Linux 5.7 kernel sources) for 5 different page sizes:

Page size (KiB)	4	8	16	32	64
page cache usage (MiB)	1,023.26	1,209.54	1,628.39	2,557.31	4,550.88
factor over 4K pages	1.00x	1.18x	1.59x	2.50x	4.45x

We can see that while a factor of 1.18 is acceptable for 8 KiB pages, a 4.45 multiplier looks excessive with 64 KiB pages.

Actually, to make it possible to address large volumes on 32 bit ARM, another solution was pointed out during the review of our series. Instead of using larger pages which have an impact on the entire system, an alternative is to modify the way the filesystem addresses the memory by using 64 bits pgoff_t offsets. This has already been implemented in vendor kernels running in some NAS systems, but this has never been submitted to mainline developers.