Linux 6.12 released, Bootlin contributions inside

Linux 6.12 has been released during the past week-end, pretty much as expected after 7 release candidates. As usual, we recommend our readers to go through the amazing LWN.net articles covering the 6.12 merge window (part 1, part 2) to get a high-level overview of the major new features and improvements in this 6.12 release. One of the prominent improvement in this release, as far as the embeded industry is concerned, is obviously the merge of the final bits enabling PREEMPT_RT… which already caused our Real-Time Linux with PREEMPT_RT training course to be updated.

As usual, Bootlin again contributed to this release: with 118 commits merged, we are again in the top 20 contributing companies! Also, in addition to contributing our own code, several of our engineers are also maintainers, and as part of this work those engineers review and merge patches from other contributors. As part of this effort, for this 6.12 release:

  • Alexandre Belloni, as the RTC and I3C subsystems maintainer, merged 29 patches from other contributors
  • Miquèl Raynal, as the MTD subsystem co-maintainer, merged 24 patches from other contributors
  • Grégory Clement, as the Marvell platform maintainer, merged 4 patches from other contributors

Overall, 13 active Bootlin engineers made contributions to this release, which on a total staff of 24 people, means that more than half of our team has contributed to the Linux kernel for 6.12, a good indication of our strong focus on Linux kernel development and upstreaming!

Now looking at our contributions to 6.12, from a high-level:

  • Hervé Codina was our most prolific contributor, with 38 patches. Hervé has put the final touch to a massive effort aiming at supporting a very complex audio topology on fairly old Freescale/NXP platforms. This last bits of work have been focused on improving the drivers he had submitted to drivers/soc/fsl/qe to also support the MPC8312 processor, which has a slightly different QMC hardware block. We hope to publish soon a very detailed blog post on this work that spanned over several years and finally came to an end with these last patches!
  • Maxime Chevallier was the second most prolific contributor, with 25 patches merged. He mostly got his Ethernet PHY topology work merged, which he had presented on at Netdev last year and at Linux Plumbers this year. He also converted the fs_enet Ethernet MAC driver to the phylink API.
  • Miquèl Raynal was (as usual) active on the MTD subsystem, where he was mostly focused on improving the SPI NAND subsystem to support continuous read. Continuous read is an optimization that allows to improve the performance when reads from contiguous pages are made. The cover letter states: overall the speed gains are impressive: up to 45% when reading a block in 1-1-4 mode, where a substantial time is lost waiting for the chip to be ready.
  • Luca Ceresoli got his improvements to the dapm-graph and decode_stacktrace.sh tools merged.
  • Thomas Bonnefille, who worked as an engineering intern at Bootlin and recently joined as a full-time employee, saw his very first Linux kernel driver merged, an IIO driver for the ADC found in Sophgo RISC-V processors: drivers/iio/adc/sophgo-cv1800b-adc.c.
  • Thomas Richard improved power management support in the PCI code used on Texas Instruments platforms.
  • Théo Lebrun managed to get the pinctrl and reset drivers for the Mobileye EyeQ MIPS processors merged. He also worked with Thomas Richard on the Texas Instruments PCI power management effort.
  • Alexis Lothoré did quite a bit of work on eBPF tests, funded by the eBPF Foundation: he converted test_xdp_veth, test_dev_cgroup, get_current_cgroup_id_user, test_cgroup_stroage and test_skb_cgroup_id_user to the test_progs infrastructure.
  • Romain Gantois made a single contribution, but it adds a new feature to the TI TLV320AIC31xx audio codec driver. This feature allows to load filter coefficients from a firmware file, providing the ability to fine-tune the codec configuration.
  • Alexandre Belloni, Bastien Curutchet, Köry Maincent and Louis Chauvet also got a small number of patches merged, mostly fixes or small improvements. A patch from Clément Léger, who left Bootlin more than a year ago, also made its way upstream.

Here is the complete details of our 118 contributions to this kernel release:

Back from Netdev 0x17

At Bootlin, we focus on Embedded Linux development and support, and these embedded devices often have a network interface, be-it an Ethernet port, a Wireless chip or some other kind of communication channel that falls under the Linux Networking Stack’s framework.

So it’s always interesting to see what the rest of the community is working on, and meet in real life people we interact with on the netdev mailing list.

That’s why this year, Alexis Lothoré and Maxime Chevallier flew to Vancouver to participate to the Netdev Conference, a 5 days event organised by the Netdev Society, a small non-profit run by volunteers dedicated to holding this event.

Bootlin at Netdev

Most talks at Netdev are not directly covering topics we’re actively working on, but it’s always refreshing to see these new exciting technologies that could trickle their way down to the embedded world a few years from now. It is also always pretty interesting to stay up to date about challenges encountered by other parts of the networking industry, at scales way different than the ones we are used to.

We learned for example what CXL is about, what it brings and the effort that are made to design new networking hardware around this technology to change the way we think about datacenter networking.

When we attended Netdev 0x13 in 2019, QUIC was one of the hot topics. This year, Homa was under the spotlight with talks on what it is, and how this new protocol could address some of TCP’s problems.

Like all previous editions, we learned all the progress that were made with TC and its future, new ways of bypassing the kernel stack, BPF integration in the kernel, along with XDP which continued to be more and more powerful.

Another hot topic in the kernel is the introduction of the Rust language, and the network subsystem is a pretty relevant target for the new features brought by the language. As a consequence, Rust subsystem maintainers Miguel Ojeda and Wedson Almeida Filho gave an overview of Rust benefits compared to traditional C code, and then showed a step-by-step implementation of a kernel-side TCP server module. While this example is not perfectly representative of classic network-related drivers we usually write, it was a nice showcase of current state of kernel APIs abstractions in Rust.

We also discovered the new use-case that is now driving most of the datacenter networking efforts, which is without surprise AI and Machine Learning. Turns out, if you want your ChatGPT to answer up-to-date replies without having to wait for too long, you need a powerful and well-organized datacenter for the training part, and networking engineering takes a big part in it to keep all those GPUs fed at a relevant pace.

This lead to the devmem TCP effort, which started to feel a bit familiar for us as it uses dma-buf, which we also sometimes use on multimedia pipelines. The ML and AI topic was introduced to us by the wonderful Keynote session given by Manya Ghobadi, who got all the audience captivated by how AI and ML works, what AI workloads requires in terms of network traffic scheduling, datacenter topology and computing hardware that uses optical computing.

On the final day, we even had a visit from Jakub Kicinski (one of the co-maintainers of Linux networking tree), presenting what he had been working on, and gave us an update on the netdev development statistics (and basically, his main point is that we do need to review more patches).

For the first time, there was a talk from Bootlin at netdev, as Maxime presented one of the topics he’s been working on lately : Improving multi-PHY and multi-port interfaces support. Although it was one of the only talks focusing on the low-levels aspects of the Ethernet stack, it triggered some discussions and interest from the community, which will help further improving the ongoing work.

The slides and videos of the event will be published at some point in the future, we will for sure mention this to our readers when it becomes available.

We’ll conclude this short feedback by thanking once again the Netdev Board members, organizers, speakers and the audience for this great event.

We’ll come back 🙂

Multi-queue improvements in Linux kernel Ethernet driver mvneta

In the past months, the Linux kernel driver for the Ethernet MAC found in a number of Marvell SoCs, mvneta, has seen quite a few improvements. Lorenzo Bianconi brought support for XDP operations on non-linear buffers, a follow-up work on the already-great XDP support that offers very nice performances on that controller. Russell King contributed an improved, more generic and easier to maintain phylink support, to deal with the variety of embedded use-cases.

At that point, it’s getting difficult to squeeze more performances out of this controller. However, we still have some tricks we can use to improve some use-cases so in the past months, we’ve worked on implementing QoS features on mvneta, through the use of mqprio.

Multi-queue support

A simple Ethernet NIC (Network Interface Controller) could be described as a controller that handles some protocol-level aspect of the Ethernet standard, where the incoming and outgoing packets would be enqueued in a dedicated ring-buffer that serves as an interface between the controller and the kernel.

The buffer containing packets that needs to be sent is called the Transmit Queue, often written as txq. It’s fed by the kernel, where the NIC driver converts some struct sk_buff objects called skb, that represent packets trough their journey in the kernel, into buffers that are enqueued in the txq.

On the ingress side, the Receive Queue, written rxq, is fed by the MAC, and dequeued by the NIC driver, who creates skb containing the payload and attached metadata.

In the end, every incoming or outgoing packet is enqueued and dequeued in a typical first-in first-out fashion. When the queue is full, packets are dropped, everything is neat, simple and deterministic.

However, typical use-cases are never simple. Most modern NICs have several queues in TX and RX. On the receive side, it’s useful to have multiple queues for performance reasons. When receiving packets, we can spread the traffic across multiple queues (with RSS for example), and have one CPU core dedicated to each queue. More complex use-cases can be expressed, such as flow steering, that you can learn about in the kernel documentation.

On the transmit side, it’s a bit less obvious why we want to have multiple queues. If the CPU is the bottleneck in terms of performances, having multiple TX queues won’t help much. Still, there are ways to benefit from having multiple TX queues on a multi-cpu system with XPS.

However, when the line-rate is the bottleneck, having multiple queues gets useful. By sorting the outgoing traffic by priority and assign each priority to a TX queue, the NIC can then pick the next packet to send from the highest priority queue until it’s empty, and then move on to the next priority. That way, we implement what’s called QoS (Quality of Service), where we care about high-priority traffic making it through the controller.

QoS itself is useful for various reasons. Besides the obvious prioritization of high-value streams for not-so-neutral networking, we can also imagine tagging management traffic as high-priority, to guarantee the ability to remotely access a machine under heavy traffic.

Other applications can be in the realm of Time Sensitive Networking, where it’s important that time-sensitive traffic gets sent right-away, using the high-priority queues as shortcuts to bypass the low-priority over-crowded queues.

NICs can use more advanced algorithms to pick the queue to de-queue the packet from, in a process called arbitration, to give low-priority traffic a chance to get sent even when high-priority traffic takes most of the bandwidth.

Typical algorithms range from strict priority-based FIFO to more flexible Weighted Round-Robin arbitration, to allow at least a few low-priority packets to leave the NIC.

In the case of mvneta, we’re limited in terms of what we can do in the receive path. Most of the work we’ve done focused on the transmit side.

Traffic Priorisation and Arbitration

Multi-queue support on the TX path is a three-fold process.

First, we have to know which packets are high-priority and which aren’t, by assigning a value to the skb->priority field.

There are several ways of doing so, using iptables, nftables, tc, socket parameters through SO_PRIORITY, or switching parameters. This is outside of the scope of this article, we’ll assume that incoming packets are already prioritized through one of the above mechanisms.

Once the packet comes into the network stack as a skb, we need to assign it to a TX queue. That’s where tc-mqprio comes in play.

In that second step, we build a mapping between priorities and queues. This mapping is done through the use of an intermediate mapping, the Traffic Classes. A Traffic Class is a mapping between N priorities and M transmit queues. We therefore have 2 levels of mappings to configure :

  1. The priority to Traffic Class mapping (N to 1 mapping)
  2. The Traffic Class to TX queue mapping (1 to M mapping)

All of this is done in one single tc command. We’re not going to dive too much into the tc tool itself, but we’ll still see a bit how tc-mqprio works.

Here’s an example :

tc qdisc add dev eth0 parent root handle 100 mqprio num_tc 8 map 0 1 2 3 4 5 6 7 queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 hw 1

Let’s break this down a bit.

Queuing Disciplines (QDiscs) are algorithms you use to select how and when you enqueue and dequeue traffic on a NIC. It benefits from great software support, but can also be offloaded to hardware, as we’ll see.

So the first part of the command is about attaching the mqprio QDisc to the network interface :

tc qdisc add dev eth0 parent root handle 100 mqprio

After that, we configure the traffic classes. Here, we create 8 traffic classes :

num_tc 8

And then we map each class to a priority. In this list, the n-th element corresponds to the class you want to assign to priority n. Here, we map the 8 first priorities to a dedicated class. Priorities that aren’t assigned a class are mapped to the class 0.

map 0 1 2 3 4 5 6 7

Finally, we map each class to a queue. Classes can only be assigned a contiguous set of queues, so the only parameters needed for the mapping are the number of queues assigned to the class (the number before the @), and the offset in the queue list (after the @). We specify one mapping per class, the m-th element in the list being the mapping for class number m. In the following example, we simply assign one queue per traffic class :

queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7

Under the hood, tc-mqprio will actually assign one QDisc per queue and map the Traffic Classes to these QDiscs, so that we can still hook other tc configurations on a per-queue basis.

Finally, we enable hardware offloading of the prioritization :

hw 1

The last part of the work consists in configuring the hardware itself to correctly prioritize each queue. That’s done by the NIC driver, which gets the configuration from the tc hooks.

If we take a look at the Hardware, we see that the queues are exposed to the kernel, which will enqueue packets into them. Each queue then has a Bandwidth Limiter, followed by the arbitration layer. Finally, we have one last global Rate limiter, and the path then continues to the egress blocks of the controller.

The arbiter configuration is easy, it’s just a matter of enabling the strict priority arbitration in a few registers.

Let’s summarize what the stack looks like :

Traffic shaping

Being able to prioritize outgoing traffic is good, but we can step-it up by allowing to limit the rate on each queue. That way, we can neatly organize and control how we want to use our bandwidth, not on a per-packet level but really on a bits/s basis.

Controlling the rate of individual flows or queues is called Traffic Shaping. When configuring a shaper, not only do we focus on the desired max rate, but also how we deal with traffic bursts. We can smooth them by tightly controlling how and when we send packets, or we can allow some burstiness by being more lenient on how often we enforce the rate limitation.

In the mvneta hardware, we have 2 levels of shaping : one level of per-queue shapers, and one per-port shaper. They can all be controlled semi-independently, some parameters being shared between all shapers.

With mqprio, the shapers are configured on a per-Traffic-Class basis. Since the hardware only allows per-queue shaping, we enforce in the driver that one traffic class is assigned to only one queue.

A final configuration with mqprio looks like this :

Most shapers use a variant of the Token Bucket Filter algorithm. The principle is the following :

Each queue has a Token Bucket, with a limited capacity. When a packet needs to be sent, one token per bit in the packet gets taken from the bucket. If the bucket is empty, transmission stops. The bucket then gets refilled at a configurable rate, with a configurable amount of tokens.

Configuring the TBF for each queue boils down to setting a few parameters :

– The Bucket Size, sometimes called burst, buffer or maxburst.

It should be big enough to accommodate for the shaping rate required, but not too big. If the bucket is too big and you have a very slow traffic going out, the bucket is going to fill up to its full size. When you suddenly have a lot of traffic to send, you’ll first have a huge burst, the time for the bucket to empty, before the traffic starts to actually get limited. Unlike the tc-tbf interface, tc-mqprio doesn’t allow to change the burst size, so it’s up to the driver to set it.

– The Refill Rate, sometimes called tick, defining how often you refill the bucket.
– The Refill Value, defining how many tokens you put back into the bucket at each refill.

These two, combined together, define the actual rate limitation. The approach taken to select the value is to select a fixed value for one of the parameters, and vary the other one to select the desired rate. In the case of mvneta, we chose to use a fixed refill rate, and change the refill value. This means that we have a limited resolution in the rates we can express. In our case, we have a 10kbps resolution, allowing us to cover any rate limitation between 10kbps and 5Gbps, with 10k increments.

– One final parameter is the minburst or MTU parameter, and controls the minimum amount of tokens required to allow packet transmission.
Since transmission stops when the bucket is empty, we can end-up with an empty bucket in the middle of a packet being transmitted.The link-partner may not be too happy about that, so it’s common to set the Maximum Transmission Unit as the minimum amount of tokens required to trigger transmission.

That way, we are sure that when we start sending a packet, we’ll always have enough tokens in the bucket to send the full packet. We can play a bit with this parameter if we want to buffer small packets and send them in a short burst when we have enough. In our case, we simply configured it as the MTU, which is good enough for most cases.

In the end, the gains are not necessarily measurable in terms of raw performances, but thanks to mqprio and traffic shaping, we can now have better control on the traffic that comes out of our interface.

An example of configuration would be :

# Shaping and priority configuration
tc qdisc add dev eth0 parent root handle 100 mqprio \
num_tc 8 map 0 1 2 3 4 5 6 7 \
queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 \
hw 1 mode channel shaper bw_rlimit \
max_rate 1Gbit 500Mbit 250Mbit 125Mbit 50Mbit 25Mbit 25Mbit 25MBit

# Assign HTTP traffic a priority of 1, to limit it to 500Mbps
iptables -t mangle -A POSTROUTING -p tcp --dport 80 -j CLASSIFY \
--set-class 0:1

There are tons of other use-cases, for example limiting per-vlan speeds, or in contrary making sure that traffic on a specific vlan has the highest priority, and all of that mostly offloaded to the hardware itself !

Videos and slides of Bootlin presentations at FOSDEM 2021

The videos from Bootlin’s presentations earlier this month at FOSDEM 2021 are now publicly available. Once again, FOSDEM was a busy event, even if it was online for once. As in most technical conferences, Bootlin engineers volunteered to share their experience and research by giving two talks.

Maxime Chevallier – Network Performance in the Linux Kernel, Getting the most out of the Hardware

Abstract: The networking stack is one of the most complex and optimized subsystems in the Linux kernel, and for a good reason. Between the wild range of applications, the complexity and variety of the networking hardware, getting good performance while keeping the stack easily usable from userspace has been a long-standing challenge.

Nowadays, complex Network Interface Controllers (NICs) can be found even on small embedded systems, bringing powerful features that were previously found only in the server world closest to day to day users.

This is a good opportunity to dive into the Linux Networking stack, to discover what is used to make networking as fast as possible, both by using all the features of the hardware and by implementing some clever software tricks.

In this talk, we cover these various techniques, ranging from simple batch processing with NAPI, queue management with RSS, RPS, XPS and so on, flow steering and filtering with ethool and TC, to finish with the newest big change that is XDP.

We dive into these various techniques and see how to configure them to squeeze the most out of your hardware, and discover that what was previously in the realm of datacenters and huge computers can now also be applied to embedded linux development.

Here are PDF slides for this presentation.

Michael Opdenacker – Embedded Linux from Scratch in 45 minutes, on RISC-V

Abstract: Discover how to build your own embedded Linux system completely from scratch. In this presentation and tutorial, we show how to build a custom toolchain (Buildroot), bootloader (opensbi / U-Boot) and kernel (Linux), that one can run on a system with the new RISV-V open Instruction Set Architecture emulated by QEMU. We also show how one can build a minimal root filesystem by oneself thanks to the BusyBox project. The presentation ends by showing how to control the system remotely through a tiny webserver. The approach is to provide only the files that are strictly necessary. That’s all the interest of embedded Linux: one can really control and understand everything that runs on the system, and see how simple the system can be. That’s much easier than trying to understand how a GNU/Linux system works from a distribution as complex as Debian!

The presentation also shares details about what’s specific to the RISC-V architecture, in particular about the various stages of the boot process. This presentation shares all the hardware (!), source code build instructions and demo binaries needed to reproduce everything at home, and add specific improvements. Most of the details are also useful to people using other hardware architectures (in particular arm and arm64).

It’s probably the first time a tutorial manages to show so many aspects of embedded Linux in less than an hour. See by yourself! At least, that’s for sure the first one demonstrating how to boot Linux from U-Boot in a RISC-V system emulated by QEMU.

Here are PDF slides for this presentation.

Feedback from the Netdev 0x13 conference

The Netdev 0x13 conference took place last week in Prague, Czech Republic. As we work on a variety of networking topics as part of our Linux kernel contributions, Bootlin engineers Maxime Chevallier and Antoine Ténart went to meet with the Linux networking community and to see a lot of interesting sessions. It’s the third time we enjoy attending the Netdev conference (after Netdev 2.1 and Netdev 2.2) and as always, it was a blast!

The 3-day conference started with a first day of workshops and tutorials. We enjoyed learning how to be the cool kids thanks to the XDP hands-on tutorial where Jesper Brouer and Toke Høiland-Jørgensen cooked us a number of lessons to progressively get to learn how to write and load XDP programs. This was the first trial-run of the tutorial which is meant to be extended and used as a material to go through the XDP basics. The instructions are all available on Github.

We then had the chance to attend the TC workshop where face to face discussions and presentations of the traffic control hot topics being worked on happened. The session caught our attention as the topic is related to current subjects being worked on at Bootlin.

Being used to work on embedded systems, seeing the problems the Network developers face can sometimes come as a surprise. During the TC workshop, Vlad Buslov presented his recent work on removing TC flower’s the dependency to the global rtnl lock, which is an issue when you have a million classification rules to update quickly.

We also went to the hardware offload workshop. The future of the network offload APIs and support in the Linux kernel was discussed, with various topics ranging from ASIC support to switchev advanced use-cases or offloading XDP. This was very interesting to us as we do work on various networking engines providing many offloading facilities to the kernel.

The next two days were a collection of talks presenting the recent advances in the networking subsystem of the Linux kernel, as well as current issues and real-world examples of recent functionalities being leveraged.

As always XDP was brought-up with a presentation of XDP offloading using virtio-net, recent advances in combining XDP and hardware offloading techniques and a feedback from Cloudflare using XDP in their DDOS mitigation in-house solution.

But we also got to see other topics, such as SO_TIMESTAMPING being used for performance analytics. In this talk Soheil Hassas Yeganeh presented how the kernel timestamping facilities can be used to track individual packets withing the networking stack for performance analysis and debugging. This was nice to see as we worked on enabling hardware timestamping in networking engines and PHYs for our clients.

Another hot topic this year was the QUIC protocol, which was presented in details in the very good QUIC tutorial by Jana Iyengar. Since this protocol is fairly new, it was brought-up in several sessions from a lot of interesting angles.

Although QUIC was not the main subject of Alissa Cooper’s keynote on Open Source, the IETF, and You, she explained how QUIC was an example of a protocol that is designed alongside its implementations, having a tight feedback loop between the protocol specifications and its usage in real-life. Alissa shared Jana’s point on how middle-boxes are a problem when designing and deploying new protocols, and explained that an approach to overcome this “ossification” is to encrypt the protocol header themselves and document the invariant parts of the non-encrypted parts.

A consequence of having a flexible protocol is that it is not meant to be implemented in the kernel. However, Maciej Machnikowski and Joshua Hay explained that it is still possible to offload some of the processing to hardware, which sparked interesting discussions with the audience on how to do so.

Conclusion

The Netdev 0x13 conference was well organized and very pleasant to attend. The content was deeply technical and allowed us to stay up-to-date with the latest developments. We also had interesting discussions and came back with lots of ideas to explore.

Thanks for organizing Netdev, we had an amazing time!