The second edition of the Alpine Linux Persistent Storage Summit (ALPSS) happened two weeks ago in the Lizumerhütte Alpine lodge. Close to Innsbruck, Austria, the lodge resides in an amazingly beautiful valley. Completely separated from the rest of the world in Winter, this year edition was marked by the absence of data network access, intensifying the feeling of isolation, stimulating the exchanges between attendees. To strengthen the representation of MTD developers at this event, Bootlin sent two of his engineers: Boris Brezillon and Miquèl Raynal, respectively MTD and NAND maintainers in the Linux kernel.
NVMe, open-channel and zoned namespaces
While almost all the ~30 attendees work on storage support that are based on NAND flashes, a majority work on domains targeting high-performances, where power-cuts are not the issue but the latency and throughput are. Far beyond our embedded world, people are working hard on the parallelization and the standardization of high-speed interfaces (SCSI, NVMe). In the end, we all have to make the software deals with the NAND-specific constraints of the underlying storage device.
Disclaimer: This is a short summary (not exhaustive) of the “high-performance” world talks as we could understand them. This is probably not 100% accurate as the topics discussed are, currently, out of our domain of expertise. Corrections are welcome.
Matias Bjørling (Western Digital) and Christoph Hellwig presented new NVMe commands to manage NVMe zones. While zones need write order to be preserved, the Linux multi-queue block I/O queueing mechanism (blk-mq) cannot enforce this. Bart van Assche (Google) and Damien Le Moal (Western Digital) proposed a draft to reorder writes at the blk-mq layer. While this solution was not very well received, it opened the discussion on how the issue should be addressed. Bart van Assche also presented his work on copy offload mechanism in Linux, which could for instance serve to fast copy entire zones. His work could be also useful to Stephen Bates who works on PCIe peer-to-peer and talked on how he wants to eg. enable DMA between SSDs. Still on the topic of DMA and performances, Idan Burstein (Mellanox) exposed the cutting-edge features he worked on to improve Remote DMA (RDMA) performances.
MTD was also present to the party
Probably the easier part to understand for us, embedded people.
Boris Brezillon and Miquèl Raynal gave a talk on their recent work support for SPI memories in Linux (and U-Boot, but this will be more detailed at ELCE in October). Boris wrote a new SPI-NAND layer, converting MTD requests into SPI exchanges, giving the flow of commands to the (also brand new) SPI-mem layer to standardize how to speak with SPI controller drivers from both SPI-NAND and SPI-NOR stacks. Cleaning work is still needed on the SPI-NOR side as well as the addition of new features like direct mapping, XIP (that was discussed after the talk), the addition of support for more chips and the conversion to SPI-mem of more SPI controllers. The slides are available online, see also our previous blog post on this topic.
Richard Weinberger (from Sigma Star GmbH, and co-maintainer of MTD and UBI/UBIFS) updated us about the level of power-cut testing available to challenge the MTD stack. Tracing is possible to get closer to the failing sequence but one big problem is to replay the sequence and reproduce the issue. Tracking down untested code path is very important to keep UBI/UBIFS as reliable as possible: this is what is generally the most important when using SPI/parallel NAND devices.
Richard’s co-worker David Gstir also works on UBI/UBIFS, but on the authentication side. Bringing filesystem authentication to UBIFS could have been simple but during his introduction he disqualified most of the alternatives he had (dm-verity, fs-verity, …). Fun-fact about fs-verity, authentication would have work on the file’s contents, but not on the inodes themselves. Hence, the file’s content could not be changed, but the file itself could still be moved. So, a brand new solution has been implemented for UBIFS, upstreaming ongoing.
Original ideas presented
Benchmarking real hardware was somehow not adapted to Damien Le Moal experiments. He hacked QEMU to add the possibility to tune CPU latency so that he could compare easily the latency on in-memory data processing paths. WIP.
Johannes Thumshirn (SUSE Labs), as a side project, started reversing APFS, Apple’s new filesystem. The firm promised two years ago to release the implementation of its filesystem so that computers running Microsoft or Linux could mount it. So far nothing happened, that is why, without even a Mac in hand, he started spending nights hex-dumping structures from a filesystem image he got, reverse-engineering the content with the help of research papers already produced. The first results are there, he can now ls and cat random files!
And after talks and hiking: time to BOFs
A bit before the official BOFs time MTD folks gathered around Hans Holmberg (CNEX Labs) to carefully listen about how pblk works, a “Physical block device” FTL for SSDs supporting open-channel that could give ideas to some of them. Why not an entirely open-source SSD running Linux with its own FTL?
Finally, between all the interesting discussions that happened, we could mention the need for a generic NVMe-oF (NVMe over Fabric) discovery protocol raised by Hannes Reinecke (SUSE Labs), and the possible evolution of the MTD stack to integrate an I/O scheduler to provide much better (and parallelized) performances exposed by Boris Brezillon.
All attendees agreed this format of conference is really pleasant, the surrounding helping a lot to the general wellness and the success of this year’s edition of the ALPSS. We will definitely try to make it next year!
As discussed in our previous blog post, Bootlin had again a strong presence at the Embedded Linux Conference North-America, with 8 attendees, 5 talks, one BoF and two E-ALE tutorial sessions.
In this blog post, we would like to highlight a number of talks from the conference that we found interesting. Each Bootlin engineer who attended the conference has selected one talk, and gives his/her feedback about this talk.
Device Tree BoF – Frank Rowand
Talk selected by Michael Opdenacker
The Device Tree BoF (Birds of a Feather session, which means an informal session about a technical topic, allowing participants to openly share questions and information) has been part of Embedded Linux Conferences for at least 2 or 3 years. For me, it has always been a good source of updates about the topic.
Frank started by sharing details about the Device Tree Workshop held in October in Prague, a one day meet-up and workshop for Device Tree contributors (like Maxime Ripard and Thomas Petazzoni from Bootlin who were invited), to address issues and plan for the next months. Slides and notes can be found on elinux.org.
Frank then went on by mentioning utilities, such as:
scripts/dtc/dt_to_config. It is not very new in the mainline kernel, but useful to generate a kernel configuration suitable for the devices present on your platform, in case you didn’t know this tool exists.
There’s an upcoming patch adding options to dtc (--annotate --full) to keep track of the line numbers in the device tree sources. This helps to locate in which DT source file a given property value comes from. The patch was idle for some time but Julia Lawall volunteered to take care of it. Thanks to her updates, this feature should be accepted in mainline soon.
The device tree compiler in mainline has also been augmented with further build checks. You can now use them by adding W=1 to make dtb. Here is an example for the Beagle Bone Black dtb:
make W=1 am335x-boneblack.dtb
arch/arm/boot/dts/am335x-boneblack.dtb: Warning (unit_address_vs_reg): Node /ocp/i2c@44e0b000/tda19988 has a reg or ranges property, but no unit name
arch/arm/boot/dts/am335x-boneblack.dtb: Warning (unit_address_vs_reg): Node /ocp/i2c@44e0b000/tda19988/ports/port@0 has a unit name, but no reg property
arch/arm/boot/dts/am335x-boneblack.dtb: Warning (unit_address_vs_reg): Node /ocp/i2c@44e0b000/tda19988/ports/port@0/endpoint@0 has a unit name, but no reg property
arch/arm/boot/dts/am335x-boneblack.dtb: Warning (unit_address_vs_reg): Node /ocp/ethernet@4a100000/slave@4a100200 has a unit name, but no reg property
arch/arm/boot/dts/am335x-boneblack.dtb: Warning (unit_address_vs_reg): Node /ocp/ethernet@4a100000/slave@4a100300 has a unit name, but no reg property
arch/arm/boot/dts/am335x-boneblack.dtb: Warning (unit_address_vs_reg): Node /ocp/lcdc@4830e000/port/endpoint@0 has a unit name, but no reg property
arch/arm/boot/dts/am335x-boneblack.dtb: Warning (simple_bus_reg): Node /ocp/l4_wkup@44c00000/prcm@200000/clocks missing or empty reg/ranges property
arch/arm/boot/dts/am335x-boneblack.dtb: Warning (simple_bus_reg): Node /ocp/l4_wkup@44c00000/prcm@200000/clockdomains missing or empty reg/ranges property
arch/arm/boot/dts/am335x-boneblack.dtb: Warning (simple_bus_reg): Node /ocp/l4_wkup@44c00000/scm@210000/scm_conf@0/clocks missing or empty reg/ranges property
arch/arm/boot/dts/am335x-boneblack.dtb: Warning (simple_bus_reg): Node /ocp/l4_wkup@44c00000/scm@210000/clockdomains missing or empty reg/ranges property
As far as I am concerned, the most interesting news remained the one that since Linux 4.15, device tree overlays are now easier to code. You no longer have to define weird “fragment” elements. You can now directly write normal nodes and use labels. The syntax is now exactly the same as for regular device tree sources!
For more details, grab the slides and if you event want to follow the discussions that happened that day, watch the video.
Tutorial: Introduction to Reverse Engineering – Mike Anderson
Talk selected by Quentin Schulz
Mike presented in an unusual 2-hour-long slot the different techniques to reverse-engineer things and the different reasons why you’d do so. After a mandatory disclaimer that reverse engineering may be illegal in some regions of the world, he introduced the different tools that anyone aspiring to reverse engineer should possess: from the obvious logic analyzer, multimeter, screwdrivers to the surprising heat gun. He then gave the first and very important step of the reverse engineering process: gathering information about the product by looking for patents, the FCC registration, manufacturer as well as carefully opening its case to examine the different components (maybe with the help of a microscope).
Later, Mike gave the multiple ways to retrieve the firmwares from the product, from the soldering of a JTAG interface to the downloading from the official website. He then offered some tools that can be used to dive into binaries and start the guessing game, and he finished his talk with an example of a reverse engineering of a protocol which required a lot of guessing and social reverse engineering.
Mike’s talk was pleasant to attend because of the high-level presentation of how to do reverse engineering while giving a quick real-life example.
Graphics Performance Analysis with FrameRetrace: A Responsive UI for ApiTrace – Mark Janes, Intel
Talk selected by Boris Brezillon
I first heard of FrameRetrace when Eric Anholt asked us to add support for the VC4 GPU to this tool, and my experience with it had been rather frustrating in that I was mainly struggling to make it work on a not yet supported architecture instead of being a simple user. So, when I saw that Mark was giving a talk on FrameRetrace usefulness and how to use it, I figured I couldn’t miss it. Well, those who looked carefuly at the schedule know I couldn’t attend it because I was giving my talk at the same time, but I did see it at FOSDEM a month before, and I’m pretty sure not much has changed since then.
Mark first described the GPU debugging/perf-anlysis tools ecosystem, saying that each GPU vendor has its own proprietary tools which most of the time are only supported on Windows. FrameRetrace is an attempt at providing a tool that exposes similar features while being open-source, cross-platform, and easily extensible to new hardware. This project is actually based on an exisiting project called ApiTrace, which it uses to capture OpenGL traces. The new feature that is interesting in FrameRetrace is that you can select the frame you want to replay, get all the hardware perf counters exposed by the GPU for this specific rendering job in order to figure out what is going wrong and then play with the OpenGL code to see how you can make things better and replay the rendering job with your local modification to see if it actually solves the problem.
I must admit I was really impressed by the demo, and now I understand why Eric (and others in the community) would like to have their GPU properly supported in FrameRetrace. It really looks like the kind of tool you don’t know you need until you’ve tested it, but once you do, you can’t do without it.
The Salmon Diet: Up-Streaming Drivers as a Form of Optimization – Gilad Ben-Yossef
Talk selected by Miquèl Raynal
Gilad was hired about a year ago to become the maintainer of the ARM® TrustZone® CryptoCell® device driver. Until now this driver was out of tree until it has been decided to upstream it. Here starts Gilad’s story.
It appeared that the right way to make it upstream was to go through the staging tree and the whole process around it. It was the first time for him to do it that way, that is why he felt it was interesting to share his experience.
At the beginning of his talk, he recalls that the driver was actually working, people already relied on it. Plus, it was released under the GPL. While all of this could make you think it was clean enough, Gilad realized that people who wrote it actually did not think about upstreaming and almost every patch to clean that driver removed more lines than it added, shrinking step by step the driver until 30% of the lines were removed!
Of course, removing the existing hardware abstraction macros was something to do, as well as running and correcting the whole checkpatch.pl output, but there are plenty of other good habits that one can adapt to his own situation, explained and well illustrated all along this talk.
Measuring and Summarizing Latencies using the Trace Event Subsystem – Tom Zanussi
Talk selected by Maxime Chevallier
Having some experience dealing with RT topics on Linux, I was looking forward to seeing Tom’s talk about these tracing features.
He gave really good examples on how to use the existing ‘latency histogram’ traces to get a cyclictest-like metric of wakeup latencies by measuring the time between sched_waking and sched_switch, and explaining how this could be re-used for other measurements such as network latencies.
What he presented was more than just having a trace in a buffer when a function is called. The tracing subsystem allows the use of handlers to perform actions when an event occurs. As an example, he demonstrated how to use the onmax handler to accumulate the maximum wakeup latency observed, each time saving crucial pieces of information on the execution context.
He then went on to describe the next-level features that are being merged, namely function events by Steven Rostedt, and Inter events by Tom himself. They allow the user to use any of the kernel functions as tracepoints, and build complex events and traces to pinpoint really specific sequences.
I recommend to read this LWN article on inter-event tracing, and of course have a look at Tom’s talk and slides.
Steering Xenomai into the Real-Time Linux Future – Jan Kiszka
Talk selected by Thomas Petazzoni
In this talk, long-term Xenomai developer and Siemens engineer Jan Kiszka gave a very interesting status of the Xenomai project and its roadmap. He started by refreshing the audience about what Xenomai is: an RTOS-to-Linux portability framework. It comes in two flavors: a co-kernel extension for a patched Linux kernel, and as libraries running for native Linux (including PREEMPT_RT). He then went on to compare the respective advantages and drawbacks of the two flavors, citing accurate modeling of legacy RTOS behavior and strong separation of real-time vs. non real-time code as the key advantages of the co-kernel approach.
Jan then summarized the history of Xenomai, from the early days as a sub-project of RTAI to the current status of Xenomai 3.0, released in 2015 after more than 5 years of development. However, even though Xenomai is widely used in the industry, its development relies on just a few individuals. Siemens is a heavy user of Xenomai, and in 2017, they started a discussion: should they migrate away from Xenomai or invest into it. Siemens made the decision to invest in the project. The same year, Xenomai main developer Philippe Gerum published an e-mail RTnet, Analogy and the elephant in the room also calling for help to maintain some parts of Xenomai.
Following those discussions, some changes were decided in the Xenomai project: Philippe Gerum will step back from the project lead, and Jan Kiszka will take over his role.
Regarding the I-Pipe kernel patch (which allows to support the co-kernel approach), the Xenomai project will discontinue a number of architectures (nios2, SH, Blackfin, PPC64, ARM < v7) and will only maintain patches for the latest Linux kernel LTS, in order to reduce the maintenance effort.
Jan announced that Xenomai 3.0.7 is soon to be released, that Xenomai 3.1 will introduce ARM64 support, and that Xenomai 2 is unmaintained and therefore users should migrate to Xenomai 3. He also gave a status on the driver stacks, citing that RTnet needs more love, and that Analogy is orphaned and needs a new maintainer.
Towards the end of his talk, Jan then started discussing the more distant future of Xenomai. The future version of Xenomai has the goal of improving the integration of the co-kernel approach, to simplify maintenance and possibly provide a chance to be upstreamed in Linux. This new approach will be split in two elements, called Dovetail (interrupt routing, co-kernel hooks) and Steely (co-kernel implementation). He made it clear that this is currently not production-ready at all. He noted that this new implementation allows a significant reduction of the code base, about 50% smaller than the current implementation. The code is already available in two Git repositories: Dovetail and Steely.
All in all, Jan’s talk was a very interesting one, providing a good coverage of Xenomai’s status and future. The video of his talk is definitely worth watching, and the slides are also available.
An Introduction to Asymmetric Multiprocessing: When this Architecture can be a Game Changer and How to Survive It – Nicola La Gloria & Laura Nao
Talk selected by Mylène Josserand
In this talk, Nicola La Gloria and Laura Nao from Kynetics presented how to handle communication between a micro-controller running bare metal code and a CPU with a full OS (such as GNU/Linux or Android).
They showed the different approaches for communication (supervised or not supervised: i.e. CPU and MCU can communicate using an hypervisor or directly) and presented the Inter-Processing Communication.
After this introduction, Laura told us about their use-case, running on an NXP i.MX7, which comprises a Cortex-M4 micro-controller and a Cortex-A7 processor: the MCU retrieves data from a sensor, which will be displayed by the CPU.
She gave feedback on how they implemented this communication and the different mechanisms used (Message Unit, RPMsg, RDC, kexec/kdump, etc). The explanation of the different mechanisms was really interesting, and particularly relevant for those who had never heard about them.
They did a short tutorial and gave some tips that would definitely be appreciated by people who start this kind of project. And finally, they did a demonstration of all the work they have done.
So if you are interested in the subject or even only for your general knowledge, have a look at their talk (video and slides).
System-in-Package Technology: Making it Easier to Build Your Own Linux Computer – Erik Welsh & Jason Kridner
Talk selected by Alexandre Belloni
Eric Welsh started to talk about how software influences hardware design and why open source hardware matters. He then presented the System-in-Package technology and in particular the Octavo OSD3358. It includes a TI AAM335x SoC, DDR3 SDRAM, a PMIC and all the related power circuitry, components which are always required. This allows the hardware engineers to concentrate on the added value of the final product.
Great pictures and videos of the SiP internals and its manufacturing process were shown.
Jason then came on stage to present the OSD3358 integration on the PocketBeagle.
Eric finally explained how easy it is to assemble the OSD3358 on a PCB, even by hand with a video to prove it. He finally concluded by summing up the benefits of using a SiP: easy bring-up, lower cost of PCB, easy manufacturing and migration from SBC prototyping to custom PCB.
It was quite enlightening for software engineers as it showed the hardware internals with some great details.
Have a look at the video and slides.
In conclusion, this was again a really nice opportunity to share and acquire knowledge from other engineers deeply involved in the Open Source community, as well as meeting people that we sometimes know only by their name on a mailing list. Next year this event will happen in Monterey Bay, California (March 19 – 21, 2019). See you there!
Over the last months, Bootlin engineers Boris Brezillon and Miquèl Raynal have been working on rewriting the NAND controller driver used on a large number of Marvell SoCs. This NAND controller driver had grown very complicated, and Miquèl’s adventure in this rework led him to contribute a new interface to the NAND framework, in order to simplify implementing NAND controller drivers for complex NAND controllers. In this blog post, Miquèl summarizes the original issue, and how it is solved by the ->exec_op() interface he has contributed.
The NAND framework is the layer between the generic MTD layer and the NAND controller drivers. Its purpose is to handle MTD requests and transform them into understandable NAND operations the controller will have to send to the NAND chip.
For general information about NANDs, the reader is invited to read the ONFI specification (Open NAND Flash Interface) which defines the most common NAND operations.
Interacting with a NAND chip
Raw NANDs (so-called “parallel NANDs”) are slave devices waiting for instructions from the controller. An operation is a sequence of instructions usually referred as “command” (CMD), “addresses” (ADDR), and “data” cycles (DATA_IN/DATA_OUT) and sometimes wait periods (WAITRDY). Some everyday operations any NAND enthusiast should know by heart are, for instance:
How it was handled in the Linux kernel
Today, a majority of NAND controlller drivers implement the ->cmd_ctrl() hook. It aimed to be a very small function, designed to just send command and address cycles independently, usually embedding some very controller-specific logic. This hook was supposed to be called by a function of higher level from the NAND core, ->cmdfunc(). In addition to calling ->cmd_ctrl() to send command and address cycles, the core would also call ->read|write_byte|word|buf() hooks to actually move data from the NAND controller and the memory (the DATA parts in the diagram above).
This approach worked very well with simple NAND controllers, which are just able to send command and address cycles one at a time to the NAND chip, without any extra intelligence. However, NAND controllers have become more and more complex and now can handle higher-level operations, usually to provide higher performance. For example, a NAND controller may provide an operation that would do all of the command and address cycles of a read-page operation in one-go. Some controllers even support only those higher-level operations, and are not able to simply do the basic operation of sending one command cycle or one data cycle. To handle such controllers, their drivers were overloading the ->cmdfunc() hook directly, circumventing the generic NAND core implementation of ->cmdfunc(). This is a first drawback: it is no longer possible to easily add logic to the NAND core to support new NAND operations, because some drivers overload the ->cmdfunc() logic. Worse, ->cmdfunc() doesn’t provide some information such as the length of the data transfer, which some controllers actually need in order to run the desired operation. NAND controller drivers started to have complicated state machines just to work around the NAND framework limitations.
Some driver-specific implementations of this hook started diverging from the original one, giving maintainers a lot of pain to maintain the whole subsystem, specifically when they needed to introduce additional vendor-specific operations support. These implementations were not only diverse but also incomplete, sometimes buggy and most importantly, developers had to guess the data that would probably be moved by the core after that, which is clearly a symptom that the framework was not fitting the user needs anymore.
The ->exec_op() era
The NAND subsystem maintainers decided to switch to a new approach, based on a new hook called ->exec_op(), implemented by NAND controller drivers and called by the generic NAND core. The logic behind that name is to provide to every controller a generic interface that can easily be extended and exposes the overall NAND operation to be performed. This way, the driver can optimize depending on the controller capabilities without the need of a complex state machine as ->cmdfunc() was.
All major NAND generic raw operations like reset, reading the NAND ID, selecting a set of timings, reading/writing data and so on found their place into small internal functions named nand_[operation]_op().
From the NAND controller driver point of view, an array of instructions is received for each operation. The controller then needs to parse these instructions, decides if it can handle the overall operation, splits the operation if needed, and executes what is requested.
Using the ->exec_op() interface is as simple as declaring a list with the controller capabilities, each entry of this array having a callback function knowing the overall operation that will actually handle all the logic. The NAND core was enhanced with a proper parser that one may use in his driver to handle the callback selection logic.
The ->exec_op() interface in the NAND core has been accepted and merged upstream, and will be part of Linux 4.16. The first driver converted to this new interface was obviously the NAND controller driver used on Marvell platforms, pxa3xx_nand. It has been rewritten as marvell_nand, and will also be part of Linux 4.16. Even though the new driver is longer (by lines of code) than the previous one, it supports additional features (such as raw read and write operations), allows the NAND core to pass custom commands to the NAND chip, and has a logic that is a lot less complicated.
Miquèl has also worked on converting the fsmc_nand driver to ->exec_op(), but this work hasn’t been merged yet. In the community, Stefan Agner has taken on the task to convert the vf610_nfc driver to this new approach.
Bootlin is proud to have contributed such enhancements to the Linux kernel, and hopes to see other developers contribute to this subsystem in the near future, by migrating their favorite NAND controller driver to ->exec_op()!