The second edition of the Alpine Linux Persistent Storage Summit (ALPSS) happened two weeks ago in the Lizumerhütte Alpine lodge. Close to Innsbruck, Austria, the lodge resides in an amazingly beautiful valley. Completely separated from the rest of the world in Winter, this year edition was marked by the absence of data network access, intensifying the feeling of isolation, stimulating the exchanges between attendees. To strengthen the representation of MTD developers at this event, Bootlin sent two of his engineers: Boris Brezillon and Miquèl Raynal, respectively MTD and NAND maintainers in the Linux kernel.
NVMe, open-channel and zoned namespaces
While almost all the ~30 attendees work on storage support that are based on NAND flashes, a majority work on domains targeting high-performances, where power-cuts are not the issue but the latency and throughput are. Far beyond our embedded world, people are working hard on the parallelization and the standardization of high-speed interfaces (SCSI, NVMe). In the end, we all have to make the software deals with the NAND-specific constraints of the underlying storage device.
Disclaimer: This is a short summary (not exhaustive) of the “high-performance” world talks as we could understand them. This is probably not 100% accurate as the topics discussed are, currently, out of our domain of expertise. Corrections are welcome.
Matias Bjørling (Western Digital) and Christoph Hellwig presented new NVMe commands to manage NVMe zones. While zones need write order to be preserved, the Linux multi-queue block I/O queueing mechanism (blk-mq) cannot enforce this. Bart van Assche (Google) and Damien Le Moal (Western Digital) proposed a draft to reorder writes at the blk-mq layer. While this solution was not very well received, it opened the discussion on how the issue should be addressed. Bart van Assche also presented his work on copy offload mechanism in Linux, which could for instance serve to fast copy entire zones. His work could be also useful to Stephen Bates who works on PCIe peer-to-peer and talked on how he wants to eg. enable DMA between SSDs. Still on the topic of DMA and performances, Idan Burstein (Mellanox) exposed the cutting-edge features he worked on to improve Remote DMA (RDMA) performances.
MTD was also present to the party
Probably the easier part to understand for us, embedded people.
Boris Brezillon and Miquèl Raynal gave a talk on their recent work support for SPI memories in Linux (and U-Boot, but this will be more detailed at ELCE in October). Boris wrote a new SPI-NAND layer, converting MTD requests into SPI exchanges, giving the flow of commands to the (also brand new) SPI-mem layer to standardize how to speak with SPI controller drivers from both SPI-NAND and SPI-NOR stacks. Cleaning work is still needed on the SPI-NOR side as well as the addition of new features like direct mapping, XIP (that was discussed after the talk), the addition of support for more chips and the conversion to SPI-mem of more SPI controllers. The slides are available online, see also our previous blog post on this topic.
Richard Weinberger (from Sigma Star GmbH, and co-maintainer of MTD and UBI/UBIFS) updated us about the level of power-cut testing available to challenge the MTD stack. Tracing is possible to get closer to the failing sequence but one big problem is to replay the sequence and reproduce the issue. Tracking down untested code path is very important to keep UBI/UBIFS as reliable as possible: this is what is generally the most important when using SPI/parallel NAND devices.
Richard’s co-worker David Gstir also works on UBI/UBIFS, but on the authentication side. Bringing filesystem authentication to UBIFS could have been simple but during his introduction he disqualified most of the alternatives he had (dm-verity, fs-verity, …). Fun-fact about fs-verity, authentication would have work on the file’s contents, but not on the inodes themselves. Hence, the file’s content could not be changed, but the file itself could still be moved. So, a brand new solution has been implemented for UBIFS, upstreaming ongoing.
Original ideas presented
Benchmarking real hardware was somehow not adapted to Damien Le Moal experiments. He hacked QEMU to add the possibility to tune CPU latency so that he could compare easily the latency on in-memory data processing paths. WIP.
Johannes Thumshirn (SUSE Labs), as a side project, started reversing APFS, Apple’s new filesystem. The firm promised two years ago to release the implementation of its filesystem so that computers running Microsoft or Linux could mount it. So far nothing happened, that is why, without even a Mac in hand, he started spending nights hex-dumping structures from a filesystem image he got, reverse-engineering the content with the help of research papers already produced. The first results are there, he can now
cat random files!
And after talks and hiking: time to BOFs
A bit before the official BOFs time MTD folks gathered around Hans Holmberg (CNEX Labs) to carefully listen about how pblk works, a “Physical block device” FTL for SSDs supporting open-channel that could give ideas to some of them. Why not an entirely open-source SSD running Linux with its own FTL?
Finally, between all the interesting discussions that happened, we could mention the need for a generic NVMe-oF (NVMe over Fabric) discovery protocol raised by Hannes Reinecke (SUSE Labs), and the possible evolution of the MTD stack to integrate an I/O scheduler to provide much better (and parallelized) performances exposed by Boris Brezillon.
All attendees agreed this format of conference is really pleasant, the surrounding helping a lot to the general wellness and the success of this year’s edition of the ALPSS. We will definitely try to make it next year!