Last year, a number of Bootlin engineers attended the Linux Plumbers conference. This year again, Bootlin will participate to the event, with engineer Antoine Ténart traveling to Vancouver, Canada on November 13-15 for this conference.
The end of August has arrived, bringing an end to Paul’s engineering internship at Bootlin, focused on bringing mainline Linux support for the VPU found on Allwinner platforms. Over the past six months, we have worked hard to reach the goals announced in the project’s crowdfunding campaign and we were able to deliver most of the main goals last month.
Reaching the end of the funding: a status on where we stand
We have now exhausted the budget that was provided through the crowdfunding campaign: both Maxime Ripard’s time (who worked mainly on the H264 decoding and helping with DRM topics) and Paul’s internship are over, and therefore the remaining work will be done on a best-effort basis, without direct funding. This will therefore be the last weekly update, but we will be publishing updates once in a while when interesting progress is made.
Here is a quick summary of our current status, compared to what was promised during our Kickstarter campaign:
Making sure that the codec works on the older Allwinner SoCs that are still widely used: A10, A13, A20, A33, R8 and R16. This goal is fully met;
Polishing the existing MPEG2 decoding support to make it fully production ready. This goal is fully met;
Implementing H264 video decoding. This goal is fully met with base H264 decoding support implemented. However, a number of more advanced H264 features have not been implemented, and therefore additional improvements could be made;
Modifying the Allwinner display driver in order to be able to directly display the decoded frames instead of converting and copying those frames. This goal is fully met.
Providing a user-space library easy to integrate in the popular open-source video players. This goal is partially met. We do provide a user-space library that offers a VA-API implementation, however the integration with popular video players turned out to be a lot more challenging than expected, and we only offer Kodi integration at this point. See below for details;
Upstreaming those changes to the official Linux kernel. This goal is in progress, on both the VPU driver side and DRM improvements side;
Supporting the newer Allwinner SoCs (H3, H5, A64). This goal is partially met, since H3 is supported, but not yet H5 and A64;
H265 video decoding support. This goal is fully met with base H265 decoding support implemented. Like H264, a number of more advanced features have not been implemented, so there is room for more work.
The most challenging topic: integration with open source video players
The major pitfalls that we encountered are related to integrating our accelerated video decoding pipeline with multimedia players. They will require extra work out of the scope of the VPU campaign to reach a production-ready state.
We considered a number of options for integrating with a desktop environment under Xorg, which was especially tricky for the oldest Allwinner platforms where the VPU outputs a tiled YUV format. The chain of required operations includes untiling, colorspace conversion (from YUV to RGB), scaling and composition.
We first resorted to the main CPU for all the required operations (including NEON-backed untiling routines), which becomes unbearably slow as soon as scaling is involved in the process.
We tried to bring-in the GPU for accelerating the untiling, colorspace conversion, scaling and composition operations involved. Although we wrote a shader-based untiler, the Mali blobs did not allow for importing the raw frame data on a byte-by-byte basis. This made GPU acceleration unusable for our use case in practice. Bringing-in the GPU for the final composition step only (that should be possible with GBM-enabled blobs) could however bring some speedup.
Another lead is to use the Xv extension of the X11 API, that fits the bill for using the Display Engine hardware to accelerate these operations, but this interface is quite old now and increasingly deprecated. It also only allows sub-optimal use cases, with one video at a time.
We also investigated the situation for media players that can run without a display server, which removes the need for the composition step and allows using the Display Engine hardware directly, through the DRM interface.
We succeeded at bringing up support for the Kodi mediacenter, by adding the required bits to implement a zero-copy pipeline.
We worked on getting GStreamer to correctly pipe VAAPI-based decoding to the DRM-enabled kmssink without going through the GPU, but did not end up with any functional result, so significant work remains in that area.
Going further: what will happen now ?
Here are the topics that we intend to continue work on in this best-effort mode and complete by the end of 2018, as promised in our crowdfunding campaign:
Ensure the base Cedrus Linux kernel driver gets merged;
Ensure the H264 decoding support in the Cedrus driver gets merged;
Ensure the H265 decoding support in the Cedrus driver gets merged;
Ensure the DRM driver improvements get merged;
Enable VPU support on H5 and A64.
Here are other topics that we do not intend to work on without additional funding. Individuals who want to see some progress on those topics are invited to contribute and join the effort of improving Allwinner VPU support in upstream Linux. Companies interested in those features can also contact us.
Additional H264 and H265 decoding features: interlaced video support (H264 and H265), quantization matrices (H265), 10-bit (H265), 4K resolution (H265);
Other codecs beyond MPEG2, H264 and H265, such as VP8;
Encoding support;
Additional work on GStreamer integration or X.org integration.
Thanks
Once again, we would like to thank all the individuals and companies who participated to our crowdfunding campaign, and made this project possible. We are very happy to see that despite the uncertainties involved in all software development projects, we have been to deliver the vast majority of the goals, within the expected time frame, while delivering weekly updates of our progress. It was a new experience for Bootlin, and we hope to renew this experience for other Linux kernel upstream developments in the future!
Starting last year, we have been working on the Microsemi VSC7513 and VSC7514 MIPS processors.
They have a 500 MHz MIPS 24KEc CPU and the usual DDR, UART, I2C and SPI controllers. But more interestingly, they also have an 8 or 10-port Gigabit Ethernet switch allowing to offload common network bridging operations to the hardware. As is usual for that kind of products, the vendor-provided SDK (called WebStaX) used to configure the switch is running in userspace and uses a custom in-kernel UIO driver to talk to the hardware.
However, this has now changed as we submitted support for the platform and the switch to the upstream Linux kernel:
The whole driver based on the switchdev Linux kernel subsystem, is about 5700 lines long.
Thanks to this work, it is now possible to use standard Linux user-space tools to configure the switch. For example, the following will bridge the switch port and offload to the hardware:
ip link add name br0 type bridge
ip link set dev sw0p0 master br0
ip link set dev sw0p1 master br0
To achieve hardware offloading, the driver needs to:
configure port forwarding i.e. to what port the frames coming form a particular port should be forwarded;
handle the MAC table: this table is the one used to know on which port which machine is connected. Also, the broadcast and multicast MAC have to be installed;
handle STP port state: whether the port is allowed to forward frames or learn new MAC addresses;
VLANs are configured using ip and bridge:
ip link set dev br0 type bridge vlan_filtering 1
bridge vlan add dev sw0p0 vid 1 pvid untagged
bridge vlan add dev sw0p1 vid 1
bridge vlan add dev sw0p0 vid 30
bridge vlan add dev sw0p1 vid 30
Here, the driver configures the VIDs on each port and what to do about them (tag, untag, forward).
Configuring link aggregation is also done with ip:
ip link add name aggr0 type bond
ip link set dev eth_yellow master aggr0
ip link set dev eth_blue master aggr0
The driver has to configure the aggregated ports and the balancing mode. It also has to ensure the switch will forward the control frame (LACPDUs) to the CPU so Linux can know the state of the links.
IGMP snooping is a simple feature where the switch is able to push new multicast addresses to the CPU so Linux can install the MACs in the table and avoid having to forward the multicast frames on all the switch ports. In our case, it is simply enabled using a single register when multicasting is enabled on the bridge.
The switch supports more features to be worked on: PTP timestamping, QoS and packet filtering to name a few. We have already implemented PTP support, and we will be submitting upstream this additional feature in the near future.
To learn more about the inner workings of switchdev, you can refer to Alexandre Belloni’s ELC talk:
If you’re interested about upstream Linux kernel support for other Ethernet switches, do not hesitate to contact us!
This week has seen great advancements in H265 support, following up on the work conducted during the past weeks. The first item to debug was support for bi-directional predictive frames (AKA B frames) which was broken last week. This required some adaptation in our standalone test tool v4l2-request-test in order to display the decoded frames in the right order. With bi-directional prediction, the display order no longer matches the decoding order, in which the coded frames are stored in the bitstream.
With the images displayed in the right order, the debugging process was a matter of comparing the configuration register values written by our driver with the reference provided by libvdpau-sunxi, but it was not enough. A specific buffer has to be provided for each frame for the decoder to store extra meta-data related to bi-directional frame prediction. With the buffer set, the situation vastly improved and only minor issues had to be resolved.
This lead to properly decoding our reference H265 video that contains I, P and B frames! A few more videos were also tested to spot possible bugs and were eventually decoded correctly too. Of course, due to the many possible combinations of H265 features, it is possible that we are still missing some corner cases, but the bulk of H265 support is well in place at this point.
We moved on to adding support for H265 in libva-v4l2-request, which allows the integration of the codec with media players such as VLC and Kodi. We hit a few hiccups during the bringup :
But we managed to fix the integration of H265 to behave properly :
So H265 is now integrated in our pipeline and we are ready to submit the patches introducing its support for the Cedrus driver, which should come around next week.
Bootlin engineer Maxime Ripard will be attending the X.org Developer Conference 2018, from September 26 to September 28 in A Coruña, Spain. This conference is the main event to discuss Linux graphics and display related topics and meet the Linux kernel and userspace developers working in this field.
At Bootlin, Maxime has been involved over the last few years in a number of display related developments:
He is the initial author and the maintainer of the DRM display controller driver for the Allwinner processors, to which he has progressively added numerous features over the years, including parallel RGB support, HDMI support, DSI support and TV-out support, for many different Allwinner platforms.
He has worked on enabling OpenGL support on Allwinner platforms using the open-source kernel driver and the closed-source binary blob provided by ARM, making OpenGL work using a mainline and upstream Linux kernel on Allwinner hardware. As part of this, Maxime designed and upstreamed a Device Tree binding to describe the Mali GPU and maintains sunxi-mali, a fork of the ARM-provided kernel driver for Mali, modified to work with the upstream Linux kernel.
Maxime has been involved in setting up automated testing of the RaspberryPi display subsystem, using the Chamelium platform and the intel-gpu-tools test suite. See our blog post on this topic.
As part of Bootlin’s work on the Linux support for the Allwinner VPU (funded by our crowdfunding campaign earlier this year), Maxime got involved into issues related to feeding the output of the VPU into the display pipeline found on Allwinner platforms.
A group of Linux kernel developers is organizing on September 11-14 the Alpine Linux Persistence and Storage Summit, a meeting of kernel developers to discuss the hot topics in Linux storage and file systems, such as persistent memory, NVMe, multi-pathing, raw or open channel flash and I/O scheduling.
Bootlin engineers Boris Brezillon, who is the co-maintainer of the MTD subsystem in the Linux kernel, and Miquèl Raynal, who is the co-maintainer of the NAND subsystem in the Linux kernel, will be attending this event. Through this participation, Bootlin is supporting the work done by its engineers acting as Linux kernel maintainers: they will have the chance to meet other kernel developers and discuss the current issues and future of storage-related subsystems. After the event, we will be reporting on our blog about the discussions that took place.
The first task that was tackled this week was solving the bit offset issue encountered last week. We found out that ffmpeg provides VAAPI with a byte-aligned value after rounding it up from an internal offset it keeps in bits. When trying to use the internal value in bits, our VPU would succeed at decoding the H265 frame. After looking at the values for a few distinct frames, it became clear that the offset matched the beginning of a Golomb-coded compressed sequence, starting with a 1 bit and followed by zeros, as a prefix code. Detecting this pattern appears to work reliably for the H265 videos we could test.
This paved the way for properly decoding intra-coded (I) H265 frames without any hardcoded value left in the code. With that in place, it was only a small stretch to decode a few seconds of video made of I frames!
Of course, intra-coded frames are rare in H265 videos since they do not use any temporal compression technique and are thus larger in size. Predicted frames (using references from already-decoded frames) compose the vast majority of H265 videos. Prediction takes places either for forward prediction (P frames) or both forward and backward prediction (B frames). Supporting these prediction modes requires significant driver-side work, especially to handle the metadata (such as prediction weight coefficients) associated to each frame in the reference lists and the lists on their own. On the framework side, V4L2 controls also had to be introduced to bring the required plumbing for these features.
As of today, we successfully implemented support for P frames while B frames are still work in progress. To illustrate our progress, the same video can be seen decoded in v4l2-request-test (at nominal and half speed), with the two prediction modes :
With I and P frames, the video is decoded correctly:
Some more work seems to be required for B frames:
Next week will be the opportunity to move forward on B frames decoding!
As always, LWN.net did an interesting coverage of this release cycle merge window, highlighting the most important changes: the first half of the 4.18 merge window and the rest of the 4.18 merge window. For 4.18 only, Bootlin contributed a total of 190 patches, which puts us at the 13th place in the ranking of most contributing companies according to KPS.
Also according to LWN statistics, Bootlin’s engineer Alexandre Belloni is the 9th most active developer in terms of changed lines for this release with a total of 6801. We see the first contribution of Paul Kocialkowski, our intern working on Allwinner VPU driver, as a Bootlin team member. Finally, we’re proud to see the Linux kernel’s NAND subsystem welcoming Miquèl Raynal as a co-maintainer.
Antoine Ténart converted to phylink and added support for 1000baseX and 2500baseX modes to the PPv2 Ethernet controller driver found on Marvell 7K and 8K platforms,
Boris Brezillon contributed the new spi-mem layer, to rework how SPI memories are supported, and allow to use regular SPI controller drivers not only for regular SPI devices, but also for SPI NOR and SPI NAND memories. See our detailed blog post on this topic,
For RTC subsystem, Alexandre Belloni fixed a race condition that could happen in the probe function of a few drivers and made a few drivers define a range of supported dates,
Bootlin engineers are not only contributors, but also maintainers of various subsystems in the Linux kernel, which means they are involved in the process of reviewing, discussing and merging patches contributed to those subsystems:
Maxime Ripard, as the Allwinner platform co-maintainer, merged 38 patches from other contributors
Boris Brezillon, as the MTD/NAND maintainer, merged 76 patches from other contributors
Alexandre Belloni, as the RTC and Microsemi maintainer and Atmel platform co-maintainer, merged 32 patches from other contributors
Grégory Clement, as the Marvell EBU co-maintainer, merged 17 patches from other contributors
Here is the commit by commit detail of our contributions to 4.18:
This week started with the preparation of a new revision of the Cedrus VPU driver, after significant feedback was received on the version posted two weeks ago. Thanks to the careful testing carried out by community member Jernej Škrabec, a number of decoding issues were discovered in version 6 of the driver. This includes issues related to MPEG2 decoding but also to the use of the VPU untiling block, that affects all codecs indifferently.
Some updates were also included on the MPEG2 controls side, in order to bring them closer to the raw bitstream parameters. Some parameters (that are not exposed by VAAPI) were also added, making the V4L2 controls broader than what is strictly required for our VPU.
Regarding H265, progress was slow this week due to a mismatch between values provided by VAAPI and what our VPU expects. More specifically, VAAPI provides a byte-aligned value for the offset to the coded video data in the slice (which also includes a header with metadata) while our VPU expects a bit-aligned value that does not match the value provided by VAAPI. We are hard at work to figure out a solution to this issue, but it is not straightforward. In addition, the reference libvdpau-sunxi code does not set that offset explicitly, as it is reached after parsing the header through the VPU itself. In our case, the parsing is done in userspace so the use case differs.
Following on last week’s progress, this week was also focused on bringing the required plumbing for H265 support in our video decoding pipeline. Thanks to register dumps obtained last week from libvdpau-sunxi, it was possible to quickly hack together support for decoding a single intra frame (with no dependency on any other frame), by replaying the dumped register write sequence. Once decoding that single frame worked with the hardcoded register values, we progressively replaced these values with actual register field definitions, that have to be configured with the appropriate metadata for the frame, that is parsed from the H265 bitstream.
As a result, the next step was integrating the required metadata information as dedicated V4L2 controls. Since these controls have to be as generic as possible (in order to fit well with future V4L2 stateless VPU drivers), we carefully looked at the metadata fields that the bitstream offers and considered the elements that VAAPI provides in userspace as well as the information that our VPU needs specifically. It appears that some fields required by our VPU are not exposed by VAAPI directly, so a few tricks were needed along the way.
At this point, we have a first draft for the controls, that allow decoding the intra-coded frame that we dumped last week, but using the metadata provided through the controls instead of hardcoded values :
More work is required to include support for other types of frame coding, namely B and P predictive frames. Next week’s focus will be set on decoding a series of intra-coded frames and moving on to supporting predictive frames. Thankfully, the work done by Bootlin engineer Maxime Ripard when adding support for H264 makes the whole process considerably easier, since H265 resembles H264 in many aspects.