This week’s progress in our VPU driver development effort was focused on two main tasks: submitting the sixth revision of the Cedrus VPU driver series to the mainline Linux kernel and starting the work on H265 decoding.
The patch series for this new iteration of the driver was submitted on Wednesday and contains both functional and cosmetic changes. Most notably, we implemented support for video-specific quantization matrices in MPEG2, one of the final extension bits we were missing until then, but also cleaned up the register definitions for the driver. At this point, there are no undocumented registers or fields left, which makes the overall understanding of the hardware interactions much more straightforward. The driver was also moved to staging drivers, not because it was deemed of poor quality but rather because V4L2 maintainers want to keep the ability to change the controls that our driver is using even after it is merged.
Aside of this work, we started looking into H265 decoding, that was also already implemented in libvdpau-sunxi for the downstream modified version of the Linux kernel provided by Allwinner for the H3 (still based on Linux 3.4 to this day, which was released in 2012). After setting up a board with this kernel and libvdpau-sunxi, we were able to dump the register access made by libvdpau-sunxi, providing a reference for bringing up H265 support in the Cedrus VPU driver!
With a few weeks of delay, we are proud to announce the delivery of the main goals of our crowdfunding campaign dedicated at adding upstream Linux support for the Allwinner video decoding hardware.
After several months of hard work by Bootlin engineer Maxime Ripard and intern Paul Kocialkowski, we now have a working demo of Kodi running with our VPU driver on top of a mainline 4.18-rc kernel. Both MPEG2 and H264 are supported, with a fully-optimized pipeline between the VPU and the display side that does not involve any buffer copy or extra transformation that the hardware cannot offload. These results were possible thanks to the previous efforts carried out by the linux-sunxi community, and especially the libvdpau-sunxi project.
Here were the main goals defined in our crowdfunding campaign, which we promised to deliver end of June 2018, and their status in our delivery:
Making sure that the codec works on the older Allwinner SoCs: A10, A13, A20, A33, R8 and R16.. This goal is fully met, with more features than planned: the Cedrus driver was brought up on the A10, A13, A20, A33 and H3. Therefore, we included H3 support in this delivery, even though it was originally only part of one of the stretch goals. The R8 is the same as an A13 and the R16 is the same as an A33, so they are supported as well.
Polishing the existing MPEG2 decoding support to make it fully production ready. This goal is fully met: we have done much more testing of the MPEG2 decoding, and both the Linux kernel code and user-space code supporting MPEG2 has been significantly improved and cleaned up.
Implementing H264 video decoding, since H264 is by far one of the most popular video codec.. This goal is fully met: H264 decoding support has been added to both the Linux kernel driver and the user-space library, including high-profile H264 support. However, the H264 support is still very recent and we expect that additional debugging and improvements will be needed.
Modifying the Allwinner display driver in order to be able to directly display the decoded frames instead of converting and copying those frames. This goal is fully met: the Allwinner DRM driver has received a number of patches to ensure we can use one of the several planes to directly display the video frames in the format provided by the VPU. Support for hardware scaling has also been fixed to work properly. Those patches have already been contributed to the upstream Linux kernel. The work on the A20 and A33 display driver was done by Bootlin, while the work on the H3 was done by other developers of the community.
Providing a user-space library easy to integrate in the popular open-source video players. This goal is partially met: while we are providing a libva-v4l2-request user-space libraries that can in theory be used by all libva capable video players, the actual integration with video players is for now only working completely with Kodi. We have started efforts to make it work with both VLC and GStreamer, but the work has not been complete due to various challenges detailed below. This area was definitely much more challenging than we initially expected.
Upstreaming those changes to the official Linux kernel. This goal is almost met: we have posted 5 iterations of the Cedrus Linux kernel driver, each time using new versions of the Request API patches, helping improve this API along the way. While our patches have not been merged yet, because the Request API itself hasn’t been merged, they have received significant review from the V4L developers, and we believe our patches are not far from being merged.
All in all, despite the numerous challenges encountered over the last few months, we are happy to see that we have been able to deliver most of the goals completely, and we are not too far off for the few goals that haven’t yet been fully met. As we will discuss below, we will continue to work in the next months on completing those unfinished steps, and on the stretch goals that received enough funding.
Reaching this level of support was not a straightforward journey, as our road was paved with various obstacles that are presented below.
Media Request API
In order to add support for the VPU found on Allwinner platforms, some internal plumbing is necessary in the Video4Linux2 (v4l2) framework, the video framework in Linux. While V4L2 gained support for a specific class of VPUs, so-called “stateful” (where the video bitstream is passed directly to the hardware controller) thanks to the Memory2Memory API, this is not sufficient for our hardware. Indeed, Allwinner platforms come with a “stateless” VPU, where the video needs to be parsed beforehand to extract the frame data and its associated metadata, and then passed to the hardware. V4L2 lacked an API for synchronizing the frame data and associated metadata, although it had been in development for a long time and known as the Request API.
Our work on Cedrus contributed to revive the flame for this API, that saw its development accelerated over the past months thanks to the commitment of individuals such as Alexandre Courbot, Hans Verkuil and Sakari Ailus. We had the opportunity to report various issues and suggest fixes over its development process, which were integrated so that all the required bits for our driver are now in. The API is finally mature and appears to be quite stable, so there is no known blocker left for its integration in the kernel.
Cedrus V4L2 Driver
The first version of the Cedrus driver originally developed in 2016 by Florent Revest as part of an internship at Bootlin was based on an old version of the Request API. We therefore started by porting it to the latest version of the API and kept publishing new revisions as development of the Request API happened. We also received useful feedback from the community in the process. Here are the different iterations of the Cedrus driver that have been sent as part of this crowdfunded effort:
In addition to those patch series adding the driver itself, an additional patch series was sent to bring H264 support.
The development of the driver itself was not the most cumbersome part of the process, although it brought some challenges. For instance, we had to rework buffer management after discovering a limitation in the hardware, where the luminance and chrominance planes of our destination buffers need to be kept close in memory. We also had to bring in a workqueue (later replaced by a threaded IRQ) for the needs of the M2M API, which comes with performance drawbacks, although this issue is in the process of being resolved.
In order to test the VPU driver in a fully-controlled environment, we developed a standalone testing tool: v4l2-request-test (formerly cedrus-frame-test) that implements all the V4L2 userspace APIs needed for our VPU, including M2M and the Request API. This tool includes frame data and metadata dumps from actual videos, with the ability to decode these frames one-by-one. The tool was tremendously helpful for debugging the driver as well as adding support for H264. Since the userspace APIs involved properly abstract the hardware, this tool can be used to bring up and develop other VPU drivers that rely on the V4L2 Request API!
In order to provide integration with actual video players, we developed libva-v4l2-request (formerly libva-cedrus): a VAAPI backend that supports the V4L2 M2M and Request APIs. It currently supports both MPEG2 and H264 and will be extended as support for new formats is added. Just like v4l2-request-test, libva-v4l2-request aims at using the kernel APIs involved in a generic way, that should suit other Request API-based VPU drivers.
In the long run, it is likely that players will integrate direct support for the Request API (for instance, through ffmpeg). In the meantime, this allows interfacing with media players through two major interfaces: buffer derivation where the destination frames are copied (and converted to a regular image format when the VPU cannot do it on its own) or dma-buf, without any copy.
Zero-copy Pipeline Integration with EGL (Mali GPUs): VLC and GStreamer
In order to reach the best performance we can achieve, we focused on pipelines where no buffer copy is involved, on popular players: VLC and GStreamer. Since the X.org display server does not easily permit piping the VPU output to a dedicated plane on the Display Engine side, we investigated the use of the GPU. GPU support on Allwinner platforms still requires proprietary blobs at this point, such as the ones recently made available by Bootlin. We hope that the Lima project will soon bring a fully free alternative that will be integrated with both upstream kernel and upstream userspace components.
We did not have much luck when dealing with the tiled VPU output format, that the GPU cannot handle directly. Although we wrote a GPU shader for untiling (that works properly with regular GL implementations), the Mali GPU blobs did not behave as expected when it came to importing the tiled output frame. There is a chance that platforms that can output a regular image format (A33 and onwards) will be able to deal with piping the VPU and the GPU for accelerated scaling and colorspace conversion, but we did not test this option at this point.
Zero-copy Pipeline Integration with DRM (Display Engine): GStreamer and Kodi
Although involving the GPU in the pipeline was not a realistic possibility with the tiled VPU output format, various players support a direct DRM video output, that uses the Display Engine directly to pipe the video. Alas, it means that no window composition is possible, so this cannot be integrated with desktop environments. Instead, the players run standalone in their own virtual terminal.
We initially looked at using GStreamer this way but soon decided to prioritize Kodi (formerly XBMC), the popular mediacenter application. It was a struggle to integrate our pipeline (through libva-v4l2-request, via ffmpeg) in Kodi, although DRM video output support was there already. We eventually managed to get a usable result out of it, although there are areas left to improve!
LibreELEC Image Release with Kodi
In order to showcase the delivery of our main VPU crowdfunding campaign goals, we cooked a release of LibreELEC that supports the A20, A33 and H3 SoCs! It consists of a LibreELEC root filesystem (excluding the kernel and boot software) that works in conjunction with our latest linux-cedrus kernel tree.
Source code is of course available through our repositories, marked with the release-2018-07 tag.
Instructions to deploy the software on a compatible board are available on the linux-sunxi community wiki!
We have tackled many of the tasks on our plate at this point, but there are still items that need to be worked on:
posting new series of the Cedrus driver and H264 support until it is merged;
supporting H265 in our driver and userspace components;
supporting the ARM64 SoCs that come with version 2 of the Display Engine design, namely the H5 and A64;
contributing to the integration of our code in upstream Kodi and LibreELEC;
integrating a dma-buf and DRM pipeline with GStreamer.
We would like to thank all the individuals and companies who have supported this project by participating to our crowdfunding campaign, but also the linux-sunxi community members who did the initial reverse engineering of the Cedrus VPU and who worked with us during the development of this driver as well as the members of the V4L2 community who worked on the Request API and reviewed our patches.
Following up on last week’s efforts on the video players integration front, Kodi remained our core focus. With a LibreELEC setup in place, it was possible to start tackling VAAPI integration. This was not such a straightforward task, since various assumptions were in place. For instance, it was assumed that VAAPI support was only relevant for x86 platforms and it seems pretty clear that VAAPI integration in general was done with x86 in mind. This is particularly illustrated by the fact that the VAAPI video rendering pipeline relies on the GPU for all transformations and composition. This is a typical setup for x86, as the use of planes on these platforms was progressively replaced by a GPU-centric approach. Since our goal with Kodi is to use DRM/KMS planes in place of the GPU, this did not fit well. Moreover, the GPU import format required for dma-buf is simply not supported by the Mali blob (as we found out some weeks ago when working with VLC and the GLES untiler) and this is the only setup that Kodi currently supports for VAAPI.
There is still definitely hope, as Kodi supports a DRM Prime renderer that uses DRM/KMS planes in place of the GPU but does not support VAAPI in its current form. More specifically, it uses ffmpeg to get a dma-buf handle (through the AV_PIX_FMT_DRM_PRIME format from ffmpeg), that is not available as-is. In order to get this sort of pipeline with VAAPI, multiple steps have to be taken. A hardware acceleration context has to be brought up to select the VAAPI acceleration method instead of regular software decoding. This exposes the AV_PIX_FMT_VAAPI format from ffmpeg, which is still not good to feed the Kodi DRM Prime renderer. This has to be converted to AV_PIX_FMT_DRM_PRIME using ffmpeg helpers. As a result, some plumbing is required in Kodi and this work is still work in progress at the moment.
In parallel to the work on players, our Sunxi-Cedrus VPU driver was rebased on top of the latest version of the media request API from Hans Verkuil. It was the occasion to spot various bugs in this latest iteration, that were rapidly tackled thanks to Hans’ availability. The required follow-up patches were posted on the request API branch and will be part of its next revision. Regarding our driver itself, a great number of comments from our previous patchset were taken into account and integrated. We now have another iteration of the series ready, that we will publish soon. The tasks list for the driver itself keeps shrinking and we are getting closer and closer to the point where the driver is ready to be merged!
On the H264 front, good progress has been made this week too. Early this week, we’ve been able to play a baseline profile video without any particular quirks anymore. Some time was thus spent on cleaning up and refactoring the driver, libva-dump and cedrus-frame-test tools in order to support both the MPEG2 and H264 codecs, a feature that was dropped due to many hacks during the development. We then took the occasion to start the discussion on the linux-media mailing list by sending a preliminary version of the patches. We then worked on the real libva-cedrus, adding the support for H264. Most of the code is there now, but unfortunately isn’t functional yet. Some debugging will be on the agenda next week 🙂
The work conducted this week on the video output side was focused on writing a shader for untiling the MB32 NV12-based format used by the VPU to output frames. This brought various challenges, some of which are presented below.
Since GLES and EGL are generic APIs that are not tied to a particular driver implementation, it made sense to start writing the shader on an x86 Intel-based device with GPU support in Mesa 3D (and speed-up the development time). The first step to the process was to display the raw pixel values from the luminance plane through the shader. Actually, two shaders are required: one for the vertex processor and one for the fragment (pixel) processor of the GPU. The former is in charge of applying geometrical operations to the vertices (the points that define the 3D mesh) while the latter defines the color for each rendered pixel from that mesh. In our case, the mesh is simply a rectangle that matches our window size. The tiled NV12 luminance plane is uploaded to the GPU as a 1-byte-per-pixel texture, which allows addressing each component separately. However, the coordinates for the texture are normalized by the GPU, so coordinates to retrieve texels (texture pixels) form the texture sampler are specified as decimal values. This makes it tricky to ensure that the right value is retrieved, especially given that the GPU might apply various filtering techniques (that are a really good thing to have when dealing with actual textures for 3D models, though).
Setting up the vertex and fragment shaders to linearly display the pixels from the tiled format results in a mangled display (as expected):
With some extra work (and quirks for ensuring that the right texel is picked on tile edges), the luminance component was finally displayed correctly:
Next up was the chrominance component, that required importing a second dedicated texture. First tries lead to funky-looking coloring of the frame:
Until the shader was corrected to end up with a good-looking picture:
Real trouble began when porting this work to the Mali, that does not behave the same when it comes to texture uploads (and requires line-by-line upload for 1-byte-per-pixel formats). Since we are aiming at DMAbuf import instead of (slow) texture upload, no time was spent coping with the difference. The main issue with DMAbuf import is that the usual one-byte-per-pixel format (described by the DRM_FORMAT_R8fourcc code) is simply not supported by the GPU, leaving only RGB and YUV as options, that do not directly fit the bill. We are still investigating ways to make our texture available to the GPU’s texture sampler without extra copies (or with copies that can fit our bill in terms of performance).
We also worked on the H264 decoding in the kernel, and some progress was made. The libva-dump and cedrus-frame-test ports are now done, and we’ve been able to run cedrus-frame-test on 32 frames without any hiccups… Unfortunately, while the VPU reports the frames as properly decoded, the contents of the output buffer is blank, which is obviously not great. Since then, we have simplified the test to have a single frame decoded, and compared the register write sequence between libvdpau-sunxi and our kernel code. This has allowed us to find some bugs in the driver, but the current state is still that we can’t decode a frame. We shouldn’t be very far now though, so stay tuned for our next status update!
Early February, we announced the launch of our first Kickstarter campaign, whose goal was to add support for the video decoding unit found in most Allwinner processors to the official Linux kernel.
We were very pleased to see that in just one day, enough companies and individuals participated to fund the main goal of the campaign, collecting more than 17,600 EUR. And now that the campaign is over since March 18, we are even more pleased, as we have reached a funding level sufficient to cover not only our main goal, but also our two first stretch goals.
Thanks to the participation of all those companies and individuals, we will be able to:
Deliver our main goal, which we expect to deliver by the end of June 2018, which includes:
Making sure that the codec works on the older Allwinner SoCs that are still widely used: A10 (Cubieboard), A13 (A13-Olinuxino), A20 (Cubieboard 2, A20-Olinuxino), A33 (A33-Olinuxino, BananaPi M2-Magic), R8 (CHIP) and R16 (NES and Super NES classic). Support for the newer SoCs (H3, H5 and A64) requires more work, and is part of our first stretch goal below.
Polishing the existing MPEG2 decoding support to make it fully production ready.
Implementing H264 video decoding, since H264 is by far one of the most popular video codec.
Modifying the Allwinner display driver in order to be able to directly display the decoded frames instead of converting and copying those frames, which is very inefficient from a CPU consumption point of view
Providing a user-space library easy to integrate in the popular open-source video players
Upstreaming those changes to the official Linux kernel
Deliver our two first stretch goals, which we expect to deliver by the end of 2018, which includes:
Supporting the newer Allwinner SoCs, such as the H3 (Most of the Orange Pis, Nano Pi M1, ..), H5 (Orange Pi PC2, NanoPi NEO2, …) and A64 (Pine64, BananaPi M64, …).
H265 video decoding support
Unfortunately, our third stretch goal, which consisted in adding support for H264 encoding was not reached, so we don’t know yet if we will have enough time to look into this topic.
However, the work on the other topics has already started, with our intern Paul Kocialkowski has started to work since March 1st on this Allwinner VPU effort, and Maxime Ripard has started this week. We have already published, will continue to publish a report every week: week 10, week 11 and week 12. You can follow the progress of this project by reading our blog, our Twitter feed or the Kickstarter updates. You can also read the Sunxi-cedrus Wiki page to get all the details about this project and its progress.
Beyond the specific topic of the Allwinner VPU support, we are very happy to see that the model of funding upstream Linux kernel work through crowd-funding has worked. Most Kickstarter projects, in exchange for the participants contribution, provide to the participant a specific product (a book, a device) that only benefits to the contributor. Here, the result of this campaign will be shared freely with everybody, both Kickstarter contributors and non-contributors, and we’re proud to see that our experience has convinced numerous companies and individuals to support our project. Of course, we will be organizing in the near future the shipment of the t-shirts as well as the beer drinking sessions with Bootlin engineer Maxime Ripard.
To conclude, we would like to thank all our participants (we’re naming only the ones who baked at a level above 16 EUR, as above this level contributors are going to be mentioned in the CREDITS file, which clarifies their intent to be publicly named). First a number of companies supported our work: OrangePi, Libre Computer, neutis.io, FriendlyArm, Pine64, Olimex and of course a huge number of individuals: Abe Lacker, Adam Morris, Adam Oberbeck, Alerino Reis, Alex, Alexander A. Istomin, Alexander Kamm, Alexandru Nedel, Alex Kaplan, Amarpreet Minhas, André, Andreas, Andreas Färber, Andreas Rozek, Andre Przywara, Andrew Langley, Angel Rua Amo, Anssi Kolehmainen, Antony, Appreciation of Efforts, Aron Somodi, Artur Huhtaniemi, Atsushi Sasaki, Bastien Nocera, Bavay, Benjamin Glass, Benjamin Larsson, Ben Young, Bernard D’Havé, Bert Lindner, Bert Vermeulen, BESSIERES MARC, Biji-san, Bob Black, brot, Bruce Shipman, Butterkeks, Carla Sella, Carl Wall, Carsten Tolkmit, cbrocas, chae, Charlie Bruce, Christian Gnägi, Christian Pellegrin, Christian Stalp, Christophe Vaillot, Christoph Kröppl, Conan Kudo (ニール・ゴンパ), D1don, Dale Cousins, Daniel, Daniel Hrynczenko, Daniel Kulesz, Daniel Mühlbachler, David Pottage, David Willmore, defsy, Denis Bodor, DESSARD Guy, Dimitrios Bogiatzoules, Dominique Dumont, Doyle Young, Dubouil, Eelco Wesemann, eineki, Emil Karlson, Emmanuel Fusté, Erdem MEYDANLI, Eric des Courtis, Eric Jensen, Eric Koorn, Éric Périé, Erik, erikf, Evaryont, Fabian Korak, Felix Eickeler, Flo, Florian Beier, Florian Kempf, Frank, Frank van Kesteren, Frederir, G40, Gabor, Gabriel Ortiz, Garrett Gee, Georg Ottinger, Gerald Hochegger, Geralt, ghostpatch, Gianpaolo Macario, Giulio Benetti, Guenther Gassner, Guilhem, Guilhem Saurel, hackman, Hamish, Hanno Helge, Hans-Frieder Vogt, Heinz Thölecke, Henrik Kuhn, hook, Hugh Reynolds, Ian Daniher, iav, Inapplicable, Ingo Strauch, Ioan Rogers, Irvel Nduva, James, James Cloos, James Valleroy, jan koopmanschap, Jared Smith, Jari-Matti, Jarkko Pöyry, Jasper Horn, jean, Jean-Pierre Rivière, Jeffrey Sites, Jernej, Jerome Hanoteau, JK, John Kelley, Johnny Sorocil, Juanjo Marin, Jussi Pakkanen, Justin Ross, Karl Palsson, Kazım Rıfat Özyılmaz, Kean, Kevin Fowlks, Kevin Read, kicklix, Kiesel, Koen Kooi, Korbinian Probst, kratz00, Laurent GUERBY, Lee Donaghy, liushuyu, Logicite, luigi, Lukas Schauer, lzrmzz, Maksims Matjakubovs, Manuel, Marcel Sarge, Marcus Cooper, Mario Villarreal, Mark Dietzer, Markus Härer, Martijn Bosgraaf, mateuszkj, Mathias Brossard, mathieu, Matsumoto Kenichi, Matthew Zhang, Matthias, Matthias Lamm, Matt Mets, Maxime Brousse, Me, MESNIL Mikaël, Michael Gregorowicz, Michael Thalmeier, Michal Zatloukal, Mirko Vogt, mouren, N/A, naguirre, Neil Davenport, Nick Crasci, Nick Richards, Oleksij Rempel, oliver, Oliver Heyme, OSAKANA TARO, othiman, ozcoder, Pablo, Patrick, Paul Philippov, Paul Sykes, Per Larsson, Peter Gnodde, Peter Robinson, Philip-Dylan Gleonec, Phipli, Phoenix Chen, Pierce Lopez, Priit Laes, Prisma, Rainer Stober, Reignier, René Kliment, Reto Haeberli, Ricardo Salveti de Araujo, Richard Cote, Richard Ferlazzo, Riku Voipio, Robert Lukierski, Robert McQueen, roens, Rui Gu, Ryan Casey, Salvatore Bognanni, Samuel Frederick, Scott Devaney, Sebastian Krzyszkowiak, Sébastien DA ROCHA, Sergey Kopalkin, Sertac Tulluk, Shelby Cruver, Shervin Emami, SIMANCAS, Simon Josefsson, Spas Kyuchukov, ssam, Stan, Stanislav Bogatyrev, Stas, Stefan Bethke, Stefan Monnier, Steffen Elste, Stephan, Stephan Bärwolf, Stephen Kelly, Steven Seifried, Stokes Gresh, Sven Kasemann, SvOlli, Tarjei Solvang Tjønn, Tetsuyuki Kobayashi, Texier Pierre-jean, Thomas Monjalon, Thomas Samson, Tim Symossek, Todd Zebert, Tomas Virgl, tpc010, Tyler Style, Valentin Hăloiu, valhalla, Vasily Evseenko, Vitaly Shukela, Xavier Duret, Yannick Allard, Yves Serrano, Zoltan Herpai, ZotoPatate, zym060050.
Just over a week ago, I started my internship focused on adding upstream Linux kernel support for the Allwinner VPU at Bootlin’s Toulouse office. The team has been super-friendly and very helpful to help me get settled and I’m definitely happy about moving to Toulouse for the occasion!
This first week of work was focused on studying and rebasing the work done by Florent Revest a year and a half ago. As a main development target, I went for an A33-based board, the SinA33 from Sinlinx. Florent’s patches for the sunxi-cedrus driver were rebased against the latest release candidate version of Linus’ tree, v4.16-rc4.
The driver was then adapted to use the latest version of the V4L2 request API, a crucial piece of plumbing needed to provide coherency between setting specific controls for the media stream and the input/output buffers that these controls are related to. A few bugs needed fixing along the way, in order to avoid memory corruptions (use-after-free) and to properly schedule the VPU to run when a request is submitted. With these fixes the driver was ready, so it was sent for review on the linux-media mailing list. On the userspace side, the cedrus-specific libva was also updated to use the latest version of the request API.
The next step in the pipeline is to use a common buffer for the VPU’s decoded frame and the display controller’s plane, using dmabuf. This should bring a significant performance improvement and eventually allow for hardware-based scaling when decoding videos through the standard DRM/KMS interfaces. However, this requires adding support for the specific format used by the VPU (a multiplanar NV12 format with 32×32 tiles) into the display controller code.