Following up on last week’s efforts on the video players integration front, Kodi remained our core focus. With a LibreELEC setup in place, it was possible to start tackling VAAPI integration. This was not such a straightforward task, since various assumptions were in place. For instance, it was assumed that VAAPI support was only relevant for x86 platforms and it seems pretty clear that VAAPI integration in general was done with x86 in mind. This is particularly illustrated by the fact that the VAAPI video rendering pipeline relies on the GPU for all transformations and composition. This is a typical setup for x86, as the use of planes on these platforms was progressively replaced by a GPU-centric approach. Since our goal with Kodi is to use DRM/KMS planes in place of the GPU, this did not fit well. Moreover, the GPU import format required for dma-buf is simply not supported by the Mali blob (as we found out some weeks ago when working with VLC and the GLES untiler) and this is the only setup that Kodi currently supports for VAAPI.
There is still definitely hope, as Kodi supports a DRM Prime renderer that uses DRM/KMS planes in place of the GPU but does not support VAAPI in its current form. More specifically, it uses ffmpeg to get a dma-buf handle (through the AV_PIX_FMT_DRM_PRIME format from ffmpeg), that is not available as-is. In order to get this sort of pipeline with VAAPI, multiple steps have to be taken. A hardware acceleration context has to be brought up to select the VAAPI acceleration method instead of regular software decoding. This exposes the AV_PIX_FMT_VAAPI format from ffmpeg, which is still not good to feed the Kodi DRM Prime renderer. This has to be converted to AV_PIX_FMT_DRM_PRIME using ffmpeg helpers. As a result, some plumbing is required in Kodi and this work is still work in progress at the moment.
In parallel to the work on players, our Sunxi-Cedrus VPU driver was rebased on top of the latest version of the media request API from Hans Verkuil. It was the occasion to spot various bugs in this latest iteration, that were rapidly tackled thanks to Hans’ availability. The required follow-up patches were posted on the request API branch and will be part of its next revision. Regarding our driver itself, a great number of comments from our previous patchset were taken into account and integrated. We now have another iteration of the series ready, that we will publish soon. The tasks list for the driver itself keeps shrinking and we are getting closer and closer to the point where the driver is ready to be merged!
On the H264 front, good progress has been made this week too. Early this week, we’ve been able to play a baseline profile video without any particular quirks anymore. Some time was thus spent on cleaning up and refactoring the driver, libva-dump and cedrus-frame-test tools in order to support both the MPEG2 and H264 codecs, a feature that was dropped due to many hacks during the development. We then took the occasion to start the discussion on the linux-media mailing list by sending a preliminary version of the patches. We then worked on the real libva-cedrus, adding the support for H264. Most of the code is there now, but unfortunately isn’t functional yet. Some debugging will be on the agenda next week 🙂
The work conducted this week on the video output side was focused on writing a shader for untiling the MB32 NV12-based format used by the VPU to output frames. This brought various challenges, some of which are presented below.
Since GLES and EGL are generic APIs that are not tied to a particular driver implementation, it made sense to start writing the shader on an x86 Intel-based device with GPU support in Mesa 3D (and speed-up the development time). The first step to the process was to display the raw pixel values from the luminance plane through the shader. Actually, two shaders are required: one for the vertex processor and one for the fragment (pixel) processor of the GPU. The former is in charge of applying geometrical operations to the vertices (the points that define the 3D mesh) while the latter defines the color for each rendered pixel from that mesh. In our case, the mesh is simply a rectangle that matches our window size. The tiled NV12 luminance plane is uploaded to the GPU as a 1-byte-per-pixel texture, which allows addressing each component separately. However, the coordinates for the texture are normalized by the GPU, so coordinates to retrieve texels (texture pixels) form the texture sampler are specified as decimal values. This makes it tricky to ensure that the right value is retrieved, especially given that the GPU might apply various filtering techniques (that are a really good thing to have when dealing with actual textures for 3D models, though).
Setting up the vertex and fragment shaders to linearly display the pixels from the tiled format results in a mangled display (as expected):
With some extra work (and quirks for ensuring that the right texel is picked on tile edges), the luminance component was finally displayed correctly:
Next up was the chrominance component, that required importing a second dedicated texture. First tries lead to funky-looking coloring of the frame:
Until the shader was corrected to end up with a good-looking picture:
Real trouble began when porting this work to the Mali, that does not behave the same when it comes to texture uploads (and requires line-by-line upload for 1-byte-per-pixel formats). Since we are aiming at DMAbuf import instead of (slow) texture upload, no time was spent coping with the difference. The main issue with DMAbuf import is that the usual one-byte-per-pixel format (described by the DRM_FORMAT_R8fourcc code) is simply not supported by the GPU, leaving only RGB and YUV as options, that do not directly fit the bill. We are still investigating ways to make our texture available to the GPU’s texture sampler without extra copies (or with copies that can fit our bill in terms of performance).
We also worked on the H264 decoding in the kernel, and some progress was made. The libva-dump and cedrus-frame-test ports are now done, and we’ve been able to run cedrus-frame-test on 32 frames without any hiccups… Unfortunately, while the VPU reports the frames as properly decoded, the contents of the output buffer is blank, which is obviously not great. Since then, we have simplified the test to have a single frame decoded, and compared the register write sequence between libvdpau-sunxi and our kernel code. This has allowed us to find some bugs in the driver, but the current state is still that we can’t decode a frame. We shouldn’t be very far now though, so stay tuned for our next status update!
Early February, we announced the launch of our first Kickstarter campaign, whose goal was to add support for the video decoding unit found in most Allwinner processors to the official Linux kernel.
We were very pleased to see that in just one day, enough companies and individuals participated to fund the main goal of the campaign, collecting more than 17,600 EUR. And now that the campaign is over since March 18, we are even more pleased, as we have reached a funding level sufficient to cover not only our main goal, but also our two first stretch goals.
Thanks to the participation of all those companies and individuals, we will be able to:
Deliver our main goal, which we expect to deliver by the end of June 2018, which includes:
Making sure that the codec works on the older Allwinner SoCs that are still widely used: A10 (Cubieboard), A13 (A13-Olinuxino), A20 (Cubieboard 2, A20-Olinuxino), A33 (A33-Olinuxino, BananaPi M2-Magic), R8 (CHIP) and R16 (NES and Super NES classic). Support for the newer SoCs (H3, H5 and A64) requires more work, and is part of our first stretch goal below.
Polishing the existing MPEG2 decoding support to make it fully production ready.
Implementing H264 video decoding, since H264 is by far one of the most popular video codec.
Modifying the Allwinner display driver in order to be able to directly display the decoded frames instead of converting and copying those frames, which is very inefficient from a CPU consumption point of view
Providing a user-space library easy to integrate in the popular open-source video players
Upstreaming those changes to the official Linux kernel
Deliver our two first stretch goals, which we expect to deliver by the end of 2018, which includes:
Supporting the newer Allwinner SoCs, such as the H3 (Most of the Orange Pis, Nano Pi M1, ..), H5 (Orange Pi PC2, NanoPi NEO2, …) and A64 (Pine64, BananaPi M64, …).
H265 video decoding support
Unfortunately, our third stretch goal, which consisted in adding support for H264 encoding was not reached, so we don’t know yet if we will have enough time to look into this topic.
However, the work on the other topics has already started, with our intern Paul Kocialkowski has started to work since March 1st on this Allwinner VPU effort, and Maxime Ripard has started this week. We have already published, will continue to publish a report every week: week 10, week 11 and week 12. You can follow the progress of this project by reading our blog, our Twitter feed or the Kickstarter updates. You can also read the Sunxi-cedrus Wiki page to get all the details about this project and its progress.
Beyond the specific topic of the Allwinner VPU support, we are very happy to see that the model of funding upstream Linux kernel work through crowd-funding has worked. Most Kickstarter projects, in exchange for the participants contribution, provide to the participant a specific product (a book, a device) that only benefits to the contributor. Here, the result of this campaign will be shared freely with everybody, both Kickstarter contributors and non-contributors, and we’re proud to see that our experience has convinced numerous companies and individuals to support our project. Of course, we will be organizing in the near future the shipment of the t-shirts as well as the beer drinking sessions with Bootlin engineer Maxime Ripard.
To conclude, we would like to thank all our participants (we’re naming only the ones who baked at a level above 16 EUR, as above this level contributors are going to be mentioned in the CREDITS file, which clarifies their intent to be publicly named). First a number of companies supported our work: OrangePi, Libre Computer, neutis.io, FriendlyArm, Pine64, Olimex and of course a huge number of individuals: Abe Lacker, Adam Morris, Adam Oberbeck, Alerino Reis, Alex, Alexander A. Istomin, Alexander Kamm, Alexandru Nedel, Alex Kaplan, Amarpreet Minhas, André, Andreas, Andreas Färber, Andreas Rozek, Andre Przywara, Andrew Langley, Angel Rua Amo, Anssi Kolehmainen, Antony, Appreciation of Efforts, Aron Somodi, Artur Huhtaniemi, Atsushi Sasaki, Bastien Nocera, Bavay, Benjamin Glass, Benjamin Larsson, Ben Young, Bernard D’Havé, Bert Lindner, Bert Vermeulen, BESSIERES MARC, Biji-san, Bob Black, brot, Bruce Shipman, Butterkeks, Carla Sella, Carl Wall, Carsten Tolkmit, cbrocas, chae, Charlie Bruce, Christian Gnägi, Christian Pellegrin, Christian Stalp, Christophe Vaillot, Christoph Kröppl, Conan Kudo (ニール・ゴンパ), D1don, Dale Cousins, Daniel, Daniel Hrynczenko, Daniel Kulesz, Daniel Mühlbachler, David Pottage, David Willmore, defsy, Denis Bodor, DESSARD Guy, Dimitrios Bogiatzoules, Dominique Dumont, Doyle Young, Dubouil, Eelco Wesemann, eineki, Emil Karlson, Emmanuel Fusté, Erdem MEYDANLI, Eric des Courtis, Eric Jensen, Eric Koorn, Éric Périé, Erik, erikf, Evaryont, Fabian Korak, Felix Eickeler, Flo, Florian Beier, Florian Kempf, Frank, Frank van Kesteren, Frederir, G40, Gabor, Gabriel Ortiz, Garrett Gee, Georg Ottinger, Gerald Hochegger, Geralt, ghostpatch, Gianpaolo Macario, Giulio Benetti, Guenther Gassner, Guilhem, Guilhem Saurel, hackman, Hamish, Hanno Helge, Hans-Frieder Vogt, Heinz Thölecke, Henrik Kuhn, hook, Hugh Reynolds, Ian Daniher, iav, Inapplicable, Ingo Strauch, Ioan Rogers, Irvel Nduva, James, James Cloos, James Valleroy, jan koopmanschap, Jared Smith, Jari-Matti, Jarkko Pöyry, Jasper Horn, jean, Jean-Pierre Rivière, Jeffrey Sites, Jernej, Jerome Hanoteau, JK, John Kelley, Johnny Sorocil, Juanjo Marin, Jussi Pakkanen, Justin Ross, Karl Palsson, Kazım Rıfat Özyılmaz, Kean, Kevin Fowlks, Kevin Read, kicklix, Kiesel, Koen Kooi, Korbinian Probst, kratz00, Laurent GUERBY, Lee Donaghy, liushuyu, Logicite, luigi, Lukas Schauer, lzrmzz, Maksims Matjakubovs, Manuel, Marcel Sarge, Marcus Cooper, Mario Villarreal, Mark Dietzer, Markus Härer, Martijn Bosgraaf, mateuszkj, Mathias Brossard, mathieu, Matsumoto Kenichi, Matthew Zhang, Matthias, Matthias Lamm, Matt Mets, Maxime Brousse, Me, MESNIL Mikaël, Michael Gregorowicz, Michael Thalmeier, Michal Zatloukal, Mirko Vogt, mouren, N/A, naguirre, Neil Davenport, Nick Crasci, Nick Richards, Oleksij Rempel, oliver, Oliver Heyme, OSAKANA TARO, othiman, ozcoder, Pablo, Patrick, Paul Philippov, Paul Sykes, Per Larsson, Peter Gnodde, Peter Robinson, Philip-Dylan Gleonec, Phipli, Phoenix Chen, Pierce Lopez, Priit Laes, Prisma, Rainer Stober, Reignier, René Kliment, Reto Haeberli, Ricardo Salveti de Araujo, Richard Cote, Richard Ferlazzo, Riku Voipio, Robert Lukierski, Robert McQueen, roens, Rui Gu, Ryan Casey, Salvatore Bognanni, Samuel Frederick, Scott Devaney, Sebastian Krzyszkowiak, Sébastien DA ROCHA, Sergey Kopalkin, Sertac Tulluk, Shelby Cruver, Shervin Emami, SIMANCAS, Simon Josefsson, Spas Kyuchukov, ssam, Stan, Stanislav Bogatyrev, Stas, Stefan Bethke, Stefan Monnier, Steffen Elste, Stephan, Stephan Bärwolf, Stephen Kelly, Steven Seifried, Stokes Gresh, Sven Kasemann, SvOlli, Tarjei Solvang Tjønn, Tetsuyuki Kobayashi, Texier Pierre-jean, Thomas Monjalon, Thomas Samson, Tim Symossek, Todd Zebert, Tomas Virgl, tpc010, Tyler Style, Valentin Hăloiu, valhalla, Vasily Evseenko, Vitaly Shukela, Xavier Duret, Yannick Allard, Yves Serrano, Zoltan Herpai, ZotoPatate, zym060050.
Just over a week ago, I started my internship focused on adding upstream Linux kernel support for the Allwinner VPU at Bootlin’s Toulouse office. The team has been super-friendly and very helpful to help me get settled and I’m definitely happy about moving to Toulouse for the occasion!
This first week of work was focused on studying and rebasing the work done by Florent Revest a year and a half ago. As a main development target, I went for an A33-based board, the SinA33 from Sinlinx. Florent’s patches for the sunxi-cedrus driver were rebased against the latest release candidate version of Linus’ tree, v4.16-rc4.
The driver was then adapted to use the latest version of the V4L2 request API, a crucial piece of plumbing needed to provide coherency between setting specific controls for the media stream and the input/output buffers that these controls are related to. A few bugs needed fixing along the way, in order to avoid memory corruptions (use-after-free) and to properly schedule the VPU to run when a request is submitted. With these fixes the driver was ready, so it was sent for review on the linux-media mailing list. On the userspace side, the cedrus-specific libva was also updated to use the latest version of the request API.
The next step in the pipeline is to use a common buffer for the VPU’s decoded frame and the display controller’s plane, using dmabuf. This should bring a significant performance improvement and eventually allow for hardware-based scaling when decoding videos through the standard DRM/KMS interfaces. However, this requires adding support for the specific format used by the VPU (a multiplanar NV12 format with 32×32 tiles) into the display controller code.
Back in 2012, Bootlin (formerly Free Electrons) engineer Maxime Ripard pioneered the support for Allwinner processors in the official Linux kernel. Today, thanks to the contributions of numerous developers around the world and our involvement, there is very good support for a large number of Allwinner processors in the Linux kernel, to the point where actual Allwinner-based products are shipping with the mainline kernel.
Despite this major effort, there is one area that has remained unsupported in the mainline kernel: the video decoding and encoding engine, which allows to accelerate in hardware the decoding and encoding of popular codecs such as MPEG2, MPEG4 or H264. Last summer, we successfully implemented a prototype, supporting MPEG2 decoding and partially MPEG4 decoding.
Today, we are launching a crowdfunding campaign to fund the remainder of the development: finishing MPEG4 decoding support, implementing H264 decoding, optimizing the rendering of video frames in cooperation with the display driver, and upstreaming the driver. We also have additional goals of supporting H265, encoding support, and additional Allwinner SoCs.
In the vendor-provided kernel, this video decoding/encoding unit is supported by a kernel driver that uses a non-standard user-space API, in conjunction with a binary-only userspace blob. Fortunately, a number of people have done an enormous reverse engineering effort, which we have leveraged for our existing prototype, and which we intend to use to continue the development of this upstream driver. Both Maxime Ripard and our intern Paul Kocialkowski will be working on this crowdfunded project.
This is our first crowdfunding campaign to fund upstream Linux kernel development, and we are interested in seeing how much interest there is in such a financing model. Help us making this a success by spreading the word!
Over the last few years, and most recently with the support for the C.H.I.P platform, Free Electrons has been heavily involved in initiating and improving the support in the mainline Linux kernel for the Allwinner ARM processors. As of today, a large number of hardware features of the Allwinner processors, especially the older ones such as the A10 or the A13 used in the CHIP, are usable with the mainline Linux kernel, including complex functionality such as display support and 3D acceleration. However, one feature that was still lacking is proper support for the Video Processing Unit (VPU) that allows to accelerate in hardware the decoding and encoding of popular video formats.
The internship resulted in a new sunxi-cedrus driver, a Video4Linux memory-to-memory decoder kernel driver and a corresponding VA-API backend, which allows numerous userspace applications to use the decoding capabilities. Both projects have both been published on Github:
Currently, the combination of the kernel driver and VA-API backend supports MPEG2 and MPEG4 decoding only. There is for the moment no support for encoding, and no support for H264, though we believe support for both aspects can be added within the architecture of the existing driver and VA-API backend.
A first RFC patchset of the kernel driver has been sent to the linux-media mailing list, and a complete documentation providing installation information and architecture details has been written on the linux-sunxi’s wiki.
Here is a video of VLC playing a MPEG2 demo video on top of this stack on the Next Thing’s C.H.I.P: