Even though the bulk of the development on the Allwinner VPU support is done, we are still working on completing the upstreaming of the kernel driver, and some progress has been made recently on this topic:
On September 10, core Video4Linux developer Hans Verkuil sent a pull request to Video4Linux maintainer Mauro Carvalho Chehab to get the Cedrus driver merged. This means we’re getting closer and closer to have the driver merged. Unfortunately, some last minute issues were found in the patch series, so this pull request wasn’t merged.
On September 13, Bootlin engineer Maxime Ripard sent a new iteration of the Cedrus driver, version 10, which addresses those issues.
In addition to this progress on the Linux kernel driver upstreaming process, we also moved forward with delivering the perks to the companies and individuals who supported our campaign:
A CREDITS file has been added to the libva-v4l2-request base, thanking all our backers who pleged more than 16 EUR.
The T-Shirts for the backers who pledged more than 128 EUR have been sent to those in the EU. We are also working on sending the t-shirts to those outside the EU, but it takes a bit more time due to the need for customs declarations. Don’t hesitate to take a picture of you with the T-Shirt, and post it on Twitter with the hashtag #VPULinuxDriverSupporter.
The end of August has arrived, bringing an end to Paul’s engineering internship at Bootlin, focused on bringing mainline Linux support for the VPU found on Allwinner platforms. Over the past six months, we have worked hard to reach the goals announced in the project’s crowdfunding campaign and we were able to deliver most of the main goals last month.
Reaching the end of the funding: a status on where we stand
We have now exhausted the budget that was provided through the crowdfunding campaign: both Maxime Ripard’s time (who worked mainly on the H264 decoding and helping with DRM topics) and Paul’s internship are over, and therefore the remaining work will be done on a best-effort basis, without direct funding. This will therefore be the last weekly update, but we will be publishing updates once in a while when interesting progress is made.
Here is a quick summary of our current status, compared to what was promised during our Kickstarter campaign:
Making sure that the codec works on the older Allwinner SoCs that are still widely used: A10, A13, A20, A33, R8 and R16. This goal is fully met;
Polishing the existing MPEG2 decoding support to make it fully production ready. This goal is fully met;
Implementing H264 video decoding. This goal is fully met with base H264 decoding support implemented. However, a number of more advanced H264 features have not been implemented, and therefore additional improvements could be made;
Modifying the Allwinner display driver in order to be able to directly display the decoded frames instead of converting and copying those frames. This goal is fully met.
Providing a user-space library easy to integrate in the popular open-source video players. This goal is partially met. We do provide a user-space library that offers a VA-API implementation, however the integration with popular video players turned out to be a lot more challenging than expected, and we only offer Kodi integration at this point. See below for details;
Upstreaming those changes to the official Linux kernel. This goal is in progress, on both the VPU driver side and DRM improvements side;
Supporting the newer Allwinner SoCs (H3, H5, A64). This goal is partially met, since H3 is supported, but not yet H5 and A64;
H265 video decoding support. This goal is fully met with base H265 decoding support implemented. Like H264, a number of more advanced features have not been implemented, so there is room for more work.
The most challenging topic: integration with open source video players
The major pitfalls that we encountered are related to integrating our accelerated video decoding pipeline with multimedia players. They will require extra work out of the scope of the VPU campaign to reach a production-ready state.
We considered a number of options for integrating with a desktop environment under Xorg, which was especially tricky for the oldest Allwinner platforms where the VPU outputs a tiled YUV format. The chain of required operations includes untiling, colorspace conversion (from YUV to RGB), scaling and composition.
We first resorted to the main CPU for all the required operations (including NEON-backed untiling routines), which becomes unbearably slow as soon as scaling is involved in the process.
We tried to bring-in the GPU for accelerating the untiling, colorspace conversion, scaling and composition operations involved. Although we wrote a shader-based untiler, the Mali blobs did not allow for importing the raw frame data on a byte-by-byte basis. This made GPU acceleration unusable for our use case in practice. Bringing-in the GPU for the final composition step only (that should be possible with GBM-enabled blobs) could however bring some speedup.
Another lead is to use the Xv extension of the X11 API, that fits the bill for using the Display Engine hardware to accelerate these operations, but this interface is quite old now and increasingly deprecated. It also only allows sub-optimal use cases, with one video at a time.
We also investigated the situation for media players that can run without a display server, which removes the need for the composition step and allows using the Display Engine hardware directly, through the DRM interface.
We succeeded at bringing up support for the Kodi mediacenter, by adding the required bits to implement a zero-copy pipeline.
We worked on getting GStreamer to correctly pipe VAAPI-based decoding to the DRM-enabled kmssink without going through the GPU, but did not end up with any functional result, so significant work remains in that area.
Going further: what will happen now ?
Here are the topics that we intend to continue work on in this best-effort mode and complete by the end of 2018, as promised in our crowdfunding campaign:
Ensure the base Cedrus Linux kernel driver gets merged;
Ensure the H264 decoding support in the Cedrus driver gets merged;
Ensure the H265 decoding support in the Cedrus driver gets merged;
Ensure the DRM driver improvements get merged;
Enable VPU support on H5 and A64.
Here are other topics that we do not intend to work on without additional funding. Individuals who want to see some progress on those topics are invited to contribute and join the effort of improving Allwinner VPU support in upstream Linux. Companies interested in those features can also contact us.
Additional H264 and H265 decoding features: interlaced video support (H264 and H265), quantization matrices (H265), 10-bit (H265), 4K resolution (H265);
Other codecs beyond MPEG2, H264 and H265, such as VP8;
Additional work on GStreamer integration or X.org integration.
Once again, we would like to thank all the individuals and companies who participated to our crowdfunding campaign, and made this project possible. We are very happy to see that despite the uncertainties involved in all software development projects, we have been to deliver the vast majority of the goals, within the expected time frame, while delivering weekly updates of our progress. It was a new experience for Bootlin, and we hope to renew this experience for other Linux kernel upstream developments in the future!
This week has seen great advancements in H265 support, following up on the work conducted during the past weeks. The first item to debug was support for bi-directional predictive frames (AKA B frames) which was broken last week. This required some adaptation in our standalone test tool v4l2-request-test in order to display the decoded frames in the right order. With bi-directional prediction, the display order no longer matches the decoding order, in which the coded frames are stored in the bitstream.
With the images displayed in the right order, the debugging process was a matter of comparing the configuration register values written by our driver with the reference provided by libvdpau-sunxi, but it was not enough. A specific buffer has to be provided for each frame for the decoder to store extra meta-data related to bi-directional frame prediction. With the buffer set, the situation vastly improved and only minor issues had to be resolved.
This lead to properly decoding our reference H265 video that contains I, P and B frames! A few more videos were also tested to spot possible bugs and were eventually decoded correctly too. Of course, due to the many possible combinations of H265 features, it is possible that we are still missing some corner cases, but the bulk of H265 support is well in place at this point.
We moved on to adding support for H265 in libva-v4l2-request, which allows the integration of the codec with media players such as VLC and Kodi. We hit a few hiccups during the bringup :
But we managed to fix the integration of H265 to behave properly :
So H265 is now integrated in our pipeline and we are ready to submit the patches introducing its support for the Cedrus driver, which should come around next week.
The first task that was tackled this week was solving the bit offset issue encountered last week. We found out that ffmpeg provides VAAPI with a byte-aligned value after rounding it up from an internal offset it keeps in bits. When trying to use the internal value in bits, our VPU would succeed at decoding the H265 frame. After looking at the values for a few distinct frames, it became clear that the offset matched the beginning of a Golomb-coded compressed sequence, starting with a 1 bit and followed by zeros, as a prefix code. Detecting this pattern appears to work reliably for the H265 videos we could test.
This paved the way for properly decoding intra-coded (I) H265 frames without any hardcoded value left in the code. With that in place, it was only a small stretch to decode a few seconds of video made of I frames!
Of course, intra-coded frames are rare in H265 videos since they do not use any temporal compression technique and are thus larger in size. Predicted frames (using references from already-decoded frames) compose the vast majority of H265 videos. Prediction takes places either for forward prediction (P frames) or both forward and backward prediction (B frames). Supporting these prediction modes requires significant driver-side work, especially to handle the metadata (such as prediction weight coefficients) associated to each frame in the reference lists and the lists on their own. On the framework side, V4L2 controls also had to be introduced to bring the required plumbing for these features.
As of today, we successfully implemented support for P frames while B frames are still work in progress. To illustrate our progress, the same video can be seen decoded in v4l2-request-test (at nominal and half speed), with the two prediction modes :
With I and P frames, the video is decoded correctly:
Some more work seems to be required for B frames:
Next week will be the opportunity to move forward on B frames decoding!
This week started with the preparation of a new revision of the Cedrus VPU driver, after significant feedback was received on the version posted two weeks ago. Thanks to the careful testing carried out by community member Jernej Škrabec, a number of decoding issues were discovered in version 6 of the driver. This includes issues related to MPEG2 decoding but also to the use of the VPU untiling block, that affects all codecs indifferently.
Some updates were also included on the MPEG2 controls side, in order to bring them closer to the raw bitstream parameters. Some parameters (that are not exposed by VAAPI) were also added, making the V4L2 controls broader than what is strictly required for our VPU.
Regarding H265, progress was slow this week due to a mismatch between values provided by VAAPI and what our VPU expects. More specifically, VAAPI provides a byte-aligned value for the offset to the coded video data in the slice (which also includes a header with metadata) while our VPU expects a bit-aligned value that does not match the value provided by VAAPI. We are hard at work to figure out a solution to this issue, but it is not straightforward. In addition, the reference libvdpau-sunxi code does not set that offset explicitly, as it is reached after parsing the header through the VPU itself. In our case, the parsing is done in userspace so the use case differs.
Following on last week’s progress, this week was also focused on bringing the required plumbing for H265 support in our video decoding pipeline. Thanks to register dumps obtained last week from libvdpau-sunxi, it was possible to quickly hack together support for decoding a single intra frame (with no dependency on any other frame), by replaying the dumped register write sequence. Once decoding that single frame worked with the hardcoded register values, we progressively replaced these values with actual register field definitions, that have to be configured with the appropriate metadata for the frame, that is parsed from the H265 bitstream.
As a result, the next step was integrating the required metadata information as dedicated V4L2 controls. Since these controls have to be as generic as possible (in order to fit well with future V4L2 stateless VPU drivers), we carefully looked at the metadata fields that the bitstream offers and considered the elements that VAAPI provides in userspace as well as the information that our VPU needs specifically. It appears that some fields required by our VPU are not exposed by VAAPI directly, so a few tricks were needed along the way.
At this point, we have a first draft for the controls, that allow decoding the intra-coded frame that we dumped last week, but using the metadata provided through the controls instead of hardcoded values :
More work is required to include support for other types of frame coding, namely B and P predictive frames. Next week’s focus will be set on decoding a series of intra-coded frames and moving on to supporting predictive frames. Thankfully, the work done by Bootlin engineer Maxime Ripard when adding support for H264 makes the whole process considerably easier, since H265 resembles H264 in many aspects.
This week’s progress in our VPU driver development effort was focused on two main tasks: submitting the sixth revision of the Cedrus VPU driver series to the mainline Linux kernel and starting the work on H265 decoding.
The patch series for this new iteration of the driver was submitted on Wednesday and contains both functional and cosmetic changes. Most notably, we implemented support for video-specific quantization matrices in MPEG2, one of the final extension bits we were missing until then, but also cleaned up the register definitions for the driver. At this point, there are no undocumented registers or fields left, which makes the overall understanding of the hardware interactions much more straightforward. The driver was also moved to staging drivers, not because it was deemed of poor quality but rather because V4L2 maintainers want to keep the ability to change the controls that our driver is using even after it is merged.
Aside of this work, we started looking into H265 decoding, that was also already implemented in libvdpau-sunxi for the downstream modified version of the Linux kernel provided by Allwinner for the H3 (still based on Linux 3.4 to this day, which was released in 2012). After setting up a board with this kernel and libvdpau-sunxi, we were able to dump the register access made by libvdpau-sunxi, providing a reference for bringing up H265 support in the Cedrus VPU driver!
With a few weeks of delay, we are proud to announce the delivery of the main goals of our crowdfunding campaign dedicated at adding upstream Linux support for the Allwinner video decoding hardware.
After several months of hard work by Bootlin engineer Maxime Ripard and intern Paul Kocialkowski, we now have a working demo of Kodi running with our VPU driver on top of a mainline 4.18-rc kernel. Both MPEG2 and H264 are supported, with a fully-optimized pipeline between the VPU and the display side that does not involve any buffer copy or extra transformation that the hardware cannot offload. These results were possible thanks to the previous efforts carried out by the linux-sunxi community, and especially the libvdpau-sunxi project.
Here were the main goals defined in our crowdfunding campaign, which we promised to deliver end of June 2018, and their status in our delivery:
Making sure that the codec works on the older Allwinner SoCs: A10, A13, A20, A33, R8 and R16.. This goal is fully met, with more features than planned: the Cedrus driver was brought up on the A10, A13, A20, A33 and H3. Therefore, we included H3 support in this delivery, even though it was originally only part of one of the stretch goals. The R8 is the same as an A13 and the R16 is the same as an A33, so they are supported as well.
Polishing the existing MPEG2 decoding support to make it fully production ready. This goal is fully met: we have done much more testing of the MPEG2 decoding, and both the Linux kernel code and user-space code supporting MPEG2 has been significantly improved and cleaned up.
Implementing H264 video decoding, since H264 is by far one of the most popular video codec.. This goal is fully met: H264 decoding support has been added to both the Linux kernel driver and the user-space library, including high-profile H264 support. However, the H264 support is still very recent and we expect that additional debugging and improvements will be needed.
Modifying the Allwinner display driver in order to be able to directly display the decoded frames instead of converting and copying those frames. This goal is fully met: the Allwinner DRM driver has received a number of patches to ensure we can use one of the several planes to directly display the video frames in the format provided by the VPU. Support for hardware scaling has also been fixed to work properly. Those patches have already been contributed to the upstream Linux kernel. The work on the A20 and A33 display driver was done by Bootlin, while the work on the H3 was done by other developers of the community.
Providing a user-space library easy to integrate in the popular open-source video players. This goal is partially met: while we are providing a libva-v4l2-request user-space libraries that can in theory be used by all libva capable video players, the actual integration with video players is for now only working completely with Kodi. We have started efforts to make it work with both VLC and GStreamer, but the work has not been complete due to various challenges detailed below. This area was definitely much more challenging than we initially expected.
Upstreaming those changes to the official Linux kernel. This goal is almost met: we have posted 5 iterations of the Cedrus Linux kernel driver, each time using new versions of the Request API patches, helping improve this API along the way. While our patches have not been merged yet, because the Request API itself hasn’t been merged, they have received significant review from the V4L developers, and we believe our patches are not far from being merged.
All in all, despite the numerous challenges encountered over the last few months, we are happy to see that we have been able to deliver most of the goals completely, and we are not too far off for the few goals that haven’t yet been fully met. As we will discuss below, we will continue to work in the next months on completing those unfinished steps, and on the stretch goals that received enough funding.
Reaching this level of support was not a straightforward journey, as our road was paved with various obstacles that are presented below.
Media Request API
In order to add support for the VPU found on Allwinner platforms, some internal plumbing is necessary in the Video4Linux2 (v4l2) framework, the video framework in Linux. While V4L2 gained support for a specific class of VPUs, so-called “stateful” (where the video bitstream is passed directly to the hardware controller) thanks to the Memory2Memory API, this is not sufficient for our hardware. Indeed, Allwinner platforms come with a “stateless” VPU, where the video needs to be parsed beforehand to extract the frame data and its associated metadata, and then passed to the hardware. V4L2 lacked an API for synchronizing the frame data and associated metadata, although it had been in development for a long time and known as the Request API.
Our work on Cedrus contributed to revive the flame for this API, that saw its development accelerated over the past months thanks to the commitment of individuals such as Alexandre Courbot, Hans Verkuil and Sakari Ailus. We had the opportunity to report various issues and suggest fixes over its development process, which were integrated so that all the required bits for our driver are now in. The API is finally mature and appears to be quite stable, so there is no known blocker left for its integration in the kernel.
Cedrus V4L2 Driver
The first version of the Cedrus driver originally developed in 2016 by Florent Revest as part of an internship at Bootlin was based on an old version of the Request API. We therefore started by porting it to the latest version of the API and kept publishing new revisions as development of the Request API happened. We also received useful feedback from the community in the process. Here are the different iterations of the Cedrus driver that have been sent as part of this crowdfunded effort:
In addition to those patch series adding the driver itself, an additional patch series was sent to bring H264 support.
The development of the driver itself was not the most cumbersome part of the process, although it brought some challenges. For instance, we had to rework buffer management after discovering a limitation in the hardware, where the luminance and chrominance planes of our destination buffers need to be kept close in memory. We also had to bring in a workqueue (later replaced by a threaded IRQ) for the needs of the M2M API, which comes with performance drawbacks, although this issue is in the process of being resolved.
In order to test the VPU driver in a fully-controlled environment, we developed a standalone testing tool: v4l2-request-test (formerly cedrus-frame-test) that implements all the V4L2 userspace APIs needed for our VPU, including M2M and the Request API. This tool includes frame data and metadata dumps from actual videos, with the ability to decode these frames one-by-one. The tool was tremendously helpful for debugging the driver as well as adding support for H264. Since the userspace APIs involved properly abstract the hardware, this tool can be used to bring up and develop other VPU drivers that rely on the V4L2 Request API!
In order to provide integration with actual video players, we developed libva-v4l2-request (formerly libva-cedrus): a VAAPI backend that supports the V4L2 M2M and Request APIs. It currently supports both MPEG2 and H264 and will be extended as support for new formats is added. Just like v4l2-request-test, libva-v4l2-request aims at using the kernel APIs involved in a generic way, that should suit other Request API-based VPU drivers.
In the long run, it is likely that players will integrate direct support for the Request API (for instance, through ffmpeg). In the meantime, this allows interfacing with media players through two major interfaces: buffer derivation where the destination frames are copied (and converted to a regular image format when the VPU cannot do it on its own) or dma-buf, without any copy.
Zero-copy Pipeline Integration with EGL (Mali GPUs): VLC and GStreamer
In order to reach the best performance we can achieve, we focused on pipelines where no buffer copy is involved, on popular players: VLC and GStreamer. Since the X.org display server does not easily permit piping the VPU output to a dedicated plane on the Display Engine side, we investigated the use of the GPU. GPU support on Allwinner platforms still requires proprietary blobs at this point, such as the ones recently made available by Bootlin. We hope that the Lima project will soon bring a fully free alternative that will be integrated with both upstream kernel and upstream userspace components.
We did not have much luck when dealing with the tiled VPU output format, that the GPU cannot handle directly. Although we wrote a GPU shader for untiling (that works properly with regular GL implementations), the Mali GPU blobs did not behave as expected when it came to importing the tiled output frame. There is a chance that platforms that can output a regular image format (A33 and onwards) will be able to deal with piping the VPU and the GPU for accelerated scaling and colorspace conversion, but we did not test this option at this point.
Zero-copy Pipeline Integration with DRM (Display Engine): GStreamer and Kodi
Although involving the GPU in the pipeline was not a realistic possibility with the tiled VPU output format, various players support a direct DRM video output, that uses the Display Engine directly to pipe the video. Alas, it means that no window composition is possible, so this cannot be integrated with desktop environments. Instead, the players run standalone in their own virtual terminal.
We initially looked at using GStreamer this way but soon decided to prioritize Kodi (formerly XBMC), the popular mediacenter application. It was a struggle to integrate our pipeline (through libva-v4l2-request, via ffmpeg) in Kodi, although DRM video output support was there already. We eventually managed to get a usable result out of it, although there are areas left to improve!
LibreELEC Image Release with Kodi
In order to showcase the delivery of our main VPU crowdfunding campaign goals, we cooked a release of LibreELEC that supports the A20, A33 and H3 SoCs! It consists of a LibreELEC root filesystem (excluding the kernel and boot software) that works in conjunction with our latest linux-cedrus kernel tree.
Source code is of course available through our repositories, marked with the release-2018-07 tag.
Instructions to deploy the software on a compatible board are available on the linux-sunxi community wiki!
We have tackled many of the tasks on our plate at this point, but there are still items that need to be worked on:
posting new series of the Cedrus driver and H264 support until it is merged;
supporting H265 in our driver and userspace components;
supporting the ARM64 SoCs that come with version 2 of the Display Engine design, namely the H5 and A64;
contributing to the integration of our code in upstream Kodi and LibreELEC;
integrating a dma-buf and DRM pipeline with GStreamer.
We would like to thank all the individuals and companies who have supported this project by participating to our crowdfunding campaign, but also the linux-sunxi community members who did the initial reverse engineering of the Cedrus VPU and who worked with us during the development of this driver as well as the members of the V4L2 community who worked on the Request API and reviewed our patches.
This week was the occasion to send out version 5 of the Sunxi-Cedrus VPU driver, that uses version 16 of the media requests API. The API contains the necessary internal plumbing for tying specific metadata (exposed as v4l2 controls, that are structures of data set by userspace) about the current video frame to decode with the associated source buffer (that is extracted in slices from the raw video bitstream and contains the frame’s encoded data). Adding this feature to the Linux kernel paves the way for supporting stateless VPUs such as Allwinner’s Video Engine, that are found in various ARM platforms. With version 16, a number of reliability issues were fixed and we were able to run decoding tests for hours without hitting any error!
This new version of our driver contains several improvements, that are presented in the cover letter of the series. Most notably, it brings support for the H3 (which uses the second revision of the Allwinner’s Display Engine hardware block) and exposes linear YUV output in addition to the tiled output format. The issue related to H264 decoding failing because of the luma and chroma planes being too distant in memory was fixed by allocating contiguous buffers for the destination frames. However, this required significant changes in our display pipeline, which was the occasion to rework both cedrus-frame-test and libva-cedrus to handle various scenarios for buffer and planes matching and avoid hardcoded values that are specific to our pipeline. This opens the way to making these tools generic users of the V4L2 and DRM APIs, without any particular tie to our specific platform and setup.
We also spent some time figuring out the reason for the various artifacts found on the A20 when using the display scaler. It turned out to be some missing register, and one register where the value documented would be offset by one, resulting in the last line of the picture repeating itself.
Once done, we switched to working on the issue we mentionned last week with H264. After testing a few ideas, we now have the H264 high profile working with libva-dump and cedrus-frame-test. The next step will be to port the new code to handle the reference frames to libva-cedrus, and hopefully we will be able to have this in our usual players.
This week, significant time was dedicated to preparing a new revision of the Sunxi-Cedrus VPU kernel driver. This new version (that was started last week) based on version 15 of the media requests API brought about a number of challenges. First off, integrating the recently-tested VPU-side untiling of the destination buffers required a significant rewrite of the part in charge of managing formats and buffers. The part of our driver handling V4L2 controls (that are used to submit the frame metadata) was also significantly reworked to allow validating that the frame metadata has indeed been submitted by userspace before launching a decode run. An initial implementation of this was brought up and discussed with V4L2 maintainer Hans Verkuil, who is backing (and baking) the requests API series. He came up with a specific patch that should allow properly implementing this detection at the right time (when checking the media request’s validity, instead of at the start of the run). Hans also solved various reliability issues that we were experiencing when using the requests API with our driver. As a result, he posted version 16 of the requests API series with these fixes. We are hoping that this version will be one of the final iterations of this long-awaited series!
While rebasing H264 support, we experienced a strange issue where the destination buffers were sometimes corrupted and sometimes not. All the hardware configuration (register writes) were exactly the same, except for the buffer addresses (that naturally tend to change depending on allocations order in the related CMA memory pool). After some investigation, we discovered that when the gap between the luma and chroma planes of the destination buffer are too distant, a corruption happens. It may be that some offset is used in the hardware at some point and that it is not coded on enough bits to represent a large gap. The way to work around this is to make sure that all the planes of our destination buffer are allocated contiguously. In practice, this means that we need a single allocation for the each whole destination buffer (with the size of its two planes), ensuring that there is no gap between the planes.
The work has continued on H264, and especially to add support for the High Profile decoding. My test video showed a limitation in our current code however, due to what seems to be a limitation of the libva API. Indeed, the H264 codec relies on a decoded picture buffer (DPB) that holds the previous decoded pictures that might be used as reference frames to decode the current frame. The kernel interface needs that DPB, and our driver will also need it to perform some ID assignation for the current frame. However, libva only gives the list of frames needed to decode the current frame, and not the whole DPB. That leads to a situation where subsequent frames, using the same reference frames set, will be assigned the same ID, which obviously doesn’t work very well. Most of the week has been spent trying to evaluate how we can address that issue, and to start implementing a solution that would be based on a cache of the reference frames passed to our libva driver.