Allwinner VPU campaign one-year anniversary

Crowdfunding Campaign

It has been over a year since we launched a crowdfunding campaign to fund the development of an upstream Linux kernel driver and userspace support for the Allwinner VPU. The funding campaign was a frank success, with over 400 backers contributing a total of over 30 k€ out of the 17,6 k€ set as the initial goal. This enabled us to work on additional stretch goals, namely support for new Allwinner SoCs and H.265 decoding support.

Initial Development

Work on the Allwinner VPU started back in March 2018, being the main topic of my 6-month internship at our office in Toulouse. Bootlin engineer and long-time Allwinner Linux kernel maintainer Maxime Ripard rapidly joined the effort hands-on, to bring up support for H.264 decoding. Aspects covered by our effort include the kernel driver (cedrus), VAAPI userspace library (libva-v4l2-request) as well as testing tools (v4l2-request-test, libva-dump) and various upstream projects such as VLC and GStreamer.

We worked hard to deliver the campaign’s goals and submitted numerous revisions of the base cedrus driver along the way. By July, we announced the delivery of the campaign’s main goals (although some goals were not fully met, as explained in the associated blog post) and accompanied it with a release (tagged release-2018-07 in our git repositories).

By the end of August, we had added support for MPEG-2, H.264 and H.265 for first-generation Allwinner SoCs (and the H3 in addition), including support for accelerated display of decoded frames in the DRM driver. See our detailed blog post presenting the status at that point. Still, our changes had yet to be included in the Linux kernel.

End of the Year Status

We kept working intermittently on VPU support over the following months and manged to get the Cedrus driver accepted in Linux at the same time as the media request API. We also continued to work on submitting new versions of the series adding H.264 and H.265 support to our driver. Last but not least, we worked on adding support for the H5, A64 and A10 platforms, which were missing from the initial delivery. A dedicated blog post presents the status at the end of 2018.

Recent Developments

In 2019, Bootlin has been continuing the effort to maintain the driver and get the remaining patch series integrated in the mainline Linux kernel. We managed to get the remaining patches for DRM support merged and they will be included in Linux 5.1!

Regarding codecs support, there are still discussions happening around the H.264 and H.265 series which are now at their sixth and third revisions respectively. We are hoping that the situation will settle and that these series will be merged (in staging) as soon as possible.

March 2019 Release

With modifications taking place in the (unstable) kernel interface and userspace being updated accordingly, it became quite hard for users to properly pick the kernel and userspace components that work together. Because of that, we decided to make a new release (tagged release-2019-03 in our git repositories). It packs an updated kernel tree (based on the next media tree with our ongoing patch series applied atop) and matching versions of libva-v4l2-request and v4l2-request-test.

External Contributions

We received a few contributions along the way, such as support for the H6 SoC in the cedrus driver (that should make it to Linux 5.2) and a few minor fixes for the driver. We also received and reviewed improvement to our v4l2-request-test testing tool.

More Improvements to Raspberry Pi Display Testing

Raspberry Pi Display Support and IGT

We have been working with Raspberry Pi for quite some time, especially on areas related to the display side. Our work is part of a larger ongoing effort to move away from using the VC4 firmware for display operations and use native Linux drivers instead, which interact with the hardware directly. This transition is a long process, which requires bringing the native drivers to a point where they are efficient and reliable enough to cover most use cases of Raspberry Pi users.

Continuous Integration (CI) plays an important role in that process, since it allows detecting regressions early in the development cycle. This is why we have been tasked with improving testing in IGT GPU Tools, the test suite for the DRM subsystem of the kernel (which handles display). We already presented the work we conducted for testing various pixel formats with IGT on the Raspberry Pi’s VC4 last year. Since then, we have continued the work on IGT and brought it even further.

Improving YUV and Adding Tiled Pixel Formats Support

We continued the work on pixel formats by generalizing support for YUV buffers and reworking the format conversion helpers to support most of the common YUV formats instead of a reduced number of them. This lead to numerous commits that were merged in IGT:

In the meantime, we have also added support for testing specific tiling modes for display buffers. Tiling modes indicate that the pixel data is laid out in a different fashion than the usual line-after-line linear raster order. It provides more efficient data access to the hardware and yields better performance. They are used by the GPU (T tiling) or the VPU (SAND tiling). This required introducing a few changes to IGT as well as adding helpers for converting to the tiling modes, which was done in the following commits:

DRM Planes Support

The display engine hardware used on the Raspberry Pi allows displaying multiple framebuffers on-screen, in addition to the primary one (where the user interface lives). This feature is especially useful to display video streams directly, without having to perform the composition step with the CPU or GPU. The display engine offers features such as colorspace conversion (for converting YUV to RGB) and scaling, which are usually quite intensive tasks. In the Linux kernel’s DRM subsystem, this ability of the display engine hardware is exposed through DRM planes.

Displaying multiple DRM planes

We worked on adding support for testing DRM planes with the Chamelium board, with a fuzzing test that selects randomized attributes for the planes. Our work lead to the introduction of a new test in IGT:

Dealing with Imperfect Outputs

With the Chamelium, there are two major ways of finding out whether the captured display is correct or not:

  • Comparing the captured frame’s CRC with a CRC calculated from the reference frame;
  • Comparing the pixels in the captured and reference frames.

While the first method is the fastest one (because the captured frame’s CRC is calculated by the Chamelium board directly), it can only work if the framebuffer and the reference are guaranteed to be pixel-perfect. Since HDMI is a digital interface, this is generally the case. But as soon as scaling or colorspace conversion is involved, the algorithms used by the hardware do not result in the exact same pixels as performing the operation on the reference with the CPU.

Because of this issue, we had to come up with a specific checking method that excludes areas where there are such differences. Since our display pattern resembles a colorful checkerboard with solid-filled areas, most of the differences are only noticeable at the edges of each color block. As a result, we introduced a checking method that excludes the checkerboard edges from the comparison.

Detecting the edges (in blue) of a multi-plane pattern

This method turned out to provide good results and very few incorrect results after some tweaking. It was contributed to IGT with commit:

Underrun Detection

We also worked on implementing display pipeline underrun detection in the kernel’s VC4 DRM driver. Underruns occur when too much pixel data is provided (e.g. because of too many DRM planes enabled) and the hardware can’t keep up. In addition, a bandwidth filter was also added to reject configurations that would likely lead to an underrun. This lead to a few commits that were already merged upstream:

We prepared tests in IGT to ensure that the underruns are correctly reported, that the bandwidth protection does its job and that both are consistent. This test was submitted for review with patch:

Allwinner VPU support in mainline Linux end of the year status update

The end of 2018 is approaching, which is the right time to provide an update on where we stand with the Allwinner VPU support in Linux. The promise of our Kickstarter campaign was to deliver all funded stretch goals by the end of 2018. As we’ll explain below, all the goals have been met, and only the upstreaming process is not completely finished for some features.

With the Linux 4.20 release happening just a few days ago, the core of the Cedrus driver and the media Requests API that supports it are finally available through Linus Torvalds’ tree! They provide the required interface for hardware-accelerated MPEG-2 decoding on the A13, A20, A33 and H3 Allwinner SoCs.

Cedrus driver in staging

Discussions regarding the details of the stateless video decoding interface offered by the V4L2 API are still taking place, which explains why our driver was integrated as a staging media driver. It was also decided to hide the associated interface for MPEG-2 decoding from the public kernel headers for now. This way, the interface is not considered stable and can keep evolving as discussions are still taking place. Because the kernel has a policy of never breaking backward compatibility, these interfaces will hardly evolve once they are declared stable, so it is crucial to ensure they are ready when that happens. This is especially true given the level of complexity associated with video decoding and the various pitfalls paving that road.

Status of H264 and H265 decoding support

The code for H264 and H265 decoding is already developed and functional, has already been submitted for upstream inclusion but it hasn’t been merged yet. However, with the decoding interfaces considered unstable for now, it should become easier to bring-in H.264 and H.265 decoding support to mainline. Support for these two codecs in the API and our driver is still under review and we will continue sending new iterations for them, with the suggested changes and fixes as well as adaptations for the still-evolving decoding interface. More specifically, our series need to be updated to the latest proposal for identifying reference frames, which is now based on the timestamps of the buffers.

Platform support: H5, A64 and A10 on the horizon

Regarding the platforms supported by our driver, the patches for H5 and A64 support were accepted since our last update and they will make it to Linux 4.21! We are also looking to send out support for the A10, which was missing from our previous series (but shouldn’t cause any particular trouble).

DRM driver improvements

The adaptations for the DRM driver allowing to display the raw decoded frames from the VPU on older platforms were also finalized, with the first half of the series already queued for 4.21. Subsequent changes in the series are still waiting for more reviews as they affect common parts of the framework.

Conclusion

Overall, we are slowly closing down on this project and hoping to get the remaining pieces of our work integrated in mainline Linux as soon as possible, as there seems to be very few obstacles left in the way.

Allwinner VPU support in mainline Linux November status update

Since our previous update back in September, we continued the work to reach the goals set by our crowdfunding campaign and made a number of steps forward. First, we are happy to announce that the core of the Cedrus driver was approved by the linux-media maintainers! It followed the final version of the media request API (the required piece of media framework plumbing necessary for our driver).

Both the API and our driver were merged in time for Linux 4.20, that is currently at the release candidate stage and will be released in a few weeks. The core of the Cedrus driver that is now in Linus’ tree supports hardware-accelerated video decoding for the MPEG-2 codec. We have even already seen contributions from the community, including minor fixes and improvements!

We have also been following-up on the other features covered by our crowdfunding campaign and made good progress on bringing them forward:

  • The series bringing H.264 decoding to our driver was updated for a second revision on November 15, rebased atop the upcoming Linux release and including a number of fixes as well as documentation;
  • H.265 decoding support followed with a second version sent on November 23, based on the updated H.264 series and bringing various minor improvements over the first iteration;
  • The patch series for the display engine DRM driver that adds support for the tiled YUV format used by the VPU was also updated, significantly reworked and submitted again on November 23;
  • Finally, we submitted a patch series adding support for the A64 and H5 Allwinner SoCs in the Cedrus VPU driver on November 15.
A64 and H5 boards provided by our sponsors Olimex and Libretech, now supported in Cedrus

With these patch series well on their way, we are closer than ever to delivering the remaining goals of the crowdfunding campaign!

Final weekly status update for Allwinner VPU support in mainline Linux (week 35)

The end of August has arrived, bringing an end to Paul’s engineering internship at Bootlin, focused on bringing mainline Linux support for the VPU found on Allwinner platforms. Over the past six months, we have worked hard to reach the goals announced in the project’s crowdfunding campaign and we were able to deliver most of the main goals last month.

Since last month delivery, we made great progress on supporting the H265 codec, one of the stretch goals that were funded during the campaign. A dedicated patch series introducing support for it was submitted to the linux-media mailing list earlier this week, as well as a new iteration of the base Cedrus VPU driver. As the Request API is on the verge of integrating the Linux kernel, our VPU driver should follow pretty soon.

Reaching the end of the funding: a status on where we stand

We have now exhausted the budget that was provided through the crowdfunding campaign: both Maxime Ripard’s time (who worked mainly on the H264 decoding and helping with DRM topics) and Paul’s internship are over, and therefore the remaining work will be done on a best-effort basis, without direct funding. This will therefore be the last weekly update, but we will be publishing updates once in a while when interesting progress is made.

Here is a quick summary of our current status, compared to what was promised during our Kickstarter campaign:

  • Making sure that the codec works on the older Allwinner SoCs that are still widely used: A10, A13, A20, A33, R8 and R16. This goal is fully met;
  • Polishing the existing MPEG2 decoding support to make it fully production ready. This goal is fully met;
  • Implementing H264 video decoding. This goal is fully met with base H264 decoding support implemented. However, a number of more advanced H264 features have not been implemented, and therefore additional improvements could be made;
  • Modifying the Allwinner display driver in order to be able to directly display the decoded frames instead of converting and copying those frames. This goal is fully met.
  • Providing a user-space library easy to integrate in the popular open-source video players. This goal is partially met. We do provide a user-space library that offers a VA-API implementation, however the integration with popular video players turned out to be a lot more challenging than expected, and we only offer Kodi integration at this point. See below for details;
  • Upstreaming those changes to the official Linux kernel. This goal is in progress, on both the VPU driver side and DRM improvements side;
  • Supporting the newer Allwinner SoCs (H3, H5, A64). This goal is partially met, since H3 is supported, but not yet H5 and A64;
  • H265 video decoding support. This goal is fully met with base H265 decoding support implemented. Like H264, a number of more advanced features have not been implemented, so there is room for more work.

The most challenging topic: integration with open source video players

The major pitfalls that we encountered are related to integrating our accelerated video decoding pipeline with multimedia players. They will require extra work out of the scope of the VPU campaign to reach a production-ready state.

We considered a number of options for integrating with a desktop environment under Xorg, which was especially tricky for the oldest Allwinner platforms where the VPU outputs a tiled YUV format. The chain of required operations includes untiling, colorspace conversion (from YUV to RGB), scaling and composition.

  • We first resorted to the main CPU for all the required operations (including NEON-backed untiling routines), which becomes unbearably slow as soon as scaling is involved in the process.
  • We tried to bring-in the GPU for accelerating the untiling, colorspace conversion, scaling and composition operations involved. Although we wrote a shader-based untiler, the Mali blobs did not allow for importing the raw frame data on a byte-by-byte basis. This made GPU acceleration unusable for our use case in practice. Bringing-in the GPU for the final composition step only (that should be possible with GBM-enabled blobs) could however bring some speedup.
  • Another lead is to use the Xv extension of the X11 API, that fits the bill for using the Display Engine hardware to accelerate these operations, but this interface is quite old now and increasingly deprecated. It also only allows sub-optimal use cases, with one video at a time.

We also investigated the situation for media players that can run without a display server, which removes the need for the composition step and allows using the Display Engine hardware directly, through the DRM interface.

  • We succeeded at bringing up support for the Kodi mediacenter, by adding the required bits to implement a zero-copy pipeline.
  • We worked on getting GStreamer to correctly pipe VAAPI-based decoding to the DRM-enabled kmssink without going through the GPU, but did not end up with any functional result, so significant work remains in that area.

Going further: what will happen now ?

Here are the topics that we intend to continue work on in this best-effort mode and complete by the end of 2018, as promised in our crowdfunding campaign:

  • Ensure the base Cedrus Linux kernel driver gets merged;
  • Ensure the H264 decoding support in the Cedrus driver gets merged;
  • Ensure the H265 decoding support in the Cedrus driver gets merged;
  • Ensure the DRM driver improvements get merged;
  • Enable VPU support on H5 and A64.

Here are other topics that we do not intend to work on without additional funding. Individuals who want to see some progress on those topics are invited to contribute and join the effort of improving Allwinner VPU support in upstream Linux. Companies interested in those features can also contact us.

  • Additional H264 and H265 decoding features: interlaced video support (H264 and H265), quantization matrices (H265), 10-bit (H265), 4K resolution (H265);
  • Other codecs beyond MPEG2, H264 and H265, such as VP8;
  • Encoding support;
  • Additional work on GStreamer integration or X.org integration.

Thanks

Once again, we would like to thank all the individuals and companies who participated to our crowdfunding campaign, and made this project possible. We are very happy to see that despite the uncertainties involved in all software development projects, we have been to deliver the vast majority of the goals, within the expected time frame, while delivering weekly updates of our progress. It was a new experience for Bootlin, and we hope to renew this experience for other Linux kernel upstream developments in the future!

Allwinner VPU support in mainline Linux status update (week 34)

This week has seen great advancements in H265 support, following up on the work conducted during the past weeks. The first item to debug was support for bi-directional predictive frames (AKA B frames) which was broken last week. This required some adaptation in our standalone test tool v4l2-request-test in order to display the decoded frames in the right order. With bi-directional prediction, the display order no longer matches the decoding order, in which the coded frames are stored in the bitstream.

With the images displayed in the right order, the debugging process was a matter of comparing the configuration register values written by our driver with the reference provided by libvdpau-sunxi, but it was not enough. A specific buffer has to be provided for each frame for the decoder to store extra meta-data related to bi-directional frame prediction. With the buffer set, the situation vastly improved and only minor issues had to be resolved.

This lead to properly decoding our reference H265 video that contains I, P and B frames! A few more videos were also tested to spot possible bugs and were eventually decoded correctly too. Of course, due to the many possible combinations of H265 features, it is possible that we are still missing some corner cases, but the bulk of H265 support is well in place at this point.

We moved on to adding support for H265 in libva-v4l2-request, which allows the integration of the codec with media players such as VLC and Kodi. We hit a few hiccups during the bringup :

Hiccups when integrating H265 with VAAPI

But we managed to fix the integration of H265 to behave properly :

H265 decoding working properly with VAAPI

So H265 is now integrated in our pipeline and we are ready to submit the patches introducing its support for the Cedrus driver, which should come around next week.

Allwinner VPU support in mainline Linux status update (week 33)

The first task that was tackled this week was solving the bit offset issue encountered last week. We found out that ffmpeg provides VAAPI with a byte-aligned value after rounding it up from an internal offset it keeps in bits. When trying to use the internal value in bits, our VPU would succeed at decoding the H265 frame. After looking at the values for a few distinct frames, it became clear that the offset matched the beginning of a Golomb-coded compressed sequence, starting with a 1 bit and followed by zeros, as a prefix code. Detecting this pattern appears to work reliably for the H265 videos we could test.

This paved the way for properly decoding intra-coded (I) H265 frames without any hardcoded value left in the code. With that in place, it was only a small stretch to decode a few seconds of video made of I frames!

Of course, intra-coded frames are rare in H265 videos since they do not use any temporal compression technique and are thus larger in size. Predicted frames (using references from already-decoded frames) compose the vast majority of H265 videos. Prediction takes places either for forward prediction (P frames) or both forward and backward prediction (B frames). Supporting these prediction modes requires significant driver-side work, especially to handle the metadata (such as prediction weight coefficients) associated to each frame in the reference lists and the lists on their own. On the framework side, V4L2 controls also had to be introduced to bring the required plumbing for these features.

As of today, we successfully implemented support for P frames while B frames are still work in progress. To illustrate our progress, the same video can be seen decoded in v4l2-request-test (at nominal and half speed), with the two prediction modes :

  • With I and P frames, the video is decoded correctly:
  • Some more work seems to be required for B frames:

Next week will be the opportunity to move forward on B frames decoding!

Allwinner VPU support in mainline Linux status update (week 32)

This week started with the preparation of a new revision of the Cedrus VPU driver, after significant feedback was received on the version posted two weeks ago. Thanks to the careful testing carried out by community member Jernej Škrabec, a number of decoding issues were discovered in version 6 of the driver. This includes issues related to MPEG2 decoding but also to the use of the VPU untiling block, that affects all codecs indifferently.

The latest version of the driver (that was posted on Thursday) brings-in the required fixes for properly supporting the videos for which decoding errors were spotted.

Some updates were also included on the MPEG2 controls side, in order to bring them closer to the raw bitstream parameters. Some parameters (that are not exposed by VAAPI) were also added, making the V4L2 controls broader than what is strictly required for our VPU.

Regarding H265, progress was slow this week due to a mismatch between values provided by VAAPI and what our VPU expects. More specifically, VAAPI provides a byte-aligned value for the offset to the coded video data in the slice (which also includes a header with metadata) while our VPU expects a bit-aligned value that does not match the value provided by VAAPI. We are hard at work to figure out a solution to this issue, but it is not straightforward. In addition, the reference libvdpau-sunxi code does not set that offset explicitly, as it is reached after parsing the header through the VPU itself. In our case, the parsing is done in userspace so the use case differs.

Allwinner VPU support in mainline Linux status update (week 31)

Following on last week’s progress, this week was also focused on bringing the required plumbing for H265 support in our video decoding pipeline. Thanks to register dumps obtained last week from libvdpau-sunxi, it was possible to quickly hack together support for decoding a single intra frame (with no dependency on any other frame), by replaying the dumped register write sequence. Once decoding that single frame worked with the hardcoded register values, we progressively replaced these values with actual register field definitions, that have to be configured with the appropriate metadata for the frame, that is parsed from the H265 bitstream.

As a result, the next step was integrating the required metadata information as dedicated V4L2 controls. Since these controls have to be as generic as possible (in order to fit well with future V4L2 stateless VPU drivers), we carefully looked at the metadata fields that the bitstream offers and considered the elements that VAAPI provides in userspace as well as the information that our VPU needs specifically. It appears that some fields required by our VPU are not exposed by VAAPI directly, so a few tricks were needed along the way.

At this point, we have a first draft for the controls, that allow decoding the intra-coded frame that we dumped last week, but using the metadata provided through the controls instead of hardcoded values :

The very first H265 frame decoded with the Cedrus VPU driver! Can you guess which animation movie it is from?

More work is required to include support for other types of frame coding, namely B and P predictive frames. Next week’s focus will be set on decoding a series of intra-coded frames and moving on to supporting predictive frames. Thankfully, the work done by Bootlin engineer Maxime Ripard when adding support for H264 makes the whole process considerably easier, since H265 resembles H264 in many aspects.

Allwinner VPU support in mainline Linux status update (week 30)

This week’s progress in our VPU driver development effort was focused on two main tasks: submitting the sixth revision of the Cedrus VPU driver series to the mainline Linux kernel and starting the work on H265 decoding.

The patch series for this new iteration of the driver was submitted on Wednesday and contains both functional and cosmetic changes. Most notably, we implemented support for video-specific quantization matrices in MPEG2, one of the final extension bits we were missing until then, but also cleaned up the register definitions for the driver. At this point, there are no undocumented registers or fields left, which makes the overall understanding of the hardware interactions much more straightforward. The driver was also moved to staging drivers, not because it was deemed of poor quality but rather because V4L2 maintainers want to keep the ability to change the controls that our driver is using even after it is merged.

Aside of this work, we started looking into H265 decoding, that was also already implemented in libvdpau-sunxi for the downstream modified version of the Linux kernel provided by Allwinner for the H3 (still based on Linux 3.4 to this day, which was released in 2012). After setting up a board with this kernel and libvdpau-sunxi, we were able to dump the register access made by libvdpau-sunxi, providing a reference for bringing up H265 support in the Cedrus VPU driver!