Allwinner VPU support in mainline Linux status update (week 20)

With DMABUF support tested, it has become possible for Paul to start the work on integrating a GPU-based video output pipeline with Sunxi-Cedrus. Using the GPU should greatly improve performance when it comes to displaying the video frames. As of now, we have been using software untiling, software YUV-to-RGB colorspace conversion and software scaling. We are looking to replace these steps with GPU-based untiling, colorspace conversion and scaling. Shaders are used to implement these operations: they are small programs that are compiled on-the-fly for the GPU’s very specialized instruction set. Most players embed shaders (in their source form, using the GL shading language) for usual operations like colorspace conversion and scaling. However, these players are not ready to handle untiling as of now (or even be notified that the format returned by our VAAPI backend is tiled).

The first step in our plan is to get VLC to cooperate with the X11 flavor of the Mali proprietary blobs that Allwinner has released in the past so that we can use GPU support for colorspace and scaling. This is still a blocking point as of now. Then, we will look into crafting a shader for untiling the VPU output frame and integrating it with our libva-cedrus VAAPI backend.

As a sidenote, the free software Lima driver is being prepared for a first RFC series, bringing the first bits of mainline Linux kernel support for Mali GPUs of the Utgard generation. So even though work on the GPU only concerns the proprietary blob for now, the work will eventually become useful to the free software driver as well.

We have also tested Sunxi-Cedrus on the H3 and started looking at integrating the display part (which differs from earlier SoCs by using a revised display engine: DE2). However, since this is a strech goal of the fundraiser and that we have many other tasks left to tack among our main goals, this is by far not our priority at the moment.

We finally worked more on the libva-dump and cedrus-frame-test for H264, which will hopefully allow us to test our first H264 decoding next week!

Stay tuned for our next progress update!

Allwinner VPU support in mainline Linux status update (week 19)

This week has seen considerably less advancement than the ones before it due to bank holidays in France. Nevertheless, we managed to prepare and send V3 of the Sunxi-Cedrus Linux kernel driver on Monday. While this new version contains several incremental improvements, a number of tasks (described in the series’ cover letter) have yet to be completed before the driver can be merged in mainline Linux.

Maxime continued to work on the H264 support. The big part of the kernel has been done, and he then moved on to convert libva-dump to be able to dump also H264 buffers. Most of that part has been done as well, so the next item will be to convert cedrus-frame-test to be able to test H264 frames, and see where that takes us.

Paul kept working on DMABUF support, which is now refined and ready both on the kernel side and on the userspace side with cedrus-frame-test. There is now a single DMABUF handle used per buffer plane (instead of per-plane) which allows having all components of the frame displayed correctly. Because there is now as many buffers for display as there are for decoding, it is necessary to register framebuffers associated with each imported buffer and cycle the framebuffers in multi-buffering page-flipping. To tackle this, we have started implementing atomic modesetting in cedrus-frame-test, allowing to set the framebuffer to use per-plane.

Finally, some attention was given to the integration of our video decoding pipeline with the Mali GPU, especially to target Kodi support.

Stay tuned for our next update!

Allwinner VPU support in mainline Linux status update (week 18)

This week, Paul continued working on DMABUF support and succeeded at exporting a buffer allocated by the Sunxi-Cedrus driver on the v4l2 side and importing it on the drm side via DMABUF. Although DMABUF support is still a work in progress in cedrus-frame-test and beyond the current level of support we have with GStreamer, the kernel side of things should be ready.

Another excerpt of the Big Buck Bunny video, in 1080p

Test coverage was also improved this week, with significantly more MPEG2 videos tested (including a standard DVD) in different resolutions up to 1080p. Some feedback from the community was also received and a first issue report will need to be investigated. Regarding platform support, initial testing of the A13 was undertaken. Although the VPU driver works apparently just as well on the A13 as already-tested platforms, the DRM driver adaptation (on the display side) for untiling VPU output buffers appears to be broken and will need to be further investigated.

In other news, a new version of the media request API has been submitted without the RFC tag (after 12 previous iterations). While we’ve been testing this new version along the course of its development, we are also taking the occasion to rebase our Sunxi-Cedrus VPU driver on top of this new version and take the received feedback into account.

Maxime continued the work on H264, and almost finished a first draft for the kernel driver side. Most of the code should be there now, the next steps are going to be making sure that no parts are missing and starting to test with cedrus-frame-test!

Allwinner VPU support in mainline Linux status update (week 17)

This week started off with numerous reviews received on the patchset introducing the Sunxi-Cedrus VPU driver. Lots of constructive comments, questions and improvements were discussed, which will help improve the driver for the next iteration of the series. Changes to other drivers will also have to be implemented, in particular to the SRAM controller found on Allwinner platforms, which needs to handle access to the SRAM by the VPU.

Maxime worked on refactoring needed to ease the support for the H264, rebasing on the latest version submitted upstream and making sure that everything still works fine. He eventually pushed them in our 4.17 branch on github, and will now focus on landing H264 support itself.

The work carried out by Paul this week was focused on the libva-cedrus VAAPI backend, which supports the Sunxi-Cedrus kernel driver on the userspace side. The backend is used by VLC (when it is configured to use VAAPI for video decoding) to play MPEG2 videos such as the ones available from the Linaro sample media. libva-cedrus was significantly improved over the week, with around 80 commits featuring a major cleanup of the code that includes, along with other changes:

  • coding style harmonization
  • proper error checking and reporting instead of assertions
  • the removal of the unsupported MPEG4 code
  • the introduction of dedicated v4l2 helpers based on those developed for cedrus-frame-test
  • the reorganization of v4l2 source and destination buffers management, where both are now tied to a specific surface and kept in sync
  • the update of the definitions to match the latest patchset
  • the implementation of the final rendering at picture end time

This work significantly improved the compatibility with VLC, which was previously dropping several frames. With these changes, VLC is now properly showing the decoded videos playing close to 25 fps when there is no software scaling involved. The performance is not as good with VLC as it is with cedrus-frame-test, which uses a dedicated DRM plane directly while VLC and libva-cedrus use the software untiling code and buffer copies to display each frame.

VLC playing the Big Buck Bunny video with libva-cedrus

Some attention was also given to GStreamer over the week. Although compatibility with our VAAPI backend and display pipeline is not there yet, the VAAPI backend rewrite allowed moving forward and GStreamer now displays the first decoded frame. While the operations for decoding the frames are correctly scheduled, they are only requested to be displayed sporadically, with no effect on the screen. This issue will need to be further investigated before a basic decoding pipeline can be used with GStreamer, with a video output either to a regular X window or to a DRM plane directly. MPV was also tested out this week, without much success in coordinating the rendering and display parts involved with the VAAPI pipeline. Thus, MPV support will also require more investigation before it can be properly supported.

While we initially decided to focus on GStreamer for implementing DMABUF buffer sharing between the VPU and the display engine, cedrus-frame-test (the standalone userspace implementation supporting the Sunxi-Cedrus VPU driver) allows us to directly work on implementing DMABUF support. So even though GStreamer does not work with libva-cedrus at this point, DMABUF support was started in a dedicated branch of the cedrus-frame-test repository. DMABUF is currently failing on the kernel side, when validating the page number of the requested DMA buffer. In this area as well, further investigation and work will be needed.

In the meantime, the Sunxi-Cedrus page on the linux-sunxi wiki was updated with the latest status of Sunxi-Cedrus support, instructions to build and install libva-cedrus and cedrus-frame-test as well as configure VLC for decoding MPEG2 videos. Feedback and test reports are welcome, especially regarding videos that are not decoded properly and show visual artifacts. The community around Sunxi-Cedrus hangs out on the #linux-sunxi and #cedrus channels of the freenode IRC network, so it is the best place to ask questions and discuss all things related to Sunxi-Cedrus!

Stay tuned for next week’s progress report!

Allwinner VPU support in mainline Linux status update (week 16)

As announced last week, the second revision of the Sunxi-Cedrus driver patchset was submitted for review earlier this week. While this new revision is based on the latest version of the request API, it also includes several fixes for corner-cases of this new API, especially to use it in the context of a M2M driver. Regarding the driver itself, significant reworks were carried out (including both functional and cosmetic changes) and the driver is now more stable. It was tested on the A33 and A20 so far and works nicely on both.

The standalone tool that was developed for testing the driver, called cedrus-frame-test, has seen various improvements that allow reliably testing the Sunxi-Cedrus driver. The tool is now in a state where it can be used nicely from the command line and includes the first few frames of our reference Big Buck Bunny MPEG2 video. It also implements timestamping to have a clear idea of how long frame decoding and frame display take. A target number of frames per seconds can also be set, with error messages printed when the target fps could not be met. Finally, a dummy libVA backend was written to easily dump slices and frame metadata from videos: libva-dump.

cedrus-frame-test displaying Big Buck Bunny frames decoded with Sunxi-Cedrus

Instructions to setup the kernel driver as well as cedrus-frame-test from our trees will be made available on the linux-sunxi wiki page dedicated to Sunxi-Cedrus very soon.

At this point, the time spent decoding each video frame is rather satisfying (around 5 ms as a ballpark figure) for our 854×480 demo video. We are still doing a hard copy of each frame to feed it to the display driver: that’s where the current bottleneck is. There is work left to be done in that area, first by implementing DMAbuf and also by using proper page flipping in cedrus-frame-test. We are also hitting a display issue with 4.16 on the A20, although that problem might have been fixed in 4.17 already.

Next week will be focused on (finally) adding DMAbuf support and getting libVA in shape to work with the new Sunxi-Cedrus kernel driver under VLC and GStreamer. The final patch of the first GStreamer adaptation series submitted some weeks ago was recently merged in GStreamer.

Allwinner VPU support in mainline Linux status update (week 15)

This week, Paul worked on preparing a new version of the patch series introducing support for the Sunxi-Cedrus VPU kernel driver, based upon the latest version of the Request API as submitted for review by Hans Verkuil on Monday. In order to make it easier to test the kernel driver, a standalone tool was written to decode a single frame (that was dumped beforehand). Support for displaying the decoded frame directly into a DRM plane was also added later this week, providing direct visual feedback. Finally, significant work was put into our libVA backend, that saw a significant rewrite of the memory-management logic related to video buffers.

We plan to prepare and release this new standalone tool as well as the libVA improvements when the kernel driver patch series is ready for submission, sometime next week. Specific instructions to get this up and running will also be made available on the Sunxi-Cedrus page of the linux-sunxi wiki, for one of the supported platforms. So far, we have tested the series on A33 and A20, but it is very likely that A10 and A13 will work just as well.

On his side, Maxime continued his effort on the H264 decoding. He first looked at the Chrome OS kernel and userspace code to drive the Rockchip SoCs VPU. This code is of interest because it’s basically the only stack so far that is functional, used and based on the Request API since Google is especially involved in the development of that API. He then went on with mapping the request API controls for H264 to the code for H264 decoding in the libvdpau-sunxi code that already provides an implementation for the Allwinner VPU. He then started to write some kernel code to add support for the kernel part of the API.

Allwinner VPU support in mainline Linux status update (week 14)

This week has seen progress on the GStreamer front: the segmentation fault that was a blocker when interfacing our libva backend with gstreamer-vaapi was investigated and understood. An associated bug was reported to the GNOME bug tracker, with all the gory details attached. Since the issue was in fact due to the assembly routine imported from libvdpau-sunxi smashing the malloc heap metadata, the bug report was closed. Using valgrind proved very useful for diagnosing the issue. Since we are not going to keep using the software-based untiling method through this assembly code, no time was spent on investigating and fixing the issue there.

The integration of changes in GStreamer for our case is also moving forward, with the submission of newer iterations of the related patches when requested. There are still issues left to fix with GStreamer, although things are looking better and better. Plenty of small bugs and mistakes have been identified and resolved in both our kernel driver and VAAPI backend in the process.

A new version of the request API has also been submitted for review and comments by Hans Verkuil. We have started rebasing our VPU driver on top of this new version and hope to send out this new version sometime next week.

We also started to look into H264, mostly by setting up a good test scenario for H264 (using the libvdpau-sunxi and an Allwinner kernel), making sure that it can actually decode H264 videos (which it does), and building similar setup with the mainline kernel and our libva implementation.

Stay tuned for more development updates related to Allwinner VPU support in mainline Linux!

Allwinner VPU support in mainline Linux status update (week 13)

While this week’s goal was set to supporting dmabuf in our Sunxi-Cedrus VAAPI implementation with GStreamer, progress was made on GStreamer support alone. The first milestone was displaying the video from videotestsrc, which produces a sample test output, to kmssink, which handles video output directly with DRM planes.

The Gstreamer pipeline for the test
The working result on display

This required specific adaptation for GStreamer to cope with the Allwinner DRM driver. More specifically, changes that allow the Allwinner DRM driver in kmspipe as well as devices that don’t sit on a PCI bus in gtreamer-vaapi were submitted. Patches to fix both these issues were sent for review earlier today and the first one was already accepted and merged. Thanks to Nicolas Dufresne for his help in the process!

On the VAAPI side, things are more complicated and a number of different areas of the sunxi-cedrus VAAPI backend had to be reworked. For instance, the slices constituting MPEG2 frames are not submitted in the same way by VLC and GStreamer. Buffers management is also done slightly differently between these two users of libVA. The net result is that a segmentation fault caused by a memory management mishaps is occurring with GStreamer and is still being investigated.

This week was also the occasion to dig into the partial reference code that Allwinner published in 2015 (to grasp a better understanding of the MPEG2 decoding process with the VPU) and start updating the register documentation on the linux-sunxi wiki, where some fields documented by the reference code were still marked as unknown. The Sunxi-Cedrus page was also updated to reflect the current status and effort. These resources will get updated as development happens. More precisely, we’d like to provide instructions to deploy this work, once it has reached a decent level of usability. Stay tuned for our next update!

Allwinner VPU support in mainline Linux status update (week 12)

Following up on the work started last week, I finished implementing initial support for displaying the NV12-based tiled format (that we shall call MB32-tiled NV12). The frame, that was dumped from the VPU, is now correctly displayed on the screen (after adapting scaling coefficients that needed specific tweaking for this use case).

The result can be shown in the following picture, where our Big Buck Bunny has the right coloring:

Scaling is also supported for the tiled format, so the frame can be shown in full screen without resorting to software scaling.

A series of patches supporting these features was sent for review on the dri-devel mailing list, where it already got some feedback from Maxime Ripard (who maintains the sun4i DRM driver impacted by these patches) as well as other members of the community! There is already enough material to craft a second version and send it again for review.

Significant time was spent figuring out the DRM, KMS, DRI and X11 graphics pipeline (as well as specific details of the inner workings of display hardware) and how to properly integrate the overlay DRM plane with all this. We are evaluating all our options here before spending time on a specific implementation. Of course, we are trying to keep things as generic as possible and avoid introducing platform-specific code in userspace, but there are also challenges to overcome in this regard. On the Wayland side, things are looking much brighter as compositors such as Weston have support for managing hardware planes directly, so there should be less work required.

Finally, I started working on dmabuf support, that I am testing with gstreamer‘s kmssink, that allows outputting directly in a hardware plane. Once this work is ready, we’ll be able to get an idea of the performance of the VPU when it is not limited by software-based untiling and compositing. Stay tuned for updates in this direction!

Allwinner VPU support in mainline Linux status update (week 11)

After the initial submission of the Sunxi-Cedrus driver last week, I spent most of this week looking into the sun4i DRM (Direct Rendering Manager) driver. The driver is in charge of handling the display pipeline on Allwinner SoCs. Tight integration of the VPU and the display pipeline is required in order to achieve decent video playback performance. That is because the output format of the VPU is a 32×32 tiled format based on NV12, a YUV420 semi-planar format, with one plane for the Y component (luminance) and one plane for the interleaved UV components (chrominance). While NV12 is a standard format for video output, the tiling is rather specific to the VPU, so the frames have to be untiled before they can be used. This operation, when done in software, is rather slow. Moreover, software-based compositing of the decoded frames is also a bottleneck that impacts the overall performance.

In order to circumvent these issues, we will be using the display engine itself to untile the VPU output frames and show the untiled frames directly in a dedicated hardware plane, that is then composed with the primary plane. This requires several features and especially support for the display engine’s frontend, that has the required components to untile and decode the frames. Partial support for the frontend was recently contributed by Maxime Ripard and is on its way to landing in the mainline Linux kernel, providing a base for my VPU-related work. Maxime’s patches allow scaling hardware planes (among other things), a feature that will be very useful for scaling videos to the screen size in hardware rather than software (which is another major bottleneck for performance).

Support for untiling the VPU frames is approaching completion (luminance is correctly decoded while chrominance is not yet correctly handled).

Decoding the MB32 tiled format with sun4i-drm

Once the frames are properly shown on screen, it’ll be time to make sure that dmabuf works as expected, which will allow us to send buffers from the VPU to the display engine without any copy, thus improving performance.

We should be making good progress on this topic over the upcoming week and start contributing patches to the sun4i DRM driver, so stay tuned for our next status update!