Allwinner VPU support in mainline Linux status update (week 30)

This week’s progress in our VPU driver development effort was focused on two main tasks: submitting the sixth revision of the Cedrus VPU driver series to the mainline Linux kernel and starting the work on H265 decoding.

The patch series for this new iteration of the driver was submitted on Wednesday and contains both functional and cosmetic changes. Most notably, we implemented support for video-specific quantization matrices in MPEG2, one of the final extension bits we were missing until then, but also cleaned up the register definitions for the driver. At this point, there are no undocumented registers or fields left, which makes the overall understanding of the hardware interactions much more straightforward. The driver was also moved to staging drivers, not because it was deemed of poor quality but rather because V4L2 maintainers want to keep the ability to change the controls that our driver is using even after it is merged.

Aside of this work, we started looking into H265 decoding, that was also already implemented in libvdpau-sunxi for the downstream modified version of the Linux kernel provided by Allwinner for the H3 (still based on Linux 3.4 to this day, which was released in 2012). After setting up a board with this kernel and libvdpau-sunxi, we were able to dump the register access made by libvdpau-sunxi, providing a reference for bringing up H265 support in the Cedrus VPU driver!

Delivery of Allwinner VPU driver main goals

With a few weeks of delay, we are proud to announce the delivery of the main goals of our crowdfunding campaign dedicated at adding upstream Linux support for the Allwinner video decoding hardware.

After several months of hard work by Bootlin engineer Maxime Ripard and intern Paul Kocialkowski, we now have a working demo of Kodi running with our VPU driver on top of a mainline 4.18-rc kernel. Both MPEG2 and H264 are supported, with a fully-optimized pipeline between the VPU and the display side that does not involve any buffer copy or extra transformation that the hardware cannot offload. These results were possible thanks to the previous efforts carried out by the linux-sunxi community, and especially the libvdpau-sunxi project.

The Cedrus VPU driver running on the A33 and H3

Here were the main goals defined in our crowdfunding campaign, which we promised to deliver end of June 2018, and their status in our delivery:

  • Making sure that the codec works on the older Allwinner SoCs: A10, A13, A20, A33, R8 and R16.. This goal is fully met, with more features than planned: the Cedrus driver was brought up on the A10, A13, A20, A33 and H3. Therefore, we included H3 support in this delivery, even though it was originally only part of one of the stretch goals. The R8 is the same as an A13 and the R16 is the same as an A33, so they are supported as well.
  • Polishing the existing MPEG2 decoding support to make it fully production ready. This goal is fully met: we have done much more testing of the MPEG2 decoding, and both the Linux kernel code and user-space code supporting MPEG2 has been significantly improved and cleaned up.
  • Implementing H264 video decoding, since H264 is by far one of the most popular video codec.. This goal is fully met: H264 decoding support has been added to both the Linux kernel driver and the user-space library, including high-profile H264 support. However, the H264 support is still very recent and we expect that additional debugging and improvements will be needed.
  • Modifying the Allwinner display driver in order to be able to directly display the decoded frames instead of converting and copying those frames. This goal is fully met: the Allwinner DRM driver has received a number of patches to ensure we can use one of the several planes to directly display the video frames in the format provided by the VPU. Support for hardware scaling has also been fixed to work properly. Those patches have already been contributed to the upstream Linux kernel. The work on the A20 and A33 display driver was done by Bootlin, while the work on the H3 was done by other developers of the community.
  • Providing a user-space library easy to integrate in the popular open-source video players. This goal is partially met: while we are providing a libva-v4l2-request user-space libraries that can in theory be used by all libva capable video players, the actual integration with video players is for now only working completely with Kodi. We have started efforts to make it work with both VLC and GStreamer, but the work has not been complete due to various challenges detailed below. This area was definitely much more challenging than we initially expected.
  • Upstreaming those changes to the official Linux kernel. This goal is almost met: we have posted 5 iterations of the Cedrus Linux kernel driver, each time using new versions of the Request API patches, helping improve this API along the way. While our patches have not been merged yet, because the Request API itself hasn’t been merged, they have received significant review from the V4L developers, and we believe our patches are not far from being merged.

All in all, despite the numerous challenges encountered over the last few months, we are happy to see that we have been able to deliver most of the goals completely, and we are not too far off for the few goals that haven’t yet been fully met. As we will discuss below, we will continue to work in the next months on completing those unfinished steps, and on the stretch goals that received enough funding.

Reaching this level of support was not a straightforward journey, as our road was paved with various obstacles that are presented below.

Media Request API

In order to add support for the VPU found on Allwinner platforms, some internal plumbing is necessary in the Video4Linux2 (v4l2) framework, the video framework in Linux. While V4L2 gained support for a specific class of VPUs, so-called “stateful” (where the video bitstream is passed directly to the hardware controller) thanks to the Memory2Memory API, this is not sufficient for our hardware. Indeed, Allwinner platforms come with a “stateless” VPU, where the video needs to be parsed beforehand to extract the frame data and its associated metadata, and then passed to the hardware. V4L2 lacked an API for synchronizing the frame data and associated metadata, although it had been in development for a long time and known as the Request API.

Our work on Cedrus contributed to revive the flame for this API, that saw its development accelerated over the past months thanks to the commitment of individuals such as Alexandre Courbot, Hans Verkuil and Sakari Ailus. We had the opportunity to report various issues and suggest fixes over its development process, which were integrated so that all the required bits for our driver are now in. The API is finally mature and appears to be quite stable, so there is no known blocker left for its integration in the kernel.

Cedrus V4L2 Driver

The first version of the Cedrus driver originally developed in 2016 by Florent Revest as part of an internship at Bootlin was based on an old version of the Request API. We therefore started by porting it to the latest version of the API and kept publishing new revisions as development of the Request API happened. We also received useful feedback from the community in the process. Here are the different iterations of the Cedrus driver that have been sent as part of this crowdfunded effort:

In addition to those patch series adding the driver itself, an additional patch series was sent to bring H264 support.

The development of the driver itself was not the most cumbersome part of the process, although it brought some challenges. For instance, we had to rework buffer management after discovering a limitation in the hardware, where the luminance and chrominance planes of our destination buffers need to be kept close in memory. We also had to bring in a workqueue (later replaced by a threaded IRQ) for the needs of the M2M API, which comes with performance drawbacks, although this issue is in the process of being resolved.

Standalone Testing

In order to test the VPU driver in a fully-controlled environment, we developed a standalone testing tool: v4l2-request-test (formerly cedrus-frame-test) that implements all the V4L2 userspace APIs needed for our VPU, including M2M and the Request API. This tool includes frame data and metadata dumps from actual videos, with the ability to decode these frames one-by-one. The tool was tremendously helpful for debugging the driver as well as adding support for H264. Since the userspace APIs involved properly abstract the hardware, this tool can be used to bring up and develop other VPU drivers that rely on the V4L2 Request API!

VAAPI Backend

In order to provide integration with actual video players, we developed libva-v4l2-request (formerly libva-cedrus): a VAAPI backend that supports the V4L2 M2M and Request APIs. It currently supports both MPEG2 and H264 and will be extended as support for new formats is added. Just like v4l2-request-test, libva-v4l2-request aims at using the kernel APIs involved in a generic way, that should suit other Request API-based VPU drivers.

In the long run, it is likely that players will integrate direct support for the Request API (for instance, through ffmpeg). In the meantime, this allows interfacing with media players through two major interfaces: buffer derivation where the destination frames are copied (and converted to a regular image format when the VPU cannot do it on its own) or dma-buf, without any copy.

Zero-copy Pipeline Integration with EGL (Mali GPUs): VLC and GStreamer

In order to reach the best performance we can achieve, we focused on pipelines where no buffer copy is involved, on popular players: VLC and GStreamer. Since the X.org display server does not easily permit piping the VPU output to a dedicated plane on the Display Engine side, we investigated the use of the GPU. GPU support on Allwinner platforms still requires proprietary blobs at this point, such as the ones recently made available by Bootlin. We hope that the Lima project will soon bring a fully free alternative that will be integrated with both upstream kernel and upstream userspace components.

We did not have much luck when dealing with the tiled VPU output format, that the GPU cannot handle directly. Although we wrote a GPU shader for untiling (that works properly with regular GL implementations), the Mali GPU blobs did not behave as expected when it came to importing the tiled output frame. There is a chance that platforms that can output a regular image format (A33 and onwards) will be able to deal with piping the VPU and the GPU for accelerated scaling and colorspace conversion, but we did not test this option at this point.

Zero-copy Pipeline Integration with DRM (Display Engine): GStreamer and Kodi

Although involving the GPU in the pipeline was not a realistic possibility with the tiled VPU output format, various players support a direct DRM video output, that uses the Display Engine directly to pipe the video. Alas, it means that no window composition is possible, so this cannot be integrated with desktop environments. Instead, the players run standalone in their own virtual terminal.

We initially looked at using GStreamer this way but soon decided to prioritize Kodi (formerly XBMC), the popular mediacenter application. It was a struggle to integrate our pipeline (through libva-v4l2-request, via ffmpeg) in Kodi, although DRM video output support was there already. We eventually managed to get a usable result out of it, although there are areas left to improve!

LibreELEC Image Release with Kodi

In order to showcase the delivery of our main VPU crowdfunding campaign goals, we cooked a release of LibreELEC that supports the A20, A33 and H3 SoCs! It consists of a LibreELEC root filesystem (excluding the kernel and boot software) that works in conjunction with our latest linux-cedrus kernel tree.

Source code is of course available through our repositories, marked with the release-2018-07 tag.
Instructions to deploy the software on a compatible board are available on the linux-sunxi community wiki!

Remaining Tasks

We have tackled many of the tasks on our plate at this point, but there are still items that need to be worked on:

  • posting new series of the Cedrus driver and H264 support until it is merged;
  • supporting H265 in our driver and userspace components;
  • supporting the ARM64 SoCs that come with version 2 of the Display Engine design, namely the H5 and A64;
  • contributing to the integration of our code in upstream Kodi and LibreELEC;
  • integrating a dma-buf and DRM pipeline with GStreamer.

Thanks

We would like to thank all the individuals and companies who have supported this project by participating to our crowdfunding campaign, but also the linux-sunxi community members who did the initial reverse engineering of the Cedrus VPU and who worked with us during the development of this driver as well as the members of the V4L2 community who worked on the Request API and reviewed our patches.

Allwinner VPU support in mainline Linux status update (week 28)

This week was the occasion to send out version 5 of the Sunxi-Cedrus VPU driver, that uses version 16 of the media requests API. The API contains the necessary internal plumbing for tying specific metadata (exposed as v4l2 controls, that are structures of data set by userspace) about the current video frame to decode with the associated source buffer (that is extracted in slices from the raw video bitstream and contains the frame’s encoded data). Adding this feature to the Linux kernel paves the way for supporting stateless VPUs such as Allwinner’s Video Engine, that are found in various ARM platforms. With version 16, a number of reliability issues were fixed and we were able to run decoding tests for hours without hitting any error!

This new version of our driver contains several improvements, that are presented in the cover letter of the series. Most notably, it brings support for the H3 (which uses the second revision of the Allwinner’s Display Engine hardware block) and exposes linear YUV output in addition to the tiled output format. The issue related to H264 decoding failing because of the luma and chroma planes being too distant in memory was fixed by allocating contiguous buffers for the destination frames. However, this required significant changes in our display pipeline, which was the occasion to rework both cedrus-frame-test and libva-cedrus to handle various scenarios for buffer and planes matching and avoid hardcoded values that are specific to our pipeline. This opens the way to making these tools generic users of the V4L2 and DRM APIs, without any particular tie to our specific platform and setup.

H264 support (for the baseline profile) was also merged in our cedrus kernel tree, as well as cedrus-frame-test and libva-cedrus master branches. With that, it was possible to cook a LibreELEC build with support for our updated libva-cedrus, that allows decoding H264 videos! The result is presented in the video below, that runs on an A33 and shows Sintel, an animation movie made with free software by the Blender Foundation and released under the Creative Commons Attribution 3.0 Generic license:

We also spent some time figuring out the reason for the various artifacts found on the A20 when using the display scaler. It turned out to be some missing register, and one register where the value documented would be offset by one, resulting in the last line of the picture repeating itself.

Once done, we switched to working on the issue we mentionned last week with H264. After testing a few ideas, we now have the H264 high profile working with libva-dump and cedrus-frame-test. The next step will be to port the new code to handle the reference frames to libva-cedrus, and hopefully we will be able to have this in our usual players.

Allwinner VPU support in mainline Linux status update (week 27)

This week, significant time was dedicated to preparing a new revision of the Sunxi-Cedrus VPU kernel driver. This new version (that was started last week) based on version 15 of the media requests API brought about a number of challenges. First off, integrating the recently-tested VPU-side untiling of the destination buffers required a significant rewrite of the part in charge of managing formats and buffers. The part of our driver handling V4L2 controls (that are used to submit the frame metadata) was also significantly reworked to allow validating that the frame metadata has indeed been submitted by userspace before launching a decode run. An initial implementation of this was brought up and discussed with V4L2 maintainer Hans Verkuil, who is backing (and baking) the requests API series. He came up with a specific patch that should allow properly implementing this detection at the right time (when checking the media request’s validity, instead of at the start of the run). Hans also solved various reliability issues that we were experiencing when using the requests API with our driver. As a result, he posted version 16 of the requests API series with these fixes. We are hoping that this version will be one of the final iterations of this long-awaited series!

While rebasing H264 support, we experienced a strange issue where the destination buffers were sometimes corrupted and sometimes not. All the hardware configuration (register writes) were exactly the same, except for the buffer addresses (that naturally tend to change depending on allocations order in the related CMA memory pool). After some investigation, we discovered that when the gap between the luma and chroma planes of the destination buffer are too distant, a corruption happens. It may be that some offset is used in the hardware at some point and that it is not coded on enough bits to represent a large gap. The way to work around this is to make sure that all the planes of our destination buffer are allocated contiguously. In practice, this means that we need a single allocation for the each whole destination buffer (with the size of its two planes), ensuring that there is no gap between the planes.

A random bug in the development of Sunxi-Cedrus with funky colors

The work has continued on H264, and especially to add support for the High Profile decoding. My test video showed a limitation in our current code however, due to what seems to be a limitation of the libva API. Indeed, the H264 codec relies on a decoded picture buffer (DPB) that holds the previous decoded pictures that might be used as reference frames to decode the current frame. The kernel interface needs that DPB, and our driver will also need it to perform some ID assignation for the current frame. However, libva only gives the list of frames needed to decode the current frame, and not the whole DPB. That leads to a situation where subsequent frames, using the same reference frames set, will be assigned the same ID, which obviously doesn’t work very well. Most of the week has been spent trying to evaluate how we can address that issue, and to start implementing a solution that would be based on a cache of the reference frames passed to our libva driver.

Allwinner VPU support in mainline Linux status update (week 26)

This week on the video player integration side, we focused on the last remaining bits for Kodi integration with our pipeline, with support for the H3 SoC. Unlike its predecessors, the untiling is done directly by the VPU itself and not by the display engine, which makes the overall integration easier. Allwinner platforms that come with version 2 of the display engine only support linear YUV formats, that is piped directly from the VPU. This makes our pipeline for these platforms slightly different from previous generations, since the VPU needs to be configured properly and no DRM format modifier needs to be carried around on the software side.

Thankfully, the code for enabling this output was already implemented by the linux-sunxi community in libvdpau-sunxi and only minor changes had to be added, in order to use the NV12 YUV format for the luma and chroma planes. Thanks to the help of the LibreELEC community members interested in Allwinner devices, who are as always very helpful when it comes to working with Kodi on Allwinner platforms, it was possible to pipe everything together on the H3 (and the DE2 DRM driver even received an initial patch for z-pos support, that allows re-ordering layers for alpha blending).

The result is visible in the following video, that shows Elephants Dream, an animation movie made with free software by the Blender Foundation and released under the Creative Commons Attribution 2.5 Generic license:

On the V4L2 kernel driver side, a rebase of the Sunxi-Cedrus driver on top of the latest kernel release candidate, 4.18-rc2 was completed and a number of features are being included in the series, including H3 support and support for the untiled NV12 video output. We expect to send a new iteration of the kernel driver patches next week.

We also discovered that the VPU untiling block is available on the A33 too, which will make supporting the A33 easier as well.

Finally, there was no progress on the H264 front this week, as Maxime was away on a business trip teaching a training. The progress on H264 will resume next week.

Allwinner VPU support in mainline Linux status update (week 25)

This week started off by submitting the fourth revision of the Sunxi-Cedrus VPU driver for review. Many improvements were squashed into this new version and the driver is closer than ever to being merged. With the media requests API in a nearly-ready state, things are really coming together on the kernel side.

On the userspace side, our standalone testing tool cedrus-frame-test received a number of improvements, starting with Maxime’s H264 work that was rebased and integrated in the master branch. Atomic modesetting support with DRM planes was also completed and merged. It allowed completing dma-buf support in the tool, implementing a zero-copy pipeline. With asynchronous page-flipping, performance is getting real good with only a few milliseconds required to schedule the flip and no buffer duplication involved!

Regarding integration with Kodi, we moved forward with the code using ffmpeg’s hwcontext for decoding with VAAPI and mapping from the VAAPI output to DRM, ending up through Kodi’s DRMMPrimeRenderer. The display pipeline is pretty much the same as cedrus-frame-test with atomic modesetting and dma-buf, only that Kodi uses an extra plane on top for displaying its controls and interface.

There are still some configuration issues to work on for display (and perhaps some kind of corruption happening on the display engine’s side), as illustrated on the following picture:

This week has seen some good H264 progress too! Our libva implementation has been tested, and while we encountered some VLC bugs that makes it drop the first few seconds, once passed that bug, every frame is decoded properly using a baseline profile H264 video. We’ve discussed with VLC developpers about this, and since it also affects the H264 software decoding, we will probably turn this into a bug report (and hopefully a bug fix!).

We therefore started to work on implementing the high profile support. We went back to the method we were using when first developping the baseline profile support: we dumped the registers access of the libvdpau-sunxi decoding the video on an Allwinner 3.4 kernel, and comparing the registers accesses we were doing. This is very early at this stage, so we don’t have much to show for now, but stay tuned for more news!

Allwinner VPU support in mainline Linux status update (week 24)

Integration with video players

Following up on last week’s efforts on the video players integration front, Kodi remained our core focus. With a LibreELEC setup in place, it was possible to start tackling VAAPI integration. This was not such a straightforward task, since various assumptions were in place. For instance, it was assumed that VAAPI support was only relevant for x86 platforms and it seems pretty clear that VAAPI integration in general was done with x86 in mind. This is particularly illustrated by the fact that the VAAPI video rendering pipeline relies on the GPU for all transformations and composition. This is a typical setup for x86, as the use of planes on these platforms was progressively replaced by a GPU-centric approach. Since our goal with Kodi is to use DRM/KMS planes in place of the GPU, this did not fit well. Moreover, the GPU import format required for dma-buf is simply not supported by the Mali blob (as we found out some weeks ago when working with VLC and the GLES untiler) and this is the only setup that Kodi currently supports for VAAPI.

There is still definitely hope, as Kodi supports a DRM Prime renderer that uses DRM/KMS planes in place of the GPU but does not support VAAPI in its current form. More specifically, it uses ffmpeg to get a dma-buf handle (through the AV_PIX_FMT_DRM_PRIME format from ffmpeg), that is not available as-is. In order to get this sort of pipeline with VAAPI, multiple steps have to be taken. A hardware acceleration context has to be brought up to select the VAAPI acceleration method instead of regular software decoding. This exposes the AV_PIX_FMT_VAAPI format from ffmpeg, which is still not good to feed the Kodi DRM Prime renderer. This has to be converted to AV_PIX_FMT_DRM_PRIME using ffmpeg helpers. As a result, some plumbing is required in Kodi and this work is still work in progress at the moment.

In parallel to the work on players, our Sunxi-Cedrus VPU driver was rebased on top of the latest version of the media request API from Hans Verkuil. It was the occasion to spot various bugs in this latest iteration, that were rapidly tackled thanks to Hans’ availability. The required follow-up patches were posted on the request API branch and will be part of its next revision. Regarding our driver itself, a great number of comments from our previous patchset were taken into account and integrated. We now have another iteration of the series ready, that we will publish soon. The tasks list for the driver itself keeps shrinking and we are getting closer and closer to the point where the driver is ready to be merged!

H264 support

On the H264 front, good progress has been made this week too. Early this week, we’ve been able to play a baseline profile video without any particular quirks anymore. Some time was thus spent on cleaning up and refactoring the driver, libva-dump and cedrus-frame-test tools in order to support both the MPEG2 and H264 codecs, a feature that was dropped due to many hacks during the development. We then took the occasion to start the discussion on the linux-media mailing list by sending a preliminary version of the patches. We then worked on the real libva-cedrus, adding the support for H264. Most of the code is there now, but unfortunately isn’t functional yet. Some debugging will be on the agenda next week 🙂

Allwinner VPU support in mainline Linux status update (week 23)

On the players integration side, the goals for this week covered Kodi support for our beloved Allwinner platforms (of course, with upstream). But first, a few words as a follow-up to last week’s work on the MB32 untiling GPU shader. A specific commit related to texture uploading on the Mali400 was spotted in the MER project, fixing an issue apparently very similar to our own. Alas! It didn’t help with our case and did not lead to any improvement.

While the shader untiler is required for accelerated X11 display with the GPU, Kodi offers direct DRM/KMS support (the Kernel Mode-Setting part of the Display Rendering Manager, that deals with on-screen display). This means that we can use the DRM work from months ago for untiling the VPU buffers directly with the video engine. This is sometimes even faster than the GPU, especially for 4K contents!

However, Kodi is a complex piece of software that requires significant integration. Its support in Buildroot definitely reveals that complexity, that is gracefully abstracted by the build system. On top of that, the Kodi target platform for using DRM/KMS, called GBM (we’ll get back to this acronym in a bit) is not supported in most build systems (Buildroot included), with the exception of LibreELEC, that is used by the developers contributing to this Kodi target. After an intense struggle, it became clear that LibreELEC was the only reasonable and sane way to go for supporting GBM. Thanks to the huge help and incredible availability of the community of LibreELEC developers interested in Allwinner support, it was possible to finally bootstrap a working installation (that does not interface with our VAAPI backend yet):

Big Buck Bunny with Kodi on the ALL-H3-CC, without VAAPI integration yet

In order to provide high performance and a pleasant experience, Kodi heavily relies on the GPU, which is supported by the EGL and GLES interfaces. EGL, in charge of the display part, has to be connected to the native windowing system of the target in use, that can be X11 or Wayland/GBM. GBM, which stands for Generic Buffer Management is an abstracted API for graphics-related memory management. It allows abstracting memory allocators such as GEM (the Graphics Execution Manager used in conjunction with DRM) through a consistent and unified interface that is, as for EGL and GLES, independent from the system and hardware implementations. Kodi uses GBM directly to allocate buffers shared between the GPU and the DRM subsystem.

This requires explicit cooperation from the used EGL implementation, the Mali blobs in our case. Sadly, the blobs available for the A10/A13/A20 and A33 platform do not provide the GBM interface. Still, LibreELEC offers support for the H3 platform and so it was selected as a primary target for setting up Kodi support for the GBM target. Thanks to Libre Computer, we received a (significant) number of boards for our development purposes, including H3 boards that were directly useful in this effort!

The H264 effort has also seen some great progress this week. We finally got the first frame of Big Buck Bunny to be decoded on Monday, and gradually improved the libva-dump, cedrus-frame-test and our kernel driver to fix the bugs that were found along the way. The libvdpau-sunxi authors, and Jens Kuske in particular, provided some great feedback on how the reference list, decoded buffer buffers, and the Video Engine in general were working. We now can play a video with only I-frames without any hiccups (that we found out), and the P-frame support is slowly getting into shape. We can decode the first 4 frames of Big Buck Bunny without any issue, and the fifth is reported as decoded, but, well, see below for yourself… Obviously this will need a bit more work, and to test it with other videos and with B-Frames. But this is good news!

Allwinner VPU support in mainline Linux status update (week 22)

Integration with video players

The work conducted this week on the video output side was focused on writing a shader for untiling the MB32 NV12-based format used by the VPU to output frames. This brought various challenges, some of which are presented below.

Since GLES and EGL are generic APIs that are not tied to a particular driver implementation, it made sense to start writing the shader on an x86 Intel-based device with GPU support in Mesa 3D (and speed-up the development time). The first step to the process was to display the raw pixel values from the luminance plane through the shader. Actually, two shaders are required: one for the vertex processor and one for the fragment (pixel) processor of the GPU. The former is in charge of applying geometrical operations to the vertices (the points that define the 3D mesh) while the latter defines the color for each rendered pixel from that mesh. In our case, the mesh is simply a rectangle that matches our window size. The tiled NV12 luminance plane is uploaded to the GPU as a 1-byte-per-pixel texture, which allows addressing each component separately. However, the coordinates for the texture are normalized by the GPU, so coordinates to retrieve texels (texture pixels) form the texture sampler are specified as decimal values. This makes it tricky to ensure that the right value is retrieved, especially given that the GPU might apply various filtering techniques (that are a really good thing to have when dealing with actual textures for 3D models, though).

Setting up the vertex and fragment shaders to linearly display the pixels from the tiled format results in a mangled display (as expected):

Thanks to the documentation made available by the linux-sunxi community, it was possible to rapidly draft a formula for getting the right texel location, that produced mitigated results:

With some extra work (and quirks for ensuring that the right texel is picked on tile edges), the luminance component was finally displayed correctly:

Next up was the chrominance component, that required importing a second dedicated texture. First tries lead to funky-looking coloring of the frame:

Until the shader was corrected to end up with a good-looking picture:

Real trouble began when porting this work to the Mali, that does not behave the same when it comes to texture uploads (and requires line-by-line upload for 1-byte-per-pixel formats). Since we are aiming at DMAbuf import instead of (slow) texture upload, no time was spent coping with the difference. The main issue with DMAbuf import is that the usual one-byte-per-pixel format (described by the DRM_FORMAT_R8 fourcc code) is simply not supported by the GPU, leaving only RGB and YUV as options, that do not directly fit the bill. We are still investigating ways to make our texture available to the GPU’s texture sampler without extra copies (or with copies that can fit our bill in terms of performance).

H264 support

We also worked on the H264 decoding in the kernel, and some progress was made. The libva-dump and cedrus-frame-test ports are now done, and we’ve been able to run cedrus-frame-test on 32 frames without any hiccups… Unfortunately, while the VPU reports the frames as properly decoded, the contents of the output buffer is blank, which is obviously not great. Since then, we have simplified the test to have a single frame decoded, and compared the register write sequence between libvdpau-sunxi and our kernel code. This has allowed us to find some bugs in the driver, but the current state is still that we can’t decode a frame. We shouldn’t be very far now though, so stay tuned for our next status update!

Allwinner VPU support in mainline Linux status update (week 21)

This week’s effort was focused on getting VLC to accelerate its video output using the Mali proprietary blobs. More specifically, two distinct interfaces are involved: EGL, that allows interacting with the platform’s windowing system (in our case, X11) and GLES, that is in charge of the rendering operations. While VLC already had support for both of these interfaces, it initially failed to create and use its GL-backed video output module with the Mali GPU blobs. Although everything indicated that it should have been working, the GLES calls were failing while EGL was setup and behaving correctly. The issue at hand was directly related to VLC using Qt for its interface. Because the Qt build used on the development boards was targeting GL support instead of GLES, it needed to import GL symbols that have the same name as GLES equivalents. Since Qt was loaded after the video output module, it would override the matching GLES symbols with GL symbols (from Mesa, not the blob).

With the help of Thomas Guillem, a few patches were crafted to fix the issue and sent out to the VLC developers. Some more revisions of these patches will be needed for the fix to integrate the VLC tree, but it should land sooner or later.

With VLC fixed, it was time to start looking at accelerating our pipeline with the GPU. VLC already includes GPU shaders for NV12 to RGB conversion as well as scaling and rotation, but does not have support for our tiled format. This is why we need a shader on our VAAPI backend side to accelerate the untiling operation. While the shader is currently work in progress, further work is also required to properly export the resulting untiled buffer as a DMABUF handle for VLC. Since the GPU blob does not support dmabuf export, we will need to implement a standalone GBM provider compatible with our DRM driver, that will handle allocating surfaces (instead of the armsoc DDX that is currently used for accelerated graphics on X) and exporting DMABUF handles to them when needed. Generally speaking, this will also allow standardizing and sanitizing the integration of the Mali blobs with the rest of the system.

Stay tuned for our next update!