Technical – Page 24

More OpenGL binaries for the Mali support on Allwinner platforms with mainline Linux

Back in September, we announced the availability of Mali userspace blobs that provide OpenGL acceleration on Allwinner platforms using the mainline Linux kernel. Back then, only the r6p2 version of the Mali blobs were available, with only the x11 and fbdev backends, and only for ARM 32 bits. Following the announcement we made last September, we kept talking with Allwinner to release more binaries and increase the usefulness of them. Two major categories were missing in order to complete the previous batch of binaries Allwinner allowed us to distribute: Wayland and arm64 flavours.

After some discussions, Allwinner provided to us this week additional Mali blobs, covering Wayland support, ARM64, and also newer versions for some of them. Overall, we now provide:

r6p2 version, ARM 32 bits, X11
r6p2 version, ARM 32 bits, fbdev
r6p2 version, ARM 32 bits, Wayland (new)
r6p2 version, ARM 64 bits, X11 (new)
r6p2 version, ARM 64 bits, fbdev (new)
r6p2 version, ARM 64 bits, Wayland (new)
r8p1 version, ARM 32 bits, fbdev (new)
r8p1 version, ARM 64 bits, fbdev (new)

We pushed everything to our github repo, enjoy! See our previous blog post for instructions on how to use those blobs.

Those binary blobs are useful because they allow today to have a fully working OpenGL acceleration on Allwinner platforms: we recently ran a Qt5 application doing OpenGL rendering 24/7 on an Allwinner A33 platform for 1.5 month uninterrupted, as a stability test. Of course, long term, we are following the progress of the Lima project, which will provide a completely free and open-source solution to provide OpenGL acceleration on Allwinner platforms.

Bleeding edge toolchains updated

Since last year, our site toolchains.bootlin.com provides a large selection of ready-to-use cross-compilation toolchains, covering a wide range of CPU architectures and C libraries. We have just deployed today a new update to all our bleeding-edge toolchains. Those toolchains are now based on:

GCC 8.1.0
Binutils 2.30
glibc 2.27 (plus fixes) or uClibc-ng 1.0.30 or musl 1.1.19
Linux headers 4.14
GDB 8.1

Toolchains.bootlin.com

All our 77 bleeding-edge toolchains built successfully with those component versions, and many of them received runtime testing under Qemu. We would like to do a special thanks to Romain Naour from Smile, who contributed a lot to this update by adding GCC 8.1 and GDB 8.1 support in Buildroot, and fixing a number of issues discovered when building those toolchains.

We will continue to regularly update our toolchains, and we are very interested in receiving feedback about those toolchains, to fix any issue or extend the range of configurations that are covered. Do not hesitate to get in touch!

Allwinner VPU support in mainline Linux status update (week 25)

This week started off by submitting the fourth revision of the Sunxi-Cedrus VPU driver for review. Many improvements were squashed into this new version and the driver is closer than ever to being merged. With the media requests API in a nearly-ready state, things are really coming together on the kernel side.

On the userspace side, our standalone testing tool cedrus-frame-test received a number of improvements, starting with Maxime’s H264 work that was rebased and integrated in the master branch. Atomic modesetting support with DRM planes was also completed and merged. It allowed completing dma-buf support in the tool, implementing a zero-copy pipeline. With asynchronous page-flipping, performance is getting real good with only a few milliseconds required to schedule the flip and no buffer duplication involved!

Regarding integration with Kodi, we moved forward with the code using ffmpeg’s hwcontext for decoding with VAAPI and mapping from the VAAPI output to DRM, ending up through Kodi’s DRMMPrimeRenderer. The display pipeline is pretty much the same as cedrus-frame-test with atomic modesetting and dma-buf, only that Kodi uses an extra plane on top for displaying its controls and interface.

There are still some configuration issues to work on for display (and perhaps some kind of corruption happening on the display engine’s side), as illustrated on the following picture:

This week has seen some good H264 progress too! Our libva implementation has been tested, and while we encountered some VLC bugs that makes it drop the first few seconds, once passed that bug, every frame is decoded properly using a baseline profile H264 video. We’ve discussed with VLC developpers about this, and since it also affects the H264 software decoding, we will probably turn this into a bug report (and hopefully a bug fix!).

We therefore started to work on implementing the high profile support. We went back to the method we were using when first developping the baseline profile support: we dumped the registers access of the libvdpau-sunxi decoding the video on an Allwinner 3.4 kernel, and comparing the registers accesses we were doing. This is very early at this stage, so we don’t have much to show for now, but stay tuned for more news!

Allwinner VPU support in mainline Linux status update (week 24)

Integration with video players

Following up on last week’s efforts on the video players integration front, Kodi remained our core focus. With a LibreELEC setup in place, it was possible to start tackling VAAPI integration. This was not such a straightforward task, since various assumptions were in place. For instance, it was assumed that VAAPI support was only relevant for x86 platforms and it seems pretty clear that VAAPI integration in general was done with x86 in mind. This is particularly illustrated by the fact that the VAAPI video rendering pipeline relies on the GPU for all transformations and composition. This is a typical setup for x86, as the use of planes on these platforms was progressively replaced by a GPU-centric approach. Since our goal with Kodi is to use DRM/KMS planes in place of the GPU, this did not fit well. Moreover, the GPU import format required for dma-buf is simply not supported by the Mali blob (as we found out some weeks ago when working with VLC and the GLES untiler) and this is the only setup that Kodi currently supports for VAAPI.

There is still definitely hope, as Kodi supports a DRM Prime renderer that uses DRM/KMS planes in place of the GPU but does not support VAAPI in its current form. More specifically, it uses ffmpeg to get a dma-buf handle (through the AV_PIX_FMT_DRM_PRIME format from ffmpeg), that is not available as-is. In order to get this sort of pipeline with VAAPI, multiple steps have to be taken. A hardware acceleration context has to be brought up to select the VAAPI acceleration method instead of regular software decoding. This exposes the AV_PIX_FMT_VAAPI format from ffmpeg, which is still not good to feed the Kodi DRM Prime renderer. This has to be converted to AV_PIX_FMT_DRM_PRIME using ffmpeg helpers. As a result, some plumbing is required in Kodi and this work is still work in progress at the moment.

In parallel to the work on players, our Sunxi-Cedrus VPU driver was rebased on top of the latest version of the media request API from Hans Verkuil. It was the occasion to spot various bugs in this latest iteration, that were rapidly tackled thanks to Hans’ availability. The required follow-up patches were posted on the request API branch and will be part of its next revision. Regarding our driver itself, a great number of comments from our previous patchset were taken into account and integrated. We now have another iteration of the series ready, that we will publish soon. The tasks list for the driver itself keeps shrinking and we are getting closer and closer to the point where the driver is ready to be merged!

H264 support

On the H264 front, good progress has been made this week too. Early this week, we’ve been able to play a baseline profile video without any particular quirks anymore. Some time was thus spent on cleaning up and refactoring the driver, libva-dump and cedrus-frame-test tools in order to support both the MPEG2 and H264 codecs, a feature that was dropped due to many hacks during the development. We then took the occasion to start the discussion on the linux-media mailing list by sending a preliminary version of the patches. We then worked on the real libva-cedrus, adding the support for H264. Most of the code is there now, but unfortunately isn’t functional yet. Some debugging will be on the agenda next week 🙂

Allwinner VPU support in mainline Linux status update (week 23)

On the players integration side, the goals for this week covered Kodi support for our beloved Allwinner platforms (of course, with upstream). But first, a few words as a follow-up to last week’s work on the MB32 untiling GPU shader. A specific commit related to texture uploading on the Mali400 was spotted in the MER project, fixing an issue apparently very similar to our own. Alas! It didn’t help with our case and did not lead to any improvement.

While the shader untiler is required for accelerated X11 display with the GPU, Kodi offers direct DRM/KMS support (the Kernel Mode-Setting part of the Display Rendering Manager, that deals with on-screen display). This means that we can use the DRM work from months ago for untiling the VPU buffers directly with the video engine. This is sometimes even faster than the GPU, especially for 4K contents!

However, Kodi is a complex piece of software that requires significant integration. Its support in Buildroot definitely reveals that complexity, that is gracefully abstracted by the build system. On top of that, the Kodi target platform for using DRM/KMS, called GBM (we’ll get back to this acronym in a bit) is not supported in most build systems (Buildroot included), with the exception of LibreELEC, that is used by the developers contributing to this Kodi target. After an intense struggle, it became clear that LibreELEC was the only reasonable and sane way to go for supporting GBM. Thanks to the huge help and incredible availability of the community of LibreELEC developers interested in Allwinner support, it was possible to finally bootstrap a working installation (that does not interface with our VAAPI backend yet):

Big Buck Bunny with Kodi on the ALL-H3-CC, without VAAPI integration yet

In order to provide high performance and a pleasant experience, Kodi heavily relies on the GPU, which is supported by the EGL and GLES interfaces. EGL, in charge of the display part, has to be connected to the native windowing system of the target in use, that can be X11 or Wayland/GBM. GBM, which stands for Generic Buffer Management is an abstracted API for graphics-related memory management. It allows abstracting memory allocators such as GEM (the Graphics Execution Manager used in conjunction with DRM) through a consistent and unified interface that is, as for EGL and GLES, independent from the system and hardware implementations. Kodi uses GBM directly to allocate buffers shared between the GPU and the DRM subsystem.

This requires explicit cooperation from the used EGL implementation, the Mali blobs in our case. Sadly, the blobs available for the A10/A13/A20 and A33 platform do not provide the GBM interface. Still, LibreELEC offers support for the H3 platform and so it was selected as a primary target for setting up Kodi support for the GBM target. Thanks to Libre Computer, we received a (significant) number of boards for our development purposes, including H3 boards that were directly useful in this effort!

The H264 effort has also seen some great progress this week. We finally got the first frame of Big Buck Bunny to be decoded on Monday, and gradually improved the libva-dump, cedrus-frame-test and our kernel driver to fix the bugs that were found along the way. The libvdpau-sunxi authors, and Jens Kuske in particular, provided some great feedback on how the reference list, decoded buffer buffers, and the Video Engine in general were working. We now can play a video with only I-frames without any hiccups (that we found out), and the P-frame support is slowly getting into shape. We can decode the first 4 frames of Big Buck Bunny without any issue, and the fifth is reported as decoded, but, well, see below for yourself… Obviously this will need a bit more work, and to test it with other videos and with B-Frames. But this is good news!

Linux 4.17 released, Bootlin contributions

4.17 was released last Sunday, so it’s time for our highlight article to see daylight.

As always, LWN.net did an interesting coverage of this release cycle merge window, highlighting the most important changes: the first half of the 4.17 merge window and the rest of the 4.17 merge window. For 4.17 only, Bootlin contributed a total of 331 patches, which puts us at the 10th place in the ranking of most contributing companies according to both LWN and KPS.

Also according to LWN statistics, Bootlin’s engineer Alexandre Belloni is the 6th most active developer in terms of changesets for this release with a total of 124 commits, almost a percent of the total number.

The main highlights of our contributions are:

For Marvell platforms:
- Antoine Ténart improved the inside-secure crypto accelerator driver by adding support for hmac(sha256) and hmac(sha224) algorithms. This driver is used on Marvell Armada 3700 and 7K/8K.
- Antoine Ténart and Gregory Clement also contributed a number of fixes to the inside-secure crypto accelerator driver,
- Maxime Chevallier added hardware offloading support for VLAN filtering to the mvpp2 network driver. This driver is used on Marvell Armada 375 and 7K/8K.
- Maxime Chevallier also contributed a number of fixes to the mvpp2 network driver,
- Miquèl Raynal migrated boards from PXA3xx NAND driver to the NAND driver he developed and got merged in 4.16,
- Gregory Clement contributed a rework of the clocks representation on Marvell Armada 7K/8K that led to the updates of several Device Trees,
In the MTD subsystem, both Boris Brezillon and Miquèl Raynal contributed numerous fixes,
For Microsemi Ocelot platforms:
- Alexandre Belloni added support for Microsemi Ocelot SoC’s reset,
- Alexandre Belloni added support for Microsemi Ocelot SoC’s interrupt controller,
- Alexandre Belloni added basic support for Microsemi Ocelot: core support, SoC Device Tree and board Device Tree. Alexandre is therefore the maintainer of the Microsemi Ocelot support.
For RaspberryPi platforms, Boris Brezillon contributed the exposing of performance counters of VC4 DRM driver to userspace and numerous fixes to VC4 DRM driver,
For Allwinner platforms:
- Quentin Schulz added support for ADC and battery power supply on the X-Powers AXP813 PMIC,
- Quentin Schulz also contributed support for cpufreq for the Allwinner A83T,
- Maxime Ripard added support for YUV planes in the DRM driver of Allwinner SoCs,
- Maxime Ripard contributed a few fixes and improvements to the DRM driver of Allwinner SoCs,
- Maxime Ripard also contributed a few improvements to the MMC core to be able to comply with Allwinner guidelines,
For RTC subsystem, Alexandre Belloni moved rtc_nvmem_register() out of the core so drivers can better fine-grain tune their initialization, and fixed a few issues along the way (most notably a possible race condition),
For the SuperH architecture, Thomas Petazzoni contributed a few fixes for the PCIe controller of the Renesas SH7786
For the ASoC subsystem, Mylène Josserand added support for the TI PCM1789 DAC,

Bootlin engineers are not only contributors, but also maintainers of various subsystems in the Linux kernel, which means they are involved in the process of reviewing, discussing and merging patches contributed to those subsystems:

Maxime Ripard, as the Allwinner platform co-maintainer, merged 97 patches from other contributors
Boris Brezillon, as the MTD/NAND maintainer, merged 87 patches from other contributors
Alexandre Belloni, as the RTC maintainer and Atmel platform co-maintainer, merged 46 patches from other contributors
Grégory Clement, as the Marvell EBU co-maintainer, merged 14 patches from other contributors

Here is the commit by commit detail of our contributions to 4.17:

spi-mem: bringing some consistency to the SPI memory ecosystem

In this article, we would like to introduce our work on the spi-mem Linux kernel framework, which will allow to re-use SPI controller drivers for both SPI NOR devices and regular SPI devices, as well as SPI NAND devices in the future.

From SPI to Dual, Quad and Octo SPI

In the good old days, SPI was a simple protocol, with only 3 signals shared by all devices present on the bus:

MISO: Master In Slave Out
MOSI: Master Out Slave In
SCLK: Serial Clock

and 1 extra signal per device to select the device we want to communicate with:

SS: Slave Select (also called CS for Chip Select sometimes)

But then SPI memories appeared. It started with small and relatively slow ones, like dataflash, EEPROMs and SRAMs and progressively moved to rather large SPI NORs and SPI NANDs. As usual when it comes to dealing with memories, we want to get the best performances out of them. SPI bus limitations quickly became the bottleneck, so vendors decided to add more I/O lines and make the MISO/MOSI lines bi-directional. Nowadays we see SPI controllers that are supporting up to 8 I/O lines. For those who are familiar with the terms, that’s what we call DualSPI, QuadSPI and OctoSPI.

In order to use all I/O lines when doing a master to slave or slave to master transfer, there must be some kind of contract between the slave and the master so that both of them know when they can receive or transmit data on the I/O lines and how many of them they should use. This is done through a predefined set of operations exposed by the slave that the master has to follow to enter a specific transmit or receive state. A SPI memory operation is usually composed of:

1 byte opcode encoding the operation to be executed (but be prepared for 2 bytes opcode, they are coming)
0 to N address bytes whose meaning is dependent on the opcode (can be an absolute memory address or something else)
0 to N dummy bytes to leave the slave enough time to enter the specific state requested through the opcode. Again, the number of dummy bytes depends on the opcode
0 to N IN or OUT data bytes, the direction depends on the opcode

Note that, while this protocol tends to be used by memory devices, there’s nothing restricting it to memories, and I wouldn’t be surprised if some FPGA were using the same kind of transactions for things that are not memory oriented at all.

The Linux SPI ecosystem

Linux has been supporting Dual and Quad SPI mode for quite some time already (v3.12), and SPI device drivers could specify the number of I/O lanes for each SPI transfer. This way, a SPI memory operation could be broken out in several SPI transfer each of them using a pre-defined number of I/O lanes.

That worked just fine until some IP vendors decided to make their SPI controllers smarter and embed some kind of high-level interface to execute SPI memory operations in a single step instead of splitting it in several transfers (actually, most SPI controllers are even smarter than that and allow you to directly map a SPI memory in the CPU address space, but let’s keep that for a future post). At this point, we needed to give more control to the SPI controller so that it could decide what to do exactly, without having to reconstruct a SPI memory operation out of a group of SPI transfers.

At that time, the decision has been to dedicate those controllers to one single task: control SPI NORs (which were the only user of quad and dual mode back then), and the SPI NOR framework was created for this reason.

Due to this decision, we now have in Linux a SPI NOR frame work which connects SPI NOR controller drivers to the SPI NOR logic on one side (spi-nor subsystem), and we have regular SPI controller drivers which can do basic SPI transfers (spi subsystem). However, from a hardware point of view, advanced SPI controllers that provide special features for SPI NOR can often also do basic transfers, and therefore control regular SPI devices. Unfortunately, with the current split between the spi-nor and spi subsystems, if a SPI controller is supported by a driver in the spi-nor subsystem, it cannot be used to communicate with regular SPI devices normally managed by the spi subsystem.

As a partial solution to solve this problem, the ->spi_flash_read() operation was added to struct spi_controller. This allowed regular SPI controller drivers in the spi subsystem to provide an optimized way to read from a SPI NOR memory, which is used by the generic SPI NOR driver m25p80. However, this solution is partial, as it only optimizes reading and is limited to SPI NORs.

Current SPI NOR stack

In the current stack, we have:

The SPI NOR framework, which knows the protocol to talk to SPI NOR memories. This framework relies on an interface listed in struct spi_nor, which is implemented by:
- Specialized SPI NOR controllers, which support advanced SPI controllers dedicated to SPI NORs.
- The m25p80 driver, which provides the same interface, but on top of dumb/regular SPI controller drivers, with the possible optimization of ->spi_flash_read()

What led us to propose the SPI memory interface?

We’ve seen before that the SPI NOR case was already properly supported thanks to the SPI NOR framework. But NORs are not the only memory devices you’ll find on a SPI bus, SPI NANDs are becoming more and more popular.
Peter Pan proposed a framework to handle SPI NAND devices which was following the SPI NOR model: SPI controllers had to implement the SPI NAND controller interface in order to control SPI NANDs. But when we got more deeply involved in this development, we quickly realized how messy it would be to follow this path, because that meant forcing SPI controllers to implement both the SPI NOR and SPI NAND interface if they want to be able control both kind of device. And what will happen when the trend will be at SPI NVRAM or any other kind of storage manufacturers decide to put on a SPI bus? Adding yet another interface that SPI controllers would again have to implement didn’t sound like a good idea.

So instead, we decided to take the problem from the other end, and tried to figure out what SPI NANDs and SPI NORs have in common.
The instruction set is different, the behavior and constraints are different (mainly due to NOR vs NAND differences), but they both follow the same SPI memory operation semantic when interacting with the device, the same one advanced controllers are trying to optimize.

The SPI memory layer is just a way for SPI controller drivers to be passed high-level SPI memory operations instead of letting them handle SPI transfers and try to optimize them themselves. It also simplifies SPI memory drivers’ life, since all they have to care about is sending SPI memory operations based on the SPI memory specs. No need for a complex, constantly evolving, memory dependent interface.

SPI memory stack

With this new stack, both SPI NOR and SPI NAND can be supported on top of the same SPI controller drivers. The m25p80 driver is modified to use the spi-mem interface instead of the limited ->spi_flash_read() interface. For now, we still have dedicated SPI NOR controller drivers, but the goal is to rid of them, and implement them as regular SPI controllers in drivers/spi. Help and contributions in this direction are very welcome!

What does the SPI memory API look like?

The SPI memory API is exposed in include/linux/spi/spi-mem.h.

SPI device drivers who want to use the SPI memory API should declare themselves as spi_mem_drivers and implement the ->probe() and ->remove() functions.
They will be passed a spi_mem object which is just a thin wrapper around a spi_device object. The reason we have a different object is because we want to be able to extend the spi_mem object and attach more information to it if required (like the type of memory, the memory organization and other kind of information advanced SPI controllers might want to know).
When a driver wants to execute a SPI memory operation, it will fill a spi_mem_op struct and call spi_mem_exec_op(). One can also test if a SPI controller is supporting a specific memory operation with spi_mem_supports_op() or try to split data transfers so that they don’t exceed the max transfer size supported by the controller with spi_mem_adjust_op_size().

Now, let’s have a look at the controller side of things. An SPI controller who wants to optimize SPI memory operations can implement the spi_mem_ops interface which contains 3 methods that are directly matching the user API:

->exec_op(): execute the memory operation or return -ENOTSUPP if it’s not supported
->supports_op(): just check if the memory operation is supported
->adjust_op_size(): adjust the size of the data transfer of a memory operation to cope with alignment and max FIFO size constraints

Note that when spi_mem_ops is not implemented, the core will add generic support for this feature by creating SPI messages formed of several SPI transfers, just as the generic SPI NOR controller driver (named m25p80) was doing before.

As you can see, the API is pretty straightforward, so hopefully, more SPI memory drivers will be converted to use it instead of manually creating SPI messages containing several SPI transfers.

Current status

A number of things have already been contributed and merged, scheduled to be part of the 4.18 Linux kernel release:

The spi-mem layer itself, in commit c36ff266dc82f4ae797a6f3513c6ffa344f7f1c7.
The two SPI controller drivers who were implementing the ->spi_flash_read interface now implement the spi-mem interface instead: bcm-qspi and ti-qspi.
The ->spi_flash_read interface is removed in commit c1f5ba70decfc2f35edcc10505e3e78fb528d212.
The m25p80 driver is modified to use the spi-mem layer in commit 4120f8d158ef904fb305b27e4a4524649faf3096.

What’s next?

Advanced SPI controllers can do more than just optimizing the execution of SPI memory operations, they can often hide all the complexity of memory accesses behind a directly-mapped IOMEM region, which, every time it is accessed, triggers a SPI memory operation on the bus and retrieves or sends the data for you, thus behaving like a memory that would be placed on a parallel memory bus. As you can imagine, this allows for even higher throughput and less CPU time consumed for SPI mem op management, but it’s also something that is hard to expose in a generic way. We have posted on the linux-mtd mailing list a proposal to support such a direct mapping capability.

As detailed earlier, another challenging topic is the conversion of all SPI NOR controller drivers to the SPI mem model so that all QSPI controllers are really exposed as SPI controllers and not SPI NOR controllers. This is likely to take a bit of time since we currently have 10 drivers in drivers/mtd/spi-nor and we’re only aware of 2 of them being converted to the SPI mem approach (fsl-quadspi and atmel-quadspi).

Allwinner VPU support in mainline Linux status update (week 22)

Integration with video players

The work conducted this week on the video output side was focused on writing a shader for untiling the MB32 NV12-based format used by the VPU to output frames. This brought various challenges, some of which are presented below.

Since GLES and EGL are generic APIs that are not tied to a particular driver implementation, it made sense to start writing the shader on an x86 Intel-based device with GPU support in Mesa 3D (and speed-up the development time). The first step to the process was to display the raw pixel values from the luminance plane through the shader. Actually, two shaders are required: one for the vertex processor and one for the fragment (pixel) processor of the GPU. The former is in charge of applying geometrical operations to the vertices (the points that define the 3D mesh) while the latter defines the color for each rendered pixel from that mesh. In our case, the mesh is simply a rectangle that matches our window size. The tiled NV12 luminance plane is uploaded to the GPU as a 1-byte-per-pixel texture, which allows addressing each component separately. However, the coordinates for the texture are normalized by the GPU, so coordinates to retrieve texels (texture pixels) form the texture sampler are specified as decimal values. This makes it tricky to ensure that the right value is retrieved, especially given that the GPU might apply various filtering techniques (that are a really good thing to have when dealing with actual textures for 3D models, though).

Setting up the vertex and fragment shaders to linearly display the pixels from the tiled format results in a mangled display (as expected):

Thanks to the documentation made available by the linux-sunxi community, it was possible to rapidly draft a formula for getting the right texel location, that produced mitigated results:

With some extra work (and quirks for ensuring that the right texel is picked on tile edges), the luminance component was finally displayed correctly:

Next up was the chrominance component, that required importing a second dedicated texture. First tries lead to funky-looking coloring of the frame:

Until the shader was corrected to end up with a good-looking picture:

Real trouble began when porting this work to the Mali, that does not behave the same when it comes to texture uploads (and requires line-by-line upload for 1-byte-per-pixel formats). Since we are aiming at DMAbuf import instead of (slow) texture upload, no time was spent coping with the difference. The main issue with DMAbuf import is that the usual one-byte-per-pixel format (described by the DRM_FORMAT_R8 fourcc code) is simply not supported by the GPU, leaving only RGB and YUV as options, that do not directly fit the bill. We are still investigating ways to make our texture available to the GPU’s texture sampler without extra copies (or with copies that can fit our bill in terms of performance).

H264 support

We also worked on the H264 decoding in the kernel, and some progress was made. The libva-dump and cedrus-frame-test ports are now done, and we’ve been able to run cedrus-frame-test on 32 frames without any hiccups… Unfortunately, while the VPU reports the frames as properly decoded, the contents of the output buffer is blank, which is obviously not great. Since then, we have simplified the test to have a single frame decoded, and compared the register write sequence between libvdpau-sunxi and our kernel code. This has allowed us to find some bugs in the driver, but the current state is still that we can’t decode a frame. We shouldn’t be very far now though, so stay tuned for our next status update!

Allwinner VPU support in mainline Linux status update (week 21)

This week’s effort was focused on getting VLC to accelerate its video output using the Mali proprietary blobs. More specifically, two distinct interfaces are involved: EGL, that allows interacting with the platform’s windowing system (in our case, X11) and GLES, that is in charge of the rendering operations. While VLC already had support for both of these interfaces, it initially failed to create and use its GL-backed video output module with the Mali GPU blobs. Although everything indicated that it should have been working, the GLES calls were failing while EGL was setup and behaving correctly. The issue at hand was directly related to VLC using Qt for its interface. Because the Qt build used on the development boards was targeting GL support instead of GLES, it needed to import GL symbols that have the same name as GLES equivalents. Since Qt was loaded after the video output module, it would override the matching GLES symbols with GL symbols (from Mesa, not the blob).

With the help of Thomas Guillem, a few patches were crafted to fix the issue and sent out to the VLC developers. Some more revisions of these patches will be needed for the fix to integrate the VLC tree, but it should land sooner or later.

With VLC fixed, it was time to start looking at accelerating our pipeline with the GPU. VLC already includes GPU shaders for NV12 to RGB conversion as well as scaling and rotation, but does not have support for our tiled format. This is why we need a shader on our VAAPI backend side to accelerate the untiling operation. While the shader is currently work in progress, further work is also required to properly export the resulting untiled buffer as a DMABUF handle for VLC. Since the GPU blob does not support dmabuf export, we will need to implement a standalone GBM provider compatible with our DRM driver, that will handle allocating surfaces (instead of the armsoc DDX that is currently used for accelerated graphics on X) and exporting DMABUF handles to them when needed. Generally speaking, this will also allow standardizing and sanitizing the integration of the Mali blobs with the rest of the system.

Stay tuned for our next update!

Allwinner VPU support in mainline Linux status update (week 20)

With DMABUF support tested, it has become possible for Paul to start the work on integrating a GPU-based video output pipeline with Sunxi-Cedrus. Using the GPU should greatly improve performance when it comes to displaying the video frames. As of now, we have been using software untiling, software YUV-to-RGB colorspace conversion and software scaling. We are looking to replace these steps with GPU-based untiling, colorspace conversion and scaling. Shaders are used to implement these operations: they are small programs that are compiled on-the-fly for the GPU’s very specialized instruction set. Most players embed shaders (in their source form, using the GL shading language) for usual operations like colorspace conversion and scaling. However, these players are not ready to handle untiling as of now (or even be notified that the format returned by our VAAPI backend is tiled).

The first step in our plan is to get VLC to cooperate with the X11 flavor of the Mali proprietary blobs that Allwinner has released in the past so that we can use GPU support for colorspace and scaling. This is still a blocking point as of now. Then, we will look into crafting a shader for untiling the VPU output frame and integrating it with our libva-cedrus VAAPI backend.

As a sidenote, the free software Lima driver is being prepared for a first RFC series, bringing the first bits of mainline Linux kernel support for Mali GPUs of the Utgard generation. So even though work on the GPU only concerns the proprietary blob for now, the work will eventually become useful to the free software driver as well.

We have also tested Sunxi-Cedrus on the H3 and started looking at integrating the display part (which differs from earlier SoCs by using a revised display engine: DE2). However, since this is a strech goal of the fundraiser and that we have many other tasks left to tack among our main goals, this is by far not our priority at the moment.

We finally worked more on the libva-dump and cedrus-frame-test for H264, which will hopefully allow us to test our first H264 decoding next week!

Stay tuned for our next progress update!