Linux 4.17 released, Bootlin contributions

Penguin from Mylène Josserand
Drawing from Mylène Josserand,
based on a picture from Samuel Blanc under CC-BY-SA

4.17 was released last Sunday, so it’s time for our highlight article to see daylight.

As always, LWN.net did an interesting coverage of this release cycle merge window, highlighting the most important changes: the first half of the 4.17 merge window and the rest of the 4.17 merge window. For 4.17 only, Bootlin contributed a total of 331 patches, which puts us at the 10th place in the ranking of most contributing companies according to both LWN and KPS.

Also according to LWN statistics, Bootlin’s engineer Alexandre Belloni is the 6th most active developer in terms of changesets for this release with a total of 124 commits, almost a percent of the total number.

The main highlights of our contributions are:

Bootlin engineers are not only contributors, but also maintainers of various subsystems in the Linux kernel, which means they are involved in the process of reviewing, discussing and merging patches contributed to those subsystems:

  • Maxime Ripard, as the Allwinner platform co-maintainer, merged 97 patches from other contributors
  • Boris Brezillon, as the MTD/NAND maintainer, merged 87 patches from other contributors
  • Alexandre Belloni, as the RTC maintainer and Atmel platform co-maintainer, merged 46 patches from other contributors
  • Grégory Clement, as the Marvell EBU co-maintainer, merged 14 patches from other contributors

Here is the commit by commit detail of our contributions to 4.17:

spi-mem: bringing some consistency to the SPI memory ecosystem

In this article, we would like to introduce our work on the spi-mem Linux kernel framework, which will allow to re-use SPI controller drivers for both SPI NOR devices and regular SPI devices, as well as SPI NAND devices in the future.

From SPI to Dual, Quad and Octo SPI

In the good old days, SPI was a simple protocol, with only 3 signals shared by all devices present on the bus:

  • MISO: Master In Slave Out
  • MOSI: Master Out Slave In
  • SCLK: Serial Clock

and 1 extra signal per device to select the device we want to communicate with:

  • SS: Slave Select (also called CS for Chip Select sometimes)

But then SPI memories appeared. It started with small and relatively slow ones, like dataflash, EEPROMs and SRAMs and progressively moved to rather large SPI NORs and SPI NANDs. As usual when it comes to dealing with memories, we want to get the best performances out of them. SPI bus limitations quickly became the bottleneck, so vendors decided to add more I/O lines and make the MISO/MOSI lines bi-directional. Nowadays we see SPI controllers that are supporting up to 8 I/O lines. For those who are familiar with the terms, that’s what we call DualSPI, QuadSPI and OctoSPI.

In order to use all I/O lines when doing a master to slave or slave to master transfer, there must be some kind of contract between the slave and the master so that both of them know when they can receive or transmit data on the I/O lines and how many of them they should use. This is done through a predefined set of operations exposed by the slave that the master has to follow to enter a specific transmit or receive state. A SPI memory operation is usually composed of:

  • 1 byte opcode encoding the operation to be executed (but be prepared for 2 bytes opcode, they are coming)
  • 0 to N address bytes whose meaning is dependent on the opcode (can be an absolute memory address or something else)
  • 0 to N dummy bytes to leave the slave enough time to enter the specific state requested through the opcode. Again, the number of dummy bytes depends on the opcode
  • 0 to N IN or OUT data bytes, the direction depends on the opcode

Note that, while this protocol tends to be used by memory devices, there’s nothing restricting it to memories, and I wouldn’t be surprised if some FPGA were using the same kind of transactions for things that are not memory oriented at all.

The Linux SPI ecosystem

Linux has been supporting Dual and Quad SPI mode for quite some time already (v3.12), and SPI device drivers could specify the number of I/O lanes for each SPI transfer. This way, a SPI memory operation could be broken out in several SPI transfer each of them using a pre-defined number of I/O lanes.

That worked just fine until some IP vendors decided to make their SPI controllers smarter and embed some kind of high-level interface to execute SPI memory operations in a single step instead of splitting it in several transfers (actually, most SPI controllers are even smarter than that and allow you to directly map a SPI memory in the CPU address space, but let’s keep that for a future post). At this point, we needed to give more control to the SPI controller so that it could decide what to do exactly, without having to reconstruct a SPI memory operation out of a group of SPI transfers.

At that time, the decision has been to dedicate those controllers to one single task: control SPI NORs (which were the only user of quad and dual mode back then), and the SPI NOR framework was created for this reason.

Due to this decision, we now have in Linux a SPI NOR frame work which connects SPI NOR controller drivers to the SPI NOR logic on one side (spi-nor subsystem), and we have regular SPI controller drivers which can do basic SPI transfers (spi subsystem). However, from a hardware point of view, advanced SPI controllers that provide special features for SPI NOR can often also do basic transfers, and therefore control regular SPI devices. Unfortunately, with the current split between the spi-nor and spi subsystems, if a SPI controller is supported by a driver in the spi-nor subsystem, it cannot be used to communicate with regular SPI devices normally managed by the spi subsystem.

As a partial solution to solve this problem, the ->spi_flash_read() operation was added to struct spi_controller. This allowed regular SPI controller drivers in the spi subsystem to provide an optimized way to read from a SPI NOR memory, which is used by the generic SPI NOR driver m25p80. However, this solution is partial, as it only optimizes reading and is limited to SPI NORs.

Current SPI NOR stack

In the current stack, we have:

  • The SPI NOR framework, which knows the protocol to talk to SPI NOR memories. This framework relies on an interface listed in struct spi_nor, which is implemented by:
    • Specialized SPI NOR controllers, which support advanced SPI controllers dedicated to SPI NORs.
    • The m25p80 driver, which provides the same interface, but on top of dumb/regular SPI controller drivers, with the possible optimization of ->spi_flash_read()

What led us to propose the SPI memory interface?

We’ve seen before that the SPI NOR case was already properly supported thanks to the SPI NOR framework. But NORs are not the only memory devices you’ll find on a SPI bus, SPI NANDs are becoming more and more popular.
Peter Pan proposed a framework to handle SPI NAND devices which was following the SPI NOR model: SPI controllers had to implement the SPI NAND controller interface in order to control SPI NANDs. But when we got more deeply involved in this development, we quickly realized how messy it would be to follow this path, because that meant forcing SPI controllers to implement both the SPI NOR and SPI NAND interface if they want to be able control both kind of device. And what will happen when the trend will be at SPI NVRAM or any other kind of storage manufacturers decide to put on a SPI bus? Adding yet another interface that SPI controllers would again have to implement didn’t sound like a good idea.

So instead, we decided to take the problem from the other end, and tried to figure out what SPI NANDs and SPI NORs have in common.
The instruction set is different, the behavior and constraints are different (mainly due to NOR vs NAND differences), but they both follow the same SPI memory operation semantic when interacting with the device, the same one advanced controllers are trying to optimize.

The SPI memory layer is just a way for SPI controller drivers to be passed high-level SPI memory operations instead of letting them handle SPI transfers and try to optimize them themselves. It also simplifies SPI memory drivers’ life, since all they have to care about is sending SPI memory operations based on the SPI memory specs. No need for a complex, constantly evolving, memory dependent interface.

SPI memory stack

With this new stack, both SPI NOR and SPI NAND can be supported on top of the same SPI controller drivers. The m25p80 driver is modified to use the spi-mem interface instead of the limited ->spi_flash_read() interface. For now, we still have dedicated SPI NOR controller drivers, but the goal is to rid of them, and implement them as regular SPI controllers in drivers/spi. Help and contributions in this direction are very welcome!

What does the SPI memory API look like?

The SPI memory API is exposed in include/linux/spi/spi-mem.h.

SPI device drivers who want to use the SPI memory API should declare themselves as spi_mem_drivers and implement the ->probe() and ->remove() functions.
They will be passed a spi_mem object which is just a thin wrapper around a spi_device object. The reason we have a different object is because we want to be able to extend the spi_mem object and attach more information to it if required (like the type of memory, the memory organization and other kind of information advanced SPI controllers might want to know).
When a driver wants to execute a SPI memory operation, it will fill a spi_mem_op struct and call spi_mem_exec_op(). One can also test if a SPI controller is supporting a specific memory operation with spi_mem_supports_op() or try to split data transfers so that they don’t exceed the max transfer size supported by the controller with spi_mem_adjust_op_size().

Now, let’s have a look at the controller side of things. An SPI controller who wants to optimize SPI memory operations can implement the spi_mem_ops interface which contains 3 methods that are directly matching the user API:

  • ->exec_op(): execute the memory operation or return -ENOTSUPP if it’s not supported
  • ->supports_op(): just check if the memory operation is supported
  • ->adjust_op_size(): adjust the size of the data transfer of a memory operation to cope with alignment and max FIFO size constraints

Note that when spi_mem_ops is not implemented, the core will add generic support for this feature by creating SPI messages formed of several SPI transfers, just as the generic SPI NOR controller driver (named m25p80) was doing before.

As you can see, the API is pretty straightforward, so hopefully, more SPI memory drivers will be converted to use it instead of manually creating SPI messages containing several SPI transfers.

Current status

A number of things have already been contributed and merged, scheduled to be part of the 4.18 Linux kernel release:

What’s next?

Advanced SPI controllers can do more than just optimizing the execution of SPI memory operations, they can often hide all the complexity of memory accesses behind a directly-mapped IOMEM region, which, every time it is accessed, triggers a SPI memory operation on the bus and retrieves or sends the data for you, thus behaving like a memory that would be placed on a parallel memory bus. As you can imagine, this allows for even higher throughput and less CPU time consumed for SPI mem op management, but it’s also something that is hard to expose in a generic way. We have posted on the linux-mtd mailing list a proposal to support such a direct mapping capability.

As detailed earlier, another challenging topic is the conversion of all SPI NOR controller drivers to the SPI mem model so that all QSPI controllers are really exposed as SPI controllers and not SPI NOR controllers. This is likely to take a bit of time since we currently have 10 drivers in drivers/mtd/spi-nor and we’re only aware of 2 of them being converted to the SPI mem approach (fsl-quadspi and atmel-quadspi).

Allwinner VPU support in mainline Linux status update (week 22)

Integration with video players

The work conducted this week on the video output side was focused on writing a shader for untiling the MB32 NV12-based format used by the VPU to output frames. This brought various challenges, some of which are presented below.

Since GLES and EGL are generic APIs that are not tied to a particular driver implementation, it made sense to start writing the shader on an x86 Intel-based device with GPU support in Mesa 3D (and speed-up the development time). The first step to the process was to display the raw pixel values from the luminance plane through the shader. Actually, two shaders are required: one for the vertex processor and one for the fragment (pixel) processor of the GPU. The former is in charge of applying geometrical operations to the vertices (the points that define the 3D mesh) while the latter defines the color for each rendered pixel from that mesh. In our case, the mesh is simply a rectangle that matches our window size. The tiled NV12 luminance plane is uploaded to the GPU as a 1-byte-per-pixel texture, which allows addressing each component separately. However, the coordinates for the texture are normalized by the GPU, so coordinates to retrieve texels (texture pixels) form the texture sampler are specified as decimal values. This makes it tricky to ensure that the right value is retrieved, especially given that the GPU might apply various filtering techniques (that are a really good thing to have when dealing with actual textures for 3D models, though).

Setting up the vertex and fragment shaders to linearly display the pixels from the tiled format results in a mangled display (as expected):

Thanks to the documentation made available by the linux-sunxi community, it was possible to rapidly draft a formula for getting the right texel location, that produced mitigated results:

With some extra work (and quirks for ensuring that the right texel is picked on tile edges), the luminance component was finally displayed correctly:

Next up was the chrominance component, that required importing a second dedicated texture. First tries lead to funky-looking coloring of the frame:

Until the shader was corrected to end up with a good-looking picture:

Real trouble began when porting this work to the Mali, that does not behave the same when it comes to texture uploads (and requires line-by-line upload for 1-byte-per-pixel formats). Since we are aiming at DMAbuf import instead of (slow) texture upload, no time was spent coping with the difference. The main issue with DMAbuf import is that the usual one-byte-per-pixel format (described by the DRM_FORMAT_R8 fourcc code) is simply not supported by the GPU, leaving only RGB and YUV as options, that do not directly fit the bill. We are still investigating ways to make our texture available to the GPU’s texture sampler without extra copies (or with copies that can fit our bill in terms of performance).

H264 support

We also worked on the H264 decoding in the kernel, and some progress was made. The libva-dump and cedrus-frame-test ports are now done, and we’ve been able to run cedrus-frame-test on 32 frames without any hiccups… Unfortunately, while the VPU reports the frames as properly decoded, the contents of the output buffer is blank, which is obviously not great. Since then, we have simplified the test to have a single frame decoded, and compared the register write sequence between libvdpau-sunxi and our kernel code. This has allowed us to find some bugs in the driver, but the current state is still that we can’t decode a frame. We shouldn’t be very far now though, so stay tuned for our next status update!

Allwinner VPU support in mainline Linux status update (week 21)

This week’s effort was focused on getting VLC to accelerate its video output using the Mali proprietary blobs. More specifically, two distinct interfaces are involved: EGL, that allows interacting with the platform’s windowing system (in our case, X11) and GLES, that is in charge of the rendering operations. While VLC already had support for both of these interfaces, it initially failed to create and use its GL-backed video output module with the Mali GPU blobs. Although everything indicated that it should have been working, the GLES calls were failing while EGL was setup and behaving correctly. The issue at hand was directly related to VLC using Qt for its interface. Because the Qt build used on the development boards was targeting GL support instead of GLES, it needed to import GL symbols that have the same name as GLES equivalents. Since Qt was loaded after the video output module, it would override the matching GLES symbols with GL symbols (from Mesa, not the blob).

With the help of Thomas Guillem, a few patches were crafted to fix the issue and sent out to the VLC developers. Some more revisions of these patches will be needed for the fix to integrate the VLC tree, but it should land sooner or later.

With VLC fixed, it was time to start looking at accelerating our pipeline with the GPU. VLC already includes GPU shaders for NV12 to RGB conversion as well as scaling and rotation, but does not have support for our tiled format. This is why we need a shader on our VAAPI backend side to accelerate the untiling operation. While the shader is currently work in progress, further work is also required to properly export the resulting untiled buffer as a DMABUF handle for VLC. Since the GPU blob does not support dmabuf export, we will need to implement a standalone GBM provider compatible with our DRM driver, that will handle allocating surfaces (instead of the armsoc DDX that is currently used for accelerated graphics on X) and exporting DMABUF handles to them when needed. Generally speaking, this will also allow standardizing and sanitizing the integration of the Mali blobs with the rest of the system.

Stay tuned for our next update!

Allwinner VPU support in mainline Linux status update (week 20)

With DMABUF support tested, it has become possible for Paul to start the work on integrating a GPU-based video output pipeline with Sunxi-Cedrus. Using the GPU should greatly improve performance when it comes to displaying the video frames. As of now, we have been using software untiling, software YUV-to-RGB colorspace conversion and software scaling. We are looking to replace these steps with GPU-based untiling, colorspace conversion and scaling. Shaders are used to implement these operations: they are small programs that are compiled on-the-fly for the GPU’s very specialized instruction set. Most players embed shaders (in their source form, using the GL shading language) for usual operations like colorspace conversion and scaling. However, these players are not ready to handle untiling as of now (or even be notified that the format returned by our VAAPI backend is tiled).

The first step in our plan is to get VLC to cooperate with the X11 flavor of the Mali proprietary blobs that Allwinner has released in the past so that we can use GPU support for colorspace and scaling. This is still a blocking point as of now. Then, we will look into crafting a shader for untiling the VPU output frame and integrating it with our libva-cedrus VAAPI backend.

As a sidenote, the free software Lima driver is being prepared for a first RFC series, bringing the first bits of mainline Linux kernel support for Mali GPUs of the Utgard generation. So even though work on the GPU only concerns the proprietary blob for now, the work will eventually become useful to the free software driver as well.

We have also tested Sunxi-Cedrus on the H3 and started looking at integrating the display part (which differs from earlier SoCs by using a revised display engine: DE2). However, since this is a strech goal of the fundraiser and that we have many other tasks left to tack among our main goals, this is by far not our priority at the moment.

We finally worked more on the libva-dump and cedrus-frame-test for H264, which will hopefully allow us to test our first H264 decoding next week!

Stay tuned for our next progress update!

Testing pixel formats on the RaspberryPi

As part of our ongoing work with the RaspberryPi Foundation, we’ve been working on a number of display-related topics recently. Besides the work done by my colleague Boris Brezillon on improving the kernel side support for a number of features (such as the GPU performance counters support, memory management improvements, etc.), I’ve been working on improving the CI infrastructure for display driver testing.

Indeed, the current workflow is not automated at all and doesn’t allow to detect breakages in the display driver. We thus needed to improve that. To do so, we’ve relied on a board developed by Google as part of the ongoing CI-effort on ChromeOS that is called the Chamelium. The Chamelium is based on an ARM board powered by an Altera SoC+FPGA platform that Google extended with an extension board with video connectivity: VGA, HDMI and DisplayPort. They then developed a firmware for the FPGA to allow the board to emulate a screen.

A RaspberryPi and the Chamelium

Using this, you can simulate improper EDIDs, simulate hotplug events, HDCP screens, etc. and see how the device under test reacts to that. One of the interesting things you can do with it is to dump a CRC of the frame received on the display link, or a raw capture of a given number of frames. The usefulness of such a feature is obvious for a CI effort: you connect the device to test over HDMI, VGA or DP to the Chamelium, and then you can setup a test pattern on the device you want to test, capture the frame received on the other end, and compare the two frames. In an ideal scenario, the two are identical, and if your driver has a regression, you’ll notice as the two frames would no longer be identical.

The intel-gpu-tools suite (also called i-g-t), even though historically named with a not-so-generic name, is a standard test suite for the DRM subsystem in Linux. Last summer, the support for the Chamelium has been introduced for exactly this setup, where intel-gpu-tools would setup a test pattern, ask the Chamelium for a CRC of the frames it received and do the comparison.

This was working fine, and after a quick test on the RaspberryPi, it turned out to work on non-Intel hardware out of the box. However, the test was actually quite simple: while it was testing all the resolutions exposed, it was only testing a single pixel format, and we wanted to do more in order to catch regressions in less common formats, and ideally the RaspberryPi proprietary formats as well.

When it comes to pixel formats, there are two main families involved:

  • the RGB formats, sometimes prefixed with an A (for alpha, the opacity) or X (for padding), and the YCbCr family (also called abusively YUV). The former will have different values for each primary color, encoded on a number of bits following the RGB prefix. XRGB1555 for example will be a 16 bits format (1 + 5 + 5 + 5), with 1 bit of padding, and 5 bits for red, green and blue in that order.
  • the YCbCr formats, based on the property of the human eye that it perceives better the changes in luminosity than in color and will thus store the luminance (Y) and chrominance (Cb and Cr) in separate fields, with possibly a different number of bits. While RGB is usually preferred by computer graphics, video is very fond of the YCbCr formats since you can compress the Cb and Cr fields, resulting in a denser pixel format, without degrading the image quality too much.

The format initially supported by intel-gpu-tools was the XRGB8888 (8 bits of padding, 8 bits for red, green and blue, in that order). The RaspberryPi supports the RGB formats XRGB8888, ARGB8888, ABGR8888, XBGR8888, RGB565, BGR565, ARGB1555, XRGB1555, RGB888, BGR888.

Like we said, i-g-t was on the contrary using only an XRGB8888 format for the test pattern. This unfortunately was based on a few assumptions, the first one being that the test pattern would be generated with Cairo. However, Cairo supports a very limited range of formats. On the formats supported on the RaspberryPi, Cairo only supported ARGB8888, XRGB8888 and RGB565. This was obviously not enough, but we didn’t really want to extend Cairo since our goal was to be able to run the test suite on as many devices as possible. One option would have been to update the version of Cairo in use to support a larger number of formats, but that was not considered to be the most appropriate solution. We thus evaluated our options, and it turned out pixman supports most of the RGB formats, and it was already a dependency of intel-gpu-tools.

So in a patch series that we submitted recently to the intel-gpu-tools project, we:

  • create an API to allow the core i-g-t functions that handle the buffers to let us simply map the underlying DRM buffer in order to access it, without having to use Cairo and its limited pixel format support
  • rework the code a bit to be able to use Cairo when relevant, and then fallback to Pixman if the format isn’t supported. Pixman list of formats supported isn’t ideal either, especially in the YCbCr family, but we focused on RGB first. In order to allow for additional fallbacks, we hid it behind an API so that it’s transparent to the users
  • create a custom pattern solely for the Chamelium test, which was needed to deal with the difference of sampling size for each color component
  • glue those functions into the Chamelium test suite and add one sub-test for each format, so that we can detect both regressions in handling the format itself, but also regressions in the list of formats exposed
  • add a VC4 test suite, extended with Chamelium based tests

All this work has then be submitted to the intel-gpu-tools mailing list for review, and while the development was done on a RaspberryPi, it should benefit the whole community.

Allwinner VPU support in mainline Linux status update (week 19)

This week has seen considerably less advancement than the ones before it due to bank holidays in France. Nevertheless, we managed to prepare and send V3 of the Sunxi-Cedrus Linux kernel driver on Monday. While this new version contains several incremental improvements, a number of tasks (described in the series’ cover letter) have yet to be completed before the driver can be merged in mainline Linux.

Maxime continued to work on the H264 support. The big part of the kernel has been done, and he then moved on to convert libva-dump to be able to dump also H264 buffers. Most of that part has been done as well, so the next item will be to convert cedrus-frame-test to be able to test H264 frames, and see where that takes us.

Paul kept working on DMABUF support, which is now refined and ready both on the kernel side and on the userspace side with cedrus-frame-test. There is now a single DMABUF handle used per buffer plane (instead of per-plane) which allows having all components of the frame displayed correctly. Because there is now as many buffers for display as there are for decoding, it is necessary to register framebuffers associated with each imported buffer and cycle the framebuffers in multi-buffering page-flipping. To tackle this, we have started implementing atomic modesetting in cedrus-frame-test, allowing to set the framebuffer to use per-plane.

Finally, some attention was given to the integration of our video decoding pipeline with the Mali GPU, especially to target Kodi support.

Stay tuned for our next update!

Allwinner VPU support in mainline Linux status update (week 18)

This week, Paul continued working on DMABUF support and succeeded at exporting a buffer allocated by the Sunxi-Cedrus driver on the v4l2 side and importing it on the drm side via DMABUF. Although DMABUF support is still a work in progress in cedrus-frame-test and beyond the current level of support we have with GStreamer, the kernel side of things should be ready.

Another excerpt of the Big Buck Bunny video, in 1080p

Test coverage was also improved this week, with significantly more MPEG2 videos tested (including a standard DVD) in different resolutions up to 1080p. Some feedback from the community was also received and a first issue report will need to be investigated. Regarding platform support, initial testing of the A13 was undertaken. Although the VPU driver works apparently just as well on the A13 as already-tested platforms, the DRM driver adaptation (on the display side) for untiling VPU output buffers appears to be broken and will need to be further investigated.

In other news, a new version of the media request API has been submitted without the RFC tag (after 12 previous iterations). While we’ve been testing this new version along the course of its development, we are also taking the occasion to rebase our Sunxi-Cedrus VPU driver on top of this new version and take the received feedback into account.

Maxime continued the work on H264, and almost finished a first draft for the kernel driver side. Most of the code should be there now, the next steps are going to be making sure that no parts are missing and starting to test with cedrus-frame-test!

Allwinner VPU support in mainline Linux status update (week 17)

This week started off with numerous reviews received on the patchset introducing the Sunxi-Cedrus VPU driver. Lots of constructive comments, questions and improvements were discussed, which will help improve the driver for the next iteration of the series. Changes to other drivers will also have to be implemented, in particular to the SRAM controller found on Allwinner platforms, which needs to handle access to the SRAM by the VPU.

Maxime worked on refactoring needed to ease the support for the H264, rebasing on the latest version submitted upstream and making sure that everything still works fine. He eventually pushed them in our 4.17 branch on github, and will now focus on landing H264 support itself.

The work carried out by Paul this week was focused on the libva-cedrus VAAPI backend, which supports the Sunxi-Cedrus kernel driver on the userspace side. The backend is used by VLC (when it is configured to use VAAPI for video decoding) to play MPEG2 videos such as the ones available from the Linaro sample media. libva-cedrus was significantly improved over the week, with around 80 commits featuring a major cleanup of the code that includes, along with other changes:

  • coding style harmonization
  • proper error checking and reporting instead of assertions
  • the removal of the unsupported MPEG4 code
  • the introduction of dedicated v4l2 helpers based on those developed for cedrus-frame-test
  • the reorganization of v4l2 source and destination buffers management, where both are now tied to a specific surface and kept in sync
  • the update of the definitions to match the latest patchset
  • the implementation of the final rendering at picture end time

This work significantly improved the compatibility with VLC, which was previously dropping several frames. With these changes, VLC is now properly showing the decoded videos playing close to 25 fps when there is no software scaling involved. The performance is not as good with VLC as it is with cedrus-frame-test, which uses a dedicated DRM plane directly while VLC and libva-cedrus use the software untiling code and buffer copies to display each frame.

VLC playing the Big Buck Bunny video with libva-cedrus

Some attention was also given to GStreamer over the week. Although compatibility with our VAAPI backend and display pipeline is not there yet, the VAAPI backend rewrite allowed moving forward and GStreamer now displays the first decoded frame. While the operations for decoding the frames are correctly scheduled, they are only requested to be displayed sporadically, with no effect on the screen. This issue will need to be further investigated before a basic decoding pipeline can be used with GStreamer, with a video output either to a regular X window or to a DRM plane directly. MPV was also tested out this week, without much success in coordinating the rendering and display parts involved with the VAAPI pipeline. Thus, MPV support will also require more investigation before it can be properly supported.

While we initially decided to focus on GStreamer for implementing DMABUF buffer sharing between the VPU and the display engine, cedrus-frame-test (the standalone userspace implementation supporting the Sunxi-Cedrus VPU driver) allows us to directly work on implementing DMABUF support. So even though GStreamer does not work with libva-cedrus at this point, DMABUF support was started in a dedicated branch of the cedrus-frame-test repository. DMABUF is currently failing on the kernel side, when validating the page number of the requested DMA buffer. In this area as well, further investigation and work will be needed.

In the meantime, the Sunxi-Cedrus page on the linux-sunxi wiki was updated with the latest status of Sunxi-Cedrus support, instructions to build and install libva-cedrus and cedrus-frame-test as well as configure VLC for decoding MPEG2 videos. Feedback and test reports are welcome, especially regarding videos that are not decoded properly and show visual artifacts. The community around Sunxi-Cedrus hangs out on the #linux-sunxi and #cedrus channels of the freenode IRC network, so it is the best place to ask questions and discuss all things related to Sunxi-Cedrus!

Stay tuned for next week’s progress report!

Allwinner VPU support in mainline Linux status update (week 16)

As announced last week, the second revision of the Sunxi-Cedrus driver patchset was submitted for review earlier this week. While this new revision is based on the latest version of the request API, it also includes several fixes for corner-cases of this new API, especially to use it in the context of a M2M driver. Regarding the driver itself, significant reworks were carried out (including both functional and cosmetic changes) and the driver is now more stable. It was tested on the A33 and A20 so far and works nicely on both.

The standalone tool that was developed for testing the driver, called cedrus-frame-test, has seen various improvements that allow reliably testing the Sunxi-Cedrus driver. The tool is now in a state where it can be used nicely from the command line and includes the first few frames of our reference Big Buck Bunny MPEG2 video. It also implements timestamping to have a clear idea of how long frame decoding and frame display take. A target number of frames per seconds can also be set, with error messages printed when the target fps could not be met. Finally, a dummy libVA backend was written to easily dump slices and frame metadata from videos: libva-dump.

cedrus-frame-test displaying Big Buck Bunny frames decoded with Sunxi-Cedrus

Instructions to setup the kernel driver as well as cedrus-frame-test from our trees will be made available on the linux-sunxi wiki page dedicated to Sunxi-Cedrus very soon.

At this point, the time spent decoding each video frame is rather satisfying (around 5 ms as a ballpark figure) for our 854×480 demo video. We are still doing a hard copy of each frame to feed it to the display driver: that’s where the current bottleneck is. There is work left to be done in that area, first by implementing DMAbuf and also by using proper page flipping in cedrus-frame-test. We are also hitting a display issue with 4.16 on the A20, although that problem might have been fixed in 4.17 already.

Next week will be focused on (finally) adding DMAbuf support and getting libVA in shape to work with the new Sunxi-Cedrus kernel driver under VLC and GStreamer. The final patch of the first GStreamer adaptation series submitted some weeks ago was recently merged in GStreamer.