benchmarks – Bootlin

Super fast Linux splashscreen

Here’s a simple trick that I recently rediscovered when I worked on a boot time reduction project for a customer. It’s not rocket science, but you may not be aware of it.

Our customer was using fbv to display its logo right after the system booted. This is a way to show that the system is available while you’re starting the system’s main application:

fbv -d 1 /root/logo.bmp > /dev/null 2>&1

With Grabserial and using simple instrumentation with messages issued on the serial console before and after running the command, we found that this command was taking 878 ms to execute. The customer’s system had an AT91SAM9263 ARM SOC from Atmel, running at 200 MHz.

Even if fbv is a simple program (22 KB on ARM, compiled with shared libraries), decoding the logo image is still expensive. Here’s a way to get this compute cost out of your boot sequence. All you have to do is display your logo on your framebuffer, and then capture the framebuffer contents in a file:

fbv -d 1 /root/logo.bmp
cp /dev/fb0 /root/logo.fb

The new file is now a little bigger, 230400 bytes instead of 76990. However, displaying your boot logo can now be done by a simple copy:

dd if=/root/logo.fb of=/dev/fb0 bs=230400 count=1 > /dev/null 2>&1

This command now runs in only 54 ms. That’s only 6% of the initial execution time! The advantage of this approach is that it works with any kind of framebuffer pixel format, as long as you have at least one program that knows how to write to your own framebuffer.

Note that the dd command was used to read and write the logo in one shot, rather than copying in multiple chunks. We found that the equivalent cp and cat commands were slightly slower. Of course, the benchmark results will vary from one system to another. Our customer had heavily optimized their NOR flash access time. If you run this on a very slow storage device, using a much faster CPU, the time to display the logo may be several impacted by the time taken to read a bigger file from slower storage.

To get even better performance, another trick is to compress the framebuffer contents with LZO (supported by BusyBox), which is very fast at decompressing, and requires very little memory to run:

lzop -9 /root/logo.fb

The new /root/logo.fb.lzop file is now only 2987 bytes big. Of course, the compression rate will depend on your logo image. In our case, the splashscreen contains mostly white space and a simple monochrome company logo. The new command to put in your startup scripts is now:

lzopcat /root/logo.fb.lzo > /dev/fb0

The execution time is now just 52.5 ms! With a faster CPU, the time reduction would have been even bigger.

The ultimate trick for having a real and possibly animated splashscreen would be to implement your own C program, directly writing to the framebuffer memory in mmap() mode. Here’s a nice tutorial showing how easy it can be.

Report on extensive real-time Linux benchmarks on AT91

The real time page I wrote for Atmel was finally released on the Linux4Sam Atmel Wiki. The purpose of this page was to help new comers to use real time features with Atmel CPUs and to present the state of the real time support.

Here are some figures associated to this work:

On this page I present the results of more than 300 hours of benchmarks!

During the setup and the tuning tests ran for more than 600 hours.

Analysis and formatting took a few dozen hours of work.

The benchmarks have been run on 3 boards, 3 flavors of Linux (vanilla, PREEMPT-RT patches, Xenomai co-kernel approach), and 2 kinds of tests (timer-based and GPIO-based)

Linux on ARM: xz kernel decompression benchmarks

I recently managed to find time to clean up and submit my patches for xz kernel compression support on ARM, which I started working on back in November, during my flight to Linaro Connect. However, it was too late as Russell King, the ARM Linux maintainer, alreadyaccepted a similar patch, about 3 weeks before my submission. The lesson I learned was that checking a git tree is not always sufficient. I should have checked the mailing list archives too.

The good news is that xz kernel compression support should be available in Linux 3.4 in a few months from now. xz is a compression format based on the LZMA2 compression algorithm. It can be considered as the successor of lzma, and achieves even better compression ratios!

Before submitting my patches, I ran a few benchmarks on my own implementation. As the decompressing code is the same, the results should be the same as if I had used the patches that are going upstream.

Benchmark methodology

For both boards I tested, I used the same pre 3.3 Linux kernel from Linus Torvalds’ mainline git tree. I also used the U-boot bootloader in both cases.

I used the very useful grabserial script from Tim Bird. This utility reads messages coming out of the serial line, and adds timestamps to each line it receives. This allow to measure time from the earliest power on stages, and doesn’t slow down the target system by adding instrumentation to it.

Our benchmarks just measure the time for the bootloader to copy the kernel to RAM, and then the time taken by the kernel to uncompress itself.

Loading time is measured between “reading uImage” and “OK” (right before “Starting kernel”) in the bootloader messages.

Compression time measured between “Uncompressing Linux” and “done”:

~/bin/grabserial -v -d /dev/ttyUSB0 -e 15 -t -m "Uncompressing Linux" -i "done," > booting-lzo.log

Benchmarks on OMAP4 Panda

The Panda board has a fast dual Cortex A9 CPU (OMAP 4430) running at 1 GHz. The standard way to boot this board is from an MMC/SD card. Unfortunately, the MMC/SD interface of the board is rather slow.

In this case, we have a fast CPU, but with rather slow storage. Therefore, the time taken to copy the kernel from storage to RAM is expected to have a significant impact on boot time.

This case typically represents todays multimedia and mobile devices such as phones, media players and tablets.

Compression	Size	Loading time	Uncompressing time	Total time
gzip	3355768	2.213376	0.501500	2.714876
lzma	2488144	1.647410	1.399552	3.046962
xz	2366192	1.566978	1.299516	2.866494
lzo	3697840	2.471497	0.160596	2.632093
None	6965644	4.626749	0	4.626749

Results on Calao Systems USB-A9263 (AT91)

The USB-A9263 board from Calao Systems has a cheaper and much slower AT91SAM9263 CPU running at 200 MHz.

Here we are booting from NAND flash, which is the fastest way to boot a kernel on this board. Note that we are using the nboot command from U-boot, which guarantees that we just copy the number of bytes specified in the uImage header.

In this case, we have a slow CPU with slow storage. Therefore, we expect both the kernel size and the decompression algorithm to have a major impact on boot time.

This case is a typical example of industrial systems (AT91SAM9263 is still very popular in such applications, as we can see from customer requests), booting from NAND storage operating with a 200 to 400 MHz CPU.

Compression	Size	Loading time	Uncompressing time	Total time
gzip	2386936	5.843289	0.935495	6.778784
lzma	1794344	4.465542	6.513644	10.979186
xz	1725360	4.308605	4.816191	9.124796
lzo	2608624	6.351539	0.447336	6.798875
None	4647908	11.080560	0	11.080560

Lessons learned

Here’s what we learned from these benchmarks:

lzo is still the best solution for minimum boot time. Remember, lzo kernel compression was merged by Bootlin.
xz is always better than lzma, both in terms of image size. Therefore, there’s no reason to stick to lzma compression if you used it.
Because of their heavy CPU usage, lzma and xz remain pretty bad in terms of boot time, on most types of storage devices. On systems with a fast CPU, and very slow storage though, xz should be the best solution
On systems with a fast CPU, like the Panda board, boot time with xz is actually pretty close to lzo, and therefore can be a very interesting compromise between kernel size and boot time.
Using a kernel image without compression is rarely a worthy solution, except in systems with a very slow CPU. This is the case of CPUs emulated on an FPGA (typically during chip development, before silicon is available). In this particular case, copying to memory is directly done by the emulator, and we just need CPU cycles to start the kernel.

Update on flash filesystems

Reviewing new possibilities for flash filesystems – My slides at ELCE 2008

With the release of Linux 2.6.27, including the new UBIFS filesystem for MTD storage, embedded Linux system developers now have multiple choices for their flash storage devices. As far as it is concerned, JFFS2 has also been improved and now has support for LZO compression, which makes uncompressing faster. So, how to choose between JFFS2, YAFFS2, and UBIFS?

To help our customers and the community make the right decision, I measured how these filesystems compare in terms of mount time, access time, read and write speed, as well as CPU usage in several corner cases and with different flash chip sizes.

I showed the results during the Embedded Linux Conference Europe event. Besides sharing lessons learned from these experiments, my presentation also introduced each filesystem and its implementation. I also gave advice for flash based block storage (such as Compact Flash and Solid State disks), to reduce the number of writes and avoid damaging flash blocks.

As usual, Bootlin slides are available under the Creative Commons BY-SA license: flash-filesystems.pdf (PDF), flash-filesystems.odp (Open Document Format).

The main finding is that UBIFS outperforms both JFFS2 and YAFFS2 in almost all corner cases. As shown by the benchmarks, it has consistently good mount time, and read/write performance. If your products are using a recent kernel, and are still based on JFFS2, you should definitely try UBIFS and get significant performance benefits, in particular for boot time, as mounting a JFFS2 root filesystem can take several seconds!

The advent of UBIFS also questions the relevance of YAFFS2. YAFFS2 used to be a good alternative to JFFS2, but unlike UBIFS, it doesn’t support compression. Then, why choose YAFFS2, when a apparently superior alternative is available?

The only case in which JFFS2 can still make sense if when you have very small partitions, sizing just a few megabytes. In this case, the overhead from UBI, the erase-block management layer below UBIFS, is no longer negligible. You will be able to pack much less data than with JFFS2. In this case, you can still improve JFFS2’s performance by using some of its new features (more details in the presentation).

SquashFS is also another great alternative, as shown by my benchmarks. It’s true it is a block filesystem, but since it is read-only, and there is no problem to use it on a write-once mtdblock device. You should really consider it for the read-only parts in your system, though it is advisable to use it on top of UBI, to make its blocks participate to wear-leveling and bad block management. Again, you will find more details in my presentation.

The presentation also mentions LogFS, which is also a promising filesystem for flash storage. Unfortunately, LogFS is not available yet for recent kernels. Stay tuned and I will benchmark it as soon this situation changes.