Faster boot: starting Linux directly from AT91bootstrap

Reducing start-up time looks like one of the most discussed topics nowadays, for both embedded and desktop systems. Typically, the boot process consists of three steps: AT91SAM9263 CPU

  • First-stage bootloader
  • Second-stage bootloader
  • Linux kernel

The first-stage bootloader is often a tiny piece of code whose sole purpose is to bring the hardware in a state where it is able to execute more elaborate programs. On our testing board (CALAO TNY-A9260), it’s a piece of code the CPU stores in internal SRAM and its size is limited to 4Kib, which is a very small amount of space indeed. The second-stage bootloader often provides more advanced features, like downloading the kernel from the network, looking at the contents of the memory, and so on. On our board, this second-stage bootloader is the famous U-Boot.

One way of achieving a faster boot is to simply bypass the second-stage bootloader, and directly boot Linux from the first-stage bootloader. This first-stage bootloader here is AT91bootstrap, which is an open-source bootloader developed by Atmel for their AT91 ARM-based SoCs. While this approach is somewhat static, it’s suitable for production use when the needs are simple (like simply loading a kernel from NAND flash and booting it), and allows to effectively reduce the boot time by not loading U-Boot at all. On our testing board, that saves about 2s.

As we have the source, it’s rather easy to modify AT91bootstrap to suit our needs. To make things easier, we’ll boot using an existing U-Boot uImage. The only requirement is that it should be an uncompressed uImage, like the one automatically generated by make uImage when building the kernel (there’s not much point using such compressed uImage files on ARM anyway, as it is possible to build self-extractible compressed kernels on this platform).

Looking at the (shortened) main.c, the code that actually boots the kernel looks like this:

int main(void)
{
/* ================== 1st step: Hardware Initialization ================= */
/* Performs the hardware initialization */
hw_init();

/* Load from Nandflash in RAM */
load_nandflash(IMG_ADDRESS, IMG_SIZE, JUMP_ADDR);

/* Jump to the Image Address */
return JUMP_ADDR;
}

In the original source code, load_nandflash actually loads the second-stage bootloader, and then jumps directly to JUMP_ADDR (this value can be found in U-Boot as TEXT_BASE, in the board-specific file config.mk. This is the base address from which the program will be executed). Now, if we want to load the kernel directly instead of a second-level bootloader, we need to know a handful of values:

  • the kernel image address (we will reuse IMG_ADDRESS here, but one could
    imagine reading the actual image address from a fixed location in NAND)
  • the kernel size
  • the kernel load address
  • the kernel entry point

The last three values can be extracted from the uImage header. We will not hard-code the kernel size as it was previously the case (using IMG_SIZE), as this would lead to set a maximum size for the image and would force us to copy more data than necessary. All those values are stored as 32 bits bigendian in the header. Looking at the struct image_header declaration from image.h in the uboot-mkimage sources, we can see that the header structure is like this:

typedef struct image_header {
uint32_t    ih_magic;    /* Image Header Magic Number    */
uint32_t    ih_hcrc;    /* Image Header CRC Checksum    */
uint32_t    ih_time;    /* Image Creation Timestamp    */
uint32_t    ih_size;    /* Image Data Size        */
uint32_t    ih_load;    /* Data     Load  Address        */
uint32_t    ih_ep;        /* Entry Point Address        */
uint32_t    ih_dcrc;    /* Image Data CRC Checksum    */
uint8_t        ih_os;        /* Operating System        */
uint8_t        ih_arch;    /* CPU architecture        */
uint8_t        ih_type;    /* Image Type            */
uint8_t        ih_comp;    /* Compression Type        */
uint8_t        ih_name[IH_NMLEN];    /* Image Name        */
} image_header_t;

It’s quite easy to determine where the values we’re looking for actually are in the uImage header.

  • ih_size is the fourth member, hence we can find it at offset 12
  • ih_load and ih_ep are right after ih_size, and therefore can be found at offset 16 and 20.

A first call to load_nandflash is necessary to get those values. As the data we need are contained within the first 32 bytes, that’s all we need to load at first. However, some space is required in memory to actually store the data. The first-stage bootloader is running in internal SRAM, so we can pick any location we want in SDRAM. For the sake of simplicity, we’ll choose PHYS_SDRAM_BASEhere, which we define to the base address of the on-board SDRAM in the CPU address space. Then, a second call will be necessary to load the entire kernel image at the right load address.

Then all we need to do is:

#define be32_to_cpu(a) ((a)[0] << 24 | (a)[1] << 16 | (a)[2] << 8 | (a)[3])
#define PHYS_SDRAM_BASE 0x20000000

int main(void)
{
unsigned char *tmp;
unsigned long jump_addr;
unsigned long load_addr;
unsigned long size;

hw_init();

load_nandflash(IMG_ADDRESS, 0x20, PHYS_SDRAM_BASE);

/* Setup tmp so that we can read the kernel size */
tmp = PHYS_SDRAM_BASE + 12;
size = be32_to_cpu(tmp);

/* Now, load address */
tmp += 4;
load_addr = be32_to_cpu(tmp);

/* And finally, entry point */
tmp += 4;
jump_addr = be32_to_cpu(tmp);

/* Load the actual kernel */
load_nandflash(IMG_ADDRESS, size, load_addr - 0x40);

return jump_addr;
}

Note that the second call to load_nandflash could in theory be replaced by:

load_nandflash(IMG_ADDRESS + 0x40, size + 0x40, load_addr);

However, this will not work. What happens is that load_nandflash starts reading at an address aligned on a page boundary, so even when passing IMG_ADDRESS+0x40 as a first argument, reading will start at IMG_ADDRESS, leading to a failure (writes have to aligned on a page boundary, so it is safe to assume that IMG_ADDRESS is actually correctly aligned).

The above piece of code will silently fail if anything goes wrong, and does no checking at all – indeed, the binary size is very limited and we can’t afford to put more code than what is strictly necessary to boot the kernel.

Update on flash filesystems

Reviewing new possibilities for flash filesystems – My slides at ELCE 2008

With the release of Linux 2.6.27, including the new UBIFS filesystem for MTD storage, embedded Linux system developers now have multiple choices for their flash storage devices. As far as it is concerned, JFFS2 has also been improved and now has support for LZO compression, which makes uncompressing faster. So, how to choose between JFFS2, YAFFS2, and UBIFS?

To help our customers and the community make the right decision, I measured how these filesystems compare in terms of mount time, access time, read and write speed, as well as CPU usage in several corner cases and with different flash chip sizes.

I showed the results during the Embedded Linux Conference Europe event. Besides sharing lessons learned from these experiments, my presentation also introduced each filesystem and its implementation. I also gave advice for flash based block storage (such as Compact Flash and Solid State disks), to reduce the number of writes and avoid damaging flash blocks.

As usual, Bootlin slides are available under the Creative Commons BY-SA license: flash-filesystems.pdf (PDF), flash-filesystems.odp (Open Document Format).

The main finding is that UBIFS outperforms both JFFS2 and YAFFS2 in almost all corner cases. As shown by the benchmarks, it has consistently good mount time, and read/write performance. If your products are using a recent kernel, and are still based on JFFS2, you should definitely try UBIFS and get significant performance benefits, in particular for boot time, as mounting a JFFS2 root filesystem can take several seconds!

The advent of UBIFS also questions the relevance of YAFFS2. YAFFS2 used to be a good alternative to JFFS2, but unlike UBIFS, it doesn’t support compression. Then, why choose YAFFS2, when a apparently superior alternative is available?

The only case in which JFFS2 can still make sense if when you have very small partitions, sizing just a few megabytes. In this case, the overhead from UBI, the erase-block management layer below UBIFS, is no longer negligible. You will be able to pack much less data than with JFFS2. In this case, you can still improve JFFS2’s performance by using some of its new features (more details in the presentation).

SquashFS is also another great alternative, as shown by my benchmarks. It’s true it is a block filesystem, but since it is read-only, and there is no problem to use it on a write-once mtdblock device. You should really consider it for the read-only parts in your system, though it is advisable to use it on top of UBI, to make its blocks participate to wear-leveling and bad block management. Again, you will find more details in my presentation.

The presentation also mentions LogFS, which is also a promising filesystem for flash storage. Unfortunately, LogFS is not available yet for recent kernels. Stay tuned and I will benchmark it as soon this situation changes.