A new tool to ease kernel maintainer life

When you are involved in mainlining or maintaining some kernel code, a non negligible part of your time is spent checking patches or patchsets themselves . It is not the most interesting part but it is truly necessary to help merging in kernel code, or to make sure you don’t break anything, for example building with an incompatible configuration.

Aiaiai developed by Artem Bityutskiy is a tool to do most of this task for you! It uses other checking tools and scripts such as sparse, coccinelle and checkpatch.pl, and comes with its own set of tools and scripts. I don’t know what does “aiaiai” stands for, but in French it sounds like “Ouch Ouch Ouch”, the sound you could make if you forget to use this tool 😉

PS: On my G+ post, Yegor Yefremov pointed that “aiaiai” means something like “tsk tsk!” (shame on you!) in Russian.

How to boot an uncompressed Linux kernel on ARM

This is a quick post to share my experience booting uncompressed Linux kernel images, during the benchmarks of kernel compression options, and no compression at all was one of these options.

It is sometimes useful to boot a kernel image with no compression. Though the kernel image is bigger, and takes more time to copy from storage to RAM, the kernel image no longer has to be decompressed to RAM. This is useful for systems with a very slow CPU, or very little RAM to store both the compressed and uncompressed images during the boot phase. The typical case is booting CPUs emulated by FPGA, during processor development, before the final silicon is out. For example, I saw a Cortex A15 chip boot at 11 MHz during Linaro Connect Q2.11 in Budapest. At this clock frequency, booting a kernel image with no compression saves several minutes of boot time, reducing development and test time. Note that with such hardware emulators, copying the kernel image to RAM is cheap, as it is done by the emulator from a file given by the user, before starting to emulate the system.

Building a kernel image with no compression on ARM is easy, but only once you know where the uncompressed image is and what to do! For people who have never done that before, I’m sharing quick instructions here.

To generate your uncompressed kernel image, all you have to do is run the usual make command. The file that you need is arch/arm/boot/Image.

Depending on the bootloader that you use, this could be sufficient. However, if you use U-boot, you still need to put this image in a uImage container, to let U-boot know about details such as how big the image is, what its entry point is, whether it is compressed or not… The problem is you can’t run make uImage any more to produce this container. That’s because Linux on ARM has no configuration option to keep the kernel uncompressed, and the uImage file would contain a compressed kernel.

Therefore, you have to create the uImage by invoking the mkimage command manually. To do this without having to guess the right mkimage parameters, I recommend to run make V=1 uImage once:

$ make V=1 uImage
...
  Kernel: arch/arm/boot/zImage is ready
  /bin/bash /home/mike/linux/scripts/mkuboot.sh -A arm -O linux -T kernel -C none -a 0x80008000 -e 0x80008000 -n 'Linux-3.3.0-rc6-00164-g4f262ac' -d arch/arm/boot/zImage arch/arm/boot/uImage
Image Name:   Linux-3.3.0-rc6-00164-g4f262ac
Created:      Thu Mar  8 13:54:00 2012
Image Type:   ARM Linux Kernel Image (uncompressed)
Data Size:    3351272 Bytes = 3272.73 kB = 3.20 MB
Load Address: 80008000
Entry Point:  80008000
  Image arch/arm/boot/uImage is ready

Don’t be surprised if the above message says that the kernel is uncompressed (corresponding to -C none). If we told U-boot that the image is already compressed, it would take care of uncompressing it to RAM before starting the kernel image.

Now, you know what mkimage command you need to run. Just invoke this command on the Image file instead of zImage (you can directly replace mkuboot.sh by mkimage):

$ mkimage -A arm -O linux -T kernel -C none -a 0x80008000 -e 0x80008000 -n 'Linux-3.3.0-rc6-00164-g4f262ac' -d arch/arm/boot/Image arch/arm/boot/uImage
Image Name:   Linux-3.3.0-rc6-00164-g4f262ac
Created:      Thu Mar  8 14:02:27 2012
Image Type:   ARM Linux Kernel Image (uncompressed)
Data Size:    6958068 Bytes = 6794.99 kB = 6.64 MB
Load Address: 80008000
Entry Point:  80008000

Now, you can use your uImage file as usual.

Linux on ARM: xz kernel decompression benchmarks

I recently managed to find time to clean up and submit my patches for xz kernel compression support on ARM, which I started working on back in November, during my flight to Linaro Connect. However, it was too late as Russell King, the ARM Linux maintainer, alreadyaccepted a similar patch, about 3 weeks before my submission. The lesson I learned was that checking a git tree is not always sufficient. I should have checked the mailing list archives too.

The good news is that xz kernel compression support should be available in Linux 3.4 in a few months from now. xz is a compression format based on the LZMA2 compression algorithm. It can be considered as the successor of lzma, and achieves even better compression ratios!

Before submitting my patches, I ran a few benchmarks on my own implementation. As the decompressing code is the same, the results should be the same as if I had used the patches that are going upstream.

Benchmark methodology

For both boards I tested, I used the same pre 3.3 Linux kernel from Linus Torvalds’ mainline git tree. I also used the U-boot bootloader in both cases.

I used the very useful grabserial script from Tim Bird. This utility reads messages coming out of the serial line, and adds timestamps to each line it receives. This allow to measure time from the earliest power on stages, and doesn’t slow down the target system by adding instrumentation to it.

Our benchmarks just measure the time for the bootloader to copy the kernel to RAM, and then the time taken by the kernel to uncompress itself.

  • Loading time is measured between “reading uImage” and “OK” (right before “Starting kernel”) in the bootloader messages.
  • Compression time measured between “Uncompressing Linux” and “done”:
    ~/bin/grabserial -v -d /dev/ttyUSB0 -e 15 -t -m "Uncompressing Linux" -i "done," > booting-lzo.log

Benchmarks on OMAP4 Panda

The Panda board has a fast dual Cortex A9 CPU (OMAP 4430) running at 1 GHz. The standard way to boot this board is from an MMC/SD card. Unfortunately, the MMC/SD interface of the board is rather slow.

In this case, we have a fast CPU, but with rather slow storage. Therefore, the time taken to copy the kernel from storage to RAM is expected to have a significant impact on boot time.

This case typically represents todays multimedia and mobile devices such as phones, media players and tablets.

Compression Size Loading time Uncompressing time Total time
gzip 3355768 2.213376 0.501500 2.714876
lzma 2488144 1.647410 1.399552 3.046962
xz 2366192 1.566978 1.299516 2.866494
lzo 3697840 2.471497 0.160596 2.632093
None 6965644 4.626749 0 4.626749

Results on Calao Systems USB-A9263 (AT91)

The USB-A9263 board from Calao Systems has a cheaper and much slower AT91SAM9263 CPU running at 200 MHz.

Here we are booting from NAND flash, which is the fastest way to boot a kernel on this board. Note that we are using the nboot command from U-boot, which guarantees that we just copy the number of bytes specified in the uImage header.

In this case, we have a slow CPU with slow storage. Therefore, we expect both the kernel size and the decompression algorithm to have a major impact on boot time.

This case is a typical example of industrial systems (AT91SAM9263 is still very popular in such applications, as we can see from customer requests), booting from NAND storage operating with a 200 to 400 MHz CPU.

Compression Size Loading time Uncompressing time Total time
gzip 2386936 5.843289 0.935495 6.778784
lzma 1794344 4.465542 6.513644 10.979186
xz 1725360 4.308605 4.816191 9.124796
lzo 2608624 6.351539 0.447336 6.798875
None 4647908 11.080560 0 11.080560

Lessons learned

Here’s what we learned from these benchmarks:

  • lzo is still the best solution for minimum boot time. Remember, lzo kernel compression was merged by Bootlin.
  • xz is always better than lzma, both in terms of image size. Therefore, there’s no reason to stick to lzma compression if you used it.
  • Because of their heavy CPU usage, lzma and xz remain pretty bad in terms of boot time, on most types of storage devices. On systems with a fast CPU, and very slow storage though, xz should be the best solution
  • On systems with a fast CPU, like the Panda board, boot time with xz is actually pretty close to lzo, and therefore can be a very interesting compromise between kernel size and boot time.
  • Using a kernel image without compression is rarely a worthy solution, except in systems with a very slow CPU. This is the case of CPUs emulated on an FPGA (typically during chip development, before silicon is available). In this particular case, copying to memory is directly done by the emulator, and we just need CPU cycles to start the kernel.

Linux Kernel Development, third edition

Linux Kernel Development, by Robert Love, 3rd edition

Linux Kernel Development is a book authored by Robert Love, a famous kernel developer. Contrary to the very famous Linux Device Drivers book, Linux Kernel Development is not oriented towards driver development, but instead covers how the core Linux kernel works.

Having this knowledge is not absolutely necessary to write Linux device drivers, but having a good overall understanding of the kernel always help to understand what’s going on in your system, even at the application level. In July this year, the third edition of Linux Kernel Development has been published, which upgrades the book contents to kernel version 2.6.34, a good opportunity to have a new look at the book that Bootlin received a few weeks ago.

After a quick introduction to the kernel sources (configuring, building, organization of the source tree), the book immediately dives into kernel internals:

  • Process management: how the kernel represents processes and their state, how processes are created inside the kernel, how threads are handled, are processes are terminated.
  • Process scheduling: a full chapter dedicated to the Linux kernel process scheduler. The new CFS scheduler is of course covered in great detail, with large portion of commented source code, for those who want to understand the fine details of the scheduler. Topics such as process sleep/wake-up, preemption, real-time scheduling policies are also covered.
  • System calls are then covered: how they are implemented, how parameters are passed from userspace to the kernel, etc. The call path from your user-space application down to the kernel is well explained in this chapter.
  • Kernel data structures: a generic chapter which details the kernel API for linked lists, queues, maps, and binary trees. Those APIs are omni-present inside the kernel, and it’s therefore a good idea to know how they work, both for understand existing code and for writing new code.
  • Interrupts: how interrupts are handled and how one can write an interrupt handler. Unfortunately, the newly introduced threaded interrupt handlers are not covered, but it’s true that their usage is not yet very widespread inside the mainline kernel.
  • Bottom halves and deferring work, a topic closely related to interrupt handling. It covers bottom halves, softirqs, tasklets and workqueues.
  • Kernel synchronization: two chapters are dedicated to this topic. First a chapter detailing why synchronization is needed, what are the sources of concurrency and what should be protected against concurrent access. And then a chapter detailing the mechanisms provided by the kernel to implement proper synchronization: atomic operations, spin locks, reader-writer spin locks, semaphores, reader-writer semaphores, mutexes, completion variables, sequential locks, preemption disabling, ordering and barriers
  • Timers and time management details how the kernel manages time: ticks, jiffies counter, timers, delaying execution of code are covered in this chapter. There are unfortunately no details about the clocksource and clockevents infrastructure, and no details about how timers and high-resolution timers are implemented. Contrary to other chapters that go fairly deep into the implementation details, this one mostly only covers the API to time management rather than the internals.
  • Memory management is the topic of the following chapter: physical memory management with the page allocator and the physical zones, then the kmalloc, vmalloc and SLAB allocators are covered. High-memory mappings, a topic specific to 32 bits architectures having more than a gigabyte of RAM is also covered in detail. The per-cpu interface is also covered, and will help those who want to understand parts of the kernel that have been optimized for scalability on multiple CPUs.
  • The Virtual Filesystem, with its different objects: superblock, inode, dentry and file is covered in good detail.
  • The block layer is then covered, with a description of the role of the bio structure, the request queues, and the I/O schedulers.
  • Then, the book goes back to more details about the internals of memory management: the mm_struct memory descriptor, the virtual memory areas (so called VMAs) and how they relate with the mmap/munmap system calls. The next chapter continues with a detailed description of the page cache implementation.
  • A fairly strange chapter called “Devices and modules” gives some information about kernel modules (how to build them, how to use module parameters, how dependencies are handled), then covers the internal of the device model (kobjects, ktypes, ksets, krefs) and finally sysfs. Just like the chapter covering the device model in the third edition of Linux Device Drivers, I think it totally misses the point. All the kobject, ktypes, ksets and krefs stuff is very low-level plumbing used by the Linux device model, but it is not exactly what the driver developer needs to interact with in the first place. In my opinion, a good description of the device model should rather explain what struct bus_type, struct driver and struct device are, how they are specialized for the different bus types in pci_driver, pci_device, usb_driver, usb_device, platform_driver, platform_device, and how the registration/probing of drivers and devices is done. I’ve recently given a talk about this topic, the video is in French, but the slides are in English.
  • Debugging is then covered: printk of course, but also oopses, debugging-related kernel options, the magic SysRq key, kernel debuggers, etc.
  • A rather generic chapter about Portability is then proposed, and finally a chapter about Patches, Hacking and the community that details the kernel community, the kernel coding style, how to generate and submit patches, etc.

All in all, Linux Kernel Development remains very good reading. I particularly appreciate the writing style of Robert Love, who manages to make a deeply-technical book interesting and easy to read. Of course, there are some topics in the kernel in which I had to dive myself and for which I’d expect to see more details in this book, but giving every possible detail about a huge beast such as the Linux kernel in just 400 pages is not possible!

Linux device drivers architecture talk at Libre Software Meeting

recursive device modelThomas Petazzoni gave a talk on the Linux kernel architecture for device drivers at the Libre Software Meeting in Bordeaux, France. While the talk was given in French, the materials are in English and can therefore benefit a larger audience. The talk seems to have been well-received, especially from people already having a basic Linux kernel development experience. The topics covered are part of our Linux Kernel development training, and are also usually very appreciated from the trainees already having Linux kernel experience.

The idea of the talk is to give an overview of how device drivers fit into the kernel, both to expose their functionality to upper layers (such as a network device driver exposes itself to the kernel network infrastructure) and to detect/access the hardware using the device/driver model, which is quite hard to understand from the source code only.

The talk went through the following sections :

  • First a basic introduction to device drivers: how devices are seen from userspace applications, and how a simple, raw, character driver can be implemented. It allowed to expose the principle of operations and their similarity with methods in object-oriented programming, and the principle of registration to an upper-layer infrastructure
  • Then, an introduction to what I call « kernel frameworks », i.e kernel subsystems that specialize a general device type (i.e character device) into a particular device type (i.e serial port device, framebuffer device, etc.). The talk illustrates this with the framebuffer core and the serial port core.
  • Finally, an explanation about the device model: bus drivers, adapter drivers and device drivers. I started with the example of the USB bus: being a dynamically-enumerated bus, it provides a good illustration of the device model principles. At the end, I explained how the device model works for the devices embedded into a SoC using the platform drivers/devices mechanism

The slides for this talk are no longer available, but their updates are now integrated in our Linux kernel and driver development training materials which are freely available.

New jobs at Bootlin

Penguin worksLooking for kernel and embedded Linux experts

Bootlin is looking for experienced members of the Free Software community to satisfy increasing demand for development, consulting and training on embedded Linux and on the Linux kernel.

One thing that distinguishes our positions from others is that contributing to the community will be part of your objectives.

All the details can be found on our careers page.

Linux 2.6.33 features for embedded systems

Interesting features for embedded Linux system developers

Penguin workerLinux 2.6.33 was out on Feb. 24, 2010, and to incite you to try this new kernel in your embedded Linux products, here are features you could be interested in.

The first news is the availability of the LZO algorithm for kernel and initramfs compression. Linux 2.6.30 already introduced LZMA and BZIP2 compression options, which could significantly reduce the size of the kernel and initramfs images, but at the cost of much increased decompression time. LZO compression is a nice alternative. Though its compression rate is not as good as that of ZLIB (10 to 15% larger files), decompression time is much faster than with other algorithms. See our benchmarks. We reduced boot time by 200 ms on our at91 arm system, and the savings could even increase with bigger kernels.

This feature was implemented by my colleague Albin Tonnerre. It is currently available on x86 and arm (commit, commit, commit, commit), and according to Russell King, the arm maintainer, it should become the default compression option on this platform. This compressor can also be used on mips, thanks to Wu Zhangjin (commit).

For systems lacking RAM resources, a new useful feature is Compcache, which allows to swap application memory to a compressed cache in RAM. In practise, this technique increases the amount of RAM that applications can use. This could allow your embedded system or your netbook to run applications or environments it couldn’t execute before. This technique can also be a worthy alternative to on-disk swap in servers or desktops which do need a swap partition, as access performance is much improved. See this LWN.net article for details.

This new kernel also carries lots of improvements on embedded platforms, especially on the popular TI OMAP platform. In particular, we noticed early support to the IGEPv2 board, a very attractive platform based on the TI OMAP 3530 processor, much better than the Beagle Board for a very similar price. We have started to use it in customer projects, and we hope to contribute to its full support in the mainline kernel.

Another interesting feature of Linux 2.6.33 is the improvements in the capabilities of the perf tool. In particular, perf probe allows to insert Kprobes probes through the command line. Instead of SystemTap, which relied on kernel modules, perf probe now relies on a sysfs interface to pass probes to the kernel. This means that you no longer need a compiler and kernel headers to produce your probes. This made it difficult to port SystemTap to embedded platforms. The arm architecture doesn’t have performance counters in the mainline kernel yet (other architectures do), but patches are available. This carries the promise to be able to use probe tools like SystemTap at last on embedded architectures, all the more if SystemTap gets ported to this new infrastructure.

Other noticeable improvements in this release are the ability to mount ext3 and ext2 filesystems with just an ext4 driver, a lightweight RCU implementation, as well as the ability to change the default blinking cursor that is shown at boot time.

Unfortunately, each kernel release doesn’t only carry good news. Android patches got dropped from this release, because of a lack of interest from Google to maintain them. These are sad news and a threat for Android users who may end up without the ability to use newer kernel features and releases. Let’s hope that Google will once more realize the value of converging with the mainline Linux community. I hope that key contributors that this company employs (Andrew Morton in particular) will help to solve this issue.

As usual, this was just a selection. You will probably find many other interesting features on the Linux Changes page for Linux 2.6.33.

LZO kernel compression

As Michael stated in his review of the interesting features in Linux 2.6.30, new compression options have been recently added to the kernel. We therefore decided to have a look at those compression methods, from a compression ratio and decompression speed point of view.

This comparison will be based on “self-extractible kernels”, that is, kernel images containing bootstrap code allowing them to extract a compressed image. As underlined in the previous article, this approach is not used on all architectures. Blackfin, notably, chose a different path and compresses the whole kernel image, without including bootstrap code. While this has the clear advantage of making compression much simpler with respect to kernel code, it forces decompression out to the bootloader code.

Each of those methods has its advantages. Indeed, the Blackfin approach relies on the bootloader to provide the necessary functions, so that may be a problem to do things like bypassing u-boot to reduce the boot time. On the other hand, implementing it only once in the bootloader (as architecture-independent code) makes it unnecessary to write the low-level bootstrap code for each architecture in the kernel, which is surely interesting on virtually all architectures, the notable exceptions being x86 and ARM.

Gzip (also known as Zlib or inflate) has been the traditional (and, as a matter of fact, only) method used to compress kernels. Consequently, we’ll use it as the reference in the following tests. Our test environment is as follows:

ARM9 AT91SAM9263 CPU, 200MHz, using the mainline arch/arm/configs/usb-a9263.config.

This comparison includes figures for LZO, a new kernel image compression method that I have contributed to the Linux sources, and which hopefully will make its way into the mainline kernel. LZO support in the kernel is only new for kernel decompression, as it is already used by JFFS2 and UBIFS. LZO is a stream-oriented algorithm, and although its compression ratio is lower than that of gzip, decompression is lightning-fast, as we will soon find out.

So, here are the figures, average on 20 boots with each compression method:

Uncompressed 3.24Mo 200%
LZO 1.76Mo 0.552s 70% 109%
Gzip 1.62Mo 0.775s 100% 100%
LZMA 1.22Mo 5.011s 646% 75%

Bzip2 has not been tested here: the low-level bootstrap file, head.S, only allocates 64Kb for use by malloc() on ARM. Some quick tests showed that the kernel would not extract with less than 3.5Mib of malloc() space. That would require to modify head.S so that malloc can use more memory, which we will not do here. However, given that enough memory is usable on the system, one could well use bzip2. All the other algorithms performed the extraction using the standard 64Kib malloc space.

From the results, we can clearly see that LZMA is nearly unsuitable for our system, and should be considered only if the space constraints for storing the kernel are so tight that we can’t afford to use more space that was is strictly necessary.

LZO looks like a good candidate when it comes to speeding up the boot process, at the expense of some (almost neglectable) extra space. Gzip is close to LZO when it comes to size, although extraction is not as fast. That means that unless you’re hitting corner cases, like only having enough space for a Gzip compressed image but not for one made with LZO, choosing the latter is probably a safe bet.

Besides, the LZO-compressed kernel size is about 54% the size of the uncompressed kernel. As the kernel load time varies linearly with its size, load time for an uncompressed kernel doubles. While 0.55s are won because there’s no need to run a decompression algorithm, you spend twice as much time loading the kernel. This time is not negligible at all compared to the decompression time. Indeed, loading the uncompressed image takes roughly 0.8s. That means that at the cost of slowing down the boot process by 0.15s (compared to an uncompressed kernel), one gets a kernel image which is roughly twice as small. Rather nice, isn’t it?

Faster boot: starting Linux directly from AT91bootstrap

Reducing start-up time looks like one of the most discussed topics nowadays, for both embedded and desktop systems. Typically, the boot process consists of three steps: AT91SAM9263 CPU

  • First-stage bootloader
  • Second-stage bootloader
  • Linux kernel

The first-stage bootloader is often a tiny piece of code whose sole purpose is to bring the hardware in a state where it is able to execute more elaborate programs. On our testing board (CALAO TNY-A9260), it’s a piece of code the CPU stores in internal SRAM and its size is limited to 4Kib, which is a very small amount of space indeed. The second-stage bootloader often provides more advanced features, like downloading the kernel from the network, looking at the contents of the memory, and so on. On our board, this second-stage bootloader is the famous U-Boot.

One way of achieving a faster boot is to simply bypass the second-stage bootloader, and directly boot Linux from the first-stage bootloader. This first-stage bootloader here is AT91bootstrap, which is an open-source bootloader developed by Atmel for their AT91 ARM-based SoCs. While this approach is somewhat static, it’s suitable for production use when the needs are simple (like simply loading a kernel from NAND flash and booting it), and allows to effectively reduce the boot time by not loading U-Boot at all. On our testing board, that saves about 2s.

As we have the source, it’s rather easy to modify AT91bootstrap to suit our needs. To make things easier, we’ll boot using an existing U-Boot uImage. The only requirement is that it should be an uncompressed uImage, like the one automatically generated by make uImage when building the kernel (there’s not much point using such compressed uImage files on ARM anyway, as it is possible to build self-extractible compressed kernels on this platform).

Looking at the (shortened) main.c, the code that actually boots the kernel looks like this:

int main(void)
{
/* ================== 1st step: Hardware Initialization ================= */
/* Performs the hardware initialization */
hw_init();

/* Load from Nandflash in RAM */
load_nandflash(IMG_ADDRESS, IMG_SIZE, JUMP_ADDR);

/* Jump to the Image Address */
return JUMP_ADDR;
}

In the original source code, load_nandflash actually loads the second-stage bootloader, and then jumps directly to JUMP_ADDR (this value can be found in U-Boot as TEXT_BASE, in the board-specific file config.mk. This is the base address from which the program will be executed). Now, if we want to load the kernel directly instead of a second-level bootloader, we need to know a handful of values:

  • the kernel image address (we will reuse IMG_ADDRESS here, but one could
    imagine reading the actual image address from a fixed location in NAND)
  • the kernel size
  • the kernel load address
  • the kernel entry point

The last three values can be extracted from the uImage header. We will not hard-code the kernel size as it was previously the case (using IMG_SIZE), as this would lead to set a maximum size for the image and would force us to copy more data than necessary. All those values are stored as 32 bits bigendian in the header. Looking at the struct image_header declaration from image.h in the uboot-mkimage sources, we can see that the header structure is like this:

typedef struct image_header {
uint32_t    ih_magic;    /* Image Header Magic Number    */
uint32_t    ih_hcrc;    /* Image Header CRC Checksum    */
uint32_t    ih_time;    /* Image Creation Timestamp    */
uint32_t    ih_size;    /* Image Data Size        */
uint32_t    ih_load;    /* Data     Load  Address        */
uint32_t    ih_ep;        /* Entry Point Address        */
uint32_t    ih_dcrc;    /* Image Data CRC Checksum    */
uint8_t        ih_os;        /* Operating System        */
uint8_t        ih_arch;    /* CPU architecture        */
uint8_t        ih_type;    /* Image Type            */
uint8_t        ih_comp;    /* Compression Type        */
uint8_t        ih_name[IH_NMLEN];    /* Image Name        */
} image_header_t;

It’s quite easy to determine where the values we’re looking for actually are in the uImage header.

  • ih_size is the fourth member, hence we can find it at offset 12
  • ih_load and ih_ep are right after ih_size, and therefore can be found at offset 16 and 20.

A first call to load_nandflash is necessary to get those values. As the data we need are contained within the first 32 bytes, that’s all we need to load at first. However, some space is required in memory to actually store the data. The first-stage bootloader is running in internal SRAM, so we can pick any location we want in SDRAM. For the sake of simplicity, we’ll choose PHYS_SDRAM_BASEhere, which we define to the base address of the on-board SDRAM in the CPU address space. Then, a second call will be necessary to load the entire kernel image at the right load address.

Then all we need to do is:

#define be32_to_cpu(a) ((a)[0] << 24 | (a)[1] << 16 | (a)[2] << 8 | (a)[3])
#define PHYS_SDRAM_BASE 0x20000000

int main(void)
{
unsigned char *tmp;
unsigned long jump_addr;
unsigned long load_addr;
unsigned long size;

hw_init();

load_nandflash(IMG_ADDRESS, 0x20, PHYS_SDRAM_BASE);

/* Setup tmp so that we can read the kernel size */
tmp = PHYS_SDRAM_BASE + 12;
size = be32_to_cpu(tmp);

/* Now, load address */
tmp += 4;
load_addr = be32_to_cpu(tmp);

/* And finally, entry point */
tmp += 4;
jump_addr = be32_to_cpu(tmp);

/* Load the actual kernel */
load_nandflash(IMG_ADDRESS, size, load_addr - 0x40);

return jump_addr;
}

Note that the second call to load_nandflash could in theory be replaced by:

load_nandflash(IMG_ADDRESS + 0x40, size + 0x40, load_addr);

However, this will not work. What happens is that load_nandflash starts reading at an address aligned on a page boundary, so even when passing IMG_ADDRESS+0x40 as a first argument, reading will start at IMG_ADDRESS, leading to a failure (writes have to aligned on a page boundary, so it is safe to assume that IMG_ADDRESS is actually correctly aligned).

The above piece of code will silently fail if anything goes wrong, and does no checking at all – indeed, the binary size is very limited and we can’t afford to put more code than what is strictly necessary to boot the kernel.