NiosII is a 32-bit RISC embedded processor architecture designed by Altera, for its family of FPGAs: Cyclone III, Cyclone IV, etc. Being a soft-core architecture, by using Altera’s Quartus Prime design software, you can adjust the CPU configuration to your needs and instantiate it into the FPGA. You can customize various parameters like the instruction or the data cache size, enable/disable the MMU, enable/disable an FPU, and so on. And for us embedded Linux engineers, a very interesting aspect is that both the Linux kernel and the U-Boot bootloader, in their official versions, support the NIOS II architecture.
Recently, one of our customers designed a custom NIOS II platform, and we are working on porting the mainline U-Boot bootloader and the mainline Linux kernel to this platform. The U-Boot porting went fine, and quickly allowed us to load and start a Linux kernel. However, the Linux kernel was crashing very early with:
[ 0.000000] Linux version 4.5.0-00007-g1717be9-dirty (rperier@archy) (gcc version 4.9.2 (Altera 15.1 Build 185) ) #74 PREEMPT Fri Apr 22 17:43:22 CEST 2016
[ 0.000000] bootconsole [early0] enabled
[ 0.000000] early_console initialized at 0xe3080000
[ 0.000000] BUG: failure at mm/bootmem.c:307/__free()!
[ 0.000000] Kernel panic - not syncing: BUG!
This BUG()
comes from the __free()
function in mm/bootmem.c
. The bootmem allocator is a simple page-based allocator used very early in the Linux kernel initialization for the very first allocations, even before the regular buddy page allocator and other allocators such as kmalloc are available. We were slightly surprised to hit a BUG
in a generic part of the kernel, and immediately suspected some platform-specific issue, like an invalid load address for our kernel, or invalid link address, or other ideas like this. But we quickly came to the conclusion that everything was looking good on that side, and so we went on to actually understand what this BUG
was all about.
The NIOS II memory initialization code in arch/nios2/kernel/setup.c
does the following:
bootmap_size = init_bootmem_node(NODE_DATA(0),
min_low_pfn, PFN_DOWN(PHYS_OFFSET),
max_low_pfn);
[...]
free_bootmem(memory_start, memory_end - memory_start);
The first call init_bootmem_node()
initializes the bootmem allocator, which primarily consists in allocating a bitmap, with one bit per page. The entire bootmem bitmap is set to 0xff
via a memset()
during this initialization:
static unsigned long __init init_bootmem_core(bootmem_data_t *bdata,
unsigned long mapstart, unsigned long start, unsigned long end)
{
[...]
mapsize = bootmap_bytes(end - start);
memset(bdata->node_bootmem_map, 0xff, mapsize);
[...]
}
After doing the bootmem initialization, the NIOS II architecture code calls free_bootmem()
to mark all the memory pages as available, except the ones that contain the kernel itself. To achieve this, the __free()
function (which is the one triggering the BUG) clears the bits corresponding to the page to be marked as free. When clearing those bits, the function checks that the bit was previously set, and if it’s not the case, fires the BUG
:
static void __init __free(bootmem_data_t *bdata,
unsigned long sidx, unsigned long eidx)
{
[...]
for (idx = sidx; idx < eidx; idx++)
if (!test_and_clear_bit(idx, bdata->node_bootmem_map))
BUG();
}
So to summarize, we were in a situation where a bitmap is memset to 0xff
, but almost immediately afterwards, a function that clears some bits finds that some of the bits are already cleared. Sounds odd, doesn’t it?
We started by double checking that the address of the bitmap was the same between the initialization function and the __free
function, verifying that the code was not overwriting the bitmap, and other obvious issues. But everything looked alright. So we simply dumped the bitmap after it was initialized by memset to 0xff
, and to our great surprise, we found that the bitmap was in fact initialized with the pattern 0xff00ff00
and not 0xffffffff
. This obviously explained why we were hitting this BUG()
: simply because the buffer was not properly initialized. At first, we really couldn’t believe this: how it is possible that something as essential as memset()
in Linux was not doing its job properly?
On the NIOS II platform, memset()
has an architecture-specific implementation, available in arch/nios2/lib/memset.c
. For buffers smaller than 8 bytes, this memset implementation uses a simple naive loop, iterating byte by byte. For larger buffers, it uses a more optimized implementation, using inline assembly. This implementation copies data per blocks of 4-bytes rather than 1 byte to speed-up the memset.
We quickly tested a workaround that consisted in using the naive implementation for all buffer sizes, and it solved the problem: we had a booting kernel, all the way to the point where it mounts a root filesystem! So clearly, it’s the optimized implementation in assembly that had a bug.
After some investigation, we found out that the bug was in the very first instructions of the assembly code. The following piece of assembly is supposed to create a 4-byte value that repeats 4 times the 1-byte pattern passed as an argument to memset:
/* fill8 %3, %5 (c & 0xff) */
" slli %4, %5, 8\n"
" or %4, %4, %5\n"
" slli %3, %4, 16\n"
" or %3, %3, %4\n"
This code takes as input in %5
the one-byte pattern, and is supposed to return in %3
the 4-byte pattern. It goes through the following logic:
- Stores in
%4
the initial pattern shifted left by 8 bits. Provided an initial pattern of 0xff
, %4
should now contain 0xff00
- Does a logical or between
%4
and %5
, which leads to %4
containing 0xffff
- Stores in
%3
the 2-byte pattern shifted left by 16 bits. %3
should now contain 0xffff0000
.
- Does a logical or between code>%3 and
%4
, i.e between 0xffff0000
and 0xffff
, which gives the expected 4-byte pattern 0xffffffff
When you look at the source code, it looks perfectly fine, so our source code review didn’t spot the problem. However, when looking at the actual compiled code disassembled, we got:
34: 280a923a slli r5,r5,8
38: 294ab03a or r5,r5,r5
3c: 2808943a slli r4,r5,16
40: 2148b03a or r4,r4,r5
Here r5
gets used for both %4
and %5
. Due to this, the final pattern stored in r4
is 0xff00ff00
instead of the expected 0xffffffff
.
Now, if we take a look at the output operands, %4 is defined with the "=r"
constraint, i.e an output operand. How to prevent the compiler from re-using the corresponding register for another operand? As explained in this document, "=r"
does not prevent gcc from using the same register for an output operand (%4
) and input operand (%5
). By adding the constrainst &
(in addition to "=r"
), we tell the compiler that the register associated with the given operand is an output-only register, and so, cannot be used with an input operand.
With this change, we get the following assembly output:
34: 2810923a slli r8,r5,8
38: 4150b03a or r8,r8,r5
3c: 400e943a slli r7,r8,16
40: 3a0eb03a or r7,r7,r8
Which is much better, and correctly produces the 0xffffffff
pattern when 0xff
is provided as the initial 1-byte pattern to memset
.
In the end, the final patch only adds one character to adjust the inline assembly constraint and gets the proper behavior from gcc:
diff --git a/arch/nios2/lib/memset.c b/arch/nios2/lib/memset.c
index c2cfcb1..2fcefe7 100644
--- a/arch/nios2/lib/memset.c
+++ b/arch/nios2/lib/memset.c
@@ -68,7 +68,7 @@ void *memset(void *s, int c, size_t count)
"=r" (charcnt), /* %1 Output */
"=r" (dwordcnt), /* %2 Output */
"=r" (fill8reg), /* %3 Output */
- "=r" (wrkrega) /* %4 Output */
+ "=&r" (wrkrega) /* %4 Output only */
: "r" (c), /* %5 Input */
"0" (s), /* %0 Input/Output */
"1" (count) /* %1 Input/Output */
This patch was sent upstream to the NIOS II kernel maintainers:
[PATCH v2] nios2: memset: use the right constraint modifier for the %4 output operand, and has already been applied by the NIOS II maintainer.
We were quite surprised to find a bug in some common code for the NIOS II architecture: we were assuming it would have already been tested on enough platforms and with enough compilers/situations to not have such issues. But all in all, it was a fun debugging experience!
It is worth mentioning that in addition to this bug, we found another bug affecting NIOS II platforms, in the asm-generic implementation of the futex_atomic_cmpxchg_inatomic()
function, which was causing some preemption imbalance warnings during the futex subsystem initialization. We also sent a patch for this problem, which has also been applied already.