The Dwarves Beneath the Kernel: Forging BTF for eBPF

This blog post is the third installment in our eBPF blog post series, following our posts about eBPF selftests and eBPF trampolines.

In the previous blog post, we discussed how eBPF trampolines are dynamically generated to allow hooking tracing programs to functions’ entry and/or exit. Each trampoline is tailored specifically for the target function on which we want to hook programs: it is then able to read the function context (e.g. function arguments and return value) and to pass those to the hooked programs. However there is one detail that we did not address: how does the trampoline generator know exactly about the function layout ? To be able to generate trampolines that can read and store the function arguments, the trampoline generator needs many details about each argument: the location (a register ? If so, which one ? Or maybe it is on the stack ? And if so, at which offset ?) and its size. Parsing the function machine code is not enough to learn about those, and even if it was, compiler optimizations would obfuscate this kind of info even more. What if besides the actual executable code, the kernel image could be bearing some data about its internal functions ? In this post, we will dive into the DWARF debug information format, and the BPF Type Format (BTF) derived from it to support such a purpose.

The debug sections and the DWARF format

If you take a look at the ELF file for a kernel image (namely: the vmlinux file that you can find at the root directory of the kernel sources after a build) and if you made sure to enable CONFIG_DEBUG_INFO before the build, you will see that your final kernel image contains some additional debug sections:

$ readelf -S vmlinux|grep debug
[43] .debug_aranges PROGBITS 0000000000000000 02c84000
[44] .debug_info PROGBITS 0000000000000000 02cac780
[45] .debug_abbrev PROGBITS 0000000000000000 0f6fb674
[46] .debug_line PROGBITS 0000000000000000 0fc7aeca
[47] .debug_frame PROGBITS 0000000000000000 113af008
[48] .debug_str PROGBITS 0000000000000000 11648098
[49] .debug_line_str PROGBITS 0000000000000000 11a3cbd3
[50] .debug_loclists PROGBITS 0000000000000000 11a51dd2
[51] .debug_rnglists PROGBITS 0000000000000000 13410650

Those sections contain a lot of debug information that, while not being actually loaded and used at runtime by the kernel, can be consumed by external tools, for example by a debugger, or some binutils program like addr2line. Those sections contain debug information arranged in a well-defined format, the DWARF format. This format gathers details to establish a link between the machine code embedded in the ELF file, and the corresponding source code from which it has been generated. It is based on tags, representing basic entities in the code (functions, data structures, etc), and properties inside each of those tags: number of function arguments, argument names, source code location of the definition, and so on.

While the tools mentioned above generally consume the “raw” binary version of this data, we can extract a more readable version of it, for example by using objdump on the generated kernel elf file:

$ objdump -g vmlinux
[...]
<1><220d342>: Abbrev Number: 92 (DW_TAG_subprogram)
   <220d343> DW_AT_name : (indirect string, offset: 0x10f033):\
     __sys_bpf
   <220d347> DW_AT_decl_file : 1
   <220d347> DW_AT_decl_line : 5990
   <220d349> DW_AT_decl_column : 12
   <220d34a> DW_AT_prototyped : 1
   <220d34a> DW_AT_type : <0x21e467d>
   <220d34e> DW_AT_low_pc : 0xffffffff81464c40
   <220d356> DW_AT_high_pc : 0x2db9
   <220d35e> DW_AT_frame_base : 1 byte block: 9c
     (DW_OP_call_frame_cfa)
   <220d360> DW_AT_call_all_calls: 1
   <220d360> DW_AT_sibling : <0x2213bae>
<2><220d364>: Abbrev Number: 72 (DW_TAG_formal_parameter)
   <220d365> DW_AT_name : cmd
   <220d369> DW_AT_decl_file : 1
   <220d369> DW_AT_decl_line : 5990
   <220d36b> DW_AT_decl_column : 35
   <220d36c> DW_AT_type : <0x21e5bcf>
   <220d370> DW_AT_location : 0x4db54c (location list)
   <220d374> DW_AT_GNU_locviews: 0x4db4ae
<2><220d378>: Abbrev Number: 36 (DW_TAG_formal_parameter)
   <220d379> DW_AT_name : (indirect string, offset: 0xa19e4): uattr
   <220d37d> DW_AT_decl_file : 1
   <220d37d> DW_AT_decl_line : 5990
   <220d37f> DW_AT_decl_column : 49
   <220d380> DW_AT_type : <0x21f6023>
   <220d384> DW_AT_location : 0x4db83d (location list)
   <220d388> DW_AT_GNU_locviews: 0x4db7ef
<2><220d38c>: Abbrev Number: 36 (DW_TAG_formal_parameter)
   <220d38d> DW_AT_name : (indirect string, offset: 0x4f5e8): size
   <220d391> DW_AT_decl_file : 1
   <220d391> DW_AT_decl_line : 5990
   <220d393> DW_AT_decl_column : 69
   <220d394> DW_AT_type : <0x21e4609>
   <220d398> DW_AT_location : 0x4dba98 (location list)
   <220d39c> DW_AT_GNU_locviews: 0x4dba16
[...]

Each tag has a unique identifier, as well as a type. The first tag we see for example is a DW_TAG_subprogram tag with the ID 0x220d342. This tag has a DW_AT_name property containing the value __sys_bpf: we then know that we are looking at the prototype for the __sys_bpf function (the entry point for the bpf syscall). The following tags directly relate to this function: they have the type DW_TAG_formal_parameter and then represent the arguments consumed by the function. We can also observe that some tags properties are actually referring to some other tags: if we analyze the cmd parameter tag, we see that its type, represented by the DW_AT_type property, has the value 0x21e5bcf. We can navigate further in the DWARF dump and find the corresponding tag:

<1><21e4609>: Abbrev Number: 139 (DW_TAG_base_type)
   <21e460b> DW_AT_byte_size : 4
   <21e460c> DW_AT_encoding : 7 (unsigned)
   <21e460d> DW_AT_name : (indirect string, offset: 0xd1b0):
     unsigned int

[...]

<1><21e5bcf>: Abbrev Number: 77 (DW_TAG_enumeration_type)
   <21e5bd0> DW_AT_name : (indirect string, offset: 0x110b24):
     bpf_cmd
   <21e5bd4> DW_AT_encoding : 7 (unsigned)
   <21e5bd5> DW_AT_byte_size : 4
   <21e5bd5> DW_AT_type : <0x21e4609>
   <21e5bd9> DW_AT_decl_file : 63
   <21e5bda> DW_AT_decl_line : 937
   <21e5bdc> DW_AT_decl_column : 6
   <21e5bdd> DW_AT_sibling : <0x21e5cd2>

We then learn that the cmd parameter is in fact a bpf_cmd enum, encoded in tag 0x21e5bcf, which in turn is represented with an unsigned int, encoded in tag 0x21e4609.

Among all those details, we can also spot some DW_AT_decl_file, DW_AT_decl_line and DW_AT_decl_column: those attributes describe  the declaration location of the element encoded by the parent tag. This is how userspace tools such as debuggers are for example able to understand where to insert a breakpoint when the provided syntax involves some source code location !

There are many other details that can be encoded in the DWARF format, and the curious reader eager to learn more can take a look at the official DWARF specification to get familiar with the role of each tag and property. In any case, it seems that this format describes the program with enough details to answer parts of our initial challenge (allowing our JIT compilers to know about the function layout to properly generate eBPF trampolines). There is still one major issue though with the DWARF debug format, which prevents us from using it at runtime:

$ readelf -S vmlinux
Section Headers:
[Nr] Name             Type             Address          Offset
Size             EntSize          Flags    Link    Info    Align
[ 0]                  NULL             0000000000000000 00000000
0000000000000000 0000000000000000 0 0 0
[ 1] .text            PROGBITS         ffffffff81000000 00200000
0000000000f97db0 0000000000000000 AX 0 0 4096
[...]
[44] .debug_info      PROGBITS         0000000000000000 02cbb980
000000000cb38291 0000000000000000 0       0       1

The .debug_info section alone in a basic x86_64 vmlinux file is  as large as 203MB. This is 13 times the size of the code section ! Unfortunately, while this value does not sound so high for standard laptop and desktop machines, this cannot be afforded for many constrained systems, because of the cost on disk and/or RAM. We then need an intermediate format, that would be lean enough to be embedded on the final image (and loaded at runtime), while conveying enough information to enable trampoline generation: that’s where the BTF information comes into play.

The BTF data format

The BPF Type Format, or BTF, is a binary data format that has been created to improve the eBPF developer experience: by encoding the software data types details into a format that  leads to data small enough to be loaded and manipulated at runtime, developers can bring a new variety of features on top of the existing eBPF infrastructure: smarter APIs to write programs more easily, better debugging output (for example, source code hints rather than raw eBPF bytecode in verifier dumps), new attach types for eBPF programs… This last point is actually the one involving our trampolines: by reading the corresponding BTF data, the kernel is able to know about the layout and expectations of the target function on which we want to hook a program, and so it is able to generate the properly tailored trampoline to transfer execution to the eBPF program(s) to trigger on the function execution.

How is this information actually generated and embedded in the kernel? The answer lies in the kernel build system and is triggered at the end of a build. The link-vmlinux.sh script checks whether CONFIG_DEBUG_INFO_BTF is enabled, and if so, it calls the gen_btf function, which will in turn call the pahole program to generate the BTF data. This data is not generated ex nihilo: it is derived  from DWARF data previously generated by our compiler. The data generated by pahole is then appended to the kernel image, in dedicated sections.

pahole (“poke a hole”) is a tool initially designed to print and manipulate data structures layout: it is for example able to identify and apply cache lines usage optimizations in C structures layout. To perform such a task, it parses the DWARF information from a binary to learn about each data structure size and location. As it is already fluent in DWARF, this tool has then been extended with BTF data generation capability. One can for example manually generate and attach BTF data to its kernel image with the following command:

pahole -J -j \
--btf_features=encode_force \
--btf_features=var \
--btf_features=float \
--btf_features=enum64 \
--btf_features=decl_tag \
--btf_features=type_tag \
--btf_features=optimized_func \
--btf_features=consistent_func \
--btf_features=decl_tag_kfuncs \
--btf_features=attributes \
--lang_exclude=rust \
vmlinux

The core option in this long command line is the -J parameter, enabling BTF information generation. Most of the other parameters are here to enable specific features (those flags are here to allow managing specific kernel and pahole versions compatibility). The generated data won’t be visible directly in the console, as it will be directly appended to the kernel final image:

$ readelf -S vmlinux
Section Headers:
[Nr] Name             Type             Address          Offset
Size             EntSize          Flags    Link    Info    Align
[...]
[15] .BTF             PROGBITS         ffffffff82585000 01785000
000000000046a644 0000000000000000 A        0       0       1

One can still get a human-readable dump of the BTF data, for example with bpftool:

$ bpftool btf dump file vmlinux
[...]
[21] INT 'int' size=4 bits_offset=0 nr_bits=32 encoding=SIGNED
[...]
[12571] ENUM 'bpf_cmd' encoding=UNSIGNED size=4 vlen=40
'BPF_MAP_CREATE' val=0
'BPF_MAP_LOOKUP_ELEM' val=1
'BPF_MAP_UPDATE_ELEM' val=2
'BPF_MAP_DELETE_ELEM' val=3
'BPF_MAP_GET_NEXT_KEY' val=4
'BPF_PROG_LOAD' val=5
'BPF_OBJ_PIN' val=6
'BPF_OBJ_GET' val=7
'BPF_PROG_ATTACH' val=8
[...]
[43497] FUNC_PROTO '(anon)' ret_type_id=21 vlen=3
        'cmd' type_id=12571
        'uattr' type_id=11892
        'size' type_id=9
[43498] FUNC '__sys_bpf' type_id=43497 linkage=static

We see that this format still contains plenty of useful information that can be used to generate eBPF trampolines that are able to parse and transfer function arguments, while being way more compact than DWARF: the previous readelf dump shows that the BTF data size is a bit less than 5MB, which is way better than the initial 203MB of DWARF data measured earlier. This makes it now tolerable to embed BTF data in the binary image that will run on the system. We also see the same kind of mechanism seen in DWARF, with all data types represented with an ID (the numbers between square brackets), and different data types referring to each other to allow de-duplicating the info.

The kernel is able to use BTF data when manipulating tracing programs thanks to the following steps:

Thanks to this whole process, eBPF users can attach tracing programs to arbitrary functions in the kernel, without having to care about low level details like arguments locations: the kernel will resolve those at program load/attach time thanks to BTF data.

A major concern: generating correct BTF data

While this whole mechanism allows eBPF users to easily write and use tracing programs, it then becomes adamant to generate correct BTF data. For most functions in the kernel, generating BTF data is straightforward, the main challenge being to interpret correctly the calling convention of the platform expected to run the eBPF programs. For example, if we consider a simple kernel function running on x86_64 machine and consuming 3 integer values as parameter, we can check the SystemV ABI (implemented by the kernel for x86 platforms) and learn that the trampoline will have to retrieve those arguments respectively from registers %rdi, %rsi and %rdx. If the trampoline is also generated for a fexit program, it will be able to capture the function return value by reading the %rax register just after the function execution. But there are of course many less trivial cases:

  • For functions consuming a lot of arguments, we may not have enough registers to pass them all: some of them must then be passed on the stack
  • The number and size of registers available for arguments passing vary between architectures
  • Some architectures will fill as many registers as possible with arguments, some others will stop filling registers and pass the rest on stack as soon as one argument does not fit in a register. The former case leads to some kind of “arguments un-ordering”
  • When arguments are passed on the stack, they must respect some alignment constraints dictated by the target architecture.
  • Developers sometimes customize the generated code with compiler pre-processor macros, like __attribute__((packed)) or __attribute__((aligned(X))__ : those will alter the corresponding function arguments location and/or size.

This last point is an example of a major challenge when generating BTF data: while those alterations are properly conveyed by the DWARF information, they are lost when generating BTF data; we are then clueless about the exact location of some specific arguments (one could argue that the BTF format should then be updated to include those details, but it would then slowly turn it into some new kind of DWARF format, and so bring the same shortcomings). Unfortunately, having uncertainties about an argument location is unacceptable: it means that the generated trampoline could read some wrong memory, which in turn would lead to a variety of random bugs and crashes.

As part of the effort funded by the eBPF Foundation, we had the opportunity at Bootlin to participate to some interesting discussions and to contribute fixes to properly address such cases:

As the eBPF framework keeps evolving, new corner cases like this one will likely continue to appear, and so both the kernel and the related tools (pahole) will likely need to continue receiving updates to handle those in the future.

Alexis Lothoré

Author: Alexis Lothoré

Alexis works at Bootlin as embedded Linux engineer since 2023. He has packaged full Linux distributions for a variety of devices, mostly for IoT devices

Leave a Reply