This blog post is the third installment in our eBPF blog post series, following our posts about eBPF selftests and eBPF trampolines.
In the previous blog post, we discussed how eBPF trampolines are dynamically generated to allow hooking tracing programs to functions’ entry and/or exit. Each trampoline is tailored specifically for the target function on which we want to hook programs: it is then able to read the function context (e.g. function arguments and return value) and to pass those to the hooked programs. However there is one detail that we did not address: how does the trampoline generator know exactly about the function layout ? To be able to generate trampolines that can read and store the function arguments, the trampoline generator needs many details about each argument: the location (a register ? If so, which one ? Or maybe it is on the stack ? And if so, at which offset ?) and its size. Parsing the function machine code is not enough to learn about those, and even if it was, compiler optimizations would obfuscate this kind of info even more. What if besides the actual executable code, the kernel image could be bearing some data about its internal functions ? In this post, we will dive into the DWARF debug information format, and the BPF Type Format (BTF) derived from it to support such a purpose.
The debug sections and the DWARF format
If you take a look at the ELF file for a kernel image (namely: the vmlinux file that you can find at the root directory of the kernel sources after a build) and if you made sure to enable CONFIG_DEBUG_INFO before the build, you will see that your final kernel image contains some additional debug sections:
$ readelf -S vmlinux|grep debug [43] .debug_aranges PROGBITS 0000000000000000 02c84000 [44] .debug_info PROGBITS 0000000000000000 02cac780 [45] .debug_abbrev PROGBITS 0000000000000000 0f6fb674 [46] .debug_line PROGBITS 0000000000000000 0fc7aeca [47] .debug_frame PROGBITS 0000000000000000 113af008 [48] .debug_str PROGBITS 0000000000000000 11648098 [49] .debug_line_str PROGBITS 0000000000000000 11a3cbd3 [50] .debug_loclists PROGBITS 0000000000000000 11a51dd2 [51] .debug_rnglists PROGBITS 0000000000000000 13410650
Those sections contain a lot of debug information that, while not being actually loaded and used at runtime by the kernel, can be consumed by external tools, for example by a debugger, or some binutils program like addr2line. Those sections contain debug information arranged in a well-defined format, the DWARF format. This format gathers details to establish a link between the machine code embedded in the ELF file, and the corresponding source code from which it has been generated. It is based on tags, representing basic entities in the code (functions, data structures, etc), and properties inside each of those tags: number of function arguments, argument names, source code location of the definition, and so on.
While the tools mentioned above generally consume the “raw” binary version of this data, we can extract a more readable version of it, for example by using objdump on the generated kernel elf file:
$ objdump -g vmlinux [...] <1><220d342>: Abbrev Number: 92 (DW_TAG_subprogram) <220d343> DW_AT_name : (indirect string, offset: 0x10f033):\ __sys_bpf <220d347> DW_AT_decl_file : 1 <220d347> DW_AT_decl_line : 5990 <220d349> DW_AT_decl_column : 12 <220d34a> DW_AT_prototyped : 1 <220d34a> DW_AT_type : <0x21e467d> <220d34e> DW_AT_low_pc : 0xffffffff81464c40 <220d356> DW_AT_high_pc : 0x2db9 <220d35e> DW_AT_frame_base : 1 byte block: 9c (DW_OP_call_frame_cfa) <220d360> DW_AT_call_all_calls: 1 <220d360> DW_AT_sibling : <0x2213bae> <2><220d364>: Abbrev Number: 72 (DW_TAG_formal_parameter) <220d365> DW_AT_name : cmd <220d369> DW_AT_decl_file : 1 <220d369> DW_AT_decl_line : 5990 <220d36b> DW_AT_decl_column : 35 <220d36c> DW_AT_type : <0x21e5bcf> <220d370> DW_AT_location : 0x4db54c (location list) <220d374> DW_AT_GNU_locviews: 0x4db4ae <2><220d378>: Abbrev Number: 36 (DW_TAG_formal_parameter) <220d379> DW_AT_name : (indirect string, offset: 0xa19e4): uattr <220d37d> DW_AT_decl_file : 1 <220d37d> DW_AT_decl_line : 5990 <220d37f> DW_AT_decl_column : 49 <220d380> DW_AT_type : <0x21f6023> <220d384> DW_AT_location : 0x4db83d (location list) <220d388> DW_AT_GNU_locviews: 0x4db7ef <2><220d38c>: Abbrev Number: 36 (DW_TAG_formal_parameter) <220d38d> DW_AT_name : (indirect string, offset: 0x4f5e8): size <220d391> DW_AT_decl_file : 1 <220d391> DW_AT_decl_line : 5990 <220d393> DW_AT_decl_column : 69 <220d394> DW_AT_type : <0x21e4609> <220d398> DW_AT_location : 0x4dba98 (location list) <220d39c> DW_AT_GNU_locviews: 0x4dba16 [...]
Each tag has a unique identifier, as well as a type. The first tag we see for example is a DW_TAG_subprogram tag with the ID 0x220d342. This tag has a DW_AT_name property containing the value __sys_bpf: we then know that we are looking at the prototype for the __sys_bpf function (the entry point for the bpf syscall). The following tags directly relate to this function: they have the type DW_TAG_formal_parameter and then represent the arguments consumed by the function. We can also observe that some tags properties are actually referring to some other tags: if we analyze the cmd parameter tag, we see that its type, represented by the DW_AT_type property, has the value 0x21e5bcf. We can navigate further in the DWARF dump and find the corresponding tag:
<1><21e4609>: Abbrev Number: 139 (DW_TAG_base_type) <21e460b> DW_AT_byte_size : 4 <21e460c> DW_AT_encoding : 7 (unsigned) <21e460d> DW_AT_name : (indirect string, offset: 0xd1b0): unsigned int [...] <1><21e5bcf>: Abbrev Number: 77 (DW_TAG_enumeration_type) <21e5bd0> DW_AT_name : (indirect string, offset: 0x110b24): bpf_cmd <21e5bd4> DW_AT_encoding : 7 (unsigned) <21e5bd5> DW_AT_byte_size : 4 <21e5bd5> DW_AT_type : <0x21e4609> <21e5bd9> DW_AT_decl_file : 63 <21e5bda> DW_AT_decl_line : 937 <21e5bdc> DW_AT_decl_column : 6 <21e5bdd> DW_AT_sibling : <0x21e5cd2>
We then learn that the cmd parameter is in fact a bpf_cmd enum, encoded in tag 0x21e5bcf, which in turn is represented with an unsigned int, encoded in tag 0x21e4609.
Among all those details, we can also spot some DW_AT_decl_file, DW_AT_decl_line and DW_AT_decl_column: those attributes describe the declaration location of the element encoded by the parent tag. This is how userspace tools such as debuggers are for example able to understand where to insert a breakpoint when the provided syntax involves some source code location !
There are many other details that can be encoded in the DWARF format, and the curious reader eager to learn more can take a look at the official DWARF specification to get familiar with the role of each tag and property. In any case, it seems that this format describes the program with enough details to answer parts of our initial challenge (allowing our JIT compilers to know about the function layout to properly generate eBPF trampolines). There is still one major issue though with the DWARF debug format, which prevents us from using it at runtime:
$ readelf -S vmlinux Section Headers: [Nr] Name Type Address Offset Size EntSize Flags Link Info Align [ 0] NULL 0000000000000000 00000000 0000000000000000 0000000000000000 0 0 0 [ 1] .text PROGBITS ffffffff81000000 00200000 0000000000f97db0 0000000000000000 AX 0 0 4096 [...] [44] .debug_info PROGBITS 0000000000000000 02cbb980 000000000cb38291 0000000000000000 0 0 1
The .debug_info section alone in a basic x86_64 vmlinux file is as large as 203MB. This is 13 times the size of the code section ! Unfortunately, while this value does not sound so high for standard laptop and desktop machines, this cannot be afforded for many constrained systems, because of the cost on disk and/or RAM. We then need an intermediate format, that would be lean enough to be embedded on the final image (and loaded at runtime), while conveying enough information to enable trampoline generation: that’s where the BTF information comes into play.
The BTF data format
The BPF Type Format, or BTF, is a binary data format that has been created to improve the eBPF developer experience: by encoding the software data types details into a format that leads to data small enough to be loaded and manipulated at runtime, developers can bring a new variety of features on top of the existing eBPF infrastructure: smarter APIs to write programs more easily, better debugging output (for example, source code hints rather than raw eBPF bytecode in verifier dumps), new attach types for eBPF programs… This last point is actually the one involving our trampolines: by reading the corresponding BTF data, the kernel is able to know about the layout and expectations of the target function on which we want to hook a program, and so it is able to generate the properly tailored trampoline to transfer execution to the eBPF program(s) to trigger on the function execution.
How is this information actually generated and embedded in the kernel? The answer lies in the kernel build system and is triggered at the end of a build. The link-vmlinux.sh script checks whether CONFIG_DEBUG_INFO_BTF is enabled, and if so, it calls the gen_btf function, which will in turn call the pahole program to generate the BTF data. This data is not generated ex nihilo: it is derived from DWARF data previously generated by our compiler. The data generated by pahole is then appended to the kernel image, in dedicated sections.
pahole (“poke a hole”) is a tool initially designed to print and manipulate data structures layout: it is for example able to identify and apply cache lines usage optimizations in C structures layout. To perform such a task, it parses the DWARF information from a binary to learn about each data structure size and location. As it is already fluent in DWARF, this tool has then been extended with BTF data generation capability. One can for example manually generate and attach BTF data to its kernel image with the following command:
pahole -J -j \ --btf_features=encode_force \ --btf_features=var \ --btf_features=float \ --btf_features=enum64 \ --btf_features=decl_tag \ --btf_features=type_tag \ --btf_features=optimized_func \ --btf_features=consistent_func \ --btf_features=decl_tag_kfuncs \ --btf_features=attributes \ --lang_exclude=rust \ vmlinux
The core option in this long command line is the -J parameter, enabling BTF information generation. Most of the other parameters are here to enable specific features (those flags are here to allow managing specific kernel and pahole versions compatibility). The generated data won’t be visible directly in the console, as it will be directly appended to the kernel final image:
$ readelf -S vmlinux Section Headers: [Nr] Name Type Address Offset Size EntSize Flags Link Info Align [...] [15] .BTF PROGBITS ffffffff82585000 01785000 000000000046a644 0000000000000000 A 0 0 1
One can still get a human-readable dump of the BTF data, for example with bpftool:
$ bpftool btf dump file vmlinux [...] [21] INT 'int' size=4 bits_offset=0 nr_bits=32 encoding=SIGNED [...] [12571] ENUM 'bpf_cmd' encoding=UNSIGNED size=4 vlen=40 'BPF_MAP_CREATE' val=0 'BPF_MAP_LOOKUP_ELEM' val=1 'BPF_MAP_UPDATE_ELEM' val=2 'BPF_MAP_DELETE_ELEM' val=3 'BPF_MAP_GET_NEXT_KEY' val=4 'BPF_PROG_LOAD' val=5 'BPF_OBJ_PIN' val=6 'BPF_OBJ_GET' val=7 'BPF_PROG_ATTACH' val=8 [...] [43497] FUNC_PROTO '(anon)' ret_type_id=21 vlen=3 'cmd' type_id=12571 'uattr' type_id=11892 'size' type_id=9 [43498] FUNC '__sys_bpf' type_id=43497 linkage=static
We see that this format still contains plenty of useful information that can be used to generate eBPF trampolines that are able to parse and transfer function arguments, while being way more compact than DWARF: the previous readelf dump shows that the BTF data size is a bit less than 5MB, which is way better than the initial 203MB of DWARF data measured earlier. This makes it now tolerable to embed BTF data in the binary image that will run on the system. We also see the same kind of mechanism seen in DWARF, with all data types represented with an ID (the numbers between square brackets), and different data types referring to each other to allow de-duplicating the info.
The kernel is able to use BTF data when manipulating tracing programs thanks to the following steps:
- the user will first load his eBPF program into the kernel thanks to the BPF system call. The kernel, when it will handle the BPF_PROG_LOAD sub-command, will check if the attach type of the loaded program needs BTF data, and if so, it will parse and load its own BTF data from the relevant section in the kernel binary loaded in memory. It will also check that for tracing programs, a BTF ID representing the targeted attach point has been provided.
- The kernel will then use its own BTF data and the passed BTF ID to derive a “function model” representing the target function: this model contains for example the list of arguments consumed by the function and the location of each of those.
- Next, the user will request to attach the program to the target function. The trampoline generator is called: it uses the function model to generate the relevant trampoline able to call the BPF program when the target function is executed. The trampoline is then attached to the function entry.
Thanks to this whole process, eBPF users can attach tracing programs to arbitrary functions in the kernel, without having to care about low level details like arguments locations: the kernel will resolve those at program load/attach time thanks to BTF data.
A major concern: generating correct BTF data
While this whole mechanism allows eBPF users to easily write and use tracing programs, it then becomes adamant to generate correct BTF data. For most functions in the kernel, generating BTF data is straightforward, the main challenge being to interpret correctly the calling convention of the platform expected to run the eBPF programs. For example, if we consider a simple kernel function running on x86_64 machine and consuming 3 integer values as parameter, we can check the SystemV ABI (implemented by the kernel for x86 platforms) and learn that the trampoline will have to retrieve those arguments respectively from registers %rdi, %rsi and %rdx. If the trampoline is also generated for a fexit program, it will be able to capture the function return value by reading the %rax register just after the function execution. But there are of course many less trivial cases:
- For functions consuming a lot of arguments, we may not have enough registers to pass them all: some of them must then be passed on the stack
- The number and size of registers available for arguments passing vary between architectures
- Some architectures will fill as many registers as possible with arguments, some others will stop filling registers and pass the rest on stack as soon as one argument does not fit in a register. The former case leads to some kind of “arguments un-ordering”
- When arguments are passed on the stack, they must respect some alignment constraints dictated by the target architecture.
- Developers sometimes customize the generated code with compiler pre-processor macros, like __attribute__((packed)) or __attribute__((aligned(X))__ : those will alter the corresponding function arguments location and/or size.
This last point is an example of a major challenge when generating BTF data: while those alterations are properly conveyed by the DWARF information, they are lost when generating BTF data; we are then clueless about the exact location of some specific arguments (one could argue that the BTF format should then be updated to include those details, but it would then slowly turn it into some new kind of DWARF format, and so bring the same shortcomings). Unfortunately, having uncertainties about an argument location is unacceptable: it means that the generated trampoline could read some wrong memory, which in turn would lead to a variety of random bugs and crashes.
As part of the effort funded by the eBPF Foundation, we had the opportunity at Bootlin to participate to some interesting discussions and to contribute fixes to properly address such cases:
- The issue has been initially mentioned on ARM64, and handled by refusing programs attachment to functions consuming structures passed on stack
- The issue is in fact not specific to ARM64 and affects all architectures supporting eBPF trampolines. A new series has been sent to apply the same kind of constraint on all the affected JIT compilers. But the outcome was not very satisfying, and maintainers rather suggested to make sure not to encode any BTF information at all for those problematic functions to make sure to prevent any faulty trampoline generation.
- Not encoding those functions details is then something to be addressed in pahole, for which a new series has been submitted. This series has received multiple revisions, as the first one brought a too simplistic approach (skipping any function consuming a struct passed on the stack): the merged version actually rely on some algorithm already present in pahole to try to detect if a struct layout or location has been explicitly altered at build time before skipping it.
- Once the pahole series has been merged, the constraint has been removed from the ARM64 JIT compiler in the kernel.
As the eBPF framework keeps evolving, new corner cases like this one will likely continue to appear, and so both the kernel and the related tools (pahole) will likely need to continue receiving updates to handle those in the future.