Performance Improvements for FreeBSD kernel debugging

We previously explored FreeBSD userspace coredumps. Backtrace’s debugging platform supports FreeBSD kernel coredumps too, and their traces share many features. They are constructed somewhat differently, and in the process of adding support for them, we found a way to improve performance for automated programs accessing them.

Read on to learn how information is extracted from a FreeBSD kernel core, and how we improved performance for this mechanism.

Generating kernel core files

A kernel core is typically only generated in exceptional circumstances. Unlike userspace processes, kernel routines cannot fault without sacrificing the machine’s availability. This means things like page faults and illegal instructions inside the kernel stop the machine, instead of just one process. At that point, in most cases, it is only usable enough to inspect its state in a debugger, or to generate a core file.

On FreeBSD, kernel core files can be saved for any machine that has dumpdev=AUTO in its /etc/rc.conf, and a sufficiently large swap partition, configured at boot time.

On FreeBSD, a kernel core file can be generated by intentionally crashing:

For machines that are configured to enter the on-line kernel debugger (aka kdb), the following command is also needed, at the system console:

Otherwise, the kernel’s dump routines will automatically save the core file to a swap partition, if configured.

Upon reboot, the savecore rc startup script will automatically identify the core file in the swap partition, and save it to /var/crash by default.

What’s in a FreeBSD kernel core?

A FreeBSD kernel core file can be formatted in several different ways. This depends on which type of dump was performed. Full core dumps are ELF files, similar in structure to userspace core files.

However, as RAM size grew, this became more difficult to manage. In 2006, FreeBSD introduced minidumps, which are much smaller without making the core file useless. This has been the default dump type since FreeBSD 6.0.

Unlike userspace cores & full kernel cores, minidumps are not ELF files. They have a custom structure that starts with a header, and are followed by several sections of data explained by the header. For example, on amd64:

The version element allows minidump formats to change over time. At the time of this article, this has only happened once for amd64.

The next three uint32_t elements describe the size of the three sections of memory included in the core file:

  • The msgbuf (“message buffer”) section contains the contents that would be printed by dmesg if it were run at the time the core file was saved.
  • The bitmap section tells the debugger which pages of physical memory are included in the pmap section. This means bit N indicates the presence of the page covering 4K bytes starting at N * 4K, in physical memory.
  • The pmap section consists of a sparse dump of all pages of physical memory. Each page in this section has its bit set in the above bitmap.

The uint64_t elements describe the boundaries of the two kernel virtual address segments. On amd64, every process’s virtual address space has these sections reserved for the kernel.

Kernel virtual address space layout

There are two sections of kernel virtual address (KVA) space:

  • The kernel map. This is the primary map, used for critical kernel context. In particular, all thread stacks are stored here. Kernel map addresses require using page tables in order to find which physical page corresponds to a particular virtual address.
  • The direct map. This is an amd64-specific feature; it takes advantage of the 64-bit address space to enable looking up the corresponding physical pages of memory directly, via bitmask. All UMA-managed pages are mapped this way.

The minidump, strictly speaking, only includes physical pages that are in the kernel map (which includes the page tables for it). Because the direct map allows accessing all physical pages in memory, all kernel map addresses will also have direct map addresses. However, the converse is not true.

The minidump includes the direct map address range so whenever a debugger encounters a direct map address, it can attempt to look up their corresponding physical address.

In addition to partitioning the KVA space in this manner, kernel subsystems can indicate whether a particular page should not be ‘dumpable’. For minidumps, these non-dumpable pages are also not included.

Looking up information in a kernel core

Let’s try something a bit more concrete. The kernel debugger (kgdb) can be run like this:

kgdb provides a basic stack trace for the faulting thread. This is an example of what it looks like when forced as described earlier. Because the trace includes a frame for kdb_sysctl_panic (the handler for the command we ran), the cause of this particular crash is clear.

Let’s extract a bit more information by running bt (short for backtrace) on the faulting thread:

Let’s see more information about the current thread. Frame #8 includes the td local. We can use this to print out the name of the thread:

To access this information, kgdb asks libkvm for the location in the core file for the virtual address 0xfffff800031f0284. Because the debugger has access to the DWARF metadata explaining the structure of td, it would have figured out that it needs to access this address rather than the td base address 0xfffff800031f0000.

Libkvm will first determine whether the virtual address lies within the kernel or direct maps. If it lies in the kernel map, libkvm will consult the page table pages to discover the corresponding physical address. If it lies in the direct map, it can simply mask off the direct map base address. If neither of these applies, the address is illegal. This process is encapsulated by va_to_pa, or “virtual address to physical address”.

Once the physical address is determined, libkvm consults the core file’s bitmap to figure out where in the core file it is located.

One way libkvm can scale better

Recall that minidumps include a sparse bitmap indicating the pages that are included. These pages are dumped sequentially in the last section. Because they are sparse in a not entirely predictable way, figuring the offset into the dump for a particular physical address cannot be reduced to a trivial formula.

When minidumps were originally created, the solution adopted was to create a hash table that maps physical addresses (one per 4K page) to dump file offsets. This hash table was initialized with 1024 buckets. A minidump with 8GB of physical pages would, therefore, have 2 million entries with an average of 2048 per bucket. This is the best case in such a scenario.

On top of this high load factor, the offset lookup access pattern is not randomly distributed, which means frequently-accessed addresses can sometimes be particularly expensive to look up. This is reflected in automated processes taking significantly longer to run.

A valid formula for calculating the offset looks like this (where pa is the physical address):

offset_for(pa) = bits_set_before(pa) * 4096

Here bits_set_before represents the number of bits set in the bitmap before the page containing the indicated physical address. So the trick is finding a better solution for this function.

The “number of bits set” has a special term: the Hamming weight, or population count. So bits_set_before is just giving us the Hamming weight for the minidump’s bitmap, from bit 0 to bit N-1.

It would be inefficient to count all of those bits every time, hence the hash table implementation.

Caching the Hamming weight

Dumps will never have a predictable pattern in this bitmap. However, we can scan the bitmap and generate a lookup table that indicates the number of bits set. By doing so, we convert the expensive operation (“get number of bits set before bit N”) into a cheap one (“lookup pre-computed value for bit N”).

We have a trade-off to make: We could store this value for every bit, but that would consume a lot of memory. We can instead substitute a small amount of CPU time at lookup time to compute the number of bits set between a cached value and the desired bit.

Modern CPUs offer a POPCNT instruction that returns the number of bits set in a fixed width integer value. There are also generic equivalents available. Older and non-x86 systems supported by FreeBSD also use libkvm, so any optimization must not affect their functionality. GCC and Clang both support a __bitcount64 primitive to bridge this gap. So that’s what we actually used to perform the calculation.

We’ve already reduced the number of counts from N to N/64. We can go one step farther. Given a page N, where X <= N < Y, we can also compute the number of bits forwards (from X) and backwards (from Y). This reduces the number of counts by half again.

Finally, in order to further limit the memory usage of this cache, we can call __bitcount64 a few times. By adopting a cache interval of 1024 bits, we require, on average (in a perfectly random access pattern), 4 counts worth of CPU time, and one count per 1024 bits in memory.

Putting it together

To illustrate the relationship between the cache and the core’s bitmap:

While generating the cache, each value is computed in turn. Cache value A is the total count of all bits set through A1023 inclusive, B is the total of all bits set through B1023 inclusive, and so forth.

To find the offset for page B122, we look up the cached count A, then compute and add the Hamming weight between bits B0 and B121, to find the total number of pages before page B122.

Similarly, to find the offset for page B700, we look up the cached count C, then compute the number of bits set between B700 and B1023, and subtract that.

The offset calculation now looks like this:

First, we find the physical address for the requested virtual address. Then we determine the cache popcount bin, rounding up to the next bin in case we’re counting down. popcount encapsulates both counting up and down as well as counting the bits in the given range. This in turn is converted to a relative offset simply by multiplying by page size of 4096. Finally, the relative offset is added to the sparse page array’s base offset (pmap_base_offset), to provide the location of the requested address.

This means the formula is now just an array index, which is O(1), while popcount is a small number of CPU instructions not linearly related to the requested bit number N. This compares to previously, where each page required a separate hash table entry, and each lookup required computing a hash on the address and an O(log n) search of the matching hash table bucket.

Results

For typical manual debugger use, the impact of this change isn’t noticeable, which is probably why the hash table implementation has been in use for 10 years. However, for any automated debugging process, the extra latency adds up quickly.

The most obvious use case for stock FreeBSD is the crashinfo utility. This program is run by the savecore utility mentioned in the introduction. crashinfo is a script that processes a kernel core dump, and runs other utilities that can operate on core dumps, including: ps, vmstat, pstat, iostat, ipcs, nfsstat, netstat, and fstat.

For a sample 8GB kernel core file (generated on a 128GB server), crashinfo improves from (as reported by /usr/bin/time -l):

to:

So in this case, crashinfo now takes 75% less time and 30% less memory.

The benefits are similarly obvious for automated debugger frameworks like the one developed by Backtrace.

With a manually-driven debugger like kgdb, a developer enters commands and processes the output manually. This means only a handful of lookups are required in a relatively long (to a CPU) amount of time. Backtrace debuggers, on the other hand, issue thousands of such lookups per second.

Epilogue

Backtrace began shipping a version of this performance improvement in ptrace in February 2016. This enables us to also offer significantly faster tracing of FreeBSD kernel cores to customers running current and older releases of FreeBSD.

On July 17, 2016, our work improving libkvm scaling was committed to FreeBSD/head. It will ship with FreeBSD 12.0.

By | 2016-08-25T16:42:00+00:00 August 25th, 2016|Backtrace, Technical Details|