FreeBSD Userspace Coredumps

A core represents the state of a process at a point in time. It contains all the information that an engineer needs in order to inspect the process and its state even after the process has exited. This information includes thread information, mapped memory, register state and more. By using a debugger with the core file, engineers can interact with and inspect the state of the process as if they had attached a debugger to the process at the time when the core file was generated.

Ever wonder what exactly is contained in a core dump and how debuggers interact with them? This post is for you. We will explore how cores are generated and how software interacts with them.

Generating Core Files

Both FreeBSD and Linux initiate a core dump for a process when that process receives certain unhandled signals. Both create core dumps for SIGQUIT, SIGILL, SIGTRAP, SIGABRT, SIGFPE, SIGSEGV, SIGBUS, SIGSYS, and SIGEMT; Linux also creates core dumps for SIGXCPU and SIGXFSZ. Users can also initiate core dumps manually using tools such as gcore. Each operating system supports a variety of core file formats, but the default and most common format is ELF, so we will focus on that.

ELF Core Files

Let’s start by generating a core file on FreeBSD (examples in this post refer to FreeBSD unless otherwise indicated):

Because the core file is an ELF file, we can inspect it using the readelf tool:

Note that the ELF file type is “Core file”. Although the ELF specification has a specific type value reserved for representing core files, it unfortunately does not provide any specifications regarding the contents of such files, and formal specifications in this regard are hard to come by. In practice, both Linux and FreeBSD store most data about the process in various ELF note sections. Let’s see what notes are present in the core file we just created:

General Process Information

The first note is a binary blob representing an instance of struct prpsinfo, defined on FreeBSD in /usr/include/sys/procfs.h:

This struct just contains some general information about the process; most of the meaty data is stored elsewhere.

Thread Information

The next notes are NT_PRSTATUS, NT_FPREGSET and NT_THRMISC twice each. These note types contain various types of data about threads and there will be a separate instance of them for each thread in the process. Because this process had two threads when we generated the core file, we see two instances of each type. Each NT_PRSTATUS note contains a binary blob representing an instance of struct prstatus, defined in the same header file:

Here we have information that allows us to determine the state of each thread, most notably register values and signal number (if applicable). The register set here only contains values for general registers and certain control registers. For example, on FreeBSD running on x86-64, gregset_t is defined as:

The values of floating point registers are contained in the NT_FPREGSET sections. The NT_THRMISC sections really only contain the name for each thread:

libprocstat

All of the note sections we have seen so far are common between FreeBSD and Linux, although the exact details of the internal structs may vary. However, on FreeBSD, after these notes we see a lot of notes starting with NT_PROCSTAT. These notes are meant to be opaque to general users and accessible only via libprocstat, a FreeBSD library whose basic API is available here. On Linux, these notes are replaced with equivalent structs whose contents are transparent to users. Most of the NT_PROCSTAT notes correspond directly with libprocstat API calls. For example, the NT_PROCSTAT_PROC note contains the data that is exposed by procstat_getprocs. This call returns an array of struct kinfo_proc, which is defined on FreeBSD in /usr/include/sys/user.h and contains a wide variety of metadata about a thread, such as signal masks, stack size and and start time.

Memory Mappings

So far we have enough information to examine the threads in a process and determine what state they are in. However, we still don’t have any information about the process’ memory, so we can’t yet determine the values of any variables (except those stored in registers). FreeBSD provides information about mapped memory segments via the procstat_getvmmap call, while Linux stores it as binary data in a note of type NT_FILE. This information is logically equivalent to the output of procstat -v on FreeBSD or the contents of /proc/<pid>/maps on Linux. The FreeBSD procstat_getvmmap call returns an array of the following struct:

Let’s create a simple program using the libprocstat API to examine the process’ virtual memory map and look at the output (Note: do not use this code in production; for brevity’s sake it does not include any error checking or cleanup):

Now we know the addresses of the process’ mapped memory segments and, where appropriate, which file the segments were mapped from. Suppose that we’re interested in actually reading data from these segments. Both Linux and FreeBSD store the contents of (some of) these as ELF file segments. Let’s take a look at the ELF program headers in our core file, which provide information about the file’s segments:

The first segment is of type NOTE, and it contains all of the note data we’ve been looking at so far. However, there are also many segments with ELF type LOAD. Let’s look specifically at the starting virtual memory addresses for the LOAD segments:

Which segments are included?

Note that the above addresses all correspond to virtual memory segments reported by the libprocstat API. However, some segments returned by libprocstat are missing in this list, such as those starting at 0x800895000, 0x800be5000 and 0x801893000. On FreeBSD, there are several criteria that mapped memory segments must satisfy in order to be included in a core file:

  1. They must have at least one of read, write, and execute permissions. If ELF legacy coredump mode is enabled (via the elf64_legacy_coredump or elf32_legacy_coredump sysctl), then the segment must have both read and write permissions.
  2. They must not have been marked as exempt from core dumps. Users can mark memory maps as exempt from core dumps using the MAP_NOCORE flag with mmap or the MADV_NOCORE flag with madvise.
  3. They must not be submaps. In practice, this means that kernel submaps, such as signal trampolines, will be excluded.
  4. They must be be backed by physical memory (either on disk or in volatile memory). For instance, segments backed by devices or files in a procfs filesystem will not be included in a core file.

Executable segments from libraries and other binary files are typically mapped with the MAP_ENTRY_NOCOREDUMP flag (the kernel internal equivalent of MAP_NOCORE), so they are not included in core dumps. Data from these segments is sometimes of interest to libraries such as libthread_db and libunwind but these segments can be quite large and their data is read-only, so if needed we can just read the data directly from the binary files on disk. The bad news here is that if these files have been modified since the core file was created, then we won’t be able to read data (or worse, will read incorrect data) from their segments; the good news is that these segments only contain executable instructions and not variable data, so missing it won’t prevent us from determining the values of any variables.

Reading from process memory

For memory segments stored in the core file, the Offset and MemSiz fields in the program header tell us where to actually read the data for the mapping. Per the ELF specification, the <Offset field indicates the offset within the ELF file where the segment starts, and the MemSiz fields indicates how many bytes the segment occupies within the file. For example, suppose we wanted to read a variable whose address in the original process was 0x8008b0100. First, we have to find the virtual memory segment containing the data we want based on start and end addresses. In this case it is the segment 0x8008b0000-0x8008b9000. Then we have to determine where to find the contents of that segment. We can find this from the corresponding ELF segment with the matching start address.

This segment starts at file offset 0x2f000, and the position of the desired memory in that segment is 0x8008b0100 - 0x8008b0000, or 0x100. After adding these together, we get an overall position of 0x2f100 in the core file, which is where we can find this variable’s value.

Memory segments in other files

As mentioned earlier, some memory segments are not actually included in the core file. In the case of executable segments, we can instead read the relevant data directly from the binary file. On FreeBSD the kve_path field of struct kinfo_vmentry gives us the path of the file to read from and the kve_offset field tells us the offset of the segment within that file. On Linux, each entry in the NT_FILE note has corresponding fields. Given these pieces of information we can apply the same logic as above to find the target memory location.

Limitations

ELF core files store memory mapping information as program headers, and the field in the ELF file header indicating the number of program headers is stored as a 16-bit unsigned integer, even in the 64-bit version of ELF. Therefore any program with more than (2^16) - 1 (65535) memory maps would cause an overflow in this field. Older core dump implementations did not account for this and as a result, core files appeared to be missing many of their memory segments. Newer versions of the Linux kernel get around this by setting this field (e_phnum) to 0xffff for any map count that would overflow, and then storing the actual count in the sh_info field (stored as a 32-bit unsigned integer) of the first section header. However, the problem still persists in FreeBSD.

Signal Information

Another very important part of process state is information about signals sent to the process. When a signal is caught by a user-supplied signal handler, the stack trace will make it clear that a signal is present because the stack will contain a frame for the signal trampoline. However, when the operating system initiates a core dump because of an unhandled signal such as SIGABRT or SIGSEGV, the stack trace will provide no indication of a signal. Linux core files provide a note of type NT_SIGINFO for each thread; each of these notes is an instance of the siginfo_t, which contains signal number, code and additional signal-specific details (such as the faulting address in the case of SIGSEGV).

Signals on FreeBSD

FreeBSD does not provide such detailed information. Instead, as shown earlier, the struct prstatus provided for each thread includes a field int pr_cursig whose value represents the number of the signal received by the process (or 0 if there is no signal present). It does not provide any further information about the signal, though. This field has the same value across all threads, so if a process dumps core due to an unhandled SIGABRT, for instance, every single struct prstatus will have its pr_cursig field set to SIGABRT. However, if a specific thread causes a signal (for example, a thread dereferences a NULL pointer, causing a SIGSEGV), then that thread will be the first one listed in the core file.

Use for Debugging

The information discussed so far is typically sufficient for traditional debugging purposes, i.e. determining the cause of a specific fault and inspecting the state of a process at the time of that fault. Some work is required for synthesizing this information into a useful form, such as unwinding stack frames and determining the values of specific variables, but in terms of the raw information that needs to be stored in a core file, nothing else is really needed. Instructions for synthesizing the information described above is typically contained in DWARF debug information either in the original executable file or a standalone file containing only this debug information.

Comparison with Backtrace Snapshots

The Backtrace snapshot format is extensible and among other things contains detailed callstack (compiler optimizations included), objects such as variables, threads and system information. Our debuggers try to determine which regions of memory are required to get to the root cause and are much more selective in determining what to persist in a snapshot. In contrast, core files contain raw data such as register values and memory contents. The additional information in Backtrace files enables various forms of large-scale trend analysis across multiple crashes. Let’s compare time and memory used to generate each type of file for a Chromium process running on Linux on my laptop:

Time (s) Memory (kB)
Coredump 58.73 468825
Backtrace Snapshot 1.65 365

The Backtrace file was generated more than 35 times faster and occupies less than 0.1% of the disk space of the core file, despite also analyzing DWARF debug information on the fly, evaluating variable values and performing various types of analysis on the resulting data. For instance, in addition to information about the traced process, the Backtrace trace file also collects system-level data such as operating system version, overall system memory usage, and other data that could be useful to engineers analyzing a crash after the fact; identifies common classes of crashes such as NULL pointer dereferences and stack overflows; and annotates specific variables that could be of special interest to engineers.

Another advantage of the Backtrace file format is that it is entirely self-contained and can be viewed on any machine with Backtrace tools, unlike core files, which typically must be viewed in the same environment in which they were created. The snapshot can also have all sorts of blobs attached to it so engineers are able to recreate the faulting environment assets in one command.

By | 2015-10-03T18:42:00+00:00 October 3rd, 2015|Backtrace, Technical Details|