Breakpoint is a series of articles where we follow engineers along the journey of debugging the most challenging bugs they’ve encountered over the years. These war stories, in turn, help inform our product roadmap. In this article we hear from Paul Khuong, former principal engineer at AppNexus, that uses Backtrace for monitoring their C and C++ real-time advertising systems. Among other things, Paul is known for his work with the SBCL compiler and binary search.
Enter Paul Khuong
My worst debugging experiences come from small “can’t go wrong” improvements to widely used infrastructure code that end up triggering bugs in larger applications. Over time I’ve learned to agree with Dr. Theodore Woodward: “When you hear hoofbeats, think of horses not zebras.” When a code change causes an application to crash, the bug is probably in that change. But what if the modification is so small that it’s obviously correct — not just to you, but to the whole team? That’s the story of my hunt for the 34th bit.
My obviously correct change boils down to a two-line patch to our bit vector implementation (
ck_bitmap in Concurrency Kit). We went from the following:
On our platform, x86-64/Linux, unsigned are 32 bit integers, so the second layout saves space for two reason: we don’t need padding between n_entries and bitmap to restore natural alignment (to 64 bits), and the bitmap itself can use less space by allocating, e.g., three unsigned (total of 96 bits) instead of two uint64_t (total of 128 bits). If bitsets were always heap allocated, this would only save a marginal amount of space. However, we also support inline structs for bit vectors of a known size; I hoped to save space for small inline bitmaps.
[ align @ 64 ] [ n_entries ] [ padding ] [ 64 bit | word ] [ 64 bit | word ] [ n_entries ] [ 32 bit word ] [32 bit word] [32 bit word]
I also had to adjust the way we allocated memory and indexed into the bitmap, but that was mostly a matter of updating a few type aliases. This tighter implementation of bit vectors passed all of its regression tests, did fine in a limited deployment of our real application, and easily flew through code review. More importantly, our bit vector implementation was so simple that it was (and still is) obviously correct.
Everyone on the team was feeling pretty confident about the next release of our application: we only had a couple bugfixes and some simple improvements. That’s until the application’s crash rate increased by a factor of 100. We rolled back to the previous version right away and all knew we were now in trouble.
Everyone on the team was feeling pretty confident about the next release of our application […] until the application’s crash rate increased by a factor of 100
We quickly determined that reverting the bit vector changes fixed the crash condition. Although I had recently checked our code base for array overflows involving bitwise operations, I decided to look again. I found more bugs, but nothing relevant to the new crashes. I even tried to insert bound checks; that uncovered nothing and still crashed mysteriously. Other members of the team looked at other suspect sections of code, and they too found a few lurking bugs. We fixed these bugs as we found them, but still had not uncovered the cause of the new, frequent, crashes. We also examined the crash reports, and it looked like we were always accessing a stale or bad pointer in the middle of an array of pointers. That left us with three facts:
1. Changing the layout of bit vectors causes crashes.
2. The crashes happen before any out of bounds write in bitmaps.
3. The immediate cause of each crash is a bad pointer in an array of pointers.
We had to debug a crash, but had no hypothesis as to how that bad pointer ever appeared in our array of pointers. In other words, we had no way of debugging backward from the crash to pinpoint the root cause of the bug. We were really in trouble.
The Debugging Begins
Being unable to debug from the crash to the root cause left us with three choices: think hard, add instrumentation and hope to inch closer to the root cause, or try to debug forward, from the changes to the cause. Thinking hard is a high risk strategy, and two engineers were already experimenting with various heap checking tools and working hard to add instrumentation, so I chose to try and pinpoint exactly what part of my bitset changes triggered crashes. I had already checked for bounds violation that would, until now, have been hidden by larger bitmaps. There’s always the possibility of a compiler/linker bug, but, except when working on a compiler, my batting average with that hypothesis is almost nil. That left the layout and the size of other objects in memory.
The size of objects seemed a likely candidate. Our memory allocator, jemalloc, bins allocations in size classes, and the bit vector commit shrunk a few common objects down by a class. That could let loose all sorts of new memory reuse patterns and expose old bugs with stale pointers. I tried padding the tail of every struct that was now smaller. No effect on crashes. We nevertheless made progress during that time. We realized that, although the array of pointers was supposed to be sorted, the bad pointer always appeared out of order: all the crashes were caused by pointers with the 34th bit cleared when it should be set.
[ 0x7fa7a3403930 ] [ 0x7fa5a34042a0 ] [ 0x7fa7a3404a00 ] [ 0x7fa7a341a360 ] ...
0x7fa7a3404a00 is out of order, because the 34th bit is 0 instead of 1.
That looked like the effect of out of bounds writes to a bitset, but we’d already looked for those… if anything, this last clue left me more confused. The only remaining idea, adjusting the layout of bitmap structs, was less palatable: I’d have to change the bitmap implementation, which introduces more difference between my test environment and the actual crash. Reverting to the previous layout would obviously fix things but not give me any additional information. I had to find intermediate changes. I started by rewriting the bitset struct into:
and by always rounding up the size of the bitmap to an even number of unsigned ints. No crash, as expected. I now only had to reduce the changes further. I saw two possibilities: keep the padding and stop rounding up, or remove the padding and keep rounding the size of the bitmap. In the first case, the struct would look like:
unsigned bitmap; /* Allow odd length. */
and, in the second case, we’d have:
unsigned bitmap; /* Only even length. */
Both versions suffered from 34th bit crashes.
Finally, I realized the common characteristic of robust layouts: the bitmap struct had 64 bit alignment! I simply forced 64 bit alignment on inline bitset structs… and the crashes went away. That was progress: I could now binary search through our codebase, locally enabling or disabling 64 bit alignment on inline bitsets.
After some mechanical work, I found the single faulty struct. It looked like
[persistent data for a callback];
Aligning the inline bitset struct to 64 bits pushed the callback data down by 32 bits. Our 34th bit corruption affected an array of pointers; the same buggy code would instead clear the second bit of a pointer, and, due to ABI requirements, that bit is always zero to begin with. Now that we knew the callback to blame, it took less than an hour (and four brains) to figure out that the callback code failed to participate in our safe memory reclamation scheme, which meant that it could trigger for objects after they had been deallocated.
That callback performed bitwise operations on a machine integer and usually cleared the same bit, causing our consistent single-bit corruption. That’s the kind of bugs that heap checking tools usually detect quickly. Yet, one of the first things we did was let valgrind run overnight, and it crashed without reporting anything useful. I believe the issue is that the bugs involved hardware concurrency in a section of code with such an allocation rate that the guilty stale pointer regularly pointed in the middle of a new valid object.
I find that the 34th bit takes the crown as my debugging nightmare for three reasons: the cause first appeared to be a change set that was obviously correct; all my usual tools were useless, so I had to resort to binary search over source code; it was an old bug that came back with a vengeance after a trivial change. The 34th bit bug was that a callback would flip one bit in an array of pointers. We’d had (what we believed to be) rare heap corruption in the objects pointed to by that array for a while. However, given that that was the one bug left after a debugging sprint and that occurrences were so rare, it made sense to focus on more pressing issues. Once the bitset changes landed, though, we couldn’t ignore dozens of crashes an hour anymore. That’s what I find most irritating about this bug: a series of reasonable decisions let a minor bug fester into a fairly serious one in production. The upside is that I work with awesome developers, so such epic (and frustrating) debugging sessions are productive team efforts and, more importantly, rare events.
Exit Paul Khuong
Backtrace helps investigate issues such as this with a flexible query system and deep introspection into fault state of faulting applications that includes memory corruption detection and other forms of error detection based on run-time state. Backtrace is powered by a purpose-built analytical database that allows users to ask a wide array of questions on attributes (user-submitted and auto-generated) derived from crashes, including, the fault address. For example, deriving the distribution of the faulting addresses is a two-click operation in Backtrace. This can be used to derive interesting patterns to get a head start on root cause analysis. Learn more at https://backtrace.io and follow us on Twitter (@0xcd03) for more Breakpoint.