Faster Backtraces With Backtrace I/O

Backtrace I/O is building a turn-key infrastructure platform to detect, aggregate, analyze and collaboratively fix software bugs of all types for even the most demanding software applications. We are taking a unique approach to the problem, from how backtraces are generated to how crashes are stored and analyzed. As engineers working on enterprise software, backtraces are exceptionally useful to us. In production, backtraces can provide key insights to real-world performance. In bug reports, backtraces and basic environmental data are usually the only thing engineers have to go on for determining and fixing the root cause of a crash. Backtraces with insufficient information quickly become useless and lead engineers to draw incorrect conclusions that lead to much wasted time and effort. Today’s tools can fail to perform for demanding applications and we would like to fix that.

In order to build the best platform for crash analysis we know we must begin with the best tracers and crash data formats. In this post we will explore some of the performance advantages of our tracers compared to other tools and announce some of our first shareware.

A major component of our platform is the advanced tracer and tracing framework, which provide features such as temporal analysis, modern crash deduplication (in cooperation with our database) and more.

The advanced tracer relies on our core backtrace library and today we’re happy to provide public access to prerelease binaries of the lightweight core library tracer bt, the same library our advanced tracer builds on. We do not expect this version of the lightweight tracer to be bug-free. You can find the lightweight tracer at our Shareware page. Please contact us at signup@backtrace.io to request access to our platform and advanced features as we make pieces available. If you have any feature requests or bug reports, please contact us at team@backtrace.io so we can make things happen.

The lightweight tracer is a barebones modern replacement for the traditional pstack and gstack system utilities. The purpose of the tool is to exercise our core client-side debugging libraries without any advanced features. These core libraries include fast DWARF, executable format and backtrace generation support including a framework for non-blocking server-side debugging. In this version, we are providing support for FreeBSD and Linux on x86-64 (plenty of other platforms are on the way). Read on for performance results and usage.

Performance

Let’s begin by demonstrating performance on several complex real-world applications. The metrics we primarily care about are max resident memory usage and time to completion. We compare gstack (a GDB wrapper on RHEL), lldb (of the LLVM project), glider (of GIMLI) and the Backtrace I/O bt tool. The significant performance differential is due to the specialized tracer functionality in our core libraries.

These tests were completed on Debian 7.5.0. The Linux 3.2.0 kernel is installed and the processor is a 4-core Intel Core i7-2600K box at 3.40GHz. The machine had 16GB of RAM and we made sure relevant caches were warmed up. GDB 7.4.1 was used. It has been verified that GDB 7.7 on the same box exhibits nearly identical performance. The trunk (217927) version of LLDB was used and compiled with optimizations enabled.

Even though the bt tool supports DWARF2, DWARF3 and DWARF4, the usage of DWARF3 is enforced to allow glider to run for comparison. GCC has switched to DWARF4 since version 4.8, which we support.

The memory usage column refers to peak resident memory in megabytes and total time refers to time to generate backtrace in seconds. Both glider and bt will access all stack reachable variables and crawl through them by default. Only a summary backtrace (versus a full detailed backtrace) is requested from gstack (gdb) and thread backtrace all (lldb) in these tests unless otherwise stated. Results are from several runs, after things have warmed up. Both bt and gdb are on parity with line number mapping information while glider is inaccurate. I found that LLDB failed to reliably unwind past signal frames on Linux and that some variable forms were unsupported. The performance numbers are also without any of the fancier optimizations our advanced tracer utilizes.


IceWeasel (Firefox)

The following are relevant statistics for Mozilla IceWeasel 24.8.0 with 670 mapped segments and 37 threads. There is approximately 1.3GB worth of DWARF3 debug data across these objects.

iceweasel Memory Usage Total Time
gdb 1313.50MB 15.41s
glider 1128.08MB 2.82s
lldb 1950.71MB 54.45s
bt 217.62MB 00.29s

LibreOffice

The following are relevant statistics for LibreOffice 3.5 with 726 mapped segments and 4 threads. There is approximately 2.1GB worth of DWARF3 debug data across these objects (with only a small subset immediately relevant to the trace). LLDB failed to load several debug files in this scenario and so the data has been omitted. I may revisit lldb performance with LibreOffice at a future date.

libreoffice Memory Usage Total Time
gdb 1355.69MB 19.13s
glider 360.37MB 05.90s
bt 91.08MB 00.14s

Chromium

The following are relevant statistics for Chromium 35.0.1916.114 with 466 mapped segments and 1 thread. There is approximately 2.6GB worth of debug data in a single executable here.

chromium Memory Usage Total Time
gdb 2634.71MB 54.00s
glider 3017.08MB 05.63s
lldb 3810.00MB 02:10.65s
bt 461.15MB 00.61s

Impact of Thread Count

Modern large-scale processes can scale in the thousands if not tens of thousands of threads, hundreds of gigabytes of memory usage and tens of thousands of memory mappings. The following is from an artificial crash scenario across a variable number of threads at 60000 mappings and a stack frame depth of 10 for every thread. The program was compiled with DWARF3 for comparison to glider. GCC 4.7.2 was used for compilation at -O2 optimization level. Some variables were optimized away completely. Unfortunately, we did not have time to scale the test up further but users have reported 3 second trace times on a 30K+ thread 500GB+ RSS process with almost 200K map entries. Both glider and bt emit more information than GDB bt full output as they actually crawl the stack and chase pointers. Every frame looks similar to the following (bt output):

Here are the results comparing glidergdblldb and bt in time spent. Frame variables were not enumerated for lldb or gdb in this scenario. Both glider and bt walk the stack and crawl memory up to a depth of 20. glider currently doesn’t have a succinct format. The bt-nv column represents bt with only line-number information utilized (similar to default lldb backtrace behavior) by passing the --no-variables option.

Threads glider gdb lldb bt-nv bt
32 3.45 0.14 0.18 0.12 0.15
64 5.59 0.26 0.18 0.12 0.20
128 9.88 0.51 0.42 0.13 0.29
256 18.49 0.98 0.37 0.15 0.46
512 35.78 1.97 1.35 0.19 0.82
1024 70.79 3.94 2.74 0.29 1.58
2048 142.69 8.17 5.15 0.50 3.06

compare

Our core libraries are fast and this demonstrates some of the spare cycles our advanced tracer will be taking advantage of to provide cool features. Our platform provides rich support for DWARF to extract many compound expressions from registers, memory and more. Our stack is built on very fast parsers, efficient memory management, a custom libunwind (expect patches upstream as soon as we expand platform support) and fast data structures (some from the Concurrency Kit library).

Usage

The bt program is fairly barebones. Usage is demonstrated through examples. Output has been truncated for brevity in some cases.

Print help

Trace a target process

Simply specify the target process ID.

Omit variable information.

Specify the --no-variables option.

Trace specific threads

Our tracers have robust support for non-stop tracing of multithreaded software. In this mode only a subset of threads are stopped for tracing. This allows for minimal intrusion and is part of what allows our advanced tracer to perform temporal analysis for certain classes of bugs. Simply pass a comma-separated list of thread identifiers to the --thread option. The --no-variables option was passed in for brevity.

Output current instruction

The --assembly option outputs current assembly instruction. Advanced tracer will support blobs. The --no-variables and --thread options were used for brevity.

Output register values

If sufficient debug information is present to unwind register values then register values can be very useful. Use the --registers option to enable this. The usefulness of register values is dependent on the ABI of your platform. Output truncated for brevity.

Limit memory crawl depth

The -m option limits the maximum crawl depth. An dereference or enumeration of an aggregate type’s members is considered a crawl operation.

Output types

Types are output in storage order with the --types option. Crawl depth is limited for brevity here.

Output memory map

The --memory-map option outputs the target process memory map.

Use cached maps file

Massive processes may have a very large number of map entries. In some cases, we have observed that reading /proc/pid/maps becomes a significant bottleneck. For this reason, users may cache map dumps (among other assets) and re-use them for future tracing sessions. To use this option, specify the --mapoption.

And more…

There are more options available. The advanced tracer is where things get really interesting and we look forward to sharing some of those features with the public soon.

Availability

If you’re interested in our blazingly fast platform and rich debugging features, please request access at signup@backtrace.io. The bt tool is available for download at our Shareware page. This simple tool is just the tip of the iceberg. All our software is available in commercial-friendly licenses.

By | 2018-06-22T15:41:55+00:00 September 15th, 2014|Backtrace, Features|