Backtrace Blog

Backtrace technical docs, conferences, and industry news.

Error Report Analysis from the Command Line

Backtrace now includes a completely new storage and indexing subsystem that enables engineers to slice and dice hundreds of attributes in real-time easily so they can better triage and investigate errors across their ecosystem, all from the comfort of their command-line environment. When an application crash occurs, there are hundreds of data points that may be relevant to the fault. This can range from application-specific attributes, such as version or request type to crucial fault data such as crashing address, fault type, garbage collector statistics to environment data such as system memory utilization.

Read on to learn more how you can interact with Backtrace from the command line for crash report and error investigation.

In this blog post, we will walk through several examples of interacting with a Backtrace object store, containing crashes across all environments, from the command-line. morgue is the command-line frontend to the Backtrace object store. Simply install it with npm install -g backtrace-morgue to give it a spin. Additional documentation is available at GitHub.

Summary Information

Let’s get a quick summary of errors in the coronerd project over the last month. Specifically, we would like overall number of occurrences, a histogram of affected versions, number of unique machines affected by the fault and a count of unique errors. In this particular case, 39 errors occurred but they are just different manifestations of the same 15 errors.

Unique Crashes over the Last Week

The Backtrace platform analyzes incoming crashes so that they are grouped by uniqueness. For example, below we request a list of all unique crashes over the last 2 weeks. There are a total of nine crashes, but only two unique crashes.

The coronerd project has 1470 crashes from development and production environments over the last two years. Backtrace deduplicates these crashes to a more manageable backlog of 67 crashes.

User Impact

It isn’t always easy to assess user impact on servers concurrently handling hundreds if not thousands of requests per second. A crash may occur once a day when 100,000 concurrent user sessions are active on a server, while another may occur 100 times a day when there are only 5 concurrent user sessions. With Backtrace, you are able to easily gain visibility into such patterns so you can better understand impact.

Below, we request a list of the top three unique crashes over the last 30 days sorted by the actual number of transactions affected in each crash. We do this by requesting a sum of the number of sessions for every unique group of crashes, and sort by the sum.

Finding patterns from callstacks

We have encountered general instability within the watcher subsystem in one of our projects. It would be useful to know if there’s a relationship between the crashes, perhaps the root cause is the same. We know that the fault involves invalid memory accesses, but is there a pattern that we’re missing? Let’s request some aggregations on any fault that has a crashing callstack calling into the watcher subsystem. I’d like to better understand the pattern of crashing memory addresses, commits associated with the crashing applications, classifiers (memory reads versus writes), process up-time distribution and resident memory usage.

Narrowing this down further, we see all instances of a null dereference are only in callstacks involving timer callbacks into the watcher subsystem. We also see that the first instance of the fault was introduced two months ago! We can quickly narrow down and bisect commits from that time window to determine what could cause a null dereference.

Normalize Dumps

Backtrace supports multiple crash formats, from our own proprietary dump format to ubiquitous formats such as minidump. It can be useful to have these objects translated to a JSON format in order to perform queries across instances of dumps, like extract the value of different instances of a variable.

For example, let’s translate to a compressed JSON buffer and extract the first mapped memory region for every object.

Group crashes by any attribute

You want to group crashes by customer or filter on a customer and callstack? Or maybe group by callstack? That’s easy too.

Let’s get a list of all detected double-free conditions broken down by environment.

List all crashes from my development machine

Let’s say I want a breakdown of all crashes from my development machine for the coronerd application, along with a breakdown on the types of crashes encountered.

Breaking down classifiers can be very useful to capture patterns on how a fault manifests.

Breakdown of all crashes by host over the last week

morgue can be used to quickly identify impact and drill down into complex patterns to improve time to resolution.

Conclusion

morgue allows you to conveniently and quickly drill down into crashes in order to quickly determine impact and investigate patterns that would otherwise be missed. This allows you to mitigate damage to customer satisfaction and improve mean time to resolution.

Interested in giving it a try? Sign up for Backtrace at https://backtrace.io/create.