Crash Deduplication: Triaging Effectively

Depending on your application, crash reports may come in at the rate of a few to many thousands a day. Regardless of scale, fatal errors slip through the cracks, and even for the ones you do catch, it’s difficult to understand crash impact and to differentiate unique crashes from crash groups. Your ability to quickly triage and prioritize these fatal errors is crucial to your ability to act on them with urgency. Triage and prioritization relies on determining impact, like which users are affected by a crash or which crash has the highest impact on revenue. Backtrace helps you do this effectively through our deduplication systems.

Backtrace automatically groups crashes using a heuristic algorithm (built on a state machine) so that identical crashes are matched together. By default, this grouping allows you to determine the number of impacted hosts or users. With the help of attributes, it is possible to attach additional pieces of data to crash reports in order to triage by such things as application load, number of impacted versions, and more.

[table id=1 /]

Deduplication

Simplistic callstack-based grouping algorithms lead to systems with either too fine-grained or too coarse-grained grouping. If your grouping algorithm is too fine-grained, the same bug impacting a large number of users will appear as many different “unique” bugs impacting a small set of users.  If your grouping algorithm is too coarse-grained, the impact of a single bug will be inflated, leading to incorrect prioritization and wasted resources.

Our approach to finding balance between these two extremes relies on a state-based deduplication algorithm. The operation of this state machine is governed by an enumerated set of rules for normalization, skipping frames, and choosing the level of information to include for any given frame (object, line number, function, etc). This state machine also allows us to transform input callstacks for the purposes of pretty-printing in addition to crash signature generation. Enterprise customers are able to modify these rulesets for their own application. Otherwise, both cloud and enterprise customers benefit from frequent updates of these rules to better improve out-of-the-box grouping capabilities.

Capabilities

Below is the set of capabilities of our flexible deduplication system.

[table id=3 /]

Backtrace Project View

Screenshot of our project view, showing the crash groups affecting the starlox project

 

Backtrace vs. Conventional Deduplication

Group by first application frame

Some systems will group according to the first application frame. This quickly starts to fall apart for many reasons including internal application error handling, faults in external libraries, and more. More likely than not, such a system is too coarse-grained to be useful.

Take the following example for a program called program.exe. The callstack of the crashing thread is abortapplication_abortab where application_abort is the first application frame. Competing systems will group by application_abort! This function is invoked in almost all cases where the application is explicitly aborting, leading to grossly ineffective deduplication.

This mechanism breaks down for any commonly-used functions, not just error handling functions. For example, let’s say a bug was introduced that leads to a fault in a commonly called utility function. These systems would group these faults by the utility function rather than the caller.

Last but not least, these mechanisms disable the ability of doing callstack-analysis for non-deterministic bugs. There may be hundreds of different callstacks for the same bug. Backtrace retains relevant portions of the stack which enables advanced statistical analysis and visualizations on faulting callstacks.

As noted above, Backtrace selects multiple relevant frames from the full callstack, not just the first frame, as inputs to its deduplication algorithm.

Group by callstack

On the other end of the spectrum is pure callstack-based grouping. This mechanism tends to be too fine-grained, leading to inaccurate aggregation of faults. Modern applications have a high degree of non-determinism both in their surrounding platform libraries as well as application code. In an event-based system, the same function could be invoked by an event loop processor in many different ways. If there is a hang condition, there are many different locations the hang may manifest.

Some systems attempt to improve on this through restrictions such as only considering application frames. This also starts to break down as crucial application libraries end up being completely ignored for the purposes of fault aggregation.

Backtrace’s deduplication system intelligently determines which frames to use to avoid situations like this. This filtering mechanism can be adjusted acccording to your preferences.

Group by error type or exception message

Some systems will group simply by the type of error condition or an exception message. It goes without saying that this is insufficient for a vast majority of real-world faults. An exception message may be as generic as “failed to complete user action”. Since the grouping is too coarse-grained, triage and prioritization is ineffective on these systems.

Signature Lists

Other systems approach this problem by using callstack-based grouping with giant lists of functions to include or exclude for the purposes of deduplication. Unfortunately, these systems are not flexible enough to handle compiler-generated names, non-deterministic callback interfaces, and more. Backtrace has flexibility built in that allows a few simple rules that will fit a majority of use-cases without resorting to giant lists that require frequent maintenance.

Conclusion

This post hopefully provided you some insights into the Backtrace deduplication system and how it differs from other systems. The Backtrace deduplication system is built on a state-machine with a set of configurable rules to transform callstacks for pretty-printing and more accurate grouping. Be on the lookout for follow-up posts about how we are improving our rulesets through statical analysis on real-world data. In addition to follow-up posts about improving our rulesets, we’ll also explore other factors to triage and prioritization like our classification and attribute subsystem. Our classification system can help you prioritize bugs based on type of error (e.g. memory corruption or security hole), ultimately helping you to stop ticking time bombs before they blow up.

If you’re interested in trying out our deduplication system on your own data, sign up for Backtrace here and start submitting your crashes. We’d like your feedback about how our deduplication results match the way you triage and prioritize crashes.

If you want to learn more and talk to us in person, feel free to contact us at support@backtrace.io.

By | 2018-04-09T18:53:46+00:00 June 1st, 2017|Engineering, Features, Technical Details|