Error Monitoring and Crash Reporting 101

To stay competitive, increasing the rate of application development and delivery is critical, and you have to ensure a great end user experience while maintaining velocity. This means capturing and fixing bugs before your users find them, and getting visibility into your production instability. To achieve this, you have the option of using multiple tools in combination or adopting an Error Monitoring and Crash Reporting solution. I want to address the latter and give a general overview for anyone considering implementing an Error Monitoring and Crash Reporting solution. The following is a general What, Why, When, and How to help guide your next steps.

What is Error Monitoring and Crash Reporting?

An Error Monitoring and Crash Reporting solution lets you monitor application instability and provides the information you need to prioritize and resolve bugs. Such solutions provide notifications of errors and capture data about the crash that is critical to the triage and resolution process. They are also an important part of automating the debugging process.

Why do you need Error Monitoring and Crash Reporting?

A great end user experience starts with you delivering a stable product. Developers spend ~50% of their time fixing bugs, and this is often due to manual work. An Error Monitoring and Crash Reporting solution automates the debugging process, providing tools to prioritize bugs and debug them faster. This helps you reduce the impact of future bugs on your users, and by spending less time debugging, you can spend more time building.

In addition, automating the aggregation of your error data across platforms give you visibility into your production instability. This means you can get ahead of any critical errors before it impacts more users and your brand. The contextual information (attributes/tag etc.) helps to prioritize critical errors vs the fixes that are queued in the next release.

 

Finally, with an end-to-end solution, you can better measure metrics like MTTD and MTTR. Since an Error Monitoring and Crash Reporting solution can start measuring the moment an error is captured, it can track the steps until the issue is triaged and resolved in your ticketing service. This not only gives you visibility into your process, but also helps measure efficiency and make optimizations. By capturing contextual information like number of users, version, platform, etc., you can answer questions like: How many users were impacted by a unique bug? Which bugs had the highest user impact? How many versions were impacted by a unique bug? Which platforms were impacted by a unique bug? What’s the impact of a bug across my stack? Being able to easily answer questions like this can help you establish metrics to start measuring the impact of bugs on your users and improve user experience.

When should I enable Error Monitoring and Crash Reporting?

Unless you haven’t written a single line of code there’s no wrong time to adopt a solution.

  • If you’re releasing a product for the first time, it is critical that you have a solution in place to help deliver a more stable product and get ahead of critical errors.
  • If you’re already in production, you can integrate a solution into your Dev and QA environments, then make it part of an upcoming release.

How do I enable Error Monitoring and Crash Reporting?

There are two things you can do: 1) Use a third party solution. 2) Build your own.

Build vs. Buy — The decision to go either route involves multiple considerations. Are you looking to enable instrumentation for your team? Are you looking to improve MTTD and MTTR? What kind of insights are you looking to get from your error data? Is this tool used only by developers, or should QA, Executives and Product Managers need to use it? Do you want developer time spent building a solution? Do you want your developers spending time maintaining and updating a solution?

Here are some key factors to consider in your decision making process:

  1. Automated Aggregation — Can this tool aggregate error and crash data (minidumps and coredumps) across your platforms? How about across your application stack? Is this capturing all error data or is it only giving you some of them? At a minimum, you want to capture all error data across platforms for one application.
  2. Automated Symbolification — Can this tool automatically attach debug symbols? If it does, will it accept my symbol format? if it doesn’t, is it ok to attach it manually? Do I need to integrate a private symbol server? Attaching debug symbols ensures the error data is human readable and it also ensure you get accurate deduplication. At a minimum, you’ll need a mechanism to attach this automatically because attaching this manually mean your developers will waste valuable time sorting through all the noise before the errors are deduplicated.
  3. Automated Deduplication — Can this tool group similar errors together? If it does, what techniques is it using to do grouping? Is it too fine grained or too coarse? Do I want custom grouping? Grouping into a unique error ensures you can reduce the noise and that your team is only working on the errors that matter. You can either use what’s available out of the box or use solutions that give you flexibility to adjust the grouping algorithms. The latter allows you to customize grouping rules to fit the unique characteristics of your application.
  4. Contextual information — Can this tool capture metadata? How about log files? Can I add custom metadata? You want to collect as much data about the error as possible to give your developers more contextual information. This includes things like dumps, log files, attributes, etc., and this can have a significant impact on the prioritization and triaging process.
  5. Integrate with existing tools — Can this tool give me immediate notifications of errors and automate triaging? Can these be triggered for unique errors? This ensures your team is aware of production errors as they happen and that tickets are automatically created. To take it a step further, you can consider filtering your metadata to customize triggers. For example, you can notify or assign an error with a particular version to one user.
  6. Analytics — Can this tool query my data? How flexible are the queries? Does it provide rich visualizations? Is it easy to use for technical and non-technical employees alike? This is critical if you want to measure impact not only across users, but also across versions, platforms, graphics card, etc. This can also help you prioritize which bugs to address first, help identify critical bugs, and identify trends in your error data.
  7. Overhead — Does this tool require developer time for maintenance and support? Is developer time better spent on my own product? This is a key factor to consider in a build vs. buy scenario because it comes down to ROI. Ultimately, developer time has a higher ROI when it’s spend on your own product. Using a third party solution also ensures you don’t have to maintain, support, or upgrade an existing solution, and try to keep up with what’s “new”.

Whether you make products for the Enterprise or Consumer industry, there will be instability in your software application. To stay competitive, increasing the velocity of your app team is critical, but you have to do this while delivering valuable products that provide a great end user experience. To address this, consider implementing a Error Monitoring and Crash Reporting solution for your team.

Backtrace’s Turnkey Error Monitoring and Crash Reporting System

The Backtrace platform is a single solution to automatically monitor, capture, organize, and analyze your application errors to resolve them faster.A Diagram of Our Crash Reporting System

Backtrace captures detailed dumps of failed application state, automates deep analysis of the data and highlights important classifiers, and archives this in a powerful database. All of this seamlessly plugs into existing workflow tools like Slack, JIRA, and PagerDuty. Improve your time to detection and resolution, increase your team’s efficiency, and take control of errors with Backtrace.

 

By | 2018-06-11T19:21:32+00:00 April 26th, 2018|Backtrace|