Recently, one of our clients, Circonus, posted a blog about a fault they experienced on one of their small but misson-critical components. We were excited to see Backtrace deeply involved in expediting their time to resolution. The experience is a great example why sophisticated software organizations need more than just monitoring and reporting to build a robust error management process. Deep introspection, analysis, automation, and workflow integration are all critical to ensure that systems come back up and stay up with minimal impact.
Error Management Use Case
Circonus uses the Backtrace Platform to detect and capture faults throughout their global network. As soon as a fault happens, the Backtrace Platform notifies Circonus directly in their Slack channel. Messages sent by Backtrace, example below, contain detailed information captured from the application at the time of fault along with highlighted signals so that their team can immediately jump into investigation and begin discovering the root-cause.
It took them just 2 minutes from the first fault notice to confirm there was a problem, and only 12 minutes to when they pushed a fix. Not a bad MTTD and MTTR (mean time to detection/resolution). In Theo’s (Circonus CEO) own words: “Backtrace was able to reduce a highly inconsistent hours-to-days process to minutes.”
We’re fortunate to have technically savvy customers integrating our debugging platform as a critical part of their error management workflow. Having them publicly share their experiences is icing on the cake.