Error Report Analysis from the Command Line

Backtrace now includes a completely new storage and indexing subsystem that enables engineers to slice and dice hundreds of attributes in real-time easily so they can better triage and investigate errors across their ecosystem, all from the comfort of their command-line environment. When an application crash occurs, there are hundreds of data points that may be relevant to the fault. This can range from application-specific attributes, such as version or request type to crucial fault data such as crashing address, fault type, garbage collector statistics to environment data such as system memory utilization.

Read on to learn more how you can interact with Backtrace from the command line for crash report and error investigation.

In this blog post, we will walk through several examples of interacting with a Backtrace object store, containing crashes across all environments, from the command-line. morgue is the command-line frontend to the Backtrace object store. Simply install it with npm install -g backtrace-morgue to give it a spin. Additional documentation is available at GitHub.

Summary Information

Let’s get a quick summary of errors in the coronerd project over the last month. Specifically, we would like overall number of occurrences, a histogram of affected versions, number of unique machines affected by the fault and a count of unique errors. In this particular case, 39 errors occurred but they are just different manifestations of the same 15 errors.

$ morgue list coronerd --unique=hostname --histogram=version --unique=fingerprint --age=2M
* ▂▁▁▁▅▁▂▂▁▁▄▅▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▇▃▁▁ 5 days ago
Date: Thu Dec 01 2016 12:21:27 GMT-0500 (EST)
Tue Jan 24 2017 09:28:34 GMT-0500 (EST)
Occurrences: 39
histogram(version):
1.1.1 ▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆ 10
1.0.77 ▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆ 9
1.0.74 ▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆ 5
1.0.69 ▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆ 4
1.0.54 ▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆ 4
1.0.59 ▆▆▆▆▆▆▆▆▆▆▆▆ 3
1.0.60 ▆▆▆▆▆▆▆▆ 2
0.15.0 ▆▆▆▆▆▆▆▆ 2
unique(hostname): 6
unique(fingerprint): 15

sbahra: backtrace/coronerd as of 2M ago [261 us]

Unique Crashes over the Last Week

The Backtrace platform analyzes incoming crashes so that they are grouped by uniqueness. For example, below we request a list of all unique crashes over the last 2 weeks. There are a total of nine crashes, but only two unique crashes.

$ morgue list coronerd --template=unique --age=1w
0ed1deb3d97312a18d2b1ba85f05... ▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ 6 days ago
Date: Mon Jan 23 2017 16:11:36 GMT-0500 (EST)
Mon Jan 23 2017 16:12:14 GMT-0500 (EST)
Occurrences: 2
head(application): coronerd
callstack:
stream_impl_ssl_close ← cr_stream_close ←
graceful_connection_close ← cr_io_enter ← cr_pu_enter ← httpd_server ←
cr_pu_spawn_thread
head(classifiers): segv stop null memory.read
unique(hostname): 1
histogram(version):
1.1.1 ▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆ 2

34adf08139c073c51357abcb4c60... ▁▁▁▁█▁▁▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ 5 days ago
Date: Mon Jan 23 2017 16:20:42 GMT-0500 (EST)
Tue Jan 24 2017 09:28:34 GMT-0500 (EST)
Occurrences: 7
head(application): coronerd
callstack:
abort ← cr_disk_buffer_map ← cr_buffer_map ← cr_buffer_write ←
url_handler_post_multipart ← call_url_handler_for_read ←
multipart_part_data_cb ← multipart_parser_execute ← parse_body ←
http_parser_execute ← request_read ← cr_io_enter ← cr_pu_enter ←
storage_dp_slave ← cr_pu_spawn_thread
head(classifiers): abort stop
unique(hostname): 1
histogram(version):
1.1.1 ▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆ 7

sbahra: backtrace/coronerd as of 1w ago [354 us]

The coronerd project has 1470 crashes from development and production environments over the last two years. Backtrace deduplicates these crashes to a more manageable backlog of 67 crashes.

User Impact

It isn’t always easy to assess user impact on servers concurrently handling hundreds if not thousands of requests per second. A crash may occur once a day when 100,000 concurrent user sessions are active on a server, while another may occur 100 times a day when there are only 5 concurrent user sessions. With Backtrace, you are able to easily gain visibility into such patterns so you can better understand impact.

Below, we request a list of the top three unique crashes over the last 30 days sorted by the actual number of transactions affected in each crash. We do this by requesting a sum of the number of sessions for every unique group of crashes, and sort by the sum.

$ morgue list project --factor=fingerprint --sum=sessions --sort=sessions --head=callstack --limit=3
5e30884abec3896a68bc72e8235a... ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁██ 33 minutes ago
Date: Mon Jan 23 2017 16:13:26 GMT-0500 (EST)
Tue Jan 31 2017 13:49:17 GMT-0500 (EST)
Occurrences: 3
callstack:
stream_impl_ssl_close ← cr_stream_close ← cr_stream_destroy ←
cr_httpd_session_destroy ← cr_io_enter ← cr_pu_enter ← httpd_server ←
cr_pu_spawn_thread
sum(sessions): 308519 sessions

0ed1deb3d97312a18d2b1ba85f05... ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁ 1 week ago
Date: Mon Jan 23 2017 16:11:36 GMT-0500 (EST)
Mon Jan 23 2017 16:12:14 GMT-0500 (EST)
Occurrences: 2
callstack:
stream_impl_ssl_close ← cr_stream_close ←
graceful_connection_close ← cr_io_enter ← cr_pu_enter ← httpd_server ←
cr_pu_spawn_thread
sum(sessions): 260360 sessions

e78e5c51c1493fffccc72c6eb899... ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁ 1 day ago
Date: Mon Jan 30 2017 10:09:48 GMT-0500 (EST)
Occurrences: 1
callstack:
epoll_wait ← cr_io_enter ← cr_pu_enter ← cr_fork_start ←
cr_pu_spawn_thread
sum(sessions): 233847 sessions

Finding patterns from callstacks

We have encountered general instability within the watcher subsystem in one of our projects. It would be useful to know if there’s a relationship between the crashes, perhaps the root cause is the same. We know that the fault involves invalid memory accesses, but is there a pattern that we’re missing? Let’s request some aggregations on any fault that has a crashing callstack calling into the watcher subsystem. I’d like to better understand the pattern of crashing memory addresses, commits associated with the crashing applications, classifiers (memory reads versus writes), process up-time distribution and resident memory usage.

$ morgue list coronerd --age=2y --filter=callstack,regular-expression,watcher --histogram=classifiers --histogram=classifiers --histogram=fault.address --bin=process.age,1024 --histogram=commit --bin=vm.rss.size,1024 --histogram=timestamp --unique=hostname
* ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▁█ 14 hours ago
Date: Thu Dec 08 2016 11:40:57 GMT-0500 (EST)
Tue Jan 31 2017 22:57:48 GMT-0500 (EST)
Occurrences: 5
histogram(timestamp):
Tue Jan 31 2017 22:57:45 GMT-0500 (EST) ▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆ 1
Tue Jan 31 2017 22:57:48 GMT-0500 (EST) ▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆ 1
Fri Jan 13 2017 18:27:47 GMT-0500 (EST) ▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆ 1
Thu Dec 08 2016 12:03:36 GMT-0500 (EST) ▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆ 1
Thu Dec 08 2016 11:40:57 GMT-0500 (EST) ▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆ 1
histogram(classifiers):
segv stop null memory.read ▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆ 2
null memory.read ▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆ 1
segv stop memory.read ▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆ 1
segv stop memory.write ▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆ 1
histogram(fault.address):
0xe0 ▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆ 3
0x6563617270 ▆▆▆▆▆▆▆▆▆▆▆▆▆ 1
0x465bc1 ▆▆▆▆▆▆▆▆▆▆▆▆▆ 1
histogram(commit):
bef6c66720c514e75cbe1460a401f4e15881604f ▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆ 2
a3da61355f0c5cc68ff8d18e580eb3c80e3f461b ▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆ 1
12806d38ac0520c0b981c1e23b0db19c32f0343e ▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆ 1
unique(hostname): 3
bin(process.age):
1351 sec 1732 sec ▆▆▆▆▆ 1
84409 sec 84790 sec ▆▆▆▆▆ 1
104602 sec 104983 sec ▆▆▆▆▆▆▆▆▆▆ 2
bin(vm.rss.size):
12996 kB 17525 kB ▆▆▆▆▆▆▆▆▆▆ 1
44699 kB 49228 kB ▆▆▆▆▆▆▆▆▆▆ 1
49228 kB 53757 kB ▆▆▆▆▆▆▆▆▆▆ 1
551947 kB 556476 kB ▆▆▆▆▆▆▆▆▆▆ 1

sbahra: backtrace/coronerd as of 2y ago [404 us]

Narrowing this down further, we see all instances of a null dereference are only in callstacks involving timer callbacks into the watcher subsystem. We also see that the first instance of the fault was introduced two months ago! We can quickly narrow down and bisect commits from that time window to determine what could cause a null dereference.

$ morgue list coronerd --age=2y --filter=callstack,regular-expression,watcher_ --template=unique --limit=3
2aeffdda940268b84d4a1e6a7f78... ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁ 2 months ago
Date: Thu Dec 08 2016 11:40:57 GMT-0500 (EST)
Occurrences: 1
head(application): coronerd
callstack:
ck_pr_fas_uint ← watcher_consume ← watcher_interval ←
cr_timerset_dispatch_current_level1_bucket ←
cr_timerset_advance_n_seconds ← cr_timerset_dispatch_through ←
cr_timerset_dispatch_through_now ← cr_io_enter ← cr_pu_enter ←
watcher_base ← cr_pu_spawn_thread
head(classifiers): segv stop null memory.read
unique(hostname): 1
histogram(commit):
bef6c66720c514e75cbe1460a401f4e15881604f ▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆ 1
histogram(version):
1.0.54 ▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆ 1

3cad3c2efb8bae6c991c105b2bb3... ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁ 2 months ago
Date: Thu Dec 08 2016 12:03:36 GMT-0500 (EST)
Occurrences: 1
head(application): coronerd
callstack:
ck_pr_fas_uint ← ck_spinlock_fas_trylock ← watcher_consume ←
watcher_interval ← cr_timerset_dispatch_current_level1_bucket ←
cr_timerset_advance_n_seconds ← cr_timerset_dispatch_through ←
cr_timerset_dispatch_through_now ← cr_io_enter ← cr_pu_enter ←
watcher_base ← cr_pu_spawn_thread
head(classifiers): segv stop memory.write
unique(hostname): 1
histogram(commit):
bef6c66720c514e75cbe1460a401f4e15881604f ▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆ 1
histogram(version):
1.0.54 ▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆ 1

63067a2e4fbd150cb79de6d7ed4e... ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ 3 weeks ago
Date: Fri Jan 13 2017 18:27:47 GMT-0500 (EST)
Occurrences: 1
head(application): coronerd
callstack:
cr_project_watcher_insert ← watcher_event ←
_watcher_event_1_handler ← _bpg_object_create ← handle_action_create ←
handle_action ← cr_bpg_handler_bpg ← route_task ← cr_pu_ptp_drain ←
cr_io_enter ← cr_pu_enter ← main
head(classifiers): segv stop memory.read
unique(hostname): 1
histogram(commit):
12806d38ac0520c0b981c1e23b0db19c32f0343e ▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆ 1
histogram(version):
1.0.77 ▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆ 1

Normalize Dumps

Backtrace supports multiple crash formats, from our own proprietary dump format to ubiquitous formats such as minidump. It can be useful to have these objects translated to a JSON format in order to perform queries across instances of dumps, like extract the value of different instances of a variable.

For example, let’s translate to a compressed JSON buffer and extract the first mapped memory region for every object.

$ morgue get coronerd 5c0 --resource=json.gz|gunzip -c|json_reformat|jq .pmap[0]
{
"low": "0x400000",
"high": "0x654000",
"mode": "r-xp",
"path": "/opt/backtrace/sbin/coronerd"
}

Group crashes by any attribute

You want to group crashes by customer or filter on a customer and callstack? Or maybe group by callstack? That’s easy too.

$ morgue list blackhole --age=3y --factor=dc
nym2 ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▂█▁▁▁▁▁▁▁▁ 8 months ago
Date: Wed Mar 25 2015 18:02:14 GMT-0400 (EDT)
Wed Jun 15 2016 10:22:03 GMT-0400 (EDT)
Occurrences: 22892

lax1 ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▃▁▁█▂▁▁▄▁▁▂▁▁▁▁▁▁ 8 months ago
Date: Wed Mar 25 2015 16:59:05 GMT-0400 (EDT)
Tue Jun 14 2016 11:23:56 GMT-0400 (EDT)
Occurrences: 7138

ams1 ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▂▂▁▁▄▁▁▁█▁▁▁▁▁▁▁▁ 8 months ago
Date: Fri Mar 27 2015 09:12:33 GMT-0400 (EDT)
Tue Jun 14 2016 17:56:39 GMT-0400 (EDT)
Occurrences: 3815

fra1 ▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▅▃▁▁▅▁▁▁▃▁▁▁▁▁▁▁▁ 8 months ago
Date: Wed May 06 2015 16:18:16 GMT-0400 (EDT)
Mon Jun 13 2016 13:17:44 GMT-0400 (EDT)
Occurrences: 1664

sbahra: yolo/blackhole as of 3y ago [11940 us]

Let’s get a list of all detected double-free conditions broken down by environment.

$ morgue list blackhole --age=3y --filter=classifiers,regular-expression,double-free --histogram=environment
* ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▄█▆▁▁▁▁▁▁ 8 months ago
Date: Wed Dec 09 2015 14:53:57 GMT-0500 (EST)
Tue Jun 14 2016 16:57:10 GMT-0400 (EDT)
Occurrences: 572
histogram(environment):
prod ▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆ 506
testrestream ▆▆▆▆▆ 62
test50934 3
test54902 1

sbahra: yolo/blackhole as of 3y ago [2910 us]

List all crashes from my development machine

Let’s say I want a breakdown of all crashes from my development machine for the coronerd application, along with a breakdown on the types of crashes encountered.

$ morgue list blackhole --age=2y --filter=hostname,regular-expression,mymachine --filter=application,equal,coronerd --histogram=classifiers
* ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▃█▃▁▁ 2 weeks ago
Date: Mon Jun 13 2016 15:43:54 GMT-0400 (EDT)
Tue Jan 17 2017 14:37:53 GMT-0500 (EST)
Occurrences: 115
histogram(classifiers):
null memory.read ▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆ 28
null memory.write ▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆ 17
invalid-free null memory.write ▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆ 11
invalid-free ▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆ 11
invalid-free null memory.read ▆▆▆▆▆▆▆ 5
memory.read ▆▆▆▆ 3
memory.write ▆ 1
invalid-free null memory.write double-free ▆ 1
null memory.write invalid-free ▆ 1
memory.write si_user ▆ 1

sbahra: backtrace/blackhole as of 2y ago [508 us]

Breaking down classifiers can be very useful to capture patterns on how a fault manifests.

$ morgue list coronerd --filter=callstack,regular-expression,watcher_ --age=3M --histogram=classifiers
* ▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▁█ 15 hours ago
Date: Thu Dec 08 2016 11:40:57 GMT-0500 (EST)
Tue Jan 31 2017 22:57:48 GMT-0500 (EST)
Occurrences: 5
histogram(classifiers):
segv stop null memory.read ▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆ 2
null memory.read ▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆ 1
segv stop memory.read ▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆ 1
segv stop memory.write ▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆ 1

sbahra: backtrace/coronerd as of 3M ago [394 us]

Breakdown of all crashes by host over the last week

morgue can be used to quickly identify impact and drill down into complex patterns to improve time to resolution.

$ morgue list blackhole --age=1w --factor=hostname
testing.sp.backtrace.io ▂▁▁▂█▅▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁ 2 days ago
Date: Wed Jan 25 2017 15:00:26 GMT-0500 (EST)
Mon Jan 30 2017 14:44:47 GMT-0500 (EST)
Occurrences: 785

roneill ▁▁▁▁▁██▁▁▁▁▁▁▁█▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁ 2 days ago
Date: Thu Jan 26 2017 17:58:57 GMT-0500 (EST)
Mon Jan 30 2017 14:36:23 GMT-0500 (EST)
Occurrences: 4

wandrews.backtrace.io ▁▁▁▁▁▁▁▄█▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ 5 days ago
Date: Fri Jan 27 2017 06:48:24 GMT-0500 (EST)
Fri Jan 27 2017 18:45:22 GMT-0500 (EST)
Occurrences: 4

skylake ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁ 2 days ago
Date: Mon Jan 30 2017 20:58:37 GMT-0500 (EST)
Mon Jan 30 2017 21:06:43 GMT-0500 (EST)
Occurrences: 6

sbahra: backtrace/blackhole as of 1w ago [534 us]

Conclusion

morgue allows you to conveniently and quickly drill down into crashes in order to quickly determine impact and investigate patterns that would otherwise be missed. This allows you to mitigate damage to customer satisfaction and improve mean time to resolution.

Interested in giving it a try? Sign up for Backtrace at https://backtrace.io/create.

By | 2019-11-03T19:46:00+00:00 February 15th, 2017|Backtrace, Engineering, Technical Details|