Breakpoint is a series of articles where we follow engineers along the journey of debugging the most challenging bugs they’ve encountered over the years. These war stories, in turn, help inform our product roadmap. Last time, we talked to Paul Khuong about his worst debugging experience. In this article, we’ll hear a story regarding a multi-threaded server from George Neville-Neil, a computer scientist, author, and practicing software engineer. He is a FreeBSD Core Team Member and President of the FreeBSD Foundation, all while running a consulting company that builds high-speed, low-latency systems for the financial services sector.
Multi-threading Server Madness
Since the use of multicore and multiprocessor systems rose in the 1990’s, engineers have been actively creating tools and best practices that allow for the issue-free use of these architectures and parallel programming paradigms. A prominent goal has been to create methods that utilize the speed and flexibility allowed by multi-threading while avoiding some of the common drawbacks that multiple threads trying to operate on a single set of data can have. One of the most reliable and most common of these methods is the use of a thread-safe language. These include languages and libraries that proactively avoid many of the challenges that come with multi-threaded programming: race conditions, deadlocks, and synchronization errors.
While thread-safe languages are common now, engineers always need to think on a large scale to understand how those multiple threads might be used in any one system. This increases the need for teams and individuals to be vigilant to ensure that they are doing their diligence in testing that their code is working together as a unit, and not stomping on each other. Errors and inconsistencies related to multi-threading are hard to test for and difficult to find or replicate in a debugger.
What happens when pieces are missed? When multi-threading is not safe the consequences can be disastrous, as George Neville-Neil describes below:
Enter George Neville-Neil
There is no magic method to make a large and complex system work
Probably my favorite example of not thinking clearly about threaded programming was a group that wanted to speed up a system they have developed which included a client and a server component. The system was already deployed but when it was scaled up to handle more clients the server, which could only handle one request at a time, couldn’t serve as many clients as was called for. The solution, of course, was to multi-thread the server, which the team dutifully did. A thread pool was created, and each thread handled a single request and sent back an answer to a client. The new server was deployed and more clients could now be served. Just one thing was left out when the new server was multi-threaded, the concept of a transaction identifier. In the original deployment all of the requests were handled in a single threaded manner which meant that a reply to request N could not be processed before request N-1, but once the system was multi-threaded it was possible for a single client to issue multiple request and for the replies to return, out of order. A transaction ID would have allowed the client to match its requests to the replies, but this was not considered, and, when the server was not under peak load, no problems occurred. The testing of the system did not expose the server to a peak load and so the problem was not noticed, until the system had been completely deployed.
Unhappily the system in question was serving banking information, which meant that a small, but non-zero number of users wound up seeing not their own account information but that of other customers, resulting in not just the embarrassment of the development team, but the shut down of their project, and in several cases, firings. Alas the firings were not out of canons, which I always felt was a pity.
What you ought to notice about this story is that it has nothing to do with inter-thread locking, which is what most people think of when they’re told that a piece of code is multi-threaded. There is no magic method to make a large and complex system work, threaded or not. The system must be well understood in total, and the side effects of possible error states must be well understood. Threaded programs and multi-core processors don’t make things more dangerous per se, they just increase the damage when you get it wrong.
Exit George Neville-Neil
Want to hear more wisdom and stories from George? You can follow him on twitter or catch up on the column he writes for ACM as his alter ego, Kode Vicious. George’s columns are a delightful mix of common sense and technical wisdom, perfectly emulating his (self-described) persona: “Fool with a heart of gold. Always willing to teach, but unwilling to teach those who are not willing to learn.” On his column, and on the various podcasts and guest blogs he frequents, George is passionate and outspoken about his projects, making listening to him a delight. Be sure to follow him to keep up!
How does Backtrace fit in?
Backtrace helps you investigate and prioritize errors and crashes in your production and development environments. It provides a flexible query system and integration to your existing tools such as Slack, Jira, and PagerDuty, to help you detect and resolve issues faster. If you’re part of a software team, work adjacent to a software team, or have projects of your own, this is your chance to learn how Backtrace will exceed your crash reporting expectations and go beyond error monitoring. You can learn more about how Backtrace can support your needs on our product pages, and follow us on Twitter for more Breakpoint.