Somewhere in downtown Manhattan, a combined team of VPs, developers, and operations staff nervously anticipate the open of the markets. After going through months of testing, they are about to push their trading system into production. As scripted, the market opens and the first trade is made. The team celebrates the big milestone and returns apprehensively to the screen watching for any anomalies. After a while, the tension fades as the trading day proceeds smoothly.
Joe, a member of the operations team, is checking on several of the components of the system while eating lunch at his desk. Something is amiss, the message queue in the trade reporting service is growing every second. Trades are coming in much faster than trade reports are going out.
A murmur grows in the operations team and then the VPs start to take notice. A quick calculation reveals what they all feared, by the end of the trading day the queue will be so large that most of the trades will not be reported until after the trade report deadline. This will result in serious fines for the broker.
A crowd builds around Joe’s desk, as the VPs immediately question the developers regarding their options and how they can speed up trade report processing. It turns out the trade reporting piece was developed by AfterTrade, a software firm in Europe. Everybody there just left for the day. The after-hours support answers, but the developers at AfterTrade are unreachable. One of the VPs calls in a financial technology consultant to bring a different perspective to the situation.
The situation becomes more dire as the trading day proceeds towards closing. A market spike at the close of the day puts the final trades in the queue. The worst part of the situation is that the entire thing will start again in the morning.
A long day becomes a long night for the combined development and operations teams. The consultant sits down at Joe’s desk and starts digging into the details of the third-party service. After searching for hours, he finds a nugget of hope. There seems to be a critical database query that can be optimized with an index. They apply the fix and start seeing the queue count drop ten times faster than before the index was applied. Today’s deadlines is long passed, but there is hope for tomorrow.
The names and some details of the story above were changed to protect the innocent, but this type of story is all too common. We have been architecting well-factored systems down into components, services, and now microservices for years. Problems like the bottleneck in the story above still happen largely because of the tools we use and the way we use them. Many of the services in these types of event-driven architectures are built using the classical single-threaded consumer model.
With the advent of serverless infrastructure, developers have a better tool in the toolbox for these types of architectures. The traditional approach has been to build a microservice that consumes messages one at a time. The microservice is always at the beck and call but is at the mercy of the queue that it serves. With serverless technology, the microservice is encapsulated as a single function (the function that the microservice provided). The result is that instead of running a single or small group of microservices against the queue, the functions can be run on any distributed number of nodes. The infrastructure providers, AWS/GCP/Azure, hide the details of how the computing power is allocated and functions are started. The result is a programming paradigm that can instantly scale to meet burst demand while costing the same or even less than traditional architecture.
Serverless infrastructure was not around when this story took place. Had the trade reporting service been implemented as a serverless function, there would be no bottleneck. As trades came in functions would fire up to generate trade reports. When the bursts at market open and market close occurred, the infrastructure provider would scale up to meet the demand. Now that we have a tool like serverless infrastructure, it is important for us as technologists to use it correctly. When I think of whether to use serverless functions for a problem, this story always seems to come to mind.