In our Innovation Factory we explore new ways of developing ideas. We use state-of-the-art technologies across all areas for the further development of administrative procedures.
Don't load-balance your (invisible) queues!
24. September 2021
A queue is a device for delaying. A governor is used to control a flow. Control engineering tells you that you shouldn't add a delaying device into a control circuit as that may cause oscillations. That's easy, right? Well, only if you know about your queues – but there might be some invisible lurking to get you.
Hi, this is the first post in a series of technical write-ups of problems (and their solutions) we encountered in IT. We hope you enjoy reading our findings!
Some time ago there was an IT system that was upgraded to a new software version. Shortly afterwards it started to malfunction as soon as the load started to increase: requests were being answered for some time, then a lot of them timed out, then requests were answered again, and so on.
The design was quite unsuspicious - a load balancer distributing requests to a range of middleware servers who did I/O against a database. How should that oscillate?
After some investigations we realized that the middleware had a pool of threads processing the requests and a pool of database connections - but the sets were not of identical size, probably to reduce the number of simultaneously active database connections in use.
That in itself wouldn't have been the problem… but the middleware tried to cope with overload by not stopping requests within the thread, but putting them on an internal queue for later processing!
Unfortunately that implied that threads were becoming free to start processing another request - and the load balancer happily obliged, passing data to the same server again and again… but these newly handed out requests just blocked on the database connection pool again, as the actual transaction throughput on a single server was lower than the incoming request rate.
But why was the break-down actually initiated? The load curves shows more or less constant load for some time!
In this case we could only blame statistical variations here. While the input load looked the same, at some time or other some small fluctuation blocked the database connection pool at just the right (wrong) time… so that the thread put a request on the queue and the avalanche started. And as the upstream machines didn't get a valid answer, they ran into a timeout - and just re-issued the queries, causing more load…
Due to another misdesign the middleware wouldn't actually remove queue items unprocessed when they were already older than the defined QoS guaranteed - instead they would certainly be processed when CPU (and a database connection) was available again, only to discover that the result couldn't be sent any more to a closed TCP stream…
So it was just a question of when exactly that would happen, not whether at all.
We applied two fixes to remediate that situation:
- No more queueing in the middleware
- Changing the load-balancer to use round-robin. As there was only one request type, the response should take about the same time every time. This makes round-robin a good choice here.
The conclusion is: queues might hide in unexpected locations – and can potentially wreck your entire load-balancing strategy.