Incident Report: Cascading Performance Problems
From last Sunday afternoon onward, up to Monday evening and throughout the Monday night, performance problems have deteriorated the Kolab Now service up to and including services becoming unavailable.
So what caused this? It may have been a combination of factors.
Lately, at about 17:00 UTC on workdays, our infrastructure receives a flood of email messages from third parties. Normally, we can deal with the flood and incur only a slight delay in the delivery of those and other regular messages. It tends to take 5-15 minutes for the flood to subside and all messages to be processed and delivered.
Clearly, we are investigating what causes this almost-daily flood, why it is a flood in the first place, and what we might be able to do about it.
On Monday afternoon however, this flood was quite a little bit larger than we had seen and dealt with before. This is factor #1.
Our distributed anti-spam filtering environment uses a database to share state between the various nodes. This is a costly exercise in terms of database processing, but its how we get to mark messages as spam — and as such essential. Processing a massive number of messages against a central database incurs a performance penalty on said database. This is factor #2.
On Sunday afternoon, usually one of the most quiet times, we choose to upgrade some database servers that had, over a longer period of time, slowly but surely been running out of disk space. We run a MariaDB Galera cluster, so the general concept is to add a new node, make it available, and expire an old node. All said and done, we did this twice. This is factor #3a, because it needs a little bit more background;
Clearly, we use some enterprise-grade storage solution for most email traffic and database servers, among other things, while using something called tmpfs (that resides in memory rather than on disk) where appropriate. This enterprise-grade storage supplies us with “storage tiers” — a distinction between “expensive, small and fast” disks, and “cheap, large and slow” disks if you will. Migration between these two tiers happens automatically, and analyzes the frequency of use of extents to make its decisions. This is factor number #3b.
The last factor worth mentioning is perhaps best referred to as a ‘cascading effect’. One part of the infrastructure performing badly may cause another component of the infrastructure to perform badly, causing all the dominoes to fall over. In the case of Kolab Now, overloaded database servers on disk volumes that had not yet been promoted to the “expensive, small and fast” storage tier may have been that first domino.
So, how does this cascade? Well;
- Tens of thousands of messages are getting delivered
- Messages are processed against a relatively costly database backend that has not yet been optimized the storage topology for
- Other service components are incurred additional disk latency for
- Queues build up waiting for queues already built up
- Dominoes have fallen, and those that have not yet fallen are about to tip over
We’re currently still examining changes in I/O patterns for certain components of our service, and we may have more information on what we are doing to address the issue in more detail later on.