Incident Report: Storage Outage at Kolab Now
On Monday, August 23rd 2021 at around 04:00 UTC, Kolab Now suffered a catastrophic outage related to its storage. This post will outline what happened, when it happened, and what is to happen.
First: the timeline provided some of analysis that has already occurred — all times are approximate;
Sunday, 16:00 UTC | Hypervisors start reporting iSCSI connection interruptions, but they do not impact production. These connection interruptions will continue throughout the evening and night. |
Monday, 04:00 UTC | The storage appliance sighs and gives up, triggering production environment failures. This starts triggering the alarm bells. |
Monday, 04:30 UTC | Our personnel has woken up, and serving themselves a cup of coffee, starts their attempts to diagnose the problem. It is quickly determined that a visit to the datacenter is required. All hands are called on deck. |
Monday, 07:00 UTC | First senior trusted personnel makes it on site. Initial troubleshooting and candidate courses of action are exchanged with second senior engineer – still commuting. |
Monday, 08:00 UTC | IBM is contacted for additional support |
Monday, 11:00 UTC | It takes quite a while before we gain access to higher level support, but as soon as access to the correct knowledge inside the IBM organization, it is established what to investigate and how to plan the contingency. |
Monday, 13:00 UTC | A pretty safe course of action is plotted, that will result in a non-disastrous end-situation. |
Monday, 14:00 UTC | Further information is exchanged and IBM is now investigating the root cause. |
Monday, 17:00 UTC | Recovery done. Mail (and other data) is flowing. Access is still limited to give the system space to handle the flow of the full amount of data for the day. |
Monday, 18:00 UTC | The data flow is lessening, and access to all systems are opened for users. |
Tuesday, 06:30 UTC | IBM confirms the root cause and combination of contributing factors. |
Tuesday, 14:00 UTC | All failing disks have been replaced and/or reseated. |
Root Cause Analysis
Background; one of our older storage devices consists of two so-called canisters, controlling volumes, each consisting of one or more RAID arrays (so-called “mdisks”).
The system uses compression and while it is typically capable of dealing with the regular I/O and one disk failure causing a RAID array rebuild, its compute power can be exhausted at times, causing I/O delays to exceed the thresholds and resetting the canister. In the event at hand, two simultaneous disk failures contributed to overloading the canisters too frequently, and both canisters ended taking themselves out (Monday, 04:00 UTC). This means all I/O with the storage becomes unavailable.
No data was lost during Mondays events.
Mails that were sent to you during the outage was delayed. Some mails would come through right away on Monday evening, when the services were available again, but most would be delayed for some hours. The length of the delay and the retry interval is at the discretion of the sending mail server configuration. Some servers are configured to retry frequently and shortly after a delivery failure and typically retry delivery for up to 5 days. Some others retry only once in a day for 5 days. Any non-delivery or significant delivery delay should have notified the sender with a delay message operations.
Preventing Recurrence in the Future
As we’ve mentioned, the storage that failed on Monday is an older part of our infrastructure. We have been working on moving away from this device for a while, but as your data is a massive quantity of ‘Handle With Care’ objects, we are not going for the fast and cheap procedures. Overall, it will take a significant amount of time to reduce the reliance on this piece of storage, and move all things over to our new NVMe Flash w/ HSM encryption piece.
As we press the Publish button on this report we immediately start the move of db components to the new infrastructure. This can be done as a replication exercise, and there is no need to wait for a service window for that. You will hear more around the planning and execution of the next activities in the coming days.
Should you have any questions or concerns at this time in this context, then please contact Support.