Blog > Incident report: “Stability!”

Incident report: “Stability!”

Welcome back!

We were finally able to bring our blog back. During the past 2 weeks we had a long lasting incident. We did what we could to make sure that users could use their primary services during our work on the systems, but we know that there was unplanned downtime. As the blog was a victim of the situation, we couldn’t inform users as well as we wanted (we really do not like X (‘formerly known as Twitter’)), so here is finally the incident report from the beginning to the end.

In the morning of August 17 CEST an old piece of supposedly redundant storage hardware gave out messages about needing service. Shortly after the first messages the canisters gave up and the status went from critical to not working. While the data was indeed stored redundant, we lost complete access due to firmware malfunction.

During the last three years, the operations team has been aware of the risk involved with this old piece of hardware, and spend a lot of time moving important user data off. This means that no user data was in danger during the incident. However the old storage was still backing too many internal systems; data-less front-ends, control and facilitation services, the front page, and unfortunately internal DNS. This in turn made other systems unavailable, such as IMAP, Submission, Web servers, Bifrost, Blog and Knowledge base.

Thanks to preparation for the migration and a relentless effort by operations the team was able to rebuild the necessary infrastructure for all core services.
After 20 hours the primary Kolab Now services were restored: Webmail, IMAP, CalDAV/CardDAV, ActiveSync, access to data, and the support email: support@kolabnow.com. All users were able to login and use the systems at that time.

For a lot of the support infrastructure such as blog, frontpage, knowledgebease and bifrost the situation was more complex though and further progress was depending on assistance and firmware from the vendor.

After a lot of back and forth with support we fortunately were able to restore the data from the failing storage system, however, it was clearly not an option to just restart all services on the legacy infrastructure, so all services needed to be migrated. While this rather time consuming process was ongoing a second situation developed..

On the late evening of August 30 CEST the internal MX servers stopped delivering email. In the morning of August 31 The team decided to rip off the band-aid and push through a rebuild and migration of the mail traffic system. This was planned for later this year, and pushing it through at this time was presenting limited risk, but unfortunately it forced users off their services once again. It however brought the systems back in updated, balanced production mode around 15:00 CEST the same day.

All mail that was sent to users during the two service outages was delayed – but not lost. This means, that the sending servers were queuing the mail for later delivery. It can take up to 5 days for the sending servers to retry the send.

We understand the frustration that this long lasting incident and the unplanned downtime created, and we truly regret the inconvenience that it has brought on our users. During the past year we have focused a lot on pushing fixes and new developments to the platform, perhaps at the cost of stability. That has lead to a series of unplanned outages; more than usual. In the coming time we will be focusing on keeping the system in a more stable state, as well as making sure that we can migrate all service in a timely fashion to new infrastructure. New features that are ready will of course be deployed, but with great care and while observing transparent user communication.

We will be doing our best to avoid a situation like this in the future.