Incident Report: Backend Down
Earlier this morning, at 04:38 UTC, one out of the twenty-two IMAP backends in production stopped serving its mail spool, showing Input/Output errors on its disk. Our Standard Operating Procedure is to examine log files, flush vm caches, stop the virtual machine, and start it back up again. This occurred at 05:48 UTC. The IMAP backend in question did not come back up cleanly.
Users will have noticed errors, and various clients treat different errors differently; most prominently — some mobile devices would proceed asking the user to supply the account password. This incident has nothing to do with your account, its password or its integrity.
What? No Mailboxes?
When the IMAP backend came back up, it initially did so without any mailboxes — meaning it did have all the underlying payload information, but it wasn’t listing any of the mailboxes. Its services were shut off again immediately, at 05:50 UTC, and recovery of the mailboxes database could then begin.
About That Recovery
No existing copy of the mailboxes database that was available was viable — Cyrus IMAP tends to provide its own backups, but the last valid backup had already been overwritten. Similarly, no viable copy of the folder annotations database was available any longer. More on this last little nugget later.
The mailboxes database would therefore have to be recovered from the underlying payload information — the mail spool contains a directory hierarchy that reflects eligible entries for the mailboxes database. Directories needed to be listed, information about the mailboxes needed to be pulled from so-called cyrus.header
files, it needed to be output in a very particular format, verified, converted, verified again, loaded, verified again, and tested.
Time to Recovery
The recovery effort lasted until 08:31 UTC, amounting to little under 4 hours of downtime.
Have we been able to fully recover? No. I mentioned the folder annotations database; This database contains metadata on folders, including;
- The type of groupware content contained within the folder, such as ‘event’ for your Calendar, and ‘contact’ for your Address Book(s),
- Whether the groupware folder is the “default” for that type of groupware content, by adding a ‘.default’ suffix to the private version of that same annotation,
- Whether or not the folder is synchronized with any of your ActiveSync devices.
This sort of information is regrettably lost. We’ve re-instated some of the typical folder names (i.e. “Calendar” is most likely a calendar, and not a “Files” folder, so we’ve set the folder type back to ‘event’, etc.), but regrettably, we do not speak all languages in the world, and of your folders have names you supply yourselves).
Our support staff is currently drafting a knowledge base article to support our customers to re-register ActiveSync devices, and restore groupware folder types where needed.
On behalf of the Kolab Now team, I apologize for any inconvenience, and we’ll be working to ensure this sort of incident does not recur.
Further Updates
For real time updates, follow @kolabops.
One further remaining issue that went uncaught caused some OOM conditions, potentially killing the process for existing connections.
— Kolab Operations (@kolabops) October 17, 2017
Opened access back up, our support staff is writing up some knowledge base articles.#kolabupdate
— Kolab Operations (@kolabops) October 17, 2017