Incident Report: Thursday, 20:25 – Friday 01:15 UTC
Last night, a failure in the storage layer caused most of our services to be unavailable. In the week before, we replaced a failed hard drive. In the week before that, a so-called Virtual Fabric Adapter failed, causing a hypervisor to shut itself off. Since the most recent incident was the more serious downtime, that’s what we’ll start our reporting on.
Late on Thursday, at about 20:25 UTC, an alert was issued from one of two controllers for the primary, most important storage device, reporting itself as having corrupted its own software files.
This causes the controller to render itself out of service, and while multipath is used providing a secondary (and tertiary, and quaternary, but half of those use the same disk I/O controller) path to the same storage, it is not the very same instant this failover occurs.
As such, some systems will cause themselves to be unable to perform I/O, however temporary the problem might have persisted, and mount their filesystems read-only. For a service that takes more data in than it lets out, read-only filesystems are not great.
When we received the alert on Thursday at 20:25 UTC, we almost immediately blocked access to services using I/O, and attempted to initiate recovery for individual virtual machines. This involves;
- Mitigating flapping multipath activation,
- Discovering manual deactivation of certain multipath paths would not be permanent but would cause less flapping,
- Restoring the now secondary I/O controller (called a canister in IBM Storwize V7000 parlance) using the now primary canister, which turned out trickier than you might think,
- Having restored complete multipath disk I/O, powering off and reconfiguring individual virtualization guests in order to ensure its I/O took the then current, preferred path,
- Ensuring our monitoring checks would all report everything being A-OK.
This might illustrate to you how it has taken as much of little under 5 hours, looping the clock back in to the early hours of Friday, September 21st, to declare full recovery. Not a good way to start the autumn!