Blog > Incident Report: Storage Failure

Incident Report: Storage Failure

At 10:23 UTC this morning, Wednesday January 29th, our environment experienced a catastrophic storage failure. The time to resolution for this underlying problem was approximately 80 minutes, and full service was restored approximately 60 minutes thereafter — 12:48 UTC.

The storage that failed is our “fast and expensive” layer (itself again 3 layers). It holds the operating system disks and the database files necessary for maintaining performance. Without going in to too much detail, these are mainly index files and caches.

An initial two errors on paths A & B occurred in just under two minutes, and the controller was kicked offline. Paths C & D failed 18 seconds later, and its node was kicked offline an additional 25 seconds later. This speed left us no time whatsoever to remedy the situation as monitoring alerts started arriving on our phones.

As the control interfaces remained available to us, we could relatively quickly diagnose the situation, re-assign I/O groups to a functioning node, restore the dysfunctional node, and cycle back over. I say “relatively quickly”, because 80 minutes may sound like its not quick at all — but when you touch these types of systems with that sort of data and that level of importance to restoring function and service, you “measure twice or more, then cut”.

As soon as we felt confident the storage would come back up, we suspended external access to services except for the web client. The recovery scenario at hand would have seen us render a slow, at times dysfunctional environment otherwise, with a longer path to recovery. Recovering systems and services that needed recovering, and double-checking their function was completed some 120 minutes in to the incident. We then started to open up some services for inbound and outbound message traffic, ultimately followed by all other interfaces within 20 minutes thereafter.

Further investigation as to the root cause of the storage failure is ongoing.

No data was hurt in the making of this incident.

We apologize for the inconvenience.