Incident Report: Various Types of Failure Symptoms
On Friday March 3rd, 2023, from approximately 01:00 UTC to approximately 14:40 UTC, numerous users of our service experienced a variety of issues.
A complete listing of the symptoms experienced is infeasible at this time, but logins will have failed to the web-client interface, the cockpit interface, and people will have not received emails in time.
In this particular incident, recovery turned out to be particularly tricky. Please allow us to elaborate;
At this point in time, we can only articulate symptoms we can likely classify as contributing, but not root causes.
At or around 01:00 UTC, a much more than normal number of requests to our database infrastructure came in, and subsequently failed to be served within a reasonable time. While this corresponds with an increase in numbers of threads for services provided by other components within our infrastructure, we expect the latter to be a consequence of the former, and subsequently the latter (hammering) contributing to the former (congestion).
At this point, we don’t quite understand well enough what was the exact wire that got tripped.
What we do understand is our lengthy path to recovery;
- In examining the database services’ underlying directories and files and disk drives, we did not see anything that would be of particular concern,
- In examining the performance of databases’ tables, we found no more than a few tables to which queries appeared to maybe have gotten stuck,
- In exploring recovery approaches, our approach was, in hind-sight, too much kitten-glove, clouded by our desire to try and understand the root cause in part, I suppose,
- After determining the aforementioned many approaches had failed, a very rough and rudimentary approach was executed, that resulted in recovery.
One further note to make is that the “rough and rudimentary” approach was well-understood to work, but for a long time considered as “really? is resetting things necessary? nah, surely not” and as such we sought various ways to avoid it — until we decided we wouldn’t any longer, and service was restored within the hour from that point.
We sincerely apologize for the inconvenience,