Incident Report: Hypervisor Failure
This weekend, at approximately 12:00 UTC on Sunday, an issue on one of the hypervisors went by unnoticed for too long, and was finally resolved in the morning of Monday. This post explains what happened, why it happened, and what we’re going to be doing to address the situation.
Let’s start at the beginning — one of our hypervisor’s management system detected a hardware defect and kicked itself, causing it to reboot. Normally, this is of no major consequence and creates a significant but minimal amount of down-time for the services that the hypervisor in question hosts exclusively.
In the case at hand though, the hypervisor had decided to switch the network cards over to different addresses. As you may be aware, network cards are no longer called 'eth0'
and 'eth1'
, but instead encapsulate the target address space for the card; 'enp123s0f4'
is one such example. As you can imagine, this interferes with the networking setup — bonded interfaces for high-availability and load-balancing, separation of management, storage and application networks, bridges, 802.1q interfaces, virtual switches. Long story short, configuring or even re-configuring requires a bit of focus and attention, but more importantly; time.
This caused the services on the hypervisor to remain unavailable even after a good ol’ “turn if off and on again”. In principle, the readdressing of these network peripherals is a feature that comes with this enterprise hardware. The side-effects it brings however, prolong the road to recovery.
The secondary cause for a delay in our response is that the hypervisor in question ran one of the monitoring system used to monitor Kolab Now services in production. This clearly prevented our monitoring system from reaching out to us — we need monitors for the monitors, and then monitor them. Turtles all the way down if you will. However, we have twelve separate monitoring systems in place — and ten of them ran on different infrastructure. I have yet to investigate why the four I would have expected to alert us to the problem didn’t.
The tertiary cause is the kicker; this particular hypervisor also ran our ticketing system. No questions to support@kolabnow.com could reach the staff on duty.
What to do about it?
First, we need to guarantee our monitoring systems continue to be able to both monitor infrastructure as well as alert our staff on duty, and everyone else if needed. I will check which monitoring system is currently configured to and allowed to contact staff, and ensure these twelve systems monitor one another.
Second, we need to make our ticketing system redundant. I’m not yet sure how to achieve this, because clustering Phabricator may turn out to be tricky. Furthermore, it should be considered secondary to getting the apparent monitoring failure to monitor the monitors sorted out.