Blog > Incident report: Failing front ends

Incident report: Failing front ends

On Friday January 21 at 15:48 UTC  users observed that the main page https://kolabnow.com/ became unaccessible and reviled an error: 503 – Server unavailable. The webclient (https://kolabnow.com/apps) was still available, but with seriously degraded performance – to the point where the service would give up and reveal the ‘Maintenance’ page.

Also IMAP, SMTP, ActiveSync, and $DAV was impacted by the failing frontends.

The service was unavailable for about 50 minutes, during which time all mails, in and out, were queued. When the service again was available, those mail queues first had to be delivered before new mail could be sent or received. This caused users to experience a delay in the traffic.

Where as services were available during that delivery delay, the error message was shifting from ‘503’ to ‘401’, which lead to that some users got presented with a page that (wrongfully) claimed that this was a planned maintenance outage. This was wrong information, and we are going to change the text on that error page.

A database server in the front end cluster had incidentally locked transactions across the cluster and prevented the cluster from treating further requests. The cluster functionality is monitored and under configuration management control, and an admin should have been notified as soon as the error occurred. This didn’t happen, but as soon as the error was discovered, and the faulty database server was removed, everything was getting back to working as expected.

Where as the Kolab Now systems are heavily monitored, monitoring is only part of any solution. Someone needs to be notified by the monitoring system when something happens that requires manual intervention. The Kolab Now Operations team realized that the existing notification system (which has served us well over the years) was lacking reliability, and we started the implementation of a new and more modern notification system. While this implementation is on the way, it is getting improved every day, and we expect it to be fully functional in the end of Q1.