Posts Tagged ‘Incident Report’

Incident report: Failing front ends

Posted on: January 26th, 2022 by

On Friday January 21 at 15:48 UTC  users observed that the main page https://kolabnow.com/ became unaccessible and reviled an error: 503 – Server unavailable. The webclient (https://kolabnow.com/apps) was still available, but with seriously degraded performance – to the point where the service would give up and reveal the ‘Maintenance’ page.

Also IMAP, SMTP, ActiveSync, and $DAV was impacted by the failing frontends.

The service was unavailable for about 50 minutes, during which time all mails, in and out, were queued. When the service again was available, those mail queues first had to be delivered before new mail could be sent or received. This caused users to experience a delay in the traffic.

> Continue Reading

Incident Report: WebDAV/CalDAV/CardDAV services unavailable..

Posted on: January 17th, 2022 by

On Monday morning at 03:22 CET the kolab Now $DAV services stopped working.

While resolving a performance issue for $DAV connections, we moved around  servers from an old infrastructure to a new infrastructure, and with that the rotation of logs. Unfortunately a tiny bit were missed while configuring the new servers, which caused the log rotation attempting to reload httpd (while that wasn’t needed). In turn, reloading httpd didn’t succeed and timed out, and the new DAV servers were left in a maintenance state;  unavailable.

This happened at ~ 03:22 CET. As these are new DAV servers, they were only included in the standard monitoring, and no one got notified about the problem until it was discovered at 07:47 CET. The configuration was promptly corrected and the new DAV servers came back online right away (and was added to the monitoring notifications). At this time everything should be available and working as expected.

We apologize for any inconvenience that this may have caused.

Incident Report: Network Outage at Kolab Now

Posted on: October 5th, 2021 by

On Tuesday 2021-10-05, between 10:20 UTC and 10:40 UTC, a network issue kicked our firewalls off the grid, and Kolab Now was down. Our Operations team was on the case right away and could correct the issue instantly. As a result the downtime was very limited.

When the dust had settled, it turned out that one of our hypervisors had stopped one of its redundant storage blocks, as it was unavailable for too long, and a group ~ 20 users were unable to see their mails in the mail folders. Again our Operations team was reacting fast and could restore the connections to the redundant storage. Everything is back up and running for everyone now, and no data was in danger at any point through out the incident.

If you are one of the few users who had the problem after the down time, or if you experienced troubles during the outage around lunch time, then we truly apologize for the inconvenience that it caused.

If you have any questions or concerns in this context, please contact support.

Incident Report: Storage Outage at Kolab Now

Posted on: August 25th, 2021 by

On Monday, August 23rd 2021 at around 04:00 UTC, Kolab Now suffered a catastrophic outage related to its storage. This post will outline what happened, when it happened, and what is to happen.

> Continue Reading

Incident Report: IMAP backend server out of memory

Posted on: June 8th, 2021 by

On Monday night, the 7th of June 2021 @ 18:29 UTC, a process on one of our many backend servers was taking a lot of memory; faster than it could release it again, which made the server run out of memory and stop responding. The server were serving IMAP for a limited group of users, who in turn were impacted by the incident

As the server was heavily monitored, alarms were going off with the staff, but was little noticed, as it happened in a low traffic period, and as the systems are build to fix such situations themselves. The systems in this class usually just restore and keep working. This time however, the server did not come up in a timely manner, and the mentioned users saw their mailboxes freeze and their mail become unavailable.

The staff realized that something was not as it should be and went on to manually restore the situation. At 20:27 UTC the server was back up and running, and all mailboxes were available again.

No data was in danger, and mail delivery to the impacted mailboxes continued during the incident.

We apologize for any inconvenience that this incident may have caused.

 

Incident Report: Network Interruption

Posted on: May 28th, 2021 by

At 06:00 UTC on Wednesday May 26th Kolab Now fell silent. All connections were dropped. What happened?

> Continue Reading

Incident Report: Lock down of firewalls

Posted on: April 28th, 2021 by

At 10:10 UTC this morning, Wednesday April 28th, parts of our environment were getting updated. The updates included Security Enhanced Linux configuration on one layer of our firewalls. Unfortunately this new configuration locked up these firewalls and all traffic was blocked for a group of users.

The problem was confirmed corrected at 10:36 UTC.

No data was in any danger of being compromised during the incident.

We apologize for the inconvenience.

Incident Report: Storage Failure

Posted on: January 29th, 2020 by

At 10:23 UTC this morning, Wednesday January 29th, our environment experienced a catastrophic storage failure. The time to resolution for this underlying problem was approximately 80 minutes, and full service was restored approximately 60 minutes thereafter — 12:48 UTC.

> Continue Reading

Incident Report: Cascading Performance Problems

Posted on: March 5th, 2019 by

From last Sunday afternoon onward, up to Monday evening and throughout the Monday night, performance problems have deteriorated the Kolab Now service up to and including services becoming unavailable.

> Continue Reading

Incident Report: Thursday, 20:25 – Friday 01:15 UTC

Posted on: September 21st, 2018 by

Last night, a failure in the storage layer caused most of our services to be unavailable. In the week before, we replaced a failed hard drive. In the week before that, a so-called Virtual Fabric Adapter failed, causing a hypervisor to shut itself off. Since the most recent incident was the more serious downtime, that’s what we’ll start our reporting on.

> Continue Reading