Posts Tagged ‘Incident Report’

Incident: Service outage

Posted on: November 6th, 2024 by

Kolabnow is currently experiencing a networking infrastructure interruption. We apologize for the inconvenience while we investigate the issue.
We will update this blogpost as soon as more information is available.

2024-11-06 @ 05:44 UTC: This issue was triggered by one of our hypervisors spontaneously rebooting. Most services have been restored, the Operations team is working through remaining issues. Users can login and use the facilities.
2024-11-06 @ 07:17 UTC: The incident has been resolved.

Incident: DATABASE ERROR!

Posted on: October 29th, 2024 by

A database issue has just presented itself, and our operations team is investigating to find the cause and fix it.

Users who try to make use of the webclient will get the message:

DATABASE ERROR!
Unable to connect to the database!
Please contact your server-administrator.

Operations have a good lead on the issue, and we expect everything to be back online shortly.

You can follow the situation here on this blog.

2024-10-29 @ 10:22 UTC: The root cause of the issue has been identified to be a problem with a synch routine in the database cluster. The Operations team is working to get synchronization back in order. Meanwhile the login and use of the webclient is back. Users can login and use the facilities.

Please keep an eye on this blog, as there might be slipstream performance issues. The synchronization use a lot of resources and will most probably slow down the systems while running. 

 

 

 

Incident report: Spool overflow filling up disk..

Posted on: July 2nd, 2024 by

On Monday 2024-07-01 early evening (CEST) a spammer attempted to use a Kolab Now account for sending out large amounts of spam. The Kolab Now exit spam filter was sorting out the spam and redirecting it, as it was supposed to do, and none of the spam was sent out. The spammer was however stubborn and kept up the sending, which subsequently was filling up a disk and hence blocking traffic. Due to ongoing maintenance on the monitoring, the full disk was unfortunately not discovered until Tuesday morning, when the problem was immediately corrected, and queued mails were again flowing in both directions.

The problem caused a group of users (about 30%) to be unable to receive mail, and sent mail was queued until the space was again freed up and spooling was possible. No mail should have been lost during the incident.

The missing monitoring has been put back into action, and the Kolab Now Engineering team is evaluating changes that will prevent the situation from repeating.

We apologize for any inconvenience that this incident may have caused.

Incident report: One external submission server overwhelmed by spam flood..

Posted on: January 8th, 2024 by

On the 2024-01-07 a spammer made a large flow on one of the external submission servers. The server stopped the spam mails, and saved them to a separate holding queue to make room for other users.

It took a while for the reporting to get to the operations team, but as soon as the issue was known it was swiftly resolved at ~19:00. However, meanwhile the server ran out of space, and some users (who hit that server) would have seen that the send and receive activities failed.

We apologize for the inconvenience that this issue has caused, and will focus on improving the reporting to also cover this specific issue.

Incident report: Service not restarting automatically on failure

Posted on: September 19th, 2023 by

In the very late hours (UTC) of 2023-09-18 Some users experienced that they could not receive mails, or do administrative activities – like making payments or creating new users. When they tried they were given an error message: ‘Internal Server error’. This lasted until the morning (UTC) of 2023-09-19.

The issue was caused by the in-memory data store (Redis) going out of memory.

> Continue Reading

Incident report: “Stability!”

Posted on: September 1st, 2023 by

Welcome back!

We were finally able to bring our blog back. During the past 2 weeks we had a long lasting incident. We did what we could to make sure that users could use their primary services during our work on the systems, but we know that there was unplanned downtime. As the blog was a victim of the situation, we couldn’t inform users as well as we wanted (we really do not like X (‘formerly known as Twitter’)), so here is finally the incident report from the beginning to the end.

> Continue Reading

Incident Report: Various Types of Failure Symptoms

Posted on: March 3rd, 2023 by

On Friday March 3rd, 2023, from approximately 01:00 UTC to approximately 14:40 UTC, numerous users of our service experienced a variety of issues.

A complete listing of the symptoms experienced is infeasible at this time, but logins will have failed to the web-client interface, the cockpit interface, and people will have not received emails in time.

> Continue Reading

Incident report: ‘Gateway Timed-out’ at login

Posted on: January 14th, 2023 by

On Saturday January 14th, 2023, from approximately 03:47:12 UTC a group of Kolab Now users observed an error stating: “Gateway time-out” when trying to login to the webclient, the dashboard, or connect with any desktop or mobile client. The Kolab Now main page https://kolabnow.com/ and all other services (Support, blog and knowledge base) were available with no issues for these same users.

The issue lasted until approximately 08:04:22 UTC, and was caused by a db server in one of the clusters being stuck in a large operation. The issue caused troubles with the login procedure, but had no impact on the mail flow. No mail was lost or delayed during the incident.

> Continue Reading

Incident report: db server stuck in large operation..

Posted on: June 7th, 2022 by

On Monday June 06 2022, from approximately 11:24 UTC a group of Kolab Now users observed an error stating: “Gateway time-out” when trying to login to the webclient, the dashboard, or connect with any desktop or mobile client. The Kolab Now main page https://kolabnow.com/ and all other services (Support, blog and knowledge base) were available with no issues for these same users.

The issue lasted for about an hour, until approximately 12:19 UTC, and was caused by a db server in one of the clusters being stuck in a large operation. The issue caused troubles with the login procedure, but had no impact on the mail flow. No mail was lost or delayed during the incident.

The operations team was warned by the monitoring system about a stuck db server in the cluster, and was having hands on keyboards right away. The issue lasted for about an hour (some users in the group saw return of service earlier than others), and was caused by this db server in the cluster being stuck in a large operation.

Further investigation and action is going into reevaluating the amount of resources assigned to the db clusters and other servers for such large operations of this type.

If you were among the group of users that was impacted by this, then please accept our apologies for any inconvenience caused.

Incident report: Failing front ends

Posted on: January 26th, 2022 by

On Friday January 21 at 15:48 UTC  users observed that the main page https://kolabnow.com/ became unaccessible and reviled an error: 503 – Server unavailable. The webclient (https://kolabnow.com/apps) was still available, but with seriously degraded performance – to the point where the service would give up and reveal the ‘Maintenance’ page.

Also IMAP, SMTP, ActiveSync, and $DAV was impacted by the failing frontends.

The service was unavailable for about 50 minutes, during which time all mails, in and out, were queued. When the service again was available, those mail queues first had to be delivered before new mail could be sent or received. This caused users to experience a delay in the traffic.

> Continue Reading