Incident report: Service not restarting automatically on failure
In the very late hours (UTC) of 2023-09-18 Some users experienced that they could not receive mails, or do administrative activities – like making payments or creating new users. When they tried they were given an error message: ‘Internal Server error’. This lasted until the morning (UTC) of 2023-09-19.
The issue was caused by the in-memory data store (Redis) going out of memory.
As described in earlier blogposts, the Kolab Now team is currently focusing the energy on improving the stability and reliability of our infrastructure. Due to preparatory work the Redis system is currently in an interim location (as it is not a critical system for data-safety), and unfortunately lacked the necessary monitoring or self-healing capabilities for a swift recovery.
For the impacted users, the incoming mail exchange would be unable to receive and queue mail, so those would be delayed and queued on the sending servers.
As soon as the issue was discovered, the root cause was quickly determined and fixed. The interim system was also reconfigured to automatically restart the service should it fail again. Once in the final location, it will once again be under full monitoring including self-healing.
At this time all is running and in order. Please contact support with any questions or concerns in this context.