Unexpected performance degradation
Incident Report for Mews
Postmortem

Problem

Users were not able to sign in to the application. The issue was first reported at 2021-07-12 14:16:00 UTC. The problem manifested as Transaction had timed out error.

Action

After the tech team was notified and the root cause was identified, the server version was rolled back. This solved the problem and the system was operational again. The issue persisted for approximately 30 minutes. Deployment of new versions was disabled to avoid reintroducing the problem until the fix was ready. The actual fix for the problem was created around 2021-07-12 18:00:00 UTC and deployed.

Causes

The incident was caused by changes related to enabling sign-in with multiple emails (allowing users to sign in with any email associated with their account). The issue was that in order to achieve this, the database query for loading users by email was changed. The new database query was suboptimal and performed poorly when introduced. The query did not finish within the time limit, hence the Transaction has timed out error.

Solutions

To prevent similar issues in the future, we’ve decided to add more gradual deployment into our CI/CD pipeline: We will deploy the new version while the old one is still alive to a small part of our users first. We will monitor the deployment and in case of no issues gradually increase the percentage of requests served by the new version. In case of any issues like poor performance, we will roll back to the previous version and fix the issue.

Posted Sep 01, 2021 - 10:54 CEST

Resolved
The incident has been resolved and the system is back to operational for already quite a while. We continue to monitor the situation and work on internal improvements to prevent this from happening again.
Posted Jul 12, 2021 - 17:24 CEST
Monitoring
A fix has been implemented and the system is operational. We are monitoring the situation.
Posted Jul 12, 2021 - 16:56 CEST
Update
We are continuing to work on a fix for this issue.
Posted Jul 12, 2021 - 16:34 CEST
Identified
We are currently experiencing unexpected performance degradation. We identified the root cause and are currently working on a fix.
Posted Jul 12, 2021 - 16:28 CEST
This incident affected: Operations, Guest Journey, Business Intelligence, Payments, Open API, and Marketplace.