Users were not able to sign in to the application. The issue was first reported at 2021-07-12 14:16:00 UTC. The problem manifested as
Transaction had timed out error.
After the tech team was notified and the root cause was identified, the server version was rolled back. This solved the problem and the system was operational again. The issue persisted for approximately 30 minutes. Deployment of new versions was disabled to avoid reintroducing the problem until the fix was ready. The actual fix for the problem was created around 2021-07-12 18:00:00 UTC and deployed.
The incident was caused by changes related to enabling sign-in with multiple emails (allowing users to sign in with any email associated with their account). The issue was that in order to achieve this, the database query for loading users by email was changed. The new database query was suboptimal and performed poorly when introduced. The query did not finish within the time limit, hence the
Transaction has timed out error.
To prevent similar issues in the future, we’ve decided to add more gradual deployment into our CI/CD pipeline: We will deploy the new version while the old one is still alive to a small part of our users first. We will monitor the deployment and in case of no issues gradually increase the percentage of requests served by the new version. In case of any issues like poor performance, we will roll back to the previous version and fix the issue.