First of all, we would like to apologize for all the problems this issue caused to you. This was our failure and we’ll do as much as we can to prevent anything similar from occurring again. Here is full postmortem describing the root causes and solutions we will put in place.
Users were not able to load the reservation module 1 hour and 20 minutes between 18:43 and 20:02 UTC September 24th. This issue affected all our clients.
After we determined cause of the problem, we fixed it and deployed an update to the production environment.
- The investigation of the problem was delayed because of parallel issue with monitoring platform which looked much more serious and therefore had priority. In the end, it turned out as false alarm, further details why this happened are described here: https://status.mews.li/incidents/js5lm6k286vc
- We have automated tests of the system but they didn’t catch this problem. We are also doing manual smoke tests of whole system, however they were not conducted before deployment of the update that caused the issue.
- New features of Commander are not being tested independently, they’re tested when finalized on the development environment. This applies both to frontend and backend applications.
- The bug (of the frontend application) was caught by testers on a playground environment. We fixed the issue and deployed it to the development environment 7 days before the incident occurred. The fix was not deployed to production yet, it was being staged on development environment. However, the issue was marked as fixed in our issue tracking software. Note that marking an issue as fixed does not mean it is deployed. Later, backend changes that were dependent on the fix of frontend application were deployed to production counting that the fix will be there. So the root cause here is coordination of development and deployment of multiple teams and human error where the person interpreted an issue that is fixed for 7 days as deployed.
- We have fixed our availability monitoring platform to work independently on the application that’s being served to the end users.
- We will extend our automated test suite to cover this scenario. We will also update our deployment process to require manual smoke tests of the version that is being deployed.
- We will improve our backend infrastructure to support testing by QA team before the feature is deployed to the development environment. That means that we will automatically provision whole environment for each feature that is being developed in order to test it during the process and before it even reaches the development environment. That should reduce the need for coordination between backend and frontend teams. For majority of frontend applications, we already had the process in places, for Commander, we have finalized it last week.
- We will update our issue resolution workflow and integration of our system with our issue tracking software, to precisely track deployment times of each issue. That should eliminate the confusion around fixed vs. deployed issue states. On top of that, we will enable continuous delivery of Commander frontend application which again eliminates some of the coordination problems, because with continuous delivery, a fixed issue effectively means deployed issue (up to a few minute difference). Note that this is only possible thanks to having 3. solved for all frontend applications. In case of backend application, we are working towards continuous delivery already for some time and this postmortem only adds to priority of that. We will increase priority of 3. for backend applications because that’s the main precondition for full continuous delivery.