Unexpected outage

Incident Report for Mews

Postmortem

Problem

On Friday 29th of May 2020 around 8:40 CET we deployed a version of Commander application that had an unexpected error thrown inside an initialization part of the application's code, which then prevented it completely from loading and thus making the Commander application unavailable.

Action

The issue was first noted around 8:47 by customer care. Notified developer picked up the issue and immediately reverted to previous working version of the application. The application was operational again at 8:59.

Causes

The primary cause of the error was a collision of changes in 2 different feature branches merged to master around same time. The first branch was doing consolidation of a class instance creation utility, moving it to different location in the codebase. The second branch was introducing a new class that used the utility from an old location. Because the first branch wasn't aware of any other usage (i.e. the new class) and the second branch wasn't aware of the consolidation, no conflict prevented them from being merged into the master. However, when assembled in master together we ended up with the new class trying to invoke a nonexistent function, thus throwing an unexpected error. Because the instance is created during the application initialization it crashed the execution completely and once deployed it was perceived as the application not loading.
There is however more harmful secondary cause that was revealed by the issue - even though the second branch was merged into the master, it shouldn't have passed a static code analysis preceding the deployment. The newly introduced class was importing a nonexistent export (the moved utility) and that should have been recognized as an error. It turns out that our linter is not recognizing this kind of error properly. Preliminary analysis didn't reveal any obvious misconfiguration, the issue seems to be more intricate and probably connected to our monorepo setup.
None of the existing end-to-end tests for Commander have managed to caught the issue. These tests are not targeted to specific problems but rather test the overall application functionality from a user’s perspective. In case of a failure of any of them, the app is prevented from being deployed. Unfortunately, it didn’t happen in this case.
Last potential cause could be seen in the process itself that allows merging branches that are not up to date, i.e. not including the latest master. This requirement can be enforced by GitHub setting, however it was turned off in the past as the perceived gains were vastly outweighed by effectivity issues. This is connected to the duration of our builds that can take up to 30 minutes on average and its successful completions is required as another check before merging to master is allowed. Thus, by enforcing the latest master check, once a PR is merged, the rest of them needs to be updated and rebuilded again. This makes the flow very attention heavy as the orchestration of merging PRs can occupy and distract one developer whole day.

In this case however, even if the check would be enabled, it wouldn't prevent the problem because of the secondary cause. Nevertheless, we want to investigate enabling the check as another layer of security against similar issues.

Solutions

Until the linter is fixed properly, the frontend developers will be instructed to take extra caution and inform rest of the team when changing location of files, especially in case of common utilities.
We will investigate and fix the linter configuration to recognize imports of missing exports as an error, this will prevent any similar deployments to happen in future.
Additional extensive end-to-end tests will be added to Commander to help us catch any similar issues in future sooner.
We will implement various improvements in our build pipeline to improve the build duration time. This should allow us to enable the latest master check as an extra layer of security.

Posted Jun 16, 2020 - 14:48 CEST

Resolved

This incident has been resolved.

Posted Jun 02, 2020 - 15:53 CEST

Monitoring

We mitigated most serious issues and are continuously monitoring the platform.

Posted Jun 02, 2020 - 12:47 CEST

Investigating

We are experiencing an unexpected outage, our team is currently investigating it.

Posted Jun 02, 2020 - 12:27 CEST

This incident affected: Operations, Guest Journey, Payments, and Open API.