Distributor wasn't available
Incident Report for Mews
Postmortem

Cause

Our web products code is structured in a way that allows us easily to share a lot of common code in between them. However, this setup means that any change in the shared code has to trigger a new version release of any dependent product too, which was exactly this case.

The offending code initially consisted of a change in Commander and rather small addition of new code into one of our shared modules. The addition triggered also a new version release in Distributor, however when processed by the build configuration of Distributor, it introduced a fatal error into its startup code (concretely error lied in code minifier removing a specific function name, which was however relied upon by rest of the code).

Because of how the development task was originally tied to Commander, the Distributor issue was quite unfortunetaly missed by our QA team and slipped to production unnoticed.

Solution

  • Re-enable automatic end-to-end tests during build. Such a complete failure of Distributor could have been prevented by them. We already have automatic end-to-end tests in place, however they're not currently runned because of technical issue (they cause our build server to hang, i.e. it waits forever for those tests to be reported as finished). The issue has already been identified, but the fix consists of multiple changes in our configurations in multiple steps, and has not been finished yet. The priority has been increased and the tests will be reenabled as soon as possible.

  • Set up automatic notifier about which products will be released by a concrete code. We already identify all affected products during a build of the code and we can easily propagate this information back to our QA team so that they can ensure neither of affected product is broken before it's released to production.

  • Research possibilities of automatic notification of fatal failures. Because of how the error was introduced in a Distributor startup code, none of regular error reporting was initialized yet and thus couldn't reported back to us. This has prolonged the reaction time unnecessarily and we believe it could be mitigated in a future.

  • Write down a guide to the process of resolving fatal errors in production. Thanks to recent improvements in our deployment configuration, it is possible for us to revert to previous version of application in matter of seconds, and thus making the time to fix after identification effectively instant. Unfortunetaly, this was not the case and fix was deployed by regular means, i.e. by producing a new version with the offending code removed, which additionally prolonged the time to fix. By making the knowledge more exposed as part of written down process, it will hopefully be used next time to stretch down the time to fix.

Posted May 21, 2019 - 10:34 CEST

Resolved
Approximately at 11:45 CET we have deployed a code that introduced a fatal error in Distributor, preventing it from loading.

The issue was identified immediately after and steps were taken to remove the offending code from production. The fix had been deployed at approximately 13:30 CET.
Posted May 20, 2019 - 11:45 CEST