Degraded performance
Incident Report for Mews
Postmortem

Problem

Between 09:30 and 09:41 UTC on September 29th, the application backend performance was degraded and some of the requests timed out.

Action

We immediately scaled up the backend to compensate for 2 misbehaving instances. After a couple of minutes, the problem went away on its own.

Causes

Two of the backend instances crashed and after rebooting, requests were being routed to them even though they were not fully initialized yet. This made them respond to these requests slower, causing subsequent requests to be queued up rather than processed immediately. Also, a lot of the pressure put on these instances came from many websocket clients reconnecting at once.

Solutions

Apart from a better crash monitoring, there are several solutions we will be implementing:

  • Have the load balancer forward requests only to fully initialized and healthy instances.
  • Optimizing the websocket client authentication flow.
  • Route application / API / websocket requests to different backends, increasing resiliency and making sure that an incident in one area does not impact the entire system.
Posted Oct 29, 2021 - 14:19 CEST

Resolved
This incident has been resolved.
Posted Sep 29, 2021 - 12:59 CEST
Monitoring
The system is healthy again and the performance is back to normal. We have identified the root cause and we are monitoring the situation.
Posted Sep 29, 2021 - 12:21 CEST
Investigating
We are currently experiencing a degraded performance of the system.
Posted Sep 29, 2021 - 11:46 CEST
This incident affected: Operations, Guest Journey, Business Intelligence, Payments, Open API, and Marketplace.