Between 09:30 and 09:41 UTC on September 29th, the application backend performance was degraded and some of the requests timed out.
We immediately scaled up the backend to compensate for 2 misbehaving instances. After a couple of minutes, the problem went away on its own.
Two of the backend instances crashed and after rebooting, requests were being routed to them even though they were not fully initialized yet. This made them respond to these requests slower, causing subsequent requests to be queued up rather than processed immediately. Also, a lot of the pressure put on these instances came from many websocket clients reconnecting at once.
Apart from a better crash monitoring, there are several solutions we will be implementing: