US-EAST-1 region outage 21st of April

Quora is down, Reddit is in emergency read only mode. Quite severe this is then!

According to the first investigation (from the AWS health dashboard) the reason for outage was a networking event which caused a large number of EBS volumes being re-mirrored. This caused capacity problems in the affected region. Also there were problems with one control plane which made it difficult to create new EBS volumes and instances. Control plane is a piece of router architecture which is responsible of drawing the network map, if you did not know it… I certainly did not know before.

Of course, there are plenty of other services impacted by the outage and I guess this is a great time to see how different services have been designed to sustain a degradation of some underlying components. Quora is totally dead (well, there is the notification to users) and Reddit is in read only mode. I give my points to Reddit as they have managed to fail gracefully to a cached read only mode.

Funny thing, just today I was reading a text by James Hamilton which is spot on this situation. I need to say I am surprised Quora did not have a fail over to a different location as the other location in US seems to be ok.

Tags: , , , , ,

Leave a comment