Recovery


30
Jan 12

Recovering a non responsive AWS instance

I could not ssh in to one of my AWS instances last evening and it wasn’t serving any pages either. AWS management console said it was up, though. Rebooting did not help. The second reboot did not help either. Shutdown and start did not help. I was running out of tricks here!

For some reason, the instance had been running on 100% CPU utilization for days:

(I better do some monitoring in future!)

Even though the CPU usage had dropped after the restarting, the instance would not accept any connections. The only thing I could think of was to either ping the AWS forum, or to get the running volume on some new instance as the instance was an EBS based one. I decided to go with the new volume if the database would not mind too much. Steps I needed to do were:

  1. Snapshot the running volume
  2. Create a new volume out of the snapshot on the same availability zone
  3. Start a new instance with the Launch more like this
  4. Shutdown the new instance
  5. Detach the volume on the new instance
  6. Attach the volume which was created from the snapshot to new instance (need to have the correct  attachment information, like /dev/sda1)
  7. Start the new instance
  8. Disassociate the Elastic IP from the old instance
  9. Associate the correct Elastic IP on the new instance
  10. Test and wish for the best

This actually worked and did not even take too much time. Actually, really cool when thinking about this and imagining I would have had a physical server instead…


21
Apr 11

US-EAST-1 region outage 21st of April

Quora is down, Reddit is in emergency read only mode. Quite severe this is then!

According to the first investigation (from the AWS health dashboard) the reason for outage was a networking event which caused a large number of EBS volumes being re-mirrored. This caused capacity problems in the affected region. Also there were problems with one control plane which made it difficult to create new EBS volumes and instances. Control plane is a piece of router architecture which is responsible of drawing the network map, if you did not know it… I certainly did not know before.

Of course, there are plenty of other services impacted by the outage and I guess this is a great time to see how different services have been designed to sustain a degradation of some underlying components. Quora is totally dead (well, there is the notification to users) and Reddit is in read only mode. I give my points to Reddit as they have managed to fail gracefully to a cached read only mode.

Funny thing, just today I was reading a text by James Hamilton which is spot on this situation. I need to say I am surprised Quora did not have a fail over to a different location as the other location in US seems to be ok.


20
May 10

EBS-based instance problems

The instance I run this blog was slightly impacted a few days ago. All of a sudden I could not ssh into the instance and the Apache was running really painfully slow. It did not really work at all. While I was already fantasizing that my really super awesome new web-2.0-youtube-facebook-twitter crossbreed vKaiser.com had gotten some traction and was overloaded by the publicity, I ended up in the AWS site to see the service status. The service status was fine and my hopes were still high. Then the truth hit me, there were others as well in the forums who had similar issues, EBS based image becomes unresponsive and reboot does not help. Can’t either take a spapshot of the EBS volume, but stopping and starting might help. Just have to prepare for the instance to go down very, very slowly.

So, as I could not take a snapshot and was not particulary interested in using a few days old snapshots, I decided to just shut down the instance and give it the time it needs. Eventually, the server went down and I could restart it just fine. Situation back to normal. This incident could of course have been avoided easily by having a backup system ready or even a load balanced setup if I would have the money to run it.

No luck in getting traction.


8
Apr 10

vKaiser.com

I’ve been neglecting the blog for a while and feel sorry about that. The spring has been busy and will most likely stay like that, some bachelor parties and weddings and I am also going to be a dad in the beginning of June! The boy is already kicking strong!

But I also have some new cloud related things to tell you about. Since the blog isn’t exactly driving traffic too much and I had some free CPU resources, I started a new project, vKaiser.com, which is a more Web 2.0 oriented site. Well, an imitation of YouTube but with heavy connections to social media sites like Facebook and Twitter. The site is by no means ready, but you are welcome to check it out – with Firefox. IE7 is ok too if you are not on compatibility mode. Interesting things to mention is the storage of the videos and thumbnails in S3 and the possibility to use CloudFront too.

And just to make this post a bit more cloud related and not just pitching my new site, a short story of what happened during the development at one point. As said, I had the Facebook Connect module as well as the Drupal for Facebook (yes, I ended up running Drupal as the CMS system) module installed but I had not enabled the Facebook Connect module since the Drupal for Facebook does essentially the same thing of connecting with your Facebook credentials. Or should do. I had and still have problems with the module as it forwards to a page which can’t be found but still after a few refreshes actually logs in. Anyway, I did go and enabled the Facebook Connect module while Drupal for Facebook had the same functionality enabled if another module would work a bit better.

Sure enough, after enabling the module I was watching a white browser screen with an Internal Server Error 500 with no access to the admin interface at all. What to do then? Should I mess with the database? Remove some modules and run update.php? Well, could not even access the update page. Luckily, I was running the site on an EBS based image! I had a week old (yeah, a bit old, but I did not mind) snapshot of the volume so all I had to do was to get the static files out from the bad volume, create a volume of the snapshot, shutdown the instance, detach the bad volume and attach the new volume. Boot up. Reboot had to be done too for some reason before I could see the log from AWS EC2 console. Reattach the elastic ip, copy the static files and I was back in business. Restore time below 10 minutes.

I love EC2.


2
Nov 09

Lottery and Cloud Computing

We have a lottery draw every Saturday. It’s quite traditional in Finland and people are really active in playing it. Last week we had about 6.9 million Euros for the lucky person getting all seven numbers right. Finns played the game for a total of over 18 million Euros which was the new record. So it came Sunday and the clock was approaching 8.45PM. The draw was on. I was one of the suckers with my own numbers in as well. I was playing the online version of the game and did not have my numbers anywhere else than in veikkaus.fi which is the online game portal of the gaming monopoly in Finland.

The draw was over and clock was about 9 PM. I was trying to log in to the portal but did not even get the front page open. The site was down. A few minutes later the site was down still. It eventually took an hour until the site was functional again. This is a prime example of a site with highly variable traffic load and I can’t help but wonder if cloud computing could help in accommodating with the variable load. This is just a thought, since I don’t have any idea of the application architecture of veikkaus.fi or if it would even be legally possible to burst the excess traffic to, let’s say, Amazon. There are connectors to online banking facilities in veikkaus.fi for example which might make cloud bursting difficult. It would be interesting, though, if this would be possible. This actually is not the first time veikkaus.fi does not work right after the draw is done and I bet there are plenty of people eager and annoyed to not being able to check the results. Come on, I might be a millionaire and have to wait for this site to load!

I would imagine cloud bursting to be difficult, but by no means impossible if there is a will to do it.

Pauli Haikonen