Amazon AWS


30
Jan 12

Recovering a non responsive AWS instance

I could not ssh in to one of my AWS instances last evening and it wasn’t serving any pages either. AWS management console said it was up, though. Rebooting did not help. The second reboot did not help either. Shutdown and start did not help. I was running out of tricks here!

For some reason, the instance had been running on 100% CPU utilization for days:

(I better do some monitoring in future!)

Even though the CPU usage had dropped after the restarting, the instance would not accept any connections. The only thing I could think of was to either ping the AWS forum, or to get the running volume on some new instance as the instance was an EBS based one. I decided to go with the new volume if the database would not mind too much. Steps I needed to do were:

  1. Snapshot the running volume
  2. Create a new volume out of the snapshot on the same availability zone
  3. Start a new instance with the Launch more like this
  4. Shutdown the new instance
  5. Detach the volume on the new instance
  6. Attach the volume which was created from the snapshot to new instance (need to have the correct  attachment information, like /dev/sda1)
  7. Start the new instance
  8. Disassociate the Elastic IP from the old instance
  9. Associate the correct Elastic IP on the new instance
  10. Test and wish for the best

This actually worked and did not even take too much time. Actually, really cool when thinking about this and imagining I would have had a physical server instead…


8
Dec 11

AWS reboots, oh the drama

I, as well as many others, received today an email from Amazon about the need to reboot one of my instances. Actually, Twitter was already aware of this and was a bit upset of the need. For me, this was the second time since 2009 when Amazon has asked to reboot one of my instances. Once the HW was degraded and now this. I would say it’s quite a decent score since I have averaged something like five instances running all the time.

I am not upset, on the contrary I am happy AWS keeping the infrastructure up to date, be the reason for the reboot what ever. Besides, the systems should be designed so, that rebooting an instance should not take the service down, if you don’t accept it (like I do).

The actual process how AWS did inform the customers did feel ok. At first it was of course just rumours, but then I received an email stating the need which gave an acceptable time to react. When I logged in to the AWS Dashboard, I saw this kind of a message:

Scheduled Events

Which had a link to further information:

And even more information:

There was an option to do the reboot right now if I wanted, so I did it. At first after the reboot, I was looking at the instance in the dashboard, but the notification icon was still there. I would have thought it would disappear. Then I had a look of the details of the event and it actually had [Completed] written infront of the event:

Which now probably means it’s ok and I am done with this.


19
Nov 11

My new best AWS feature, CloudFormation

I just realized AWS has a feature called the CloudFormation which allows users to script their technology stack in a convenient and easily understood JSON formatted text files which can then be used to deploy the stack over and over again, always the same way. Fantastic! This eases a the burden of managing a bunch of customized AMIs or other ways of having some custom features introduced to the AMIs. I wonder how I did not notice this feature before. It even has a tab in the AWS Management Console. There are also some sample templates which for example install Drupal or a basic Ruby Hello World example.

As a test, I ran the Drupal installation script and I have to say this was by far the easiest Drupal installation I have ever done. From start to finish in 5 minutes where most of it was just waiting for the deploy to finish. Absolutely great! Minor thing might be to remember that the security keys are not available in all the Regions, at least not in US East (Virginia) my keys were not available which caused the stack deployment to fail without any good reason except key was not found… I was of course first thinking of a typo in the key name. The other thing is that the user must know the instance type name, such as t1.micro while a drop down menu would be great.

There is also a possibility to modify an existing stack which is actually a relatively new feature. This makes it even more usable. It would be interesting to see if I could do a stack for a simple Aegir installation as lately that’s the platform I have been installing the most and doing the manual installation has become kind of boring. CloudFormation would help lot with that!


16
May 11

Amazon Web Services used in Sony PSN attack

Today’s breaking news have been Bloomberg’s story about the Sony PSN attack been conducted by using Amazon Web Services. I read the story and feel confused, like how on earth can the source of the servers be any kind of relevancy if they’ve been using a public cloud provider? Come on, Amazon can’t and really should not, follow what their customers do with their servers. This whole thing Bloomberg is writing about is like saying the bank was robbed by a Smith&Wesson and it was Smith&Wesson’s fault.

Of course, there will be a subpoena for getting all the information of the account used in managing the account and I guess they had to use some stolen credit card as well which is interesting. Also, the statement in the Bloomberg’s article about anyone anonymously going and getting an account in AWS is kind of not totally true. Maybe it can be managed somehow if using a stolen credit card, but it’s not an anonymous service as such. And how are you going to prevent that “flaw” in the system of the possibility using stolen cards and false identities? Scan your id and send that as well or visit them at AWS personally? Huh?

In the end of the article, there is a thought-provoking paragraph of “Rethinking the Cloud” because a cloud can be used also for malicious purposes. Yep. I’ll do think about this for a while…

Thinking…

Thinking…

…and it should not matter for the most parts. Say, the whole AWS would be used only for attacks and the service level would degrade and my IPs would be black listed, then I probably would switch to some other provider, but, right now, I am not worried the least bit. I have my application and the service level I need in a good and healthy balance.


21
Apr 11

US-EAST-1 region outage 21st of April

Quora is down, Reddit is in emergency read only mode. Quite severe this is then!

According to the first investigation (from the AWS health dashboard) the reason for outage was a networking event which caused a large number of EBS volumes being re-mirrored. This caused capacity problems in the affected region. Also there were problems with one control plane which made it difficult to create new EBS volumes and instances. Control plane is a piece of router architecture which is responsible of drawing the network map, if you did not know it… I certainly did not know before.

Of course, there are plenty of other services impacted by the outage and I guess this is a great time to see how different services have been designed to sustain a degradation of some underlying components. Quora is totally dead (well, there is the notification to users) and Reddit is in read only mode. I give my points to Reddit as they have managed to fail gracefully to a cached read only mode.

Funny thing, just today I was reading a text by James Hamilton which is spot on this situation. I need to say I am surprised Quora did not have a fail over to a different location as the other location in US seems to be ok.