Posts Tagged: Recovery


30
Jan 12

Recovering a non responsive AWS instance

I could not ssh in to one of my AWS instances last evening and it wasn’t serving any pages either. AWS management console said it was up, though. Rebooting did not help. The second reboot did not help either. Shutdown and start did not help. I was running out of tricks here!

For some reason, the instance had been running on 100% CPU utilization for days:

(I better do some monitoring in future!)

Even though the CPU usage had dropped after the restarting, the instance would not accept any connections. The only thing I could think of was to either ping the AWS forum, or to get the running volume on some new instance as the instance was an EBS based one. I decided to go with the new volume if the database would not mind too much. Steps I needed to do were:

  1. Snapshot the running volume
  2. Create a new volume out of the snapshot on the same availability zone
  3. Start a new instance with the Launch more like this
  4. Shutdown the new instance
  5. Detach the volume on the new instance
  6. Attach the volume which was created from the snapshot to new instance (need to have the correct  attachment information, like /dev/sda1)
  7. Start the new instance
  8. Disassociate the Elastic IP from the old instance
  9. Associate the correct Elastic IP on the new instance
  10. Test and wish for the best

This actually worked and did not even take too much time. Actually, really cool when thinking about this and imagining I would have had a physical server instead…


8
Apr 10

vKaiser.com

I’ve been neglecting the blog for a while and feel sorry about that. The spring has been busy and will most likely stay like that, some bachelor parties and weddings and I am also going to be a dad in the beginning of June! The boy is already kicking strong!

But I also have some new cloud related things to tell you about. Since the blog isn’t exactly driving traffic too much and I had some free CPU resources, I started a new project, vKaiser.com, which is a more Web 2.0 oriented site. Well, an imitation of YouTube but with heavy connections to social media sites like Facebook and Twitter. The site is by no means ready, but you are welcome to check it out – with Firefox. IE7 is ok too if you are not on compatibility mode. Interesting things to mention is the storage of the videos and thumbnails in S3 and the possibility to use CloudFront too.

And just to make this post a bit more cloud related and not just pitching my new site, a short story of what happened during the development at one point. As said, I had the Facebook Connect module as well as the Drupal for Facebook (yes, I ended up running Drupal as the CMS system) module installed but I had not enabled the Facebook Connect module since the Drupal for Facebook does essentially the same thing of connecting with your Facebook credentials. Or should do. I had and still have problems with the module as it forwards to a page which can’t be found but still after a few refreshes actually logs in. Anyway, I did go and enabled the Facebook Connect module while Drupal for Facebook had the same functionality enabled if another module would work a bit better.

Sure enough, after enabling the module I was watching a white browser screen with an Internal Server Error 500 with no access to the admin interface at all. What to do then? Should I mess with the database? Remove some modules and run update.php? Well, could not even access the update page. Luckily, I was running the site on an EBS based image! I had a week old (yeah, a bit old, but I did not mind) snapshot of the volume so all I had to do was to get the static files out from the bad volume, create a volume of the snapshot, shutdown the instance, detach the bad volume and attach the new volume. Boot up. Reboot had to be done too for some reason before I could see the log from AWS EC2 console. Reattach the elastic ip, copy the static files and I was back in business. Restore time below 10 minutes.

I love EC2.


28
Oct 09

Backing up and restoring an Amazon instance, take 2

Today I realized you have to be really careful on how the Amazon instances are assigned if you are using reserved instances. I did the math and figured out that buying a reserved instance will probably be ideal for me since I rarely need to scale, other than the few times I might be testing how to scale.

So I bought the m1.small instance and had it placed in the best available place automatically. It went to eu-west-1b. Then I created my first instance and started enjoying my 0.04 per hour charge. The account activity was though something else, running at the normal 0.11 per hour rate. Then I had a look at the exact location where I had my instance placed. It was eu-west-1a, not 1b. Though that was chosen automatically as well. Sigh. I also had my EBS volume in 1a as well. I am not sure if you can actually even mix those. I would guess not.

So I pretty much had to do the same thing again which I was doing yesterday. I wanted to give another shot at that fstab issue if it could be solved, so I decided to bundle the image again. I also snapshotted the EBS volume and created a volume out from the snapshot and placed the volume in eu-west-1b. I got my bundle uploaded to S3, so I proceeded to initiate it. Once again, the AWS management console was lagging a bit and did not give me the small instance and actually no RAM ID nor Kernel ID either. Waiting a bit helped here and I got my IDs and small instance as well. I don’t even know what those IDs are, but I guess they are important somewhat…

I was then able to boot up my instance at the glorious eu-west-1b availability zone. I attached the EBS volume created from the snapshot, but the fstab was once again missing the mountpoint, so I had to add that and reboot. MySQL and Apache started fine this time and the site was up. The last thing was to change the Elastic IP to point to the new host and shut down the old server and clean up the old snapshots and EBS volumes. The whole thing took around an hour while sitting on sofa and watching the evening news at the same time.

I had a look at my Account Activity, but so far there has not been any new information about the 0.04 charge… I really hope this was not all for nothing! 
 

Pauli Haikonen


27
Oct 09

Backing up and restoring an Amazon AWS instance

On Sunday I got my virtual server running in Amazon EC2 and it has been happily runnning there since. I have done some homework and know not to rely on Amazon keeping the instance running forever. I should expect it to fail. At the beginning this felt suspicious… I should expect the server to fail? What kind of a service is that? After giving this some thought I eventually realized this is how we should of course expect all IT systems to behave. It just happens that mostly IT people tend to rely on good luck instead of having a tested backup scheme, not to mention a tested disaster recovery plan.

Amazon forces you to think in a different way. The virtual machines run on not-so-high-end server hardware, which means that at any given moment, the server can fail and the poor sysadmin has to figure his/her way out of the situation. Good thing is, Amazon provides some tools to do the task as well, for example the Amazon Simple Storage Service and the Amazon EBS, which provide persistent storage for files used by the instances. The characteristic difference between these two is, that the instance can mount an EBS volume, but S3 works with REST and SOAP interfaces, thus making EBS fast but expensive and S3 slow but cheap.

In my previous post, I setup my first instance and application running in the cloud. I was a bit lucky, since the system was up untill today because I did not do the basic steps of bundling my instance and shapshotting the volume. So, to do this a few things have to be done.

  1. Test if you have ec2-ami-tools installed on the instance you are bundling. Install them if they are missing.
  2. Move your private key (pk-something.pem) and certificate (cer-something.pem) file to the instance’s /mnt directory (this is fine since /mnt will not be bundled).
  3. Use ec2-bundle-vol command to build the bundle (for example: ec2-bundle-vol -d /mnt/image -k pk.pem -c cert.pem -u “AWS account id” -r i386). You might get an error like this, if you are using the standard ami, but this is no reason for concern, as it will most likely complete the bundling:

    NOTE: rsync with preservation of extended file attributes failed. Retrying rsync
    without attempting to preserve extended file attributes…
    NOTE: rsync seemed successful but exited with error code 23. This probably means
    that your version of rsync was built against a kernel with HAVE_LUTIMES defined,
    although the current kernel was not built with this option enabled. The bundling
    process will thus ignore the error and continue bundling. If bundling completes
    successfully, your image should be perfectly usable. We, however, recommend that
    you install a version of rsync that handles this situation more elegantly
    .

  4. Use ec2-upload-bundle command to upload the bundle to S3 (ec2-upload-bundle -b myimages -m image/image.manifest.xml -a “Access Key ID” -s “Secret Access Key”).
  5. Register your private AMI. I did this through the AWS management console.

If all went well, you now should have your private AMI created and ready to be provisioned. I proceeded to test my setup with first doing a snapshot of the EBS volume. I did this with xfs_freeze -f /mountpoint command and then snapshotting the volume through AWS console. Of course, I should have done sync and database lock too, but decided to live dangerously since this is just a test setup. After the snapshot was completed I unfreezed the partition and terminated the running instance. I started provisioning the replacement server, but to my surprise I did not have the option of a small instance anymore. They started from large. I was puzzled and did not really want to start provisioning of a large instance. It could have been something with the web GUI so I changed my zone from EU to US and back and suddenly I did get the small one in the list as well. Great! It’s fantastic though to realize how easy it is to scale vertically with Amazon AWS, though it’s not too different than with VMware ESX which goes along the same lines: shut down, change the VM properties and boot up with more RAM and/or CPU.

The server was initiated and I could log on. Apache and MySQL did not start though, because the EBS volume was not attached. I proceed to add that and gave it another reboot but still the services did not start. I then went to look for the mountpoint which had disappeared. Adding this and another reboot later the services were running happily! I also had to manually give the Elastic IP to the new instance. I suppose the mount point information is not included within the bundle, but I have to investigate this further.

I now have my system running again. I also have a little more confidence for the Amazon AWS as well. This effort though required a good part of manual work. The next step would be to automate some of this and to create a recovery plan in case the server fails.

Pauli Haikonen