Rizzo and Self: Amazon EBS Volume Failure Notice

Ever wonder what happens when Amazon loses your data? I got this in email the other day:

Dear Jeffrey C Rizzo,

Your volume experienced a failure due to multiple failures of the underlying hardware components and we were unable to recover it.
Although EBS volumes are designed for reliability, backed by multiple physical drives, we are still exposed to durability risks caused by concurrent hardware failures of multiple components, before our systems are able to restore the redundancy. We publish our durability expectations on the EBS detail page here (http://aws.amazon.com/ebs).


Sincerely,
EBS Support

Fortunately, there was nothing of importance on that volume, and I had a snapshot which was unaffected by the failure but it's a reminder that just because something is in "the cloud" doesn't mean that it's necessarily safe.

On their EBS detail page, Amazon says that a single volume with less than 20GB of data changed since the last snapshot should expect an annual failure rate of between 0.1% and 0.5% - or between 1 and 5 out of every 1000 volumes should fail annually, where failure is defined as complete loss of the volume. This is better than commodity hard drive failure rates, but still in the range where they can and do happen to ordinary users.

A simple way to improve your failure rate is to snapshot often; EBS snapshots are stored on Amazon's S3 service, which guarantees 99.999999999% durability and 99.99% availability. If you get a volume failure, you have only to roll back to the most recent snapshot, and you've only lost the data which has changed in the interim. A more robust and high-availability solution would be to mirror writes to volumes in different availability zones, but that's another project for another day.

Keep your data safe!

Rizzo and Self

22 April 2012

Amazon EBS Volume Failure Notice

No comments:

Post a Comment