Re: An architectural error
@Jim - You are confusing the EBS service with S3.
Amazon's S3 service was completely unaffected by this; that is the service that stores data in multiple locations (zones) within a region. You don't 'choose' those locations - that is taken care of automatically.
The affected service was EBS which is just a standard drive with mirroring onto two devices. Therefore an EBS volume has some redundancy, but it is all within the same zone. You should never, ever rely on an EBS volume not failing completely.
Architectural best practice (whether in the cloud or not) is where possible not to rely on a single device or server. Those that had well engineered setups were unaffected by this outage as their EBS volume(s) that already existed in their secondary zone continued unaffected. Yes, you couldn't create/attach/detach/snapshot volumes in the other zones due to the common control plane overloading, but the volumes that already existed went on just fine.
The biggest problems here are:
(a) Communication - everyone has universally commented that this was poor.
(b) Some people (2.5%) with RDS (database) services running in multiple zones were affected. This shouldn't have happened, and Amazon have admitted this was part network traffic related, but part due to a bug.
(c) The control plane overload affect other zones.
(d) This was caused by human error.
The fact that a whole zone failed? Not good, but not unexpected. I've never met a data centre that hasn't had some sort of power or network failure that affects multiple servers. Normally something causes the power to failover to UPS/generators (eg. testing!) and it doesn't :-)