An architectural error
The concept of Amazon's storage is that there are typically three copies of any data element with one at a remote site. This particular failure appears to have occurred because the "remote" site chosen by most users was just in another part of the data center, rather than truly independent (no shared internal network) of the local copy.
Whether this was a misunderstanding of the concept of remote, or a misrepresentation that zones die not share resources is only a part of the issue. There is a real weakness in the Amazon model. If a lot of copies are lost, the system MUST rebuild a new replica somewhere, to meet the published SLA and there own architectural requirements.
This means they must over-provision storage by an amount to hold the largest data center they have (not zones---locations!) and, and this is where it gets really painful, provide compute resources in the management nodes and comm resources throughout the system for the bandwidth needed to handle literally billions of replication requests.
I believe this is beyond current technology or financial reach. Amazon should have had two more layers of logic before allowing the storm to explode. First, they should have detected this as a multi-zone, data center outage, rather than a myriad of individual recovery requests. This would have allowed throttling to occur, possibly almost to preventing replication. The second layer is also obvious. They should have had logic and an independent system back-channel network to detect that the problem storage had NOT failed, but only become inaccessible. This would have allowed business to continue using other data centers for either storage or compute or both. Any write transactions to the (healthy, but offline) affected replicas could be journaled and synced later. (Note that, in this current disaster, restoring sync after several days of partial outage must have been fun!)
Disaster-proofing and recovery is a major rationale for the Amazon data structure. That they failed the test of a very predictable failure scenario sets me to wonder if the technology team had explored this adequately, and if they understand the beast they created well enough. I've been having discussions with grid, HPC and storage people for over 7 years on subjects like this and I've concluded there is a gap in understanding of failure modes in large clusters. I've seen this arise mainly as a result of peer pressure and corporate culture (it's hard to be a naysayer and critic), and it's also hard to critique your own design adequately. That's likely what happened here. I hope we all learn the lesson, or cloud deployment will lose out to FUD.