Re: Have we heard the full story?
From the various things I have seen so far, the issue appears to be related to the scale of the US-EAST-1 availability zone - it's AWS's biggest. My guess is that's because of the size of the East coast business but suprised more companies haven't moved to US-EAST-2. Reddit commentards have noted that most of the AWS issues in recent years have centred around US-EAST-1.
The problem appears to have been triggered by a network issues and resulted in one of the North Virginia DCs basically going offline. The migration of workloads to the other two DC's within the availability zone then resulted in "high error rates" which I guess is overloading of network links or storage bandwidth to complete the migrations. If there were any other issues that compounded the problem (i.e. maintenance or faults/outages on inter-DC connections) then it may be a case of "n+1 or more links isn't sufficient to cope with the potential issues we see".
The post-mortem is due on Monday - should be an interesting read.