Re: Credit where it's due
Without being able to provide examples specific to this incident is the absence of a deep dive, there are whole classes of failures that can and do wipe out decentralized systems without needing a singe point failure. These are in fact the bane of large scale systems, as they are often quite subtle as well.
Redundancy is only one step, and has a price you pay in complexity and reliability that you trade for availability. Running a datacenter full of these systems is already pushing hard on the limits of manageable complexity. Now scale that up to a global cloud.
The scarier class of bugs are the ones that show up when all of that redundant gear starts talking to each other. One of the nastier ones I saw involved a message passing library passing an error as a message. There was a bug that caused that message to crash the receiving machine generating anther error message. In a non HA environment this would propagate around and probably crash a third of the message daemons in the cluster. In an HA environment, as the system tried to restart the failing processes waves of crashes were constantly being broadcast, till not only was the message passing code wrecked, a bunch of the HA stuff flooded, triggering a second, similar problem on the management network that took down most of the stuff across the DC till someone started dumping traffic on the network core.
As a consequence a lack of those kaiju sized problems, my onsite deployment has had better downtime numbers then most of the major cloud services we use, including Google, for the last 8 of the last 10 years. That said, some of our systems aren't even redundant, and the org can tolerate them being offline long enough for us to grap a spare server chassis, rack it, and restore its services.
Going on a SPOF hunt isn't always the best use of resources. Now at the scale Fastly is at, there really shouldn't be many (for example, they probably only have one roof on their datacenters).