Re: Network failures
Risking to sound (un)pleasantly pedantic, I still have to say that the examples given are not only completely predictable, these are simple textbook examples of bad system design. Taleb does not need to be involved at all.
Configuring 2 NTP servers is a Bad Practice because 2 NTP servers cannot form a quorum to protect against the standard problem of a false ticker. The recommended and optimal minimum is 3, however 1 is still better than 2 because if the 2 differ significantly, it is difficult to impossible to determine which one is the false ticker.
Some badly designed black box systems only allow for a maximum of 2 NTP servers being configured; in this special case the importance of the system might prompt using a cluster of monitored anycast NTP servers for high availability; for less demanding cases using a single DNS record to something from pool.ntp.org will ensure enough availability without false tickers (while adding the Internet access and DNS dependencies).
Having a split-brain failure scenario in a geographically distributed firewall cluster is so common that it is usually specifically tested in any sane DR plan. This, again, is a glaring example of bad network design, implementation or operation. No black swan magic is necessary, just build better systems or hire a professional.
Real-world problems with highly available systems are usually multi-staged and are caused by a chain of unfortunate events, every single one of which would not have had the devastating effects. Simple, non-trivial failure scenarios, however, do exist. Something from personal experience that immediately comes to mind:
- A resilient firewall cluster in a very large company is exposed to external non-malicious network conditions triggering a bug in the firewall code and the primary active firewall reboots as a result. The firewall cluster fails over, the secondary firewall is exposed to the same conditions and the same bug and reboots as well while the primary firewall still boots up. The process repeats resulting in noticeable outage until the unsuspecting external influence is removed.
- A well-maintained but apparently defective dual-PSU device in a large datacentre short circuits without any external cause resulting in 2 feeds tripping and powering off the whole row of racks as well as a few devices not surviving through it.
Cheers to all the IT infrastructure fellas, whichever fancy name you are called now!