Reply to post: Network failures

Excuse me, what just happened? Resilience is tough when your failure is due to a 'sequence of events that was almost impossible to foresee'

Phil O'Sophical Silver badge

Network failures

In hindsight, this was completely predictable

Doesn't require much hindsight, it's an example of the "Byzantine Generals problem" described by Leslie Lamport 40 years ago, and should be well-known to anyone working on highly-available systems. With only two sites, in the event of network failure it's provably impossible for either of them to know what action to take. That's why such configurations always need a third site/device, with a voting system based around quorum. Standard for local HA systems, but harder to do with geographically-separate systems for DR because network failures are more frequent and often not independent.

In that case best Business Continuity practice is not to do an automatic primary-secondary failover, but to have a person (the BC manager) in the loop. That person is alerted to the situation and can gather enough additional info (maybe just phone calls to the site admins, a separate "network link") to take the decision about which site should be Primary. After that the transition should be automated to reduce the likelihood of error.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon