Reply to post • Network failures • The Register Forums

Saturday 12th June 2021 13:55 GMT Phil O'Sophical

Network failures

In hindsight, this was completely predictable

Doesn't require much hindsight, it's an example of the "Byzantine Generals problem" described by Leslie Lamport 40 years ago, and should be well-known to anyone working on highly-available systems. With only two sites, in the event of network failure it's provably impossible for either of them to know what action to take. That's why such configurations always need a third site/device, with a voting system based around quorum. Standard for local HA systems, but harder to do with geographically-separate systems for DR because network failures are more frequent and often not independent.

In that case best Business Continuity practice is not to do an automatic primary-secondary failover, but to have a person (the BC manager) in the loop. That person is alerted to the situation and can gather enough additional info (maybe just phone calls to the site admins, a separate "network link") to take the decision about which site should be Primary. After that the transition should be automated to reduce the likelihood of error.

Topics

Special Features

Vendor Voice

Resources

User topics

Article topics

Reply to post: Network failures

Excuse me, what just happened? Resilience is tough when your failure is due to a 'sequence of events that was almost impossible to foresee'

Network failures

POST COMMENT House rules

Enter your comment

Add an icon

About Us

Our Websites

Your Privacy