Reply to post • Excuse me, <i>what</i> just happened? Resilience is tough when your failure is due to a 'sequence of events that was almost impossible to foresee' • The Register Forums

Monday 14th June 2021 18:18 GMT Abominator

Hardware resiliancy is almost always the biggest problem.

You have a disk failure in the raid array, but its does not quite fail properly enough for the RAID array to drop it and a) fail drop it out of the army b) the RAID IO performance is dragged down two orders of magnitude but limps on.

Same for network failover with bonded network interfaces. I have hundreds of cases of a) the backup switch was never properly configured and and nobody had properly tested a switch failure b) bad drivers in the bonded interface failed to switch over the ports.

It's better to be cheap and simple and rely on software failure, in the case that software is designed for failover. It's much easier to validate. Just start killing processes randomly.

Topics

Special Features

Vendor Voice

Resources

User topics

Article topics

Reply to post:

Excuse me, what just happened? Resilience is tough when your failure is due to a 'sequence of events that was almost impossible to foresee'

POST COMMENT House rules

Enter your comment

Add an icon

About Us

Our Websites

Your Privacy