Reply to post:

Excuse me, what just happened? Resilience is tough when your failure is due to a 'sequence of events that was almost impossible to foresee'

Abominator

Hardware resiliancy is almost always the biggest problem.

You have a disk failure in the raid array, but its does not quite fail properly enough for the RAID array to drop it and a) fail drop it out of the army b) the RAID IO performance is dragged down two orders of magnitude but limps on.

Same for network failover with bonded network interfaces. I have hundreds of cases of a) the backup switch was never properly configured and and nobody had properly tested a switch failure b) bad drivers in the bonded interface failed to switch over the ports.

It's better to be cheap and simple and rely on software failure, in the case that software is designed for failover. It's much easier to validate. Just start killing processes randomly.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon