Hardware resiliancy is almost always the biggest problem.
You have a disk failure in the raid array, but its does not quite fail properly enough for the RAID array to drop it and a) fail drop it out of the army b) the RAID IO performance is dragged down two orders of magnitude but limps on.
Same for network failover with bonded network interfaces. I have hundreds of cases of a) the backup switch was never properly configured and and nobody had properly tested a switch failure b) bad drivers in the bonded interface failed to switch over the ports.
It's better to be cheap and simple and rely on software failure, in the case that software is designed for failover. It's much easier to validate. Just start killing processes randomly.