Root cause?
I cant see much sensible analysis here or in the main article.
For instance, when I worked for another vendor we had a major failure once on a storage array that had caused a set of web transactions to be lost. The program essentially was a web application to the outside world that took details and allocated a unique identifying number.
The array had failed and we were hauled in to explain to the Customer's Senior Management why our equipment was such crap.
Superficially it was our fault. A controller had failed and the whole enterprise storage system went down. It was back up and running within 10 minutes of the failure.
Analysis definitely showed that a controller had failed and then whole storage system shut down
BUT
1. 18 hours previously, a controller had started to fail. The appropriately warnings were put out to operator consoles etc. No one acted on the warning. This was government, so no phone home capability.
2. The controller automatically switched over to the redundant controller. And proceeded on.
3. Unfortunately, this controller also started to fail, notifying the consoles etc. No one acted AGAIN.
4. This controller then failed over. In the 18 hours since the original failure, the other board could have been replaced, but it hadnt been. A logic error in the code had the now sole running controller fail over to the already failed original controller. The storage then crashed.
6. The (onsite) engineer had the storage system back up and running 10 minutes later with the spare parts that were on site!.
7.The customer was furious that the system had issued a bunch of receipts numbers to customers but no data was recorded for those receipts.
BUT
8. If you are running a database with two phase commit etc, how do you lose a transaction in a well designed application?
Turns out said customers were using some of that new fangled web .NET stuff. They were having problems getting performance out of the gazillion 1RU rack boxes used for the system. So they cached the transactions in memory and trickle fed them to database storage. It allowed customers to get their special receipt number without waiting for the data to be in the database. They were effectively giving receipts for uncommitted data when the user hit the submit button.
Who is at fault here?
1. Storage Vendor - there was a bug in the failover code - so yes.
2. Operations - they didnt have proper operating processes for the system - good process would have utterly avoided the problem
3. Solution Architects - for designing a Bozo solution with no integrity - there were heaps of other bits of kit failure that would have had the same result
4. Customer upper management - for not having any kind of clue about either the solution or adequate operating procedure, and throwing good money after a bad idea
5. Controller provider - perhaps it was a bad batch? They had been running for a long time prior to this fault.
6. Dinky Field Service contracts? Not likely in this case, the engineer was actually there and had parts available. He was never notified of the problem by Ops.
6. All of the above
Just saying its the HDS kit that failed is like blaming Boeing for the crash when the pilot was drunk and the airline had outsourced maintenance to Peru (no offence to Peruvian Aircraft Maintainers or Pilots intended here).