about 7 years ago(before HP acquired 3PAR) I had a big outage on one of my 3PAR arrays at the time(took about a week to recover everything(actual array downtime was about 5 hrs) as the bulk of the data was in a NAS platform from a vendor that went bust earlier in the year and we had not had time to migrate off of it), in short from the incident report said
"Root cause has been identified as a single disk drive (PD94) having a very rare read inconsistency issue. As the first node read the invalid data, it caused the node to panic and invoke the powerfail process. During the node down recovery process another node panicked as it encountered the same invalid data causing a multi-node failure scenario that lead to the InServ invoking the powerfail process.
After PD94 was returned, 3PAR’s drive failure analysis team re-read the data in the special area where ‘pd diag’ wrote specific data, and again verified that what was written to the media is what 3PAR expected (was written by 3PAR tool) confirming the failure analysis that the data inconsistency developed during READ operations. In addition, 3PAR extracted the ‘internal HDD’ log from this drive and had Seagate review it for anomalies. Seagate could not find any issues with this drive based on log analysis. "
Since then the Gen4 and Gen5 platforms have added a lot of internal integrity checking (Gen5 extends that to host communications as well), the platform that had the issue above was Gen3(last of which went totally end of support in November 2016, I have one such system currently on 3rd party support).
The outage above did not affect the company's end user transactions, just back end reporting(which was the bulk of the business, so people weren't getting updated data, but consumer requests were fine since they were isolated).
I was on a support call with 3PAR for about 5 hours that night until the array was declared fully operational again(I gave them plenty of time for diagnostics). It was the best support experience I have ever experienced(even to today).
I learned that day that while striping your data across every resource in an array can give great performance and scalability, it also has it's downsides when data goes bad.
At another company back in 2004 we had an EMC Clariion CX600 suffer a double controller failure which resulted in 36 hrs of downtime for our Oracle systems. I wasn't in charge of storage back then, I don't know what the cause of the failure though the guy who was in charge of storage later told me he believes it was his fault for misconfiguring something that allowed the 2nd controller to go down after the first had failed. I don't know how that can happen as I have never configured such a system before.
3PAR by default will distribute data across shelves so you can lose an entire disk shelf and not have any loss of data availability (unless that shelf takes out enough I/O capacity that it hurts you).
That was by far the biggest issue I have had on 3PAR arrays as a customer for the past 11 years now, but they handled it well and have done things to address it going forward. I am still a (loyal) customer today, I have had other issues over the years, nothing remotely resembling that though.
I realized over the past decade that storage is really complicated, and have come to understand(years ago of course) why people invest so much in it.
Certainly don't like to know there are still issues out there, but at the same time if such issues exist in such a widely deployed and tested platform it makes me even more weary to consider a system that would have less deployment or testing(naturally would expect this on smaller scale vendors).
At that same company we had another outage on our earlier storage system provided by BlueArc (long before HDS bought them). Fortunately that was a scheduled outage and we took all of our systems offline so they could do the offline upgrade. However where BlueArc failed is that they had a problem which blocked the upgrade(and could not roll back) and they had no escalation policy at their company. So we sat for about 6 hrs while the on site support guy could not get anybody to help him back at BlueArc. My co-worker who was responsible for it finally got tired of waiting(I think he wasn't aware on site support couldn't get help) and started raising hell at BlueArc. They fixed the issue. A couple months later the CEO sent us a letter apologizing and said they had implimented an escalation policy at that time.