Re: 3PAR losing data
Neither of those events were the fault of 3PAR technology. The 2nd was the fault of humans working with the hardware, that were managed by HP, so HP was at fault(and owned up to it). The first was also the fault of humans but not those working with the hardware.
I have suffered 2 largish scale SAN outages since I have been working with them (though the first one I wasn't responsible for the storage). First was EMC, well our websites were down for a good 35 hours or so while we recovered(lots of corrupted data). The cause was a double controller failure(software fault I believe), as to what caused the double controller failure I am not sure but the storage admin blamed himself after the fact, apparently a configuration setting "allowed" the 2nd controller to fail(nobody was working on the system at the time of the failure it was a Sunday afternoon, I recall the Oracle DBA told me he was driving to lunch and almost got into a car accident when I sent the alarm showing I/O errors on the Oracle servers), I don't know specifics. The hardware did not have to be replaced from what I recall (this was 2004).
Second failure was a 3PAR failure(2010), downtime was about 4-5 hrs, root cause was a Seagate SATA hard disk (in an array of ~200 disks) began silently corrupting data (it would acknowledge disk writes but then mess up the data on the reads). Took several hours for the situation to become critical, given the nature of the system to distribute data over all disks by default one disk doing bad things can wreck havok. Had a few cases of data being corrupted and then later that night the controller responsible for that disk panic'd, and then the 2nd controller took over and saw the same problem and panic'd(4 controller array but 3PAR architecture has disks being managed by pairs of controllers). That particular array wasn't responsible for front end operations (front end servers were all self contained, no external dependencies of any kind), but it did take out back end data processing. It was the best support experience I have ever had myself (this outage was before HP acquired 3PAR, support not as good since). From the incident report(2010):
"After PD94 was returned, 3PAR’s drive failure analysis team re-read the data in the special area where ‘pd diag’ wrote specific data, and again verified that what was written to the media is what 3PAR expected (was written by 3PAR tool) confirming the failure analysis that the data inconsistency developed during READ operations. In addition, 3PAR extracted the ‘internal HDD’ log from this drive and had Seagate review it for anomalies. Seagate could not find any issues with this drive based on log analysis. "
I learned a LOT during that outage both the outage itself and recovering after the fact.
That particular scenario I believe was addressed in the 3PAR Gen4 systems (~2011 ?) when they started having end to end check sums on everything internal to the array, and extended it even further on Gen5 having check sums all the way from the host to the disk.
In both outages, neither company had any sort of backup system to take over load, the array itself was a single point of failure(even though they are generally highly redundant internally). I'd bet 80% of the time companies deploying these do it like this just for budget reasons alone.
I had a controller fail(technically just the hard disk in the controller that has the OS on it) mid software upgrade on an 3PAR F200 system(2 controllers only, end of life now for 2 years), system never went completely down but write performance really goes down on two controller arrays that use disk drives when a controller is out. The situation was annoying in that it took HP about 26 hours to resolve the issue because the replacement controller didn't have the same OS version(and refused to join the cluster) and the on site tech had problems with his laptop crashing every 30 minutes from the USB serial connector.
But really all you need to do is look at the change logs for these systems(or any other complex system) and many times you'll find some really scary bugs being fixed.
Having been a customer for 12 years you may guess that I know MANY stories good and bad about 3PAR stuff over the years. All things considered, I am still a very satisfied customer, most of that(90%) is because of the core technology. Less satisfied with the level of support HP gives out these days, but the support aspect wasn't unexpected after being acquired by a big company.
I have a few 3PAR arrays today, all of the company's critical data are on them, though I don't have as much time to work with them as I used to(I am the only one in the company that does work with them though). They just sit back and run and run, like the rest of the infrastructure. The oldest 3PAR is also part of our first infrastructure and it has been online since 12/19/2011. Hoping to retire it soon and replace it with something current, but don't see it happening this year.
Though I have learned to be MUCH more conservative on what I do with storage, obviously LONG gone are the days where I thought "hey this disk array has two controllers and does RAID it's the same as this other one that has two controllers and does RAID".