back to article Australian prisoner-tracking system brought down by 3PAR defects

Defects in HPE 3PAR storage area networks caused “a series of abnormal outages” that resulted in systems to track prisoners on release from incarceration in the Australian State of New South Wales (NSW) becoming unavailable. Prisoner tracking devices and software in the state are provided by a British company called Buddi, …

  1. Stoneshop
    Facepalm

    Right

    “The issue was escalated to global executive management at HP with a number of enterprises seeking a resolution to the 3PAR defect.”

    And now to find the remaining techie who knows how to read the 3PAR dumps and has the docs to try and make sense of them.

  2. David Roberts
    WTF?

    Verified as stable?

    Presumably they didn't bother doing this before. Or perhaps this is publicity speak for "hasn't fallen over again yet".

  3. Anonymous Coward
    Anonymous Coward

    yeah, never happened anywhere else in the world, eh HP.

  4. Bob Wheeler
    FAIL

    System outages and faults are infrequent

    Define infrequent?

    In the last 14 years working on SAN,s I've only had one outage that impacted on live production systems.

    1. Anonymous Crowbar

      Re: System outages and faults are infrequent

      Same as, and about the same time frame.

      And that outage was caused by admin error [me, when I knocked the power cable to the secondary container when replacing a failed one in the rack :O ].

    2. TRT

      Re: System outages and faults are infrequent

      Ah, but does HP support still cover a system if it's been jail broken?

    3. Cynic_999

      Re: System outages and faults are infrequent

      The article states what all the outages were, and ISTM that they were indeed infrequent and pretty short-lived. As also mentioned, the people being monitored would not have known that there was an outage, and so were not in a position to take advantage of it, nor was any outage long enough for people to have become aware and taken advantage. (I can't see that the shortest outage of 12 minutes would have been of any concern even if everyone being monitored were given advance notice).

      It makes no sense to ensure a system is 100% reliable when the application will work perfectly well with 99.9% reliability. That final 0.1% cost huge amounts of money.

      1. Anonymous Coward
        Anonymous Coward

        Re: System outages and faults are infrequent

        Bollocks. It's not difficult to architect something that's close to 100% reliable. In the vast majority of situations where I have seen failure (many) the root cause has been spending a lot of money on hardware and software and spending as little as possible on its implementation. People who don't understand the systems they are implementing out managing, and/or are paid too little too care.

    4. Anonymous Coward
      Anonymous Coward

      Re: System outages and faults are infrequent

      >>In the last 14 years working on SAN,s I've only had one outage that impacted on live production systems.

      On the contrary, at one time I was seeing live system impacts on an almost daily basis.

      I was running a level 3 support team though.

  5. DontFeedTheTrolls
    Headmaster

    Dear El Reg

    I know its a well know acronym in Australia, but please stop shortening New South Wales. It looks too much like NSFW which immediately attracts attention from co-workers

    Regards

    J

    1. David Harper 1

      Well thanks a lot

      I won't be able to see NSFW now without think it stands for New South F***ing Wales.

  6. Missing Semicolon Silver badge
    Facepalm

    3PAR losing data

    ... again?

    http://www.theregister.co.uk/2016/10/19/kcls_strand_data_centre_down/

    https://www.theregister.co.uk/2017/06/08/ato_hpe_outage_report/

    1. Anonymous Coward
      Anonymous Coward

      Re: 3PAR losing data

      There's a fundamental difference between a brief interruption of IO service and losing data. This article only talks about the former; the latter, and the title of your post, is a much more serious offense within the SAN world.

    2. Nate Amsden

      Re: 3PAR losing data

      Neither of those events were the fault of 3PAR technology. The 2nd was the fault of humans working with the hardware, that were managed by HP, so HP was at fault(and owned up to it). The first was also the fault of humans but not those working with the hardware.

      I have suffered 2 largish scale SAN outages since I have been working with them (though the first one I wasn't responsible for the storage). First was EMC, well our websites were down for a good 35 hours or so while we recovered(lots of corrupted data). The cause was a double controller failure(software fault I believe), as to what caused the double controller failure I am not sure but the storage admin blamed himself after the fact, apparently a configuration setting "allowed" the 2nd controller to fail(nobody was working on the system at the time of the failure it was a Sunday afternoon, I recall the Oracle DBA told me he was driving to lunch and almost got into a car accident when I sent the alarm showing I/O errors on the Oracle servers), I don't know specifics. The hardware did not have to be replaced from what I recall (this was 2004).

      Second failure was a 3PAR failure(2010), downtime was about 4-5 hrs, root cause was a Seagate SATA hard disk (in an array of ~200 disks) began silently corrupting data (it would acknowledge disk writes but then mess up the data on the reads). Took several hours for the situation to become critical, given the nature of the system to distribute data over all disks by default one disk doing bad things can wreck havok. Had a few cases of data being corrupted and then later that night the controller responsible for that disk panic'd, and then the 2nd controller took over and saw the same problem and panic'd(4 controller array but 3PAR architecture has disks being managed by pairs of controllers). That particular array wasn't responsible for front end operations (front end servers were all self contained, no external dependencies of any kind), but it did take out back end data processing. It was the best support experience I have ever had myself (this outage was before HP acquired 3PAR, support not as good since). From the incident report(2010):

      "After PD94 was returned, 3PAR’s drive failure analysis team re-read the data in the special area where ‘pd diag’ wrote specific data, and again verified that what was written to the media is what 3PAR expected (was written by 3PAR tool) confirming the failure analysis that the data inconsistency developed during READ operations. In addition, 3PAR extracted the ‘internal HDD’ log from this drive and had Seagate review it for anomalies. Seagate could not find any issues with this drive based on log analysis. "

      I learned a LOT during that outage both the outage itself and recovering after the fact.

      That particular scenario I believe was addressed in the 3PAR Gen4 systems (~2011 ?) when they started having end to end check sums on everything internal to the array, and extended it even further on Gen5 having check sums all the way from the host to the disk.

      In both outages, neither company had any sort of backup system to take over load, the array itself was a single point of failure(even though they are generally highly redundant internally). I'd bet 80% of the time companies deploying these do it like this just for budget reasons alone.

      I had a controller fail(technically just the hard disk in the controller that has the OS on it) mid software upgrade on an 3PAR F200 system(2 controllers only, end of life now for 2 years), system never went completely down but write performance really goes down on two controller arrays that use disk drives when a controller is out. The situation was annoying in that it took HP about 26 hours to resolve the issue because the replacement controller didn't have the same OS version(and refused to join the cluster) and the on site tech had problems with his laptop crashing every 30 minutes from the USB serial connector.

      But really all you need to do is look at the change logs for these systems(or any other complex system) and many times you'll find some really scary bugs being fixed.

      Having been a customer for 12 years you may guess that I know MANY stories good and bad about 3PAR stuff over the years. All things considered, I am still a very satisfied customer, most of that(90%) is because of the core technology. Less satisfied with the level of support HP gives out these days, but the support aspect wasn't unexpected after being acquired by a big company.

      I have a few 3PAR arrays today, all of the company's critical data are on them, though I don't have as much time to work with them as I used to(I am the only one in the company that does work with them though). They just sit back and run and run, like the rest of the infrastructure. The oldest 3PAR is also part of our first infrastructure and it has been online since 12/19/2011. Hoping to retire it soon and replace it with something current, but don't see it happening this year.

      Though I have learned to be MUCH more conservative on what I do with storage, obviously LONG gone are the days where I thought "hey this disk array has two controllers and does RAID it's the same as this other one that has two controllers and does RAID".

      1. Anonymous Coward
        Anonymous Coward

        Re: 3PAR losing data

        The old 3PARs are good. It's the new models that have problems

        1. TRT

          Re: 3PAR losing data

          As I understand it, he KCL incident was apparently caused when one of the 4 controller modules was hot swapped by a HP engineer for a replacement module that was flashed with a different version of the firmware. The consequences of running the system with conflicting versions of controller firmware was known before the engineer acted, so who's fault was that one, eh? The incident was compounded by a lack of DR testing by KCL staff, and a panicked decision to erase the array and restore but the initial outage was pure HPe.

  7. Yet Another Anonymous coward Silver badge

    Cruel and unusual punishment

    Make the prisoners work at HPE

  8. Androgynous Cow Herd

    HPE "Enterprise class"

    HPE also sells Nimble - better availability numbers, telemetry enabled proactive support via infosight but HPE keeps thinking that Nimble is SMB and 3PAR is enterprise.

    With Nimble, even if the first event happened, data would be available to identify root cause and prevent outage # 2, 3, and 4.

    Despite what marketing tells you, 3PAR will NEVER be integrated into Infosight to the level Nimble is, because Nimble was built from he beginning with Infosight in mind. 3 par wasn't and inserting the sensors into the code base now can never be completed, especially since Rod is no longer there to oversee the effort.

    1. J. Cook Silver badge

      Re: HPE "Enterprise class"

      We did a very risky thing back in 2012 here at [RedactedCo] and put Nimble CS220G arrays in at the two production sites that make [RedactedCo]'s money. For production. using a (poorly designed) transactional system that have a tendancy to use brute force and ingnorance style SQL queries.

      both array's have been absolute champs- one of them has had... two? three? media failures (2 HDD, one SSD), and the other had one of it's controllers not come back after an upgrade. none of these failures even slowed them down.

      My only worry with the HPE acquisition is that our support experience will go downhill.

  9. yogidude

    Just dayin

    https://3parug.com/viewtopic.php?f=25&t=2515

    And

    https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-a00022399en_us

    Apparently if you had a few of these drives in your Storserv, once the error rate started to increase (read: inevitable pre-failure) array performance was sooo bad it was effectively an outage. Everyone had to leave the room while the regional manager from HPE discussed options with your account Exec. Definitely a situation HPE wanted to 'manage' carefully. Give me an XP any day.

    1. Anonymous Coward
      Anonymous Coward

      Re: Just dayin

      So are we to assume only HPE supplied drives from this manufacturer ? Given there's only two remaining for spinning disk, in all likelihood these same drives types were supplied to pretty much every array vendor and as such all with be affected to some extent.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like