RMA the controllers?
Not much detail here. First thing that comes to mind, is with so many errors, why ATO did not RMA the controllers when the errors started popping up?
The HPE 3PAR SANs that twice failed at the Australian Taxation Office had warned of outages for months, but HPE decided the arrays were in no danger of major failure. Combined with decisions to place the SANs' recovery software on the SANs themselves, and HPE's configuration of the SANs for speed, not resilience, the failures …
Because the WHOLE array job inclusive of management was OUTSOURCED to HPE and ATO decided that it is a good cost saving measure NOT to have any staff directly involved in managing its critical infrastructure.
This is the usual story when you outsource BOTH the job and the control over the quality of the job to the same company. The results are always the same, but this for some reason does not stop imbeciles with MBA degrees doing it(*).
(*)Not surprising - if job + QA are in the same place, a cost saving can always be claimed by cutting both the job and the QA. That cost saving is not available if the QA is external to the job.
Because the WHOLE array job inclusive of management was OUTSOURCED to HPE and ATO decided that it is a good cost saving measure NOT to have any staff directly involved in managing its critical infrastructure.
To be fair, the OEM should have staff more capable of looking after their kit than the client will. In this case it appears HPE does not fall into this bucket. It also appears they made some pretty clueless choices.
As article stated the controllers were fine. The cables/connectors were more the issue (well really the issue was how it was physically setup). HP obviously did a poor job the people who installed the system were not knowledgeable enough. It's not THAT complicated.
When my 7450 was installed a 3rd party did the actual installation (distributor), even though I bought installation services. Anyway that guy cabled the system wrong. It is just a total of 8U with 4 enclosures. But he did it wrong. The system was working, but it was me that noticed at the CLI one of the enclosures was not right and one of the SAS ports was not used on the controller.. So I had them come back and fix it(fix was zero impact to operations though the array was not in use yet anyway). I believe even before the fix full array availability was there, just that shelf was configured to be in a loop config with another shelf instead of directly attached to the controller (higher performance ).
Customer unwilling to spend money on a proper backup system doesn't help either (my companies have not done so either).
At the end of the day this problem had absolutely nothing to do with 3par tech and was an entirely implimentation specific. Still HPs fault of course since they set it up. I heard that the specific person responsible for the cabling stuff is not with HP anymore.
In addition to HA, the real lesson is that you need to exercise internal control over an outsourced job or outsource the control to a DIFFERENT supplier.
The idiot who will procured the current solution did not do that. The idiot who will procure the next cloudy solution will not do that either - it will still be a block outsource tender for the "whole thing" and will result in the same clusterf*** at some point in the future.
I've seen more often that a SAN vendor (not HP) dismisses alerts as unimportant. As a customer you then have to keep insisting on parts to be replaced and eventually they'll do it.
I can imagine that if they didn't pressure them, they effectively ended up with a SAN with one failed/unreliable component and thus no redundancy. If one more thing then happens, the entire SAN may go down.
As for the cloud, statistically it's only a matter of time before a major outage will happen. The number of storage related failure notifications and outages I have seen is alarming (kudos to them for full disclosure though).
In my opinion you're best off with running your own datacenter and keeping everyone sharp, even though it's obviously no fun pointing everyone on their mistakes and sometimes cancelling contracts if vendors don't improve.
I can imagine that if they didn't pressure them
ATO doesn't "micro-manage" what HPE does. For example, ATO tells the panel that it wants a high-speed storage system. HPE goes and investigates, implements and installs the system and bills ATO. What the SANs do is up to HPE to maintain, as per contract.
In my opinion you're best off with running your own datacenter and keeping everyone sharp,
ATO is in the "business" to collect taxes. It's not really "strong" in IT. Their entire IT system is outsourced to a lot of mob it's a dog's breakfast after a cyclone. It is totally a mess and I'm surprised it is still working.
hey what about doing proper DR testing
When I was there, ATO did a "DR testing" was on paper. Everybody sat behind a big table, shuffling papers written with "what ifs" and see "what breaks". Nobody from ATO and EDS had the guts to conduct a full-blown DR exercise that involves powering down any equipment.
"What the SANs do is up to HPE to maintain, as per contract."
And when it all goes wrong you have someone to point the finger at. But also, when it all goes wrong that's all you can do. Your staff can just sit there twiddling their thumbs hoping that somebody, somewhere is fixing it and all the while your organisation grinds to a halt.
You don't have your own staff dealing with it as their one and only top priority task as opposed to just another job, albeit a top priority job, by an outsourced supplier. There's a difference.
I don't know if HPE was sole source here, but I'll bet it was a competitive bid situation, and they didn't include stuff the RFP didn't call for. They probably had some people say "hey what about doing proper DR testing" etc. but the bid manager doesn't want to include that because if you come up with a high bid that's rejected they get no bonus. Better to have a smaller bonus selling a cheaper solution that's bought, even if it isn't the right solution. The team selling the solution gets paid when the deal is done, and doesn't suffer the consequences if it all goes tits up a year or two down the road.
It is rather like the perverse incentives for realtors. I see a lot of houses that will go on the market and be sold within a week around my neighborhood. To me that says they were underpriced, but that's in the realtor's interest - and the realtor is the "expert" in the market who will recommend what price you should list at. They'd rather put a house on the market and have it sell quickly so they can get on with listing/selling other houses, than have a house on the market for three months that gets shown to 20 prospective buyers before someone is willing to buy at a 5% higher price. The realtor is happy to give up the 5% higher commission in exchange for a quick transaction that lets them move on to selling more underpriced houses per year, but you as the homeowner don't want to give up that 5% which represents tens of thousands of dollars...
Agreed, it's not an IF they will fail, it's a when. And if more people understood how miraculous it is that they even work to begin with, the thousands of possible failure points, there would be a LOT more emphasis on BC/DR, and a lot less of these events. I eventually got to where I couldn't sleep at night anymore thinking about all the possible ways things to go south, and the older they got, the worse the stress, because you know the lifespan of a LOT of the parts is 5 years, and they're installed in large batches. If you have 100 drives installed, and they were from the same lot, and 3 fail at month 50, guess what is going to happen next month? And yes, a lot of shops hold on to that hardware longer than the recommended 3-5 years.
This isn't even the *first* time I've heard of people regretting storing their critical SAN recovery stuff on the SAN, FFS :-(
We need some kind of yearly idiot test, to go alongside the Workstation Assessment and Health and Safety crap people have to do..
Maybe I should put a sudoku on the corporate login...
Conned.
You certainly were.
Those array diagnostics only help if you a)Enable them b)monitor them c)Trend them to see if stuff is getting worse and d)Actually do something constructive if it is.
As for the "cloud" being safer. Let's ask Imgur, Medium, the Docker Registry HubRunkeeper, Trello, and Yahoo webmail, all users of Amazon's East 1 data center, how well that worked for them
I wonder if the sales team that sold the solution still there?
Usually I see many sales team rather over-commit the capability to close the order. On the other hand, we have evaluation committee blindly follow the recommendation. Will be surprise to see that many running the Datacenter and the so call "IT" folks are very poor on knowledge.
•Did not include available, automated technical resilience and data/system recovery features (such as 3PAR Recovery Manager and Peer Persistence)
These are licensable options, if they didn't want to pay for it, they weren't going to get it.
•Did not define or test “recovery procedures for applications in the event of a complete SAN outage”
It's not for the Vendor to define these, that's down to the Business to sort out.
•Did not define or verify “Processes for data reconciliation in the event of an outage of this nature”
Again, not for the Vendor to define.
To be frank, the whole "Report" smacks of an ATO arse-covering exercise. There's only slightly less fiction coming out of the Trump Administration.
(10 Year 3PAR Customer)
I think the root of the problem is clearly defined : "Full automated fail‑over for the entire suite of applications and services in the event of a complete Sydney array failure had not been considered to be cost‑effective.”
They didn't think it cost-effective to properly design for automated failover. In other words, this government project wasn't worth the effort. So, now that it has indeed failed, one of two things should happen : either ATO has the balls to publicly state that a few days of downtime of its services is no big deal and resumes as before, or it wimps out and decides that it should actually implement full automated fail-over and, as a consequence, fire the top management moron who decided and signed off the document stating it wasn't cost-effective in the first place.
My guess ? The ATO meekly implements proper failover procedures and nobody gets fired.
"But the ATO was not fully operational again for eight days because recovery tools were stored on the same SAN that had just failed so spectacularly. That decision also made failover to a spare SAN on another site impossible."
Why is that ? if they replicated the data to a second array it shouldn't have been an issue to failover, after all the failures wouldn't propagate and a manual failover should have been preferable to an 8 day recovery. It sounds like they only actually replicated a subset of the applications to the second site, which obviously wasn't enough on it's own to bring up a functional system. Instead they relied on local recovery for some data using tooling that unfortunately ended up sat on the same array that had failed.
TBH sounds like a the usual combination of design trade-offs made to meet a price.
The problem was, as stated in article, that it was a complete service contract with HPE. It was up to HPE to specify what performance, redundancy, and disaster recovery were required for the customer's needs. SO, err no, it was not (in this specific case) for the business to sort out.
Earlier comments re sales monkies getting their cut and never to be seen again are pertinent here.
I am sure 3 PAR is great. But HPE screwed up. And of course the ATO, as was also noted in an earlier comment, should have run the QA separate to the admin so that shortcuts were not taken.
Of course, your mileage may vary
"It's not for the Vendor to define these, that's down to the Business to sort out."
No, when you outsource the complete management of your storage, it's usually entirely for the vendor in question to specify what hardware / software is required. You would be signing up to an SLA with say minimum IOPS performance per TB...
"We had two 3pars and peer persistence, in different rooms. Both hung at the same time"
So what did HP say?
"hundreds of vms went read-only"
Ah, so it didn't hang then. Sounds like someone ran out of disk space on a thin provisioned system. That's a well known problem with 3PAR - the fix is to insert someone competent between the chair and keyboard...
Sure it's a waste of space, and space ain't free.
Indeed, but this is easy to justify to those of the bean-counterish persuasion. Simply show the cost in terms of productivity and lost revenue for common issues arising from under-resourced systems of this type, and I'm sure the purse strings will open.
Late to the party, so I never took mine off.
That's called not having a quorum and is standard for most clusters to prevent data corruption. There are plenty of ways you can force that situation through poor configuration, misunderstanding the failure modes and other infrastructure failures, so blaming the array(s) for doing their job and protecting data your integrity is unproductive without digging into the detail proper.
In the case of the ATO it appears they mandated very little in terms of availability and business continuity requirements, chose performance and cost over resilience. The operations side seems to have repeatedly ignored errors and the system wasn't even configured to dial home, despite this being a standard feature of the array. There obviously wasn't a proper DR solution in place otherwise they would have failed over rather than spend days recovering data from backups and even that method of recovery was hampered by it's reliance on data housed on the failed array.
Pretty much a perfect storm for any solution..
"The Reg imagines readers will be keen to know which company's kit gets corrupted firmware when SANs crash"
They're not necessarily talking about corrupted firmware. I don't know any specific details about the FUBAR implementation at the ATO, but many years ago when I still did hardware I worked on DEC kit containing SWXCR RAID controllers, which were part of the DEC Storageworks tech taken over by Compaq and then HP.
I remember once having to flash a customer's drives' firmware due to firmware issues (probably DEC branded Seagate wide SCSI back then), because if there was a mismatch between what a drive was doing and what the controller thought it was doing, under certain circumstances the controller could mark a perfectly serviceable disk as bad and drop it from the array, and however much swearing and jumping up and down you did, it would refuse to mark it good again and bring it online and back into the array (unless you reinitialised the disk, wiping the data.)
In a failure scenario, where you've got other real hardware errors, this is disastrous as you can lose the whole array (and kiss your data goodbye.)
At this point you find you need a clean pair of trousers, and discover just how good your customer's DR strategy is...
Au contraire monsieur.
Sir Loin of Beef had a 3Par 7250 model QZF rev 1 with seven UFS400 rev B drive trays cabled in configuration A.
The ATO had a 3Par 7250 model QZF rev 2 with nine UFS400 rev C drive trays cabled in configuration B.
These are totally different and in no way related, so the failures are completely unrelated and unique, have never happened before and HPE are totally telling the truth.
@DavidRa
Not to mention that here in the colonies, the array was effectively mounted upside down, where Sir Loin's array was mounted in the correct orientation. I'm sure they'll also quote a number of other descriptive quirks that make the great land of Oz not fit the statistical criteria of other "similar yet unique and unrelated" failures..
cable issues happen quiet often...unfortunately.
But the fact is, cost cutting is the root problem. Not having proper backups (or putting all $ into performance and traching proper backup strategy), poor people, poor technology, poor process. killer every time. Perhaps cloud will fix the issue, perhaps not.
true story - org i worked with has a mission critical customer go down for 2 weeks. They declined services for install. upon root cause analysis, installer tried to used 10GB/E cable in place of fibre channel. that was it. Seriously? wow. i mean, you dont need any technical expertise here. may as well follow lego instructions.
go to cloud could be an answer. pay your people better or get higher quality talent could be another. perhaps listen to the vendor when they strongly urge you to pay and utilize services (being a good vendor also means knowing your people intimately and understanding their skills are poor....so darn well listen to them when they say pony up for services or else!!)
the article wont tell the full story but the more this keeps happening the more orgs will move to amazon to remove variables from the equation.
EDS won the initial ATO outsourcing contract in 1996/7 and was bought by HP, a hardware vendor, in 2008 to become HPE and this years' merger with CSC created yet another entity, DXC.
Now we have a hardware vendor acting as an I.T. services provider, which creates confusion over their preferences for new hardware.
1. I've never seen one byte of data stored on a SAN, they are NETWORKS only, so why do the ATO / HPE claim their "SAN" stores data and has drives?
Either the HPE & ATO don't know the difference between a Network and Storage Array or Device (!), or they are deliberately mis-using terms for a reason. This is not a rookie mistake.
2. Sure, cables fail, but all your cables don't all suddenly need replacing after 6-12 months, nor do they suddenly become "stressed" if properly installed and maintained.
From the scale of the task, we can infer large numbers of cables, perhaps all or most, were replaced.
Why is the ATO (or HPE/DXC) insisting on replacing the 3PAR hardware and a full forensic tear-down and investigation, including of all the "fibre optic cables", if there has been no unusual event or physical damage?
That action wouldn't be justified on technical grounds alone, but would result from a serious legal dispute between client and vendor over liability. EDS was know for its
If the root cause suspected is physical damage, whoever ordered the action that caused the damage will be responsible.
Yet nothing explaining this unprecedented forensic examination is contained in the ATO report.
3. If the ATO is talking "Cloud", it's quite bizarre.
The 3PAR 20850 device named is a low-latency, high-performance All-Flash Storage Array that must be locally connected to hosts to be usable.
The ATO already extensively uses VMware to manage its workload and move executing instances between hosts, even Datacentres.
They already run a "Cloud", so is this code for outsourcing these operations to another supplier - which would mean breaking the existing contract with years to run.
I've not seen the EDS / HPE / DXC contracts, but I'd expect breaking them would cost the ATO dearly in time and money.
The ATO refers to the 3PAR Storage Array as a "20850 SAN" - if you look up the device, it's an All-Flash Array, not a Storage Area Network.
This mis-use of the term "SAN" is consistent - they report they replaced "an EMC SAN" with the 3PAR.
EMC have always just made Storage Arrays, never switches and Fibre Channel network gear.
Mentioned in passing, as a label in "Figure 1", is another device, "XP7". There is a HP storage array with that product designator that provides PB using HDD's, not Flash.
How does this relate to the system, given it gets mentioned just once? Why is the XP7 included at all in the diagram if not part of the functional system?
What devices & cables are the SAN (Network) and what are the Storage Devices?
Why do the ATO & HPE conflate these specific terms? "Ignorance" is as daming as "Obscuring".
What about HBA's, switches, routers & "directors" that are in the network? We hear nothing of them.
Are they using Fibre Channel (16Gbps?) or FCoE with 10Gbps ethernet interfaces (or faster)?
If it's FCoE, are they using HP or CISCO as their fabric? Or someone else entirely?
Where were the fibre problems reported by SNMP between?
Cables aren't active devices, they don't log errors themselves, only the devices that attach to them can detect & report errors.
We hear "cable", but is that a permanent cable between patch panels or patch lead(s)? (panel to panel, panel to Array, panel to Host)
So what devices were logging the SNMP errors over the 6 months prior to the first outage?
Where do they sit within the environment?
If we had 3PAR HBA's logging internal errors to its drives over local, non-SAN links, that's very different to a host connecting via the Network to the 3PAR controller.
The ATO reports takes special care to mention "data paths", disk drives and "SAS" (Serial Attached SCSI), but never cares to provide any sort of explanation of their importance/relevance or diagram of connections.
No, that's not restricted information in a simplified document. No reason to suppress that level of detail - they've disclosed other very specific technical details.
Over decades, I've never seen an optically connected drive, the HDD & SSD's I've seen have only ever had _copper_ connectors. SAS is electrically the same as SATA, but allows dual-ports and daisy-chains.
The "state of the art" is fibre connections between backplanes, into which the devices are plugged with copper connectors.
That'd make sense for either SSD or HDD drives in either 3PAR 20850 or XP7.
It's where you'd expect "data paths" to run: from Array Controllers to drives/shelves. That makes these "stressed" cables internal to the 3PAR Array, not part of the Network.
Hosts connect to the SAN via at least dual connections, while on the other side, the Array controller has multiple connections to the SAN router / director for performance & reliability.
Were the fibre cables that caused errors on "data paths" internal to the 3PAR Flash Storage Array or within the SAN (network)? HPE & the ATO keeps that obscured.
The whole point of the 3PAR, in fact any Array, is to hide the individual devices from the hosts and create virtual error-free devices of any size.
With Network attached Storage Arrays, there aren't any direct "data paths" from a host to a drive - where the errors described occurred.
I had a 3Par array which failed. Totally. In as much as I had ripped it out of the rack, fed all of the drives into a mobile shredding unit and smashed the shelves with a sledgehammer. In September.
In March we got a phone call asking if the array was having problems, as it hadn't checked in.
This was on supposed top tier paid for support.
This was also after the process supplied to balance out the data distribution when adding a new magazine of disks to 2 shelves did no such thing, and writes were purely going to the limited number of new disks. It took 6 attempts by HP to fix it, including several demands that we pay for service packs to support this process before they sent us an extract from the user guide with the same command they'd had us type in