Nah, it was that home-made sheep-poo generated methane/electric converter again.
Munted, Kiwi style!
The boss of Air New Zealand has launched an astonishing attack on IBM after a catastrophic system crash crippled the airline and left passengers stranded. The massive IBM letdown could see the vendor turfed out of its contract with the New Zealand flag carrier. The chaos was down to a crash at the airline's mainframe, which …
everyone in NZ IT knows there are few alternatives to the large Computer Corporates (IBM, EDS, etc) and that decisions like this are made on the golf course, not the board-room. Air NZ will huff and puff but eventually stick with the same vendor, with a few minor concessions gained.
Big Blue, pile of poo!
He really should blame himself. Why was it outsourced in the first place when mainframe uptime is so critical to the business? Having responsibility within your own company with people who understand the business and care is not always a bad idea... Ah, I see, they could save a few quid and surely nothing could ever go wrong....
I'd be interested to see some of the background to this.
What response do the airline actually pay for? It's amazing how many companies pay for slow response times and then shout when they don't get an instant response. If you pay 8x5x4 and the server goes down at 5pm on friday then tough, you're getting an engineer on monday.
Do they pay for the highest level of resilience? Again a lot of companies are quoted for loads of resilience (co-location, multiple generators and UPS, multiple network routes, etc) and decide it's unnecessary, then blame the supplier when a failure occurs that would have been prevented by the proposed resilient solution.
After a slating like that I hope IBM come back with a detailed response.
Q: What kind of an idiot outsources a business-critical service and then lets it run without ever actually testing that the outsourcer can deliver the contractually agreed service to the full extent as agreed, in any meaningful way ?
A: A clueless one who had a nice lunch or three down the golf course or the Lodge. Hopefully soon to be a jobless one, but somehow I doubt it.
There was a similar case some time ago where a very large telco's s most critical asset went bellyup and they shouted at the provider because the redundant system didn't come up.
Of course, it wasn't until after the news coverage died down that said provider noticed that their customer had been mucking about in places they shouldn't muck.
That said, as former IBM employee I can easily see this happen.
Sounds more like a case of process creep to me.
Customer: Something just blew up log an incident!
Call centre: Certainly sir we will get right on it.
IM: Things blowing up tend to be problems not incidents, pass this back to the call centre.
Call centre: It seems this is a problem not an incident, would you like us to open a problem ticket?
Customer, I don’t care I just want it fixed.
PM: Well it does certainly look like a problem, but I think we will need to engage CM to raise an RFC to cover the work.
CM: The RFC needs customer and architect approval, we will get right on getting it, Oh but its 5 PM so the CM for this account has gone home, we will need to pass this back to call centre to engage OOH CM service.
Etc. Etc. Etc. Etc.
5 Hours later the 3rd party that looks after the power systems is engaged to come fix it.
It's a zSysem from IBM, it only has one level of support: 4 hour.
Now, the real question: Why did the vendor deploy a zSystem on a singlular power system, and why wasn't there a second z chassis sitting next to it (or better yet in a seperate building) replicating the LPARs for resiliancy?
We've got 4 z10s in our building, several z9s and a few older z8s, not to mention a nice big 595 or 3. Each is run to 2 completely seperate power system from 2 seperate power companies, and back up by 3 generator systems (only 1 of which needs to be running, 2 of which are constantly spinning a flywheel). We don't have real time offsite resiliency (in construction now), but if any main system goes down, with exception for a few smaller business units who don't pay for that level of redundancy, it doesn't go down...
Since zSystems are sold on MIPS, the secondary chassis sitting there doing nothing costs very little. You only pay as it goes in CPU minutes, so an idle chassis costs very little. Replication licenses and LPAR configuration are not free, but a $10m mainframe can be replicated for another $1-2M, not more than double as x86 and Px hardware costs... Segregated power systems to core infrastructure in a server room is old hat at this point as well. I'm surprised IBM even OFFERED a support contract without fully segregated power systems (since loosing power on a zsystem is BAD, real bad, for the hardware and integrated cooling systems). They are the primary reason all our DCs have fully redundant seperated power and fully redundant seperated cooling as well.
Regarding criticism of outsourcing system management to IBM, I doubt that it was done to save money. When is IBM _ever_ cheaper than doing it yourself? Perhaps they felt the best way to get proper mainframe expertise was to hire the company that built it. It isn't like they hired the Geek Squad from Best Buy.
The guarded comments show that IBM is not ready to announce the findings of its incident review, i.e. to identify root cause and thereby a responsible party. Even if this was installed at an IBM facility with a contract of xx hours support, there may well be emails which point to the customers' responsibility for the design or the current state of the environmental (i.e. power in this case) systems. Or may be the contract just says "4 hours from when power is restored".
Its rare that IBM would jump to and attempt to fix things quickly in the case of a major sev1 for an important customer, even if the contact in place or other context meant they could turn their back.
But perhaps there was a more important customer in that data centre than Air NZ.
Yes this is all groundless speculation just ike the trolls further up....
I think that the fat salesman Sam P should just tell the geezer to chill out, grab a Sheila, and put another prawn on the barbie. That should calm him down !
Nice writeup. Maybe El Reg ought to get some occasional content from people who (still) have a clue about how things are done properly and who are lucky enough to work at places where people (specifically, management) understand the difference between cost and value, between "trendy" and "good business investment". I guess El Reg have your email?
As already suggested, yet another example for the "Reasons NOT to Outsource" list.
Does anyone who is NOT a Beancounter actually have anything good to say about Outsourcing? Any Techies, Ops, Proggies or SysAdmins actually benefitted from being Outsourced? Any "shop-floor" customer-facing Managers happy?
Or is it just the Top Neddys, Accountants and Shareholders who still think Outsourcing really is the best way to go?
(hmm, maybe Air NZ should also think about getting a nickname that is not related to their national, FLIGHTLESS, bird - even if "Kiwi Air" is rather apt!)
Every vendor and outsourcer in town will be rubbing their hands and I can see a few lunches and rounds of golf comming up.
IBM probably lose points for attitude. But were they to blame for what happened. We all get bagged for every bad management decision that customers make.
Generators and UPS's are notoriously twitchy and need to be checked regularly, I think in the case of Air New Zealand, it should be monthly, but someone will have to pay for that. Every component should be tested at irregular intervals to make sure this stuff doesn't happen.
Hmmm my evil plan to take over the world is working mowhahahahahaha !!!!!
"A generator failure Sunday at an IBM data center in Auckland, New Zealand crippled key services for Air New Zealand, prompting the airline’s CEO to publicly chastise Big Blue for the failure. The data center outage crashed airport check-in systems, as well as on-line bookings and call center systems Sunday morning, affecting more than 10,000 passengers and throwing airports into disarray.
The problem occurred during planned maintenance at IBM’s Newton data center in Auckland. A generator failed during the maintenance window, dropping power to parts of the data center, including the mainframe operations supporting Air New Zealand’s ticketing. IBM says service was restored to most clients within an hour, but local media reports say Air New Zealand’s ticketing kiosks were offline for up to six hours."
"IBM expressed its regrets and said the likely cause was a failed oil pressure sensor on a backup generator during a scheduled maintenance session"
I say let's blame the manufacturer of the generator... it does make you wonder why they didn't have more power redundancy in the data center... although from the article it sounds like they probably do, it's just that the section of the data center that went down was connected to that particular generator that failed. As to the Kiosks not coming back up for six hours, hard to say who's to blame for that... someone else may be responsible for the maintenance of those and not IBM.
Calm down you lot !
I've worked on sites where system and backup facilities for them have been considered totally separately to the power supply requirements ! Same with the air conditioning, etc.
So you get :-
1) IBM on a costly fast response support contract with expected high availability.
2) Power supply by ACME whoever who are barely intelligent enough to know what a volt is, let alone how to work the said generator that is argued to be the cause of the delays to resoration of service here. Worse they are on a 1 week response contract on account of power failures being infrequent, and usually of short duration so the UPS can keep the systems up for a few minutes. The need for the generator is considered low.
Bet air nz took 2) above as their low cost option, in which case how can Big Blue be expected to fix a system that has no power, AND is outside their schedule of support deliverables ?
Just speculation, but having seen how all this stuff gets farmed out on lowest cost basis I would not be surprised it has happened here !
My 2p worth, for what it's really worth :-), probably more than the ACME electric contract, but not much.
Now I really must go get that pint.
"CEO Rob Fyfe placed the blame squarely on IBM"
Speaking from experience, I'm absolutely positive that IBM's salesdroids tried to sell Mr. Fyfe large portions of redundancy ... but it looks to me like Mr. Fyfe, or one of Mr. Fyfe's minions, turned it down. From my perspective, Mr. Fyfe only has Mr. Fyfe to blame ...
“The cause of [Sunday’s]power outage at the Newton Data Centre has not been fully determined. IBM's primary focus was to rapidly restore services to our clients, and in particular to Air New Zealand. IBM immediately engaged a team of 32 local IT professionals supported by global colleagues and management to restore impacted client systems. Services to most clients were restored within an hour of the outage.
“We have already engaged an independent expert to conduct a thorough investigation into the cause of the outage, however the likely cause appears to have been a failed oil pressure sensor on a backup generator. We regret any inconvenience caused to our clients or their customers.”
that the primary systems were undergoing extensive maintenance, the secondary ones were what was hit by the power outage.
The next failover level is known as "paper". Staff voluntarily came in from off duty (annual leave etc) to help out without even being asked.
I'm waiting for the dust to settle too.
RE: Tough Talk, what about Fonterra ditching Gen-i and jumping in be with Everything Done Slowly? Sometimes vendors _do_ get changed.
I work for a major IT Service Provider.
Isn't this Air NZ scenario just another example of corporate greed.
The outsourcer squeezes the service provider so hard that they cannot possibly deliver what they promise. In turn the service provider cuts back on on staff so that they can make a profit. Morale within the service proivider organisation plummets to the extent that no-one gives a stuff.
When will these financial managers masquerading as leaders start to exhibit real leadership and start building morale within their orgainisations instead of concentrating on reaching financial targets?
The majority of employees want to take prie in their work and deliver quality service but their so called leaders prevent them from doing so with their short term policies.
The sooner we get rid of accountants out of executive leader positions in technical organisations and replace them with engineers who have financial skills the better off everyone (including our clients) will be. Maybe then we can get some real balance between people and profits.
...as a contractor to AirNZ. The data centre is a 1980's relic that seriously need upgrading. Two yeas ago they were at the limit of their power distribution, every time we added a new load we needed to remove another one first and get sign off from the Data Centre manager, who worked for IBM, even though the DC is owned by AirNZ. There was a culture of deferring spend as long as possible and increasing the power feeds to the DC had been thrown in the too hard basket. As a result of this culture we had a room full of mainly empty racks, or at the least low density servers if any were actually full.
BTW no beer fridge, no alcohol allowed on AirNZ premises unless it was a Koru lounge or actually on an aircraft, and they don't provide staff in that DC with any plates, cups, mugs or cutlery, its all BYO!
IBM is a service provider now. It's what they do and it's how they make their money. You pay IBM money and they provide and take care of your systems for you. It's a perfectly sensible thing to do, mainframe experts aren't exactly growing on trees these days and IBM made the damned machine. If they did indeed screw this one up(and even if they didn't) this kind of failure isn't going to be good for their core business.
Outsourcing isn't always bad, I don't do my own electrical work or plumbing I outsource that to someone with the relevant expertise. It's not cheaper, but it's usually a lot better.
Certainly outsourcing normal business functions is stupid and generally bites you in the long run. Outsourcing things which you can't do to people who can isn't stupid though. Having enough staff on hand to provide 24 hour support to a mainframe is expensive, you essentially need four people to cover it fully. If you have only one mainframe it's generally not cost effective for the company or the staff. IBM has a lot of mainframes and can provide that support.
"Does anyone who is NOT a Beancounter actually have anything good to say about Outsourcing?"
Yes... Its economies of scale. From smaller organisations, not having to pay 2 IT staff (and contractors when stuff if beyond the capability of staff) is a much cheaper option.
And why 2 staff? Because they need to go on holiday sometime, and will get sick sometime.
Yep, theres nothing like intractability for negotiating a discpount/price rise, depending which side your on.
IBM's negotiating point, without us, it'll take you months to get servicable again, if you don't go bankrupt first.
From ANZ, point of view, are you trying to bankrupt us, we'll sue your little blue socks off.
Must laugh, didn't have these problems when the world was run by Commodore 64's or Atari's.
Still that's progress/modernisation/technology/globalisation for you.
Thank goodness for Ryanair, at least you expect to get shafted one way or another, so it's no let down then.
"Must laugh, didn't have these problems when the world was run by Commodore 64's or Atari's."
Uh ... Neal 5, I think you'll find that IBM-style mainframes "ran the world" long before those two companies existed, "ran the world" during their heyday, still "run the world" today, and will continue to "run the world" into the foreseeable future.
Not trying to stick up for BigBlue, just pointing out the obvious.
"CEO Rob Fyfe placed the blame squarely on IBM in an email, which inevitably hit the media almost immediately." Some anonymous coward must have FWD'd it to NZ Herald or something, as there is no other followup from Air NZ management.
All said and done, Air NZ signed up - whether they knew what they were getting or didn't care will now be up for public speculation. Disappoints me being a NZer that with so much Kiwi IT and management expertise around that my country's airline hits the (IT) headlines for the wrong reasons. Air Timbuktu yes, Air NZ no.
The news that the Diesel Gen. did not kick in is no surprise. I have worked for IBM for many years, and they do not understand 'true disaster recovery' tests, believing that a 'paper exercise' is all that is needed. Even for the millennium they never carried out any real tests to determine if any 'what if' scenario defined on paper had any merit.
DR Testing costs money. IBM does not spend money, believing that the money saved is worth the risk. If I were a CEO looking for an out-sourcing solution I would insist on regular & 'real' DR testing being a part of any contract
Biting the hand that feeds IT © 1998–2021