Christ on bike....
...ok if this is a hardware failure as they say it is, how the hell can they not have a DR up and running in a few mins. Ok maybe an hour while they um and arr about the pros and cons of going DR, but 21 hrs!
NetApp has said none of its technology was involved in the Australian Virgin Blue airline reservation system crash on Sunday. On Sunday and Monday a 21-hour outage of the New Skies system, hosted at Navitaire's Sydney data centre, on Monday caused mass delays to Virgin Blue's customers as the airline reverted to a manual …
The RAMSAN is not an enterprise storage device in any shape or form. It is almost exclusively used in HPTC environments where performance rules ahead of availability. What do you expect when you have a disk subsystem that doesn't really have any accredited application or vendor support?
Why they would buy a low-latency performance device and then hang it behind a NAS-TY applicance is beyond me - very daft architecture.
That the whole solution broke is not a suprise, just how long it took.
If you want good reliability buy an enterprise disk subsystem, you get what you pay for.
"The RAMSAN is not an enterprise storage device in any shape or form. It is almost exclusively used in HPTC environments where performance rules ahead of availability....."
/Looks down datacenter ailse in Fortune 500 company to where two RAMSANs are racked as part of an HA Oracle RAC solution.
/Remembers that the solution has been both performant, resilient and reliant.
/Wonders if the poster is smoking an illegal substance.....
".....Why they would buy a low-latency performance device and then hang it behind a NAS-TY applicance is beyond me - very daft architecture....."
/Decides the poster is lucid after all, if a little misinformed about the RAMSAN kit, has to agree that the idea of a RAMSAN behind a NAS is just silly.
Happens over and over again.
$(SOME_CONSULTANT) designs a new application, which is regarded mission critical,
but due to all the different design- and implementation bugs on all layers of the design,
database, middleware, data conversion routines, frontend, network, performance tends to suffer.
How's going to be blamed? "The storage device, the storage device, it's hell to slow!
"We need a latency no bigger then 1 ms!!! 10 ms and we're dead!!!"
While spending 500ms or more in the crappy frontend..
That's how devices like RAMSAN come into play where they really shouldn't.
HA concepts typically also don't help to speed up application performance, in best case, it would be neutral.
Since all of these projects run out of time, rigorous tests of fail- and failover scenarios tend to get "forgotten".Until <something ugly> hits the fan.
But then the question isn't "How do we prevent this from happening again?"
The question is always "Whom can we blame, since obviously it's not our fault. It simply can't be. We attended one project meeting after the other!"
I have seen a very long outage on a NetApp. That despite visits from NetApp professional services staff trying to fix the damn thing. To say they fixed it is a bit misleading. A disk shelf, the redundant controllers, every disk in the shelf were replaced.
The root cause was never satisfactorily identified. The problem's symptoms were that several disks in quick succession failed, which used up all the spare disks.
We learned that:
1. Copying a large SATA disk takes several hours.
2. Running a WAFL check takes several hours (can't be done at the same time as a disk copy).
3. NetApp had no viable cover in London over a weekend, the PSE came from up north somewhere.
And this is how "outages" can end up being 21 hours. Sometimes rolling forwards on a DR plan is not trivial. If it is thought that the fix will be reasonably quick, then it is sometimes preferable to stay with what you have. Especially as the "outage" was not particularly service affecting as things simply queued elsewhere.
It is a big question. At what stage should a DR plan be kicked off? Is it "at once there is a problem"? Or is it "at some point into the outage"? Probably the second, but that just raises a further question of "when"?
But we learned the hard way, and we now have detailed knowledge of the time involved just to do certain things and run certain checks. If the same problem happened (unlikely, but it has happened once so it is not impossible), the DR plan would be "now".
I have no idea if NetApp were in any way responsible for the outage in this article, nor do I insinuate that they were.
Well now, I'd just like to thank all the architectural posters on here who apparently know bugger all about edge cases. Thanks for the amusement.
Just because your app runs fine on your big fat general purpose EMC/HDS rig (as do 95% of mine) does not mean that there is no place for something different if the performance profile demands it. There must have been a v good reason for the airline going V-Series and TMS. It's about IOPs and latency. Ever tried getting 150k random IOPS out of a 2TB dataset? Try it with a VMAX or USP and you'll be short stroked to hell. Try using SSD in a VMAX or HDS and you'll flatten the DAs. TMS is tuned for performance over availability and management, but it is damn fast if you need that kind of throughput. Mitigate it with running NetApp's SyncMirror plus SnapMirror and you'll have an incredibly fast resilient rig that costs a bomb. But, if you need those 150k IOPs it works very well indeed.
So, just because *you* don't need it in your shop doesn't mean other folks don't.
"After tying with EMC in the last Quality Awards, NetApp prevails this time to nudge out EMC and IBM, which tied for second.
NetApp Inc. continues to shed its "NAS-only" image with another impressive outing in the Storage magazine/SearchStorage.com Quality Awards for enterprise arrays."
I was talking to someone from Virgin Blue on the week leading to this fiasco, and I was informed they had just closed the books on a new release that was due to go out.
And then, that weekend, the system crashes. And it gets blamed on a HW failure in a SSD *array*, for <deity>'s sake.
Coincidence? Possibly. But I have my doubts - I see (in my mind's eye) a platoon of Anderso- sorry, Accenture consultants panicking and swearing black and blue that they can patch-fix the problem with the new version (instead of doing what *any* decent release management team should have done - revert the change on what is a *mission-critical* system and *then* try to figure out what went wrong).
VB screwed the pooch on the one, in more ways than one - bad DR management, bad PR management (they should have booked passengers on other flights, rather than keeping them in hotel rooms - might be more money lost, but it's better PR: "look, we care enough about you getting to where you need to go that we'll put you in a competitor's aeroplane") and, of course, using Accenture in the first place.
Anonymous, because I don't want to get the VB person I was talking to in trouble.
Come on, all of that Virgin Blue infrastructure was deployed on Windows/.NET - that alone should be a pretty good giveaway that pretty bad choices were made for otherwise mission critical environment. Who in the right mind would run a high profile mission critical ticketing system on tinkertoy OS which is Windows? Blech.
As a twice weekly Ryanair commuter I now understand why their booking systen was down for "Essential system maintenance" for an entire working day last week. Given their 25 minute flight turn-around is the shortest in the business I imagine Mr O'Leary has SLAs with Navtiaire to match.... Somehow I don't see him being as 'understanding' as Mr Branson when it comes to compensation.
Sad but true. I assure you that someone presented them an option that woul allow for rapid recovery in the event of any type of failure...and the customer didn't think it was worth the money.
Your failure to buy insurance is not the fault of the guy who tried to sell it to you.
Putting this behind a V-filer can only the result of serious Kool-aid drinking. NetApp flavored.
".....I assure you that someone presented them an option that woul allow for rapid recovery in the event of any type of failure...and the customer didn't think it was worth the money....." Well, budgets are budgets. My usual arse-covering tactic is to go to the board with three options - call them bronze, silver and gold if you like. Bronze is what I definately don't want - it is cheap, just about does the core job, but with lots of caveats (such as poor DR). Silver is what I really want but I know will cause the beancounters heartburn. The gold option is the Rolls-Royce job, absolutley bullet-proof and with every box gilded. The main function of the gold option is to make the silver one look more reasonable. The way to avoid the board going for the cheaper bronze option is to draw up a table showing the cost to the business of different types of failure, and show that the bronze option will incur bonus-threatening or even boardroom- job-threatening levels of loss to the company. If you can translate the technical impact into money lost it's usually a winner. Of course, they may reject all three option and send you away to try again, but if they do take the bronze option (as VB seem to have) you can point out that you did warn them. I pinched that technique from a vendor rep after he let one too many secrets out of the bag down the boozer!
".....Your failure to buy insurance is not the fault of the guy who tried to sell it to you....." Actually, it can be. You see, most resellers/vendors advise on what they sell, and some will advise you on what is best for them rather than what is best for you. I'm not saying let the vendors advise on each other's solutions, that would just leave you swimming in a sea of FUD, but I would suggest you get more than just one option from more than just one reseller/vendor. It's all very well for trolls like me to say "hp uber alles" because the hp solutions often works best for my company, but every company is defferent and you may find another vendor is the better option for you (even NetApp!). Unfortunately, a reseller or vendor is even less independent as their paycheque depends on actually not telling you the better option if it isn't what they sell.
Biting the hand that feeds IT © 1998–2021