Nice to see RBS not using ITIL. *cough*
'Mainframe blowout' knackered millions of RBS, NatWest accounts
A hardware fault in one of the Royal Bank of Scotland Group's mainframes prevented millions of customers from accessing their accounts last night. A spokesman said an unspecified system failure was to blame after folks were unable to log into online banking, use cash machines or make payments at the tills for three hours on …
-
-
-
Thursday 7th March 2013 15:14 GMT jai
to add some detail, if it's the same one, that would suggest that no one did any due diligence, or lessons learned, or root cause analysis or any of a dozen other service delivery buzzwords that all basically mean, "wtf happened and how do we make sure it doesn't happen again?"
you'll always get new issues that break things. such is the way of IT. no system is 100% perfect. you just have to put in as much monitoring/alerting and backup systems as you can afford to ensure any impact from outage to your business critical systems is as minimal as possible.
-
-
-
Thursday 7th March 2013 14:22 GMT Bumpy Cat
I doubt it
I work somewhere that has a much smaller IT department and a much smaller IT budget than RBS, but it would take the failure of multiple hardware devices to knock a key service out. What kind of Mickey mouse setup do they have that a hardware failure can take down their core services for hours?
-
Thursday 7th March 2013 14:34 GMT IT Hack
Re: I doubt it
Indeed. My point re ITIL...
It beggars belief that this has happened. Oh wait...no actually it doesn't. Seen a simple electrical fault take down a Tier Two dc...hit 6 companies hosting live revenue generating services. One company nearly went tits up.
Of course at RBS I expect a director to be promoted to a VP type position for this cock up.
-
Thursday 7th March 2013 14:36 GMT Anonymous Coward
Re: I doubt it
It would take failure of multiple pieces of hardware to take down an IBM zServer, that doesn't mean it can't happen. The only thing you can be sure of with a system is that the system will eventually fail.
To accuse them of "Mickey mouse" operation suggests that you've no idea how big or complex the IT setup at RBS is. I believe they currently have the largest "footprint" of zServers in Europe, that's without even thinking of mentioning the vast amount of other hardware on a globally distributed network.
Small IT = Easy.
Big IT = Exponentially more complicated.
-
-
Thursday 7th March 2013 17:44 GMT John Smith 19
Re: I doubt it
"... and why the fuck are they allowed to have BOTH a banking licence and limited liability? ... mutter mutter .... moan ..."
You forgot that UK banks have "preferred creditor" status, so are one of the first in line if a company is declared bankrupt. Because it's to protect the widows, orphans and other children (hence my icon).
Which can be when a bank asks them to repay their overdraft now for example.
-
-
Thursday 7th March 2013 14:56 GMT Phil O'Sophical
Re: I doubt it
> It would take failure of multiple pieces of hardware to take down an IBM zServer, that doesn't mean it can't happen.
It also assumes someone noticed the first failure. I remember our DEC service bod (it was a while ago :) ) complaining about a customer who'd had a total cluster outage after a disk controller failed. Customer was ranting & raving about the useless "highly available" hardware they'd spent so much money on.
Investigation showed that one of the redundant controllers had failed three months before, but none of the system admins had been checking logs or monitoring things. The spare controller took over without a glitch, no-one noticed, and it was only when it failed that the system finally went down.
-
Thursday 7th March 2013 15:24 GMT jai
Re: I doubt it
i was once told an anecdote. the major investment bank they worked for failed in it's overnight payment processing repeatedly, every night. Eventually they determined it was happening at exactly the same time each night. so they upgraded the ram/disks. patched the software. replaced the whole server. nothing helped.
Finally, head of IT decides enough is enough, takes a chair and a book to the data centre, sits in front of the server all night long to see what it's doing.
and at the time when the batch failed every night previously, the door to the server room opens, the janitor comes in with a hoover, looks for a spare power socket, finds none, so unplugs the nearest and plugs in the hoover. yes, you guessed it, plug that was unplugged was the power lead for the server in question.
just because you're a big firm, doesn't mean you don't get taken out by the simplest and stupidest of things
-
-
Thursday 7th March 2013 17:03 GMT Anonymous Coward
Urban Legend ...
"i was once told an anecdote .. the door to the server room opens, the janitor comes in with a hoover, looks for a spare power socket, finds none, so unplugs the nearest and plugs in the hoover. yes, you guessed it, plug that was unplugged was the power lead for the server in question."
I read something similar, only it was set in the ICU of a hospital and what they unplugged was the ventilator ...
-
This post has been deleted by its author
-
Friday 8th March 2013 12:37 GMT JimC
Re: Urban Legend ...
Although I imagine the strory has grown in the telling, the "handy power socket" is certainly something I've experienced first hand in end user situations. I remember in one office having to go round labelling the appropriate dirty power sockets "vacuum cleaners only" in order to try and prevent the staff plugging IT equipment into them, thus leaving the cleaner grabbing any handy socket on the clean power system for the vacuum cleaner...
-
Friday 8th March 2013 13:51 GMT Iain 14
Re: Urban Legend ...
“I read something similar, only it was set in the ICU of a hospital and what they unplugged was the ventilator ...”
Yup - famous Urban Legend. The hospital setting dates back to a South African newspaper "story" in 1996, but the UL itself goes back much further.
http://www.snopes.com/horrors/freakish/cleaner.asp
-
-
Thursday 7th March 2013 18:54 GMT Matt Bryant
Re: Jai Re: I doubt it
"......and at the time when the batch failed every night previously, the door to the server room opens, the janitor comes in with a hoover, looks for a spare power socket, finds none, so unplugs the nearest and plugs in the hoover. yes, you guessed it, plug that was unplugged was the power lead for the server in question....." Yeah, and I have some prime real estate in Florida if you're interested. A Hoover or the like would be on a normal three-pin, whilst a proper server would be on a C13/19 or 16/32A commando plug. It would also have probably at least two PSUs so two power leads, unplugging one would not kill it.
-
-
Thursday 7th March 2013 23:31 GMT Anonymous Coward
Re: I doubt it
"It would take failure of multiple pieces of hardware to take down an IBM zServer, that doesn't mean it can't happen. The only thing you can be sure of with a system is that the system will eventually fail."
A hardware failure taking down a System z is supremely unlikely. I think the real world mean time for mainframe outages are once in 50 years. Even if it was a hardware failure (which would mean a series of hardware component failures all at the same time in the system), IBM has a class of its own HA solution, geographically dispersed parallel sysplex. You can intentionally blow up a mainframe or the entire data center in that HA design and it will be functionally transparent to the end user. A system might fail, but the environment never should.
-
-
Friday 8th March 2013 12:19 GMT Dominic Connor, Quant Headhunter
Re: I doubt it
We used to have a Stratus, they worked on the principle that SysOps *do* forget to look at logs, the irony being is that if all the components are individually reliable, then humans being humans won't worry about it so much.
So Stratus machines phoned home when a part died and that meant an engineer turning up with a new bit before the local SysOps had noticed it had died.
That's not a cheap way of doing things of course, but at some level that's how you do critical systems. When a critical component fails the system should attract the attention of the operators.
That leads me back to seeing this as yet another failure of IT management at RBS.
If the part failed, then there should have been an alert of such a nature that the Ops could not missi it. A manager might not write that himself, but his job is to make sure someone does.
The Ops should be motivated, trained and managed to act rapidly and efficiently. Again this is a management responsibility.
Al hardware fails, all you can do is buy lower probability of system failure, so the job of senior IT management at RBS is not as they seem to think playing golf and greasing up to other members of the "management team", but delivering a service that actually works.
No hardware component can be trusted. I once had to deal with a scummy issue where a cable that lived in a duct just started refusing to pass signals along. The dust on the duct showed it had not been touched or even chewed by rats, it had just stopped. Never did find out why.
-
-
Thursday 7th March 2013 22:56 GMT Anonymous Coward
Re: I doubt it
"I work somewhere that has a much smaller IT department and a much smaller IT budget than RBS, but it would take the failure of multiple hardware devices to knock a key service out."
Yes, and we are talking about a mainframe. It is near impossible to knock a mainframe off line with a simple "hardware failure." Those systems are about 14 way redundant in the first place, so it isn't as though a OSA corrupted, or another component, and knocked the mainframe offline. Even if the data center flooded or the system disappeared using magic, almost all of these mega-mainframes have a parallel sysplex/HyperSwap configuration which is a bulletproof HA design. If system A falls off the map, the secondary system picks up the I/O in real time, so why didn't that happen.... I am interested to hear the details.
-
-
Thursday 7th March 2013 14:34 GMT M Gale
I thought one of the features of a mainframe...
...was umpteen levels of redundancy? One CPU "cartridge" goes pop? Fine. Rip it out of the backplane and stuff another one in, when you've got one to stuff in there.
Dual (or more) PSUs, RAID arrays.. and yet this happens. Oh well. Wonder what RBS's SLAs say about this?
They do have SLAs for those likely-hired-from-someone-probably-IBM machines, don't they?
-
Thursday 7th March 2013 15:59 GMT Velv
Re: I thought one of the features of a mainframe...
Multiple hardware components are fine as long as it is a discreet hardware failure.
Firmware, microcode or whatever you want to call it can also fail, and even when you're running alleged different versions at different sites they could have the same inherent fault.
The only true way to have resilience is for the resilient components to be made by different vendors using different components (which is what Linx/Telehouse has with Jupiter, Cisco, Foundry and others for their network cores). IBM mainframes don't work this way
-
Thursday 7th March 2013 23:53 GMT Anonymous Coward
Re: I thought one of the features of a mainframe...
"The only true way to have resilience is for the resilient components to be made by different vendors using different components (which is what Linx/Telehouse has with Jupiter, Cisco, Foundry and others for their network cores). IBM mainframes don't work this way"
Yeah, I suppose that is true... although you are more likely to have constant integration issues with many vendors in the environment, even if you are protected against the blue moon event of a system wide fault spreading across the environment. By protecting yourself against the possible, but extremely unlikely, big problem, you guarantee yourself a myriad of smaller problems all the time.
-
Friday 8th March 2013 11:51 GMT Anonymous Coward
Re: I thought one of the features of a mainframe...
Except for 1 thing, the RBS mainframe is not "a mainframe", its a cluster of (14 was the last number I heard) mainframes, all with multiple CPUs. This failure probably is not a single point of failure, its a total system failure of the IT hardware and the processes used to manage it.
-
Thursday 7th March 2013 14:48 GMT Mike Smith
I reckon the other source had it spot on
"the bank’s IT procedures will in some way require system administrators to understand a problem before they start flipping switches."
Naturally. However, let's not forget the best-of-breed world-class fault resolution protocol that's been implemented to ensure a right-first-time customer-centric outcome.
That protocol means that a flustercluck of management has to be summoned to an immediate conference call. That takes time - dragging them out of bed, out of the pub, out of the
brothelgentlemen's club and so on.Next, they have to dial into the conference call. They wait while everyone joins. Then the fun begins:
Manager 1: "Ok what's this about?"
Operator: "The mainframe's shat itself, we need to fail over NOW. Can you give the OK, please?"
Manager 2: "Hang on a minute. What's the problem exactly?"
Operator: "Disk controller's died."
Manager 3: "Well, can't you fix it?"
Operator: "Engineer's on his way, but this is a live system. We need to fail over NOW."
Manager 4: "All right, all right. Let's not get excited. Why can't we just switch it off and switch it on again? That's what you IT Crowd people do, isn't it?"
Operator: "Nggggg!"
Manager 1: "I beg your pardon?"
Operator: (after deep breath): "We can't just switch it off and on again. Part of it's broken. Can I fail it over now, please?"
Manager 2: "Well, where's your change request?"
Operator: "I've just called you to report a major failure. I haven't got time to do paperwork!"
Manager 3: "Well, I'm not sure we should agree to this. There are processes we have to follow."
Manager 4: "Indeed. We need to have a properly documented change request, impact assessment from all stakeholders and a timeframe for implementation AND a backout plan. Maybe you should get all that together and we'll reconvene in the morning?"
Operator: "For the last bloody time, the mainframe's dead. This is an emergency!"
Manager 1: "Well, I'm not sure of the urgency, but if it means so much to you..."
Manager 2: "Tell you what. Do the change, write it up IN FULL and we'll review it in the morning. But it's up to you to make sure you get it right, OK"
Operator: "Fine, thanks."
<click>
Manager 3: "He's gone. Was anyone taking minutes?"
Manager 4: "No. What a surprise. These techie types just live on a different planet."
Manager 1: "Well, I'm off to bed now. I'll remember this when his next appraisal's due. Broken mainframe indeed. Good night."
Manager 2: "Yeah, night."
Manager 3: "Night."
Manager 4: "Night."
-
Thursday 7th March 2013 15:24 GMT Anonymous Coward
Re: I reckon the other source had it spot on
@Mike - that may well be what you think happens, but I've experienced financial services IT recovery management and it's a lot more along the lines of:
Bunch of experts in the hardware, OS, software, Network, Storage and Backup get on call to discuss, chaired by a trained professional recovery manager.
You tend to get paniky engineers who identified the problem saying a disk controller has died, and we must change it now, NOW, do you hear?
The recovery manager will typically ask "Why did it fail, what are the risks of putting another one in, do we have scheduled maintenance running at the moment, has there been a software update, can someone confirm that going to DR is an option, are we certain that we understand what we're seeing? What is the likelihood of the remaining disk controller failing?
The last thing you want to do is failover to DR at the flick of a switch, because that may well make things worse. Let me assure you, this isn't the sort of situation where people bugger off back to bed before it's fixed and expect to have a job in the morning.
-
Thursday 7th March 2013 15:32 GMT IT Hack
Re: I reckon the other source had it spot on
AC 15:24 - this.
Not only financial services btw.
I'm not sure miss those midnight calls...in some ways quite fun to sort shit out but on the flip side the pressure to get it right first time is immense.
However its not only just flipping the bit..its also very much understanding the impact of that decision. If you fail over an entire DC you need to really be able to explain why...
-
Thursday 7th March 2013 15:33 GMT Mike Smith
Re: I reckon the other source had it spot on
"Bunch of experts in the hardware, OS, software, Network, Storage and Backup get on call to discuss, chaired by a trained professional recovery manager."
Well, quite. That's exactly what should happen. Been there myself, admittedly not in financial services.
I've seen it done properly, and it's precisely as you describe.
And I've seen it done appallingly, with calls derailed by people who knew next to nothing about the problem, but still insisted on adding value by not keeping their traps shut.
I guess I'm just too old and cynical these days :-)
-
Friday 8th March 2013 12:23 GMT Field Marshal Von Krakenfart
Re: I reckon the other source had it spot on
"Bunch of experts in the hardware, OS, software, Network, Storage and Backup get on call to discuss, chaired by a trained professional recovery manager."
With update meetings every half hour, which is why you also need team leads and project managers, so there is somebody to go to the meeting and say " NO, the techs are still working on it"
-
-
Thursday 7th March 2013 17:05 GMT Anonymous Coward
Re: I reckon the other source had it spot on
> The last thing you want to do is failover to DR at the flick of a switch, because that may well make things worse.
I spend so much time trying to convince customers of that, and many of them still won't get past "but we need automatic failover to the DR site". We refuse to do it, the field staff cobble something together with a script, and it all ends in tears.
-
-
-
Thursday 7th March 2013 17:15 GMT Anonymous Coward
Re: I reckon the other source had it spot on
"throw in a couple of 3rd parties and you've got them all pointing fingers at each other as well to add into the mix."
"RBS - Data Consultant - Accenture"
I reckon who ever wrote that never actually worked in a real IT environment ...
-
-
-
Thursday 7th March 2013 17:13 GMT Anonymous Coward
Re: I reckon the other source had it spot on
> Which is how it might work if you have manual intervention required.
For DR you should have manual intervention required.
For simple HA when the sites are close enough to be managed by the same staff, have guaranteed independent redundant networking links, etc. then, yes, you can do automatic failover.
For proper DR, with sites far enough apart that a disaster at one doesn't touch the other, you have far more to deal with than just the IT stuff, and there you must have a person in the loop. How often have you watched TV coverage of a disaster when even the emergency services don't know what the true situation is for hours (9/11 or Fukishima, anyone?) ? Having the IT stuff switching over by itself while you're still trying to figure out what the hell has happened will almost always just make the disaster worse.
For example, ever switched over to another call center, when all the staff there are sleeping obliviously in their beds? Detected a site failure which hasn't happened, due to a network fault, and switched the working site off? There is a reason that the job of trained business continuity manager exists. We aren't at the stage where (s)he can be replaced by an expert system yet, let alone by a dumb one.
-
Friday 8th March 2013 00:11 GMT Anonymous Coward
Re: I reckon the other source had it spot on
"Highly available mainframe plex's like RBS run active/active across multiple sites."
Exactly, no one should have had to call anyone. The mainframe should have moved to the second active system in real time. The only call that would have been made is system calling IBM to tell an engineer to come and replace whatever was broken.
-
-
This post has been deleted by its author
-
-
Thursday 7th March 2013 23:02 GMT M Gale
Re: I reckon the other source had it spot on
"So all, technicians are incredibly overworked but still infallible, whilst all managers are lazy and incompetent. Procedures are completely unnecessary. Just get rid of all managers and procedures and everything will be fantastic."
And irate operators really do poison their bosses with halon. Oh come on, that has got to be the best rant I've seen in a while. Possibly it might have come from person experience, as far as you know.
Wherever it came from, I think Mike Smith needs to be hired as Simon Travaglia's ghost writer for when he's off sick and a new BOFH episode needs writing up. That was awesome.
-
-
Friday 8th March 2013 14:25 GMT Chris007
Re: I reckon the other source had it spot on @Mike Smith
wow - having worked at RBS this is not far from what has happened on some recovery calls I was involved in. Systems were down and some manager would actually request a change be raised BEFORE fixing the issue - Anybody who knows the RBS change system *coughInfomancough* knows it is not the quickest system in to world.
have a pint for reminding what I escaped from.
-
-
Thursday 7th March 2013 15:08 GMT Anonymous Coward
"In theory, the banking group’s disaster-recovery procedures should have kicked in straight away without a glitch in critical services."
Easy to be judgemental but a DR failover is normally a controlled failover which take a number of hours, you need to be 100% sure the data is in a consistent state to be able to switch across. I've seen failovers that have gone wrong and its a million times worse being left halfway between both!
Its unlikely any system is truly active/active across all of its parts
-
Friday 8th March 2013 00:19 GMT Anonymous Coward
"Its unlikely any system is truly active/active across all of its parts"
That is the beauty of the System z though. It is not a distributed system which requires 8 HA software layers from 8 different vendors, none of which are aware of each other, to be perfectly in sync. It is truly active/active across all of its parts because there are not that many parts, or third party parts to sync. Parallel sysplex, IBM mainframe HA, manages the whole process.
-
Friday 8th March 2013 01:20 GMT david 12
banking group’s disaster-recovery procedures
This was the bank where the disaster-recovery procedures went disastrously wrong.
I'm going to at least consider the possiblity that this time they were told they COULD NOT start a disaster-recovery procedure until everything was turned off and backed up.
-
Thursday 7th March 2013 15:12 GMT Dan 55
"procedures should have kicked in straight away without a glitch in critical services"
These would be the disaster-recovery procedures that RBS said were given a good seeing to after last year's balls up so this kind of thing would never happen again.
(I know a hardware problem and a batch problem aren't related, but the procedures that they follow if something happens which brings down the bank's service probably are.)
-
Thursday 7th March 2013 23:20 GMT Anonymous Coward
Re: "procedures should have kicked in straight away without a glitch in critical services"
Disaster recovery doesn't help a damn if (a) your data is replicated immediately to DR and is buggered or (b) it's quicker to recover in the same site than flip to remote site.
The assumption some people have that DR is a simple flick to a remote site and everything springs to life in 5 minutes is horribly, horribly flawed in so many ways.
-
Friday 8th March 2013 09:22 GMT SE
Re: "procedures should have kicked in straight away without a glitch in critical services"
The assumption some people have that DR is a simple flick to a remote site and everything springs to life in 5 minutes is horribly, horribly flawed in so many ways."
Indeed, though vendors often like to give the impression that this is how their solutions work.
-
-
-
Thursday 7th March 2013 15:27 GMT Anonymous Coward
Alternatively...
...mainframe is duplicated across 2 towers. Or more.
Part 1: Tower 1 take out for patching/maintenance/a laugh/upgrade
Part 2: At the same time, somebody yanks the 3 phase on tower 2 by mistake. As it was night time, it was probably to plug in a hoover (yes, I know...it's a joke). or the main breaker blows. Or the entire DC where Tower 2 resides goes on holiday to the bermuda triangle.
Part 3: Royal shitstorm getting Tower 2 back online after unclean shutdown, and everything rolledback, then rolled forward again. it was probably noticed *in milliseconds" when it went down, but "Swtiching it on and off again" doesn't really work on a transactional Z. It takes time.
Part 4: Parallel recovery is to cancel the work on Tower 1 and get it back online, while rolling forward all the "in transit" stuff from Tower 2.
Part 5: Meanwhile CA7 guys run about cancelling/rescheduling batch jobs.
Edit: Part 6: All the secondary services are restarted - mainly the application server layer, and any rollbacks are replayed now the back end is back.
-
Thursday 7th March 2013 15:52 GMT Anonymous Coward
Tandem
I don't know much about these things but I recall many years ago a relative working in banking IT who mentioned that some of the banks used Tandem hardware that provided a continuous and automatic redundancy, but the machines cost a lot more. As I said, I know nothing about mainframes but these days do all mainframes provide such features or not? If not, is that why we get such incidents? Any insights most welcome.
-
Thursday 7th March 2013 15:57 GMT Anonymous Coward
Re: Tandem
Just found this on Wiki, partly explains my question, Tandem no longer seem to exist, but perhaps proves my point about cheaper options and that use of Tandem-like solutions should cope with these failures and not require a committee to have a conference call about the failure?
"Tandem Computers, Inc. was the dominant manufacturer of fault-tolerant computer systems for ATM networks, banks, stock exchanges, telephone switching centers, and other similar commercial transaction processing applications requiring maximum uptime and zero data loss. The company was founded in 1974 and remained independent until 1997. It is now a server division within Hewlett Packard.
Tandem's NonStop systems use a number of independent identical processors and redundant storage devices and controllers to provide automatic high-speed "failover" in the case of a hardware or software failure.
To contain the scope of failures and of corrupted data, these multi-computer systems have no shared central components, not even main memory. Conventional multi-computer systems all use shared memories and work directly on shared data objects. Instead, NonStop processors cooperate by exchanging messages across a reliable fabric, and software takes periodic snapshots for possible rollback of program memory state."
-
-
Thursday 7th March 2013 17:27 GMT hugo tyson
Re: Tandem
Yeah, but the main thing about Tandem/HP NonStop systems is every CPU is duplicated, all memory is duplicated, and for every operation if the two results don't match the (dual)CPU in question STOPS. It's very keen on stopping; it's only a huge mound of failover software and redundant power and duplication that makes a *system* very keen on continuing; individual parts stop quite readily.
Of course, the intended market is OLTP, so the goal is to make sure that the decrement to your bank balance is the right answer; if two paired hardware CPUs and their memory give different answers, that pair of CPUs stops and a whole 'nother hardware set attempts the same transaction.
-
-
-
Thursday 7th March 2013 17:23 GMT Phil O'Sophical
Re: Tandem
Tandem hardware was Fault Tolerant, not Highly Available. There were other players, like Stratus and Sun, in that area.
FT hardware duplicates the systems inside a single box, perhaps three CPUs, three disk controllers, three network cards, etc. They all do the same work on the same data, and if they get different results there's a majority vote to decide who's right.
It provides excellent protection against actual hardware failure, like a CPU or memory chip dying, but offers no protection at all against an external event, or operator error. Just like using RAID with disks, which protects against disk failure, but it isn't a replacement for having a backup if someone deletes the wrong file by mistake.
It is expensive, you're paying for three systems but getting the performance of one, and given the reliability of most systems these days it isn't used much outside the aviation/space/nuclear/medical world, where even the time to switchover to a backup can be fatal. There's a reason that none of the companies who made FT systems managed to survive as independent entities.
-
Friday 8th March 2013 00:53 GMT Anonymous Coward
Re: Tandem
"As I said, I know nothing about mainframes but these days do all mainframes provide such features or not? If not, is that why we get such incidents?"
Yes, IBM mainframe's parallel sysplex is the gold standard in HA. Basically the system upon which all other clusters have been based. The systems read/write I/O in parallel so both system A and B (or more than two if you choose) have perfect data integrity and can process I/O is parallel. If one of those systems goes away, the others continue handling I/O with no disruptions. There is also a geographically dispersed parallel sysplex option which can provide out of region DR, in case the data center blows up or something, at wire speed with log shipping which is also active/active, but it takes a few seconds, literally, before the I/O on the wire is written and the DR site takes over. In theory, we should never get such incidents, but, like anything, people can misimplement the HA solution... which seems to have happened here.
-
-
-
Thursday 7th March 2013 16:28 GMT Anonymous Coward
You Hit the Nail
RBS had to "retain talent" in the trading rooms and pay them several 100k of bonus per year. And let them bet the entire bank to get the short term results for obtaining said bonuses. On the long run it crashed the bank and "in order to become profitable again", experienced British engineers and specialists were replaced by Indians with 1/10th of wage and 1/100th of experience/skill/actual value.
But you know what ? That is the whole purpose of modern finance - suck the host white until it is dead, then leave the carcass for the next host. IT people are considered part of the host organism.
Grab yourself a history book and see how that played out between 1929 and 1945.
Picture of firebombed city.
-
-
-
-
Friday 8th March 2013 08:42 GMT Silverburn
Re: Prevented millions from accessing their accounts?
9pm at night with no card transactions and no ATM. Anyone out for dinner or drinks using RBS was knackered.
That would limit it to politicians and Traders, as everyone else in the country is too skint to eat out midweek. Then again, the politicians probably wouldn't be paying for it anyway - it's all on "expenses". So it's just the traders then. No biggie.
-
-
-
Thursday 7th March 2013 16:12 GMT FutureShock999
High Availability and Resiliance
This should NOT have been about DR or backups. This should have been handled as part of any high-availability , RESILIENT cluster system design. I've designed and architected HA on IBM SP2 supercomputer clusters and can well attest that it works - our "system test" was walking the floor of the data centre randomly pulling drive controller cables and CPU boards out of their sockets, while having the core systems still running processes without failing! And that was 10+ years ago - I find it appalling that a live banking system would not be engineered to have the same degree of _resiliance_. Don't talk in terms of how many minutes of downtime it will have per year - it should be engineered to have the failure of x number of disks, y number of controllers, and z number of processors within a chassis/partition/etc.) before failure. For a live, financial system, those should be the metrics that are quoted, not reliability alone.
-
Thursday 7th March 2013 16:44 GMT Jim McCafferty
Just Joshing...
The Government have said they will sell their holding in RBS when the stock price reaches a certain level. What if someone decided they didn't like that idea? This little incident will put that sale in doubt.
After the last mainframe blow out - one would have thought the place would have been running a bit better - I take it other banks aren't experiencing similar outages?
-
Thursday 7th March 2013 16:54 GMT Anonymous Coward
A mainframe hardware fault !
"A hardware fault in one of the Royal Bank of Scotland Group's mainframes prevented millions of customers from accessing their accounts last night.
Assuming this were the case, they must have multiple redundent systems, mustn't they? On the other hand, maybe someone ran the wrong backup procedure ... again ! ! !
"This fault may have been something as simple as a corrupted hard drive, broken disk controller or interconnecting hardware."
No, mainframes have multiple harddrives, disk controllers and error detection and correction circuits ...
-
Friday 8th March 2013 00:58 GMT Anonymous Coward
Re: A mainframe hardware fault !
"No, mainframes have multiple harddrives, disk controllers and error detection and correction circuits ..."
Yes, and they are clustered systems, so if one system bombs, even with all the fault tolerant architecture, another system in the cluster (or parallel sysplex in mainframe vernacular) should pick up the load. As with any cluster, you may take a performance hit, but it should never just go down.
-
-
Thursday 7th March 2013 18:21 GMT despairing citizen
Banks Fail Again
Unless their data centre was a smoking hole in the ground, outages of live systems are unacceptable.
Even if their data centre was nuked, the bank should have continued running it's live services from an alternate location, with minimal "down time"
The bank is paid very hansomly by it's customers for services, and "off lining" several Billion pounds of the UK economy for 3 hours is completely unacceptable.
Whilst normally I personally think less legislation is a "good thing", HMG really needs to kick the regulator to remind them of a "fit and proper" organisation to have a banking license should include can they actually deliver the service reliably.
-
Thursday 7th March 2013 19:21 GMT Matt Bryant
Probable key factor in the outage - "IBM mainframes don't fail!"
I have had (stupid) people say to me statements like "Here, this is our DR plan, but don't worry about reading it, we have an IBM mainframe and IBM told us it will never fail" (note - IBM are VERY careful not to make that legally binding statement in their sales pitch, but they are happy to leave you with the impression). I have called directors at three in the morning to tell them the bizz is swimming in the brown stuff because we have had a mainframe stop/pop/do-the-unexpected, and after a few moments of bewildered silence at the other end of the line you get those immortal words: "But it's a mainframe....?" Half the problem is people are so lulled by the IBM sales pitch they just don't stop to think ANYTHING MANMADE IS FALLIBLE, so when something does go wrong there is an inertia due to an inability to accept the simple fact stuff breaks, whether it has an IBM badge or not. I bet half the delay in solving the RBS outage was simply down to people getting past that inertia.
-
Friday 8th March 2013 01:40 GMT Anonymous Coward
Re: Probable key factor in the outage - "IBM mainframes don't fail!"
"Half the problem is people are so lulled by the IBM sales pitch they just don't stop to think ANYTHING MANMADE IS FALLIBLE"
Yes, anything manmade is fallible. IBM mainframe is fault tolerant, redundant hardware which can be dynamically used in the case of a component failure, but it is also a clustered system, parallel sysplex. Parallel sysplex in place specifically because the systems might fail for whatever reason, e.g. hw failure, software error, data center blows up. I/O is processed in parallel across multiple systems so if one is unavailable, the other mainframes can immediately pickup the I/O. The IBM coupling facilities which make it possible for server time protocols to work in parallel are brilliant. No system hardware failure should ever take down a mainframe environment, unless you implemented parallel sysplex incorrectly. It is like having Oracle RAC implemented incorrectly and blaming the outage on a single server failure. If RAC is implemented correctly, a server failure should not matter. I highly doubt any IBM rep told anyone the mainframe is infallible and never goes down at a hardware level, if for no other reason than they wanted to sell parallel sysplex software.
"I have called directors at three in the morning to tell them the bizz is swimming in the brown stuff because we have had a mainframe stop/pop/do-the-unexpected, and after a few moments of bewildered silence at the other end of the line you get those immortal words: "But it's a mainframe....?"
I doubt that ever happened, but, if it did, they asked the right question. If properly implemented, that should never happen. Much like if you were to call your Director at three in the morning to tell them that the RAC cluster is down because a server failed, they would say "But it's a RAC cluster.....?"
-
Friday 8th March 2013 08:36 GMT Matt Bryant
Re: Probable key factor in the outage - "IBM mainframes don't fail!"
".....Yes, anything manmade is fallible....." But - don't tell me - IBM mainframes are made by The Gods, right?
"..... IBM mainframe is fault tolerant, redundant hardware....." Ignoring my own experience, this story goes to show you are completely and wilfully blind!
"......No system hardware failure should ever take down a mainframe...." So the event never happened, it was all just a fairy tale, right? You know I mentioned stupid people earlier that said stuff like "forget DR, it's on an IBM mainframe", well please take a bow, Mr Stupid.
-
Saturday 9th March 2013 00:34 GMT Anonymous Coward
Re: Probable key factor in the outage - "IBM mainframes don't fail!"
"Ignoring my own experience, this story goes to show you are completely and wilfully blind!"
Look at a System z data sheet. Every critical component is triple redundant. That certainty doesn't mean, in and of itself, that the system can't go down. It just means a hardware component failure is less likely to take down a system than a hardware failure in a system which is not fault tolerant. It is not an HA solution at all... which is why IBM created parallel sysplex, the HA solution.
"So the event never happened, it was all just a fairy tale, right? You know I mentioned stupid people earlier that said stuff like "forget DR, it's on an IBM mainframe", well please take a bow, Mr Stupid."
I didn't say this event didn't happen. I said that if parallel sysplex had been implemented correctly, it would be impossible for a hardware failure in a single mainframe to take down the cluster. It is possible that RBS did not have parallel sysplex on this application or that it was not implemented correctly. No individual system failure *should* ever take down a mainframe environment is what I wrote, that is assuming you have IBM HA solution in place. If you don't have the HA solution in place, sure, mainframes can go down like any other system... less likely than an x86 or lower end Unix server due to its fault tolerance, but it certainly can happen if it is stand alone. My point was, as this was clearly ultra mission critical, why wasn't parallel sysplex implemented? It should have been done as a matter of course, every other major bank that I know of runs their ATM apps in parallel sysplex, most in geographically dispersed parallel sysplex.
-
Saturday 9th March 2013 10:57 GMT Matt Bryant
Re: AC Re: Probable key factor in the outage - "IBM mainframes don't fail!"
"Look at a System z data sheet......" AC, the data sheet is just part of the IBM sales smoke and mirrors routine - "it can't fail, it's an IBM mainframe and the data sheet says it is triple redundant". You're just proving the point about people that cannot move forward because they're still unable to deal with the simple fact IBM mainframes can and do break. The data sheet is just a piece of paper, the RBS event is reality, you need to understand the difference. Fail!
-
Sunday 10th March 2013 00:52 GMT Anonymous Coward
Re: AC Probable key factor in the outage - "IBM mainframes don't fail!"
"The data sheet is just a piece of paper, the RBS event is reality, you need to understand the difference. "
You need to understand the difference between hardware level fault tolerance and high availability. Two different concepts.
IBM mainframes run most of the world's truly mission critical systems, e.g. banks, airlines, governments, etc. To my knowledge, these all run in parallel sysplex without exception. If anyone thought that a mainframe didn't go down just because the hardware was built so well/redundant, there would be no point in all of these organizations implementing parallel sysplex. Even if you have a 132 way redundant system, it will still likely need to be taken down for OS upgrades or another software layer upgrade that requires an IPL. Not having a hardware issue because of hardware layer redundancy is only one small part of HA.
-
Sunday 10th March 2013 17:54 GMT Matt Bryant
Re: AC Re: AC Probable key factor in the outage - "IBM mainframes don't fail!"
".....You need to understand the difference between hardware level fault tolerance and high availability. Two different concepts....." No, what YOU need to understand is both mean SFA to the business, what matters to them is that they keep serving customers and making money. The board don't give two hoots how I keep the services running, be it by highly available systems or winged monkeys, they really don't give a toss as long as the money keeps rolling in. RBS had a service outage, reputedly because of a mainframe hardware issue, and it cost them directly in lost service to customers and indirectly in lost reputation, simple as that. You can quote IBM sales schpiel until you're blue in the face, it doesn't mean jack compared to the headlines. Get out of the mainframe bubble and try looking at how the business works.
-
-
-
-
-
Friday 8th March 2013 09:44 GMT Anonymous Coward
Re: Probable key factor in the outage - "IBM mainframes don't fail!"
> If properly implemented, that should never happen.
I hope to God you're never implementing systems I have to rely on.
Let me guess, your code also has lots of:
/* We can never get here */
return;
> Much like if you were to call your Director at three in the morning to tell them that the RAC cluster is down because a server failed, they would say "But it's a RAC cluster.....?"
And we all know that RAC clusters never, ever, go down. It's amazing that Oracle even bothers to sell support for them isn't it?
-
Saturday 9th March 2013 00:48 GMT Anonymous Coward
Re: Probable key factor in the outage - "IBM mainframes don't fail!"
"And we all know that RAC clusters never, ever, go down. It's amazing that Oracle even bothers to sell support for them isn't it?"
Yes, it is possible to have some error in the clustering software which takes down the entire cluster, be the clustering software RAC, Hadoop, or Parallel Sysplex. *But that is not where RBS said the issue occured* If they had said, "a parallel sysplex issue" and not a "hardware failure" then it is possible that they had the right architecture but the software bugged out or was improperly implemented. My point is: This application clearly should have been running in parallel sysplex as an ultra mission critical app. That is the architecture for nearly all mainframe apps. Therefore, saying a "hardware failure" caused their entire ATM network and all other transactional systems to go down makes no sense. They were either not running this in sysplex, in which case... why not, or they did not report the issue correctly and it was much more than a "hardware failure."
-
-
-
Friday 8th March 2013 14:58 GMT Roland6
Re: Probable key factor in the outage - "IBM mainframes don't fail!"
"I have called directors at three in the morning to tell them the bizz is swimming in the brown stuff because we have had a mainframe stop/pop/do-the-unexpected"
There was time when the Director would of called you to ask why they had to get the news of a fault from IBM and not their own IT organisation...
Either IBM customer service has gone down hill or they've decided it's better business to be friends with the IT organisation.
-
Friday 8th March 2013 15:50 GMT Matt Bryant
Re: Roland6 Re: Probable key factor in the outage - "IBM mainframes don't fail!"
".....There was time when the Director would of called you to ask why they had to get the news of a fault from IBM and not their own IT organisation..." If you're implying the typical IBM response was to worry about cuddling up to senior management rather than fixing the problem then nothing has changed. But you should also know it is the first rule of BOFHdom that you should always know more than those above you. Dial-home services and the like should always have the BOFH as contact so you are in control of the flow of information uphill, so as to make sure that when the brown stuff comes rolling downhill it is not on your side. Your role has probably already been short-listed for being outsourced if you haven't mastered such basics.
-
-
-
Friday 8th March 2013 09:19 GMT dbbloke
Failover
So they didn't have:
Some kind of HDR (high availability server) to seamlessly swap over to?
A SDS secondary shared disk server to failover to?
An Enterprise replication machine somewhere?
A RSS Remote standby machine, in another location or even the cloud?
Probably Only Informix does this well (and would give me some work), Oracle tries but it's problematic.
I wonder if it were database / application or what. Doesn't sound like a network Issue. SANs are bulletproof as well. I would assume there is a more sinister reason given the state of banking.
Mainframe - no wonder it fails, like almost nobody is alive who knows how to maintain the OS. I've tried and it's super command line unfriendly.
Banks are sadly expertless, I know loads with terrible DBA's runing mickey mouse systems with no failover. The more you know the more you wonder how it works AT ALL.
-
Friday 8th March 2013 09:22 GMT Anonymous Coward
Probably too scared to take immediate action
The operators were probably unwilling to make any failover call following the almighty bollocking they will have received after last years fubar.
They will (quite rightly) have kicked the decision up the chain to those earning the salary for having the responsibility.
-
Friday 8th March 2013 09:31 GMT 1052-STATE
These aren't smalls shops....they're mainframe *complexes*
Fair amount of tosh being written - such as ALL failovers are not immediate. I ran some of the world's largest realtime systems (banking, airlines) for 15yrs and it's imperative an immediate seamless failover is there the second you need it. Realtime loads were switched from one mainframe complex to another on a different continent in less than five seconds - with zero downtime.
See "TPF" on Wikipedia. (aka Transaction Processing Facility)
-
Friday 8th March 2013 10:50 GMT Anonymous Coward
I am a broadcast engineer, not an IT guy and I've seen some IT guys spectacularly fail under pressure. Accidentally cut of services to an entire country? Fine, don't run around like a headless chicken, get it working and then you can stress, not the other way around. Also, sometimes the answer isn't to fix the problem but to just get the system working, you are providing a critical service to the public, you can fix the problem later. Sometimes getting it working does involve fixing the problem, sometimes you just need to patch around it and schedule the fix. It isn't amateur bodging, it is maintaining a critical service at all costs.
I previously worked for a major broadcaster's technology division, the broadcaster wanting to reduce its headcount, talk of "leveraging" etc. and we were sold to a major IT outsourcing company. Now, although the sale was supposed to buy the IT and phones they saw "broadcast communications" and someone wet themselves with excitement. Massive connectivity infrastructure, lots of racks of equipment, 24x7 operation with flashy consoles and most importantly of all high margin contracts, an IT directors wet dream (it was cool). So they asked the broadcaster if they could also take that department in the same purchase, "Are you sure?... Okay." What the IT outsourcing people didn't realise was that with valuable contracts came great responsibility. We never had *any* measurable outages, changeovers happened in a flash. Hardware resilience: n+1? no thanks we'll have 2n or at least 3a+2b. Resilient power? Grid, Gas turbine, Diesel & UPS, plus manual bypass changeover switches!
The thing was, some of this isn't unfamiliar to IT people who do real DR, but what created the biggest fuss? They refused to acknowledge that the IT response time for some users (the 24x7x365 ops team) had to be less than 4hours. Surely you can wait 4 hours to get your email back? Surely you can do without your login for a few hours? Your Exchange account has zero size and can't send mail? Can you send us a mail to report the fault?
If the people supporting you don't understand you then how can you be effective.
-
Friday 8th March 2013 14:35 GMT Chris007
@AC 10:50 GMT
"Also, sometimes the answer isn't to fix the problem but to just get the system working, you are providing a critical service to the public, you can fix the problem later"
Having been at the sharp edge (in a certain mega large organisation) I can tell you that 99% of teccies would like to take this course of action but 99% of the time they are stopped by [glory hunting] managers.
We have a name for them - "Visibility Managers".
They didn't want anything happening until very senior managers had seen them involved so they could take all the credit. Once the very senior managers had disappeared (fault worked around or fixed etc.) the "Visibility Manager" would very quickly become the "Invisibility Manager" and f**ked off.
-
-
Friday 8th March 2013 12:45 GMT Mick Sheppard
Redundant != no outages
I worked at a place that ran their databases from a tier 2 storage array. This had redundant everything, dual controllers, power supplies, paths to disk, paths to the SAN etc.
We had disk failures that the system notified us and we hot replaced with the array re-laying out the data dynamically. We had a controller failure that we were notified about and the engineer came to replace, again without an outage.
We then had two separate incidents that caused complete outages. The first was a disk that failed in a way that for some reason took out both controllers. It shouldn't happen but did. The second was down to a firmware issue in the controllers that under a particular combination of actions on the array caused a controller failure. With both controllers running the same firmware the failure cascaded from one to the other and took out the array.
So, whilst its trendy to be cynical, these complex redundant systems aren't infallible and when they do fail it can take a while to work out what has happened and what needs to be done to get things operational again.
-
Saturday 9th March 2013 01:01 GMT Anonymous Coward
Re: Redundant != no outages
Definitely, I think people are confusing fault tolerance with high availability. There is overlap, but they are different concepts.
Fault tolerance just means a bunch of extra hardware is in place so if a NIC, or whatever, fails, another will pick up for it. It says nothing about down time other than you have added protection in the single category of hardware failures. If you need to upgrade the OS, even in the most fault tolerant system known to man, it will likely require an outage. That is why you need an HA solution in place, if no downtime is a requirement. A high availability solution will be running a parallel system with real time data integrity so that it can immediate pick up I/O if another system in the HA environment goes down, either scheduled or unscheduled. For instance, Tandem NonStop was supremely fault tolerant, but not necessarily highly available. Google's home brew 1U x86 servers have zero fault tolerance, but their Hadoop cluster makes it a highly available environment. IBM mainframe has both. It is fault tolerant hardware, but you can also add parallel sysplex which provides high availability.
-
-
Friday 8th March 2013 14:46 GMT Roland6
Scary the lack of any real knowledge being shown here
"I work somewhere that has a much smaller IT department and a much smaller IT budget than RBS, but it would take the failure of multiple hardware devices to knock a key service out. What kind of Mickey mouse setup do they have that a hardware failure can take down their core services for hours?"
From reading the comments, I'm concerned about the total lack of any real knowledge of real world enterprise computing demonstrated by many and hence the above comment would seem to be the sub-text to many comments.
Setting up and running an IBM Parallel Sysplex, with only 6 zSeries in it distributed across 3 sites was complex, let along 14+. Plus I suspect that not all systems were running at capacity, mainly due to the hardware and software licensing costs (believe it or not some software you pay, not for the cpu it actually runs on, but on the TOTAL active cpu in the Sysplex), hence it would have taken time to call the engineers out, bring additional capacity on-line, move load within the Sysplex and confirm all is well before re-opening the system to customers; that is assuming the fault really was on a mainframe and not on a supporting system. Also it should not be assumed that the mainframe that failed was only running the customer accounts application, hence other (potentially more critical applications could also have failed). From companies I've worked with 2~3 hours to restore the mainframe environment to 'normal' operation, out-of-hours, would be within SLA.
Yes with smaller systems with significantly lower loads, operating costs and support system's requirements different styles of operation are possible to achieve high-availability and low failover times.
-
Saturday 9th March 2013 03:11 GMT Anonymous Coward
Re: Scary the lack of any real knowledge being shown here
While it is costly and complex, you can definitely have a real time fail over with parallel sysplex even with extreme I/O volumes. PS was built for that purpose. Do you mean 2-3 hours to restore sysplex equilibrium while the apps stay online or 2-3 hours to take the system completely down anytime after hours?
-