back to article 'Mainframe blowout' knackered millions of RBS, NatWest accounts

A hardware fault in one of the Royal Bank of Scotland Group's mainframes prevented millions of customers from accessing their accounts last night. A spokesman said an unspecified system failure was to blame after folks were unable to log into online banking, use cash machines or make payments at the tills for three hours on …

COMMENTS

This topic is closed for new posts.
  1. IT Hack

    Nice to see RBS not using ITIL. *cough*

    1. Anonymous Coward
      Anonymous Coward

      Crap Bonuses = Only Crap Staff Stay = Crap Services...

      1. Anonymous Coward
        Anonymous Coward

        Fuck off. Like the people getting paid millions in bonuses have got anything to do with the actual day to day running of the bank's IT services. Apart from taking the credit when things don't go wrong, obv...

  2. Rampant Spaniel

    Is it better that it is a new failure rather than a repeat of an old one?

    1. Anonymous Coward
      Anonymous Coward

      Yes.

      Next question.

      1. jai

        to add some detail, if it's the same one, that would suggest that no one did any due diligence, or lessons learned, or root cause analysis or any of a dozen other service delivery buzzwords that all basically mean, "wtf happened and how do we make sure it doesn't happen again?"

        you'll always get new issues that break things. such is the way of IT. no system is 100% perfect. you just have to put in as much monitoring/alerting and backup systems as you can afford to ensure any impact from outage to your business critical systems is as minimal as possible.

    2. Anonymous Dutch Coward
      Coat

      New failure better than repeat failure?

      Yep, they're innovating.

      1. Anonymous Coward
        Anonymous Coward

        This failure is *not* related to the previous one.

        Oh no.

        We are capable of lots of different failures.

  3. Bumpy Cat
    FAIL

    I doubt it

    I work somewhere that has a much smaller IT department and a much smaller IT budget than RBS, but it would take the failure of multiple hardware devices to knock a key service out. What kind of Mickey mouse setup do they have that a hardware failure can take down their core services for hours?

    1. IT Hack

      Re: I doubt it

      Indeed. My point re ITIL...

      It beggars belief that this has happened. Oh wait...no actually it doesn't. Seen a simple electrical fault take down a Tier Two dc...hit 6 companies hosting live revenue generating services. One company nearly went tits up.

      Of course at RBS I expect a director to be promoted to a VP type position for this cock up.

      1. TheVogon
        Mushroom

        Re: I doubt it

        You mean demoted? Director is higher than VP.

        1. Anonymous Coward
          Anonymous Coward

          Re: I doubt it

          "You mean demoted? Director is higher than VP."

          Well in a sane world that might be true but one very large (70000+) UK company I worked for had :

          minions

          me

          associate director

          director

          VP

          Head of dept

          Head of Research

          Board level directors

    2. Anonymous Coward
      Anonymous Coward

      Re: I doubt it

      It would take failure of multiple pieces of hardware to take down an IBM zServer, that doesn't mean it can't happen. The only thing you can be sure of with a system is that the system will eventually fail.

      To accuse them of "Mickey mouse" operation suggests that you've no idea how big or complex the IT setup at RBS is. I believe they currently have the largest "footprint" of zServers in Europe, that's without even thinking of mentioning the vast amount of other hardware on a globally distributed network.

      Small IT = Easy.

      Big IT = Exponentially more complicated.

      1. S4qFBxkFFg

        Re: I doubt it

        Incidentally, this could be another argument that RBS is just too big.

        ... and why the fuck are they allowed to have BOTH a banking licence and limited liability? ... mutter mutter .... moan ...

        1. John Smith 19 Gold badge
          Childcatcher

          Re: I doubt it

          "... and why the fuck are they allowed to have BOTH a banking licence and limited liability? ... mutter mutter .... moan ..."

          You forgot that UK banks have "preferred creditor" status, so are one of the first in line if a company is declared bankrupt. Because it's to protect the widows, orphans and other children (hence my icon).

          Which can be when a bank asks them to repay their overdraft now for example.

          1. Brad Ackerman
            Stop

            Re: I doubt it

            That was an episode of Monk.

      2. Phil O'Sophical Silver badge

        Re: I doubt it

        > It would take failure of multiple pieces of hardware to take down an IBM zServer, that doesn't mean it can't happen.

        It also assumes someone noticed the first failure. I remember our DEC service bod (it was a while ago :) ) complaining about a customer who'd had a total cluster outage after a disk controller failed. Customer was ranting & raving about the useless "highly available" hardware they'd spent so much money on.

        Investigation showed that one of the redundant controllers had failed three months before, but none of the system admins had been checking logs or monitoring things. The spare controller took over without a glitch, no-one noticed, and it was only when it failed that the system finally went down.

        1. jai

          Re: I doubt it

          i was once told an anecdote. the major investment bank they worked for failed in it's overnight payment processing repeatedly, every night. Eventually they determined it was happening at exactly the same time each night. so they upgraded the ram/disks. patched the software. replaced the whole server. nothing helped.

          Finally, head of IT decides enough is enough, takes a chair and a book to the data centre, sits in front of the server all night long to see what it's doing.

          and at the time when the batch failed every night previously, the door to the server room opens, the janitor comes in with a hoover, looks for a spare power socket, finds none, so unplugs the nearest and plugs in the hoover. yes, you guessed it, plug that was unplugged was the power lead for the server in question.

          just because you're a big firm, doesn't mean you don't get taken out by the simplest and stupidest of things

          1. Gaius
            WTF?

            Re: I doubt it

            And no-one noticed the reboots in the syslog? No-one noticed the uptime looked a bit funny when they ran top (or equivalent)? No-one came in the next morning and wondered why their remote logins had dropped?

            I call shenanigans.

          2. Anonymous Dutch Coward
            Facepalm

            Re: I doubt it

            Recycled urban legend methinks. I heard the one about a server stuck under a desk with the janitor etc.

            1. Gaius
              FAIL

              Re: I doubt it

              Not to mention that all the kit in a DC is powered by sockets *inside the racks*.

          3. Anonymous Coward
            Anonymous Coward

            Urban Legend ...

            "i was once told an anecdote .. the door to the server room opens, the janitor comes in with a hoover, looks for a spare power socket, finds none, so unplugs the nearest and plugs in the hoover. yes, you guessed it, plug that was unplugged was the power lead for the server in question."

            I read something similar, only it was set in the ICU of a hospital and what they unplugged was the ventilator ...

            1. This post has been deleted by its author

            2. JimC
              Facepalm

              Re: Urban Legend ...

              Although I imagine the strory has grown in the telling, the "handy power socket" is certainly something I've experienced first hand in end user situations. I remember in one office having to go round labelling the appropriate dirty power sockets "vacuum cleaners only" in order to try and prevent the staff plugging IT equipment into them, thus leaving the cleaner grabbing any handy socket on the clean power system for the vacuum cleaner...

            3. Iain 14

              Re: Urban Legend ...

              “I read something similar, only it was set in the ICU of a hospital and what they unplugged was the ventilator ...”

              Yup - famous Urban Legend. The hospital setting dates back to a South African newspaper "story" in 1996, but the UL itself goes back much further.

              http://www.snopes.com/horrors/freakish/cleaner.asp

          4. Matt Bryant Silver badge
            Facepalm

            Re: Jai Re: I doubt it

            "......and at the time when the batch failed every night previously, the door to the server room opens, the janitor comes in with a hoover, looks for a spare power socket, finds none, so unplugs the nearest and plugs in the hoover. yes, you guessed it, plug that was unplugged was the power lead for the server in question....." Yeah, and I have some prime real estate in Florida if you're interested. A Hoover or the like would be on a normal three-pin, whilst a proper server would be on a C13/19 or 16/32A commando plug. It would also have probably at least two PSUs so two power leads, unplugging one would not kill it.

          5. Fatman

            Re: so unplugs the nearest and plugs in the hoover.

            WROK PALCE prevents that one by using LOCKING plugs on ALL of its servers. (For those of you on the "other" side of the pond, locking plugs are completely incompatible with standard US power cords.)

        2. BrentRBrian
          Mushroom

          Re: I doubt it

          The biggest failure point on a Z is the idiots running loose around it.

          I asked a console operator on an Amdahl 470 what the button labelled IPL was for. He said "IDIOTS PUSH LOAD".

      3. Anonymous Coward
        Anonymous Coward

        Re: I doubt it

        "It would take failure of multiple pieces of hardware to take down an IBM zServer, that doesn't mean it can't happen. The only thing you can be sure of with a system is that the system will eventually fail."

        A hardware failure taking down a System z is supremely unlikely. I think the real world mean time for mainframe outages are once in 50 years. Even if it was a hardware failure (which would mean a series of hardware component failures all at the same time in the system), IBM has a class of its own HA solution, geographically dispersed parallel sysplex. You can intentionally blow up a mainframe or the entire data center in that HA design and it will be functionally transparent to the end user. A system might fail, but the environment never should.

      4. Duffaboy

        I've Said it before and witnessed it again this week

        Walking through a data center only 2 days ago I noticed a failed drive on a San Box. No Alerts no one doing a physical check either every morning and afternoon either so I'm not surprised

      5. Dominic Connor, Quant Headhunter

        Re: I doubt it

        We used to have a Stratus, they worked on the principle that SysOps *do* forget to look at logs, the irony being is that if all the components are individually reliable, then humans being humans won't worry about it so much.

        So Stratus machines phoned home when a part died and that meant an engineer turning up with a new bit before the local SysOps had noticed it had died.

        That's not a cheap way of doing things of course, but at some level that's how you do critical systems. When a critical component fails the system should attract the attention of the operators.

        That leads me back to seeing this as yet another failure of IT management at RBS.

        If the part failed, then there should have been an alert of such a nature that the Ops could not missi it. A manager might not write that himself, but his job is to make sure someone does.

        The Ops should be motivated, trained and managed to act rapidly and efficiently. Again this is a management responsibility.

        Al hardware fails, all you can do is buy lower probability of system failure, so the job of senior IT management at RBS is not as they seem to think playing golf and greasing up to other members of the "management team", but delivering a service that actually works.

        No hardware component can be trusted. I once had to deal with a scummy issue where a cable that lived in a duct just started refusing to pass signals along. The dust on the duct showed it had not been touched or even chewed by rats, it had just stopped. Never did find out why.

    3. jonathanb Silver badge

      Re: I doubt it

      One where the PHB will only release funds for repairs when there is an actual service failure.

    4. Anonymous Coward
      Anonymous Coward

      Re: I doubt it

      "I work somewhere that has a much smaller IT department and a much smaller IT budget than RBS, but it would take the failure of multiple hardware devices to knock a key service out."

      Yes, and we are talking about a mainframe. It is near impossible to knock a mainframe off line with a simple "hardware failure." Those systems are about 14 way redundant in the first place, so it isn't as though a OSA corrupted, or another component, and knocked the mainframe offline. Even if the data center flooded or the system disappeared using magic, almost all of these mega-mainframes have a parallel sysplex/HyperSwap configuration which is a bulletproof HA design. If system A falls off the map, the secondary system picks up the I/O in real time, so why didn't that happen.... I am interested to hear the details.

  4. Rob
    Coat

    Sounds like...

    ... the first error was the PFY and this one is the BOFH.

    Give me my coat quick before someone lumbers out of the halon mist brandishing a cattle prod.

  5. M Gale

    I thought one of the features of a mainframe...

    ...was umpteen levels of redundancy? One CPU "cartridge" goes pop? Fine. Rip it out of the backplane and stuff another one in, when you've got one to stuff in there.

    Dual (or more) PSUs, RAID arrays.. and yet this happens. Oh well. Wonder what RBS's SLAs say about this?

    They do have SLAs for those likely-hired-from-someone-probably-IBM machines, don't they?

    1. Anonymous Coward
      Anonymous Coward

      Re: I thought one of the features of a mainframe...

      There are umpteen levels of redundancy, that doesn't mean that outages don't happen on occasion.

    2. Velv

      Re: I thought one of the features of a mainframe...

      Multiple hardware components are fine as long as it is a discreet hardware failure.

      Firmware, microcode or whatever you want to call it can also fail, and even when you're running alleged different versions at different sites they could have the same inherent fault.

      The only true way to have resilience is for the resilient components to be made by different vendors using different components (which is what Linx/Telehouse has with Jupiter, Cisco, Foundry and others for their network cores). IBM mainframes don't work this way

      1. Martin
        Headmaster

        ...so long as it is a DISCRETE hardware failure...

        I won't do my usual sigh - this one is a bit more subtle than your/you're.

        Discrete - separate

        Discreet - circumspect.

        To remember, the e's are discrete.

      2. Anonymous Coward
        Anonymous Coward

        Re: I thought one of the features of a mainframe...

        "The only true way to have resilience is for the resilient components to be made by different vendors using different components (which is what Linx/Telehouse has with Jupiter, Cisco, Foundry and others for their network cores). IBM mainframes don't work this way"

        Yeah, I suppose that is true... although you are more likely to have constant integration issues with many vendors in the environment, even if you are protected against the blue moon event of a system wide fault spreading across the environment. By protecting yourself against the possible, but extremely unlikely, big problem, you guarantee yourself a myriad of smaller problems all the time.

    3. Anonymous Coward
      Anonymous Coward

      Re: I thought one of the features of a mainframe...

      Except for 1 thing, the RBS mainframe is not "a mainframe", its a cluster of (14 was the last number I heard) mainframes, all with multiple CPUs. This failure probably is not a single point of failure, its a total system failure of the IT hardware and the processes used to manage it.

  6. Mike Smith
    FAIL

    I reckon the other source had it spot on

    "the bank’s IT procedures will in some way require system administrators to understand a problem before they start flipping switches."

    Naturally. However, let's not forget the best-of-breed world-class fault resolution protocol that's been implemented to ensure a right-first-time customer-centric outcome.

    That protocol means that a flustercluck of management has to be summoned to an immediate conference call. That takes time - dragging them out of bed, out of the pub, out of the brothel gentlemen's club and so on.

    Next, they have to dial into the conference call. They wait while everyone joins. Then the fun begins:

    Manager 1: "Ok what's this about?"

    Operator: "The mainframe's shat itself, we need to fail over NOW. Can you give the OK, please?"

    Manager 2: "Hang on a minute. What's the problem exactly?"

    Operator: "Disk controller's died."

    Manager 3: "Well, can't you fix it?"

    Operator: "Engineer's on his way, but this is a live system. We need to fail over NOW."

    Manager 4: "All right, all right. Let's not get excited. Why can't we just switch it off and switch it on again? That's what you IT Crowd people do, isn't it?"

    Operator: "Nggggg!"

    Manager 1: "I beg your pardon?"

    Operator: (after deep breath): "We can't just switch it off and on again. Part of it's broken. Can I fail it over now, please?"

    Manager 2: "Well, where's your change request?"

    Operator: "I've just called you to report a major failure. I haven't got time to do paperwork!"

    Manager 3: "Well, I'm not sure we should agree to this. There are processes we have to follow."

    Manager 4: "Indeed. We need to have a properly documented change request, impact assessment from all stakeholders and a timeframe for implementation AND a backout plan. Maybe you should get all that together and we'll reconvene in the morning?"

    Operator: "For the last bloody time, the mainframe's dead. This is an emergency!"

    Manager 1: "Well, I'm not sure of the urgency, but if it means so much to you..."

    Manager 2: "Tell you what. Do the change, write it up IN FULL and we'll review it in the morning. But it's up to you to make sure you get it right, OK"

    Operator: "Fine, thanks."

    <click>

    Manager 3: "He's gone. Was anyone taking minutes?"

    Manager 4: "No. What a surprise. These techie types just live on a different planet."

    Manager 1: "Well, I'm off to bed now. I'll remember this when his next appraisal's due. Broken mainframe indeed. Good night."

    Manager 2: "Yeah, night."

    Manager 3: "Night."

    Manager 4: "Night."

    1. Anonymous Coward
      Anonymous Coward

      Re: I reckon the other source had it spot on

      @Mike - that may well be what you think happens, but I've experienced financial services IT recovery management and it's a lot more along the lines of:

      Bunch of experts in the hardware, OS, software, Network, Storage and Backup get on call to discuss, chaired by a trained professional recovery manager.

      You tend to get paniky engineers who identified the problem saying a disk controller has died, and we must change it now, NOW, do you hear?

      The recovery manager will typically ask "Why did it fail, what are the risks of putting another one in, do we have scheduled maintenance running at the moment, has there been a software update, can someone confirm that going to DR is an option, are we certain that we understand what we're seeing? What is the likelihood of the remaining disk controller failing?

      The last thing you want to do is failover to DR at the flick of a switch, because that may well make things worse. Let me assure you, this isn't the sort of situation where people bugger off back to bed before it's fixed and expect to have a job in the morning.

      1. IT Hack

        Re: I reckon the other source had it spot on

        AC 15:24 - this.

        Not only financial services btw.

        I'm not sure miss those midnight calls...in some ways quite fun to sort shit out but on the flip side the pressure to get it right first time is immense.

        However its not only just flipping the bit..its also very much understanding the impact of that decision. If you fail over an entire DC you need to really be able to explain why...

      2. Mike Smith

        Re: I reckon the other source had it spot on

        "Bunch of experts in the hardware, OS, software, Network, Storage and Backup get on call to discuss, chaired by a trained professional recovery manager."

        Well, quite. That's exactly what should happen. Been there myself, admittedly not in financial services.

        I've seen it done properly, and it's precisely as you describe.

        And I've seen it done appallingly, with calls derailed by people who knew next to nothing about the problem, but still insisted on adding value by not keeping their traps shut.

        I guess I'm just too old and cynical these days :-)

        1. Field Marshal Von Krakenfart
          Meh

          Re: I reckon the other source had it spot on

          "Bunch of experts in the hardware, OS, software, Network, Storage and Backup get on call to discuss, chaired by a trained professional recovery manager."

          With update meetings every half hour, which is why you also need team leads and project managers, so there is somebody to go to the meeting and say " NO, the techs are still working on it"

      3. Anonymous Coward
        Anonymous Coward

        Re: I reckon the other source had it spot on

        > The last thing you want to do is failover to DR at the flick of a switch, because that may well make things worse.

        I spend so much time trying to convince customers of that, and many of them still won't get past "but we need automatic failover to the DR site". We refuse to do it, the field staff cobble something together with a script, and it all ends in tears.

    2. Anonymous Coward
      Anonymous Coward

      Re: I reckon the other source had it spot on

      http://www.emptylemon.co.uk/jobs/view/345349

      throw in a couple of 3rd parties and you've got them all pointing fingers at each other as well to add into the mix.

      1. Anonymous Coward
        IT Angle

        Re: I reckon the other source had it spot on

        "throw in a couple of 3rd parties and you've got them all pointing fingers at each other as well to add into the mix."

        "RBS - Data Consultant - Accenture"

        I reckon who ever wrote that never actually worked in a real IT environment ...

    3. Velv

      Re: I reckon the other source had it spot on

      Which is how it might work if you have manual intervention required.

      Highly available mainframe plex's like RBS run active/active across multiple sites.

      1. Anonymous Coward
        Anonymous Coward

        Re: I reckon the other source had it spot on

        > Which is how it might work if you have manual intervention required.

        For DR you should have manual intervention required.

        For simple HA when the sites are close enough to be managed by the same staff, have guaranteed independent redundant networking links, etc. then, yes, you can do automatic failover.

        For proper DR, with sites far enough apart that a disaster at one doesn't touch the other, you have far more to deal with than just the IT stuff, and there you must have a person in the loop. How often have you watched TV coverage of a disaster when even the emergency services don't know what the true situation is for hours (9/11 or Fukishima, anyone?) ? Having the IT stuff switching over by itself while you're still trying to figure out what the hell has happened will almost always just make the disaster worse.

        For example, ever switched over to another call center, when all the staff there are sleeping obliviously in their beds? Detected a site failure which hasn't happened, due to a network fault, and switched the working site off? There is a reason that the job of trained business continuity manager exists. We aren't at the stage where (s)he can be replaced by an expert system yet, let alone by a dumb one.

      2. Anonymous Coward
        Anonymous Coward

        Re: I reckon the other source had it spot on

        "Highly available mainframe plex's like RBS run active/active across multiple sites."

        Exactly, no one should have had to call anyone. The mainframe should have moved to the second active system in real time. The only call that would have been made is system calling IBM to tell an engineer to come and replace whatever was broken.

    4. andy mcandy
      Thumb Up

      Re: I reckon the other source had it spot on

      That's so good, and accurate, I have just printed it out and stuck it on the kitchen wall as a reminder to the ever expanding mass of PHB's I have the (mis)fortune of working with/for/alongside

      Made my day :)

      1. Dave Lawton

        Re: I reckon the other source had it spot on

        @ andy mcandy

        Just for clarification (I wrote clarifiction first, is that a word ? speeling chucker didn't barf at it), it is Mike Smith's post you are referring to, if not which one please ?

    5. This post has been deleted by its author

    6. Anonymous Coward
      Anonymous Coward

      Re: I reckon the other source had it spot on

      You, sir, have just summed up my job with scary accuracy!

      (Are you in my team....?)

    7. Anonymous Coward
      Anonymous Coward

      Re: I reckon the other source had it spot on

      So all, technicians are incredibly overworked but still infallible, whilst all managers are lazy and incompetent. Procedures are completely unnecessary. Just get rid of all managers and procedures and everything will be fantastic.

      1. M Gale
        Thumb Up

        Re: I reckon the other source had it spot on

        "So all, technicians are incredibly overworked but still infallible, whilst all managers are lazy and incompetent. Procedures are completely unnecessary. Just get rid of all managers and procedures and everything will be fantastic."

        And irate operators really do poison their bosses with halon. Oh come on, that has got to be the best rant I've seen in a while. Possibly it might have come from person experience, as far as you know.

        Wherever it came from, I think Mike Smith needs to be hired as Simon Travaglia's ghost writer for when he's off sick and a new BOFH episode needs writing up. That was awesome.

        1. M Gale
          Facepalm

          Re: I reckon the other source had it spot on

          Personal experience, even. And too late to use the edit button too.

          Damn my fat fingers.

          (Maybe that's what someone said shortly after the start of the RBS outage?)

    8. Chris007
      Pint

      Re: I reckon the other source had it spot on @Mike Smith

      wow - having worked at RBS this is not far from what has happened on some recovery calls I was involved in. Systems were down and some manager would actually request a change be raised BEFORE fixing the issue - Anybody who knows the RBS change system *coughInfomancough* knows it is not the quickest system in to world.

      have a pint for reminding what I escaped from.

    9. RegGuy1 Silver badge
      Coat

      Re: I reckon the other source had it spot on

      Wot no down votes? So are there no managers who frequent el Reg? Or maybe, just maybe, could the techies be right?

      [Coat -- system's buggered so I'm off down the pub.]

  7. Anonymous Coward
    Anonymous Coward

    "In theory, the banking group’s disaster-recovery procedures should have kicked in straight away without a glitch in critical services."

    Easy to be judgemental but a DR failover is normally a controlled failover which take a number of hours, you need to be 100% sure the data is in a consistent state to be able to switch across. I've seen failovers that have gone wrong and its a million times worse being left halfway between both!

    Its unlikely any system is truly active/active across all of its parts

    1. Yet Another Commentard

      In theory...

      Of course, in theory reality and theory are identical. In reality, they are not.

    2. Anonymous Coward
      Anonymous Coward

      "Its unlikely any system is truly active/active across all of its parts"

      That is the beauty of the System z though. It is not a distributed system which requires 8 HA software layers from 8 different vendors, none of which are aware of each other, to be perfectly in sync. It is truly active/active across all of its parts because there are not that many parts, or third party parts to sync. Parallel sysplex, IBM mainframe HA, manages the whole process.

    3. david 12 Silver badge

      banking group’s disaster-recovery procedures

      This was the bank where the disaster-recovery procedures went disastrously wrong.

      I'm going to at least consider the possiblity that this time they were told they COULD NOT start a disaster-recovery procedure until everything was turned off and backed up.

  8. Anonymous Coward
    Anonymous Coward

    Microsoft let their own cert expire and it blew up their cloud

    Just because it's a small avoidable failure, doesn't mean it can't have a catastrophic effect

  9. Dan 55 Silver badge
    Meh

    "procedures should have kicked in straight away without a glitch in critical services"

    These would be the disaster-recovery procedures that RBS said were given a good seeing to after last year's balls up so this kind of thing would never happen again.

    (I know a hardware problem and a batch problem aren't related, but the procedures that they follow if something happens which brings down the bank's service probably are.)

    1. Anonymous Coward
      Anonymous Coward

      Re: "procedures should have kicked in straight away without a glitch in critical services"

      Disaster recovery doesn't help a damn if (a) your data is replicated immediately to DR and is buggered or (b) it's quicker to recover in the same site than flip to remote site.

      The assumption some people have that DR is a simple flick to a remote site and everything springs to life in 5 minutes is horribly, horribly flawed in so many ways.

      1. SE

        Re: "procedures should have kicked in straight away without a glitch in critical services"

        The assumption some people have that DR is a simple flick to a remote site and everything springs to life in 5 minutes is horribly, horribly flawed in so many ways."

        Indeed, though vendors often like to give the impression that this is how their solutions work.

  10. Anonymous Coward
    Anonymous Coward

    Alternatively...

    ...mainframe is duplicated across 2 towers. Or more.

    Part 1: Tower 1 take out for patching/maintenance/a laugh/upgrade

    Part 2: At the same time, somebody yanks the 3 phase on tower 2 by mistake. As it was night time, it was probably to plug in a hoover (yes, I know...it's a joke). or the main breaker blows. Or the entire DC where Tower 2 resides goes on holiday to the bermuda triangle.

    Part 3: Royal shitstorm getting Tower 2 back online after unclean shutdown, and everything rolledback, then rolled forward again. it was probably noticed *in milliseconds" when it went down, but "Swtiching it on and off again" doesn't really work on a transactional Z. It takes time.

    Part 4: Parallel recovery is to cancel the work on Tower 1 and get it back online, while rolling forward all the "in transit" stuff from Tower 2.

    Part 5: Meanwhile CA7 guys run about cancelling/rescheduling batch jobs.

    Edit: Part 6: All the secondary services are restarted - mainly the application server layer, and any rollbacks are replayed now the back end is back.

    1. Anonymous Coward
      Anonymous Coward

      Re: Alternatively...

      No battery backup?

  11. Anonymous Coward
    Anonymous Coward

    Tandem

    I don't know much about these things but I recall many years ago a relative working in banking IT who mentioned that some of the banks used Tandem hardware that provided a continuous and automatic redundancy, but the machines cost a lot more. As I said, I know nothing about mainframes but these days do all mainframes provide such features or not? If not, is that why we get such incidents? Any insights most welcome.

    1. Anonymous Coward
      Anonymous Coward

      Re: Tandem

      Just found this on Wiki, partly explains my question, Tandem no longer seem to exist, but perhaps proves my point about cheaper options and that use of Tandem-like solutions should cope with these failures and not require a committee to have a conference call about the failure?

      "Tandem Computers, Inc. was the dominant manufacturer of fault-tolerant computer systems for ATM networks, banks, stock exchanges, telephone switching centers, and other similar commercial transaction processing applications requiring maximum uptime and zero data loss. The company was founded in 1974 and remained independent until 1997. It is now a server division within Hewlett Packard.

      Tandem's NonStop systems use a number of independent identical processors and redundant storage devices and controllers to provide automatic high-speed "failover" in the case of a hardware or software failure.

      To contain the scope of failures and of corrupted data, these multi-computer systems have no shared central components, not even main memory. Conventional multi-computer systems all use shared memories and work directly on shared data objects. Instead, NonStop processors cooperate by exchanging messages across a reliable fabric, and software takes periodic snapshots for possible rollback of program memory state."

      1. Arbee

        Re: Tandem

        Actually, Tandom (or HP NonStop as it is now called) is still in wide use. At the institution I work at (UK) it's the backend to the ATM network. I went to a seminar about it and was told their last outage was in 1991!

        1. hugo tyson
          Stop

          Re: Tandem

          Yeah, but the main thing about Tandem/HP NonStop systems is every CPU is duplicated, all memory is duplicated, and for every operation if the two results don't match the (dual)CPU in question STOPS. It's very keen on stopping; it's only a huge mound of failover software and redundant power and duplication that makes a *system* very keen on continuing; individual parts stop quite readily.

          Of course, the intended market is OLTP, so the goal is to make sure that the decrement to your bank balance is the right answer; if two paired hardware CPUs and their memory give different answers, that pair of CPUs stops and a whole 'nother hardware set attempts the same transaction.

    2. Phil O'Sophical Silver badge
      Stop

      Re: Tandem

      Tandem hardware was Fault Tolerant, not Highly Available. There were other players, like Stratus and Sun, in that area.

      FT hardware duplicates the systems inside a single box, perhaps three CPUs, three disk controllers, three network cards, etc. They all do the same work on the same data, and if they get different results there's a majority vote to decide who's right.

      It provides excellent protection against actual hardware failure, like a CPU or memory chip dying, but offers no protection at all against an external event, or operator error. Just like using RAID with disks, which protects against disk failure, but it isn't a replacement for having a backup if someone deletes the wrong file by mistake.

      It is expensive, you're paying for three systems but getting the performance of one, and given the reliability of most systems these days it isn't used much outside the aviation/space/nuclear/medical world, where even the time to switchover to a backup can be fatal. There's a reason that none of the companies who made FT systems managed to survive as independent entities.

    3. Anonymous Coward
      Anonymous Coward

      Re: Tandem

      "As I said, I know nothing about mainframes but these days do all mainframes provide such features or not? If not, is that why we get such incidents?"

      Yes, IBM mainframe's parallel sysplex is the gold standard in HA. Basically the system upon which all other clusters have been based. The systems read/write I/O in parallel so both system A and B (or more than two if you choose) have perfect data integrity and can process I/O is parallel. If one of those systems goes away, the others continue handling I/O with no disruptions. There is also a geographically dispersed parallel sysplex option which can provide out of region DR, in case the data center blows up or something, at wire speed with log shipping which is also active/active, but it takes a few seconds, literally, before the I/O on the wire is written and the DR site takes over. In theory, we should never get such incidents, but, like anything, people can misimplement the HA solution... which seems to have happened here.

  12. LPF

    Ok my two pence

    The zSeries is not some mickey mouse piece of Hardware , this is built to have downtimes measuers in minutes per year.

    In the rush to get back to profits, using slash and burn tactics, did they kick out too many british based banking staff??

    1. Anonymous Coward
      Flame

      You Hit the Nail

      RBS had to "retain talent" in the trading rooms and pay them several 100k of bonus per year. And let them bet the entire bank to get the short term results for obtaining said bonuses. On the long run it crashed the bank and "in order to become profitable again", experienced British engineers and specialists were replaced by Indians with 1/10th of wage and 1/100th of experience/skill/actual value.

      But you know what ? That is the whole purpose of modern finance - suck the host white until it is dead, then leave the carcass for the next host. IT people are considered part of the host organism.

      Grab yourself a history book and see how that played out between 1929 and 1945.

      Picture of firebombed city.

  13. Candy

    Prevented millions from accessing their accounts?

    I had no idea that their customers were such night owls. More likely that thousands were affected by the outage, no?

    1. nsld
      FAIL

      Re: Prevented millions from accessing their accounts?

      9pm at night with no card transactions and no ATM. Anyone out for dinner or drinks using RBS was knackered.

      Certainly going to have an effect on many people.

      1. Silverburn

        Re: Prevented millions from accessing their accounts?

        9pm at night with no card transactions and no ATM. Anyone out for dinner or drinks using RBS was knackered.

        That would limit it to politicians and Traders, as everyone else in the country is too skint to eat out midweek. Then again, the politicians probably wouldn't be paying for it anyway - it's all on "expenses". So it's just the traders then. No biggie.

  14. ukgnome
    Trollface

    It does seem odd that the auxiliary systems didn't take the load whilst the primary was out of action.

  15. FutureShock999
    Boffin

    High Availability and Resiliance

    This should NOT have been about DR or backups. This should have been handled as part of any high-availability , RESILIENT cluster system design. I've designed and architected HA on IBM SP2 supercomputer clusters and can well attest that it works - our "system test" was walking the floor of the data centre randomly pulling drive controller cables and CPU boards out of their sockets, while having the core systems still running processes without failing! And that was 10+ years ago - I find it appalling that a live banking system would not be engineered to have the same degree of _resiliance_. Don't talk in terms of how many minutes of downtime it will have per year - it should be engineered to have the failure of x number of disks, y number of controllers, and z number of processors within a chassis/partition/etc.) before failure. For a live, financial system, those should be the metrics that are quoted, not reliability alone.

    1. Anonymous Coward
      Anonymous Coward

      Re: High Availability and Resiliance

      Exactly, this is an HA design or implementation issue.

    2. MH Golfer

      Re: High Availability and Resiliance

      Definitely an HA issue and it should be automated/orchestrated.

  16. Caff

    Tandem

    There would be a Tandem line between the mainframe and the ATM network. If it goes down or out of sync it can take some coordination restarts to bring it back up.

  17. Joey
    Joke

    Common problem

    It's that 16K RAM pack on the back. If you wobble it, you can lose all your data.

  18. Jim McCafferty

    Just Joshing...

    The Government have said they will sell their holding in RBS when the stock price reaches a certain level. What if someone decided they didn't like that idea? This little incident will put that sale in doubt.

    After the last mainframe blow out - one would have thought the place would have been running a bit better - I take it other banks aren't experiencing similar outages?

  19. Anonymous Coward
    Anonymous Coward

    Old joke

    That's envelope number 2 used up.

  20. Anonymous Coward
    FAIL

    A mainframe hardware fault !

    "A hardware fault in one of the Royal Bank of Scotland Group's mainframes prevented millions of customers from accessing their accounts last night.

    Assuming this were the case, they must have multiple redundent systems, mustn't they? On the other hand, maybe someone ran the wrong backup procedure ... again ! ! !

    "This fault may have been something as simple as a corrupted hard drive, broken disk controller or interconnecting hardware."

    No, mainframes have multiple harddrives, disk controllers and error detection and correction circuits ...

    1. Anonymous Coward
      Anonymous Coward

      Re: A mainframe hardware fault !

      "No, mainframes have multiple harddrives, disk controllers and error detection and correction circuits ..."

      Yes, and they are clustered systems, so if one system bombs, even with all the fault tolerant architecture, another system in the cluster (or parallel sysplex in mainframe vernacular) should pick up the load. As with any cluster, you may take a performance hit, but it should never just go down.

  21. despairing citizen
    FAIL

    Banks Fail Again

    Unless their data centre was a smoking hole in the ground, outages of live systems are unacceptable.

    Even if their data centre was nuked, the bank should have continued running it's live services from an alternate location, with minimal "down time"

    The bank is paid very hansomly by it's customers for services, and "off lining" several Billion pounds of the UK economy for 3 hours is completely unacceptable.

    Whilst normally I personally think less legislation is a "good thing", HMG really needs to kick the regulator to remind them of a "fit and proper" organisation to have a banking license should include can they actually deliver the service reliably.

    1. Anonymous Coward
      Anonymous Coward

      Re: Banks Fail Again

      I think it's unacceptable that society has got to a point that computers are *that* big and *that* important and the job *can't* be done by humans.

  22. Matt Bryant Silver badge
    Boffin

    Probable key factor in the outage - "IBM mainframes don't fail!"

    I have had (stupid) people say to me statements like "Here, this is our DR plan, but don't worry about reading it, we have an IBM mainframe and IBM told us it will never fail" (note - IBM are VERY careful not to make that legally binding statement in their sales pitch, but they are happy to leave you with the impression). I have called directors at three in the morning to tell them the bizz is swimming in the brown stuff because we have had a mainframe stop/pop/do-the-unexpected, and after a few moments of bewildered silence at the other end of the line you get those immortal words: "But it's a mainframe....?" Half the problem is people are so lulled by the IBM sales pitch they just don't stop to think ANYTHING MANMADE IS FALLIBLE, so when something does go wrong there is an inertia due to an inability to accept the simple fact stuff breaks, whether it has an IBM badge or not. I bet half the delay in solving the RBS outage was simply down to people getting past that inertia.

    1. Anonymous Coward
      Anonymous Coward

      Re: Probable key factor in the outage - "IBM mainframes don't fail!"

      "Half the problem is people are so lulled by the IBM sales pitch they just don't stop to think ANYTHING MANMADE IS FALLIBLE"

      Yes, anything manmade is fallible. IBM mainframe is fault tolerant, redundant hardware which can be dynamically used in the case of a component failure, but it is also a clustered system, parallel sysplex. Parallel sysplex in place specifically because the systems might fail for whatever reason, e.g. hw failure, software error, data center blows up. I/O is processed in parallel across multiple systems so if one is unavailable, the other mainframes can immediately pickup the I/O. The IBM coupling facilities which make it possible for server time protocols to work in parallel are brilliant. No system hardware failure should ever take down a mainframe environment, unless you implemented parallel sysplex incorrectly. It is like having Oracle RAC implemented incorrectly and blaming the outage on a single server failure. If RAC is implemented correctly, a server failure should not matter. I highly doubt any IBM rep told anyone the mainframe is infallible and never goes down at a hardware level, if for no other reason than they wanted to sell parallel sysplex software.

      "I have called directors at three in the morning to tell them the bizz is swimming in the brown stuff because we have had a mainframe stop/pop/do-the-unexpected, and after a few moments of bewildered silence at the other end of the line you get those immortal words: "But it's a mainframe....?"

      I doubt that ever happened, but, if it did, they asked the right question. If properly implemented, that should never happen. Much like if you were to call your Director at three in the morning to tell them that the RAC cluster is down because a server failed, they would say "But it's a RAC cluster.....?"

      1. Matt Bryant Silver badge
        FAIL

        Re: Probable key factor in the outage - "IBM mainframes don't fail!"

        ".....Yes, anything manmade is fallible....." But - don't tell me - IBM mainframes are made by The Gods, right?

        "..... IBM mainframe is fault tolerant, redundant hardware....." Ignoring my own experience, this story goes to show you are completely and wilfully blind!

        "......No system hardware failure should ever take down a mainframe...." So the event never happened, it was all just a fairy tale, right? You know I mentioned stupid people earlier that said stuff like "forget DR, it's on an IBM mainframe", well please take a bow, Mr Stupid.

        1. Anonymous Coward
          Anonymous Coward

          Re: Probable key factor in the outage - "IBM mainframes don't fail!"

          "Ignoring my own experience, this story goes to show you are completely and wilfully blind!"

          Look at a System z data sheet. Every critical component is triple redundant. That certainty doesn't mean, in and of itself, that the system can't go down. It just means a hardware component failure is less likely to take down a system than a hardware failure in a system which is not fault tolerant. It is not an HA solution at all... which is why IBM created parallel sysplex, the HA solution.

          "So the event never happened, it was all just a fairy tale, right? You know I mentioned stupid people earlier that said stuff like "forget DR, it's on an IBM mainframe", well please take a bow, Mr Stupid."

          I didn't say this event didn't happen. I said that if parallel sysplex had been implemented correctly, it would be impossible for a hardware failure in a single mainframe to take down the cluster. It is possible that RBS did not have parallel sysplex on this application or that it was not implemented correctly. No individual system failure *should* ever take down a mainframe environment is what I wrote, that is assuming you have IBM HA solution in place. If you don't have the HA solution in place, sure, mainframes can go down like any other system... less likely than an x86 or lower end Unix server due to its fault tolerance, but it certainly can happen if it is stand alone. My point was, as this was clearly ultra mission critical, why wasn't parallel sysplex implemented? It should have been done as a matter of course, every other major bank that I know of runs their ATM apps in parallel sysplex, most in geographically dispersed parallel sysplex.

          1. Matt Bryant Silver badge
            FAIL

            Re: AC Re: Probable key factor in the outage - "IBM mainframes don't fail!"

            "Look at a System z data sheet......" AC, the data sheet is just part of the IBM sales smoke and mirrors routine - "it can't fail, it's an IBM mainframe and the data sheet says it is triple redundant". You're just proving the point about people that cannot move forward because they're still unable to deal with the simple fact IBM mainframes can and do break. The data sheet is just a piece of paper, the RBS event is reality, you need to understand the difference. Fail!

            1. Anonymous Coward
              Anonymous Coward

              Re: AC Probable key factor in the outage - "IBM mainframes don't fail!"

              "The data sheet is just a piece of paper, the RBS event is reality, you need to understand the difference. "

              You need to understand the difference between hardware level fault tolerance and high availability. Two different concepts.

              IBM mainframes run most of the world's truly mission critical systems, e.g. banks, airlines, governments, etc. To my knowledge, these all run in parallel sysplex without exception. If anyone thought that a mainframe didn't go down just because the hardware was built so well/redundant, there would be no point in all of these organizations implementing parallel sysplex. Even if you have a 132 way redundant system, it will still likely need to be taken down for OS upgrades or another software layer upgrade that requires an IPL. Not having a hardware issue because of hardware layer redundancy is only one small part of HA.

              1. Matt Bryant Silver badge
                FAIL

                Re: AC Re: AC Probable key factor in the outage - "IBM mainframes don't fail!"

                ".....You need to understand the difference between hardware level fault tolerance and high availability. Two different concepts....." No, what YOU need to understand is both mean SFA to the business, what matters to them is that they keep serving customers and making money. The board don't give two hoots how I keep the services running, be it by highly available systems or winged monkeys, they really don't give a toss as long as the money keeps rolling in. RBS had a service outage, reputedly because of a mainframe hardware issue, and it cost them directly in lost service to customers and indirectly in lost reputation, simple as that. You can quote IBM sales schpiel until you're blue in the face, it doesn't mean jack compared to the headlines. Get out of the mainframe bubble and try looking at how the business works.

      2. Anonymous Coward
        Anonymous Coward

        Re: Probable key factor in the outage - "IBM mainframes don't fail!"

        > If properly implemented, that should never happen.

        I hope to God you're never implementing systems I have to rely on.

        Let me guess, your code also has lots of:

        /* We can never get here */

        return;

        > Much like if you were to call your Director at three in the morning to tell them that the RAC cluster is down because a server failed, they would say "But it's a RAC cluster.....?"

        And we all know that RAC clusters never, ever, go down. It's amazing that Oracle even bothers to sell support for them isn't it?

        1. Anonymous Coward
          Anonymous Coward

          Re: Probable key factor in the outage - "IBM mainframes don't fail!"

          "And we all know that RAC clusters never, ever, go down. It's amazing that Oracle even bothers to sell support for them isn't it?"

          Yes, it is possible to have some error in the clustering software which takes down the entire cluster, be the clustering software RAC, Hadoop, or Parallel Sysplex. *But that is not where RBS said the issue occured* If they had said, "a parallel sysplex issue" and not a "hardware failure" then it is possible that they had the right architecture but the software bugged out or was improperly implemented. My point is: This application clearly should have been running in parallel sysplex as an ultra mission critical app. That is the architecture for nearly all mainframe apps. Therefore, saying a "hardware failure" caused their entire ATM network and all other transactional systems to go down makes no sense. They were either not running this in sysplex, in which case... why not, or they did not report the issue correctly and it was much more than a "hardware failure."

    2. Roland6 Silver badge

      Re: Probable key factor in the outage - "IBM mainframes don't fail!"

      "I have called directors at three in the morning to tell them the bizz is swimming in the brown stuff because we have had a mainframe stop/pop/do-the-unexpected"

      There was time when the Director would of called you to ask why they had to get the news of a fault from IBM and not their own IT organisation...

      Either IBM customer service has gone down hill or they've decided it's better business to be friends with the IT organisation.

      1. Matt Bryant Silver badge
        Facepalm

        Re: Roland6 Re: Probable key factor in the outage - "IBM mainframes don't fail!"

        ".....There was time when the Director would of called you to ask why they had to get the news of a fault from IBM and not their own IT organisation..." If you're implying the typical IBM response was to worry about cuddling up to senior management rather than fixing the problem then nothing has changed. But you should also know it is the first rule of BOFHdom that you should always know more than those above you. Dial-home services and the like should always have the BOFH as contact so you are in control of the flow of information uphill, so as to make sure that when the brown stuff comes rolling downhill it is not on your side. Your role has probably already been short-listed for being outsourced if you haven't mastered such basics.

        1. Roland6 Silver badge

          Re: Roland6 Probable key factor in the outage - "IBM mainframes don't fail!"

          Matt, upvoted because I agree with you.

    3. Fatman
      FAIL

      Re: ...was simply down to people getting past that inertia.

      A well placed kick with a steel toed boot might help!

  23. Maverick
    FAIL

    and the reality is that

    NW online banking is down again this morning

    got in then it told me I'd pressed the back page key and kicked me out

    let's see what today's excuse will be . . . let's get going with that break up HMG eh?

  24. dbbloke

    Failover

    So they didn't have:

    Some kind of HDR (high availability server) to seamlessly swap over to?

    A SDS secondary shared disk server to failover to?

    An Enterprise replication machine somewhere?

    A RSS Remote standby machine, in another location or even the cloud?

    Probably Only Informix does this well (and would give me some work), Oracle tries but it's problematic.

    I wonder if it were database / application or what. Doesn't sound like a network Issue. SANs are bulletproof as well. I would assume there is a more sinister reason given the state of banking.

    Mainframe - no wonder it fails, like almost nobody is alive who knows how to maintain the OS. I've tried and it's super command line unfriendly.

    Banks are sadly expertless, I know loads with terrible DBA's runing mickey mouse systems with no failover. The more you know the more you wonder how it works AT ALL.

  25. Anonymous Coward
    Anonymous Coward

    Probably too scared to take immediate action

    The operators were probably unwilling to make any failover call following the almighty bollocking they will have received after last years fubar.

    They will (quite rightly) have kicked the decision up the chain to those earning the salary for having the responsibility.

  26. 1052-STATE
    Facepalm

    These aren't smalls shops....they're mainframe *complexes*

    Fair amount of tosh being written - such as ALL failovers are not immediate. I ran some of the world's largest realtime systems (banking, airlines) for 15yrs and it's imperative an immediate seamless failover is there the second you need it. Realtime loads were switched from one mainframe complex to another on a different continent in less than five seconds - with zero downtime.

    See "TPF" on Wikipedia. (aka Transaction Processing Facility)

  27. Anonymous Coward
    Anonymous Coward

    I am a broadcast engineer, not an IT guy and I've seen some IT guys spectacularly fail under pressure. Accidentally cut of services to an entire country? Fine, don't run around like a headless chicken, get it working and then you can stress, not the other way around. Also, sometimes the answer isn't to fix the problem but to just get the system working, you are providing a critical service to the public, you can fix the problem later. Sometimes getting it working does involve fixing the problem, sometimes you just need to patch around it and schedule the fix. It isn't amateur bodging, it is maintaining a critical service at all costs.

    I previously worked for a major broadcaster's technology division, the broadcaster wanting to reduce its headcount, talk of "leveraging" etc. and we were sold to a major IT outsourcing company. Now, although the sale was supposed to buy the IT and phones they saw "broadcast communications" and someone wet themselves with excitement. Massive connectivity infrastructure, lots of racks of equipment, 24x7 operation with flashy consoles and most importantly of all high margin contracts, an IT directors wet dream (it was cool). So they asked the broadcaster if they could also take that department in the same purchase, "Are you sure?... Okay." What the IT outsourcing people didn't realise was that with valuable contracts came great responsibility. We never had *any* measurable outages, changeovers happened in a flash. Hardware resilience: n+1? no thanks we'll have 2n or at least 3a+2b. Resilient power? Grid, Gas turbine, Diesel & UPS, plus manual bypass changeover switches!

    The thing was, some of this isn't unfamiliar to IT people who do real DR, but what created the biggest fuss? They refused to acknowledge that the IT response time for some users (the 24x7x365 ops team) had to be less than 4hours. Surely you can wait 4 hours to get your email back? Surely you can do without your login for a few hours? Your Exchange account has zero size and can't send mail? Can you send us a mail to report the fault?

    If the people supporting you don't understand you then how can you be effective.

    1. Chris007
      Flame

      @AC 10:50 GMT

      "Also, sometimes the answer isn't to fix the problem but to just get the system working, you are providing a critical service to the public, you can fix the problem later"

      Having been at the sharp edge (in a certain mega large organisation) I can tell you that 99% of teccies would like to take this course of action but 99% of the time they are stopped by [glory hunting] managers.

      We have a name for them - "Visibility Managers".

      They didn't want anything happening until very senior managers had seen them involved so they could take all the credit. Once the very senior managers had disappeared (fault worked around or fixed etc.) the "Visibility Manager" would very quickly become the "Invisibility Manager" and f**ked off.

  28. Mick Sheppard

    Redundant != no outages

    I worked at a place that ran their databases from a tier 2 storage array. This had redundant everything, dual controllers, power supplies, paths to disk, paths to the SAN etc.

    We had disk failures that the system notified us and we hot replaced with the array re-laying out the data dynamically. We had a controller failure that we were notified about and the engineer came to replace, again without an outage.

    We then had two separate incidents that caused complete outages. The first was a disk that failed in a way that for some reason took out both controllers. It shouldn't happen but did. The second was down to a firmware issue in the controllers that under a particular combination of actions on the array caused a controller failure. With both controllers running the same firmware the failure cascaded from one to the other and took out the array.

    So, whilst its trendy to be cynical, these complex redundant systems aren't infallible and when they do fail it can take a while to work out what has happened and what needs to be done to get things operational again.

    1. Anonymous Coward
      Anonymous Coward

      Re: Redundant != no outages

      Definitely, I think people are confusing fault tolerance with high availability. There is overlap, but they are different concepts.

      Fault tolerance just means a bunch of extra hardware is in place so if a NIC, or whatever, fails, another will pick up for it. It says nothing about down time other than you have added protection in the single category of hardware failures. If you need to upgrade the OS, even in the most fault tolerant system known to man, it will likely require an outage. That is why you need an HA solution in place, if no downtime is a requirement. A high availability solution will be running a parallel system with real time data integrity so that it can immediate pick up I/O if another system in the HA environment goes down, either scheduled or unscheduled. For instance, Tandem NonStop was supremely fault tolerant, but not necessarily highly available. Google's home brew 1U x86 servers have zero fault tolerance, but their Hadoop cluster makes it a highly available environment. IBM mainframe has both. It is fault tolerant hardware, but you can also add parallel sysplex which provides high availability.

  29. Roland6 Silver badge

    Scary the lack of any real knowledge being shown here

    "I work somewhere that has a much smaller IT department and a much smaller IT budget than RBS, but it would take the failure of multiple hardware devices to knock a key service out. What kind of Mickey mouse setup do they have that a hardware failure can take down their core services for hours?"

    From reading the comments, I'm concerned about the total lack of any real knowledge of real world enterprise computing demonstrated by many and hence the above comment would seem to be the sub-text to many comments.

    Setting up and running an IBM Parallel Sysplex, with only 6 zSeries in it distributed across 3 sites was complex, let along 14+. Plus I suspect that not all systems were running at capacity, mainly due to the hardware and software licensing costs (believe it or not some software you pay, not for the cpu it actually runs on, but on the TOTAL active cpu in the Sysplex), hence it would have taken time to call the engineers out, bring additional capacity on-line, move load within the Sysplex and confirm all is well before re-opening the system to customers; that is assuming the fault really was on a mainframe and not on a supporting system. Also it should not be assumed that the mainframe that failed was only running the customer accounts application, hence other (potentially more critical applications could also have failed). From companies I've worked with 2~3 hours to restore the mainframe environment to 'normal' operation, out-of-hours, would be within SLA.

    Yes with smaller systems with significantly lower loads, operating costs and support system's requirements different styles of operation are possible to achieve high-availability and low failover times.

    1. Anonymous Coward
      Anonymous Coward

      Re: Scary the lack of any real knowledge being shown here

      While it is costly and complex, you can definitely have a real time fail over with parallel sysplex even with extreme I/O volumes. PS was built for that purpose. Do you mean 2-3 hours to restore sysplex equilibrium while the apps stay online or 2-3 hours to take the system completely down anytime after hours?

This topic is closed for new posts.

Other stories you might like