back to article Resilience is overrated when it's not advertised

Nothing ruins a weekend like failed failover, which is why every Friday The Register brings readers a new instalment of On Call, the column in which we celebrate the readers whose recreation is ruined by rotten resilience regimes. This week, meet “Brad” who once worked for a company that provided criminal justice apps to …

  1. Anonymous Anti-ANC South African Coward Silver badge

    Fallback fault-tolerant

    Are modern failover systems much better and more resilient than their ancient counterparts?

    1. Wellyboot Silver badge

      Re: Fallback fault-tolerant

      They can be.

      The backup system having the same capacity as the primary unit certainly helps!

    2. jake Silver badge

      Re: Fallback fault-tolerant

      The ancient counterparts worked just fine when spec'ed and installed and maintained properly.

      Just like the modern kit.

    3. Anonymous Coward
      Anonymous Coward

      Re: Fallback fault-tolerant

      Many years ago, a city council in the north of England had a department running a pair of Netware 3 servers in a SFT cluster. The nodes were called Zig and Zag after a couple of characters on a breakfast TV show. One day, Zag had a permanent and unrepairable hardware failure leaving the cluster running only on Zig. Anecdotally, the users said that performance had improved - although it was only in that state for a few months before we installed the replacements with far more boring and forgettable names.

      1. Michael Strorm Silver badge

        Dem girls, dem girls, yeah yeah yeah

        According to Wikipedia, Zig was the name of the stupid one, and Zag less so- as far as The Big Breakfast went- but apparently they'd been more the other way round when they first appeared on Irish TV, where they originated.

        Maybe the thick server stopped getting in the way of the good one?

        1. Flightmode

          Re: Dem girls, dem girls, yeah yeah yeah

          > Dem girls, dem girls, dey all love me!

          Welp, that weekend is ruined. Might as well make it full-blown.

          Ziggyman and Zagamuffin are in the house in full effect!

          https://www.youtube.com/watch?v=7bv_36P_f-w

      2. Anonymous Coward
        Anonymous Coward

        Re: Fallback fault-tolerant

        a boffin at a lab I worked at had a couple of PC's used to connect to kit when away at sea, Burt and Ernie

        1. Korev Silver badge
          Boffin

          Re: Fallback fault-tolerant

          The FACS machines in the lab at my previous site were named after Thomas the tank engine characters

          1. The commentard formerly known as Mister_C Silver badge

            Re: Fallback fault-tolerant

            T'was Star Trek characters at a site I worked at in the 90s (i.e. mid DS9). Sys admin awarded himself "Q". The launch of Voyager helped as the company expanded.

            I used WestWallaby as the domain name for my machines at home for a while. Gromit for my normal PC, McGraw for the linux box (see what I did there) and Preston for the overclocked monster that kept malfunctioning. I gave Mister_C senior a rebuilt PC and he never worked out why it was called Wallace...

            1. jake Silver badge

              Re: Fallback fault-tolerant

              Naming systems after Tolkien characters officially became old after I ran across the fifth server named "Bilbo" in a single day (two at Berkeley, one each at Stanford, San Jose State and Mission College). That was in roughly 1980.

              1. that one in the corner Silver badge

                Re: Fallback fault-tolerant

                At Uni, all the shared resources printouts came back as output from Orac.

                1. This post has been deleted by its author

                  1. Killfalcon

                    Re: Fallback fault-tolerant

                    The oldest mainframes at my place are all named after comedians, or Wind In The Willows characters though as they only have four character names, there's some liberties. Over the long decades we've lost ENRI, HRDY and RATY, but we still have ERIC, STAN, and TOAD.

                2. shawn.grinter

                  Re: Fallback fault-tolerant

                  When I worked for a certain oil company our pair of IBM mainframes were called Zen and Orac

                3. Elongated Muskrat Silver badge

                  Re: Fallback fault-tolerant

                  At one place I worked at, the development file server was called "JCN," which was a reference to HAL from 2001 A Space Oddysey (HAL was "one letter up" from IBM; JCN is "one letter down")

              2. Eclectic Man Silver badge

                Re: Fallback fault-tolerant Tolkein names

                In a collection of Terry Pratchett's articles about life, he recounts a book signing where one lady approached him, he asked her name. She mumbled something, which he could not hear, he asked again, another mumble. third time of asking it transpired her given name was Galadriel. He asked if she had been born on a Welsh commune. She said not, it was a caravan in Cornwall, but basically hippy parents.

                Lots of rock climbs at a place called 'Goblin Combe' are named after Tolkien people and places especially those on Owl Rock and Orthanc : https://www.ukclimbing.com/logbook/crags/goblin_combe-44/

                1. Elongated Muskrat Silver badge

                  Re: Fallback fault-tolerant Tolkein names

                  Lots of rock climbs at a place called 'Goblin Combe' are named after Tolkien people and places

                  This I did not know, but the rule about Goblin Combe is: don't tell people about it, or it'll be swarming with people ruining it, like anywhere nice in Bristol.

                  1. Eclectic Man Silver badge
                    Facepalm

                    Re: Fallback fault-tolerant Tolkein names

                    Elongated Muskrat: the rule about Goblin Combe is: don't tell people about it, or it'll be swarming with people ruining it

                    Ooops! (Sorry.)

              3. Paul Hovnanian Silver badge
                Coat

                Re: Fallback fault-tolerant

                Not sure if that naming could cause problems on a network.

                Was it a Tolkien ring?

              4. Anonymous Coward
                Anonymous Coward

                Re: Fallback fault-tolerant

                I have a whole suite of algorithms named after Tolkien characters and places. Rauros, Gimli, Gloin, Dimrost, etc.

                The tradition was started back in the 70's when most of the experimental work informing current practices were done, and turned into software on an S/360 mainframe. Who am I to argue with carrying on the naming convention.

                We still have the user manuals, and they include glorious estimates for the cost of carrying out a cycle.

                Physics doesn't change (much) so the algorithms live on, thankfully mostly migrated so something more recent.

            2. John Robson Silver badge

              Re: Fallback fault-tolerant

              Worked at one place which used composers.

              for relatively small networks I quite like elements.

            3. Anonymous Coward
              Anonymous Coward

              Re: Fallback fault-tolerant

              We used to name our servers after birds, there are surprisingly many bird names which is helpful. The bigger servers were named after bigger birds e.g. ostrich, emu etc. You didn't want to get allocated a server called wren!

              We did manage to get away having a server named chough for a while and had several of the tit family. Didn't manage to get a booby past our managers.

              Better make this anonymous as some of them are still on the network!

              1. molletts

                Re: Fallback fault-tolerant

                I seem to recall workstation names including "fanny" and "breast" in one of the computer labs at uni. I'm sure the IT team would have claimed they were randomly-chosen according to an entirely innocent documented pattern of names if asked.

                I've had pairs of servers called Romulus & Remus and Castor & Pollux in the past, before boring but informative names like "host1" and "host2" became de rigeur.

          2. Paul Hovnanian Silver badge

            Re: Fallback fault-tolerant

            We had a group which acquired an HP-UX system. They named it Homer.

            Hinting that we should continue with the series when we received our system, we happily named it Ulysses.

            I didn't realize they meant the cartoon. D'oh!

            1. Elongated Muskrat Silver badge

              Re: Fallback fault-tolerant

              Should have gone with one of these, rather than one of his characters.

              If nothing else, it should be amusing watching non Greek-speakers trying to pronounce Creophylus correctly.

          3. Peter Ford

            Re: Fallback fault-tolerant

            I set up the HP-UX cluster in the my Ph.D. lab with TTTE names - Gordon was the Big machine, Edward, Henry, James and Thomas were the smaller ones. When we got a shiny SG Iris in the lab the Prof said we had too many boys names and it needed something more feminine - if there had been two new machines they'd be Annie and Clarabel, but in the end the purple case (and the fact that it was a Crystallography lab) resulted in 'Amethyst'.

            Slightly later, OUCS had machines named as colours, and the main multi-user servers were 'black' and 'white'. They were superceded by 'sable' and 'ermine', which somehow transitioned to a mustelidae theme and there was (I think) a 'wolverine' and a 'weasel' after that...

            The company I moved to had Arthurian Legend names - Arthur, Merlin, Guinevere, Morgana...

            Later when I took on the system admin of that company, I went very boring and used NATO phonetic alphabet names for the sudden proliferation of VMs

        2. Anonymous Coward
          Anonymous Coward

          Re: Fallback fault-tolerant

          At work I have Pinky, TheBrain and Snowball. Used mostly for testing - AKA ACME Research Labs....

          I did have a period of using Beavis and Butthead names and insults for host names, vacup server called fartknocker for example

        3. swm

          Re: Fallback fault-tolerant

          At the school where I taught the file server was named Mordor but was eventually replaced with Gondor.

        4. FIA Silver badge

          Re: Fallback fault-tolerant

          a boffin at a lab I worked at had a couple of PC's used to connect to kit when away at sea, Burt and Ernie

          Boat and Ernie surely?

          1. Peter Ford

            Re: Fallback fault-tolerant

            If they were from Sunderland, that's pretty much the same thing

        5. Anonymous Coward
          Anonymous Coward

          Re: Fallback fault-tolerant

          I do remember a couple in a school named Neo and Scooby

      3. Stevie

        Re: Fallback fault-tolerant

        In the mid-90s I grew very tired of visiting sites where the resilient nodes were named "Calvin" and "Hobbes".

        Hard to smile at a joke you've heard many times before.

        1. BartyFartsLast Silver badge

          Re: Fallback fault-tolerant

          There's always a firewall called Cerberus surely?

          1. Anonymous Coward
            Anonymous Coward

            Re: Fallback fault-tolerant

            Used to deal with an evil piece of security software with that name. It was terrifyingly easy to completely lock yourself out of a machine with it. Not good memories!

      4. Anonymous Coward
        Anonymous Coward

        Re: Fallback fault-tolerant

        I've once named a machine 'icle' when I set it up and that made it into dev literature before I had a chance to change it when my brain eventually woke up (don't force emergency jobs on me before coffee, due to the potential for consequences).

        I never had the heart to tell them that I only used the second part of the word because I got bored using the first half - "test" ..

        :)

        I was also once told by security auditors on another, rather large project to remove the version number from the sendmail HELO prompt (in the days we still used sendmail), so we recompiled it to say "biscuit" instead. I honestly have no recollection why on Earth I chose that, but even after I left that project I could see that it remained in place for quite a few years by means of a simple telnet to port 25.

        There are now a few people who know who I am :)

      5. brotherelf

        Re: Fallback fault-tolerant

        Well yes, synchronous operations will do that.

      6. Prst. V.Jeltz Silver badge

        Re: Fallback fault-tolerant

        I did much googling when setting up the ssids at my house . The correct comedy name for two nodes of something or other is infact

        .

        "Wallace & Gromit"

    4. Anonymous Coward
      Anonymous Coward

      Re: Fallback fault-tolerant

      Software is generally the limiting factor. Hardware is definitely capable. If well executed, the combination can be very effective.

      Lots of databases are Transactional in nature and so if properly designed; operations interrupted mid-execution by a failure can be backed out by your failover unit and repeated.

      Does require something somewhat more robust than a bunch of python scripts gluing glorified spreadsheets together however does not a resilient system make.

  2. Pascal Monett Silver badge

    Failover backup redlining

    The issue is not the failover, it's the idiot who designed a failover system with less resources than the production system.

    If you design a failover, that server needs the exact same configuration than the one it is replacing.

    Not doing that is stupid, and this was the result.

    I would have thought that you wouldn't need a degree in computer science to understand that. Apparently, you do.

    1. Korev Silver badge
      Coat

      Re: Failover backup redlining

      Yeah, the person who designed this is criminal...

      1. Korev Silver badge
        Coat

        Re: Failover backup redlining

        Someone should beat them...

        1. Anonymous Custard Silver badge
          Coat

          Re: Failover backup redlining

          You could predict that backup server would plod...

          1. Will Godfrey Silver badge

            Re: Failover backup redlining

            Well, the backup performance was certainly arrested!

            1. Eclectic Man Silver badge
              Unhappy

              Re: Failover backup redlining

              Steady on, chaps!

              The person who designed it probably specified an equal system, but then got overruled by 'management' who wanted to save some cash, and possibly reasoned that a failover system would only be used for a short time while the main system was fixed' of whatever malady had occurred. It is whoever authorised the lower powered back-up system who deserves your opprobrium. I mean, who has not come across systems hampered by management's failure to shell out the necessary dosh?

              1. John Brown (no body) Silver badge

                Re: Failover backup redlining

                Or, possibly they DID buy a matched pair, but as development and feature creep went on, realised they needed more resources and someone forgot or vetoed the upgrade on the backup machine.

                1. Screwed

                  Re: Failover backup redlining

                  Or they used some memory from the backup server when they found they needed more in the prime server.

      2. aerogems Silver badge
        Coat

        Re: Failover backup redlining

        Wonder if they developed a phobia regarding anything copper after that.

    2. Peter Gathercole Silver badge

      Re: Failover backup redlining

      ..."same configuration"...

      Whilst this is desirable, it may not always be required.

      If you say up front in the requirements that the backup server does not need to maintain the same performance, just as long as it can carry the load, then it could be smaller. But this would need to be communicated through the user base that when in failover, the service will be slower (and you probably want some indication that the service is running on the backup server, so users can see why this is running slow).

      I've been in situations where this has been the decision made (and the client has had a load-shedding process to make sure that the essential parts of the service work at the expense of some of the others). It's a risk decision between cost and failover capability.

      Not having a fail-back process is probably more of an issue in these cases, though.

      1. Richard 12 Silver badge

        Re: Failover backup redlining

        In the Real World (tm) this is absolutely everywhere.

        - Emergency lighting is not as bright.

        - UPS and backup generators don't carry the whole load.

        - Traffic diversions are onto smaller, slower roads.

        - Backup Internet connections have less bandwidth and increased latency (eg cellular)

        It's the normal way of doing redundancy.

        1. Captain Scarlet
          Childcatcher

          Re: Failover backup redlining

          You have backup generators O_O

          Posh

        2. Anonymous Coward
          Anonymous Coward

          Re: Failover backup redlining

          We had a new software system going in that would eventually if accepted be our main driver for the business. The system was on evaluation and brand new, it had a dedicated server setup with supposed redundancy. This redundancy was provided by 2 identical boxes* and identical raid setup. Or that’s what was supposed to be the case. One night I came back to the office to collect my brolly and found ‘Rob’ one our senior tech people and lead on this project sitting at his desk with a coffee, the bin had evidence of other earlier coffees. I wondered why he hadn’t been in the boozer with the rest of us.

          I wondered for only a short period as he told me in a very irritated voice that he was monitoring a raid repair/rebuild on this much vaunted new system. Then I noticed the services screen which showed the health of all the IT kit and the screen was red indicating a problem that had taken something down. On closer look it turned out to be the new system and I asked why this was down given the redundancy we’d all been told about. “Because they haven’t installed that yet so we’re running on just one box* and I have to stay until it’s working again.” Apparently they were super confident of their system and so hadn’t installed the backup system at the same time that was a few weeks away.

          He’d phoned the US tech support as they were still working at that hour and had been told what to do by a teenager who sounded like Jeremy Freedman, the squeaky voiced teenager from The Simpsons. Suffice to say it didn’t take long for Rob to recommend we not proceed with a full rollout

          *or group of boxes can’t remember

        3. Anonymous Coward
          Anonymous Coward

          Re: Failover backup redlining

          > - UPS and backup generators don't carry the whole load.

          > - Traffic diversions are onto smaller, slower roads

          > It's the normal way of doing redundancy.

          Those are not examples of redundancy - fallbacks, yes, redundancy, no. The traffic diversion example should make that very clear.

          1. doublelayer Silver badge

            Re: Failover backup redlining

            It depends on what they're supposed to be redundant to. A UPS and generator are not supposed to be redundant to mains power for the entire office but is designed to be redundant for the important servers. Other roads are generally designed to be redundant for emergency vehicles and average traffic, not all the cars that can go down a larger road. If it's designed to be redundant for every purpose the original one was used for, then you're right. If not, you may not be.

      2. Wellyboot Silver badge

        Re: Failover backup redlining

        The load shedding part is vital, that requires a decision as to what's not important enough to spend money on.

        1. ColinPa Silver badge

          Re: Failover backup redlining

          And of course you need to decide in advance what load you will shed, and have automation to do it, which you have tested/

          1. Doctor Syntax Silver badge

            Re: Failover backup redlining

            And make sure the users know this and understand the implications.

            1. Flightmode

              Re: Failover backup redlining

              AND not end up in a situation where users - because things are slower - keep pounding the servers by refreshing their requests again and again and again and...

            2. Eclectic Man Silver badge
              Joke

              Re: Failover backup redlining

              Dr Syntax: And make sure the users know this and understand the implications.

              Are you mad?

              Don't tell the users anything, they'll want explanations and promises. They'll complain to you that 'it doesn't work' or 'it's too slow' or 'You said this would be OK and it isn't.' Heh will ask 'Why isn't it working?' and 'How long before I can use it again?' and 'Can I just print this out before you shut the whole thing down again?'

              Shakes head in disbelief at some people's naïveté .*

              *Accents and spelling courtesy of Apple spell-checker suggestion. No, I have no idea how to get two dots overt a lower case i.

              1. jake Silver badge

                Re: Failover backup redlining

                " have no idea how to get two dots overt a lower case i."

                Not a Black Sabbath fan, I take it?

                Perhaps I'm showing my age.

                Or maybe I'm just paranoïd.

      3. An_Old_Dog Silver badge

        Re: Failover backup redlining

        A major problem here is when management, when shown the cash outlay for two identically-resourced computers, says they can save money on the second server by making it less-capable, and promises the techies they won't be expected to provide identical performance in failover mode -- promises which are immediately forgotten-and-later-denied by said management.

        (see: https://www.youtube.com/watch?v=xWBFWw3XubQ)

    3. Wellyboot Silver badge

      Re: Failover backup redlining

      What's the betting that 'We're only using a third of the capacity on average so we don't need a full size backup' was part of the conversation.

      Beancounters at work again?

      1. Dave314159ggggdffsdds Silver badge

        Re: Failover backup redlining

        If everything is a conspiracy to you, the problem isn't with the world...

        1. This post has been deleted by its author

        2. Anonymous Coward
          Anonymous Coward

          Re: Failover backup redlining

          "If everything is a conspiracy to you, the problem isn't with the world..."

          Correct, just with the ones who are out to get me !!!

          :)

          1. Eclectic Man Silver badge
            Joke

            Re: Failover backup redlining

            Dear Coward,

            We're not out to get you, we're out to get everyone.

            MWhahahaha

        3. SCP

          Re: Failover backup redlining

          Just because you're paranoid it doesn't mean they aren't out to get you.

          1. Anonymous Coward
            Anonymous Coward

            Re: Failover backup redlining

            I believe it is..

            Just because you're *not* paranoid, it doesn't mean they aren't out to get you.

        4. Doctor Syntax Silver badge

          Re: Failover backup redlining

          Paranoia is the price of freedom. Vigilance is not enough. - Len Deighton

        5. Montreal Sean
          Black Helicopters

          Re: Failover backup redlining

          That's what THEY want you to think!

      2. Anonymous Coward
        Anonymous Coward

        Re: Failover backup redlining

        Where I'm working at the moment, we have a crazy situation. We have pairs of large database servers in active/active mode, with each individual server sized to support the entire load of the pair on their own.

        I look at the on-going CPU and memory utilization in absolute horror sometimes as we are only really using 25-30% of the resource available most of the time (and sometimes much less). The lifetime cost of all this unused CPU and memory must be terrific, but even when set up like this, the DBAs still want more resource (why are DBAs like this?). This is in an environment where we can allocate and de-allocate resource dynamically should the need arise.

        This is compounded by the per-core licensing model of the database software (you can guess which this is!) which also puts constraints on dynamic sizing of the systems.

        1. Doctor Syntax Silver badge

          Re: Failover backup redlining

          The key here is "most of the time". If it rises closer to 100% some of the time that "some" might be quite important. And things might get scary when that happens. I ended up spending a few Friday lunch-times* watching a server engine eat up more and more memory (due to a badly written 3rd party program which I eventually managed to get fixed) and having to allocate memory on the fly. If it overran it crashed and left a nice mess to clean up. If you don't want to spend your time doing that then going along with the sizing might be a good idea.

          * Nice scheduling of the weekly invoice run, manglement.

          1. Doctor Syntax Silver badge

            Re: Failover backup redlining

            I should add that the disks were mirrored at the controller level and again in software - i.e. quadrupled. We never had a disk failure but the backup tape drive failed fairly regularly.

          2. Anonymous Coward
            Anonymous Coward

            Re: Failover backup redlining

            I feel I must add that the times that the systems use more than the 30% are few and far between (and are also to do with the administrative services, not the application load, and normally only affect one of the two systems).

            I understand that if you want normal levels of service even when you have a system down, it is necessary to overspec. the systems, but in this case we're talking possibly 28 unused Power 8 processors dedicated to the two system images running the main database, and something like 500GB of unused memory (these systems were installed 8 years ago when Power 8 was current, so very expensive). This is a lot of resource, and is even more when you consider that all of those AAAA class processes need per-processor licenses for the (very expensive - you can guess which one) database!

            I estimate that we could recover maybe 40% of the resource (certainly of the CPU resource - memory maybe a little less) without the service even blinking when the load was being carried on just one system. It would be busy, but the biggest constraint would probably be the I/O bandwidth because of the limited number of disk paths (I have pointed this out as the main bottleneck from the measured stats, but the primary DBA is listened to more than me).

            Of course, the biggest constraint is the stupid license conditions on the DB license that prevents us using the dynamic resource allocation that these systems have without incurring punitive charges for unnecessary DB licenses!

        2. An_Old_Dog Silver badge

          "Wasted" Computer Resources

          1. Payment for those "wasted" computer resources is akin to insurance payments: you're paying now to mitigate possible severe bad consequences later. This is especially-relevant to retail-related systems which experience transaction peaks around certain holidays and times of the year.

          2. You're not paying as much as you are afraid that you are for that usually-unused capacity: CPU cycle, RAM capacity, and hard disc storage capacity costs are going pretty-much continually down.

          1. Anonymous Coward
            Anonymous Coward

            Re: "Wasted" Computer Resources @An_Old_Dog

            I don't disagree with you, but these servers are overspec'd to an unnecessary degree IMHO.

            During maintenance work, we occasionally carry the load on a single server, and even when we do, there is spare resource on that system.

            The problem I have is that when carrying the load on a single system, it logs significant amounts of what is simplistically called "I/O Wait" time, which the DBA looks at, and immediately demands more CPU resource (because taking that into consideration, the CPUs appear to be running at 100%), regardless of the number of times I tell him that that is an indication of the disk subsystem being overwhelmed, not lack of CPU!

        3. el_oscuro

          Re: Failover backup redlining

          I am a DBA and I like it when my servers are at 20-30% CPU usage. That means my database is working well and the application is well designed. If you are at or near 100%, there is something wrong with your database and/or application. On one of our databases, it was routinely getting hammered and was at 100%. Performance was terrible. And I found out why: The application was executing a query to look up a static value over 30 million times during the login process. Stupid shit like that is how you get to 100% on database server, and no DBA wants to see anything like that.

          1. Anonymous Coward
            Anonymous Coward

            Re: Failover backup redlining

            There's a world of difference between 20-30% and 100%.

            I appreciate that if you have a database that allows ad-hoc queries to be run, it is always worth having more headroom (because it's remarkably easy to write a bad, inefficient query that will just consume resource like crazy). But if you are mostly running canned queries on a production server, it should be possible to push to 50-60% without compromising the system, maybe bring this down to 40-50% if this is an active-active database cluster with failover, to cover for failover situations.

    4. DS999 Silver badge

      Re: Failover backup redlining

      it's the idiot who designed a failover system with less resources than the production system

      That may have been deliberate to save money (the difference could easily have been tens of thousands of dollars or more back then) with the idea that you'd only be limping along on the backup server for a short time since you would immediately call in the vendor to fix the primary server.

      Of course that would require fully operational failover, including the VERY important piece of notification that failover occurred! Otherwise how are you supposed to know there's something broken?

    5. An_Old_Dog Silver badge

      Secondary Server Needs to be Identically-Resourced as the Primary Server

      Degrees and certfications are irrelevant to knowing the primary and secondary systems should be identically-resourced.

      A job in management, or a job which requires sucking up to mangement, can far-too-easily remove a person's give-a-shits about possible bad consequences when higher-level management wants to save money (so they can spend it on "more-important" things).

    6. chris street

      Re: Failover backup redlining

      Theres enough detail in here for me to know exactly what server was being talked about nearly thirty years later...

      Firstly, manglement wouldnt let us have a more expensive failover backup - so thats the first one. With the cost of the machines at the time it was hardly surprising either. Secondly, where is it writ that it must be the same size and specification? The failover box was designed to provide an emergency level of service so you could do the very basics, the custody suite always had the option of running on pen and paper, which they used to do when Oracle 6.5 took one of it's regular sulks and refused to work.

      The problem with that is getting a copper to appreciate that emergency use might result in a lower level of service, and also getting them to care a **** about it when they did understand. They were some of the most, obstreperous and wilfully malicious users that I have ever encountered.

  3. Korev Silver badge
    Coat

    "A couple of years later we moved to Sun servers

    Did they use Solaris Jails?

    1. jake Silver badge

      Zones. Solaris isn't BSD anymore.

      1. Korev Silver badge
        Facepalm

        Good point, not so good for puns though...

      2. that one in the corner Silver badge

        Easy to forget, guess Korev just zoned out.

  4. jake Silver badge

    Resilience is futile.

    Prepare to be discombobulated.

    1. that one in the corner Silver badge

      Re: Resilience is futile.

      We shall make your technological distinctiveness our own.

      No, wait, hang on, that's too much, where were you keeping all this!

  5. trevorde Silver badge

    The horror...

    Worked on a system where we had redundant failover servers, both quite well specced. Only problem was our software was so flaky we had to have a separate watchdog timer to reboot the system when our software stopped heartbeating. This happened a lot. The most reliable part of the system was the 'database' which was a set of text files, kept on a shared network drive (no sniggering at the back, please).

    1. Anonymous Coward
      Anonymous Coward

      Re: The horror...

      We had a pair of Sun E450s purchased from a reseller, but Sun insisted that we have official Solaris Cluster training. This was held on-site so the instructor decided that we could do real-world practicals rather than lab-based ones. Except we couldn't as the reseller had cabled everything wrongly and made a mess of the IP addresses too!

      It took the instructor an hour or so of head scratching and troubleshooting before realising what was wrong and fixing it for us, after which it worked flawlessly.

      Except the developers used the redundant server to develop (don't ask, I don't know why!) and every month or so we had a failover as the developers made a mistake. Eventually, they got given their own little server to use and the failovers stopped.

      1. Anonymous Coward
        Anonymous Coward

        Re: The horror...

        "Except the developers used the redundant server to develop (don't ask, I don't know why!)"

        I know exactly why .... someone with 'clout BUT no IT experience' reasoned 'there is a box that is not used' and said use that instead of funding kit for the developers.

        Been there, given all the warnings and lived to tell the tail.

        [Basically, told the powers that be that if you use the server and it fails over you will lose all the current work being done by the developers. Put in place whatever systems to limit loss BUT it is on YOUR head not mine !!!]

        Fun times !!!

        :)

  6. TWB

    Proper resilience

    Years ago when I worked in a Big Broadcasting Corporation and we were heading towards automation, some of my colleagues visited a bank (I believe) to see their set up and they had a proper pair of 'systems' and could and did switch over from to the other frequently. This mean that both systems were well maintained, up to date and usually ready to go.

    Of course we still went down the "main and backup" way of thinking - where even if the "backup" was specced the same as the main, it was considered a bit second class and never got the love of the main. If we ever went over to the backup, there was always a push to go back to the main as soon as possible.

    Much better if you want proper resilience to have X and Y systems which are the same spec and truly considered, treated and used equally.

    1. Peter Gathercole Silver badge

      Re: Proper resilience

      The difference there is that the Bank would probably lose money in a failover situation if they weren't prepared for it, which makes justifying the initial outlay and ongoing costs a lot easier, whereas a Big Broadcasting Corporation, probably funded by the public and accountable to financial audit and public scrutiny of their accounts, only has a reputational loss to consider.

      1. TWB

        Re: Proper resilience

        The point I was trying to make that very often we did get equivalent systems on main and backup, but the mentality was that backup was considered "not as good"

        1. claimed Silver badge

          Re: Proper resilience

          Should call them mirrors, x and y

          Using 1, 2 or a, b still implies preference but x and y are pretty much neutral. Actually, even more fun: b and d!

          1. An_Old_Dog Silver badge

            Re: Proper resilience

            Or call them Salt and Pepper, or Pepsi and Coke, or Bread and Butter ...

            1. KittenHuffer Silver badge

              Re: Proper resilience

              In the case of these servers they should have called them Chalk and Cheese!

          2. collinsl Silver badge

            Re: Proper resilience

            Speak to the Royal Navy - A and B turrets were at the front of the ship, X and Y were at the back.

            X and Y turrets were traditionally manned by the Royal Marines, whereas the others were manned by the gunners of the ship.

            Definitely some preference there!

    2. trindflo Silver badge

      Re: Proper resilience

      Another way I implemented once upon a time was to have multiple servers that could handle a request and a mechanism by which one of the machines could claim the request. The initial request was forwarded to muliple servers and whichever machine got to the request first handled it. This redundancy was duplicated at multiple levels. If one of the machines went down you wouldn't notice it. The system was reliable enough that it became something people forgot about, and we needed to add monitoring to report there was a problem or we would not know until all the redundant machines were down.

      I never heard if this was considered a standard practice. It seemed somewhat like a RAID disk array. Maybe it was a RAIS? Same concept except for servers?

      1. Killfalcon

        Re: Proper resilience

        I did that by mistake once. The 'claim' was originally meant to just tell the users which requests were in progress and which were queued (and could in principle be cancelled/modified).

        Then one time someone accidentally started the process twice, and... nothing went wrong. Each process just grabbed a task for itself and claimed it, and they only polled every two seconds so there weren't any issues with collisions.

        In the long run we ended up running it on two machines to double the throughput, with hardly any extra effort needed. Great feeling.

  7. Howard Sway Silver badge

    .... and it had never been tested

    Well, it had been by the time he got to the office.

    At least they learnt from it afterwards by starting to test properly. Hopefully whichever management thought that they could do backups / failover on the cheap learnt that too.

  8. Anonymous Coward
    Anonymous Coward

    In all my years (16) working with AIX I never met a HA cluster which hadn't been broken into individual servers, without HA/clustered applications. Or maybe the IBM sales rep sold the customer a HA solution under the premise of "you need to have high availability applications!", but the customer never got them, or was too cheap to pay the cluster licenses, idunno.

    This off course was carried out without deactivating the HA/floating IPs/shared LUNs. So whenever a reboot or network change came in, the inevitable high severity ticket was dispatched.

    A particularly cumbersome setup I had to recover once involved several individual Oracle instances (not RAC) each with their own floating IP, all poorly documented by the existing SysAdmins and DBAs, which took a lot of head scratching to make sure the IP aliases were assigned to the right NIC/node.

  9. disgruntled yank Silver badge

    IP addresses?

    It has been a long time since I sat at a D210 terminal, but it seems to me that even at the end of my involvement with the MV/Eclipse systems they were using DG's own Xodiac networking. This does not of course affect the burden of the story--but what would the comments section be without a bit of gratuitous pedantry.

    I never got to work with anyone prosperous enough to hook DG servers together to fail over.

    1. Antron Argaiv Silver badge
      Thumb Up

      Re: IP addresses?

      Former DG employee here (#15490). I was on the Westboro team that designed the D200. We got a nice trip to Austin TX to hand the design over to the new Austin terminal group. DG was an interesting first job out of uni. I stayed there for 14 years.

    2. An_Old_Dog Silver badge

      Re: IP addresses?

      I had a DG Nova/4 for a while, and vaguely recall reading somewhere that one could cable two CPUs to share the same disk box (a 6045? It had 5+5 MB: DP0 was the lower, fixed platter, and DP1 was the upper, removable [in-a-cartridge] platter). Did you ever see or hear of such a thing being used?

  10. Gerhard den Hollander

    Wouldnt he have noticed ?

    That the server had only half the expected CPU and only half the expected memory ?

    Or was he administering so many machines he didnt know how much meory/cpu there was supposed to be in this one ?

    I'm assuming that since there was telnet to the machine he could have checked the amount of memory and amount of CPUs ?

  11. Boris the Cockroach Silver badge
    Facepalm

    I have one phrase

    for you all

    'London Ambulance service'

    For those not in the know, they introduced a computerised system to look after the service, due to a software fault the system fell over after 3-4 months . however, the software and hardware guys who'd built the system said to themselves "It might fall over.... so lets design it to fall over to a backup server".... if only the beancounters hadn't cancelled buying the backup server because of the unneeded cost......

    1. Doctor Syntax Silver badge

      Re: I have one phrase

      I think if I'd been in that one I'd have given the BCs a written warning including something on the lines of "I am warning you that not providing the backup is likely to result in loss of human life. When this happens I will personally attend any Coroner's Court and give evidence of this warning and will name you in that evidence."

      1. doublelayer Silver badge

        Re: I have one phrase

        This assumes that they knew about that and that they didn't mind losing their jobs that day. If they created the software, made the test systems, then handed it to someone else to deploy in production, they wouldn't know whether the servers were present. If the person who received the task of putting it in production is aware that there will be no backup server and assumes the developers were informed, they may assume that it has been handled in some way. It's possible nobody was in a position to make that ultimatum. Which brings us to the problem with ultimatums: managers who like commanding don't like being threatened by employees, and employees know it. People who aren't ready to leave tend not to be that blunt with people who won't listen, so it's more likely to have sounded like this:

        Manager: I've decided we don't need the two servers. One should be enough and save our budget.

        IT: The system is designed for two, so I suggest we go along with those requirements.

        Manager: I've decided that we won't. Make it work.

        IT: If something goes wrong, this might have safety implications.

        Manager: Then do your job well and make sure that server stays up.

        IT: We can't guarantee that, though. Two servers would

        Manager: I can find an IT person who will guarantee it. Should I look for them?

        1. Anonymous Coward
          Anonymous Coward

          Re: I have one phrase

          It's a black mark on the industry that the manager can find someone who will make that guarantee. It should be like trying to find a pilot who thinks wings are optional.

          1. Richard Pennington 1
            FAIL

            Re: I have one phrase

            ... a pilot who thinks the other wing is optional.

            1. collinsl Silver badge

              Re: I have one phrase

              Jets with more than two engines were originally the only ones allowed to fly over the Atlantic because most three-engined planes can fly perfectly well on two engines, but older two-engined planes could not maintain height on one sufficiently long to divert to the nearest airport.

              This analogy would be better by suggesting two engines were enough in those early days.

          2. doublelayer Silver badge

            Re: I have one phrase

            That's somewhat true, because everyone in IT knows that, even with two servers, there is some chance that something will affect them all. However, it's less difficult to find someone who says they're reasonably confident that they can make the server stay up without having a second one, especially if the manager keeps reducing the number of facts they tell them about how critical the thing is.

        2. Anonymous Coward
          Anonymous Coward

          Re: I have one phrase

          Never suggest, only advise, or strongly advise. Make sure it's recorded. As long as it's in black and white you did as you are employed to do.

          If people don't listen to your advice make sure it's recorded and then it's no longer your problem.

          Always make sure that if an "I told you so" situation occurs, you can back it up.

          Better still, have someone impartial write a risk assessment first.

  12. Gordon 11

    I remember a failover cluster for a server to run an Oracle database.

    Whereas the cost covered configuring the system and database to failover, it didn't cover anything else we had running there. Such as the Web server - and other bits.

    So, after a quick course on how to do it (and, thankfully, the documentation) I wrote the scripts to do that. But was never able to test them before putting them into place (the system was already running by then...)

    Over the next year my scripts all worked, but the vendor supplied ones (or our own central IT config) always seemed to have some problem.

    So after a year we had two separate systems - the downtime caused by the failover's failures wasn't worth it.

    1. el_oscuro

      I once had a project like that - took 2 weeks to set up Oracle failsafe on Windows. And in the end, my entire output of the gig was a single Windows .BAT file and instructions on how to use it. The project was a success.

  13. Stevie

    Bah!

    I once had the pleasure of watching someone demonstrate the resilience of Veritas multi-path volume management.

    Unplugged one of the arrays - no problem.

    Plugged it back in - massive problems as system started yelling about duplicate network addresses.

    Seems that the system would fall over, but could not get back up.

    Some wag suggested hanging a "life alert" on the frame, but was made to sit in the uncooperative corner.

  14. Eclectic Man Silver badge
    Childcatcher

    This story and thread ...

    ... merely confirm me in my believe that 'The Register' and the reader comments pages should be MANDATORY READING for all MP's Peers of the Realm and all PPE and MBA students.

    1. Boris the Cockroach Silver badge

      Re: This story and thread ...

      How are they going to fit in reading this among their other studies such as

      Evading the blame

      Hiding bribes

      Not telling the truth

      dodging questions

      Lining one's friends pockets

      Networking the right people

      Kissing the right arse

      And

      Suing the newspapers

      1. John Brown (no body) Silver badge

        Re: This story and thread ...

        You forgot studying "tractor" websites in The House :-)

        1. Anonymous Custard Silver badge
          Gimp

          Re: This story and thread ...

          Also giving cushy "jobs" to relatives.

          Except for the job of their secretary or assistant, which is for someone young and cute and who get given something else...

  15. Not now John, I’ve gotta get on with this

    I remember the early days of Sun failover systems, failover worked perfectly but the journal file system wasn't really ready for primetime and nobody used it, so the result was: failover takes 2-3 seconds, the resulting FSCK when the backup machine rips the disks away from the primary takes 4-6 hours! It was almost always quicker to fix the hardware issue in the primary server than to allow a failover to occur.

  16. Grogan Silver badge

    If he can use telnet from his "hour away" office, couldn't he telnet in to his office from home, then telnet to the remote server? Nobody would even know. I did shit like that back in the 90's. If I couldn't access something on a campus network, I'd telnet somewhere I could (home, ISP shell account etc.). I always had my own telnet client on a floppy disk too for windows computers.

    I mean, how "secure" is the access anyway. It's just telnet, so all it would be is an address mask (and a user account of course) on the remote server.

    1. doublelayer Silver badge

      The 1990s and today have different networking and security concepts, but maybe there is still some similarity. Maybe the server couldn't be accessed from the public internet and had to be controlled from something on the corporate network, but they hadn't set up a VPN.

  17. trindflo Silver badge

    confused by tech that worked when it wasn't supposed to?

    DHCP server is a good one. Seen a few times a router with an internal DHCP server got used to create a unique network for testing, and then someone bridged it to the main company network.

    Anyone who had powered on their machine before the test router was added would have a good address. Anyone who booted after that and looked for a dynamically assigned IP might get an address from either the real DHCP server or the test one. If they got an address from the test network, it might seem to work depending on what other machines were powered up in the test network. IT can recognize the symptoms, but AFAIK it can be a challenge to figure out where the router is.

  18. Herby

    Backup power...

    I was told a story (yes, second hand) about a power failure at a telephone central office. So, the power failed one day. Well the central office has batteries that take over in such cases, but they don't last too long. So, there is a nice gas (natural) powered turbine to generate power for the "duration" of the power failure. This is a "goodf thing:. So to start said turbine, it has a nice large air tank to spin the beast up. Super easy, So the chief flips the switch to start the turbine up, and it does nicely. Now to switch it over to power the "plant". Well, over the course of time lots of things have been plugged in and while the power company could nicely handle the load, the gas turbine generator could not. So, the whole thing failed, and they were back to square one. Not to worry, the air start system had enough to do two starts (gotta love the capacity). Do, another start, and a wise man would re-supply the air start system BEFORE switching the power over. Nope, he just flipped the switch and the same failure happened. Well, with only two starts available, they had to sit and watch as the central office died around him. A big lesson learned the hard hard hard way.

    I wasn't informed if procedures were changed, but I suspect so!

  19. ariels-again

    STONITH

    A shame the architecture rant STONITH, and our operator didn't get to expand the acronym in front of police. It was always a horrible idea, but I'm sure the police would have enjoyed hearing that the problem was that one node didn't shoot the other node in the head.

  20. FeRDNYC

    Do it Yourself (by Bill Sutton)

    Oh, I-B-M, DEC, and Honeywell, H-P, D-G, and Wang,

    Amdahl, NEC, and N-C-R, they don't know anything

    They make big bucks for systems so they never want it known,

    That you can build a mainframe from the things you find at home.

  21. Evil Auditor Silver badge
    FAIL

    Have you ever been confused by tech that worked when it wasn't supposed to?

    Only that one time when I shut down a machine that didn't have failover capability and it kept responding... Triple checked ever thereafter.

  22. el_oscuro

    Data General and EMC

    Way back in the 90's, I used to work on Data General servers. They were pretty nice, but NT4 didn't really support them, and required loading special drivers at boot to recognize the SCSI drives.

    And then there was EMC (known as "Even More Complex"). I had to administer those SAN's with symcli, and that process is not for the faint of heart. Make one mistake and you could potentially corrupt a critical database. I never made any, but that was because I double and triple checked every single command, all while ensuring we had good backups and verifying standby databases were in sync.

    We my contract ended, that was the job I was glad I no longer had. My replacement wasn't so lucky and he corrupted a database. Of course there were no backups and the only standby was out of sync. I got called back to help with the recovery, and it was a mess. The outage lasted several days.

    1. Adrian Harvey

      Re: Data General and EMC

      Is symcli the one where the “commands” are all just hex codes? I remember seeing the engineers use that on an EMC Symmetrix a long time ago, it seemed like an awful way to work. Very stable and great support, but not easy to make changes on.

  23. Silverburn

    A man after my own heart.

    "Brad's first tactic was to stay in bed for a bit – he wasn't allowed to remote into these servers and hoped the problem would go away by itself."

    This hits closer to home than I would like to admit.

  24. wimton@yahoo.com
    FAIL

    More fail over lore.

    A military lab was surrounded by a moat. For resilience, it was powered by 2 electricity cables, crossing the moat at different locations, with automatic failover.

    One day the moat had to be dredged out. The dredger cut one of the cables, the failover worked perfectly and nobody noticed. The dredger processed its work, till it also cut the second cable....

    1. jake Silver badge

      Re: More fail over lore.

      That's why all major redundant systems that I design have failure indicator panels with both flashing lights and sonalerts, and usually placed in a couple of different locations.

    2. Eclectic Man Silver badge
      Facepalm

      Re: More fail over lore.

      Was it this one:

      "The Daily Telegraph in 2009 exposed [Douglas] Hogg for claiming upwards of £2,000 of taxpayers' money for the purposes of "cleaning the moat" of his country estate, Kettlethorpe Hall; "

      ?

      https://en.wikipedia.org/wiki/Douglas_Hogg#:~:text=The%20Daily%20Telegraph%20in%202009,parliamentary%20expenses%20scandal%2C%20although%20it

  25. Anonymous Coward
    Anonymous Coward

    Fail, but over?

    And then there are systems designed with good intentions... a lovely resilient clustered pair of servers, but the application was coded to address one server name, not the cluster name; a standby generator that had its mains-sensing switch on the output side; another generator that had its fuel supply pump wired to the non-essential mains; a lovely battery set-up that unfortunately included the building lifts as essential supply... This is why we weep into our beer.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like