back to article Excuse me, what just happened? Resilience is tough when your failure is due to a 'sequence of events that was almost impossible to foresee'

When designing systems that our businesses will rely on, we do so with resilience in mind. Twenty-five years ago, technologies like RAID and server mirroring were novel and, in some ways, non-trivial to implement; today this is no longer the case and it is a reflex action to procure multiple servers, LAN switches, firewalls, …

  1. cantankerous swineherd Silver badge

    putting all your eggs in the internet basket is the opposite of resilient.

    1. Blackjack Silver badge

      Is always good to have offline backups.

      Also the fact most of the Internet can be brought down just by having wrong dates seems quite stupid.

      1. DJV Silver badge

        Wrong dates

        Well, I've certainly been on some wrong dates - I glad to report that I don't think they ever brought the internet down though.

        1. Gene Cash Silver badge
          Coat

          Re: Wrong dates

          I started with TRS-80s and S-100 machines, but obviously, I'm dating myself. Which is good, because nobody else will.

          1. DJV Silver badge

            Re: I'm dating myself

            And the conversation is probaby better!

        2. Ken G
          Trollface

          Re: Wrong dates

          If you go on the internet you don't need to go on dates, and vice versa.

  2. Pascal Monett Silver badge
    Thumb Up

    Great article

    Pure wisdom as well.

    I will keep the link as a reference in all my discussions about testing components with my customers.

  3. ColinPa

    Do not have your experts on hand

    As well as testing for failure, ensure that the people are considered as part of this. Keep the experts who know what they are doing out of it - they can watch as the other members of the team manage the fail overs. You can watch as much as you like - you only learn when you have to do it. If the team is screwing up - let them - up to a point. Part of learning is how to undig the hole.

    The expert may know to do A, B, C - but the documentation only says do A, C.

    In one situation someone had to go into a box to press the reset button - but the only person with the key to the box was the manager/senior team leader.

    Having junior staff do the work, while the senior staff watch is also good as it means someone who knows is watching for unusual events.

    1. DS999 Silver badge

      Re: Do not have your experts on hand

      Speaking of random system failures, I've just seen something I've never seen at The Register. The post I'm replying to currently has -1 downvotes! How appropriate for this article lol

      1. Ken Moorhouse Silver badge

        Re: -1 downvotes

        They were doing system maintenance earlier. Maybe something to do with that?

      2. DJV Silver badge

        Re: Do not have your experts on hand

        Yes, it's still -1 - I'm tempted to click it to see what happens but, then again, I don't want to be the guy that did something similar to what happened with Fastly!

        1. Ken Moorhouse Silver badge

          Re: DJV threatens dire crisis. TheRegister intervenes to save planet from extinction

          Yesterday a major crisis was averted when TheRegister reset the counters after DJV threatened "to click it to see what happens", referring to a negative Downvote count on a post.

          Now 11 Upvotes 0 Downvotes.

          Phew, that was close.

  4. A Non e-mouse Silver badge

    Partial Failures

    Partial failures are the hardest to spot and defend against. So many times I see high-availability systems die because they failed in an obscure and non-obvious way.

    And as we make our systems more complex & interconnected (ironically to try and make them more resilient!) they become more susceptible to catastrophic outages caused by partial failures.

    1. Brewster's Angle Grinder Silver badge

      Re: Partial Failures

      I was thinking about Google's insights into chip misbehaviour. You can't write your code defensively against the possibility that arithmetic has stopped working.

      Likewise, as a consumer of a clock: you've just go to assume it's monotonically increasing, haven't you? (And if you do check, have you now opened up a vulnerability should we ever get a negative leap second?) That said, my timing code nearly always checks for positive durations. But it's response is to throw an exception. Which is just to switch one catastrophically bad thing for another.

      1. Filippo Silver badge

        Re: Partial Failures

        You're switching one catastrophically bad and very hard to diagnose thing for one that's catastrophically bad but which I assume leaves a log with the exact problem somewhere. That's a pretty good improvement.

      2. Claptrap314 Silver badge

        Re: Partial Failures

        Actually, I had to write code to defend against integer store not working.

        It was a fun bit of work. Pure assembler, of course. But such things CAN be done--if you know the failure mode in advance.

        Not the point of the story, I know. But intricate does not mean impossible.

        1. Brewster's Angle Grinder Silver badge

          Re: Partial Failures

          I didn't mean to imply it was impossible; only that it was impractical.

          I guess dodgy memory is read back and verify. You sometimes used to have to do that with peripherals. Although I suppose it gets more tasty if there are caches between the CPU and main memory.

          For arithmetic, you could perform the calculation multiple times in different ways and compare the results. But I'm not doubling or trebling the size of the code and the wasting all that extra time just in case I've got bum CPU. (Particularly if I've already spent a lot of time optimising the calculation to minimise floating point rounding errors.) In real world, you can't write code against on the off chance the CPU is borked.

          1. fireflies

            Re: Partial Failures

            Surely you could have the code perform a calculation and then compare it against a constant that holds the expected answer - if it deviates from that constant, something has gone terribly wrong - not a 100% guarantee but it would reduce the risk somewhat?

            1. adam 40 Silver badge
              Happy

              Re: Partial Failures

              Then you could optimise the code to return your constant - win-win!

  5. Julz Silver badge

    The

    Only place I every did any work for which took an approach similar that being proposed by the article was a bank which switched from it's primary to it's secondary data centers once a month, flip flopping one to the other, over and over. Well in fact it was a bit more complicated, as other services went the other way and there were three data centers but you get the idea. The kicker being that even they had issues when they had to do it for real due to a plane hitting a rather tall building in New York.

    1. TaabuTheCat

      Re: The

      When I worked in Texas we used to do bi-annual real production failovers to our backup DC, run for a couple of days and then fail everything back. It was how we knew we prepared for hurricane season. Interesting how many orgs don't consider failback as part of the exercise. One of my peers (at another org) said management considered the risk too high for "just a test". I'm thinking it's actually quite rare to find companies doing real failover/failback testing in their production environments.

      1. Jc (the real one)

        Re: The

        Many years ago I was talking to a big bank about DR/HA. Shortly before, they had run their fail over test and went to the pub to celebrate the successful procedure. Suitably refreshed, they went back to the office to perform the fail back, only to discover that one of their key steps was missing (and code needed to be written) so they had to carry on running at the DR site for several months until the next scheduled downtime window.

        Never forget the fail back

        Jc

      2. Claptrap314 Silver badge

        Re: The

        Heh. For the bigs (I was at G), this is called "Tuesday". (Not Friday, though. Never stir up trouble on a Friday.)

        One of the things that the cloud detractors fail to "get" is just how much goes into 4+9s of resilience. We would deliberately fail DCs a part of training exercises. Also, because a DC was going into a maintenance window. Also, because the tides were a bit high right now, so traffic was slow over there.

        The thing is, we didn't "fail over" so much as "move that traffic elsewhere". Increasing capacity on a live server is almost always, well, more safe than just dumping a bunch of traffic on a cold server.

    2. DS999 Silver badge

      Re: The

      I've done a lot of consulting that involved high availability and DR over the years, and companies are surprisingly much more willing to spend money on redundant infrastructure than to schedule the time to properly test it on a regular basis and make sure that money was well spent.

      I can only guess those in charge figure "I signed off on paying for redundant infrastructure, if it doesn't actually work I'm off the hook and can point my finger elsewhere" so why approve testing? It worked when it was installed so it will work forever, right?

      I can't count the number of "walkthroughs" and "tabletop exercises" I've seen that are claimed to count as a DR test. The only thing those are good for is making sure that housekeeping stuff like the contact and inventory list are up to date, and making sure those involved actually remember what is in the DR plan.

      Clients don't like to hear it, but before go live of a DR plan I think you need to repeat the entire exercise until you can go through the plan step by step without having anyone have to do a single thing that isn't listed in the plan as their responsibility, and without anyone feeling there is any ambiguity in any step. Once you've done that, then you need to get backup/junior people (ideally with no exposure to the previous tests) to do it on their own - that's the real test that nothing has been left out or poorly documented. Depending on vacation schedules and job changes, you might be forced to rely on some of them when a disaster occurs - having them "looking over their shoulder" of their senior during a test is NOT GOOD ENOUGH.

      When a disaster occurs, everyone is under a great deal of stress and steps that are left out "because they are obvious" or responsibilities that are left ambiguous are the #1 cause of problems.

      1. robn

        Re: The

        At a bank I worked at (last century!), the disaster recovery exercise chose a random bunch of employees associated with the systems to participate in the recovery exercise. Those not chosen were considered victims of the disaster, and could not provide any input to the exercise.

        1. batfink Silver badge

          Re: The

          This.

          The prospect of losing critical staff in the disaster is often not even considered. It's not something any of us would ever like to see happen, and therefore people don't even like to think about it, but in DR thinking it has to be counted.

          One of the DR scenarios in a place I used to work was a plane hitting the primary site during the afternoon shift changeover (we were near a major airport). All the operational staff were at the Primary, and the Secondary was dark. Therefore that would immediately take out 2/3 of the staff, plus most of the management layer. The DR Plan still had to work.

          1. DS999 Silver badge

            Re: The

            When I did DR planning at my first managerial job long ago (a university) the entire DR consisted of having a copy of the full backups go home with a long tenured employee of mine, and then he'd bring back the previous week's full backup tapes. We both agreed that if there was a disaster big enough to take out both our university building and his house some 4 miles away, no one would really care about the loss of data - and it wouldn't be our problem as we'd likely be dead.

            A bigger concern than the staff actually dying would be the staff is alive, but unable to connect remotely and unable to physically reach work. This actually happened where I was consulting in August 2003 when the big east coast power outage hit. I was fortunately not around when it happened to deal with it, but such widespread power outage meant no one could connect remotely. Some were able to make it in person, but others were stuck because they were like me and drive their cars down near the 'E' before filling - which doesn't allow you to get anywhere when the power's out because gas pumps don't work!

            Something like a major earthquake would not only take out power but damage enough roads due to bridge collapses, debris on the road, cars that run out of gas blocking it etc. that travel would become near impossible over any distance. There might be some routes that are open, but without any internet access there'd be no way for people to find out which routes are passable and which are not without risking it themselves.

            1. Claptrap314 Silver badge

              Re: The

              Which is why I don't consider an Western Oregon datacenter to be a very good backup site for one in Silicon Valley. If you prepare for earthquakes, you better understand that they are NOT randomly distributed.

              I also don't like backups up & down the (US) East coast because hurricanes have been known to take a tour of our Eastern seaboard. Same on the Gulf.

              1. DS999 Silver badge

                Re: The

                An Oregon datacenter is fine for backing up California. No earthquakes in that region will affect both places. The only place where a single earthquake might both is the midwest, if the New Madrid fault has another big slip.

                Now with hurricanes you have a point, a single hurricane could affect the entire eastern seaboard from Florida to Massachusetts. One hurricane wouldn't affect ALL those states (it would lose too much power) but it could affect any of two of them.

                1. Claptrap314 Silver badge

                  Re: The

                  Earthquakes cluster my friend. And there is a fault system that runs up & down the West coast. Not to mention the Great Big One.

            2. adam 40 Silver badge
              FAIL

              Re: The

              Is that one of those "write only" tapes that IT people keep in safes?

              I don't know how many times over the years I've asked for a deleted directory to be restored from tape, to get "sorry, the restore failed".

              1. Claptrap314 Silver badge

                Re: The

                Checkout Zen and the Art of Computer Programming. This mode of failure, and the means to prevent it, have been known for decades.

      2. Anonymous Coward
        Anonymous Coward

        Re: The

        Project vs Ops costs. The business case usually included the up front spend for the redundant tin, but after go-live when it's handed to ops, they're looking at ways to shave cost.

    3. HildyJ Silver badge
      Boffin

      Re: The

      Systems need to be designed and configured to allow for manually triggering failover and failback.

      Triggering it, running the failover system for a week (to include day/night and weekday/weekend) should be part of your monthly cycle. The same should hold true for any redundant ancillary equipment (e,g, ACs, UPSs).

      Some things will never be anticipated but you can be better prepared for when they happen.

      1. JerseyDaveC

        Re: The

        Absolutely right. And the thing is, the more you test things, the more comfortable you become with the concept of testing by actually downing stuff. You need to be cautious not to get complacent, of course, and start carelessly skipping steps of the run-book, but you feel a whole lot less trepidatious doing your fifth or sixth monthly test than you did doing the first :-)

        One does have to be careful, though, that if you implement a manual trigger the behaviour must be *exactly* identical to what would happen with a real outage. As a former global network manager I've seen link failover tested by administratively downing the router port, when actually killing the physical connection caused different behaviour.

      2. DS999 Silver badge

        Re: The

        That's great for those who have a full complement of redundant infrastructure, but few companies are willing to spend millions building infrastructure that sits idle except for the most critical business applications that have no tolerance for downtime during a disaster.

        Other applications might do stuff like keep the development/QA/etc. stuff in another datacenter and designate that as production DR - obviously you can't do what you suggest without halting development for a week. Some companies contract with a DR specialist that has equipment ready to go but they charge for its use (as well as of course being ready for its use) so you wouldn't run your production on it except in a true DR scenario.

        What you suggest makes sense in theory but no one is willing to risk it due to all the unknowns - let's say a change is made to how monthly or quarterly closing is done, and that change isn't properly integrated into the DR system. You fail over to DR then your month end closing blows up. Sure, it served a purpose by alerting you to the improper change, but in a really complex environment this kind of thing will happen often enough that the guys in charge will eventually put a stop to that sort of "throw us in the fire and see if we burn" type of testing.

        At least when you have a true DR event, everyone is watching things really closely and they are expecting things like this that fell through the cracks so they can be responded to more quickly because you already have the right people on a 24x7 crisis call. In the non-true DR scenario unless the fix is obvious to whoever gets paged about it, you're going to end up having to round up people and open a conference line to resolve an issue that didn't have to happen, and the application owner involved is going to question why the heck we need to go looking for problems and roping him into a call on the weekend that didn't need to happen.

        1. Claptrap314 Silver badge
          Megaphone

          Re: The

          This is what you are paying AWS or GCP to do. They have the equipment, and they know the magic to keep the price down.

          As I stated before, one of the big tricks is to have your traffic going ten places with the ability to function just fine if it has to be focused on only seven or eight of them. You see that the overhead cost is now less than 50%--but only after you reach the kinds of scale that they have.

    4. Anonymous Coward
      Anonymous Coward

      Re: The

      Fully support the idea of switching "active / standby" to "stabnby / active" on a regular basis.

      Switches, routers, WAN links, DC's, EVERYTHING.

      1. Claptrap314 Silver badge

        Re: The

        Much better is to have traffic flowing all the time to N+2, and redirect to only N as necessary.

    5. Anonymous Coward
      Anonymous Coward

      Re: The

      I thought it was the same bank as I'd worked for except they only had 2 data centres and they developed the practice after a 747 landed on one of them in 1992. They actually ran 50:50 from each data centre and exchanged which location a particular half of the service ran from every two weeks.

  6. This post has been deleted by its author

  7. amanfromMars 1 Silver badge

    Surely not too mad and/or rad to understand???

    today this is no longer the case and it is a reflex action to procure multiple servers, LAN switches, firewalls, and the like to build resilient systems.

    'Tis sad, and surely something you should be thinking about addressing, that the systems they build all fail the same though fielding and following those processes. Such suggests there is an abiding persistent problem in present standardised arrangements.

    Future procedures may very well need to be somewhat different.

  8. Ken Moorhouse Silver badge
    Coat

    "sequence of events that was almost impossible to foresee"

    B-b-but these times that were issued by the Time Server had been seen before, twenty years ago.

    1. John Brown (no body) Silver badge

      Re: "sequence of events that was almost impossible to foresee"

      I was thinking more along the lines of, it's a time server. Surely unless it's a startum-1 atomic clock, then it should be its checking the time against outside sources. And surely even Stratum-1 clocks check against their peers. The only thing I can image here is that's getting time from GPS or similar and then adjusting what it gets for timezones and somehow the factory default managed to think it was in a timezone 20 years away. But then a time server should really be on UTC. Timezones ought to be a "user level" thing.

      1. Jc (the real one)

        Re: "sequence of events that was almost impossible to foresee"

        In many organisations, time doesn't have to be *exactly* correct, just the same at all systems. But still, only a single time source?

        Jc

      2. KarMann Silver badge
        Boffin

        Re: "sequence of events that was almost impossible to foresee"

        GPS week number rollover:

        …is a phenomenon that happens every 1024 weeks, which is about 19.6 years. The Global Positioning system broadcasts a date, including a weekly counter that is stored in only ten binary digits. The range is therefore 0–1023. After 1023 an integer overflow causes the internal value to roll over, changing to zero again.

        I'd bet that's what happened here, rather than a pure factory default time. The last rollover was in April 2019, so I bet the default of this time server was to assume it was in the August 1999–April 2019 block, and it just hadn't been rebooted since before April 2019. See the article for a similar bork-bork-bork candidate picture.

        I guess back in 1980, they weren't too concerned about the likes of the Y2K problem yet. And lucky us, the next week rollover will be in 2038, the icing on the Unix time Y2K38 cake, if about ten months later.

        1. Claptrap314 Silver badge
          FAIL

          Re: "sequence of events that was almost impossible to foresee"

          I'm having trouble processing all the levels of wrong involved here. Just wow.

      3. adam 40 Silver badge
        IT Angle

        Ross 154

        Is around 9.6 light years away from Earth, so the return-trip time would put its timezone about 20 years ago...

  9. Phil O'Sophical Silver badge

    Network failures

    In hindsight, this was completely predictable

    Doesn't require much hindsight, it's an example of the "Byzantine Generals problem" described by Leslie Lamport 40 years ago, and should be well-known to anyone working on highly-available systems. With only two sites, in the event of network failure it's provably impossible for either of them to know what action to take. That's why such configurations always need a third site/device, with a voting system based around quorum. Standard for local HA systems, but harder to do with geographically-separate systems for DR because network failures are more frequent and often not independent.

    In that case best Business Continuity practice is not to do an automatic primary-secondary failover, but to have a person (the BC manager) in the loop. That person is alerted to the situation and can gather enough additional info (maybe just phone calls to the site admins, a separate "network link") to take the decision about which site should be Primary. After that the transition should be automated to reduce the likelihood of error.

    1. Muppet Boss Bronze badge
      Pint

      Re: Network failures

      Risking to sound (un)pleasantly pedantic, I still have to say that the examples given are not only completely predictable, these are simple textbook examples of bad system design. Taleb does not need to be involved at all.

      Configuring 2 NTP servers is a Bad Practice because 2 NTP servers cannot form a quorum to protect against the standard problem of a false ticker. The recommended and optimal minimum is 3, however 1 is still better than 2 because if the 2 differ significantly, it is difficult to impossible to determine which one is the false ticker.

      Some badly designed black box systems only allow for a maximum of 2 NTP servers being configured; in this special case the importance of the system might prompt using a cluster of monitored anycast NTP servers for high availability; for less demanding cases using a single DNS record to something from pool.ntp.org will ensure enough availability without false tickers (while adding the Internet access and DNS dependencies).

      Having a split-brain failure scenario in a geographically distributed firewall cluster is so common that it is usually specifically tested in any sane DR plan. This, again, is a glaring example of bad network design, implementation or operation. No black swan magic is necessary, just build better systems or hire a professional.

      Real-world problems with highly available systems are usually multi-staged and are caused by a chain of unfortunate events, every single one of which would not have had the devastating effects. Simple, non-trivial failure scenarios, however, do exist. Something from personal experience that immediately comes to mind:

      - A resilient firewall cluster in a very large company is exposed to external non-malicious network conditions triggering a bug in the firewall code and the primary active firewall reboots as a result. The firewall cluster fails over, the secondary firewall is exposed to the same conditions and the same bug and reboots as well while the primary firewall still boots up. The process repeats resulting in noticeable outage until the unsuspecting external influence is removed.

      - A well-maintained but apparently defective dual-PSU device in a large datacentre short circuits without any external cause resulting in 2 feeds tripping and powering off the whole row of racks as well as a few devices not surviving through it.

      Cheers to all the IT infrastructure fellas, whichever fancy name you are called now!

  10. Anonymous Coward
    Anonymous Coward

    Strange how so many of these 'impossible to predict' failure modes, could have been predicted if someone had run through the following checklist:

    What happens if the service fails to respond?

    What happens if the service intermittently responds?

    What happens if the service responds slowly?

    What happens if the service responds with incorrect data?

    1. Anonymous Coward
      Anonymous Coward

      What? Only four questions?

      @AC

      What happens when the last devops delivery isn't correctly documented?

      What happens when the "agile" development team is working three shifts in three continents....and isn't coordinating the documentation?

      What happens when there ISN'T any documentation?

      What happens when a critical infrastructure component is a bought out service (see Fastly, Cloudflare)?

      ...........

      ........... and so on............

      1. DCdave
        Flame

        Re: What? Only four questions?

        Documentation? I think I recognise this word from the last century when as a tester my developer boss told me "the code is the documentation".

    2. graeme leggett

      That was my thinking partway through reading.

      The first business took steps to mitigate total breakdown of the time server but not a malfunction.

      And presumably the interoffice networking example was tested by switching off one unit (the event that was being mitigated against) rather than stopping the heartbeat (the trigger for the backup to take over)

      The lesson being that in an ideal world one ought to consider all the things that can happen. I recall a story that Feynman while on Manhattan project in order to give the impression he was involved in discussion pointed to a symbol on a process diagram and asked some innocent question about it. Upon which an engineer present recognised that it was a dangerous weak point.

      Now all I have to do is remember this for my own work....

  11. Paul Crawford Silver badge

    NTP

    For classic NTP operation it is recommended that you have 4 or more time servers configured on each client so they can detect problems including a broken/false clock source. That could be costly in hardware, so you might have 1 or 2 local servers from GPS that offer precise time due to low symmetric LAN delays and back it up with ones across the internet at large that can catch one of the GPS going massively stupid but only offer accuracy, on their own, to several/tens of milliseconds.

    1. Anonymous Coward
      Anonymous Coward

      Re: NTP

      That's a very clever idea. Unix people really did think of everything, decades ago

    2. Anonymous Coward
      Anonymous Coward

      Re: NTP

      Proper NTP (not "Simple NTP") also refuses to skew the clock (and throws up errors) if the time appears to be more than an hour wrong - because that's almost certainly due to something badly broken. It also never gives you discontinuous time - the clock is skewed to slow it down or speed it up; time never goes backwards, and it never jumps.

      "Simple" NTP is not proper time sync, it just blindly accepts what it is told. If systems started believing a date 20 years ago, they were not set up with proper time sync.

    3. Mage Silver badge
      Alert

      Re: NTP

      We need to stop using GPS as a cheap clock. Atomic clocks are not so expensive now and a decent solar flare or a system error can take out GPS based time world wide, then DTT, Mobile, DAB, Internet and other stuff needlessly fails. Jamming can take out or spoof local use of GPS.

      A big enough solar flare is a certainty, we just don't know when. GPS should only be used for navigation, and critical systems should have some sort of alternative navigation, even if only dead reckoning (inertial) that's normally corrected by GPS and can take over.

      1. Paul Crawford Silver badge

        Re: NTP

        You still need to sync the atomic clocks together in the first place, and to keep them agreeing afterwards (depending on the level of time accuracy you need)!

        For that you need something like GPS to do it, so really it comes down to how many will pay extra for an atomic clock reference oscillator in addition to the GPS receiver and outdoor antenna, etc. Many should do it, if they are running essential services, but usually the bean counters say no...

        1. Cynic_999 Silver badge

          Re: NTP

          "

          You still need to sync the atomic clocks together in the first place, and to keep them agreeing afterwards (depending on the level of time accuracy you need)!

          "

          Well yes, it does depend on the level of accuracy you need, but there will be very few cases where a static atomic clock on Earth will need to be adjusted during its lifetime. Even if you require the clock to be accurate to within 1uS (one millionth of a second), an atomic clock would only need adjusting every 100 years. The variation in propagation delay from time source to the point where the time is used is a lot more than 1uS in any networked computer system I can think of, other than systems where precise time is part of the input data it is working on (e.g the embedded CPU in a GPS receiver).

    4. Anonymous Coward
      Anonymous Coward

      Re: NTP

      > "For classic NTP operation it is recommended that you have 4 or more time servers configured on each client so they can detect problems including a broken/false clock source. That could be costly in hardware, so you might have 1 or 2 local servers from GPS that offer precise time due to low symmetric LAN delays and back it up with ones across the internet at large that can catch one of the GPS going massively stupid but only offer accuracy, on their own, to several/tens of milliseconds."

      Thought the rule-of-thumb is 3+ NTP servers as if you have only 2 then if they disagree you know one of them is wrong but not which one, whereas is you have 3 (or more) and one of them differs from the other 2 your NTP software knows which is the wrong one. Its the same reason the Space Shuttle had 3 computers running the same code that worked on consensus - the 2 computers that agreed could out-vote the 3rd (obviously wrong) one.

      As for cost, not that much these days: 3 Raspberry Pis (with POE adaptors and RTC addon) plus GPS, DCF77, and MSF radio receivers and there's your 3 local Stratum 1 servers gettting reference time from disparate sources - space, near Frankfurt, and Cumbria respectively. No point using mobile network for time as (a) its ultimately coming from GPS and (b) from memory its doesn't provide much accuracy (no milliseconds, not even sure if it provides seconds).

      The above kit would cost about 300-400 pounds in total. Low enough to keep spares...

  12. Paul Crawford Silver badge

    Fail-over failure

    We used to have one of the SunOracle storage servers with the dual heads configured as active/passive and linked via both a Ethernet cable and a pair of RS232 lines. That was, allegedly, so it could synchronise configuration via the Ethernet link and had the RS232 as a final check on connectivity to avoid the "split brain" problem of both attempting to become master at once.

    It was an utterly useless system. In the 5+ years we had it as primary storage it failed over a dozen times for various reasons and only occasionally did the passive head take over. We complained and raised a bug report with Oracle and they just said it was "working as designed" because it was only to take over if there was a kernel panic on the active head. Failing to serve files, its sole purpose in life, due to partial borking was not considered a problem apparently.

    The conclusion we had was paying for professional systems by big companies is a waste of time. Sure we had a soft, strong and absorbent maintenance SLA but we would have had less trouble with a singe-head home made FreeNAS server and a watchdog daemon running.

    1. DJV Silver badge

      Re: Fail-over failure

      In your case Oracle was obviously the main point of failure (but that will come as no surprise to any reader of the Register).

    2. Anonymous Coward
      Anonymous Coward

      Re: Fail-over failure

      Those systems run Solaris internally, and when they were being developed the Solaris Cluster team offered to help them embed the 10+ year old tried, tested & very reliable clustering software into them.

      The usual "Not Invented Here" attitude that was prevalent in that part of Sun prevailed, and they insisted that they were quite capable of doing the job themselves in a more modern fashion, despite having no background in high availability. We all saw the results.

  13. JassMan Silver badge

    Black swan event

    Having emigrated to the land of kangaroos before I was old enough to even know the name of a swan, I grew up believing all swans were black. On returning to my birthplace once I was old enough to earn my airfare, I discovered they also come in white. For a while, I just thought it was peculiar to see so many albino birds together, until I realised that all European swans are white. Of the two varieties, I would say the white are the slighly more unpredictable so what is this black swan event thing?

    1. KarMann Silver badge
      Holmes

      Re: Black swan event

      The phrase was coined by Juvenal back in the 2nd century CE, discussing the problem of induction, whilst Europeans weren't aware of your black swans until 1697. By then, the name had stuck, with an added bonus of highlighting the problem that just because you haven't seen a specific counterexample doesn't mean it can't exist.

    2. AdamT

      Re: Black swan event

      They are rare but a few examples in the UK:

      https://www.nationaltrust.org.uk/chartwell/features/chartwells-black-swans

      ( https://goo.gl/maps/3YinKfXzMmdyZRg16 )

      Dawlish (on south west coast)

      ( https://goo.gl/maps/XRJyKZN2kz8ETg259 )

      I think both sets are "natural but monitored" i.e. they are not captive or bred, and arrived under their own steam but they are looked after and encouraged in their current locations.

    3. Val Halla

      Re: Black swan event

      It's probably fair to assume the indigenous races never expected a white man event either.

  14. TheBadja

    There was once a major outage on Sydney Trains because of an intermittent failure on a router in the signal box. The IT network was configured in failover with a pair of routers in load sharing.

    One router failed - any the second router took up the load. Should have been fine and the failed router could have been reloaded at leisure.

    However, the failed router, now without load, restored itself and took up its part of the IP traffic - and then failed under load. Repeat multiple times per second and packets were being lost everywhere.

    This stopped the trains as this disconnected the signal box from centralised rail control.

    Trains stopped for hours as the IT network technical staff couldn’t track what was going on, then needed to physically replace the faulty unit. Of course, couldn’t get there by train, so they had to drive there in traffic built up due to no trains.

  15. Manolo
    Black Helicopters

    Bourne?

    Isn't "Excuses me, what just happened?" a line from one of the Jason Bourne films?

  16. Anonymous Coward
    Anonymous Coward

    A customer's mainframe comms front-end was split over three identical devices. Dual networking kit before them ensured that Incoming new connections were automatically rotated between the three comms boxes to give load sharing and resilience. It all worked well during fail-over testing and into live service. Even two boxes failing would leave full capability.

    A few months later one comms box fell over - followed by the other two a few minutes later. A while later - same again.

    The problem turned out to be a customer department had set up real time monitoring of connection response times. Every so often their test rig would automatically just make a connection and log the response time. As no data was passed the mainframe didn't see the connection. Unfortunately their test kit left each connection live in the comms box. A box bug meant the number of orphaned connections in each box slowly built up - and eventually the box crashed..

    The rotating load sharing meant that all the three boxes approached the critical point at the same time. When a box crashed its users were automatically reconnected to the other two boxes - and that tipped them over the edge.

  17. Anonymous Coward
    Anonymous Coward

    One of my house MSF radio synchronised clocks was showing an obviously wrong time this week. To save battery power they normally receive to resynchronise only once a day. Forcing it into receiving mode fixed the problem.

    The MSF time transmissions send data representing one second - every second. It needs the specific 60 contiguous transmissions to build up the time and date etc. Each second's data has very poor protection against bit corruption. It would appear the clock takes its daily sync from one minute's apparently complete transmission - and then stops receiving.

    My home-built timer uses an Arduino and a DS3231 TCXO RTC module. It also receives MSF signals when available. It only resynchronises the RTC when three consecutive minutes' MSF transmissions correlate with each other. The remaining RTC battery voltage is automatically checked once a week.

  18. aje21
    FAIL

    Fail-back is just as important as fail-over

    Doing DR pre-testing with a client in a non-production environment to make sure the steps looked right for when we did actual DR testing in production, the only issue we hit was during the fail-back a week after the fail-over when the DBA picked the wrong backup to restore from the secondary to the primary data centre and nobody noticed until after the replication from primary to secondary had been turned back on that we'd lost the week of data from the fail-over...

    Another reason for testing fail-over and fail-back with a sensible time separation because if it had been done on the same day we may not have noticed the mix-up of backup files.

  19. Hero Protagonist
    Devil

    When it comes to imagining failure modes…

    …remember that Murphy has a better imagination than you

    1. Bruce Ordway

      Re: When it comes to imagining failure modes…

      >>Murphy has a better imagination than you

      This remind me of an upgrade where (I believed) I had every potential for failure fully accounted for.

      I swapped out an HP K400 with an L1000 over the weekend.

      Sunday afternoon... congratulations all around.

      Monday morning.... chaos, no file access for users.

      I quickly discovered that a power supply for an unrelated file server had failed early AM.

      Even though a coincidence... this and any other issue was blamed on that weekend upgrade for several days.

      Lesson learned... I now avoid making public announcement of IT upgrades/projects.

      So that I'm more likely to get credit for "the fixing" rather than "the breaking".

  20. Anonymous Coward
    Anonymous Coward

    Avoidable

    Put it all in the Cloud, and blame other peoples computers when it goes wrong.

  21. Abominator

    Hardware resiliancy is almost always the biggest problem.

    You have a disk failure in the raid array, but its does not quite fail properly enough for the RAID array to drop it and a) fail drop it out of the army b) the RAID IO performance is dragged down two orders of magnitude but limps on.

    Same for network failover with bonded network interfaces. I have hundreds of cases of a) the backup switch was never properly configured and and nobody had properly tested a switch failure b) bad drivers in the bonded interface failed to switch over the ports.

    It's better to be cheap and simple and rely on software failure, in the case that software is designed for failover. It's much easier to validate. Just start killing processes randomly.

  22. tpobrienjr

    Never test, or test at first install?

    Decades ago, at a Big Oil multinational, I was put in charge of the EDI transaction processor. It had been designed with redundancy, fail-over hardware, etc., all the best stuff. I asked when the recovery had been last tested. At first install, I was told. So I scheduled a test shutdown and restart. Twelve hours after the shutdown, it was back up. Quite scary. I won't do that again. I think when downsizing the IT, they had downsized the documentation.

  23. gerdesj Silver badge
    Childcatcher

    Yes, absolutely

    "Could monitoring tools have been put in place to see issues like this when they happen? Yes, absolutely, but the point is that to do so one would first need to identify the scenarios as something that could happen."

    No that's crap and anyone who has managed a monitoring system for more than a few years will tell you so. Time sync is one of the two things that always goes wrong. DNS is the other one and so is SSL certificate expiration and the other things that you wonder how you wrote that script for but still seems to work years later 8)

    The writer of the article is probably quite knowledgeable but clearly not time served.

  24. Joe Gurman

    Unexpected points of failure

    I worked for many years in a largish organization that allowed us a small data center. So small that its kludgey HVAC system was totally inadequate for the amount of gear (mainly RAID racks, but a number of servers, too) stuffed into the place. After years of complaint, the Borg refused to respond by adding cooling capacity..

    The most spectacular failure, which we had foreseen, was when one of the frequent electrical storms we have in the summers hereabouts took out not only the HVAC, but also turned the electronics for the keycard on the (totally inappropriate) ordinary wooden core door to the data center. Such things are meant to fail open, but instead (of course), it failed locked. Wouldn't have been a long term problem but for three other foreseeable failures: the HVAC unit in the data center needed to be restarted manually (of course), and the Borg had refused my repeated requests for a manual (key) lock override for the keycard system. So with temperatures outside in the vicinity of 35 C, and all the machines inside the data center up and running — and no way to turn them off remotely because the network switch had failed off as well, we were faced with how to get through the damn door and start powering things off.

    Life got more interesting when the electricians showed up, popped a dropped ceiling panel outside our door, and found — nothing. The keycard lock electronics were... elsewhere, but nobody knew where, as no one had an accurate set of drawings (fourth unexpected failure?). So owe called in the carpenters to cut through the door. "Are you sure you want us to cute the part of the door with the lock away from the rest of the door? Then you won't be able to lock the door." Feeling a little like Group Captain Mandrake speaking to Col. Bat Guano in Dr. Strangelove, I omitted the expletives I felt like using and simply replied, "Please do it, and do it now, before everything in there becomes a pile of molten slag." They did, we got in, powered off everything but the minimal set of mission-critical hardware, and tried to restart the in-room HVAC unit. No joy, as it had destroyed a belt in failing (not an unexpected maneuver, as it had happened a few times before in "normal" operation). Took the Borg a few days to find and install a replacement.

    Meanwhile, the head-scratching electricians had been wandering up and down our hallway, popping ceiling panels to look for the missing keycard PCB. One of them got the bright idea of checking the panel outside the dual, glass doors to a keycarded office area 10 meters or so down the hall. Sure enough, there were two keycard PCBs up there: one for the glass doors, and one for our door. No one could figure out why it had been installed that way.

    And a few days later, the carpentry shop had a new door for us.

    But wait — it gets better (worse, really). The Borg decided we'd been right all along (scant comfort at that point) and decided to increase our cooling capacity.... by adding a duct to an overspeced blower for a conference room at the far and of the hallway. That's right, now we had two, single points of HVAC failure. But the unexpected failure came when, despite the wrapping of our racks in plastic as they pulled, cut, reconfigured, &c, ceiling panels and installed intake and outflow hardware, as well as the new ducting, we got a snowstorm of fiberglass all over our racks (fortunately powered down over the work period). We cleaned up as best we could, but after a week or two, our NetApp RAID controllers tarted failing, at unpredictable intervals (iirc we had eight of the monsters at the time). It turned out the fibers were getting sucked into the power supply fans, and then — bzerrrt — the power supplies would short out. Being NetApp gear, they were all redundant and hot swappable.... until we, and NetApp ran out of power supplies for such ancient gear. We managed to find previously owned units on eBay (which required an act of Whoever to relax the normal rule against sourcing from them) to complete our preventive swapout of all the remaining, operational power supplies we knew were going to fail.

    So many, unexpected failure modes.

  25. MOH

    "... but if management have any sense..."

    I think I see the problem

  26. rh16181618190224
    Alert

    How complex systems fail

    This was written a long time ago, and it is still true. https://how.complexsystems.fail/

    Happily I'm now retired and just have to worry about my dual raspberry Pis.

  27. TerjeMathisen

    NTP to the rescue!

    This happens to be my particular area - I have been a member of the NTP Hackers team for 25+ years now, and we have designed this protocol to survive a large number of "Byzantine Generals", i.e. servers that will serve up the wrong time ("falsetickers").

    The idea is that you need 4 independent servers to handle one willfull (or accidental) falseticker, you need 7 to handle two such at the same time.

    If you use the official NTPD distribution (all unix systems + windows), then you can configure all this with a single line

    server pool.ntp.org

    in your ntp.conf file, and your server will get 10 randomly selected servers from areas that are reasonably close to you. Every hour the ntpd deamon will resort the list of these 10 servers, discard the pair that has performed the worst compared to the median consensus, and ask the pool DNS service for more servers to replace them with.

    BTW, servers that suddenly start to give out 20 year old timestamps happen every 1024 weeks, which is how often the GPS 10-bit week counter rolls over. Every time this happens, we find a few more old GPS units that fail to handle it properly. :-(

    Terje Mathisen

    "Almost all programming can be viewed as an exercise in caching"

    1. dmesg
      Pint

      Re: NTP to the rescue!

      Thank you and your team! NTP is one of the most essential pieces of software in a data center, addressing a thorny problem with layers of complexity and nuance, and yet is a snap to set up and run -- and it does its job perfectly. Well done!

  28. Tom Paine
    Mushroom

    Good piece, but

    ...if management have any sense

    That's quite a load bearing "if".

  29. dmesg
    Mushroom

    An interesting workplace, Dr. Falken ...

    "If management have any sense, they will be persuadable that an approved outage during a predictable time window with the technical team standing by and watching like hawks is far better than an unexpected but entirely foreseeable outage when something breaks for real and the resilience turns out not to work."

    Nope. Gotta have all systems up 24/7/365, you know. Can't look like laggards with scheduled downtime, now, can we?

    Forget about routine downtime. We had to beg and plead for ad hoc scheduled maintenance windows. We tended to get them after a failure brought down the campus (and of course, we made good use of the failure downtime as well). But upper Administrators' memories were even shorter than our budget, and it would happen again a few months later.

    Thank $DEITY for the NTP team knowing what they were doing. It was easy to bring independent local NTP servers on line ("Is it really this easy?? We must be doing something wrong"). We put in three or four, each synced independently to four or five NTP pool servers, but capable of keeping good time for a several days if the internet crapped out. The sane NTP setup resulted in a noticeable drop in gremlins across our servers, particularly the LDAP "cluster".

    That LDAP setup was a treat: three machines configured for failover. Supposedly. One had never been configured properly and was an OS and LDAP version behind the others, but the other two wouldn't work unless the first was up. Failover didn't work. It was a cluster in multiple senses of the word, and everyone who set it up had departed for greener pastures. We didn't dare try to fix it; it was safer to not touch it and just reboot it when it failed. Actually, we wanted to fix it but it who has time for learning, planning, and executing a change amidst all the fire fighting?

    <digressive rant>

    Besides, fixing it wasn't really necessary, since the higher ups decided we were going to have a nice new nifty Active Directory service to replace it. Problem is, AD has a baked-in domain-naming convention ... and the name it wanted was already in use ... by the LDAP servers. We had to bring in a consulting service to design the changeover and help implement it. No problem, eh? Well, they were actually extremely competent and efficient but the mess that the previous IT staff had left was so snarled that the project was only three-quarters implemented when I left a year later. At least it had cloud-based redundancy, and failover seemed to work.

    The reason for switching to AD? Officially, compatibility with authentication interfaces for external services (which, it turns out, could usually do straight LDAP too). Reading between the lines: it finally dawned on the previous team what a mess they'd made with LDAP and rather than redo it right they went after a new shiny. When they left there was an opportunity to kill the AD project, but a new reason arose, just before I came on board: the college president liked Outlook and the higher-ups decided that meant we had to use M$ back-end software.

    </rant>

    We also had dual independent AC units for the server room. Mis-specced. When one was operational it wasn't quite enough to cool the room. When the second kicked in it overcooled the room. If both ran too long it overcooled the AC equipment room as well, and both AC units iced up. Why would it cool the AC room? Why indeed. The machine room was in a sub-basement with no venting to the outside. The machine room vented into the AC equipment room, and that vented into the sub-basement hallway.

    When the AC units froze up, que a call to Maintenance to find out where they'd taken our mobile AC unit this time. Then the fun of wheeling it across campus, down a freight elevator that occasionally blew its fuse between floors, into the machine room, then attaching the jury-rigged ducting. It could have been worse. We had our main backup server and tape drive in a telecomms room in another location, and that place didn't have redundant AC. It regularly failed, and for security's sake whoever was on-call got to spend the night by the open door with a couple big fans growling at the stars.

    It was a matter of luck that one of our team had been an HVAC tech in a previous life and he was able to at least minimize the problems, and tell the Facilities staff what we really needed when the building was renovated.

    Oh, do you want to hear about that whole-building floor-to-ceiling renovation? About the time a contractor used a disk grinder to cut through a pipe, including its asbestos cladding, shutting the whole building down for a month while it was cleaned up? With no (legal) access to the machine or AC room for much of that month? Another time, grasshopper.

    <rant redux>

    The college president commissioned an external review of the IT department to find out why we had so many outages, in preparation for firing the head of IT. The report came back dropping a 16-ton weight right on her for mismanagement. Politely worded but unmistakable. She tried to quash it but everyone knew it was on her desk. She then tried to get the most damning parts rewritten but the author wouldn't budge and eventually it all came out. Shortly afterward an all-IT-hands meeting was held where the President appeared (I was told she almost had to be dragged) and stated that we'd begin addressing the problems with band-aids, then move on to rubber bands. Band-aids. That was the exact word she used. I lasted another half year or so, but that was the clear beginning of the end.

    The college is also my alma mater, and I have many fond memories of student days. But I don't respond to their donation pleas any more.

    </rant>

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Biting the hand that feeds IT © 1998–2021