back to article Excuse me, what just happened? Resilience is tough when your failure is due to a 'sequence of events that was almost impossible to foresee'

When designing systems that our businesses will rely on, we do so with resilience in mind. Twenty-five years ago, technologies like RAID and server mirroring were novel and, in some ways, non-trivial to implement; today this is no longer the case and it is a reflex action to procure multiple servers, LAN switches, firewalls, …

Page:

  1. cantankerous swineherd

    putting all your eggs in the internet basket is the opposite of resilient.

    1. Blackjack Silver badge

      Is always good to have offline backups.

      Also the fact most of the Internet can be brought down just by having wrong dates seems quite stupid.

      1. DJV Silver badge

        Wrong dates

        Well, I've certainly been on some wrong dates - I glad to report that I don't think they ever brought the internet down though.

        1. Gene Cash Silver badge
          Coat

          Re: Wrong dates

          I started with TRS-80s and S-100 machines, but obviously, I'm dating myself. Which is good, because nobody else will.

          1. DJV Silver badge

            Re: I'm dating myself

            And the conversation is probaby better!

        2. Ken G Silver badge
          Trollface

          Re: Wrong dates

          If you go on the internet you don't need to go on dates, and vice versa.

  2. Pascal Monett Silver badge
    Thumb Up

    Great article

    Pure wisdom as well.

    I will keep the link as a reference in all my discussions about testing components with my customers.

  3. ColinPa Silver badge

    Do not have your experts on hand

    As well as testing for failure, ensure that the people are considered as part of this. Keep the experts who know what they are doing out of it - they can watch as the other members of the team manage the fail overs. You can watch as much as you like - you only learn when you have to do it. If the team is screwing up - let them - up to a point. Part of learning is how to undig the hole.

    The expert may know to do A, B, C - but the documentation only says do A, C.

    In one situation someone had to go into a box to press the reset button - but the only person with the key to the box was the manager/senior team leader.

    Having junior staff do the work, while the senior staff watch is also good as it means someone who knows is watching for unusual events.

    1. DS999 Silver badge

      Re: Do not have your experts on hand

      Speaking of random system failures, I've just seen something I've never seen at The Register. The post I'm replying to currently has -1 downvotes! How appropriate for this article lol

      1. Ken Moorhouse Silver badge

        Re: -1 downvotes

        They were doing system maintenance earlier. Maybe something to do with that?

      2. DJV Silver badge

        Re: Do not have your experts on hand

        Yes, it's still -1 - I'm tempted to click it to see what happens but, then again, I don't want to be the guy that did something similar to what happened with Fastly!

        1. Ken Moorhouse Silver badge

          Re: DJV threatens dire crisis. TheRegister intervenes to save planet from extinction

          Yesterday a major crisis was averted when TheRegister reset the counters after DJV threatened "to click it to see what happens", referring to a negative Downvote count on a post.

          Now 11 Upvotes 0 Downvotes.

          Phew, that was close.

  4. A Non e-mouse Silver badge

    Partial Failures

    Partial failures are the hardest to spot and defend against. So many times I see high-availability systems die because they failed in an obscure and non-obvious way.

    And as we make our systems more complex & interconnected (ironically to try and make them more resilient!) they become more susceptible to catastrophic outages caused by partial failures.

    1. Brewster's Angle Grinder Silver badge

      Re: Partial Failures

      I was thinking about Google's insights into chip misbehaviour. You can't write your code defensively against the possibility that arithmetic has stopped working.

      Likewise, as a consumer of a clock: you've just go to assume it's monotonically increasing, haven't you? (And if you do check, have you now opened up a vulnerability should we ever get a negative leap second?) That said, my timing code nearly always checks for positive durations. But it's response is to throw an exception. Which is just to switch one catastrophically bad thing for another.

      1. Filippo Silver badge

        Re: Partial Failures

        You're switching one catastrophically bad and very hard to diagnose thing for one that's catastrophically bad but which I assume leaves a log with the exact problem somewhere. That's a pretty good improvement.

      2. Claptrap314 Silver badge

        Re: Partial Failures

        Actually, I had to write code to defend against integer store not working.

        It was a fun bit of work. Pure assembler, of course. But such things CAN be done--if you know the failure mode in advance.

        Not the point of the story, I know. But intricate does not mean impossible.

        1. Brewster's Angle Grinder Silver badge

          Re: Partial Failures

          I didn't mean to imply it was impossible; only that it was impractical.

          I guess dodgy memory is read back and verify. You sometimes used to have to do that with peripherals. Although I suppose it gets more tasty if there are caches between the CPU and main memory.

          For arithmetic, you could perform the calculation multiple times in different ways and compare the results. But I'm not doubling or trebling the size of the code and the wasting all that extra time just in case I've got bum CPU. (Particularly if I've already spent a lot of time optimising the calculation to minimise floating point rounding errors.) In real world, you can't write code against on the off chance the CPU is borked.

          1. fireflies

            Re: Partial Failures

            Surely you could have the code perform a calculation and then compare it against a constant that holds the expected answer - if it deviates from that constant, something has gone terribly wrong - not a 100% guarantee but it would reduce the risk somewhat?

            1. adam 40
              Happy

              Re: Partial Failures

              Then you could optimise the code to return your constant - win-win!

  5. Julz

    The

    Only place I every did any work for which took an approach similar that being proposed by the article was a bank which switched from it's primary to it's secondary data centers once a month, flip flopping one to the other, over and over. Well in fact it was a bit more complicated, as other services went the other way and there were three data centers but you get the idea. The kicker being that even they had issues when they had to do it for real due to a plane hitting a rather tall building in New York.

    1. TaabuTheCat

      Re: The

      When I worked in Texas we used to do bi-annual real production failovers to our backup DC, run for a couple of days and then fail everything back. It was how we knew we prepared for hurricane season. Interesting how many orgs don't consider failback as part of the exercise. One of my peers (at another org) said management considered the risk too high for "just a test". I'm thinking it's actually quite rare to find companies doing real failover/failback testing in their production environments.

      1. Jc (the real one)

        Re: The

        Many years ago I was talking to a big bank about DR/HA. Shortly before, they had run their fail over test and went to the pub to celebrate the successful procedure. Suitably refreshed, they went back to the office to perform the fail back, only to discover that one of their key steps was missing (and code needed to be written) so they had to carry on running at the DR site for several months until the next scheduled downtime window.

        Never forget the fail back

        Jc

      2. Claptrap314 Silver badge

        Re: The

        Heh. For the bigs (I was at G), this is called "Tuesday". (Not Friday, though. Never stir up trouble on a Friday.)

        One of the things that the cloud detractors fail to "get" is just how much goes into 4+9s of resilience. We would deliberately fail DCs a part of training exercises. Also, because a DC was going into a maintenance window. Also, because the tides were a bit high right now, so traffic was slow over there.

        The thing is, we didn't "fail over" so much as "move that traffic elsewhere". Increasing capacity on a live server is almost always, well, more safe than just dumping a bunch of traffic on a cold server.

    2. DS999 Silver badge

      Re: The

      I've done a lot of consulting that involved high availability and DR over the years, and companies are surprisingly much more willing to spend money on redundant infrastructure than to schedule the time to properly test it on a regular basis and make sure that money was well spent.

      I can only guess those in charge figure "I signed off on paying for redundant infrastructure, if it doesn't actually work I'm off the hook and can point my finger elsewhere" so why approve testing? It worked when it was installed so it will work forever, right?

      I can't count the number of "walkthroughs" and "tabletop exercises" I've seen that are claimed to count as a DR test. The only thing those are good for is making sure that housekeeping stuff like the contact and inventory list are up to date, and making sure those involved actually remember what is in the DR plan.

      Clients don't like to hear it, but before go live of a DR plan I think you need to repeat the entire exercise until you can go through the plan step by step without having anyone have to do a single thing that isn't listed in the plan as their responsibility, and without anyone feeling there is any ambiguity in any step. Once you've done that, then you need to get backup/junior people (ideally with no exposure to the previous tests) to do it on their own - that's the real test that nothing has been left out or poorly documented. Depending on vacation schedules and job changes, you might be forced to rely on some of them when a disaster occurs - having them "looking over their shoulder" of their senior during a test is NOT GOOD ENOUGH.

      When a disaster occurs, everyone is under a great deal of stress and steps that are left out "because they are obvious" or responsibilities that are left ambiguous are the #1 cause of problems.

      1. robn

        Re: The

        At a bank I worked at (last century!), the disaster recovery exercise chose a random bunch of employees associated with the systems to participate in the recovery exercise. Those not chosen were considered victims of the disaster, and could not provide any input to the exercise.

        1. batfink

          Re: The

          This.

          The prospect of losing critical staff in the disaster is often not even considered. It's not something any of us would ever like to see happen, and therefore people don't even like to think about it, but in DR thinking it has to be counted.

          One of the DR scenarios in a place I used to work was a plane hitting the primary site during the afternoon shift changeover (we were near a major airport). All the operational staff were at the Primary, and the Secondary was dark. Therefore that would immediately take out 2/3 of the staff, plus most of the management layer. The DR Plan still had to work.

          1. DS999 Silver badge

            Re: The

            When I did DR planning at my first managerial job long ago (a university) the entire DR consisted of having a copy of the full backups go home with a long tenured employee of mine, and then he'd bring back the previous week's full backup tapes. We both agreed that if there was a disaster big enough to take out both our university building and his house some 4 miles away, no one would really care about the loss of data - and it wouldn't be our problem as we'd likely be dead.

            A bigger concern than the staff actually dying would be the staff is alive, but unable to connect remotely and unable to physically reach work. This actually happened where I was consulting in August 2003 when the big east coast power outage hit. I was fortunately not around when it happened to deal with it, but such widespread power outage meant no one could connect remotely. Some were able to make it in person, but others were stuck because they were like me and drive their cars down near the 'E' before filling - which doesn't allow you to get anywhere when the power's out because gas pumps don't work!

            Something like a major earthquake would not only take out power but damage enough roads due to bridge collapses, debris on the road, cars that run out of gas blocking it etc. that travel would become near impossible over any distance. There might be some routes that are open, but without any internet access there'd be no way for people to find out which routes are passable and which are not without risking it themselves.

            1. Claptrap314 Silver badge

              Re: The

              Which is why I don't consider an Western Oregon datacenter to be a very good backup site for one in Silicon Valley. If you prepare for earthquakes, you better understand that they are NOT randomly distributed.

              I also don't like backups up & down the (US) East coast because hurricanes have been known to take a tour of our Eastern seaboard. Same on the Gulf.

              1. DS999 Silver badge

                Re: The

                An Oregon datacenter is fine for backing up California. No earthquakes in that region will affect both places. The only place where a single earthquake might both is the midwest, if the New Madrid fault has another big slip.

                Now with hurricanes you have a point, a single hurricane could affect the entire eastern seaboard from Florida to Massachusetts. One hurricane wouldn't affect ALL those states (it would lose too much power) but it could affect any of two of them.

                1. Claptrap314 Silver badge

                  Re: The

                  Earthquakes cluster my friend. And there is a fault system that runs up & down the West coast. Not to mention the Great Big One.

            2. adam 40
              FAIL

              Re: The

              Is that one of those "write only" tapes that IT people keep in safes?

              I don't know how many times over the years I've asked for a deleted directory to be restored from tape, to get "sorry, the restore failed".

              1. Claptrap314 Silver badge

                Re: The

                Checkout Zen and the Art of Computer Programming. This mode of failure, and the means to prevent it, have been known for decades.

      2. Anonymous Coward
        Anonymous Coward

        Re: The

        Project vs Ops costs. The business case usually included the up front spend for the redundant tin, but after go-live when it's handed to ops, they're looking at ways to shave cost.

    3. Anonymous Coward
      Boffin

      Re: The

      Systems need to be designed and configured to allow for manually triggering failover and failback.

      Triggering it, running the failover system for a week (to include day/night and weekday/weekend) should be part of your monthly cycle. The same should hold true for any redundant ancillary equipment (e,g, ACs, UPSs).

      Some things will never be anticipated but you can be better prepared for when they happen.

      1. JerseyDaveC

        Re: The

        Absolutely right. And the thing is, the more you test things, the more comfortable you become with the concept of testing by actually downing stuff. You need to be cautious not to get complacent, of course, and start carelessly skipping steps of the run-book, but you feel a whole lot less trepidatious doing your fifth or sixth monthly test than you did doing the first :-)

        One does have to be careful, though, that if you implement a manual trigger the behaviour must be *exactly* identical to what would happen with a real outage. As a former global network manager I've seen link failover tested by administratively downing the router port, when actually killing the physical connection caused different behaviour.

      2. DS999 Silver badge

        Re: The

        That's great for those who have a full complement of redundant infrastructure, but few companies are willing to spend millions building infrastructure that sits idle except for the most critical business applications that have no tolerance for downtime during a disaster.

        Other applications might do stuff like keep the development/QA/etc. stuff in another datacenter and designate that as production DR - obviously you can't do what you suggest without halting development for a week. Some companies contract with a DR specialist that has equipment ready to go but they charge for its use (as well as of course being ready for its use) so you wouldn't run your production on it except in a true DR scenario.

        What you suggest makes sense in theory but no one is willing to risk it due to all the unknowns - let's say a change is made to how monthly or quarterly closing is done, and that change isn't properly integrated into the DR system. You fail over to DR then your month end closing blows up. Sure, it served a purpose by alerting you to the improper change, but in a really complex environment this kind of thing will happen often enough that the guys in charge will eventually put a stop to that sort of "throw us in the fire and see if we burn" type of testing.

        At least when you have a true DR event, everyone is watching things really closely and they are expecting things like this that fell through the cracks so they can be responded to more quickly because you already have the right people on a 24x7 crisis call. In the non-true DR scenario unless the fix is obvious to whoever gets paged about it, you're going to end up having to round up people and open a conference line to resolve an issue that didn't have to happen, and the application owner involved is going to question why the heck we need to go looking for problems and roping him into a call on the weekend that didn't need to happen.

        1. Claptrap314 Silver badge
          Megaphone

          Re: The

          This is what you are paying AWS or GCP to do. They have the equipment, and they know the magic to keep the price down.

          As I stated before, one of the big tricks is to have your traffic going ten places with the ability to function just fine if it has to be focused on only seven or eight of them. You see that the overhead cost is now less than 50%--but only after you reach the kinds of scale that they have.

    4. Anonymous Coward
      Anonymous Coward

      Re: The

      Fully support the idea of switching "active / standby" to "stabnby / active" on a regular basis.

      Switches, routers, WAN links, DC's, EVERYTHING.

      1. Claptrap314 Silver badge

        Re: The

        Much better is to have traffic flowing all the time to N+2, and redirect to only N as necessary.

    5. Anonymous Coward
      Anonymous Coward

      Re: The

      I thought it was the same bank as I'd worked for except they only had 2 data centres and they developed the practice after a 747 landed on one of them in 1992. They actually ran 50:50 from each data centre and exchanged which location a particular half of the service ran from every two weeks.

  6. This post has been deleted by its author

  7. amanfromMars 1 Silver badge

    Surely not too mad and/or rad to understand???

    today this is no longer the case and it is a reflex action to procure multiple servers, LAN switches, firewalls, and the like to build resilient systems.

    'Tis sad, and surely something you should be thinking about addressing, that the systems they build all fail the same though fielding and following those processes. Such suggests there is an abiding persistent problem in present standardised arrangements.

    Future procedures may very well need to be somewhat different.

  8. Ken Moorhouse Silver badge
    Coat

    "sequence of events that was almost impossible to foresee"

    B-b-but these times that were issued by the Time Server had been seen before, twenty years ago.

    1. John Brown (no body) Silver badge

      Re: "sequence of events that was almost impossible to foresee"

      I was thinking more along the lines of, it's a time server. Surely unless it's a startum-1 atomic clock, then it should be its checking the time against outside sources. And surely even Stratum-1 clocks check against their peers. The only thing I can image here is that's getting time from GPS or similar and then adjusting what it gets for timezones and somehow the factory default managed to think it was in a timezone 20 years away. But then a time server should really be on UTC. Timezones ought to be a "user level" thing.

      1. Jc (the real one)

        Re: "sequence of events that was almost impossible to foresee"

        In many organisations, time doesn't have to be *exactly* correct, just the same at all systems. But still, only a single time source?

        Jc

      2. KarMann Silver badge
        Boffin

        Re: "sequence of events that was almost impossible to foresee"

        GPS week number rollover:

        …is a phenomenon that happens every 1024 weeks, which is about 19.6 years. The Global Positioning system broadcasts a date, including a weekly counter that is stored in only ten binary digits. The range is therefore 0–1023. After 1023 an integer overflow causes the internal value to roll over, changing to zero again.

        I'd bet that's what happened here, rather than a pure factory default time. The last rollover was in April 2019, so I bet the default of this time server was to assume it was in the August 1999–April 2019 block, and it just hadn't been rebooted since before April 2019. See the article for a similar bork-bork-bork candidate picture.

        I guess back in 1980, they weren't too concerned about the likes of the Y2K problem yet. And lucky us, the next week rollover will be in 2038, the icing on the Unix time Y2K38 cake, if about ten months later.

        1. Claptrap314 Silver badge
          FAIL

          Re: "sequence of events that was almost impossible to foresee"

          I'm having trouble processing all the levels of wrong involved here. Just wow.

      3. adam 40
        IT Angle

        Ross 154

        Is around 9.6 light years away from Earth, so the return-trip time would put its timezone about 20 years ago...

  9. Phil O'Sophical Silver badge

    Network failures

    In hindsight, this was completely predictable

    Doesn't require much hindsight, it's an example of the "Byzantine Generals problem" described by Leslie Lamport 40 years ago, and should be well-known to anyone working on highly-available systems. With only two sites, in the event of network failure it's provably impossible for either of them to know what action to take. That's why such configurations always need a third site/device, with a voting system based around quorum. Standard for local HA systems, but harder to do with geographically-separate systems for DR because network failures are more frequent and often not independent.

    In that case best Business Continuity practice is not to do an automatic primary-secondary failover, but to have a person (the BC manager) in the loop. That person is alerted to the situation and can gather enough additional info (maybe just phone calls to the site admins, a separate "network link") to take the decision about which site should be Primary. After that the transition should be automated to reduce the likelihood of error.

    1. Muppet Boss
      Pint

      Re: Network failures

      Risking to sound (un)pleasantly pedantic, I still have to say that the examples given are not only completely predictable, these are simple textbook examples of bad system design. Taleb does not need to be involved at all.

      Configuring 2 NTP servers is a Bad Practice because 2 NTP servers cannot form a quorum to protect against the standard problem of a false ticker. The recommended and optimal minimum is 3, however 1 is still better than 2 because if the 2 differ significantly, it is difficult to impossible to determine which one is the false ticker.

      Some badly designed black box systems only allow for a maximum of 2 NTP servers being configured; in this special case the importance of the system might prompt using a cluster of monitored anycast NTP servers for high availability; for less demanding cases using a single DNS record to something from pool.ntp.org will ensure enough availability without false tickers (while adding the Internet access and DNS dependencies).

      Having a split-brain failure scenario in a geographically distributed firewall cluster is so common that it is usually specifically tested in any sane DR plan. This, again, is a glaring example of bad network design, implementation or operation. No black swan magic is necessary, just build better systems or hire a professional.

      Real-world problems with highly available systems are usually multi-staged and are caused by a chain of unfortunate events, every single one of which would not have had the devastating effects. Simple, non-trivial failure scenarios, however, do exist. Something from personal experience that immediately comes to mind:

      - A resilient firewall cluster in a very large company is exposed to external non-malicious network conditions triggering a bug in the firewall code and the primary active firewall reboots as a result. The firewall cluster fails over, the secondary firewall is exposed to the same conditions and the same bug and reboots as well while the primary firewall still boots up. The process repeats resulting in noticeable outage until the unsuspecting external influence is removed.

      - A well-maintained but apparently defective dual-PSU device in a large datacentre short circuits without any external cause resulting in 2 feeds tripping and powering off the whole row of racks as well as a few devices not surviving through it.

      Cheers to all the IT infrastructure fellas, whichever fancy name you are called now!

Page:

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like