putting all your eggs in the internet basket is the opposite of resilient.
Excuse me, what just happened? Resilience is tough when your failure is due to a 'sequence of events that was almost impossible to foresee'
When designing systems that our businesses will rely on, we do so with resilience in mind. Twenty-five years ago, technologies like RAID and server mirroring were novel and, in some ways, non-trivial to implement; today this is no longer the case and it is a reflex action to procure multiple servers, LAN switches, firewalls, …
COMMENTS
-
Saturday 12th June 2021 08:11 GMT ColinPa
Do not have your experts on hand
As well as testing for failure, ensure that the people are considered as part of this. Keep the experts who know what they are doing out of it - they can watch as the other members of the team manage the fail overs. You can watch as much as you like - you only learn when you have to do it. If the team is screwing up - let them - up to a point. Part of learning is how to undig the hole.
The expert may know to do A, B, C - but the documentation only says do A, C.
In one situation someone had to go into a box to press the reset button - but the only person with the key to the box was the manager/senior team leader.
Having junior staff do the work, while the senior staff watch is also good as it means someone who knows is watching for unusual events.
-
-
-
-
Sunday 13th June 2021 05:56 GMT Ken Moorhouse
Re: DJV threatens dire crisis. TheRegister intervenes to save planet from extinction
Yesterday a major crisis was averted when TheRegister reset the counters after DJV threatened "to click it to see what happens", referring to a negative Downvote count on a post.
Now 11 Upvotes 0 Downvotes.
Phew, that was close.
-
-
-
-
Saturday 12th June 2021 09:37 GMT A Non e-mouse
Partial Failures
Partial failures are the hardest to spot and defend against. So many times I see high-availability systems die because they failed in an obscure and non-obvious way.
And as we make our systems more complex & interconnected (ironically to try and make them more resilient!) they become more susceptible to catastrophic outages caused by partial failures.
-
Sunday 13th June 2021 12:01 GMT Brewster's Angle Grinder
Re: Partial Failures
I was thinking about Google's insights into chip misbehaviour. You can't write your code defensively against the possibility that arithmetic has stopped working.
Likewise, as a consumer of a clock: you've just go to assume it's monotonically increasing, haven't you? (And if you do check, have you now opened up a vulnerability should we ever get a negative leap second?) That said, my timing code nearly always checks for positive durations. But it's response is to throw an exception. Which is just to switch one catastrophically bad thing for another.
-
Monday 14th June 2021 23:28 GMT Claptrap314
Re: Partial Failures
Actually, I had to write code to defend against integer store not working.
It was a fun bit of work. Pure assembler, of course. But such things CAN be done--if you know the failure mode in advance.
Not the point of the story, I know. But intricate does not mean impossible.
-
Tuesday 15th June 2021 11:21 GMT Brewster's Angle Grinder
Re: Partial Failures
I didn't mean to imply it was impossible; only that it was impractical.
I guess dodgy memory is read back and verify. You sometimes used to have to do that with peripherals. Although I suppose it gets more tasty if there are caches between the CPU and main memory.
For arithmetic, you could perform the calculation multiple times in different ways and compare the results. But I'm not doubling or trebling the size of the code and the wasting all that extra time just in case I've got bum CPU. (Particularly if I've already spent a lot of time optimising the calculation to minimise floating point rounding errors.) In real world, you can't write code against on the off chance the CPU is borked.
-
-
-
Saturday 12th June 2021 09:48 GMT Julz
The
Only place I every did any work for which took an approach similar that being proposed by the article was a bank which switched from it's primary to it's secondary data centers once a month, flip flopping one to the other, over and over. Well in fact it was a bit more complicated, as other services went the other way and there were three data centers but you get the idea. The kicker being that even they had issues when they had to do it for real due to a plane hitting a rather tall building in New York.
-
Saturday 12th June 2021 13:55 GMT TaabuTheCat
Re: The
When I worked in Texas we used to do bi-annual real production failovers to our backup DC, run for a couple of days and then fail everything back. It was how we knew we prepared for hurricane season. Interesting how many orgs don't consider failback as part of the exercise. One of my peers (at another org) said management considered the risk too high for "just a test". I'm thinking it's actually quite rare to find companies doing real failover/failback testing in their production environments.
-
Sunday 13th June 2021 14:03 GMT Jc (the real one)
Re: The
Many years ago I was talking to a big bank about DR/HA. Shortly before, they had run their fail over test and went to the pub to celebrate the successful procedure. Suitably refreshed, they went back to the office to perform the fail back, only to discover that one of their key steps was missing (and code needed to be written) so they had to carry on running at the DR site for several months until the next scheduled downtime window.
Never forget the fail back
Jc
-
Monday 14th June 2021 23:36 GMT Claptrap314
Re: The
Heh. For the bigs (I was at G), this is called "Tuesday". (Not Friday, though. Never stir up trouble on a Friday.)
One of the things that the cloud detractors fail to "get" is just how much goes into 4+9s of resilience. We would deliberately fail DCs a part of training exercises. Also, because a DC was going into a maintenance window. Also, because the tides were a bit high right now, so traffic was slow over there.
The thing is, we didn't "fail over" so much as "move that traffic elsewhere". Increasing capacity on a live server is almost always, well, more safe than just dumping a bunch of traffic on a cold server.
-
-
Saturday 12th June 2021 17:22 GMT DS999
Re: The
I've done a lot of consulting that involved high availability and DR over the years, and companies are surprisingly much more willing to spend money on redundant infrastructure than to schedule the time to properly test it on a regular basis and make sure that money was well spent.
I can only guess those in charge figure "I signed off on paying for redundant infrastructure, if it doesn't actually work I'm off the hook and can point my finger elsewhere" so why approve testing? It worked when it was installed so it will work forever, right?
I can't count the number of "walkthroughs" and "tabletop exercises" I've seen that are claimed to count as a DR test. The only thing those are good for is making sure that housekeeping stuff like the contact and inventory list are up to date, and making sure those involved actually remember what is in the DR plan.
Clients don't like to hear it, but before go live of a DR plan I think you need to repeat the entire exercise until you can go through the plan step by step without having anyone have to do a single thing that isn't listed in the plan as their responsibility, and without anyone feeling there is any ambiguity in any step. Once you've done that, then you need to get backup/junior people (ideally with no exposure to the previous tests) to do it on their own - that's the real test that nothing has been left out or poorly documented. Depending on vacation schedules and job changes, you might be forced to rely on some of them when a disaster occurs - having them "looking over their shoulder" of their senior during a test is NOT GOOD ENOUGH.
When a disaster occurs, everyone is under a great deal of stress and steps that are left out "because they are obvious" or responsibilities that are left ambiguous are the #1 cause of problems.
-
-
Monday 14th June 2021 09:35 GMT batfink
Re: The
This.
The prospect of losing critical staff in the disaster is often not even considered. It's not something any of us would ever like to see happen, and therefore people don't even like to think about it, but in DR thinking it has to be counted.
One of the DR scenarios in a place I used to work was a plane hitting the primary site during the afternoon shift changeover (we were near a major airport). All the operational staff were at the Primary, and the Secondary was dark. Therefore that would immediately take out 2/3 of the staff, plus most of the management layer. The DR Plan still had to work.
-
Monday 14th June 2021 16:15 GMT DS999
Re: The
When I did DR planning at my first managerial job long ago (a university) the entire DR consisted of having a copy of the full backups go home with a long tenured employee of mine, and then he'd bring back the previous week's full backup tapes. We both agreed that if there was a disaster big enough to take out both our university building and his house some 4 miles away, no one would really care about the loss of data - and it wouldn't be our problem as we'd likely be dead.
A bigger concern than the staff actually dying would be the staff is alive, but unable to connect remotely and unable to physically reach work. This actually happened where I was consulting in August 2003 when the big east coast power outage hit. I was fortunately not around when it happened to deal with it, but such widespread power outage meant no one could connect remotely. Some were able to make it in person, but others were stuck because they were like me and drive their cars down near the 'E' before filling - which doesn't allow you to get anywhere when the power's out because gas pumps don't work!
Something like a major earthquake would not only take out power but damage enough roads due to bridge collapses, debris on the road, cars that run out of gas blocking it etc. that travel would become near impossible over any distance. There might be some routes that are open, but without any internet access there'd be no way for people to find out which routes are passable and which are not without risking it themselves.
-
Monday 14th June 2021 23:41 GMT Claptrap314
Re: The
Which is why I don't consider an Western Oregon datacenter to be a very good backup site for one in Silicon Valley. If you prepare for earthquakes, you better understand that they are NOT randomly distributed.
I also don't like backups up & down the (US) East coast because hurricanes have been known to take a tour of our Eastern seaboard. Same on the Gulf.
-
Tuesday 15th June 2021 21:21 GMT DS999
Re: The
An Oregon datacenter is fine for backing up California. No earthquakes in that region will affect both places. The only place where a single earthquake might both is the midwest, if the New Madrid fault has another big slip.
Now with hurricanes you have a point, a single hurricane could affect the entire eastern seaboard from Florida to Massachusetts. One hurricane wouldn't affect ALL those states (it would lose too much power) but it could affect any of two of them.
-
-
-
-
-
-
Saturday 12th June 2021 20:52 GMT Anonymous Coward
Re: The
Systems need to be designed and configured to allow for manually triggering failover and failback.
Triggering it, running the failover system for a week (to include day/night and weekday/weekend) should be part of your monthly cycle. The same should hold true for any redundant ancillary equipment (e,g, ACs, UPSs).
Some things will never be anticipated but you can be better prepared for when they happen.
-
Saturday 12th June 2021 21:11 GMT JerseyDaveC
Re: The
Absolutely right. And the thing is, the more you test things, the more comfortable you become with the concept of testing by actually downing stuff. You need to be cautious not to get complacent, of course, and start carelessly skipping steps of the run-book, but you feel a whole lot less trepidatious doing your fifth or sixth monthly test than you did doing the first :-)
One does have to be careful, though, that if you implement a manual trigger the behaviour must be *exactly* identical to what would happen with a real outage. As a former global network manager I've seen link failover tested by administratively downing the router port, when actually killing the physical connection caused different behaviour.
-
Sunday 13th June 2021 19:26 GMT DS999
Re: The
That's great for those who have a full complement of redundant infrastructure, but few companies are willing to spend millions building infrastructure that sits idle except for the most critical business applications that have no tolerance for downtime during a disaster.
Other applications might do stuff like keep the development/QA/etc. stuff in another datacenter and designate that as production DR - obviously you can't do what you suggest without halting development for a week. Some companies contract with a DR specialist that has equipment ready to go but they charge for its use (as well as of course being ready for its use) so you wouldn't run your production on it except in a true DR scenario.
What you suggest makes sense in theory but no one is willing to risk it due to all the unknowns - let's say a change is made to how monthly or quarterly closing is done, and that change isn't properly integrated into the DR system. You fail over to DR then your month end closing blows up. Sure, it served a purpose by alerting you to the improper change, but in a really complex environment this kind of thing will happen often enough that the guys in charge will eventually put a stop to that sort of "throw us in the fire and see if we burn" type of testing.
At least when you have a true DR event, everyone is watching things really closely and they are expecting things like this that fell through the cracks so they can be responded to more quickly because you already have the right people on a 24x7 crisis call. In the non-true DR scenario unless the fix is obvious to whoever gets paged about it, you're going to end up having to round up people and open a conference line to resolve an issue that didn't have to happen, and the application owner involved is going to question why the heck we need to go looking for problems and roping him into a call on the weekend that didn't need to happen.
-
Monday 14th June 2021 23:44 GMT Claptrap314
Re: The
This is what you are paying AWS or GCP to do. They have the equipment, and they know the magic to keep the price down.
As I stated before, one of the big tricks is to have your traffic going ten places with the ability to function just fine if it has to be focused on only seven or eight of them. You see that the overhead cost is now less than 50%--but only after you reach the kinds of scale that they have.
-
-
-
Tuesday 15th June 2021 16:10 GMT Anonymous Coward
Re: The
I thought it was the same bank as I'd worked for except they only had 2 data centres and they developed the practice after a 747 landed on one of them in 1992. They actually ran 50:50 from each data centre and exchanged which location a particular half of the service ran from every two weeks.
-
-
This post has been deleted by its author
-
Saturday 12th June 2021 13:55 GMT amanfromMars 1
Surely not too mad and/or rad to understand???
today this is no longer the case and it is a reflex action to procure multiple servers, LAN switches, firewalls, and the like to build resilient systems.
'Tis sad, and surely something you should be thinking about addressing, that the systems they build all fail the same though fielding and following those processes. Such suggests there is an abiding persistent problem in present standardised arrangements.
Future procedures may very well need to be somewhat different.
-
-
Sunday 13th June 2021 12:20 GMT John Brown (no body)
Re: "sequence of events that was almost impossible to foresee"
I was thinking more along the lines of, it's a time server. Surely unless it's a startum-1 atomic clock, then it should be its checking the time against outside sources. And surely even Stratum-1 clocks check against their peers. The only thing I can image here is that's getting time from GPS or similar and then adjusting what it gets for timezones and somehow the factory default managed to think it was in a timezone 20 years away. But then a time server should really be on UTC. Timezones ought to be a "user level" thing.
-
Sunday 13th June 2021 15:58 GMT KarMann
Re: "sequence of events that was almost impossible to foresee"
…is a phenomenon that happens every 1024 weeks, which is about 19.6 years. The Global Positioning system broadcasts a date, including a weekly counter that is stored in only ten binary digits. The range is therefore 0–1023. After 1023 an integer overflow causes the internal value to roll over, changing to zero again.
I'd bet that's what happened here, rather than a pure factory default time. The last rollover was in April 2019, so I bet the default of this time server was to assume it was in the August 1999–April 2019 block, and it just hadn't been rebooted since before April 2019. See the article for a similar bork-bork-bork candidate picture.
I guess back in 1980, they weren't too concerned about the likes of the Y2K problem yet. And lucky us, the next week rollover will be in 2038, the icing on the Unix time Y2K38 cake, if about ten months later.
-
-
Saturday 12th June 2021 13:55 GMT Phil O'Sophical
Network failures
In hindsight, this was completely predictable
Doesn't require much hindsight, it's an example of the "Byzantine Generals problem" described by Leslie Lamport 40 years ago, and should be well-known to anyone working on highly-available systems. With only two sites, in the event of network failure it's provably impossible for either of them to know what action to take. That's why such configurations always need a third site/device, with a voting system based around quorum. Standard for local HA systems, but harder to do with geographically-separate systems for DR because network failures are more frequent and often not independent.
In that case best Business Continuity practice is not to do an automatic primary-secondary failover, but to have a person (the BC manager) in the loop. That person is alerted to the situation and can gather enough additional info (maybe just phone calls to the site admins, a separate "network link") to take the decision about which site should be Primary. After that the transition should be automated to reduce the likelihood of error.
-
Monday 14th June 2021 19:17 GMT Muppet Boss
Re: Network failures
Risking to sound (un)pleasantly pedantic, I still have to say that the examples given are not only completely predictable, these are simple textbook examples of bad system design. Taleb does not need to be involved at all.
Configuring 2 NTP servers is a Bad Practice because 2 NTP servers cannot form a quorum to protect against the standard problem of a false ticker. The recommended and optimal minimum is 3, however 1 is still better than 2 because if the 2 differ significantly, it is difficult to impossible to determine which one is the false ticker.
Some badly designed black box systems only allow for a maximum of 2 NTP servers being configured; in this special case the importance of the system might prompt using a cluster of monitored anycast NTP servers for high availability; for less demanding cases using a single DNS record to something from pool.ntp.org will ensure enough availability without false tickers (while adding the Internet access and DNS dependencies).
Having a split-brain failure scenario in a geographically distributed firewall cluster is so common that it is usually specifically tested in any sane DR plan. This, again, is a glaring example of bad network design, implementation or operation. No black swan magic is necessary, just build better systems or hire a professional.
Real-world problems with highly available systems are usually multi-staged and are caused by a chain of unfortunate events, every single one of which would not have had the devastating effects. Simple, non-trivial failure scenarios, however, do exist. Something from personal experience that immediately comes to mind:
- A resilient firewall cluster in a very large company is exposed to external non-malicious network conditions triggering a bug in the firewall code and the primary active firewall reboots as a result. The firewall cluster fails over, the secondary firewall is exposed to the same conditions and the same bug and reboots as well while the primary firewall still boots up. The process repeats resulting in noticeable outage until the unsuspecting external influence is removed.
- A well-maintained but apparently defective dual-PSU device in a large datacentre short circuits without any external cause resulting in 2 feeds tripping and powering off the whole row of racks as well as a few devices not surviving through it.
Cheers to all the IT infrastructure fellas, whichever fancy name you are called now!
-