putting all your eggs in the internet basket is the opposite of resilient.
Excuse me, what just happened? Resilience is tough when your failure is due to a 'sequence of events that was almost impossible to foresee'
When designing systems that our businesses will rely on, we do so with resilience in mind. Twenty-five years ago, technologies like RAID and server mirroring were novel and, in some ways, non-trivial to implement; today this is no longer the case and it is a reflex action to procure multiple servers, LAN switches, firewalls, …
COMMENTS
-
Saturday 12th June 2021 08:11 GMT ColinPa
Do not have your experts on hand
As well as testing for failure, ensure that the people are considered as part of this. Keep the experts who know what they are doing out of it - they can watch as the other members of the team manage the fail overs. You can watch as much as you like - you only learn when you have to do it. If the team is screwing up - let them - up to a point. Part of learning is how to undig the hole.
The expert may know to do A, B, C - but the documentation only says do A, C.
In one situation someone had to go into a box to press the reset button - but the only person with the key to the box was the manager/senior team leader.
Having junior staff do the work, while the senior staff watch is also good as it means someone who knows is watching for unusual events.
-
-
-
-
Sunday 13th June 2021 05:56 GMT Ken Moorhouse
Re: DJV threatens dire crisis. TheRegister intervenes to save planet from extinction
Yesterday a major crisis was averted when TheRegister reset the counters after DJV threatened "to click it to see what happens", referring to a negative Downvote count on a post.
Now 11 Upvotes 0 Downvotes.
Phew, that was close.
-
-
-
-
Saturday 12th June 2021 09:37 GMT A Non e-mouse
Partial Failures
Partial failures are the hardest to spot and defend against. So many times I see high-availability systems die because they failed in an obscure and non-obvious way.
And as we make our systems more complex & interconnected (ironically to try and make them more resilient!) they become more susceptible to catastrophic outages caused by partial failures.
-
Sunday 13th June 2021 12:01 GMT Brewster's Angle Grinder
Re: Partial Failures
I was thinking about Google's insights into chip misbehaviour. You can't write your code defensively against the possibility that arithmetic has stopped working.
Likewise, as a consumer of a clock: you've just go to assume it's monotonically increasing, haven't you? (And if you do check, have you now opened up a vulnerability should we ever get a negative leap second?) That said, my timing code nearly always checks for positive durations. But it's response is to throw an exception. Which is just to switch one catastrophically bad thing for another.
-
Monday 14th June 2021 23:28 GMT Claptrap314
Re: Partial Failures
Actually, I had to write code to defend against integer store not working.
It was a fun bit of work. Pure assembler, of course. But such things CAN be done--if you know the failure mode in advance.
Not the point of the story, I know. But intricate does not mean impossible.
-
Tuesday 15th June 2021 11:21 GMT Brewster's Angle Grinder
Re: Partial Failures
I didn't mean to imply it was impossible; only that it was impractical.
I guess dodgy memory is read back and verify. You sometimes used to have to do that with peripherals. Although I suppose it gets more tasty if there are caches between the CPU and main memory.
For arithmetic, you could perform the calculation multiple times in different ways and compare the results. But I'm not doubling or trebling the size of the code and the wasting all that extra time just in case I've got bum CPU. (Particularly if I've already spent a lot of time optimising the calculation to minimise floating point rounding errors.) In real world, you can't write code against on the off chance the CPU is borked.
-
-
-
Saturday 12th June 2021 09:48 GMT Julz
The
Only place I every did any work for which took an approach similar that being proposed by the article was a bank which switched from it's primary to it's secondary data centers once a month, flip flopping one to the other, over and over. Well in fact it was a bit more complicated, as other services went the other way and there were three data centers but you get the idea. The kicker being that even they had issues when they had to do it for real due to a plane hitting a rather tall building in New York.
-
Saturday 12th June 2021 13:55 GMT TaabuTheCat
Re: The
When I worked in Texas we used to do bi-annual real production failovers to our backup DC, run for a couple of days and then fail everything back. It was how we knew we prepared for hurricane season. Interesting how many orgs don't consider failback as part of the exercise. One of my peers (at another org) said management considered the risk too high for "just a test". I'm thinking it's actually quite rare to find companies doing real failover/failback testing in their production environments.
-
Sunday 13th June 2021 14:03 GMT Jc (the real one)
Re: The
Many years ago I was talking to a big bank about DR/HA. Shortly before, they had run their fail over test and went to the pub to celebrate the successful procedure. Suitably refreshed, they went back to the office to perform the fail back, only to discover that one of their key steps was missing (and code needed to be written) so they had to carry on running at the DR site for several months until the next scheduled downtime window.
Never forget the fail back
Jc
-
Monday 14th June 2021 23:36 GMT Claptrap314
Re: The
Heh. For the bigs (I was at G), this is called "Tuesday". (Not Friday, though. Never stir up trouble on a Friday.)
One of the things that the cloud detractors fail to "get" is just how much goes into 4+9s of resilience. We would deliberately fail DCs a part of training exercises. Also, because a DC was going into a maintenance window. Also, because the tides were a bit high right now, so traffic was slow over there.
The thing is, we didn't "fail over" so much as "move that traffic elsewhere". Increasing capacity on a live server is almost always, well, more safe than just dumping a bunch of traffic on a cold server.
-
-
Saturday 12th June 2021 17:22 GMT DS999
Re: The
I've done a lot of consulting that involved high availability and DR over the years, and companies are surprisingly much more willing to spend money on redundant infrastructure than to schedule the time to properly test it on a regular basis and make sure that money was well spent.
I can only guess those in charge figure "I signed off on paying for redundant infrastructure, if it doesn't actually work I'm off the hook and can point my finger elsewhere" so why approve testing? It worked when it was installed so it will work forever, right?
I can't count the number of "walkthroughs" and "tabletop exercises" I've seen that are claimed to count as a DR test. The only thing those are good for is making sure that housekeeping stuff like the contact and inventory list are up to date, and making sure those involved actually remember what is in the DR plan.
Clients don't like to hear it, but before go live of a DR plan I think you need to repeat the entire exercise until you can go through the plan step by step without having anyone have to do a single thing that isn't listed in the plan as their responsibility, and without anyone feeling there is any ambiguity in any step. Once you've done that, then you need to get backup/junior people (ideally with no exposure to the previous tests) to do it on their own - that's the real test that nothing has been left out or poorly documented. Depending on vacation schedules and job changes, you might be forced to rely on some of them when a disaster occurs - having them "looking over their shoulder" of their senior during a test is NOT GOOD ENOUGH.
When a disaster occurs, everyone is under a great deal of stress and steps that are left out "because they are obvious" or responsibilities that are left ambiguous are the #1 cause of problems.
-
-
Monday 14th June 2021 09:35 GMT batfink
Re: The
This.
The prospect of losing critical staff in the disaster is often not even considered. It's not something any of us would ever like to see happen, and therefore people don't even like to think about it, but in DR thinking it has to be counted.
One of the DR scenarios in a place I used to work was a plane hitting the primary site during the afternoon shift changeover (we were near a major airport). All the operational staff were at the Primary, and the Secondary was dark. Therefore that would immediately take out 2/3 of the staff, plus most of the management layer. The DR Plan still had to work.
-
Monday 14th June 2021 16:15 GMT DS999
Re: The
When I did DR planning at my first managerial job long ago (a university) the entire DR consisted of having a copy of the full backups go home with a long tenured employee of mine, and then he'd bring back the previous week's full backup tapes. We both agreed that if there was a disaster big enough to take out both our university building and his house some 4 miles away, no one would really care about the loss of data - and it wouldn't be our problem as we'd likely be dead.
A bigger concern than the staff actually dying would be the staff is alive, but unable to connect remotely and unable to physically reach work. This actually happened where I was consulting in August 2003 when the big east coast power outage hit. I was fortunately not around when it happened to deal with it, but such widespread power outage meant no one could connect remotely. Some were able to make it in person, but others were stuck because they were like me and drive their cars down near the 'E' before filling - which doesn't allow you to get anywhere when the power's out because gas pumps don't work!
Something like a major earthquake would not only take out power but damage enough roads due to bridge collapses, debris on the road, cars that run out of gas blocking it etc. that travel would become near impossible over any distance. There might be some routes that are open, but without any internet access there'd be no way for people to find out which routes are passable and which are not without risking it themselves.
-
Monday 14th June 2021 23:41 GMT Claptrap314
Re: The
Which is why I don't consider an Western Oregon datacenter to be a very good backup site for one in Silicon Valley. If you prepare for earthquakes, you better understand that they are NOT randomly distributed.
I also don't like backups up & down the (US) East coast because hurricanes have been known to take a tour of our Eastern seaboard. Same on the Gulf.
-
Tuesday 15th June 2021 21:21 GMT DS999
Re: The
An Oregon datacenter is fine for backing up California. No earthquakes in that region will affect both places. The only place where a single earthquake might both is the midwest, if the New Madrid fault has another big slip.
Now with hurricanes you have a point, a single hurricane could affect the entire eastern seaboard from Florida to Massachusetts. One hurricane wouldn't affect ALL those states (it would lose too much power) but it could affect any of two of them.
-
-
-
-
-
-
Saturday 12th June 2021 20:52 GMT Anonymous Coward
Re: The
Systems need to be designed and configured to allow for manually triggering failover and failback.
Triggering it, running the failover system for a week (to include day/night and weekday/weekend) should be part of your monthly cycle. The same should hold true for any redundant ancillary equipment (e,g, ACs, UPSs).
Some things will never be anticipated but you can be better prepared for when they happen.
-
Saturday 12th June 2021 21:11 GMT JerseyDaveC
Re: The
Absolutely right. And the thing is, the more you test things, the more comfortable you become with the concept of testing by actually downing stuff. You need to be cautious not to get complacent, of course, and start carelessly skipping steps of the run-book, but you feel a whole lot less trepidatious doing your fifth or sixth monthly test than you did doing the first :-)
One does have to be careful, though, that if you implement a manual trigger the behaviour must be *exactly* identical to what would happen with a real outage. As a former global network manager I've seen link failover tested by administratively downing the router port, when actually killing the physical connection caused different behaviour.
-
Sunday 13th June 2021 19:26 GMT DS999
Re: The
That's great for those who have a full complement of redundant infrastructure, but few companies are willing to spend millions building infrastructure that sits idle except for the most critical business applications that have no tolerance for downtime during a disaster.
Other applications might do stuff like keep the development/QA/etc. stuff in another datacenter and designate that as production DR - obviously you can't do what you suggest without halting development for a week. Some companies contract with a DR specialist that has equipment ready to go but they charge for its use (as well as of course being ready for its use) so you wouldn't run your production on it except in a true DR scenario.
What you suggest makes sense in theory but no one is willing to risk it due to all the unknowns - let's say a change is made to how monthly or quarterly closing is done, and that change isn't properly integrated into the DR system. You fail over to DR then your month end closing blows up. Sure, it served a purpose by alerting you to the improper change, but in a really complex environment this kind of thing will happen often enough that the guys in charge will eventually put a stop to that sort of "throw us in the fire and see if we burn" type of testing.
At least when you have a true DR event, everyone is watching things really closely and they are expecting things like this that fell through the cracks so they can be responded to more quickly because you already have the right people on a 24x7 crisis call. In the non-true DR scenario unless the fix is obvious to whoever gets paged about it, you're going to end up having to round up people and open a conference line to resolve an issue that didn't have to happen, and the application owner involved is going to question why the heck we need to go looking for problems and roping him into a call on the weekend that didn't need to happen.
-
Monday 14th June 2021 23:44 GMT Claptrap314
Re: The
This is what you are paying AWS or GCP to do. They have the equipment, and they know the magic to keep the price down.
As I stated before, one of the big tricks is to have your traffic going ten places with the ability to function just fine if it has to be focused on only seven or eight of them. You see that the overhead cost is now less than 50%--but only after you reach the kinds of scale that they have.
-
-
-
Tuesday 15th June 2021 16:10 GMT Anonymous Coward
Re: The
I thought it was the same bank as I'd worked for except they only had 2 data centres and they developed the practice after a 747 landed on one of them in 1992. They actually ran 50:50 from each data centre and exchanged which location a particular half of the service ran from every two weeks.
-
-
This post has been deleted by its author
-
Saturday 12th June 2021 13:55 GMT amanfromMars 1
Surely not too mad and/or rad to understand???
today this is no longer the case and it is a reflex action to procure multiple servers, LAN switches, firewalls, and the like to build resilient systems.
'Tis sad, and surely something you should be thinking about addressing, that the systems they build all fail the same though fielding and following those processes. Such suggests there is an abiding persistent problem in present standardised arrangements.
Future procedures may very well need to be somewhat different.
-
-
Sunday 13th June 2021 12:20 GMT John Brown (no body)
Re: "sequence of events that was almost impossible to foresee"
I was thinking more along the lines of, it's a time server. Surely unless it's a startum-1 atomic clock, then it should be its checking the time against outside sources. And surely even Stratum-1 clocks check against their peers. The only thing I can image here is that's getting time from GPS or similar and then adjusting what it gets for timezones and somehow the factory default managed to think it was in a timezone 20 years away. But then a time server should really be on UTC. Timezones ought to be a "user level" thing.
-
Sunday 13th June 2021 15:58 GMT KarMann
Re: "sequence of events that was almost impossible to foresee"
…is a phenomenon that happens every 1024 weeks, which is about 19.6 years. The Global Positioning system broadcasts a date, including a weekly counter that is stored in only ten binary digits. The range is therefore 0–1023. After 1023 an integer overflow causes the internal value to roll over, changing to zero again.
I'd bet that's what happened here, rather than a pure factory default time. The last rollover was in April 2019, so I bet the default of this time server was to assume it was in the August 1999–April 2019 block, and it just hadn't been rebooted since before April 2019. See the article for a similar bork-bork-bork candidate picture.
I guess back in 1980, they weren't too concerned about the likes of the Y2K problem yet. And lucky us, the next week rollover will be in 2038, the icing on the Unix time Y2K38 cake, if about ten months later.
-
-
Saturday 12th June 2021 13:55 GMT Phil O'Sophical
Network failures
In hindsight, this was completely predictable
Doesn't require much hindsight, it's an example of the "Byzantine Generals problem" described by Leslie Lamport 40 years ago, and should be well-known to anyone working on highly-available systems. With only two sites, in the event of network failure it's provably impossible for either of them to know what action to take. That's why such configurations always need a third site/device, with a voting system based around quorum. Standard for local HA systems, but harder to do with geographically-separate systems for DR because network failures are more frequent and often not independent.
In that case best Business Continuity practice is not to do an automatic primary-secondary failover, but to have a person (the BC manager) in the loop. That person is alerted to the situation and can gather enough additional info (maybe just phone calls to the site admins, a separate "network link") to take the decision about which site should be Primary. After that the transition should be automated to reduce the likelihood of error.
-
Monday 14th June 2021 19:17 GMT Muppet Boss
Re: Network failures
Risking to sound (un)pleasantly pedantic, I still have to say that the examples given are not only completely predictable, these are simple textbook examples of bad system design. Taleb does not need to be involved at all.
Configuring 2 NTP servers is a Bad Practice because 2 NTP servers cannot form a quorum to protect against the standard problem of a false ticker. The recommended and optimal minimum is 3, however 1 is still better than 2 because if the 2 differ significantly, it is difficult to impossible to determine which one is the false ticker.
Some badly designed black box systems only allow for a maximum of 2 NTP servers being configured; in this special case the importance of the system might prompt using a cluster of monitored anycast NTP servers for high availability; for less demanding cases using a single DNS record to something from pool.ntp.org will ensure enough availability without false tickers (while adding the Internet access and DNS dependencies).
Having a split-brain failure scenario in a geographically distributed firewall cluster is so common that it is usually specifically tested in any sane DR plan. This, again, is a glaring example of bad network design, implementation or operation. No black swan magic is necessary, just build better systems or hire a professional.
Real-world problems with highly available systems are usually multi-staged and are caused by a chain of unfortunate events, every single one of which would not have had the devastating effects. Simple, non-trivial failure scenarios, however, do exist. Something from personal experience that immediately comes to mind:
- A resilient firewall cluster in a very large company is exposed to external non-malicious network conditions triggering a bug in the firewall code and the primary active firewall reboots as a result. The firewall cluster fails over, the secondary firewall is exposed to the same conditions and the same bug and reboots as well while the primary firewall still boots up. The process repeats resulting in noticeable outage until the unsuspecting external influence is removed.
- A well-maintained but apparently defective dual-PSU device in a large datacentre short circuits without any external cause resulting in 2 feeds tripping and powering off the whole row of racks as well as a few devices not surviving through it.
Cheers to all the IT infrastructure fellas, whichever fancy name you are called now!
-
-
Saturday 12th June 2021 14:10 GMT Anonymous Coward
Strange how so many of these 'impossible to predict' failure modes, could have been predicted if someone had run through the following checklist:
What happens if the service fails to respond?
What happens if the service intermittently responds?
What happens if the service responds slowly?
What happens if the service responds with incorrect data?
-
Saturday 12th June 2021 15:10 GMT Anonymous Coward
What? Only four questions?
@AC
What happens when the last devops delivery isn't correctly documented?
What happens when the "agile" development team is working three shifts in three continents....and isn't coordinating the documentation?
What happens when there ISN'T any documentation?
What happens when a critical infrastructure component is a bought out service (see Fastly, Cloudflare)?
...........
........... and so on............
-
Monday 14th June 2021 09:49 GMT graeme leggett
That was my thinking partway through reading.
The first business took steps to mitigate total breakdown of the time server but not a malfunction.
And presumably the interoffice networking example was tested by switching off one unit (the event that was being mitigated against) rather than stopping the heartbeat (the trigger for the backup to take over)
The lesson being that in an ideal world one ought to consider all the things that can happen. I recall a story that Feynman while on Manhattan project in order to give the impression he was involved in discussion pointed to a symbol on a process diagram and asked some innocent question about it. Upon which an engineer present recognised that it was a dangerous weak point.
Now all I have to do is remember this for my own work....
-
-
Saturday 12th June 2021 16:25 GMT Paul Crawford
NTP
For classic NTP operation it is recommended that you have 4 or more time servers configured on each client so they can detect problems including a broken/false clock source. That could be costly in hardware, so you might have 1 or 2 local servers from GPS that offer precise time due to low symmetric LAN delays and back it up with ones across the internet at large that can catch one of the GPS going massively stupid but only offer accuracy, on their own, to several/tens of milliseconds.
-
Saturday 12th June 2021 21:47 GMT Anonymous Coward
Re: NTP
Proper NTP (not "Simple NTP") also refuses to skew the clock (and throws up errors) if the time appears to be more than an hour wrong - because that's almost certainly due to something badly broken. It also never gives you discontinuous time - the clock is skewed to slow it down or speed it up; time never goes backwards, and it never jumps.
"Simple" NTP is not proper time sync, it just blindly accepts what it is told. If systems started believing a date 20 years ago, they were not set up with proper time sync.
-
Sunday 13th June 2021 09:36 GMT Mage
Re: NTP
We need to stop using GPS as a cheap clock. Atomic clocks are not so expensive now and a decent solar flare or a system error can take out GPS based time world wide, then DTT, Mobile, DAB, Internet and other stuff needlessly fails. Jamming can take out or spoof local use of GPS.
A big enough solar flare is a certainty, we just don't know when. GPS should only be used for navigation, and critical systems should have some sort of alternative navigation, even if only dead reckoning (inertial) that's normally corrected by GPS and can take over.
-
Monday 14th June 2021 09:05 GMT Paul Crawford
Re: NTP
You still need to sync the atomic clocks together in the first place, and to keep them agreeing afterwards (depending on the level of time accuracy you need)!
For that you need something like GPS to do it, so really it comes down to how many will pay extra for an atomic clock reference oscillator in addition to the GPS receiver and outdoor antenna, etc. Many should do it, if they are running essential services, but usually the bean counters say no...
-
Tuesday 15th June 2021 13:41 GMT Cynic_999
Re: NTP
"
You still need to sync the atomic clocks together in the first place, and to keep them agreeing afterwards (depending on the level of time accuracy you need)!
"
Well yes, it does depend on the level of accuracy you need, but there will be very few cases where a static atomic clock on Earth will need to be adjusted during its lifetime. Even if you require the clock to be accurate to within 1uS (one millionth of a second), an atomic clock would only need adjusting every 100 years. The variation in propagation delay from time source to the point where the time is used is a lot more than 1uS in any networked computer system I can think of, other than systems where precise time is part of the input data it is working on (e.g the embedded CPU in a GPS receiver).
-
-
-
Monday 14th June 2021 18:48 GMT Anonymous Coward
Re: NTP
> "For classic NTP operation it is recommended that you have 4 or more time servers configured on each client so they can detect problems including a broken/false clock source. That could be costly in hardware, so you might have 1 or 2 local servers from GPS that offer precise time due to low symmetric LAN delays and back it up with ones across the internet at large that can catch one of the GPS going massively stupid but only offer accuracy, on their own, to several/tens of milliseconds."
Thought the rule-of-thumb is 3+ NTP servers as if you have only 2 then if they disagree you know one of them is wrong but not which one, whereas is you have 3 (or more) and one of them differs from the other 2 your NTP software knows which is the wrong one. Its the same reason the Space Shuttle had 3 computers running the same code that worked on consensus - the 2 computers that agreed could out-vote the 3rd (obviously wrong) one.
As for cost, not that much these days: 3 Raspberry Pis (with POE adaptors and RTC addon) plus GPS, DCF77, and MSF radio receivers and there's your 3 local Stratum 1 servers gettting reference time from disparate sources - space, near Frankfurt, and Cumbria respectively. No point using mobile network for time as (a) its ultimately coming from GPS and (b) from memory its doesn't provide much accuracy (no milliseconds, not even sure if it provides seconds).
The above kit would cost about 300-400 pounds in total. Low enough to keep spares...
-
Saturday 12th June 2021 16:32 GMT Paul Crawford
Fail-over failure
We used to have one of the
SunOracle storage servers with the dual heads configured as active/passive and linked via both a Ethernet cable and a pair of RS232 lines. That was, allegedly, so it could synchronise configuration via the Ethernet link and had the RS232 as a final check on connectivity to avoid the "split brain" problem of both attempting to become master at once.It was an utterly useless system. In the 5+ years we had it as primary storage it failed over a dozen times for various reasons and only occasionally did the passive head take over. We complained and raised a bug report with Oracle and they just said it was "working as designed" because it was only to take over if there was a kernel panic on the active head. Failing to serve files, its sole purpose in life, due to partial borking was not considered a problem apparently.
The conclusion we had was paying for professional systems by big companies is a waste of time. Sure we had a soft, strong and absorbent maintenance SLA but we would have had less trouble with a singe-head home made FreeNAS server and a watchdog daemon running.
-
Sunday 13th June 2021 08:23 GMT Anonymous Coward
Re: Fail-over failure
Those systems run Solaris internally, and when they were being developed the Solaris Cluster team offered to help them embed the 10+ year old tried, tested & very reliable clustering software into them.
The usual "Not Invented Here" attitude that was prevalent in that part of Sun prevailed, and they insisted that they were quite capable of doing the job themselves in a more modern fashion, despite having no background in high availability. We all saw the results.
-
Sunday 13th June 2021 00:04 GMT JassMan
Black swan event
Having emigrated to the land of kangaroos before I was old enough to even know the name of a swan, I grew up believing all swans were black. On returning to my birthplace once I was old enough to earn my airfare, I discovered they also come in white. For a while, I just thought it was peculiar to see so many albino birds together, until I realised that all European swans are white. Of the two varieties, I would say the white are the slighly more unpredictable so what is this black swan event thing?
-
Sunday 13th June 2021 16:12 GMT KarMann
Re: Black swan event
The phrase was coined by Juvenal back in the 2nd century CE, discussing the problem of induction, whilst Europeans weren't aware of your black swans until 1697. By then, the name had stuck, with an added bonus of highlighting the problem that just because you haven't seen a specific counterexample doesn't mean it can't exist.
-
Monday 14th June 2021 09:26 GMT Anonymous Coward
Re: Black swan event
They are rare but a few examples in the UK:
https://www.nationaltrust.org.uk/chartwell/features/chartwells-black-swans
( https://goo.gl/maps/3YinKfXzMmdyZRg16 )
Dawlish (on south west coast)
( https://goo.gl/maps/XRJyKZN2kz8ETg259 )
I think both sets are "natural but monitored" i.e. they are not captive or bred, and arrived under their own steam but they are looked after and encouraged in their current locations.
-
-
Sunday 13th June 2021 04:52 GMT TheBadja
There was once a major outage on Sydney Trains because of an intermittent failure on a router in the signal box. The IT network was configured in failover with a pair of routers in load sharing.
One router failed - any the second router took up the load. Should have been fine and the failed router could have been reloaded at leisure.
However, the failed router, now without load, restored itself and took up its part of the IP traffic - and then failed under load. Repeat multiple times per second and packets were being lost everywhere.
This stopped the trains as this disconnected the signal box from centralised rail control.
Trains stopped for hours as the IT network technical staff couldn’t track what was going on, then needed to physically replace the faulty unit. Of course, couldn’t get there by train, so they had to drive there in traffic built up due to no trains.
-
Sunday 13th June 2021 10:23 GMT Anonymous Coward
A customer's mainframe comms front-end was split over three identical devices. Dual networking kit before them ensured that Incoming new connections were automatically rotated between the three comms boxes to give load sharing and resilience. It all worked well during fail-over testing and into live service. Even two boxes failing would leave full capability.
A few months later one comms box fell over - followed by the other two a few minutes later. A while later - same again.
The problem turned out to be a customer department had set up real time monitoring of connection response times. Every so often their test rig would automatically just make a connection and log the response time. As no data was passed the mainframe didn't see the connection. Unfortunately their test kit left each connection live in the comms box. A box bug meant the number of orphaned connections in each box slowly built up - and eventually the box crashed..
The rotating load sharing meant that all the three boxes approached the critical point at the same time. When a box crashed its users were automatically reconnected to the other two boxes - and that tipped them over the edge.
-
Sunday 13th June 2021 10:42 GMT Anonymous Coward
One of my house MSF radio synchronised clocks was showing an obviously wrong time this week. To save battery power they normally receive to resynchronise only once a day. Forcing it into receiving mode fixed the problem.
The MSF time transmissions send data representing one second - every second. It needs the specific 60 contiguous transmissions to build up the time and date etc. Each second's data has very poor protection against bit corruption. It would appear the clock takes its daily sync from one minute's apparently complete transmission - and then stops receiving.
My home-built timer uses an Arduino and a DS3231 TCXO RTC module. It also receives MSF signals when available. It only resynchronises the RTC when three consecutive minutes' MSF transmissions correlate with each other. The remaining RTC battery voltage is automatically checked once a week.
-
Sunday 13th June 2021 14:28 GMT aje21
Fail-back is just as important as fail-over
Doing DR pre-testing with a client in a non-production environment to make sure the steps looked right for when we did actual DR testing in production, the only issue we hit was during the fail-back a week after the fail-over when the DBA picked the wrong backup to restore from the secondary to the primary data centre and nobody noticed until after the replication from primary to secondary had been turned back on that we'd lost the week of data from the fail-over...
Another reason for testing fail-over and fail-back with a sensible time separation because if it had been done on the same day we may not have noticed the mix-up of backup files.
-
-
Tuesday 15th June 2021 15:36 GMT Bruce Ordway
Re: When it comes to imagining failure modes…
>>Murphy has a better imagination than you
This remind me of an upgrade where (I believed) I had every potential for failure fully accounted for.
I swapped out an HP K400 with an L1000 over the weekend.
Sunday afternoon... congratulations all around.
Monday morning.... chaos, no file access for users.
I quickly discovered that a power supply for an unrelated file server had failed early AM.
Even though a coincidence... this and any other issue was blamed on that weekend upgrade for several days.
Lesson learned... I now avoid making public announcement of IT upgrades/projects.
So that I'm more likely to get credit for "the fixing" rather than "the breaking".
-
-
Monday 14th June 2021 18:18 GMT Abominator
Hardware resiliancy is almost always the biggest problem.
You have a disk failure in the raid array, but its does not quite fail properly enough for the RAID array to drop it and a) fail drop it out of the army b) the RAID IO performance is dragged down two orders of magnitude but limps on.
Same for network failover with bonded network interfaces. I have hundreds of cases of a) the backup switch was never properly configured and and nobody had properly tested a switch failure b) bad drivers in the bonded interface failed to switch over the ports.
It's better to be cheap and simple and rely on software failure, in the case that software is designed for failover. It's much easier to validate. Just start killing processes randomly.
-
Tuesday 15th June 2021 00:48 GMT tpobrienjr
Never test, or test at first install?
Decades ago, at a Big Oil multinational, I was put in charge of the EDI transaction processor. It had been designed with redundancy, fail-over hardware, etc., all the best stuff. I asked when the recovery had been last tested. At first install, I was told. So I scheduled a test shutdown and restart. Twelve hours after the shutdown, it was back up. Quite scary. I won't do that again. I think when downsizing the IT, they had downsized the documentation.
-
Tuesday 15th June 2021 10:34 GMT Anonymous Coward
Yes, absolutely
"Could monitoring tools have been put in place to see issues like this when they happen? Yes, absolutely, but the point is that to do so one would first need to identify the scenarios as something that could happen."
No that's crap and anyone who has managed a monitoring system for more than a few years will tell you so. Time sync is one of the two things that always goes wrong. DNS is the other one and so is SSL certificate expiration and the other things that you wonder how you wrote that script for but still seems to work years later 8)
The writer of the article is probably quite knowledgeable but clearly not time served.
-
Tuesday 15th June 2021 12:12 GMT Joe Gurman
Unexpected points of failure
I worked for many years in a largish organization that allowed us a small data center. So small that its kludgey HVAC system was totally inadequate for the amount of gear (mainly RAID racks, but a number of servers, too) stuffed into the place. After years of complaint, the Borg refused to respond by adding cooling capacity..
The most spectacular failure, which we had foreseen, was when one of the frequent electrical storms we have in the summers hereabouts took out not only the HVAC, but also turned the electronics for the keycard on the (totally inappropriate) ordinary wooden core door to the data center. Such things are meant to fail open, but instead (of course), it failed locked. Wouldn't have been a long term problem but for three other foreseeable failures: the HVAC unit in the data center needed to be restarted manually (of course), and the Borg had refused my repeated requests for a manual (key) lock override for the keycard system. So with temperatures outside in the vicinity of 35 C, and all the machines inside the data center up and running — and no way to turn them off remotely because the network switch had failed off as well, we were faced with how to get through the damn door and start powering things off.
Life got more interesting when the electricians showed up, popped a dropped ceiling panel outside our door, and found — nothing. The keycard lock electronics were... elsewhere, but nobody knew where, as no one had an accurate set of drawings (fourth unexpected failure?). So owe called in the carpenters to cut through the door. "Are you sure you want us to cute the part of the door with the lock away from the rest of the door? Then you won't be able to lock the door." Feeling a little like Group Captain Mandrake speaking to Col. Bat Guano in Dr. Strangelove, I omitted the expletives I felt like using and simply replied, "Please do it, and do it now, before everything in there becomes a pile of molten slag." They did, we got in, powered off everything but the minimal set of mission-critical hardware, and tried to restart the in-room HVAC unit. No joy, as it had destroyed a belt in failing (not an unexpected maneuver, as it had happened a few times before in "normal" operation). Took the Borg a few days to find and install a replacement.
Meanwhile, the head-scratching electricians had been wandering up and down our hallway, popping ceiling panels to look for the missing keycard PCB. One of them got the bright idea of checking the panel outside the dual, glass doors to a keycarded office area 10 meters or so down the hall. Sure enough, there were two keycard PCBs up there: one for the glass doors, and one for our door. No one could figure out why it had been installed that way.
And a few days later, the carpentry shop had a new door for us.
But wait — it gets better (worse, really). The Borg decided we'd been right all along (scant comfort at that point) and decided to increase our cooling capacity.... by adding a duct to an overspeced blower for a conference room at the far and of the hallway. That's right, now we had two, single points of HVAC failure. But the unexpected failure came when, despite the wrapping of our racks in plastic as they pulled, cut, reconfigured, &c, ceiling panels and installed intake and outflow hardware, as well as the new ducting, we got a snowstorm of fiberglass all over our racks (fortunately powered down over the work period). We cleaned up as best we could, but after a week or two, our NetApp RAID controllers tarted failing, at unpredictable intervals (iirc we had eight of the monsters at the time). It turned out the fibers were getting sucked into the power supply fans, and then — bzerrrt — the power supplies would short out. Being NetApp gear, they were all redundant and hot swappable.... until we, and NetApp ran out of power supplies for such ancient gear. We managed to find previously owned units on eBay (which required an act of Whoever to relax the normal rule against sourcing from them) to complete our preventive swapout of all the remaining, operational power supplies we knew were going to fail.
So many, unexpected failure modes.
-
Tuesday 15th June 2021 14:49 GMT TerjeMathisen
NTP to the rescue!
This happens to be my particular area - I have been a member of the NTP Hackers team for 25+ years now, and we have designed this protocol to survive a large number of "Byzantine Generals", i.e. servers that will serve up the wrong time ("falsetickers").
The idea is that you need 4 independent servers to handle one willfull (or accidental) falseticker, you need 7 to handle two such at the same time.
If you use the official NTPD distribution (all unix systems + windows), then you can configure all this with a single line
server pool.ntp.org
in your ntp.conf file, and your server will get 10 randomly selected servers from areas that are reasonably close to you. Every hour the ntpd deamon will resort the list of these 10 servers, discard the pair that has performed the worst compared to the median consensus, and ask the pool DNS service for more servers to replace them with.
BTW, servers that suddenly start to give out 20 year old timestamps happen every 1024 weeks, which is how often the GPS 10-bit week counter rolls over. Every time this happens, we find a few more old GPS units that fail to handle it properly. :-(
Terje Mathisen
"Almost all programming can be viewed as an exercise in caching"
-
Tuesday 15th June 2021 20:35 GMT dmesg
An interesting workplace, Dr. Falken ...
"If management have any sense, they will be persuadable that an approved outage during a predictable time window with the technical team standing by and watching like hawks is far better than an unexpected but entirely foreseeable outage when something breaks for real and the resilience turns out not to work."
Nope. Gotta have all systems up 24/7/365, you know. Can't look like laggards with scheduled downtime, now, can we?
Forget about routine downtime. We had to beg and plead for ad hoc scheduled maintenance windows. We tended to get them after a failure brought down the campus (and of course, we made good use of the failure downtime as well). But upper Administrators' memories were even shorter than our budget, and it would happen again a few months later.
Thank $DEITY for the NTP team knowing what they were doing. It was easy to bring independent local NTP servers on line ("Is it really this easy?? We must be doing something wrong"). We put in three or four, each synced independently to four or five NTP pool servers, but capable of keeping good time for a several days if the internet crapped out. The sane NTP setup resulted in a noticeable drop in gremlins across our servers, particularly the LDAP "cluster".
That LDAP setup was a treat: three machines configured for failover. Supposedly. One had never been configured properly and was an OS and LDAP version behind the others, but the other two wouldn't work unless the first was up. Failover didn't work. It was a cluster in multiple senses of the word, and everyone who set it up had departed for greener pastures. We didn't dare try to fix it; it was safer to not touch it and just reboot it when it failed. Actually, we wanted to fix it but it who has time for learning, planning, and executing a change amidst all the fire fighting?
<digressive rant>
Besides, fixing it wasn't really necessary, since the higher ups decided we were going to have a nice new nifty Active Directory service to replace it. Problem is, AD has a baked-in domain-naming convention ... and the name it wanted was already in use ... by the LDAP servers. We had to bring in a consulting service to design the changeover and help implement it. No problem, eh? Well, they were actually extremely competent and efficient but the mess that the previous IT staff had left was so snarled that the project was only three-quarters implemented when I left a year later. At least it had cloud-based redundancy, and failover seemed to work.
The reason for switching to AD? Officially, compatibility with authentication interfaces for external services (which, it turns out, could usually do straight LDAP too). Reading between the lines: it finally dawned on the previous team what a mess they'd made with LDAP and rather than redo it right they went after a new shiny. When they left there was an opportunity to kill the AD project, but a new reason arose, just before I came on board: the college president liked Outlook and the higher-ups decided that meant we had to use M$ back-end software.
</rant>
We also had dual independent AC units for the server room. Mis-specced. When one was operational it wasn't quite enough to cool the room. When the second kicked in it overcooled the room. If both ran too long it overcooled the AC equipment room as well, and both AC units iced up. Why would it cool the AC room? Why indeed. The machine room was in a sub-basement with no venting to the outside. The machine room vented into the AC equipment room, and that vented into the sub-basement hallway.
When the AC units froze up, que a call to Maintenance to find out where they'd taken our mobile AC unit this time. Then the fun of wheeling it across campus, down a freight elevator that occasionally blew its fuse between floors, into the machine room, then attaching the jury-rigged ducting. It could have been worse. We had our main backup server and tape drive in a telecomms room in another location, and that place didn't have redundant AC. It regularly failed, and for security's sake whoever was on-call got to spend the night by the open door with a couple big fans growling at the stars.
It was a matter of luck that one of our team had been an HVAC tech in a previous life and he was able to at least minimize the problems, and tell the Facilities staff what we really needed when the building was renovated.
Oh, do you want to hear about that whole-building floor-to-ceiling renovation? About the time a contractor used a disk grinder to cut through a pipe, including its asbestos cladding, shutting the whole building down for a month while it was cleaned up? With no (legal) access to the machine or AC room for much of that month? Another time, grasshopper.
<rant redux>
The college president commissioned an external review of the IT department to find out why we had so many outages, in preparation for firing the head of IT. The report came back dropping a 16-ton weight right on her for mismanagement. Politely worded but unmistakable. She tried to quash it but everyone knew it was on her desk. She then tried to get the most damning parts rewritten but the author wouldn't budge and eventually it all came out. Shortly afterward an all-IT-hands meeting was held where the President appeared (I was told she almost had to be dragged) and stated that we'd begin addressing the problems with band-aids, then move on to rubber bands. Band-aids. That was the exact word she used. I lasted another half year or so, but that was the clear beginning of the end.
The college is also my alma mater, and I have many fond memories of student days. But I don't respond to their donation pleas any more.
</rant>