Fallback fault-tolerant
Are modern failover systems much better and more resilient than their ancient counterparts?
Nothing ruins a weekend like failed failover, which is why every Friday The Register brings readers a new instalment of On Call, the column in which we celebrate the readers whose recreation is ruined by rotten resilience regimes. This week, meet “Brad” who once worked for a company that provided criminal justice apps to …
Many years ago, a city council in the north of England had a department running a pair of Netware 3 servers in a SFT cluster. The nodes were called Zig and Zag after a couple of characters on a breakfast TV show. One day, Zag had a permanent and unrepairable hardware failure leaving the cluster running only on Zig. Anecdotally, the users said that performance had improved - although it was only in that state for a few months before we installed the replacements with far more boring and forgettable names.
According to Wikipedia, Zig was the name of the stupid one, and Zag less so- as far as The Big Breakfast went- but apparently they'd been more the other way round when they first appeared on Irish TV, where they originated.
Maybe the thick server stopped getting in the way of the good one?
T'was Star Trek characters at a site I worked at in the 90s (i.e. mid DS9). Sys admin awarded himself "Q". The launch of Voyager helped as the company expanded.
I used WestWallaby as the domain name for my machines at home for a while. Gromit for my normal PC, McGraw for the linux box (see what I did there) and Preston for the overclocked monster that kept malfunctioning. I gave Mister_C senior a rebuilt PC and he never worked out why it was called Wallace...
This post has been deleted by its author
The oldest mainframes at my place are all named after comedians, or Wind In The Willows characters though as they only have four character names, there's some liberties. Over the long decades we've lost ENRI, HRDY and RATY, but we still have ERIC, STAN, and TOAD.
In a collection of Terry Pratchett's articles about life, he recounts a book signing where one lady approached him, he asked her name. She mumbled something, which he could not hear, he asked again, another mumble. third time of asking it transpired her given name was Galadriel. He asked if she had been born on a Welsh commune. She said not, it was a caravan in Cornwall, but basically hippy parents.
Lots of rock climbs at a place called 'Goblin Combe' are named after Tolkien people and places especially those on Owl Rock and Orthanc : https://www.ukclimbing.com/logbook/crags/goblin_combe-44/
Lots of rock climbs at a place called 'Goblin Combe' are named after Tolkien people and places
This I did not know, but the rule about Goblin Combe is: don't tell people about it, or it'll be swarming with people ruining it, like anywhere nice in Bristol.
I have a whole suite of algorithms named after Tolkien characters and places. Rauros, Gimli, Gloin, Dimrost, etc.
The tradition was started back in the 70's when most of the experimental work informing current practices were done, and turned into software on an S/360 mainframe. Who am I to argue with carrying on the naming convention.
We still have the user manuals, and they include glorious estimates for the cost of carrying out a cycle.
Physics doesn't change (much) so the algorithms live on, thankfully mostly migrated so something more recent.
We used to name our servers after birds, there are surprisingly many bird names which is helpful. The bigger servers were named after bigger birds e.g. ostrich, emu etc. You didn't want to get allocated a server called wren!
We did manage to get away having a server named chough for a while and had several of the tit family. Didn't manage to get a booby past our managers.
Better make this anonymous as some of them are still on the network!
I seem to recall workstation names including "fanny" and "breast" in one of the computer labs at uni. I'm sure the IT team would have claimed they were randomly-chosen according to an entirely innocent documented pattern of names if asked.
I've had pairs of servers called Romulus & Remus and Castor & Pollux in the past, before boring but informative names like "host1" and "host2" became de rigeur.
I set up the HP-UX cluster in the my Ph.D. lab with TTTE names - Gordon was the Big machine, Edward, Henry, James and Thomas were the smaller ones. When we got a shiny SG Iris in the lab the Prof said we had too many boys names and it needed something more feminine - if there had been two new machines they'd be Annie and Clarabel, but in the end the purple case (and the fact that it was a Crystallography lab) resulted in 'Amethyst'.
Slightly later, OUCS had machines named as colours, and the main multi-user servers were 'black' and 'white'. They were superceded by 'sable' and 'ermine', which somehow transitioned to a mustelidae theme and there was (I think) a 'wolverine' and a 'weasel' after that...
The company I moved to had Arthurian Legend names - Arthur, Merlin, Guinevere, Morgana...
Later when I took on the system admin of that company, I went very boring and used NATO phonetic alphabet names for the sudden proliferation of VMs
I've once named a machine 'icle' when I set it up and that made it into dev literature before I had a chance to change it when my brain eventually woke up (don't force emergency jobs on me before coffee, due to the potential for consequences).
I never had the heart to tell them that I only used the second part of the word because I got bored using the first half - "test" ..
:)
I was also once told by security auditors on another, rather large project to remove the version number from the sendmail HELO prompt (in the days we still used sendmail), so we recompiled it to say "biscuit" instead. I honestly have no recollection why on Earth I chose that, but even after I left that project I could see that it remained in place for quite a few years by means of a simple telnet to port 25.
There are now a few people who know who I am :)
Software is generally the limiting factor. Hardware is definitely capable. If well executed, the combination can be very effective.
Lots of databases are Transactional in nature and so if properly designed; operations interrupted mid-execution by a failure can be backed out by your failover unit and repeated.
Does require something somewhat more robust than a bunch of python scripts gluing glorified spreadsheets together however does not a resilient system make.
The issue is not the failover, it's the idiot who designed a failover system with less resources than the production system.
If you design a failover, that server needs the exact same configuration than the one it is replacing.
Not doing that is stupid, and this was the result.
I would have thought that you wouldn't need a degree in computer science to understand that. Apparently, you do.
Steady on, chaps!
The person who designed it probably specified an equal system, but then got overruled by 'management' who wanted to save some cash, and possibly reasoned that a failover system would only be used for a short time while the main system was fixed' of whatever malady had occurred. It is whoever authorised the lower powered back-up system who deserves your opprobrium. I mean, who has not come across systems hampered by management's failure to shell out the necessary dosh?
..."same configuration"...
Whilst this is desirable, it may not always be required.
If you say up front in the requirements that the backup server does not need to maintain the same performance, just as long as it can carry the load, then it could be smaller. But this would need to be communicated through the user base that when in failover, the service will be slower (and you probably want some indication that the service is running on the backup server, so users can see why this is running slow).
I've been in situations where this has been the decision made (and the client has had a load-shedding process to make sure that the essential parts of the service work at the expense of some of the others). It's a risk decision between cost and failover capability.
Not having a fail-back process is probably more of an issue in these cases, though.
In the Real World (tm) this is absolutely everywhere.
- Emergency lighting is not as bright.
- UPS and backup generators don't carry the whole load.
- Traffic diversions are onto smaller, slower roads.
- Backup Internet connections have less bandwidth and increased latency (eg cellular)
It's the normal way of doing redundancy.
We had a new software system going in that would eventually if accepted be our main driver for the business. The system was on evaluation and brand new, it had a dedicated server setup with supposed redundancy. This redundancy was provided by 2 identical boxes* and identical raid setup. Or that’s what was supposed to be the case. One night I came back to the office to collect my brolly and found ‘Rob’ one our senior tech people and lead on this project sitting at his desk with a coffee, the bin had evidence of other earlier coffees. I wondered why he hadn’t been in the boozer with the rest of us.
I wondered for only a short period as he told me in a very irritated voice that he was monitoring a raid repair/rebuild on this much vaunted new system. Then I noticed the services screen which showed the health of all the IT kit and the screen was red indicating a problem that had taken something down. On closer look it turned out to be the new system and I asked why this was down given the redundancy we’d all been told about. “Because they haven’t installed that yet so we’re running on just one box* and I have to stay until it’s working again.” Apparently they were super confident of their system and so hadn’t installed the backup system at the same time that was a few weeks away.
He’d phoned the US tech support as they were still working at that hour and had been told what to do by a teenager who sounded like Jeremy Freedman, the squeaky voiced teenager from The Simpsons. Suffice to say it didn’t take long for Rob to recommend we not proceed with a full rollout
*or group of boxes can’t remember
> - UPS and backup generators don't carry the whole load.
> - Traffic diversions are onto smaller, slower roads
> It's the normal way of doing redundancy.
Those are not examples of redundancy - fallbacks, yes, redundancy, no. The traffic diversion example should make that very clear.
It depends on what they're supposed to be redundant to. A UPS and generator are not supposed to be redundant to mains power for the entire office but is designed to be redundant for the important servers. Other roads are generally designed to be redundant for emergency vehicles and average traffic, not all the cars that can go down a larger road. If it's designed to be redundant for every purpose the original one was used for, then you're right. If not, you may not be.
Dr Syntax: And make sure the users know this and understand the implications.
Are you mad?
Don't tell the users anything, they'll want explanations and promises. They'll complain to you that 'it doesn't work' or 'it's too slow' or 'You said this would be OK and it isn't.' Heh will ask 'Why isn't it working?' and 'How long before I can use it again?' and 'Can I just print this out before you shut the whole thing down again?'
Shakes head in disbelief at some people's naïveté .*
*Accents and spelling courtesy of Apple spell-checker suggestion. No, I have no idea how to get two dots overt a lower case i.
A major problem here is when management, when shown the cash outlay for two identically-resourced computers, says they can save money on the second server by making it less-capable, and promises the techies they won't be expected to provide identical performance in failover mode -- promises which are immediately forgotten-and-later-denied by said management.
(see: https://www.youtube.com/watch?v=xWBFWw3XubQ)
This post has been deleted by its author
Where I'm working at the moment, we have a crazy situation. We have pairs of large database servers in active/active mode, with each individual server sized to support the entire load of the pair on their own.
I look at the on-going CPU and memory utilization in absolute horror sometimes as we are only really using 25-30% of the resource available most of the time (and sometimes much less). The lifetime cost of all this unused CPU and memory must be terrific, but even when set up like this, the DBAs still want more resource (why are DBAs like this?). This is in an environment where we can allocate and de-allocate resource dynamically should the need arise.
This is compounded by the per-core licensing model of the database software (you can guess which this is!) which also puts constraints on dynamic sizing of the systems.
The key here is "most of the time". If it rises closer to 100% some of the time that "some" might be quite important. And things might get scary when that happens. I ended up spending a few Friday lunch-times* watching a server engine eat up more and more memory (due to a badly written 3rd party program which I eventually managed to get fixed) and having to allocate memory on the fly. If it overran it crashed and left a nice mess to clean up. If you don't want to spend your time doing that then going along with the sizing might be a good idea.
* Nice scheduling of the weekly invoice run, manglement.
I feel I must add that the times that the systems use more than the 30% are few and far between (and are also to do with the administrative services, not the application load, and normally only affect one of the two systems).
I understand that if you want normal levels of service even when you have a system down, it is necessary to overspec. the systems, but in this case we're talking possibly 28 unused Power 8 processors dedicated to the two system images running the main database, and something like 500GB of unused memory (these systems were installed 8 years ago when Power 8 was current, so very expensive). This is a lot of resource, and is even more when you consider that all of those AAAA class processes need per-processor licenses for the (very expensive - you can guess which one) database!
I estimate that we could recover maybe 40% of the resource (certainly of the CPU resource - memory maybe a little less) without the service even blinking when the load was being carried on just one system. It would be busy, but the biggest constraint would probably be the I/O bandwidth because of the limited number of disk paths (I have pointed this out as the main bottleneck from the measured stats, but the primary DBA is listened to more than me).
Of course, the biggest constraint is the stupid license conditions on the DB license that prevents us using the dynamic resource allocation that these systems have without incurring punitive charges for unnecessary DB licenses!
1. Payment for those "wasted" computer resources is akin to insurance payments: you're paying now to mitigate possible severe bad consequences later. This is especially-relevant to retail-related systems which experience transaction peaks around certain holidays and times of the year.
2. You're not paying as much as you are afraid that you are for that usually-unused capacity: CPU cycle, RAM capacity, and hard disc storage capacity costs are going pretty-much continually down.
I don't disagree with you, but these servers are overspec'd to an unnecessary degree IMHO.
During maintenance work, we occasionally carry the load on a single server, and even when we do, there is spare resource on that system.
The problem I have is that when carrying the load on a single system, it logs significant amounts of what is simplistically called "I/O Wait" time, which the DBA looks at, and immediately demands more CPU resource (because taking that into consideration, the CPUs appear to be running at 100%), regardless of the number of times I tell him that that is an indication of the disk subsystem being overwhelmed, not lack of CPU!
I am a DBA and I like it when my servers are at 20-30% CPU usage. That means my database is working well and the application is well designed. If you are at or near 100%, there is something wrong with your database and/or application. On one of our databases, it was routinely getting hammered and was at 100%. Performance was terrible. And I found out why: The application was executing a query to look up a static value over 30 million times during the login process. Stupid shit like that is how you get to 100% on database server, and no DBA wants to see anything like that.
There's a world of difference between 20-30% and 100%.
I appreciate that if you have a database that allows ad-hoc queries to be run, it is always worth having more headroom (because it's remarkably easy to write a bad, inefficient query that will just consume resource like crazy). But if you are mostly running canned queries on a production server, it should be possible to push to 50-60% without compromising the system, maybe bring this down to 40-50% if this is an active-active database cluster with failover, to cover for failover situations.
it's the idiot who designed a failover system with less resources than the production system
That may have been deliberate to save money (the difference could easily have been tens of thousands of dollars or more back then) with the idea that you'd only be limping along on the backup server for a short time since you would immediately call in the vendor to fix the primary server.
Of course that would require fully operational failover, including the VERY important piece of notification that failover occurred! Otherwise how are you supposed to know there's something broken?
Degrees and certfications are irrelevant to knowing the primary and secondary systems should be identically-resourced.
A job in management, or a job which requires sucking up to mangement, can far-too-easily remove a person's give-a-shits about possible bad consequences when higher-level management wants to save money (so they can spend it on "more-important" things).
Theres enough detail in here for me to know exactly what server was being talked about nearly thirty years later...
Firstly, manglement wouldnt let us have a more expensive failover backup - so thats the first one. With the cost of the machines at the time it was hardly surprising either. Secondly, where is it writ that it must be the same size and specification? The failover box was designed to provide an emergency level of service so you could do the very basics, the custody suite always had the option of running on pen and paper, which they used to do when Oracle 6.5 took one of it's regular sulks and refused to work.
The problem with that is getting a copper to appreciate that emergency use might result in a lower level of service, and also getting them to care a **** about it when they did understand. They were some of the most, obstreperous and wilfully malicious users that I have ever encountered.
Worked on a system where we had redundant failover servers, both quite well specced. Only problem was our software was so flaky we had to have a separate watchdog timer to reboot the system when our software stopped heartbeating. This happened a lot. The most reliable part of the system was the 'database' which was a set of text files, kept on a shared network drive (no sniggering at the back, please).
We had a pair of Sun E450s purchased from a reseller, but Sun insisted that we have official Solaris Cluster training. This was held on-site so the instructor decided that we could do real-world practicals rather than lab-based ones. Except we couldn't as the reseller had cabled everything wrongly and made a mess of the IP addresses too!
It took the instructor an hour or so of head scratching and troubleshooting before realising what was wrong and fixing it for us, after which it worked flawlessly.
Except the developers used the redundant server to develop (don't ask, I don't know why!) and every month or so we had a failover as the developers made a mistake. Eventually, they got given their own little server to use and the failovers stopped.
"Except the developers used the redundant server to develop (don't ask, I don't know why!)"
I know exactly why .... someone with 'clout BUT no IT experience' reasoned 'there is a box that is not used' and said use that instead of funding kit for the developers.
Been there, given all the warnings and lived to tell the tail.
[Basically, told the powers that be that if you use the server and it fails over you will lose all the current work being done by the developers. Put in place whatever systems to limit loss BUT it is on YOUR head not mine !!!]
Fun times !!!
:)
Years ago when I worked in a Big Broadcasting Corporation and we were heading towards automation, some of my colleagues visited a bank (I believe) to see their set up and they had a proper pair of 'systems' and could and did switch over from to the other frequently. This mean that both systems were well maintained, up to date and usually ready to go.
Of course we still went down the "main and backup" way of thinking - where even if the "backup" was specced the same as the main, it was considered a bit second class and never got the love of the main. If we ever went over to the backup, there was always a push to go back to the main as soon as possible.
Much better if you want proper resilience to have X and Y systems which are the same spec and truly considered, treated and used equally.
The difference there is that the Bank would probably lose money in a failover situation if they weren't prepared for it, which makes justifying the initial outlay and ongoing costs a lot easier, whereas a Big Broadcasting Corporation, probably funded by the public and accountable to financial audit and public scrutiny of their accounts, only has a reputational loss to consider.
Another way I implemented once upon a time was to have multiple servers that could handle a request and a mechanism by which one of the machines could claim the request. The initial request was forwarded to muliple servers and whichever machine got to the request first handled it. This redundancy was duplicated at multiple levels. If one of the machines went down you wouldn't notice it. The system was reliable enough that it became something people forgot about, and we needed to add monitoring to report there was a problem or we would not know until all the redundant machines were down.
I never heard if this was considered a standard practice. It seemed somewhat like a RAID disk array. Maybe it was a RAIS? Same concept except for servers?
I did that by mistake once. The 'claim' was originally meant to just tell the users which requests were in progress and which were queued (and could in principle be cancelled/modified).
Then one time someone accidentally started the process twice, and... nothing went wrong. Each process just grabbed a task for itself and claimed it, and they only polled every two seconds so there weren't any issues with collisions.
In the long run we ended up running it on two machines to double the throughput, with hardly any extra effort needed. Great feeling.
In all my years (16) working with AIX I never met a HA cluster which hadn't been broken into individual servers, without HA/clustered applications. Or maybe the IBM sales rep sold the customer a HA solution under the premise of "you need to have high availability applications!", but the customer never got them, or was too cheap to pay the cluster licenses, idunno.
This off course was carried out without deactivating the HA/floating IPs/shared LUNs. So whenever a reboot or network change came in, the inevitable high severity ticket was dispatched.
A particularly cumbersome setup I had to recover once involved several individual Oracle instances (not RAC) each with their own floating IP, all poorly documented by the existing SysAdmins and DBAs, which took a lot of head scratching to make sure the IP aliases were assigned to the right NIC/node.
It has been a long time since I sat at a D210 terminal, but it seems to me that even at the end of my involvement with the MV/Eclipse systems they were using DG's own Xodiac networking. This does not of course affect the burden of the story--but what would the comments section be without a bit of gratuitous pedantry.
I never got to work with anyone prosperous enough to hook DG servers together to fail over.
I had a DG Nova/4 for a while, and vaguely recall reading somewhere that one could cable two CPUs to share the same disk box (a 6045? It had 5+5 MB: DP0 was the lower, fixed platter, and DP1 was the upper, removable [in-a-cartridge] platter). Did you ever see or hear of such a thing being used?
That the server had only half the expected CPU and only half the expected memory ?
Or was he administering so many machines he didnt know how much meory/cpu there was supposed to be in this one ?
I'm assuming that since there was telnet to the machine he could have checked the amount of memory and amount of CPUs ?
for you all
'London Ambulance service'
For those not in the know, they introduced a computerised system to look after the service, due to a software fault the system fell over after 3-4 months . however, the software and hardware guys who'd built the system said to themselves "It might fall over.... so lets design it to fall over to a backup server".... if only the beancounters hadn't cancelled buying the backup server because of the unneeded cost......
I think if I'd been in that one I'd have given the BCs a written warning including something on the lines of "I am warning you that not providing the backup is likely to result in loss of human life. When this happens I will personally attend any Coroner's Court and give evidence of this warning and will name you in that evidence."
This assumes that they knew about that and that they didn't mind losing their jobs that day. If they created the software, made the test systems, then handed it to someone else to deploy in production, they wouldn't know whether the servers were present. If the person who received the task of putting it in production is aware that there will be no backup server and assumes the developers were informed, they may assume that it has been handled in some way. It's possible nobody was in a position to make that ultimatum. Which brings us to the problem with ultimatums: managers who like commanding don't like being threatened by employees, and employees know it. People who aren't ready to leave tend not to be that blunt with people who won't listen, so it's more likely to have sounded like this:
Manager: I've decided we don't need the two servers. One should be enough and save our budget.
IT: The system is designed for two, so I suggest we go along with those requirements.
Manager: I've decided that we won't. Make it work.
IT: If something goes wrong, this might have safety implications.
Manager: Then do your job well and make sure that server stays up.
IT: We can't guarantee that, though. Two servers would
Manager: I can find an IT person who will guarantee it. Should I look for them?
Jets with more than two engines were originally the only ones allowed to fly over the Atlantic because most three-engined planes can fly perfectly well on two engines, but older two-engined planes could not maintain height on one sufficiently long to divert to the nearest airport.
This analogy would be better by suggesting two engines were enough in those early days.
That's somewhat true, because everyone in IT knows that, even with two servers, there is some chance that something will affect them all. However, it's less difficult to find someone who says they're reasonably confident that they can make the server stay up without having a second one, especially if the manager keeps reducing the number of facts they tell them about how critical the thing is.
Never suggest, only advise, or strongly advise. Make sure it's recorded. As long as it's in black and white you did as you are employed to do.
If people don't listen to your advice make sure it's recorded and then it's no longer your problem.
Always make sure that if an "I told you so" situation occurs, you can back it up.
Better still, have someone impartial write a risk assessment first.
I remember a failover cluster for a server to run an Oracle database.
Whereas the cost covered configuring the system and database to failover, it didn't cover anything else we had running there. Such as the Web server - and other bits.
So, after a quick course on how to do it (and, thankfully, the documentation) I wrote the scripts to do that. But was never able to test them before putting them into place (the system was already running by then...)
Over the next year my scripts all worked, but the vendor supplied ones (or our own central IT config) always seemed to have some problem.
So after a year we had two separate systems - the downtime caused by the failover's failures wasn't worth it.
I once had the pleasure of watching someone demonstrate the resilience of Veritas multi-path volume management.
Unplugged one of the arrays - no problem.
Plugged it back in - massive problems as system started yelling about duplicate network addresses.
Seems that the system would fall over, but could not get back up.
Some wag suggested hanging a "life alert" on the frame, but was made to sit in the uncooperative corner.
I remember the early days of Sun failover systems, failover worked perfectly but the journal file system wasn't really ready for primetime and nobody used it, so the result was: failover takes 2-3 seconds, the resulting FSCK when the backup machine rips the disks away from the primary takes 4-6 hours! It was almost always quicker to fix the hardware issue in the primary server than to allow a failover to occur.
If he can use telnet from his "hour away" office, couldn't he telnet in to his office from home, then telnet to the remote server? Nobody would even know. I did shit like that back in the 90's. If I couldn't access something on a campus network, I'd telnet somewhere I could (home, ISP shell account etc.). I always had my own telnet client on a floppy disk too for windows computers.
I mean, how "secure" is the access anyway. It's just telnet, so all it would be is an address mask (and a user account of course) on the remote server.
DHCP server is a good one. Seen a few times a router with an internal DHCP server got used to create a unique network for testing, and then someone bridged it to the main company network.
Anyone who had powered on their machine before the test router was added would have a good address. Anyone who booted after that and looked for a dynamically assigned IP might get an address from either the real DHCP server or the test one. If they got an address from the test network, it might seem to work depending on what other machines were powered up in the test network. IT can recognize the symptoms, but AFAIK it can be a challenge to figure out where the router is.
I was told a story (yes, second hand) about a power failure at a telephone central office. So, the power failed one day. Well the central office has batteries that take over in such cases, but they don't last too long. So, there is a nice gas (natural) powered turbine to generate power for the "duration" of the power failure. This is a "goodf thing:. So to start said turbine, it has a nice large air tank to spin the beast up. Super easy, So the chief flips the switch to start the turbine up, and it does nicely. Now to switch it over to power the "plant". Well, over the course of time lots of things have been plugged in and while the power company could nicely handle the load, the gas turbine generator could not. So, the whole thing failed, and they were back to square one. Not to worry, the air start system had enough to do two starts (gotta love the capacity). Do, another start, and a wise man would re-supply the air start system BEFORE switching the power over. Nope, he just flipped the switch and the same failure happened. Well, with only two starts available, they had to sit and watch as the central office died around him. A big lesson learned the hard hard hard way.
I wasn't informed if procedures were changed, but I suspect so!
Oh, I-B-M, DEC, and Honeywell, H-P, D-G, and Wang,
Amdahl, NEC, and N-C-R, they don't know anything
They make big bucks for systems so they never want it known,
That you can build a mainframe from the things you find at home.
Way back in the 90's, I used to work on Data General servers. They were pretty nice, but NT4 didn't really support them, and required loading special drivers at boot to recognize the SCSI drives.
And then there was EMC (known as "Even More Complex"). I had to administer those SAN's with symcli, and that process is not for the faint of heart. Make one mistake and you could potentially corrupt a critical database. I never made any, but that was because I double and triple checked every single command, all while ensuring we had good backups and verifying standby databases were in sync.
We my contract ended, that was the job I was glad I no longer had. My replacement wasn't so lucky and he corrupted a database. Of course there were no backups and the only standby was out of sync. I got called back to help with the recovery, and it was a mess. The outage lasted several days.
A military lab was surrounded by a moat. For resilience, it was powered by 2 electricity cables, crossing the moat at different locations, with automatic failover.
One day the moat had to be dredged out. The dredger cut one of the cables, the failover worked perfectly and nobody noticed. The dredger processed its work, till it also cut the second cable....
Was it this one:
"The Daily Telegraph in 2009 exposed [Douglas] Hogg for claiming upwards of £2,000 of taxpayers' money for the purposes of "cleaning the moat" of his country estate, Kettlethorpe Hall; "
?
https://en.wikipedia.org/wiki/Douglas_Hogg#:~:text=The%20Daily%20Telegraph%20in%202009,parliamentary%20expenses%20scandal%2C%20although%20it
And then there are systems designed with good intentions... a lovely resilient clustered pair of servers, but the application was coded to address one server name, not the cluster name; a standby generator that had its mains-sensing switch on the output side; another generator that had its fuel supply pump wired to the non-essential mains; a lovely battery set-up that unfortunately included the building lifts as essential supply... This is why we weep into our beer.