
Nine nines and an explosion
Well, when the power goes, they again get a taste of reality, eventually.
There is no point in keeping extreme uptimes when your system will eventually break down in a, maybe not literal, but spectacular explosion.
As Friday dawns with its promise of rebooting the working week, The Register presses the button to publish another instalment of On Call – our weekly, reader-contributed column that shares real-world tales of being flummoxed by the farces they're asked to fix. This week, meet "Kane" who shared the story of the time a client …
Indeed...'absolutely no downtime' and 'nothing can be switched off' do not work together. If your goal is zero downtime, then you need to have redundancy to allow for unplanned failure, which means you do have the option to switch things off or restart them as and when required.
Presumably his thought process went as far as "what would be the best number of uptime?"
I wonder what his thought process would be when presented with "Well that will triple what we currently spend on servers due to have to create clusters and redundancy and such"
Unfortunately, redundancy can be a difficult sell to the bean counters.
After all, it costs money, and if implemented properly, will not make a visible difference.
Yes, I know if it is implemented properly, it will increase reliability, but they would probably look at the the complete lack of downtime they have as evidence the expense of redundancy is neither jutified, nor needed..
There is literally no point in the uptime of a specific system or server, if it is not in the service of the uptime/availability of a required *function*. That's exactly why you have redundancy, so that the function can continue uninterrupted even while one of the servers is being rebooted.
Otherwise it's just willy-waving of the "ooh, look how long my server has gone between reboots" variety - a vanity number that may or may not have any real-world value.
At Dartmouth with our 100user time-sharing system we kept track of scheduled uptime. We counted any interruption as 15 minutes even though the system would come up much faster than that. Some months we were over 99% but other months we got as low as 97% (mainly due to total building power failures).
Getting to 99.9% (1/3 of a day per year) would be quite hard. More nines would require a total redesign of the power system, redundant computers etc.
About 10 years ago we had a PC sitting in the bottom on a machine control cabinet, just a bog standard computer, not an industrial one, which failed to boot one morning.
Upon investigation I discovered it hadn't been powered down for at least 5 years, possibly longer, yet for some reason the day before the person using it had selected the Shutdown option on the menu. We all know that's a recipe for problems, in this case the spinning rust was completely dead and the motherboard couldn't get past POST. No real problem, fetch a PC from the spares pile then wait a week or so for the machine manufacturer to supply us the software from the States (we couldn't download it as you required the floppy* for the security key, not sure why it was secured as the software would only work with their machine).
*I did say it was a while back!
Re: (we couldn't download it as you required the floppy* for the security key, not sure why it was secured as the software would only work with their machine).
I used to support a Uni computer lab with the same problem. We had proper, broadcast quality, capture cards in ten of the machines. These cards were essentially high end pentium based PCs on a card. They even required their own SCSI hard drive.
It was interesting how they handled drive access.. These drives appeared to the host PC as NTFS formatted drives, with several folders on the root. JPG, PNG, AVI and a few others.. They each had folders inside for individual projects. In these folders, you would find the video for each project, in the format with the folder name. So, the JPG folder would have the individual frames of the video in JPEG format, the AVI folder would have the video in AVI format. All conversion was done on the fly. It was actually a very neat system, once the user got used to it.
Anyhow, I digress. The cards were directly supported by both Adobe Premiere, and a custom version of a little known video editor, called "Speed Razor". This was a commercial application, but rare enough that when I logged a support call, I was told our ten machine lab was the largest installation in the country.
This Speed Razor required a dongle that plugged into the parallel port.. This was a massive pain in the arse, because it meant we had to install parallel port extension cables, and run them back into the machine so we could lock the dongles inside the machines. As if requiring access to a £5k video capture card wasn't enough of a restriction.
I worked with some simulation software that needed a dongle (thankfully USB, though we did have some older serial port ones) to validate licenses, but only at boot. Because the dongles went missing a lot, I once ended up spending an hour sharing a half-dozen dongles between thirty machines, booting then in batches, after a power outage took down the whole room at once. PITA, but I got everything back up and running before the important folk got into the office!
("Power outage? What about the UPS?" I hear you ask. Well, the power went out because the UPS caught fire...)
It didn't catch on fire, but I learned you need to test your UPS every so often. I had a brief power outage, a little more than the lights flickering, and my UPS shut down. Seems gel cell lead batteries are only good for a few years.
Dang, I should have put a post-it not on my UPS stating when I changed the battery.
APC UPSes are notorious for killing SLA gel batteries. This is because in order to maintain tip-top maximum runtime (without spending money on bigger batteries) they float-charge them at 13.8V per 12-volt battery. That is right on the upper-uppermost limit for float charging. In a few years (as little as 3 for 3rd-party batteries) they get hot, swell, and (if left for too long) emit nasty smells.
Of course, APC would happily sell you a nice new set, premium quality, premium price, that would then be carefully cooked in 4-5 years.
Floating at 13.2V would give years more life - but reduce the headline runtime and kill the golden goose of battery replacement. No, there is no setting for float charge voltage!
First IT job, supporting a company with a Project office... they purchased a few copies of AutoCAD for new office layout design purposes.... but it came with a dongle. A USB dongle. We ran NT4 workstations.... sigh. As if getting their SCSI-attached scanner to work under NT4 wasn't challenging enough.... they had to introduce USB to the fun and games.
I can't remember how we fixed it.... either we found some 3rd party drivers/hacks. Or put them on the fast track for the Win2000 workstation upgrade project.
Had Emerson-provided remote DeltaV training recently. Their (cloud?) VM provider had some odd network outage that meant we could access the training VMs, but the VMs couldn't see the physical USB license keys. So we could build anything we wanted - but not test it. For 2 days out of a 4.5-day training.
I hate copy protection.
Big expensive, and very old, cake cutting machine. The software ran on DOS, and this was 2012. The machine was on its own backed up power supply which was really hefty and could keep the PC running for a couple of hours (everything else was left to go off).
One day some complete numpty of a "manager" pulled out the plug to shove in his laptop charger. Cue instant silence as the PC goes off and the machine enters failsafe mode with a hard stop.
Cue other managers suddenly very annoyed.
When asked why he unplugged that thing, he claimed the plug didn't say not to. But being a manager, he looked at the plug and not the red sockets (red, not white) and the big sign right above that said that these sockets must not be touched for any reason. Even the repair guys didn't have the authority.
So the manager makes a bad situation worse, he shrugs and plugs the PC back in.
"Missing operating system"
The harddisc is dead, the software existed only there (the system was supplied by the machine manufacturer, who ceased to exist over a decade ago) and let's talk about the broken hardware because the hardware wasn't designed to be stopped that suddenly except in case of dire emergency.
The company got a replacement at some expense, we also had chaos due to all of the stuff that was supposed to have been made not being able to for a while.
And the manager? I think it was politely suggested that he might not want to use them as a reference in his search for alternative employment.
Can't help but wonder what eventually happened.
Go back to the client and say, 'fair enough, here's the cost for a new storage array, let me check if the ESXi host has the right or enough adapters.... no it doesn't so that's a shutdown to install the HBA'
or 'let's see if there are any empty drive slots in the host... yes there are .... does it support hot swopping? Alas no, so that's a shutdown to install the disks!
You're running low on space on an ESXi host - that's not good, you absolutely have to increase the storage but your own policies are preventing you from increasing the storage! Rock please meet Hard Place!
"he was told his client had a policy not to allow outages."
I would have to assume meant "not to allow planned outages" but as the point King Canute was making (but later misrepresented) - time and tide wait for no man, nor noutage neither.
Of course the client might have been a (minor) deity or, not uncommon in this game, at least thought of themselves as such.
"Of course the client might have been a (minor) deity or, not uncommon in this game, at least thought of themselves as such."
More likely the opposite, I'd have thought. Policies exist to further the business's objectives. If a policy is getting in the way of that it sounds as if the original policy maker has long gone and been replaced by a weight to keep the chair from escaping with no understanding of why the policy exists.
I've always thought that policies should be written with a rationale so they can not be over-ridden by idiots ("This is a legal requirement") or reviewed when no longer appropriate ("This was a legal requirement but the law has changed").
"...the original policy maker has long gone and been replaced by a weight to keep the chair from escaping..."
You've reminded me of the nickname an Italian colleague had for our "systems architect". I assumed from the way he pronounced it, that "ballast" was an Italian word with respectful connotations and being used ironically for our frankly incompetent architect. Nope, he eventually explained it was the English word, since the architect's only purpose was weighing his chair down.
"he was told his client had a policy not to allow outages."
Sounds to me like a really bad case of Honour Before Reason.
Wanting to keep the machines on? Good. Wanting to keep the machines on so much that no patches and updates can ever be installed? Mind numbingly Not Good.
EVER!
On second thoughts, there may be even more moronic management policies. I shudder at the thought, but never underestimate the ability of management to create clusterfucks of epic proportions.
Maybe there should be a Most Moronic Management Policy (M3P) award, to be awarded annually.
Many years sinceupon, the QA idiot in a company for whom I was working, decreed that all electronic components would henceforth be stored in a tistatic packaging. All the sensitive stuff - ICs semiconductors etc already were. He pointed out additional stuff so the stores folk duly sighed and complied
We had a batch of dead PCB mounted batteries not too long after that.
The packaging mimics a Farady cage in that the packaging is conductive to allow charge to flow around the sensitive contents rather than jumping into and through them.
If bits of the packaging are touching bits of the circuit board then it is possible for batteries with exposed terminals to discharge themselves via the packaging.
"What's to stop a motherboard with a battery in it having the battery tracks / pins shorted?"
The little sticker or pull tab keeping the battery disconnected during shipping. Most motherboards also can survive quite a lot of shorting, good ones even give different errors codes/LEDs for different parts shorting. A lot of people forget to use the standoffs when installing the motherboard and short it against the case.
Considering the number of people\service techs that replace Dell\HP\Lenovo, the standoffs are pressed into the chassis & align perfectly with the motherboard.
It's not surprising to find that people skip the standoffs supplied in a barebones case\motherboard out of never needing to use them before (Though you'd think the metal backplate for the motherboard to slip into the ATX case void might be a clue something was wrong.....).
I think there was a On-Call\Who Me forum posting on this very fact more years than I care to recall.
Proper antistatic bags are slightly conductive throughout. Any battery that happens to touch the bag in two places will be slowly discharged, possibly slowly enough not to matter.
The main problem is that batteries are conductive, and can short each other out if they're kept loose in any bag.
Yes. They are practically flat high-ohm resistors. But the actual resistor value depends on the distance of the contact points. Since those CR2032 batteries, for example, only have a 0.25 mm or less gap between + and - draining via anti-static package sounds plausible.
Edit: Your question is not dumb.
Sadly, there are stupid questions, at least when you aren't smart enough to let them go after repeated warnings.
Some 15 years ago, working for a no longer extant maker of enterprise software, I was conducting a training class for our customers. One of them had a question that amounted to "How can I get around the license fee to use your product?" and was very upset that I wouldn't give him instructions on how to steal from my employer. So upset, in fact, that after ten minutes of insisting that I HAD to answer his question, he stormed out after writing a minor novel on the class evaluation to accompany the minimum possible rating for me.
Ratings that required my boss to follow up with the student to determine what the source of the displeasure was. Which he was more than happy to reiterate and also to share that learning how to install without paying was "the only reason my boss sent me to class."
Six weeks, one unannounced onsite license audit (read your EULA, kids!), and a seven figure (USD) invoice for unlicensed software later, the student, his boss, and his boss's boss all found themselves dejobbed.
If you use the metallic silver antistatic bags, they're very conductive, and will drain batteries fairly well. If you use the pink plastic bags, those have lower conductivity. I don't have the resistance figures memorized, but a battery will at least last a lot longer in the pink bags than it will in the silver ones.
If you do a deep dive into the use cases the pink can be better in some areas since the higher resistivity causes static charges to bleed down more slowly, hopefully slow enough to limit damage. The silver bags OTOH, can be better at directing an external charge around the contents due to the higher conductivity.
Source: I have my ESD-compliant shoes on right now.
Fun story: I once had to condition ("re-form") some electrolytic capacitors. To do that, you run them up to their rated DC voltage (with a current-limiting resistor in line) and soak them for a while. I set up a Frankenstein rig of caps on my test bench, put a warning sign over the caps, and fired up the supply. All was well for a few minutes until I received an olfactory reminder of the electrical characteristics of the ESD mat on my bench (400VDC across a couple inches of ESD foam made an effective heater)
Anti-static and batteries...
You see anti-static is a bit conductive. Probably more than the standby current of a CMOS clock chip. So, yes, it will drain the battery. Yes, it will take a while, and Yes, it will reduce the life of the battery (much more than normal use).
No, you don't want to do it!
Working at a company hosting a reasonably significant online-only trader. Network design, servers etc all set up to be properly redundant for potentially zero application down time. Arse-covering know-nothing senior mangler refused to allow down time regardless, even for patches - didn't want to take the grief if anything actually went wrong.
Come the day when the DC had a massive power outage, taking down everything (I'd left the company by then, but still in touch with good friends made there so got the story). Once power was restored, a goodly proportion of the (Cisco) switches were basically bricked - an actual bug in the Cisco OS (whoulda thought) causing gear that hadn't been restarted for 'n' days (actually years, if I recall) to very permanently retire itself.
Never did find out what happened to the mangler, hope he got dumped on from a great height but suspect he weaselled his way out of it.
There was also an issue with the ram on Catalyst switches, if a module was up too long it would die on reboot.
One example was when a 6513 was rebooted, 3 or 4 modules died, the third party warranty service didn't believe it, thought it was the chassis, would only send 1 or 2 modules, which both worked.
**
Another good story was all the ASAs were finally updated to a certain code, a few months later Cisco said there was bug that would shut down the firewall after 213? days (and a few did before they were upgraded to patched code)
**
It is always something, and management ignoring so much. A very large hosting company was using (and still may be) a firewall that was EOL about 15 years ago (and they no longer partner with the vendor). Even most of the models in replacement line of firewalls are EOL. They are afraid to convert it to a new firewall (because it is beyond their skill level), even though they had an engineer who did the exact same conversion for a few firewalls at their previous job.
Their logic in a nutshell, it is better to let it die with no rollback option (or support on the dead EOL product) than attempt a cutover to newer equipment, when you have a rollback option. On hardware that will severely impact the company when it goes down.
They will have fun converting over 50,000 lines of code when it goes down.
"he weaselled his way out of it"
Once upon a time I was a member of a small IT team (5 strong) supporting a large number of engineers on a contract.
I'm doing my normal day-to-day work when a chance comment to one of the other team members led to me finding out that one of our systems had suffered a production database deletion, and the other members of the team (and many of the engineers) were working flat out to rebuild the database ...... and the months of data lost because 'the Oracle backups had been failing due to an Oracle bug that a fix had not been published for'!
I found it so strange that I hadn't been asked to help with the recovery. In fact I hadn't even been told there was a problem.
I deduced that this was actually deliberate as I was the one person in the team that would have stood up in meetings and pointed out the plot holes in the explanation that was given as to what went wrong, and why it wasn't the fault of the mangler that ran the team. "The Emperor has no clothes!"
I have a very short list of people who I will never work with again, and that particular mangler is top of the list!
We have a handful of Microsoft SQL servers that we reboot weekly, just so they can have a micro nap and feel all refreshed, is that a good thing ? are there disadvantages ?
Are there learned things in volatile RAM that will need relearning and slow performance for instance? (just guessing)
We also recycle the SQL services periodically , which is basically same thing .
Any thoughts anyone?
Doing a reboot when you can is in my mind A Good Thing. While I mostly use Linux boxes and they have less need of rebooting for patches, I have been caught out before by a boot loader patch that borked booting but in itself had no need to reboot. Only discovered when an unplanned late night reboot occurred, doh!
After that I try to reboot after significant patching even if not called for, assuming there is not any real impact from doing so.
The other "gotcha!" is application software that has been changed and fails to properly start on boot. It may have SFA to do with the OS patching, but again a planned reboot while to responsible software person is to hand is a good policy so you have a server that is kept in "automatically recovers" mode. Because unplanned reboots happen. Due to power issues, gross administrative error, system lock-ups triggering a watchdog, etc, etc.
I've noticed twice problems with firefox on linux when I download a new version and don't restart firefox. Evidently the code is running partially on old software and partially on new software. Once it affected the screen saver and once it crashed firefox. Restarting firefox solved these problems.
"Any thoughts anyone?"
You have a memory leak in an application / routine. It's not normal to keep rebooting servers.
I believe it is Windows, so fire up Process Explorer from Microsoft and take 1 snapshots per day to see the faulty process / routine eating your RAM.
After a week, compare the snapshots.
Next, go to the faulty process and watch for the handles being used - likely you will find a process / routine opening / writing to a file and not closing the handle.
I don't believe you can do the memory usage logging thing with SQL Server, as the process will take the maximum amount of memory assigned to it, and then use that in the most optimal way to increase the speed of queries run against the database.
When you restart the server it effectively flushes the data that has been cached as previous queries were run. So immediately after the reboot SQL Server will (on average) run a bit slower until it builds up a cache of the most used data again.
I believe there are ways to check how SQL Server is using it's reserved memory, but you would have to ask a production DBA about that. I'm merely a development DBA.
All our servers are rebooted monthly, including SQL. Most of them are just on a scheduled task for the early hours of a Sunday morning but the SQL and Web servers are scheduled manually each month as they have to be done in the correct order and we try to avoid doing them if there's a possibility of significant user activity, just in case.
Yes, that's generally a good policy. The time to find out whether a system restarts is not when you are trying to restart everything after a power outage.
Yes, there are likely caches in RAM which need to be rebuilt and will increase latency for a while.
Some years back I had an angry customer (investment bank) to deal with. One of our servers had crashed and taken down a trading application - despite being in a failover cluster. What the bank then discovered was that they didn't know how to restart the service - about 20 applications on a similar number of servers. The services had to be restarted in a specific order - which they didn't know since everything had been running for several years uninterrupted.
This is when I also learnt the lesson that the more somebody shouts at you, the more likely that it's their fault.
I once worked with a company that installed a Vax cluster for redundancy at a client site. Years later it failed.
When they were called it to explain and then it turned out one of the machines had failed a year previous and nobody noticed, and more importantly, nobody fixed it. Second failure took out the lot.
The lesson being to actually monitor stuff!
About fifteen years ago I worked on a warehouse management system, and for one client had to write a data export to the client's SQL Server instance. There was a long period when said instance was rebooted every night because of a bug in SQL Server that resulted in a memory leak.
I have a comparison why such reboots are good: Use a search engine on "RpcSs service memory usage", specifically the "svchost.exe -k rpcss -p". Usually the thing runs around 7 MB to 25 MB, maybe 100 MB on a very busy communicative system. But there are a lot of people with "needs 1 GB" "8 GB" "all my 128 GB" RAM problem related exactly to that service. But it is not the service which is buggy, it is software calling the service, reserving the resources, but never uses the .end(), .stop() or whatever. So it stays allocated, and the software keeps on calling for resources. Intel GFX drivers are my most recent example for that nonsense, stopped all intel services, and problem was gone. But the nice Intel-GFX-config tool didn't work any more, but who cares about that.
Other examples from the other direction: I came across a Server 2012 R2 hyper-v cluster with more than 1250 days uptime, and nobody noticed it - they just worked. They were running from 2016 to 2020 non stop, and had over 100 windows updates in pipeline waiting since those updates are needed before those "big cumulative update package" would install. Before that cluster the record was over 800 days uptime of a hyper-v host.
We had a customer visit our development labs, and they said their system had been up for 5 years without going down.
We all clapped... and he said that's not the good news. In this time we have moved machine rooms twice, and upgraded all the products at least once - and still had zero down time.
This was running with CICS and DB2 on z/OS
They had multiple system, on different boxes etc, and could transparently move the work from one system to the next etc.
"You could install security patches and upgrade an OS if you wanted to, but you could not reboot,"
This is a depressing reality when you have C-Suite with their heads so far up their arses they've come back up again for fresh air. The same buggers will bitch, whine, and moan when a server crashes (or eventually does get a reboot) and then fails to come up because of bad patches or bad practice screwing up the system to the point it won't reboot. Queue hours long meetings trying to work out who did what and when so we can attempt to work out how to fix the mess.
I have lost count of the times I've encountered this. It's worse with the attitude most of them have now "well it's in the Cloud and that's always up" - forgetting that we've (in most cases) just had to move the same crap from on-prem to Cloud....
Had a guy in our building from another business unit running three unpatched Windows 2000 boxes.
IT had been going back and forth with the guy and his entire chain of command for months. He claimed it was too important to have any downtime, his managers believed him, yadda, yadda. Finally, at about month six, we get a clear policy edict from someone important enough everyone had to listen; He is to patch or he is to be unplugged from the network.
So that's what the guy did. He installed the current service pack and then reported that he had patched. But he didn't reboot. We could even see the damned dialogs through the glass wall of his office.
On one hand, my boss was pissed. On the other, well, not his circus, not his monkey. He'd just go through the motions again and, in couple months, the guy would have a second edict to actually complete the patch or be cut off from the network.
Enter Tim. Tim took the dialog personally, like the guy was bragging he'd outsmarted us, and he would not let that stand. He spent days brainstorming how he could get into that office and click "Reboot" without being caught by the badge readers or the cameras. He thought about bribing janitorial staff, about going over the false ceiling, about manufacturing a VCR failure and popping the door, etc.
And then a memo goes around. Facilities is going to be checking lighting ballasts over the weekend, as they've had a few fail and they want to get ahead of things. Make sure to shut everything down when you leave, be aware you might not have lighting if you're working overtime, etc.
This gives Tim the perfect idea. He wanders over to facilities, says he saw the memo, and asks the manager if, since he's going to be shutting off breakers anyway, could he tag along and make sure all of them were labelled properly? That sure would make things a lot easier for both of them the next time they had to do a cubicle move or confiscate a space heater.
It only took ten minutes of Tim's Saturday for the UPS keeping those Win2K boxes on to give up the ghost and for Tim to head home a winner.
OK, yes I get your point, and big shout-out to 'Tim'. But would it not have been easier to do what the 'Big Boss' had mandated and disconnect those machines from the network?
Downloading the patches but not applying them by rebooting, surely means that they weren't applied and hence the mandate to cut them off should take effect
Or have I missed something?
"Or have I missed something?"
Yes, you have. The guy followed the *letter* of the edict from on high. He downloaded and installed the patches but didn't reboot because the edict didn't specify that. He did exactly and only what the edict specified. A reasonable person might assume that applying patches that required a reboot, then a reboot would be performed. But that's only in the "spirit" of the edict, not the letter ;-)
Well, that is the "I know it all better" answer.
But we know the reality too. Hosts not correctly connected for such an HA scenario, not the right license for such a HA scenario, host versions too different, host uptime too long so it won't work, not enough local storage to even activate HA (oh wait, wasn't there an On Call story about that recently?), not enough free LAN connections or only 100 MBit connections free, (etc Friday compain mode).
the glory years we had with HDDs inside the machinery.
"Need to replace the HDD, its showing signs of failure" "Nope... need the machinery up and running"
"Look , there more errors reported on the HDD, only a matter of time before it goes kaputt, it will take the engineer 1 hr to swap it out" "Nope need the machinery running"
"we need to replace the HDD now ! got bad sectors everywhere, 1 hr with the tech to fix.. got his number here" "Nope need the machinery running"
"HDD is kaputt.. machine is dead" "Call the engineer... must have it running"
"Hello.... this is the engineer.... I'm booked up for the next 10 days... please call back later"
Boss
"Get it running now!" "Get it running now!" "Get it running now!" "Get it running now!" "Get it running now!" "Get it running now!" "Get it running now!" "Get it running now!" "Get it running now!" "Get it running now!" "Get it running now!" "Get it running now!" "Get it running now!" .............. ad infinitum
I hate my job
I do hope that you made all of the requests to replace the hardware, in writing, email is perfectly admissible in Court, just in case that should push should come to shove!
I have been in a similar situation where my warnings about what is almost certainly going to happen to a client was met with 'deaf ears' or no response. Until the wheels fell off the wagon, and suddenly it's all 'my fault' and they are going to sue etc....
Being able to present them with 'well here's an email from xyz date, warning you of this, and the consequences, oh and here's your reply saying you aren't prepared to pay for it;, plus, 'oh here's another email warning you of.....see you in Court!' Amazing how having the evidence to hand deflates them.
I have even preempted similar situations by quietly mentioning to managers that 'I do make sure that things are documented, I am keeping copies of emails. When, not if, it all goes 'nipples north' - please don't even think about playing the legal card - because I absolutely will cut you off at the fucking knees'. Naturally, said with a smile!
Even if the lesson is: "underlings might store emails and messages documenting my stupidity so it's not a good idea to try to shove the failure in their shoes"
Which is why I too would always recommend keeping a paper trail. Always send those: "here's what we just discussed" emails after verbal agreements. It avoids confusion, it avoids misunderstandings and most of all it diverts the river of shit running downhill when things inevitably go pear shaped
If the boss does'nt want the machinery stopped to change a HDD, then its down to him.
Yes indeed. But then many of them will develop selective amnesia, remember that they told you to do something, but forget that they actively prevented you from doing it. Since you were told to do it but didn't, then it's all your fault. That's when you need to be able to provide evidence that you had informed them of what needed doing and that they refused to allow you to do it. If you can't prove that, then it's the bosses fault, but you are officially to blame.
Been there, done that, had the CIO copied on the "This is what I will need to do to action this dumb-ass idea of yours that doesn't actually need to happen, certainly NOT on a holiday when the floor is a mad house, I will wait for your explicit written command to proceed, starting with the machines that run the floor."
Had a reply from the CIO not even ten minutes later asking if I was really going to do that, because he knew that there was no way on this planet or any other that I'd actually go through with it. He also stated to hold off and that he'd talk with boss about Why We Don't Do Changes On The Fly In The Middle Of Busy Times. ::feral grin::
We consolidated our different divisional servers into a common corporate run data center in the early 2000s. We had one set of servers that we realized didn't have virus scan installed. We approached them about an outage to install virus scan and reboot the servers. They said absolutely not, the servers were too important to have virus scan on them. They had virus scan installed previously and it slowed down the systems too much. The servers ran critical manufacturing systems and reported build rates to the higher ups in that division. Their systems were about 8 years old and long past their sell by date. Lots of back and forth and they finally allowed it to be installed after it went all the way up to our CIO and he said either install virus scan on or the servers would be removed from the network. We had tickets every week when the full scan kicked off. They finally started planning on a refresh and got modern equipment, 2 years later.
IT in the sense of "it uses electricity so it's IT".... I was contracted to electrically fit out a small office extension that had been created from converting a garage. Sockets, lights, etc. Did so, a small ring of sockets ready to patch into the existing system. "When can I turn the power off to wire in the extension? I need about an hour." Never, nobody's allowed to turn the power off, it must be on continuously.
"Ok, enjoy your new office."
That sort of reminds me of my most recent job.
Company was migrating from some ERP I'd never heard of (Aptean) where everyone had admin rights and the controls on the data were... lax, to be charitable. I was hired as part of a company-wide shift to SAP, and was told that they wanted to bring some discipline to their data operations. All well and good, except for like less than 2-weeks past go-live, the wheels were already coming off. People decided it was just too hard to do things like decide which plant and storage location a material should be in, so they were just going to extend every material to every plant, sales org, and sloc. Any time some group ran into an issue where SAP actually demanded them to actually stop and think about things -- you know, it's whole raison d'etre -- they would just whine to the upper management who'd overrule my efforts to do what I was hired to do. Ended up leaving after only a couple months because I'm not interested in being blamed for the shitty state of the data when I am not allowed to actually do anything about it.
The database is usually not backed up as a file or set of files. On most common database systems, the backup will back up the database, effectively read only, so as not to block access. Then, any transactions (data changes) after the backup was started will be backed up, usually in a separate operation.
The database itself does a backup job. Simplified: A dump at a specific point in time to some other storage. Better: Constant transaction logging, backing up the database + transaction logs as far as possible. And when the backup says "OK, backup X done" play those logs into the database. Next level: Constant sync between several servers, where each of them pauses for the backup, and then re-syncs and continues after it is finished.
Even more is possible, but then it will get A LOT more expensive if you need more than mentioned above.
Was at a client recently where we needed to restart a service as it was causing a problem and was told that it could not be rebooted blah blah.
I did ask what their policies were re patching and so forth as, pointed out here, many patches do not take effect until rebooted. Many patches won't even entertain being installed if a pre-req or dependency is not there. Worrying due to the nature of the client,
At one point in my life, I was responsible for patching a lot of kit, we had our pilot, test and production roll outs, all had a weekend reboot cycle. Servers were rebooted, checked to make sure all flags and markers indicated that no more reboots were required - if there were still flags, the server was rebooted again until clear. We then manually ran a patch check that afternoon and ran the whole process again before leaving. Patching ran overnight by default and on the Sunday we would repeat - though hardly any reboots needed at that point.
Fun part were clusters, but implemented code to switch over onto the other nodes and rebalance to stop any downtime