Eventually it transpired that a disk in one of the servers had failed the day previously.
So the disc copped it?
The long arm of the law is unexpectedly severed by the antics of Microsoft Exchange this week as another reader explains why some of Her Majesty's finest were once bereft of festive email. Welcome to Who, Me? Today's story, from a reader Regomized as "Sam", takes us back to the mid 2000s and happier days before Brexit, Trump …
A backup is not a backup until it is tested/restored is a very solid principle to live by , but :
And between polls it must be assumed to be in a degraded state.
Really ?How far are you going to take this ? how often do you poll? Whats the first thing you do when you suspect a degraded state? poll? You must be sat at your desk all day cycling between:
"The RAIDS broke"
"Oh wait , its ok "
"The RAIDS broke"
"Oh wait , its ok "
"The RAIDS broke"
"Oh wait , its ok "
"The RAIDS broke"
"Oh wait , its ok "
You choose your polling interval according to the length of time you are willing to live with the array in a degraded state - I'd suggest that 12 hours is too long for critical production systems, but might be tolerated on a low usage home setup.
You only ever *know* that the RAID was working at the previous poll (active alerting excepted, though again technically you only *know* at the timestamp of the last event, which we hope was informational).
The assumption that it's degraded isn't knowing that it is and reacting, it's the baseline which tells you how long are you OK to be running in a degraded state before you jump on it.
If the answer is very small then you should be running with a warm spare drive (or several) available anyway.
If the answer is long enough for people to move around and do things then you don't need to check more often than that, the assumption that it is degraded is what defines your polling cycle.
Each polling cycle (or trap/event) either resets that counter or puts you into a "this drive has failed" state, which does require action on some timescale.
I was once responsible for a NAS which contained several VMs - including which were high profile to the project. The NAS was old and had been shifted several times before it came to me. We had, had a disk fail and I suggested - in writing! - that all should be replaced as they were now supect. There was the inevitable no reponse.
Isn't it amazing how the MTBF can be calculated so precisely the the failure of a second disk can occur within the time taken to purchase a replacement for the first failure!
Of couse there wer3e no backups - it had RAID.
When I decided to purchase a Synology DS414j, I said to myself that I had to stage the purchase of disks to avoid that problem.
As a result, the first month I bought a Seagate and a Winchester 3TB disk. The second month I bought another Winchester 3TB disk, and the third month I bought the final HDD.
When I decided to replace them with 8TB disks, I did the same, staging the procurement process through three months.
I'm hoping that that will avoid me the kind of problem that is outlined in this article.
"Winchester"? Do you mean WD?
Anyway, what I have often done at home is to purchase from different suppliers. It's unlikely CPC and eBuyer (just as examples) will have discs from the same batch, and it means that other than a couple of days difference in the delivery times, I get everything at the same time. I also use RAID6 (or equivalent) so the system can survive two failed discs, "just in case". The main downside is the reduction in the amount of storage available.
M.
It will not.
There's been an enormous amount of bunk spoken about RAID, including the "RAID isn't RAID if..." tripe. I mean, a lifeboat is a lifeboat even if there's a hole in the bottom! The key point is is what the technology is good for, etc...
The reason spacing out your disk purchases won't avoid the problem is that the second (or third, if you're using a RAID-6 setup with two-dimensional parity) failure event is not independent of the first. When the first drive failed, the workload on the remaining drives increased significantly, raising the probability of wear-related failures (and also also the exposure to low-probability drive/firmware events that trigger a "drive failed" response).
So unless you know the real failure characteristics (mean and standard deviation of various types of failure by lot number), which you don't (mostly because no-one does, because it's not valuable information to the drive vendors and their manufacturing plants and subsystem suppliers), you have no idea whether you've expanded or contracted the event surface. For example, suppose you have two drive vendors, A and B, plus two lots of each drive, and somehow you know that A's drives become generally more prone to failure after 10,000 operational hours, while B's are good to 10,200. Suppose one of your B drives is from a slightly sub-par lot, and fails a little early, at 10,100. All your "A" drives are outside their comfort zone, and the increased workload accelerates their departure, and unhappiness may be in your future! By contrast, if you have all the same lot/brand, the first failure is a much more reliable harbinger of doom!
True story: back in the 90's, I was part of the engineering team designing and building a (well, the) major storage subsystem's upcoming "open systems" RAID solution. We were working through the system-level testing, including what our team leader ("Tom", because it was his name) called the "yank test" (grab drive, yank... does the system keep trucking along?)
Anyway, we were working through some issues, when marketing heard that problems existed, and wandered by to see if the issues were going to impact the reliability figures they'd decided to feature in the collateral and at the upcoming launch event. This lead to an extensive and heated "discussion" as to whether "hot plug" is a reliability feature ("sexy!!!") or a maintainability one ("yawn"). Obviously, it's the latter, but I was not winning the argument...
In desperation, I turned to Tom, who had been the guy in charge of the company's IT department prior to his current position, and asked him how he, as the owner of the machine room, would handle a drive failure? His response was that the repair people (from the same company, mind) weren't going to get to TOUCH the machine until, like, Saturday at midnight and then only if he didn't have important work still running, no matter how good the alleged hot plugging feature allegedly was!
Moral of the above story and many others: when something fails, step 1 is get a backup. step 2 is check your backup. steps 3 and 4 might be a repeat of steps 1 and 2... and only THEN do you do something about the failure!
(I lost the argument, and the company trumpeted a mean time before data loss of something longer than the time to the heat death of the universe or something like that, because marketing...)
[ Incidentally, another reason people like to try to get drives from different lots is the idea that a firmware update might kill a drive, and if all the drives are the same, the update will kill all of them. This only works if (a) you only ever have one drive of a given model, and (b) you update the firmware in the RAID chassis at all -- if you have a fetish for that sort of thing, pull the drive, update it, replace it, rinse and repeat! ]
"Incidentally, another reason people like to try to get drives from different lots is the idea that a firmware update might kill a drive"
Doesn't even have to be a firmware update that kills a drive, given that one reason to release firmware updates is to fix bugs (sometimes critical) in the firmware you already have on your drive...
On the wider point, I guess one reason people like the idea of populating their arrays with drives from different batches/manufacturers is that, whilst the points you make about general failures are entirely valid, what you don't seem to have touched on here are those completely out of band failures that happen every once in a while, and cause a particular model of drive to suffer significant reliability problems completely out of keeping with anything else at the time. Granted, these don't seem to occur as often as they used to, but maybe for those of us who still bear the scars of having lived through such times (I was the "lucky" owner of both an IBM Deathstar and a Seagate 7200.11 with firmware SD1A...) there's some inherent reluctance to trust in statistical modelling etc. when our gut instinct is screaming at us to never ever trust all our data to the same manufacturer/model/batch/whatever, just in case we end up being part of the next big drive reliability scandal...
That being said, I appreciate your insider perspective here - the more good quality information we have about likely causes of array failures, the easier it becomes to push our (irrational) fears to one side when deciding how to set up an array in future. Coincidentally, I'm in the process of speccing a new RAID setup for home, and was umming and aahing over whether to populate it with a set of "identical" drives, or go to the extra trouble of mixing and matching manufacturers, suppliers etc, and I think your comments here have just made my life a bit easier.
When I first had my DS415play I originally used RAID5 as most people do....later I changed that to 2 pairs of mirrored disks.
Why?
Well if one disk fails, both options are the same. However if a 2nd disk fails before the first is fixed then the RAID5 is 100% toast, whilst the mirrored option only has a 50% chance of toasting your data and needing a full rebuild and restore.
I could actually loose 3 disks and only half my stuff.....
The problem with a disk failure in RAID is when it starts a rebuild all of the other disks get hammered as it rebuilds copies/parity and that can provoke another disk to give up the ghost.
The second and less obvious thing is if the RAID does not do periodic scrubs then the rebuild time is when you discover unreadable cluster on the disk, resulting in data corruption somewhere.
Of course if you have double parity (e.g. RAID-6 or RAID-Z2) then you can cope with a double disk failure, and if you have certain file systems, with ZFS being the poster child, it tells you which files are corrupted, not just some sector number(s) that you then have to use some low-level tool to map to a file allocation. Been there, bought the T-shirt :(
In a previous role a while ago we had a large number of Exchange 2003 servers in stretched clusters (active/passive) with the Exchange data on EMC SANs. Each node of each cluster had 4 x 80Gb disks configured with 2 discs mirrored as C drive: and 2 mirrored as D:drive.
Worked great over time when the odd drive failed as the relevant cluster node carried on working and popped out an email so we could get an engineer to swap the failed disk.
Wasn't so great when the RAID controller board in one of the cluster nodes popped and blew all 4 disks. The business decided it was easier (cheaper) to evict the failed node from the cluster and support the remaining node on a 'best endeavours' basis.
Similar issue we found in managing some Exchange servers where HPE's preferred architecture had been followed.
Use single SAS disks per group of replicated databases but make each one a logical array within the Smart Array (SA).
Ok, fine. Whatever.
Problem being that for some reason if Windows starts showing errors on the disk, i.e. unable to read file, bad sector etc. it does NOT get flagged up by the Smart Array as a predictive failure or failed disk.
So if your Wintel guys aren't specifically monitoring for the correct NTFS event log ID's you'll be blissfully unaware of the issue. There is a Smart Array event generated saying couldn't provide the file to Windows but SA doesn't class it as anything it cares about so does nothing.
Back since the IDE days seeing a bad sector on disk means there have already been multiple failures and spare good sectors have been mapped in until they've run out. Replace disk immediately if you see that.
RAID 5 is pretty much redundant these days, see what I did there! With the size of arrays these days the rebuild time can get very long so increasing the chances of another disk in the array failing whilst they're being thrashed doing the rebuild. Better of with RAID6 or if you have the cash RAID10 as the rebuild times are quicker
RAID10 is almost certainly cheaper *for the same write performance* - if you're using equivalent disks, with RAID6 you need 3x the number of disks as with RAID10 to get the same write performance - with RAID6, each write is written across 3 disks, with the old value of the data on each disk being read first so the parity can be recalculated - so 6 physical IOPS per logical write IOP; with RAID10, you just write the same data to two disks, no reads required, so only 2 physical IOPS per logical write IOP.
Sure, with RAID6 you can use smaller (cheaper) disks as your capacity penalty for RAID 6 is lower; but the more disks you need, the greater the likelihood you need additional chassis, additional networking/interconnects, additional licensing (depending on SAN manufacturer), additional rackspace, additional power etc...
RAID6 has a non-significant (2%) chance of more data corruption on array rebuilds over 5TB and that's before factoring in the increased chance of drives falling over to increased head thrash (assuming 10E-23 error rates - the standard for spinning media)
IE: If you're relying on RAID6 to keep large arrays error free, you're already in trouble. Ditto RAID10 - and at these sizes the errors have a statistically significant chance of being SILENT
ZFS assumes drives are crap and proceeds accordingly. It's a lot easier to do this than to gold plate turds. 10E-23 error rates simply isn't good enough at very large storage scales
Reminds me of a Halloween costume party... One guy came holding field glasses, with a toilet seat around his neck! I asked him WTF are you... he replied "I'm a bad accident looking for a place to happen!" Simply awesome!
yeah, that second disk failure during a rebuild... my sphincter always puckered and un-puckered during a raid-rebuild
Ideally you should watch your SMART data and replace drives long before they run out of sectors or hours, but manglement never think about it that way. It's often better to talk to accountants about "preventative maintenance" and "cost of not acting"
My rule of thumb for a server with a failed RAID disk when I was hands on - make sure you have a reliable backup* before you do anything else. And if it's Exchange, where absolutely possible, stop all of the MS Exchange Services and do an offline backup. Believe me you will thank me later for those few hours of downtime that you are currently cursing.
*I appreciate it isn't always possible to prove the integrity (or even the overall usefulness) of said backups before commencing work but at least take one before you start and do everything you can to verify it, however little that might actually be - but being able to stand in front of the big cheese and demonstrate that you did everything possible to ensure you covered all bases can be a career saver.
But it's this moment when you find out that the remaining power supply can't actually cope with all the load, and fails hard.
Had this on a blade enclosure that had four supposedly-redundant PSUs. I pulled the power out of one PSU to move it to a different UPS, the enclosure tried to shunt all the load to one of the other PSUs, which promptly failed. It claimed to be 1200W, but only worked ok up until about 300W...
Indeed.
A few years ago I was looking after/inherited a datacentre "thing" for a very large, and very high profile customer. A task was to upgrade all of the firewalls to a new version of software for vulnerabilities and feature reasons. I warned that the age of the firewalls, despite them being under support from the vendor, was such that the capacitors in the PSUs may have dried out and there was risk that they may not reboot after the upgrades. Naturally I got the whole "yeah sure, what would you know" etc response.
I put all of my warnings and recommendations in writing and made sure EVERYONE saw them prior to the upgrade. All of these firewalls had two PSUs, and at least (yes, at least) one in each go them shat themselves. Cue frantic support calls to the vendor to get replacement RMA PSUs.
So, when I got blamed for them all failing and this huge customer cracking the shits, I referred them all to the warnings I sent, the dates they were send and what my suggested ways to avid the issue were. Funnily enough the blame disappeared then.
We were getting disk errors on one drive in the disk array on an AS/400. The engineer was adamant it was a cable fault so to avoid downtime I left my very experienced evening operator and him to sort it after office hours.
Sheer chaos hit as it wasn't the cable, it was the drive itself which then failed to restart. Of course because it was "just the cable" the engineer hadn't taken the drive out of the array first and as it was only striped, not mirrored, the system was f****d. To say I wasn't happy with either the engineer or the operator, both of whom should have known better, is an understatement.
Fortunately we had backups but it still took a few days to get everything fully working.
I have been a technical Support Manager my focus was always on maintaining service levels I could negotiate maintenance windows but unplanned outages were anathema. I've had many conversations with IBM engineers. I've always found with a bit of digging asking a few 'what if it's not that' questions would get us to the point where the engineer really wanted to be but wasn't allowed to tell me officially.
This was 15 years ago but even then engineers were under huge pressure to not pull spares from stores unless they were definitely going to be used as the engineering manager were trying to minimize spares costs.
A call to the Account Manager would normally result in a changed plan often with the 'if its not that there will be a charge' conversation. I think I ended up paying for one disk volume which wasn't required, the new disk was just added to the array providing some extra space. This was lower than the financial penalties that would have been extracted from my budget for an unplanned outage of that length.
As the Tech Support Manager it's your job to make sure that the change plan is properly planned, impact assessed and executed, Its also your job, having done that, to support the engineer and operator should things go wrong. There's only one person you should have been angry with and you see his face in the bathroom mirror every morning. I know the engineer was an IBM employee if he is the nominated site engineer he;'s still a member of 'My' team. Try that approach working with your vendor support staff, whether its hardware engineers, network specialists or devs and you'll find they suddenly get very open with you as y ou just want to work with them for a solution not point the finger of blame.
I hate this excuse. I've seen it way too often.
How does a cable that is either inside a computer/server or in a building infrastructure fail if there has been no human or possibly animal interaction?
It has happened, but in my experience, cable faults (barring desk patch cables) are very, very rare.
Oh God.
Ultra 2 Low Voltage Differential SCSI. 68-way twisted-pair *insulation displacement cable* with accompanying brain-damaged mini-D connectors. Along with terminators that had to be blessed at a full moon with the blood of a goat to have a chance of working.
The least fun I have ever had inside a computer. Never again.
How does a cable that is either inside a computer/server or in a building infrastructure fail if there has been no human or possibly animal interaction?
When there's money to be made or field service want to point the blame elsewhere. Take a dead PC into PissyWorld and see how long it takes to be told a $$$$ gold plated "digital" cable will fix the problem.
At one place I worked, a support engineer tried and failed to get me to believe the RAID was randomly dying because cosmic rays were corrupting the controller card memory. Yet those rays somehow didn't corrupt the server's memory because that would have made the OS crash and someone might have noticed that.
Must have had the same book I did - "Upgrading and Repairing PCs" - was a page about cosmic rays corrupting memory. The soultion was to bury the computer under several meters of concrete.
Also remember reading about an IBM server - if you put in extra memory that had tinned contacts instead of gold plated, the keyboard would stop working.
Make sure you have old steel for the reinforcing rods. The Belfast C14 dating lab had the counter in a pit with a few concrete paving slabs (no reinforcing rods at all) sitting on top of old steel plates. Back then we had a shipyard just down the road so sourcing that from breakers was quite feasible. https://en.wikipedia.org/wiki/Low-background_steel
The Falklands War started when an Argentinian boat went into one of the old whaling stations on South Georgia and started dismantling it for scrap. They had permission from the Argentinian government, but not from the Falkland Islands government ... so the local magistrate arrested them.
Notes:
1. South Georgia is a dependency of the Falkland Islands.
2. The base commander for the BAS base is sworn in as a magistrate; this is usually a formality.
No, you'll just need to bury the concrete under more concrete!
It probably depends on the type of taxation - I'm sure they are people will correct me, my knowledge of radiation is from school, but think some types are slow moving particles, and it was gamma that is a more energetic?
RE: Tinned memory contacts
That one actually make sense. If the mount-points were gold (plated), when current runs through a gold/aluminum contact pair, they electrolize and you get a weird gold/aluminum alloy: purple, brittle, and very non-conductive AKA The Purple Rot.
On reflection, That would take time, and cause intermittent errors first, and I don't know how that would cause the KB to stop.
I've only ever seen purple plague once - ever - in my entire career.
An audio IC manufactured in the early-mid 1970s which erupted purple goo when touched with a soldering iron. By that point (late 1980s) it was a good decade past the point where any of my cow-orkers had seen cases of it and even then it was only in older kit
By the time IBM PCs came along, purple plague was a distant memory
"a page about cosmic rays corrupting memory"
It wasn't cosmic rays that was doing it.
The incident that led to the urban legend was due to cheaping out on supply chains which resulted in source clay material contaminated with unacceptably high levels of uranium being used in IC ceramic cases - with fairly predictable results (all ceramics contain radioactives, some contain more than others)
WRT the IBM servers, the contacts may have been tinned, but the culprit was voltage and timing tolerances along with circuit loading. Back then people didn't pay nearly enough attention to such things in digital circuits
Ditto. Badly made cables happen but even those tend to only play up when subjected to actual vibration or stress (hint: freezer spray isn't just for components)
Making matters worse, connectors are only rated for "N" operations and each test cycle brings them that much closer to crapping out (this is why my test rigs frequently have sacrifice leads in them)
About 50 years ago our college time sharing system failed. It couldn't even load diagnostic tapes. Eventually it was traced to a cable from memory to CPU. So this cable was replaced and worked perfectly - but three other bits did not. Our field engineer had three spare cables at this point which he gingerly connected without cable ties etc. and got the system up. He then put in a call to have ALL cables of this manufacture replaced. A team came the next weekend and replaced hundreds of cables. Did not have a problem thereafter.
Don't tell me that cables can't go bad.
And let's not mention when your disk controllers have incompatible firmware versions... and as you replace the one failed unit out of the four with a spare that has a newer firmware version and it proceeds to wipe out everything on the disks in a very systematic fashion... I think there's a certain university I'm looking at over there...
I once had a server at a customer who screamed "Why did you do a RAID0! Now my data is lost!".
Turned out: We did set up a RAID5, but when one disk failed the RAID controller silently switch to RAID0 instead of raising an alarm and turn on the two orange warning LEDs on the server (one for the specific HDD, and one for the general warning). When the second disk failed a few month later the data was *poof* gone. Ended up in a nice little "Supplement document" from the vendor about that bug, and how urgent it is to have specific RAID controllers with specific firmware versions updated as soon as possible.
customer who screamed "Why did you do a RAID0! Now my data is lost!"
I once had a small customer who spec'd the cheapest Proliant ML310 server with two IDE (or SATA?) drives and insisted on RAID0 despite protests from me and my colleagues. I ended up honouring the request for RAID0, but I also slipped a note inside the server for future administrators to show it was the customer's idea!
the RAID controller silently switch to RAID0 instead of raising an alarm
Sounds made up so must be true! Which vendor/controller was this? Name and Shame please.
Was long ago when Server 2003 was the newest shit: Fujtisu Primergy, but the nearly cheapest option available which uses the onboard LSI-SCSI-Ports with an additional LSI chip in a specific PCI64 (yes, without -e) slot for the RAID5 logic. At least the SCSI was connected to a real SCSI-SCA backplane and the HDDs were real hot-plug. The lowest possible end for a real hardware RAID5. I never liked that constellation instead of shelling out 100 bucks more for a real RAID controller which can do everything on its own, but at that time I was too young and dumb to speak up. They were slow, of course. And, except for that one f*up, quite reliable.
Nothing exists unless it has a minimum of one independent backup. That is the absolute minimum, for the simplest of systems, before you can say you have any data. Less than that and you just have a hope that your data is there.
This was true when I was semi-professionally supporting educational colleagues 30 odd years ago before we were given proper IT support. And it's still true now. Maybe it always will be, cloud or no cloud.
RAID is not always the answer. Disks will do what they are told, so if an operator says 'delete this file', or 'reformat this disk', the disk subsystem will do it. If it is configured for RAID, it will do it very reliably.
You still need backups.
Clearly the server was NOT ok because in the previous paragraph we are told the served DIED and that was what started the whole fiasco in the first place. The fact a second disk died during the replacement of the failed disk is just icing on the cake. If this really was a properly configured RAID array with redundancy and not just a RAID0/JBOD, the server would have carried on working and simply replacing the failed disk to rebuild the array would not have been the solution anyway.
Either the original story smellls a bit or the re-write by El Reg got the order of events mixed up.
Okay downvoters, tell me why I'm wrong instead of just hitting a button and moving on. A server died and the diagnosis is a failed HDD in a RAID array where the solution is to replace the failed disk. Why did the server die? The whole point of RAID is that a failed disk, other than in a RAID0 config, shouldn't bring the server down, it pootles along in degraded state. The fact they thought replacing the failed HDD would be "the fix" means they were running with redundancy, otherwise they'd already be looking for the backups (ignoring, for the time being, that a 2nd disk failed after the fact, during the array rebuild and is not pertinent to the initial server death). I suppose it's feasible that the disk failed in such way that it caused the controller to have a fit, but I've never seen that happen before and if that was the case, on an outlook mail server, there WILL be data corruption, so again, a simple HDD replacement is not going to fix it
"ignoring, for the time being, that a 2nd disk failed after the fact, during the array rebuild and is not pertinent to the initial server death"
Not one of the downvoters but AFAICS the above statement is the problem. As I read the whole story there was no initial server death. It was the failure of the 2nd disk that was labelled as the death of the server.
According to the article...
"One of the Exchange Servers had abruptly died. "Consequently, half of the constabulary had lost email!"
Eventually it transpired that a disk in one of the servers had failed the day previously."
And then later on...
"The replacement was popped in and a rebuild was started.
"While in the process of rebuilding the array, a second disk decided it was going to join the party and while not quite failing..."
The timeline seems quite specific in the article. As I said, the 2nd disk failure is not relevant to "the server died and it was all down to a single failed HDD in a RAID"
The article has extra text indicating the real timeline, which you have missed. Here is the timeline in its original form.
Day 1: Hard drive 1 fails, server stays up.
Somewhere in the middle, probably day 2 morning: Team inserts a new drive into array to recover.
Day 2 noon: Server team leaves for party, desktop team comes in to manage things, repair still in progress.
Day 2 afternoon: Hard drive 2 fails, server goes down.
For the desktop team, it was the first thing they saw with the server, as they didn't put in the new drive. Since Sam was on that team, that was his first knowledge. It was the second drive that did it. The article has the events out of order, and the clue was "Eventually it transpired that [...] the day previously." We're getting the events from Sam's point of view, and he wasn't there from the beginning.
Nope:
Day 1: drive failed, sabres rattled.
Day 2: new drive installed, rebuild started, server team goes to Christmas party
Day 2 also: latent fault revealed on a second drive, rebuild aborts and (likely) controller removes second drive drive from array leaving it inoperable and probably without a way to shove it back in the degraded set so you can at least try to get a backup...
Also, earlier you mentioned you didn't know of cases where a drive failure caused a controller to have a fit... <insert hollow laugh here>.
These types of failures are horribly common (for developers), because they are extremely hard to replicate and debug. With the old shared-bus setups like SCSI it's quite easy to imagine drives behaving badly on the bus, but many controllers with point-to-point connections (SATA, SAS, FC, etc) have shared resources, so creative failures on Drive A bugger up Drives B through D by starvation. And my most common nightmares involved drives disappearing, then recovering _while you're in the process of removing them from the set_., leading to a completion event on a drive that you're trying to mark as killed and so for which you've released resources, and the sodding "intelligent" controller has a cached DMA address that it's happily going to use regardless...
I had a RAID nas drive whose controller board got blown up by a lightning strike on the phone line that propogated through the network cables (*). I thought, 'oh, no problem, I'll put the discs in another drive.' Would it read them? Of course it bloody wouldn't.
I had other backups. But I air-gapped everything with a wifi extender from the router after that.
(*) Apart from the nas it killed the router, 2 switches, 2 dect base stations, and the network interface interface in a computer. Could've been worse though, and it wasn't even a direct strike on the line.
"But I air-gapped everything with a wifi extender from the router after that."
I had to read that again to realise you meant an electrical air-gap between the Internet and your kit rather than the more usual meaning in IT circles of "not on the network AT ALL" :-)
My preferred technique is to replace one of the drives in a RAID1 after a year. Then at least I know the two are not from the same batch.
Just never, ever use a hardware RAID controller. They use nasty, proprietary disc formats that are unreadable without the correct RAID card; and have an unfortunate tendency to attempt to rebuild an array by copying the new, pristine drive over the one that used to have all the data on it.
Or says its rebuilt the mirror but actually hasn't so if disk 1 has been replaced and disk 2 then fails bye, bye array.
Looking at you HP Smart Array.
Thankfully caught that when I joined a new company since one of the first things I looked at was firmware versions on servers, RAID controllers etc.
Listed as a critical issue on release notes but nobody had noticed up to that point.
A competitor of the company I worked for in the 90s was bought up and merged, and we were proper impressed to see they'd bothered to properly partition their drives, so the first 64 sectors of each disk included a standard-ish partition table that could be read by FDISK and the like, with the RAID configuration tucked in following the table (not in a partition, which was a bit half-baked), and then the RAID storage space marked off as a partition.
Snag (to us) was that each drive access had to be manipulated to add the 64 sectors, which didn't thrill us as we spent a _lot_ of energy optimizing cylinder/head alignment, and this made that messy.
[ For people going "huh?", when you can say to Seagate "build us a new model HDD, we'll buy all you can make for a year..." you get to be able to do all sorts of optimizations that regular customers can't know about! ]
conversation
IT bod : the disk is failing, bad sectors everywhere, only a matter of time before it dies
manglement: It still runs
IT bod : look , just schedule an engineer to come out , do a backup and swap the failing drive over for a good one, it take 2 hrs
Manglement : costs too much, server/computer/industrial plant needed.. cant afford down time.
IT bod : look just let me.
Manglement : Nope nope nope nope not listening !
<Drive dies 4 days later
manglement : WHADDYA MEAN ITS GOING TO TAKE 3 DAYS TO GET AN ENGINEER IN ? FIX IT FIX IT FIX IT FIX IT FIX IT FIX IT FIX IT FIX IT FIX IT FIX IT FIX IT FIX IT FIX IT FIX IT FIX IT FIX IT ....... etc etc etc
If you've not had the above played out in front of you, are you even a real IT bod ?
"If you've not had the above played out in front of you, are you even a real IT bod ?"
Yes, because I'm the contractor who gets sent to fix it in 3 days, not the one worrying about whether your PHB is a penny pinching moron. :-)
If you want same or next day service, you need to take out a maintenance contract, otherwise you'll be further down the list of priorities than our paying customers. (Although will always try our best to get there ASAP because you might turn into a contract customer, but we'd NEVER push an off-contract job above a contract job.)
Next step is to get old and cranky and to start to learn the management lingo.
Throw in terms like "preventative maintenance schedule cycle" and "service delivery optimisation alignment window" and you will have your planned maintenance in a few seconds. A few meaningless graphs of before and after and you're half way to a promotion to PHB.
Point of order, it does nothing to stop them from throwing you under the bus. What is means is that you can throw them under a faster, and heavier bus when they do it.
Of course, the next level up will then just ask "why didn't you escalate?" - thereby throwing both PHB and you under a bus so big it can be mistaken for a "B" ark.
If you do ask, I believe Murphy was an optimist.
They're not related to https://www.theregister.com/2022/07/11/aerojet_cybersecurity_whistleblower/ are they?
I mean, repeated warnings of 'this really is going to fail at some time unless you do it properly first. etc.etc. does seem a common umm failing.
Oh what's the point?
...however nothing beats a Saturday watching the app support teams crap their pants twice a year at the thought of having to reboot their precious servers to prove to the DR admins that services can be taken down and brought back safely.
As a unix admin I do envy the Windows admins their regular reboot cycles they exercise the system boots and startup scripts. You just don't reboot unix boxes unless you have to, which not often and then it's mad panic to patch up the init.d/services configs 'cos you forget to check something you did 9 months ago!
"I didn't choose the IT life, it chose me."
No support stories, apart from one where my old boss (who was brought up Muslim but never followed the Qu'ran) who told a story of one of our users who phoned him up on his mobile on Christmas day to complain about a problem. The convo went as follows:
Boss: "It's Christmas"
User: "You are Muslim and don't celebrate Christmas"
Boss: "You aren't, and do celebrate Christmas".
And yes, while my boss was Muslim, he did celebrate Christmas. Not because he was brought up to, but because he had a wife and child who were not Muslim..
I've had a couple of Christmas parties where it would have been infinitely more interesting to watch a RAID rebuild than go to the party.
Usually what happens is that each team member hangs around with their team, eating, drinking and having fun. There is usually a christmas meal, and we sit and eat with our own teams. After the meal, we usually head for a local pub and carry on the drinking, often with other teams.
Every few years, the admin office (who organise the parties) come up with the wonderful idea of mixing up the teams, with the idea this sort of thing promotes inter team relations. It really doesn't. You end up stuck on a table with a bunch of people you don't really know for 2 hours. Admittedly, while I'm quite friendly and good at helping people I don't know, but not too comfortable with the kind of chit chat expected at a table. The whole situation is usually painful, embarrassing and hated by most of the party goers.