back to article Watch a RAID rebuild or go to a Christmas party? Tough choice

The long arm of the law is unexpectedly severed by the antics of Microsoft Exchange this week as another reader explains why some of Her Majesty's finest were once bereft of festive email. Welcome to Who, Me? Today's story, from a reader Regomized as "Sam", takes us back to the mid 2000s and happier days before Brexit, Trump …

  1. Korev Silver badge
    Coat

    Eventually it transpired that a disk in one of the servers had failed the day previously.

    So the disc copped it?

    1. Korev Silver badge
      Coat

      It sounds like the Police RAID didn't go to plan

      1. Sgt_Oddball Silver badge

        You should

        Always be wary of blue flashing lights...

        Red ones means it's all too late.

        1. 2+2=5 Silver badge
          Joke

          Re: You should

          Bloody Intel - the long ARM of the law would have been a better choice.

          Or help from Canada - the Mounties always, er, mount their disks.

  2. Mark White

    It has to be a rule of IT somewhere, any routine task will fail once you stop monitoring it.

    1. John Robson Silver badge

      More pertinantly

      RAID isn't RAID if it isn't actively monitored.

      And between polls it must be assumed to be in a degraded state.

      1. Prst. V.Jeltz Silver badge
        Trollface

        Re: More pertinantly

        A backup is not a backup until it is tested/restored is a very solid principle to live by , but :

        And between polls it must be assumed to be in a degraded state.

        Really ?How far are you going to take this ? how often do you poll? Whats the first thing you do when you suspect a degraded state? poll? You must be sat at your desk all day cycling between:

        "The RAIDS broke"

        "Oh wait , its ok "

        "The RAIDS broke"

        "Oh wait , its ok "

        "The RAIDS broke"

        "Oh wait , its ok "

        "The RAIDS broke"

        "Oh wait , its ok "

        1. Gene Cash Silver badge

          Re: More pertinantly

          Looks like the log files from a Seagate device...

        2. Anonymous Coward
          Anonymous Coward

          Re: More pertinantly

          That's up there with the apprentice automotive technician being asked to help verify if the turn signal indicators are working properly on a customer's car. "hey newguy, are the turn signals working?"

          "yes... no... yes... no... yes..."

        3. John Robson Silver badge

          Re: More pertinantly

          You choose your polling interval according to the length of time you are willing to live with the array in a degraded state - I'd suggest that 12 hours is too long for critical production systems, but might be tolerated on a low usage home setup.

          You only ever *know* that the RAID was working at the previous poll (active alerting excepted, though again technically you only *know* at the timestamp of the last event, which we hope was informational).

          The assumption that it's degraded isn't knowing that it is and reacting, it's the baseline which tells you how long are you OK to be running in a degraded state before you jump on it.

          If the answer is very small then you should be running with a warm spare drive (or several) available anyway.

          If the answer is long enough for people to move around and do things then you don't need to check more often than that, the assumption that it is degraded is what defines your polling cycle.

          Each polling cycle (or trap/event) either resets that counter or puts you into a "this drive has failed" state, which does require action on some timescale.

      2. Alan Brown Silver badge

        Re: More pertinantly

        RAID is not backup, backup is not RAID

  3. Kildare
    FAIL

    It's got RAID so it can't fail. . .

    I was once responsible for a NAS which contained several VMs - including which were high profile to the project. The NAS was old and had been shifted several times before it came to me. We had, had a disk fail and I suggested - in writing! - that all should be replaced as they were now supect. There was the inevitable no reponse.

    Isn't it amazing how the MTBF can be calculated so precisely the the failure of a second disk can occur within the time taken to purchase a replacement for the first failure!

    Of couse there wer3e no backups - it had RAID.

    1. Anonymous Coward
      Anonymous Coward

      Re: It's got RAID so it can't fail. . .

      "It's got RAID storage!", they said.

      "Disks from the same manufacture, of the same type and from the same batch?", sez I.

      "Yes. They said.

      "Then you haven't got RAID." sez me.

      1. Pascal Monett Silver badge

        When I decided to purchase a Synology DS414j, I said to myself that I had to stage the purchase of disks to avoid that problem.

        As a result, the first month I bought a Seagate and a Winchester 3TB disk. The second month I bought another Winchester 3TB disk, and the third month I bought the final HDD.

        When I decided to replace them with 8TB disks, I did the same, staging the procurement process through three months.

        I'm hoping that that will avoid me the kind of problem that is outlined in this article.

        1. CountCadaver

          You mean Western digital I take it? Given Winchester was an early IBM drive named.in honour of the 30/30 rifle cartridge

          1. Prst. V.Jeltz Silver badge
            Trollface

            Winchester was an early IBM drive named.in honour of the 30/30 rifle cartridge

            Seems an odd thing to do, whats the link? was it bulletproof?

            1. Russ Pitcher

              My understanding is that the drive was so named because it had 30Mb of fixed storage and 30Mb of removable storage and was named after the rifle because it sounded good.

              http://www.computinghistory.org.uk/det/22711/Watford-Electronics-30MB-Winchester-Hard-Drive

              1. Minor7th

                Mr

                I thought they were so named because they were first created and developed at IBM Hurseley in Winchester.

              2. Terry 6 Silver badge

                I'd always assumed it was named after the place, rather than the gun. Not for any particular reason. But then I'm not American.

              3. sniperpaddy

                Mini Winnies ?

        2. Martin an gof Silver badge

          "Winchester"? Do you mean WD?

          Anyway, what I have often done at home is to purchase from different suppliers. It's unlikely CPC and eBuyer (just as examples) will have discs from the same batch, and it means that other than a couple of days difference in the delivery times, I get everything at the same time. I also use RAID6 (or equivalent) so the system can survive two failed discs, "just in case". The main downside is the reduction in the amount of storage available.

          M.

          1. Anonymous Coward
            Anonymous Coward

            It's highly likely CPC etc. have disks from the same batch.

            Buy from different manufacturers not suppliers.

            1. Martin an gof Silver badge

              I do both. Hasn't been my experience - buying one or two discs at a time - that batch numbers are even close between unrelated companies.

              M.

        3. Pascal Monett Silver badge

          Yeah, silly me.

          WD, obviously. My memory must have fixated on one of my first hard drives.

          Sorry for that slip.

        4. Malcolm Weir Silver badge

          It will not.

          There's been an enormous amount of bunk spoken about RAID, including the "RAID isn't RAID if..." tripe. I mean, a lifeboat is a lifeboat even if there's a hole in the bottom! The key point is is what the technology is good for, etc...

          The reason spacing out your disk purchases won't avoid the problem is that the second (or third, if you're using a RAID-6 setup with two-dimensional parity) failure event is not independent of the first. When the first drive failed, the workload on the remaining drives increased significantly, raising the probability of wear-related failures (and also also the exposure to low-probability drive/firmware events that trigger a "drive failed" response).

          So unless you know the real failure characteristics (mean and standard deviation of various types of failure by lot number), which you don't (mostly because no-one does, because it's not valuable information to the drive vendors and their manufacturing plants and subsystem suppliers), you have no idea whether you've expanded or contracted the event surface. For example, suppose you have two drive vendors, A and B, plus two lots of each drive, and somehow you know that A's drives become generally more prone to failure after 10,000 operational hours, while B's are good to 10,200. Suppose one of your B drives is from a slightly sub-par lot, and fails a little early, at 10,100. All your "A" drives are outside their comfort zone, and the increased workload accelerates their departure, and unhappiness may be in your future! By contrast, if you have all the same lot/brand, the first failure is a much more reliable harbinger of doom!

          True story: back in the 90's, I was part of the engineering team designing and building a (well, the) major storage subsystem's upcoming "open systems" RAID solution. We were working through the system-level testing, including what our team leader ("Tom", because it was his name) called the "yank test" (grab drive, yank... does the system keep trucking along?)

          Anyway, we were working through some issues, when marketing heard that problems existed, and wandered by to see if the issues were going to impact the reliability figures they'd decided to feature in the collateral and at the upcoming launch event. This lead to an extensive and heated "discussion" as to whether "hot plug" is a reliability feature ("sexy!!!") or a maintainability one ("yawn"). Obviously, it's the latter, but I was not winning the argument...

          In desperation, I turned to Tom, who had been the guy in charge of the company's IT department prior to his current position, and asked him how he, as the owner of the machine room, would handle a drive failure? His response was that the repair people (from the same company, mind) weren't going to get to TOUCH the machine until, like, Saturday at midnight and then only if he didn't have important work still running, no matter how good the alleged hot plugging feature allegedly was!

          Moral of the above story and many others: when something fails, step 1 is get a backup. step 2 is check your backup. steps 3 and 4 might be a repeat of steps 1 and 2... and only THEN do you do something about the failure!

          (I lost the argument, and the company trumpeted a mean time before data loss of something longer than the time to the heat death of the universe or something like that, because marketing...)

          [ Incidentally, another reason people like to try to get drives from different lots is the idea that a firmware update might kill a drive, and if all the drives are the same, the update will kill all of them. This only works if (a) you only ever have one drive of a given model, and (b) you update the firmware in the RAID chassis at all -- if you have a fetish for that sort of thing, pull the drive, update it, replace it, rinse and repeat! ]

          1. ChrisC Silver badge

            "Incidentally, another reason people like to try to get drives from different lots is the idea that a firmware update might kill a drive"

            Doesn't even have to be a firmware update that kills a drive, given that one reason to release firmware updates is to fix bugs (sometimes critical) in the firmware you already have on your drive...

            On the wider point, I guess one reason people like the idea of populating their arrays with drives from different batches/manufacturers is that, whilst the points you make about general failures are entirely valid, what you don't seem to have touched on here are those completely out of band failures that happen every once in a while, and cause a particular model of drive to suffer significant reliability problems completely out of keeping with anything else at the time. Granted, these don't seem to occur as often as they used to, but maybe for those of us who still bear the scars of having lived through such times (I was the "lucky" owner of both an IBM Deathstar and a Seagate 7200.11 with firmware SD1A...) there's some inherent reluctance to trust in statistical modelling etc. when our gut instinct is screaming at us to never ever trust all our data to the same manufacturer/model/batch/whatever, just in case we end up being part of the next big drive reliability scandal...

            That being said, I appreciate your insider perspective here - the more good quality information we have about likely causes of array failures, the easier it becomes to push our (irrational) fears to one side when deciding how to set up an array in future. Coincidentally, I'm in the process of speccing a new RAID setup for home, and was umming and aahing over whether to populate it with a set of "identical" drives, or go to the extra trouble of mixing and matching manufacturers, suppliers etc, and I think your comments here have just made my life a bit easier.

          2. Pascal Monett Silver badge

            Interesting point.

            Thank you for that insight.

            I do backups on optical of everything that is really important to me.

            The NAS is basically an enormous storage zone. I can lose everything on it, I won't lose sleep over it.

            Because I have my backups.

        5. David Hicklin

          When I first had my DS415play I originally used RAID5 as most people do....later I changed that to 2 pairs of mirrored disks.

          Why?

          Well if one disk fails, both options are the same. However if a 2nd disk fails before the first is fixed then the RAID5 is 100% toast, whilst the mirrored option only has a 50% chance of toasting your data and needing a full rebuild and restore.

          I could actually loose 3 disks and only half my stuff.....

    2. Paul Crawford Silver badge

      Re: It's got RAID so it can't fail. . .

      The problem with a disk failure in RAID is when it starts a rebuild all of the other disks get hammered as it rebuilds copies/parity and that can provoke another disk to give up the ghost.

      The second and less obvious thing is if the RAID does not do periodic scrubs then the rebuild time is when you discover unreadable cluster on the disk, resulting in data corruption somewhere.

      Of course if you have double parity (e.g. RAID-6 or RAID-Z2) then you can cope with a double disk failure, and if you have certain file systems, with ZFS being the poster child, it tells you which files are corrupted, not just some sector number(s) that you then have to use some low-level tool to map to a file allocation. Been there, bought the T-shirt :(

    3. simonlb
      Meh

      Re: It's got RAID so it can't fail. . .

      In a previous role a while ago we had a large number of Exchange 2003 servers in stretched clusters (active/passive) with the Exchange data on EMC SANs. Each node of each cluster had 4 x 80Gb disks configured with 2 discs mirrored as C drive: and 2 mirrored as D:drive.

      Worked great over time when the odd drive failed as the relevant cluster node carried on working and popped out an email so we could get an engineer to swap the failed disk.

      Wasn't so great when the RAID controller board in one of the cluster nodes popped and blew all 4 disks. The business decided it was easier (cheaper) to evict the failed node from the cluster and support the remaining node on a 'best endeavours' basis.

      1. gryphon
        Unhappy

        Re: It's got RAID so it can't fail. . .

        Similar issue we found in managing some Exchange servers where HPE's preferred architecture had been followed.

        Use single SAS disks per group of replicated databases but make each one a logical array within the Smart Array (SA).

        Ok, fine. Whatever.

        Problem being that for some reason if Windows starts showing errors on the disk, i.e. unable to read file, bad sector etc. it does NOT get flagged up by the Smart Array as a predictive failure or failed disk.

        So if your Wintel guys aren't specifically monitoring for the correct NTFS event log ID's you'll be blissfully unaware of the issue. There is a Smart Array event generated saying couldn't provide the file to Windows but SA doesn't class it as anything it cares about so does nothing.

        Back since the IDE days seeing a bad sector on disk means there have already been multiple failures and spare good sectors have been mapped in until they've run out. Replace disk immediately if you see that.

    4. Anonymous Coward
      Anonymous Coward

      Re: It's got RAID so it can't fail. . .

      RAID 5 is pretty much redundant these days, see what I did there! With the size of arrays these days the rebuild time can get very long so increasing the chances of another disk in the array failing whilst they're being thrashed doing the rebuild. Better of with RAID6 or if you have the cash RAID10 as the rebuild times are quicker

      1. Anonymous Coward
        Anonymous Coward

        RAID6 sucks

        The write penalty on RAID6 is even worse than RAID5 - if you're trying to do anything even vaguely performant I've always found RAID10 to be cheaper

        1. Kerry Hoskin

          Re: RAID6 sucks

          RAID10 will have better performance and better rebuild times than RAID6 but it won't be cheaper!

          1. Anonymous Coward
            Anonymous Coward

            Re: RAID6 sucks

            RAID10 is almost certainly cheaper *for the same write performance* - if you're using equivalent disks, with RAID6 you need 3x the number of disks as with RAID10 to get the same write performance - with RAID6, each write is written across 3 disks, with the old value of the data on each disk being read first so the parity can be recalculated - so 6 physical IOPS per logical write IOP; with RAID10, you just write the same data to two disks, no reads required, so only 2 physical IOPS per logical write IOP.

            Sure, with RAID6 you can use smaller (cheaper) disks as your capacity penalty for RAID 6 is lower; but the more disks you need, the greater the likelihood you need additional chassis, additional networking/interconnects, additional licensing (depending on SAN manufacturer), additional rackspace, additional power etc...

            1. Nick Ryan Silver badge

              Re: RAID6 sucks

              For the better controllers the writing to multiple disks occurs in parallel and not in serial. Therefore, for example, when writing to two disks in a RAID10 topology the overhead should be trivial and not twice the time to write.

            2. Anonymous Coward
              Anonymous Coward

              Re: RAID6 sucks

              for the same write performance maybe but not in absolute terms. If performance isn't an issue than RAID 6 cost less than RAID10 as there are few disks lost to parity

      2. Alan Brown Silver badge

        Re: It's got RAID so it can't fail. . .

        RAID6 has a non-significant (2%) chance of more data corruption on array rebuilds over 5TB and that's before factoring in the increased chance of drives falling over to increased head thrash (assuming 10E-23 error rates - the standard for spinning media)

        IE: If you're relying on RAID6 to keep large arrays error free, you're already in trouble. Ditto RAID10 - and at these sizes the errors have a statistically significant chance of being SILENT

        ZFS assumes drives are crap and proceeds accordingly. It's a lot easier to do this than to gold plate turds. 10E-23 error rates simply isn't good enough at very large storage scales

  4. chivo243 Silver badge
    Go

    "It's a disaster waiting to happen!"

    Reminds me of a Halloween costume party... One guy came holding field glasses, with a toilet seat around his neck! I asked him WTF are you... he replied "I'm a bad accident looking for a place to happen!" Simply awesome!

    yeah, that second disk failure during a rebuild... my sphincter always puckered and un-puckered during a raid-rebuild

    1. Alan Brown Silver badge

      Re: "It's a disaster waiting to happen!"

      Ideally you should watch your SMART data and replace drives long before they run out of sectors or hours, but manglement never think about it that way. It's often better to talk to accountants about "preventative maintenance" and "cost of not acting"

      1. Nick Ryan Silver badge

        Re: "It's a disaster waiting to happen!"

        Been there, done that, even got into "trouble" for estimating staff salary multiplied by data stored on the servers (i.e. their work) and they still didn't care. A year later two drives failed a day apart...

  5. TonyJ Silver badge

    My rule of thumb for a server with a failed RAID disk when I was hands on - make sure you have a reliable backup* before you do anything else. And if it's Exchange, where absolutely possible, stop all of the MS Exchange Services and do an offline backup. Believe me you will thank me later for those few hours of downtime that you are currently cursing.

    *I appreciate it isn't always possible to prove the integrity (or even the overall usefulness) of said backups before commencing work but at least take one before you start and do everything you can to verify it, however little that might actually be - but being able to stand in front of the big cheese and demonstrate that you did everything possible to ensure you covered all bases can be a career saver.

    1. LDS Silver badge

      When possible, when I can I let the RAID rebuild invoking its firmware utility at boot - no other software pestering the disks while the array is rebuilt, especially disk intensive ones like databases. Of course the system is offline.

  6. Mayday Silver badge
    Facepalm

    RAID is not backup.

    It’s just drive redundancy, with varying degrees of fault tolerance of course.

    Now, when one disc goes, particularly when they’re from the same manufacturer, batch number etc (they need to be identical, right?) that means the rest are about to go. Get on it.

    1. Trollslayer
      Mushroom

      Re: RAID is not backup.

      And if it is the PSU goes...

      1. Peter2 Silver badge

        Re: RAID is not backup.

        Then the other one takes the load and continues going.

        Because servers have two hot swap PSU's.

        1. phuzz Silver badge

          Re: RAID is not backup.

          But it's this moment when you find out that the remaining power supply can't actually cope with all the load, and fails hard.

          Had this on a blade enclosure that had four supposedly-redundant PSUs. I pulled the power out of one PSU to move it to a different UPS, the enclosure tried to shunt all the load to one of the other PSUs, which promptly failed. It claimed to be 1200W, but only worked ok up until about 300W...

          1. Alan Brown Silver badge

            Re: RAID is not backup.

            or you have Capacitor Rot - seen that a few times

            1. Mayday Silver badge
              Alert

              Capacitors

              Indeed.

              A few years ago I was looking after/inherited a datacentre "thing" for a very large, and very high profile customer. A task was to upgrade all of the firewalls to a new version of software for vulnerabilities and feature reasons. I warned that the age of the firewalls, despite them being under support from the vendor, was such that the capacitors in the PSUs may have dried out and there was risk that they may not reboot after the upgrades. Naturally I got the whole "yeah sure, what would you know" etc response.

              I put all of my warnings and recommendations in writing and made sure EVERYONE saw them prior to the upgrade. All of these firewalls had two PSUs, and at least (yes, at least) one in each go them shat themselves. Cue frantic support calls to the vendor to get replacement RMA PSUs.

              So, when I got blamed for them all failing and this huge customer cracking the shits, I referred them all to the warnings I sent, the dates they were send and what my suggested ways to avid the issue were. Funnily enough the blame disappeared then.

      2. chivo243 Silver badge
        Go

        Re: RAID is not backup.

        And if it is the PSU goes...

        and disk failure?

        Time to pack it in, and buy new gear...

  7. GlenP Silver badge

    IBM Engineer...

    We were getting disk errors on one drive in the disk array on an AS/400. The engineer was adamant it was a cable fault so to avoid downtime I left my very experienced evening operator and him to sort it after office hours.

    Sheer chaos hit as it wasn't the cable, it was the drive itself which then failed to restart. Of course because it was "just the cable" the engineer hadn't taken the drive out of the array first and as it was only striped, not mirrored, the system was f****d. To say I wasn't happy with either the engineer or the operator, both of whom should have known better, is an understatement.

    Fortunately we had backups but it still took a few days to get everything fully working.

    1. wyatt

      Re: IBM Engineer...

      I take it there were no risks identified in the change control they were working under along with mitigating steps then!?

    2. PM from Hell

      Re: IBM Engineer...

      I have been a technical Support Manager my focus was always on maintaining service levels I could negotiate maintenance windows but unplanned outages were anathema. I've had many conversations with IBM engineers. I've always found with a bit of digging asking a few 'what if it's not that' questions would get us to the point where the engineer really wanted to be but wasn't allowed to tell me officially.

      This was 15 years ago but even then engineers were under huge pressure to not pull spares from stores unless they were definitely going to be used as the engineering manager were trying to minimize spares costs.

      A call to the Account Manager would normally result in a changed plan often with the 'if its not that there will be a charge' conversation. I think I ended up paying for one disk volume which wasn't required, the new disk was just added to the array providing some extra space. This was lower than the financial penalties that would have been extracted from my budget for an unplanned outage of that length.

      As the Tech Support Manager it's your job to make sure that the change plan is properly planned, impact assessed and executed, Its also your job, having done that, to support the engineer and operator should things go wrong. There's only one person you should have been angry with and you see his face in the bathroom mirror every morning. I know the engineer was an IBM employee if he is the nominated site engineer he;'s still a member of 'My' team. Try that approach working with your vendor support staff, whether its hardware engineers, network specialists or devs and you'll find they suddenly get very open with you as y ou just want to work with them for a solution not point the finger of blame.

    3. Anonymous Coward
      Anonymous Coward

      Re: IBM Engineer...

      I hate this excuse. I've seen it way too often.

      How does a cable that is either inside a computer/server or in a building infrastructure fail if there has been no human or possibly animal interaction?

      It has happened, but in my experience, cable faults (barring desk patch cables) are very, very rare.

      1. Gene Cash Silver badge

        Re: IBM Engineer...

        cable faults ... are very, very rare

        Whoo. Let me introduce you to my boi SCSI...

        1. Fifth Horseman

          Re: IBM Engineer...

          Oh God.

          Ultra 2 Low Voltage Differential SCSI. 68-way twisted-pair *insulation displacement cable* with accompanying brain-damaged mini-D connectors. Along with terminators that had to be blessed at a full moon with the blood of a goat to have a chance of working.

          The least fun I have ever had inside a computer. Never again.

        2. Malcolm Weir Silver badge

          Re: IBM Engineer...

          And fiber (or even fibre) is fun!

          The ones that sounds like they should be very rare, but in my experience aren't, are ground loops... Shades of "million to one" odds turning up nine times out of ten...

      2. Anonymous Coward
        Anonymous Coward

        Re: IBM Engineer...

        How does a cable that is either inside a computer/server or in a building infrastructure fail if there has been no human or possibly animal interaction?

        When there's money to be made or field service want to point the blame elsewhere. Take a dead PC into PissyWorld and see how long it takes to be told a $$$$ gold plated "digital" cable will fix the problem.

        At one place I worked, a support engineer tried and failed to get me to believe the RAID was randomly dying because cosmic rays were corrupting the controller card memory. Yet those rays somehow didn't corrupt the server's memory because that would have made the OS crash and someone might have noticed that.

        1. Anonymous Coward
          Anonymous Coward

          Re: IBM Engineer...

          Must have had the same book I did - "Upgrading and Repairing PCs" - was a page about cosmic rays corrupting memory. The soultion was to bury the computer under several meters of concrete.

          Also remember reading about an IBM server - if you put in extra memory that had tinned contacts instead of gold plated, the keyboard would stop working.

          1. Stork Silver badge

            Re: IBM Engineer...

            But won’t the radioactivity from the concrete do the same?

            1. Doctor Syntax Silver badge

              Re: IBM Engineer...

              Make sure you have old steel for the reinforcing rods. The Belfast C14 dating lab had the counter in a pit with a few concrete paving slabs (no reinforcing rods at all) sitting on top of old steel plates. Back then we had a shipyard just down the road so sourcing that from breakers was quite feasible. https://en.wikipedia.org/wiki/Low-background_steel

              1. Malcolm Weir Silver badge

                Re: IBM Engineer...

                Years ago I learned that the steel in, say, the fleet scuttled at Scapa Flow is intrinsically valuable, as it was made prior to July 16, 1945. After that date, all steel made anywhere on Earth is significantly more radioactive.

                1. MJB7

                  Re: IBM Engineer...

                  The Falklands War started when an Argentinian boat went into one of the old whaling stations on South Georgia and started dismantling it for scrap. They had permission from the Argentinian government, but not from the Falkland Islands government ... so the local magistrate arrested them.

                  Notes:

                  1. South Georgia is a dependency of the Falkland Islands.

                  2. The base commander for the BAS base is sworn in as a magistrate; this is usually a formality.

              2. Stork Silver badge

                Re: IBM Engineer...

                I was thinking of the aggregate material, often granite is used.

            2. Fifth Horseman

              Re: IBM Engineer...

              For conventional DRAM, the plastic packaging of the chip is radioactive enough to flip the odd bit here and there, hence the use of ECC.

            3. Anonymous Coward
              Anonymous Coward

              Re: IBM Engineer...

              No, you'll just need to bury the concrete under more concrete!

              It probably depends on the type of taxation - I'm sure they are people will correct me, my knowledge of radiation is from school, but think some types are slow moving particles, and it was gamma that is a more energetic?

              1. swm Silver badge

                Re: IBM Engineer...

                Alpha particles from the case material are more capable of flipping RAM bits. Ceramic cases used to be slightly radioactive and it took a while to discover the problem. High-energy cosmics don't cause as much ionization.

          2. Swarthy Silver badge

            Re: IBM Engineer...

            RE: Tinned memory contacts

            That one actually make sense. If the mount-points were gold (plated), when current runs through a gold/aluminum contact pair, they electrolize and you get a weird gold/aluminum alloy: purple, brittle, and very non-conductive AKA The Purple Rot.

            On reflection, That would take time, and cause intermittent errors first, and I don't know how that would cause the KB to stop.

            1. Will Godfrey Silver badge
              WTF?

              Re: IBM Engineer...

              Ummm, gold/aluminium? Where have you seen that? I've been involved in electronics in one form or another for only a decade... or 5, but that's a new one on me.

              P.S. Except aluminium used in HV distribution lines, and the disastrous use in telephone systems.

              1. Swarthy Silver badge
                Facepalm

                Re: IBM Engineer...

                Ah, my misrake, it's "purple plague".

                Here's a wiki on it

            2. Alan Brown Silver badge

              Re: IBM Engineer...

              I've only ever seen purple plague once - ever - in my entire career.

              An audio IC manufactured in the early-mid 1970s which erupted purple goo when touched with a soldering iron. By that point (late 1980s) it was a good decade past the point where any of my cow-orkers had seen cases of it and even then it was only in older kit

              By the time IBM PCs came along, purple plague was a distant memory

          3. Alan Brown Silver badge

            Re: IBM Engineer...

            "a page about cosmic rays corrupting memory"

            It wasn't cosmic rays that was doing it.

            The incident that led to the urban legend was due to cheaping out on supply chains which resulted in source clay material contaminated with unacceptably high levels of uranium being used in IC ceramic cases - with fairly predictable results (all ceramics contain radioactives, some contain more than others)

            WRT the IBM servers, the contacts may have been tinned, but the culprit was voltage and timing tolerances along with circuit loading. Back then people didn't pay nearly enough attention to such things in digital circuits

      3. Alan Brown Silver badge

        Re: IBM Engineer...

        Ditto. Badly made cables happen but even those tend to only play up when subjected to actual vibration or stress (hint: freezer spray isn't just for components)

        Making matters worse, connectors are only rated for "N" operations and each test cycle brings them that much closer to crapping out (this is why my test rigs frequently have sacrifice leads in them)

      4. swm Silver badge

        Re: IBM Engineer...

        About 50 years ago our college time sharing system failed. It couldn't even load diagnostic tapes. Eventually it was traced to a cable from memory to CPU. So this cable was replaced and worked perfectly - but three other bits did not. Our field engineer had three spare cables at this point which he gingerly connected without cable ties etc. and got the system up. He then put in a call to have ALL cables of this manufacture replaced. A team came the next weekend and replaced hundreds of cables. Did not have a problem thereafter.

        Don't tell me that cables can't go bad.

    4. Anonymous Coward
      Anonymous Coward

      Re: IBM Engineer...

      And let's not mention when your disk controllers have incompatible firmware versions... and as you replace the one failed unit out of the four with a spare that has a newer firmware version and it proceeds to wipe out everything on the disks in a very systematic fashion... I think there's a certain university I'm looking at over there...

  8. Jou (Mxyzptlk) Silver badge

    At least you got a warning!

    I once had a server at a customer who screamed "Why did you do a RAID0! Now my data is lost!".

    Turned out: We did set up a RAID5, but when one disk failed the RAID controller silently switch to RAID0 instead of raising an alarm and turn on the two orange warning LEDs on the server (one for the specific HDD, and one for the general warning). When the second disk failed a few month later the data was *poof* gone. Ended up in a nice little "Supplement document" from the vendor about that bug, and how urgent it is to have specific RAID controllers with specific firmware versions updated as soon as possible.

    1. Sandtitz Silver badge
      Unhappy

      Re: At least you got a warning!

      customer who screamed "Why did you do a RAID0! Now my data is lost!"

      I once had a small customer who spec'd the cheapest Proliant ML310 server with two IDE (or SATA?) drives and insisted on RAID0 despite protests from me and my colleagues. I ended up honouring the request for RAID0, but I also slipped a note inside the server for future administrators to show it was the customer's idea!

      the RAID controller silently switch to RAID0 instead of raising an alarm

      Sounds made up so must be true! Which vendor/controller was this? Name and Shame please.

      1. Jou (Mxyzptlk) Silver badge

        Re: At least you got a warning!

        Was long ago when Server 2003 was the newest shit: Fujtisu Primergy, but the nearly cheapest option available which uses the onboard LSI-SCSI-Ports with an additional LSI chip in a specific PCI64 (yes, without -e) slot for the RAID5 logic. At least the SCSI was connected to a real SCSI-SCA backplane and the HDDs were real hot-plug. The lowest possible end for a real hardware RAID5. I never liked that constellation instead of shelling out 100 bucks more for a real RAID controller which can do everything on its own, but at that time I was too young and dumb to speak up. They were slow, of course. And, except for that one f*up, quite reliable.

    2. Jay 2

      Re: At least you got a warning!

      I'd be interested to know what make/model that controller was. A definite case of " you had one job" for the controller.

  9. Terry 6 Silver badge

    Nothing exists unless it has a minimum of one independent backup. That is the absolute minimum, for the simplest of systems, before you can say you have any data. Less than that and you just have a hope that your data is there.

    This was true when I was semi-professionally supporting educational colleagues 30 odd years ago before we were given proper IT support. And it's still true now. Maybe it always will be, cloud or no cloud.

  10. ColinPa Silver badge

    Disks do what they are told. You still need backups.

    RAID is not always the answer. Disks will do what they are told, so if an operator says 'delete this file', or 'reformat this disk', the disk subsystem will do it. If it is configured for RAID, it will do it very reliably.

    You still need backups.

  11. John Brown (no body) Silver badge

    The server was ok - after all, it could handle a failed a disk.

    Clearly the server was NOT ok because in the previous paragraph we are told the served DIED and that was what started the whole fiasco in the first place. The fact a second disk died during the replacement of the failed disk is just icing on the cake. If this really was a properly configured RAID array with redundancy and not just a RAID0/JBOD, the server would have carried on working and simply replacing the failed disk to rebuild the array would not have been the solution anyway.

    Either the original story smellls a bit or the re-write by El Reg got the order of events mixed up.

    1. John Brown (no body) Silver badge

      Re: The server was ok - after all, it could handle a failed a disk.

      Okay downvoters, tell me why I'm wrong instead of just hitting a button and moving on. A server died and the diagnosis is a failed HDD in a RAID array where the solution is to replace the failed disk. Why did the server die? The whole point of RAID is that a failed disk, other than in a RAID0 config, shouldn't bring the server down, it pootles along in degraded state. The fact they thought replacing the failed HDD would be "the fix" means they were running with redundancy, otherwise they'd already be looking for the backups (ignoring, for the time being, that a 2nd disk failed after the fact, during the array rebuild and is not pertinent to the initial server death). I suppose it's feasible that the disk failed in such way that it caused the controller to have a fit, but I've never seen that happen before and if that was the case, on an outlook mail server, there WILL be data corruption, so again, a simple HDD replacement is not going to fix it

      1. Doctor Syntax Silver badge

        Re: The server was ok - after all, it could handle a failed a disk.

        "ignoring, for the time being, that a 2nd disk failed after the fact, during the array rebuild and is not pertinent to the initial server death"

        Not one of the downvoters but AFAICS the above statement is the problem. As I read the whole story there was no initial server death. It was the failure of the 2nd disk that was labelled as the death of the server.

      2. Anonymous Coward
        Anonymous Coward

        Re: The server was ok - after all, it could handle a failed a disk.

        If a second disk in a RAID 5 fails before the first has been replaced and rebuilt then your array is toast

        COMPTIA do some good beginners courses you may be interested in.

        1. John Brown (no body) Silver badge

          Re: The server was ok - after all, it could handle a failed a disk.

          According to the article...

          "One of the Exchange Servers had abruptly died. "Consequently, half of the constabulary had lost email!"

          Eventually it transpired that a disk in one of the servers had failed the day previously."

          And then later on...

          "The replacement was popped in and a rebuild was started.

          "While in the process of rebuilding the array, a second disk decided it was going to join the party and while not quite failing..."

          The timeline seems quite specific in the article. As I said, the 2nd disk failure is not relevant to "the server died and it was all down to a single failed HDD in a RAID"

          1. doublelayer Silver badge

            Re: The server was ok - after all, it could handle a failed a disk.

            The article has extra text indicating the real timeline, which you have missed. Here is the timeline in its original form.

            Day 1: Hard drive 1 fails, server stays up.

            Somewhere in the middle, probably day 2 morning: Team inserts a new drive into array to recover.

            Day 2 noon: Server team leaves for party, desktop team comes in to manage things, repair still in progress.

            Day 2 afternoon: Hard drive 2 fails, server goes down.

            For the desktop team, it was the first thing they saw with the server, as they didn't put in the new drive. Since Sam was on that team, that was his first knowledge. It was the second drive that did it. The article has the events out of order, and the clue was "Eventually it transpired that [...] the day previously." We're getting the events from Sam's point of view, and he wasn't there from the beginning.

          2. Malcolm Weir Silver badge

            Re: The server was ok - after all, it could handle a failed a disk.

            Nope:

            Day 1: drive failed, sabres rattled.

            Day 2: new drive installed, rebuild started, server team goes to Christmas party

            Day 2 also: latent fault revealed on a second drive, rebuild aborts and (likely) controller removes second drive drive from array leaving it inoperable and probably without a way to shove it back in the degraded set so you can at least try to get a backup...

            Also, earlier you mentioned you didn't know of cases where a drive failure caused a controller to have a fit... <insert hollow laugh here>.

            These types of failures are horribly common (for developers), because they are extremely hard to replicate and debug. With the old shared-bus setups like SCSI it's quite easy to imagine drives behaving badly on the bus, but many controllers with point-to-point connections (SATA, SAS, FC, etc) have shared resources, so creative failures on Drive A bugger up Drives B through D by starvation. And my most common nightmares involved drives disappearing, then recovering _while you're in the process of removing them from the set_., leading to a completion event on a drive that you're trying to mark as killed and so for which you've released resources, and the sodding "intelligent" controller has a cached DMA address that it's happily going to use regardless...

  12. NXM

    You need more than one

    I had a RAID nas drive whose controller board got blown up by a lightning strike on the phone line that propogated through the network cables (*). I thought, 'oh, no problem, I'll put the discs in another drive.' Would it read them? Of course it bloody wouldn't.

    I had other backups. But I air-gapped everything with a wifi extender from the router after that.

    (*) Apart from the nas it killed the router, 2 switches, 2 dect base stations, and the network interface interface in a computer. Could've been worse though, and it wasn't even a direct strike on the line.

    1. John Brown (no body) Silver badge

      Re: You need more than one

      "But I air-gapped everything with a wifi extender from the router after that."

      I had to read that again to realise you meant an electrical air-gap between the Internet and your kit rather than the more usual meaning in IT circles of "not on the network AT ALL" :-)

    2. Jou (Mxyzptlk) Silver badge

      Re: You need more than one

      Those NAS as linuxes. Pop them into another linux box, and the RAID should be recognized. If you have only two drives (i.e. RAID1) even one disk is enough. Plug in, mount, copy, done.

      1. Robert Carnegie Silver badge

        Re: You need more than one

        These are hard disks that were struck by lightning - that's a special case.

  13. 43300

    Well, personally if it's a choice between watching a RAID rebuild or going to a Christmas party I'd prefer to watch the RAID!

  14. JulieM

    Pre-emptive swap

    My preferred technique is to replace one of the drives in a RAID1 after a year. Then at least I know the two are not from the same batch.

    Just never, ever use a hardware RAID controller. They use nasty, proprietary disc formats that are unreadable without the correct RAID card; and have an unfortunate tendency to attempt to rebuild an array by copying the new, pristine drive over the one that used to have all the data on it.

    1. gryphon

      Re: Pre-emptive swap

      Or says its rebuilt the mirror but actually hasn't so if disk 1 has been replaced and disk 2 then fails bye, bye array.

      Looking at you HP Smart Array.

      Thankfully caught that when I joined a new company since one of the first things I looked at was firmware versions on servers, RAID controllers etc.

      Listed as a critical issue on release notes but nobody had noticed up to that point.

    2. John Brown (no body) Silver badge

      Re: Pre-emptive swap

      "proprietary disc formats that are unreadable without the correct RAID card"

      Or even, the correct RAID card but the wrong firmware revision. Nasty memories of an ancient Compaq server many years ago :-/

      1. Malcolm Weir Silver badge

        Re: Pre-emptive swap

        A competitor of the company I worked for in the 90s was bought up and merged, and we were proper impressed to see they'd bothered to properly partition their drives, so the first 64 sectors of each disk included a standard-ish partition table that could be read by FDISK and the like, with the RAID configuration tucked in following the table (not in a partition, which was a bit half-baked), and then the RAID storage space marked off as a partition.

        Snag (to us) was that each drive access had to be manipulated to add the 64 sectors, which didn't thrill us as we spent a _lot_ of energy optimizing cylinder/head alignment, and this made that messy.

        [ For people going "huh?", when you can say to Seagate "build us a new model HDD, we'll buy all you can make for a year..." you get to be able to do all sorts of optimizations that regular customers can't know about! ]

    3. Alan Brown Silver badge

      Re: Pre-emptive swap

      "They use nasty, proprietary disc formats that are unreadable without the correct RAID card"

      Not as unreadable as you may think, thankfully.... (Linux to the rescue)

  15. Boris the Cockroach Silver badge
    FAIL

    Another old

    conversation

    IT bod : the disk is failing, bad sectors everywhere, only a matter of time before it dies

    manglement: It still runs

    IT bod : look , just schedule an engineer to come out , do a backup and swap the failing drive over for a good one, it take 2 hrs

    Manglement : costs too much, server/computer/industrial plant needed.. cant afford down time.

    IT bod : look just let me.

    Manglement : Nope nope nope nope not listening !

    <Drive dies 4 days later

    manglement : WHADDYA MEAN ITS GOING TO TAKE 3 DAYS TO GET AN ENGINEER IN ? FIX IT FIX IT FIX IT FIX IT FIX IT FIX IT FIX IT FIX IT FIX IT FIX IT FIX IT FIX IT FIX IT FIX IT FIX IT FIX IT ....... etc etc etc

    If you've not had the above played out in front of you, are you even a real IT bod ?

    1. John Brown (no body) Silver badge

      Re: Another old

      "If you've not had the above played out in front of you, are you even a real IT bod ?"

      Yes, because I'm the contractor who gets sent to fix it in 3 days, not the one worrying about whether your PHB is a penny pinching moron. :-)

      If you want same or next day service, you need to take out a maintenance contract, otherwise you'll be further down the list of priorities than our paying customers. (Although will always try our best to get there ASAP because you might turn into a contract customer, but we'd NEVER push an off-contract job above a contract job.)

    2. Fred Daggy Bronze badge

      Re: Another old

      Next step is to get old and cranky and to start to learn the management lingo.

      Throw in terms like "preventative maintenance schedule cycle" and "service delivery optimisation alignment window" and you will have your planned maintenance in a few seconds. A few meaningless graphs of before and after and you're half way to a promotion to PHB.

    3. Alan Brown Silver badge

      Re: Another old

      After the second response, ALWAYS ask to get that stance in writing

      It prevents them from attempting to throw you under a bus when it actually happens

      1. Fred Daggy Bronze badge

        Re: Another old

        Point of order, it does nothing to stop them from throwing you under the bus. What is means is that you can throw them under a faster, and heavier bus when they do it.

        Of course, the next level up will then just ask "why didn't you escalate?" - thereby throwing both PHB and you under a bus so big it can be mistaken for a "B" ark.

        If you do ask, I believe Murphy was an optimist.

  16. Eclectic Man Silver badge
    Facepalm

    Management fail

    They're not related to https://www.theregister.com/2022/07/11/aerojet_cybersecurity_whistleblower/ are they?

    I mean, repeated warnings of 'this really is going to fail at some time unless you do it properly first. etc.etc. does seem a common umm failing.

    Oh what's the point?

  17. Anonymous Coward
    Anonymous Coward

    Lemme guess

    Dell PERC?

  18. BenDwire Silver badge
    Pint

    ... caused enough disturbance in the force

    I see what you did there, although that may well have gone unnoticed by Left Pondians.

    Beer for all as it's the only decent thing to keep cool in a heatwave.

    1. TRT Silver badge

      Re: ... caused enough disturbance in the force

      Official vocab guidelines state that service is now the preferred term as force is deemed too aggressive.

  19. Montreal Sean

    Never utter the words

    "What could possibly go wrong?"

    Or for that matter, even think them.

    The moment they are thought or spoken, all hell will break loose.

  20. Plest Silver badge
    Unhappy

    Fricking hate DR tests...

    ...however nothing beats a Saturday watching the app support teams crap their pants twice a year at the thought of having to reboot their precious servers to prove to the DR admins that services can be taken down and brought back safely.

    As a unix admin I do envy the Windows admins their regular reboot cycles they exercise the system boots and startup scripts. You just don't reboot unix boxes unless you have to, which not often and then it's mad panic to patch up the init.d/services configs 'cos you forget to check something you did 9 months ago!

    "I didn't choose the IT life, it chose me."

  21. Stuart Castle Silver badge

    No support stories, apart from one where my old boss (who was brought up Muslim but never followed the Qu'ran) who told a story of one of our users who phoned him up on his mobile on Christmas day to complain about a problem. The convo went as follows:

    Boss: "It's Christmas"

    User: "You are Muslim and don't celebrate Christmas"

    Boss: "You aren't, and do celebrate Christmas".

    And yes, while my boss was Muslim, he did celebrate Christmas. Not because he was brought up to, but because he had a wife and child who were not Muslim..

  22. Stuart Castle Silver badge

    I've had a couple of Christmas parties where it would have been infinitely more interesting to watch a RAID rebuild than go to the party.

    Usually what happens is that each team member hangs around with their team, eating, drinking and having fun. There is usually a christmas meal, and we sit and eat with our own teams. After the meal, we usually head for a local pub and carry on the drinking, often with other teams.

    Every few years, the admin office (who organise the parties) come up with the wonderful idea of mixing up the teams, with the idea this sort of thing promotes inter team relations. It really doesn't. You end up stuck on a table with a bunch of people you don't really know for 2 hours. Admittedly, while I'm quite friendly and good at helping people I don't know, but not too comfortable with the kind of chit chat expected at a table. The whole situation is usually painful, embarrassing and hated by most of the party goers.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like

Biting the hand that feeds IT © 1998–2022