back to article HPE fixes another SAS SSD death bug: This time, drives will conk out after 40,000 hours of operation

HPE has told customers that four kinds of solid-state drives (SSDs) in its servers and storage systems may experience failure and data loss at 40,000 hours, or 4.5 years, of operation. The IT titan said in a bulletin this month the “issue is not unique to HPE and potentially affects all customers that purchased these drives.” …

  1. Anonymous Coward
    Anonymous Coward

    That seems rather specific. Is there some sort of self-destruct code in the firmware they forgot to remove?

    1. Version 1.0 Silver badge

      When I started building RAID systems, I quickly discovered that you should not buy a bunch of disks from the same vendor, always use different batches because otherwise you often get every disk failing within a couple of weeks of each other down the line.

      1. robidy

        It's in the name, never buy HP(E) disks - RAID redundant array of inexpensive disks.

        1. Version 1.0 Silver badge

          For home users, if you are going to buy a NAS, then buy a pair of them and swap the disks to get non-sequential serial numbers to each NAS.

          You can then get a safe network backup by making the second NAS invisible on the network and use rsync to pull an incremental copy of all the data from the visible NAS every night.

          1. batfink

            Good! Someone else as paranoid as I am about backups :)

          2. Stoneshop

            For home users, if you are going to buy a NAS,

            Buy a bare unit, and buy the disks from two sellers. Preferably multiple brands per order as well if the NAS doesn't object (it shouldn't). At the very least split the order between the NAS plus half the disks, and the other half a week or two later if you want to stick with one seller.

        2. Anonymous Coward
          Anonymous Coward

          HPE don't make Disks/SSD's, They brand them

          These are WD SSD's.

          Additionally if you are running HPE Storage then to maintain your service agreement you HAVE to use HPE branded disks

          1. robidy

            The cost saving more than pays for additional hardware. Not sure if you're operating at scale but if so, paying between 8x and 20x the street price for enterprise grade drives is madness. Use the saving for redundancy.

      2. hoola Silver badge

        If you are using enterprise hardware then you get what you get. There is no option to specify the manufacturer of the actual disk in the carrier. In fact they are usually all the same. If you are really concerned then you do a rolling replacement at various points in the life of the system. Though hou you do that with 1000 disks in a storage solution is interesting.

        As ever, what you do with a generic box, array control and a handful of disks is not the same at enterprise scale.

      3. Alan Brown Silver badge

        "I quickly discovered that you should not buy a bunch of disks from the same vendor, always use different batches "

        Something that seems "lost" on vendors pushing arrays into enterprise. It's extremely difficult to convince 'em to supply anything other than a boxful of identical HDDs with sequential serial numbers to drop into the chassis. Bad enough when it's a small array, a potential nightmare scenario if you're up servers with 60+ drives in 'em. (Been there done that - looking at you HP)

        1. elip

          Seems to be the case with my old EMC array, but all of my Nimble arrays have non-sequential serial #s and diff vendors across built-in shelf and expansion. Choose your vendor wisely.

        2. A Non e-mouse Silver badge

          I was at a recent vendor sales promo (Unfortunately, can't remember who, maybe Netapp?) and they said that when you order discs, the order is fulfilled from multiple batches (& maybe suppliers?) to help reduce this risk.

        3. robidy

          Quite, one of the first things to ask for is a drive serial number report followed by a list of drives replaced in the past 12 months.

    2. Anonymous Coward
      Trollface

      Mtbf

      4.5 Years of continuances server use or 9 years of playing candy crush. Brings back memories of mtbf.

      Brush up your backup skills.

    3. druck Silver badge

      That seems rather specific. Is there some sort of self-destruct code in the firmware they forgot to remove randomise sufficiently? FTFY

      1. gnasher729 Silver badge

        It's a bug where a huge circular buffer is used, which needs to start back at the beginning of the buffer after 40,000 hours of operation, and the code checking for the condition is wrong by one. Since they are telling people now about this, which means they will update the firmware or be responsible for it, I'd assume this is entirely bad luck.

        1. Version 1.0 Silver badge
          Boffin

          That's an example of my rule, "Never attribute a bug to malice when stupidity works" - circular buffers are often treated (and coded) as "easy solutions" when in fact they are complex as soon as you start actually using them.

          1. Wayland

            "Never attribute a bug to malice when stupidity works" - you're too soft.

        2. Michael Wojcik Silver badge

          I wouldn't say it's entirely bad luck. That's a condition that could be simulated in testing. I'd say the manufacturer did not do a proper job of testing their firmware.

          1. Anonymous Coward
            Anonymous Coward

            "I'd say the manufacturer did not do a proper job of testing their firmware."

            Software "corner turning" has always been fraught with potential bugs. LIFO and FIFO buffers - even if not circular eg simple backspacing on a keyboard input.

    4. Anonymous Coward
      WTF?

      Purposely?

      For once there's an HPE article where it's someone else's fault.

      I assume it's an attempt by SSD manufacturers to hard-code their assumed maximum number of read/writes. And like so many companies, they didn't bother to be honest with their users, because it would hurt sales.

    5. Wayland

      32768 a number I remember well. It was the location of the first character on a Commodore P.E.T display.

      POKE 32768,65

      would put an A there

      It's actually 2^16 so the 16th address line on a 6502

      65 being the ASCII code for A

      It sounds like there must be some sort of clock inside these SSDs because why else would it be counting hours. Also perhaps it's supposed to die when that number runs out but it was either supposed to be a bigger number or a slower count.

  2. Anonymous Coward
    Anonymous Coward

    Oh well...

    Delete as appropriate

    - Brilliant timing given Covid-19

    - A great way to screw more dosh out of your customers with planned obsolescence.

    - Others not listed above

  3. Anonymous Coward
    Anonymous Coward

    just like the printer cartridges

    Just like the inkjet cartridges. They expire a couple of months after the date even when unused and completly full.

    1. Michael Wojcik Silver badge

      Re: just like the printer cartridges

      Well, no, it isn't.

      The inkjet cartridges are planned obsolescence, and they self-destruct on a programmed date, regardless of how much they've been used. The SSDs fail after doing a certain amount of (presumably useful) work, and if the comment above regarding a circular buffer is accurate, it's an actual mistake in the firmware (albeit one that should never have made it out the door).

      Inkjet cartridges (and inkjet printers) are a scam. This is a stupid bug.

      And, of course, HPE doesn't sell inkjet cartridges; that's HP Inc.

  4. Anonymous Coward
    Anonymous Coward

    Oh look it's this thread again.

  5. Anonymous Coward
    Anonymous Coward

    Dell

    Dell sent me an email with the service tag of an affected server I look after. Nice email with the tag in a url to the page to download the drivers, and the name of what you needed.

    So good on them. But then again they sell servers with 5 years warranty, so would not them all failing in 4 years 200 days. But then I want not want all the drives to fail at the same time either.

    1. SJA

      Re: Dell

      Just make sure you don't only use those SSDs in your raid / zfs mirror/raidz and be happy that you get new devices before warranty ends.

  6. J__M__M

    I've always said

    it's best to build arrays with non-matching drives. And by always I mean never.

  7. SJA

    Planned obsolescence?

    This sounds to me rather like planned obsolescence - making products unusable after a while. The blunder seems to be that they give 5y warrant and that all fail at the specific time instead of randomizing it.

    1. robidy

      Re: Planned obsolescence?

      Probably the bit of legacy code that made the drive go read only when SSD's first came out before they worked out SMART could tell you when to stop writing to a disk.

    2. Alan Brown Silver badge

      Re: Planned obsolescence?

      "The blunder seems to be that they give 5y warrant"

      Few customers buy 5 year support. 3 years is standard from most vendors unless you push hard for more (and going to 5 years usually adds a 100% premium over 3 year support contracts)

      Ergo most vendors aren't going to give a flying F* and warranty beyond that point is your problem

      (worse - most drive makers sell OEM drives with _only_ the warranty provided by the vendor - so if you bought 3 year support on a system containing drives with a supposed 5 year warranty, then 3 year warranty is what you actually get)

    3. Anonymous Coward
      Anonymous Coward

      Re: Planned obsolescence?

      Based on what people are saying about a circular buffer, it looks like it might be better to call it "Unplanned obsolescence" as it apparently wasn't intentional.

  8. Anonymous Coward
    Anonymous Coward

    Can someone explain why an SSD needs a clock.....

    ....and if it isn't a clock, then what is it counting? And why would 40,000 turn out to be a magic number?

    *

    If it's a matter of maintenance data being needed ("How old is this drive?", etc), then what's wrong with writing a date and time stamp record somewhere at install time?

    1. Wayland

      Re: Can someone explain why an SSD needs a clock.....

      It's probably 40k hours where k = 1024

      That's 32k + 8k

  9. Anonymous Coward
    Anonymous Coward

    HPE MTBF

    "HPE has told customers that four kinds of solid-state drives (SSDs) in its servers and storage systems may experience failure and data loss at 40,000 hours, or 4.5 years, of operation."

    Looks like a freaking unusually high MTBF for any HPE gear, those days !

    Last Superdomes I've seen installed were done with HPE staff connected to them 24X7 for a couple of weeks before they ran by themselves ...

    1. Wayland

      Re: HPE MTBF

      We don't have 12 years left, more like 3, so why bother? Maybe TPTB have had to put back the end of the world, too ambitious.

  10. Anonymous Coward
    Anonymous Coward

    I installed HPE's fix...

    ...and all of a sudden my Laserjet needed more ink, now priced at £32,768 per cartridge. I was somewhat surprised to say the least with HP reducing the price of ink in these troubled times.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like