back to article Gospel according to HPE: And lo, on the 32,768th hour did thy SSD give up the ghost

Using an HPE solid-state drive? You might want to take a look at your firmware after the computer outfit announced that some of its SSDs could auto-bork after less than four years of use. The problem affects "certain" HPE SAS Solid State Drive models once the 32,768th hour of operation has been reached and, frankly, is a bit …

  1. Jeffrey Nonken

    Not just an integer; a signed integer. Unsigned would be good to 65535.

    1. James O'Shea

      You beat me to it but were more restrained. I had to cool off a little before posting, the original version of my post would almost certainly have been moderated. Something about stakes and large quantities of gasoline. Or, alternatively, the use of a spoon the way Alan Rickard would have used it in the best line in _Robin Hood: Prince of California_.

      1. jmch Silver badge

        +1 for the "use of a spoon" quote, but surely that was Alan RickMAN?

        1. James O'Shea

          Yes, it was Rickman. Bloody stupid autocorrect struck, and I didn't notice in time. Whoever invented autocorrect also needs the spoon treatment.

    2. Steve Aubrey
      Joke

      "A signed integer" because - of course - you never know when your SSD might be used in a time machine.

      1. Ken Moorhouse Silver badge

        Re: you never know when your SSD might be used in a time machine.

        For that you need more than signed integer.

        How about BCD? Not sure what the "D" represents... Before Christ's Death perhaps.

        1. Will Godfrey Silver badge
          Facepalm

          Re: you never know when your SSD might be used in a time machine.

          Not just a signed integer but a signed short integer. A standard 32bit one would be:

          2,147,483,6476

          1. T. F. M. Reader

            Re: you never know when your SSD might be used in a time machine.

            A short integer counting hours... Highly unusual in computer firmware, IMHO...

            1. Anonymous Coward
              Anonymous Coward

              Re: you never know when your SSD might be used in a time machine.

              Not at all if you think it as planned obsolence.

              Give lifetime of 3 years and when it borks itself after that you say too bad, you should have swapped it when planned lifetime was used up.

              Because there's no other rational reason to do that.

              1. Stoneshop
                Holmes

                Oh really?

                Because there's no other rational reason to do that.

                Never attribute to malice what can be adequately explained by stupidity.

          2. MJB7
            Windows

            Re: you never know when your SSD might be used in a time machine.

            "A standard 32bit [integer]" - Get off my lawn you young rascals! I'll have you know that C and C++ standards still don't require more than 16 bits in the <tt>int</tt> type.

            Icon: About the right age.

            1. Roland6 Silver badge

              Re: you never know when your SSD might be used in a time machine.

              >"A standard 32bit [integer]" - Get off my lawn...

              Agree, anyone who knew their K&R would know that 2,147,483,647 is the minimum value of a long (signed int), although depending on machine architecture and compiler options, it might actually reserve more than 32bit's of memory (so values align on byte/word boundaries to improve memory access efficiency).

              1. DJO Silver badge

                Re: you never know when your SSD might be used in a time machine.

                "2,147,483,647 is the minimum value of a long"

                Er, wouldn't that be a bit restrictive?

                -2,147,483,648 is the minimum value if signed, 0 if not. Perhaps you meant "Maximum" value - or am I missing something subtle?

        2. Anonymous Coward
          Anonymous Coward

          Re: you never know when your SSD might be used in a time machine.

          Decimal. BCD was a format used to reduce the size of code in early machines - each nibble could contain the digits 0-9 only. Some CPUs had support for addition and subtraction of BCD coded numbers. Great for accountancy where floating point conversions were rare in those days.

          1. Merrill

            Re: you never know when your SSD might be used in a time machine.

            The IBM System 360 Model 20 in the story about the recent move could only do 16-bit integer add, subtract and compares, but it could do the full set of arithmetic instructions on packed decimal of up to 31 digits plus sign.

            1. Anonymous Coward
              Anonymous Coward

              Re: you never know when your SSD might be used in a time machine.

              The middle word of International Business Machines is relevant here.

          2. Suricou Raven

            Re: you never know when your SSD might be used in a time machine.

            Many CPUs still do: BCD operations are still part of the x86 ISA. Not sure how often they get used though.

          3. TeeCee Gold badge

            Re: you never know when your SSD might be used in a time machine.

            in early machines

            The As/400IBM i series still does it and it's inherent in EBCDIC[1] systems. I also used to think it was an "old" thing, until I found out that INFORMIX stores its decimal data types the same way.

            Subsequent investigation showed that this is still a common storage method where fixed numeric precision is critical. That's pretty much anywhere where the numbers represent money.

            [1] Extended Binary Coded Decimal Interchange Code

          4. aks

            Re: you never know when your SSD might be used in a time machine.

            We always referred to it as packed decimal. Two decimal nibbles in one byte with the final nibble being the sign. C for positive, D for negative and F for unsigned. Eight bytes is 16 nibbles giving 15 digits plus a sign. That datatype exists in SQL on all platforms. I'm not sure if it's implemented in all providers but certainly in Oracle and SQL Server. The sign related to the encoding of 80-column Hollerith cards.

        3. Someone Else Silver badge

          Re: you never know when your SSD might be used in a time machine.

          BCD = Binary Coded Decimal, a.k.a the lingua franca of COBOL (may it rest forever in pieces).

          "It" being COBOL, although BCD itself would be worthy of the same fate.

  2. James O'Shea

    fucking incredible

    The idiots put a _signed, short integer_ into a counter. If it were unsigned it'd be bad enough. Why couldn't they have picked a _long_ integer? Can someone please identify the morons in question who sold those pieces of shit to HP so that we can be sure to avoid any of their other products?

    And, HP... get a firmware update out, yesterday.

    1. ArrZarr Silver badge

      Re: fucking incredible

      While sooner is probably better, a number of planes falling out of the sky (and a signed 16-bit number in a counter) in recent times has highlighted the value of solid QA.

      At this stage, HPE hasn't done much wrong (performing more validation on the supplier's part causing this mess could have brought this to light), but pushing out a firmware update that breaks people's RAID arrays will turn this situation from a problem to a disaster.

      1. iGNgnorr

        Re: fucking incredible

        "pushing out a firmware update that breaks people's RAID arrays will turn this situation from a problem to a disaster"

        Unless the article has been updated, this is a reading fail. They are warning that a RAID array with a bunch of these will have them all fail at the same time *without* the firmware update.

        1. MJB7

          Re: fucking incredible

          I don't think there was a reading failure here. If the RAID arrays are not updated in time, they will indeed all fail at about the same time (as indicated in the article). However if the RAID arrays are updated too soon (because HPE have rushed out a new firmware without adequate testing), that will *also* bring down the RAID arrays.

      2. gnasher729 Silver badge

        Re: fucking incredible

        Not quite sure how QA would detect this failure it may only appear in testing after 32,768 hours. Code reviews however should fix it.

        Could they have used a compiler where int = 16 bit?

      3. plrndl
        FAIL

        Validation of the supplier's part

        If HP can screw up the due dilligence on a multi-billion dollar acquisition, what hope do we have for a cheap, replaceable component?

    2. Stoneshop
      Trollface

      And, HP... get a firmware update out, yesterday.

      'Yesterday' is exactly where you need a negative time stamp for.

    3. T. F. M. Reader

      Re: fucking incredible

      I know of several cases where an unsigned long counter of milliseconds (not hours, which is weird in itself) overflowed after 7 weeks and a bit. I was involved in a manufacturer's investigation of mysterious switch resets myself. The most famous case, however, is that of someone who managed to run Windows for 50 days without rebooting for the first time in history in the 90ies. Well, he almost got to 50 days' uptime - the machine crashed on the 50th day because a 32-bit counter overflowed. The counter had been there for many years without anyone noticing.

      1. Someone Else Silver badge

        Re: fucking incredible

        No one noticed, because no one had successfully run a Windows machine in any environment continuously for that long.

      2. Dave K

        Re: fucking incredible

        Yeah, that was the original release of Win98 if memory serves. Of course, not a commonly encountered issue as it usually crashed way before then...

    4. Someone Else Silver badge
      FAIL

      Re: fucking incredible

      More importantly, why any sort of counter? To tell service technicians when the warranty runs out? To be a source of MTBF analytics? As a time bomb so that users would have to replace SSDs every so often (think Lexmark ink cartridges...)? Can't think of a good reason why you'd even need one, much less one that horks you system at some arbitrary interval.

      1. Stoneshop
        Boffin

        Re: fucking incredible

        More importantly, why any sort of counter?

        Operating time.

        Where you actually want a more fine-grained counter to record events with some sort of timestamp. Of course you can push that out to the host where it will probably get logged against a real-world timestamp, but you also want an error log that stays with the device for when it gets returned.

      2. vtcodger Silver badge

        Re: fucking incredible

        As someone points out upthread, it's probably the S.M.A.R.T. power on hours counter which is supposed to be used to tell you when the drive exceeds 43,800 hours of operation (5 years) and you might want to think about replacing it. https://en.wikipedia.org/wiki/Power-on_hours.

        Note that 5 years was apparently considered to be an approximation of maximum reliable lifetime 25 years ago when S.M.A.R.T. was designed.

        1. Anonymous Coward
          Anonymous Coward

          Re: fucking incredible

          "Note that 5 years was apparently considered to be an approximation of maximum reliable lifetime"

          For spinning disks 5 years nowadays is a good life time, end user drives have usually 1 year guarantee and then you are on your own. Some only 6 months, which is laughable.

          SSD is still very much depending on the generation and components used, probably stabilizes to 3-5 years for commercial reasons.

          Why make a drive which lasts 20 years if >80% of users throw it away when they buy a new machine, every 3 to 5 years.

          1. Stoneshop

            Re: fucking incredible

            Why make a drive which lasts 20 years if >80% of users throw it away when they buy a new machine, every 3 to 5 years.

            Even more relevant is the (still) ever-increasing data density. If your storage racks are physically full and the next generation of SSDs can hold several times the data your current ones can, then replacing them can be quite attractive compared to carting in new racks (provided there's room and power for that)

            Note that this case concerns business storage. Few consumers will run SAS SSDs.

  3. Pascal Monett Silver badge

    32,768

    That is exactly the limit of a signed integer variable in practically every programming language under the Sun. What a coincidence.

    A programmer with any sort of experience is going to have hit that limit some way or another during his various projects and knows that, for any sort of operation outside of basic For/Next, or very basic mathematical calculations, integers should generally be replaced by long ints, which go a bit beyond 2 billion or into the 9 billion billion depending on language definitions.

    It is quite obvious that the summer intern who was given the driver project did not have the required experience, nor the foresight to imagine that 4 bytes would not be enough to count uptime. It is also obvious that the senior reviewer didn't do his job properly.

    1. Anonymous Coward
      Anonymous Coward

      @Pascal Monett - Re: 32,768

      There's a subtle difference between a programming and software engineering.

    2. Phil Endecott

      Re: 32,768

      Well not really, in “practically every programming language” in 2019 an integer is 32 or 64 bits, and a long integer is 64 bits. It’s a long time since you got only 16 bits by default. you need to go out of your way to ask for a short.

      1. Simon Harris
        Coat

        Re: 32,768

        "you need to go out of your way to ask for a short."

        I wouldn't say that - there are quite a few pubs between work and home.

        Mine's the one with the whisky stains down the front. -------->

      2. NetBlackOps

        Re: 32,768

        That really depends on your embedded processor and associated compiler.

      3. Major Page Fault

        Re: 32,768

        Is this also the case in embedded programming for obscure 32 bit CPUs?

      4. Cynic_999

        Re: 32,768

        "

        Well not really, in “practically every programming language” in 2019 an integer is 32 or 64 bits

        "

        That may be true if you are programming PC applications, but if you are compiling embedded firmware for 8 or 16 bit CPUs it is definitely not.

    3. Chairo
      Pint

      @Pascal Monett - Re: 32,768

      I am quite sure that 4 bytes would have been enough, though...

    4. Someone Else Silver badge

      Re: 32,768

      A programmer with any sort of experience is going to have hit that limit some way or another during his various projects and knows that, for any sort of operation outside of basic For/Next, or very basic mathematical calculations, integers should generally be replaced by long ints unsigned 64-bit integers, [...]

      There, FTFY

      (Assuming, of course, that you would even need such a counter....)

  4. ArrZarr Silver badge
    Holmes

    It's worth asking if the supplier with the faulty components also supplies this component for other manufacturers.

    Anybody know?

  5. Simon Harris
    Joke

    Pity it's not a traditional drive...

    or you could reverse the polarity to the platter motor and spin it backwards for 32768 hours to reverse the clock.

    1. IGotOut Silver badge

      Re: Pity it's not a traditional drive...

      But doesn't that allow the Devil and her* horde to walk once again upon the Earth?

      * I presume the Devil is a woman as it's always a woman who is the trouble maker in religious beliefs.

      1. Anonymous Coward
        Anonymous Coward

        Re: Pity it's not a traditional drive...

        * I presume the Devil is a woman as it's always a woman who is the trouble maker in religious beliefs.

        I've often suggested God is a woman (no, not Alanis Morisette), with **permanent** PMS...

        1. W.S.Gosset

          Re: Pity it's not a traditional drive...

          Well, the Cathars believed that the Old Testament "God" was actually the Devil.

          So I guess that makes your belief Catharine.

    2. Anonymous Coward
      Anonymous Coward

      Re: Pity it's not a traditional drive...

      It's solid state though so once you've hit the limit you can flip the sign bit by turning the drive over and carrying on as normal for a further 65,536 hours. Then you'll have to flip it back again. I know it's a pain, but it only has to be done every 7½-ish years after the first 3¼-ish years "breaking in" period.

      1. David 132 Silver badge
        Happy

        Re: Pity it's not a traditional drive...

        It's solid state though so once you've hit the limit you can flip the sign bit by turning the drive over and carrying on as normal

        Stop! Do not do this! This will lead to data loss!

        FIRST, you have to use a hole-punch to mark the drive as double-sided.

        AC, you should be ashamed of yourself, giving out incomplete advice like that.

        1. Dolvaran

          Re: Pity it's not a traditional drive...

          Even then, will not the data bits fall out when you turn it over?

          1. DJV Silver badge

            Re: Pity it's not a traditional drive...

            Only if you punch the hole in the wrong place.

          2. Killfalcon Silver badge

            Re: Pity it's not a traditional drive...

            This is where good old spinning rust is better than these fancy SSDs. As a magnetic media, the bits stay stuck on even if you turn the drive upside-down

          3. fobobob

            Re: Pity it's not a traditional drive...

            We usually wrap ours in foil.

    3. Anonymous Coward
      Anonymous Coward

      Re: Pity it's not a traditional drive...

      And in years to come, dodgy resellers would have a junior with a special drill attachment doing just that...

  6. vtcodger Silver badge

    As I read it

    As I read it, it sounds like the problem is in the SSD firmware, not stuff HPE did. If so, they have to get an update from the drive manufacturer. Then HPE might want to test it to make sure that the fix doesn't bork the user/system data/code in any configuration that HPE supports. Then they have to come up with bullet-proof instructions so users can update their system.

    Might take a while. Makes one long for the days of yore when firmware was, like FIRM

    1. Simon Harris

      Re: As I read it

      "Might take a while." - hopefully not four years!

      1. Ken Moorhouse Silver badge

        Re: "Might take a while."

        No, no, time is measure in bits round here...

        "Might take a bit."

    2. Anonymous Coward
      Anonymous Coward

      Re: As I read it

      HPE put their own firmware on drives - this makes sure that the status code can only be read by the HPS server. Therefore you can put the identical drive into the caddy and into the machine but it will tell you it's broken and will keep showing the status as degraded in the error logs and viewers.

      At least with the older Gens (8/9?) that I tried putting SSDs that were better, cost 1/3 price than the HPE ones but weren't HPE certified.

      1. EveryTime

        Re: As I read it

        HP has a legacy of unique drive firmware going back decade. Their pitch that it wasn't lock-in, it was to provide higher reliability and better QC.

        1. DJV Silver badge

          Pitch

          Aha, so it sounded like pitch but smelt of bovine excrement.

        2. Anonymous Coward
          Anonymous Coward

          Re: As I read it

          "it wasn't lock-in, it was to provide higher reliability and better QC."

          Well at least until after the 32767th hour, anyway. Even then it's not lock-in, it's lock-out.

          O M F G HPE :(

          As other commentards have enquired - if the same "bare drives" are used by other storage vendors, do similar misfeatures occur?

        3. Anonymous Coward
          Anonymous Coward

          Re: As I read it

          Not just HP - other manufactures pulled the same BS with hard drives and add-in cards.

          Which is great when the OEM part has a bug fixed but your "trusted" supplier has validated the updated firmware and you get to see the inner workings of a suppliers "validation" process. Basically just branding the OEM drivers and dump "beta" on it if the customer needs it fixed and bump the revision/release the updated firmware in 6+ months once enough customers have used it without any further issues AND the number of tickets in the support queue for this issue gets too high.

          For example, Dell's NIC's and HP/IBM FCOE adapters.

    3. Stoneshop

      Makes one long for the days of yore when firmware was, like FIRM

      You mean like replacing (E)PROMs?

      (where's the cobwebs icon?)

  7. Anonymous Coward
    Anonymous Coward

    Cheeky gets

    "By disregarding this notification and not performing the recommended resolution," thundered HPE, "the customer accepts the risk of incurring future related errors."

    FRO HPE you can come into my business and perform the recommended resolution if you want. otherwise you are quite firmly on the peg for shipping defective product! Christ.

  8. Ken Moorhouse Silver badge

    SSD

    Solid State Doorstop

    1. My other car WAS an IAV Stryker

      Re: SSD

      Solid state: dead.

    2. Simon Harris

      Re: SSD

      Self-Shredding Drive.

  9. Camilla Smythe

    SSD Auto-Bork

    So they set a counter that kills the drive at the equivalent of an overflow?

    Personally I give zero stuffs about it being unsigned or not or for that matter how big it is. They have still installed a mechanism whereby the SSD will Auto-Bork at their imposed time limit irrespective of how much life there might be left in the drive.

    I have been advised that one way of extending SSD life is to over provision the drive. 16GB, use 12GB leave 4GB spare. I know not how much additional life time that might gain but the exercise is rendered pointless if the manufacturer wades in and Auto-Borks the drive 5 years before it was due to gasp its last.

    WTAF?

    1. Simon Harris

      Re: SSD Auto-Bork

      Could be because it's a signed number that's borking things rather than them meaning to specify a strict lifetime for the product.

      If it was unsigned, the counter *might* just roll over from 65535 to 0.

      If it was signed when the counter rolls over from 32767 to -32768 the drive firmware might go 'hey - we've got a negative time here - must be an error code' - it's quite common to see negative values as being error codes when a function would normally expect to return a positive value.

      1. Richard 12 Silver badge
        Boffin

        Re: SSD Auto-Bork

        Assuming a C or C++ CPU model, overflow of a signed int is Undefined Behaviour, and therefore Cannot Happen.

        The difference between something that is expected to happen and something that cannot happen, is that when something that cannot happen, happens, absolutely anything might happen as a consequence.

        1. Simon Harris

          Re: SSD Auto-Bork

          "overflow of a signed int is Undefined Behaviour, and therefore Cannot Happen."

          Undefined Behaviour does not mean something Cannot Happen.

          It *may* happen, or something else *may* happen depending on the CPU architecture and the whim of the compiler writer in deciding what should happen in such circumstances. For a 2's complement system that doesn't trap overflows wraparound is a common manifestation of 'undefined behaviour' - what happens to a negative time code may also result in other undefined behaviour.

          Of course, it could be performing some other form of undefined behaviour on the overflow (e.g. an unexpected exception), or maybe the firmware uses the timecode as an array index in a event status table, and going negative has overwritten critical system information - there are lots of ways it could go wrong!

          If you live in a world where you imagine 'undefined behaviour' is something that 'cannot happen' (are you expecting at run-time an error message saying 'this cannot happen'?) you'll miss a lot of error conditions where 'things that cannot happen' do happen.

          1. Richard 12 Silver badge

            Re: SSD Auto-Bork

            RTFC

    2. Pirate Dave Silver badge
      Pirate

      Re: SSD Auto-Bork

      Agreed. The time-limited auto-bork capability doesn't really seem like a great selling point, IMHO. Like selling a car with... uh, sorry, almost started a car analogy there. Why would the firmware bork the storage just because a certain number of hours had passed? That truly doesn't make any sense at all.

  10. Henry Wertz 1 Gold badge

    So bad...

    So bad... I don't blame HPE (since they didn't write the firmware.) But, honestly, besides using an inappropriately small variable type, I also see it as a flaw that the firmware didn't handle rollover. One should really try to handle every corner case even if it's "not supposed" to happen. If they'd done that in this case, I'm sure they didn't do the math to realize this counter would roll over in under 4 years, and considered rollover an "impossible" case, but if they'd handled it anyway, there would have been no story here because the firmware would handle it. Writing robust code helps in cases like this, if the programmer makes a design or coding error, it gives them a second chance to recover, or maybe consider it a safety net.

    Second reason for writing robust code, it helps if someone is "abusing" your code; just extending the variable would "work", but means if someone had decided down the road to make this, say, a millisecond counter instead, then you'd have rollover risks showing up again, versus not if they'd handled rollover to begin with. People now run code that's running on systems with like 100x the RAM, 100x the CPU power, putting basically 100x the load on the software that it may have been designed for; but if it's robustly designed, it just scales right up with no drama.

    1. Anonymous Coward
      Anonymous Coward

      Re: So bad...

      "just extending the variable would "work","

      Could be the earlier version of the firmware used an 8-bit counter, and it was extended to 16 because, wow, that's a lot of hours. Probably decades worth, innit?

    2. JohnFen

      Re: So bad...

      "I don't blame HPE (since they didn't write the firmware.)"

      That's a terrible reason to not blame them. A manufacturer is responsible for everything in the product they sell, whether they created/wrote the component or not. So HPE is the correct entity to blame.

    3. Stoneshop
      FAIL

      Re: So bad...

      I don't blame HPE (since they didn't write the firmware.)

      Maybe not, but they specify it, and clearly have access to the code as they're stating they're writing a new version. You don't do that from scratch.

      There's also the bit about QA, but I can't write that down without soiling my keyboard.

      1. Anonymous Coward
        Anonymous Coward

        Re: So bad...

        HPE is not writing a new version of the code. That means that are beating up on the OEM vendor of the drive to get them to produce a fixed version of the firmware ASAP. HPE does not have the in-house expertise to write drive firmware for all the different drive models and vendors they deal with.

  11. Pen-y-gors

    "Remediate"????

    Jesus!

    1. Phil O'Sophical Silver badge

      Re: "Remediate"????

      I thought he saved, not remediated?

      1. Korev Silver badge
        1. David 132 Silver badge
          Happy

          Re: "Remediate"????

          Jesus saves, but Hoddle scores on the rebound.

          (Edit: this is the joke as I heard it in the '80s. For those of you wondering "who's 'Hoddle'?", feel free to substitute a more contemporary name.)

          1. quxinot

            Re: "Remediate"????

            Jesus saves, all others roll 4d8 for fire damage.

        2. My other car WAS an IAV Stryker
          Angel

          Re: "Remediate"????

          "In God we trust; all others pay cash."

        3. David Woodhead

          Re: "Remediate"????

          Jesus saves, but Moses invests.

  12. Anonymous Coward
    Anonymous Coward

    Does Dell have the same problem or was the firmware written by HPE?

    1. Anonymous Coward
      Anonymous Coward

      It looks like these are Samsung SSDs that are also used by Dell.

      1. pmsrodrigues

        My experience with HPE SSD stuff leads me to believe they are probably Intel.

  13. TaabuTheCat

    How quickly we forget...

    Only last time it was Crucial.

    https://www.storagereview.com/node/2676

  14. eswan

    "We are currently notifying customers of the need to install this update as soon as possible. Helping our customers to remediate this issue is our highest priority."

    Well, the ones with up-to-date service contracts at least.

  15. Anonymous Coward
    Anonymous Coward

    Lets not be negative about numbers after 32767.

  16. rcxb Silver badge

    Next time HPE tells you about their Next-generation reliability*, you'll know exactly what they mean...

    * From HPE ProLiant DL560 Gen9 Data Sheet Pg.2

    1. phuzz Silver badge

      Fortunately, because of HP's Next Generation pricing, we didn't buy SSDs from HP themselves, and instead just bought enterprise SSDs from a third party, and then bought a load of caddies.

      At the time HP were charging hundreds of quid for a single small SSD, and it was about a quarter of the cost to provide them ourselves. Sure, you don't get the warranty, but in this case it doesn't look that helpful.

  17. stiine Silver badge
    Facepalm

    Hey Richard.

    Have you seen this article?

    https://www.theregister.co.uk/2017/02/16/hpe_blames_solid_state_drive_failure_for_australian_tax_office_outage/

  18. Tromos
    Joke

    Easy to recover from

    Just leave it running for another 32768 hours and it should cycle round and be as good as new.

  19. Unicornpiss
    Alert

    I may be wrong..

    But I'd bet a few dollars or pounds that they are Intel drives. We had a number of Intel drives that never got firmware updates fail (in high-end workstations) within weeks of each other a few years back. The ones that did get updated kept going. They were quietly replaced by the PC manufacturer, but it soured me on Intel SSDs. And again, I could be wrong, but I wouldn't be surprised in the least if lessons weren't learned by Intel. Of course we had a smaller rash of Samsung failures around that period too. Really, the only ones I have no complaints about were made by Crucial/Micron. I'm not a marketing shill, I swear, but Crucial's stuff has had a much lower failure rate for us than anyone else. And a few that were unresponsive were recovered by whatever internal failsafe that is built in, simply by hooking them to power and letting them sit for a while.

    If someone knows the manufacturer of these drives for sure (or if I'm just being thick and missed it somehow in the article), I'd be curious to know.

    1. Anonymous Coward
      Anonymous Coward

      Re: I may be wrong..

      It's not intel (Which anyone can tell by comparing the sizes against intel's ARK database and know they don't make a 15.3TB drive) *Yes I know who the ODM is, no I'm not posting it*

      1. Anonymous Coward
        Anonymous Coward

        Re: I may be wrong..

        Definitely not Intel and only affects HP. I cant quite figure out why HP feels the need to have a special firmware version that counts the hours of service though. The whole OEM special firmware version thing has always been a little smelly to me.

        1. Stoneshop

          Re: I may be wrong..

          I cant quite figure out why HP feels the need to have a special firmware version that counts the hours of service though.

          One of the reasons they put their own firmware in is to make the drive report as "HP mumblety foo blatz", so that the controller/host can nag about non-vendor drives if it wants to. Once you're there you might as well specify a few vendor-specific extensions to the error reports that all modern drives generate (and store).

          I'm not surprised that for some obscure reason related to the above a signed 16-bit int got called into existence as an hours operating counter.

    2. o0o0o

      Re: I may be wrong..

      Well, googling for VO0480JFDGT, VO0960JFDGU, VO1920JFDGV etc., does reveal a name.. However if the firmware is custom, it's unfair to blame the manufacturer.

  20. Andy The Hat Silver badge

    Is it custom firmware?

    Is the error *only* part of a custom firmware used by HP or is it also part of some standard firmware used in other branded devices?

    Be very interesting to know ... NOW!

    In the meantime, anyone with 4-year-old SSDs should ensure their backups are good and start gently panicking at the merest sniff of a disk error ...

    1. rmason

      Re: Is it custom firmware?

      If you have affected drives (SKUs on the HPE link) you don't wait. You check the uptime on them, now. Then you move the data off.

      There will be no "sniff of drive failure" here. You hit the required amount of uptime and the drives die, to a state where the data can't be recovered.

    2. Anonymous Coward
      Anonymous Coward

      Re: Is it custom firmware?

      I work for a vendor who competes with HP and has ocasionally said rude things about their quality control, and this same question was raised internally. The information we received is that this issue is restricted to custom HP firmware only .. this does not make HP uniquely evil, most OEMs do extensive qualification against specific hardware / firmware levels and control updates of firmware via their own distribution process in case a firmware problem is detected (which is what HP is responsibly trying to do right now). My personal conjecture is that HP placed a counter in their firmware to track fleet age, possibly for QA tracking purposes, or possibly to make warrantee claims easier to validate. Either way someone made a mistake, and HP is trying their best to fix it. This isn't the first time this kind of thing has happened e.g. https://www.computerworld.com/article/2530543/complaints-flood-seagate-over-hard-drive-problems.html and it wont be the last, this time it is HP, next time it might be Dell, or Lenovo, or SuperMicro or whoever, so if you've done some basic sysadmin hygiene like registering your hardware and allowing it to send back telemetry to the vendor and action the "your system is at risk" emails they send you (instead of say, sending them to an email address for a sysadmin who left two years ago) you should be fine.

      1. stiine Silver badge

        Re: Is it custom firmware?

        So, if it fails for 33727 uptime its automatically out of warranty? That's a novel way to get customers to buy new hardware on a set schedule.

  21. AndrueC Silver badge
    Meh

    My first ever SSD - a Kingston unit - had a similar issue. In that case the problem manifested as the drive 'dying' every hour after approximately a year's worth of service. I was one of the early victims because I was using it in a mail server which ran 24/7/52 and it started dying roughly a year after I installed it. Luckily in that case the solution was to apply a firmware fix. No data was lost and the SSD continued to provide good service until it was replaced as part of a server upgrade.

  22. cavac

    "Power on hours" in S.M.A.R.T.?

    This looks suspiciously like the "Power On Hours" counter in S.M.A.R.T. (attribute 0x09).

  23. Cynic_999

    Not only is there an inappropriate variable type used, but the exception handling must be truely terrible. When the counter overflows why should the firmware react by corrupting any data? Because unless the data on the EEPROMs has been corrupted, the data would be recoverable even if new firmware had to be installed via a JTAG port (which the manufacturer should provide as a free service).

    1. Richard 12 Silver badge

      Undefined Behaviour

      Perhaps the hardware exception or other out-of-range behaviour means it branches to an unpredictable (but consistent) location in the firmware, executes the instructions there which thus results in it writing or erasing an unintended flash block.

      It'd only need to erase one flash block to effectively nuke the entire drive, if it were the right (wrong?) flash block. And once triggered, it'd keep doing it - perhaps running through a significant part of the flash.

      1. Stoneshop
        WTF?

        Re: Undefined Behaviour

        There's no mention of actual data corruption, but "not being able to access the drive at all" does count as "data loss" in my book.

        1. Simon Harris

          Re: Undefined Behaviour

          Here's a possible scenario which does not corrupt the user data area, but renders it, and the drive unreadable.

          Imagine that the controller has an EEPROM that is used to store a historical hour-by-hour status log (for the sake of argument let's say each hour's status can be contained in a single byte. The status log is a circular buffer that holds the last 65536 hours worth of status logs (or should do). Now imagine that preceding this log in the EEPROM is a block of data that contains encryption keys for the Flash ROM (for security, assume the user data on Flash ROM is encrypted) along with various other system information (e.g. flash configuration, etc.):

          Now imagine there is a function - for the sake of argument, lets say its prototype is:

          unsigned short GetHourCode();

          that gets the number of up-time hours modulo 65536 to indicate where in the table to put the status,

          But some dunce of a programmer forgets its an unsigned short and writes:

          short hour_code = GetHourCode();

          and ignores any compiler warnings. Now imagine the next bit of code is:

          int EEPROM_address = hour_code + StatusTableOffset;

          UpdateEEPROM( EEPROM_address, status );

          where StatusTableOffset is the start of the status table in the EEPROM. All goes well until 32767 hours, but when we get to hour 32768, the GetHourCode() return value (which is returned as 32768) is interpreted by the re-casting of unsigned short to signed short as -32768, meaning the EEPROM actually gets written in the configuration or encryption key area, killing the drive's ability to decrypt user data or know the correct configuration of the Flash ROM.

          No user data is corrupted - just the controller now has no idea how to get to it.

    2. Stoneshop
      FAIL

      When the counter overflows why should the firmware react by corrupting any data?

      The counter overflowing bricks the disk. It disappears off the bus. There's no way any controller can access data that's on it. And that's why one would need to restore off a backup, not because of some (or all) data having gotten corrupted.

      Because unless the data on the EEPROMs

      The customer data storage area is flash, not EEPROM. The controller housekeeping stuff is a bit of EEPROM, integrated into the controller die. The data storage area are flash chips physically separate from the controller.

      has been corrupted, the data would be recoverable even if new firmware had to be installed via a JTAG port (which the manufacturer should provide as a free service).

      Easy enough. HPE issues a mandatory firmware upgrade, and if you don't install mandatory vendor-issued firmware updates and your system shits itself in whatever way the system would with out of date firmware any vendor will just laugh at you

      1. Cynic_999

        "

        The customer data storage area is flash, not EEPROM.

        "

        Erm ... "Flash" is a subset of "EEPROM". so your statement is a bit like protesting that you drive a car, not a motor vehicle.

        There is nothing to prevent the controller's housekeeping data from being stored in a special section of the main customer data array (and it probably is). Spinning rust HDDs commonly store controller variables on the same platters as the customer data, just in physical sectors that are inaccesible via the normal interface.

        1. Stoneshop
          Boffin

          Flash vs. EEPROM

          Flash is designed for high speed and high density, at the expense of large erase blocks and a smaller number of erase cycles compared to EEPROM.

          Although they are based on the same technology, EEPROM is much better suited to non-volatile storage of register-size data like counters and event flags that can change comparatively often but still need to be retained across a power cycle.

          1. Cynic_999

            Re: Flash vs. EEPROM

            My understanding is that EEPROM is a generic term that encompases all the various types of electronically erasable non-volatile solid state memory technologies. Flash is a specific type of EEPROM designed for high speed and many erase cycles.

            Think of EEPROM as being "Motor Vehicle" and Flash as being "Racing car". The fact that racing cars are fast does not mean that motor vehicles are slow or that racing cars are not motor vehicles.

            However, in looking for an official definition I found this https://electronicsforu.com/resources/learn-electronics/eeprom-difference-flash-memory which makes several contradictory statements such as

            "Flash is just one type of EEPROM."

            and

            "Flash uses NAND-type memory, while EEPROM uses NOR type." among other contradictions.

            So now I'm not so sure.

            1. Stoneshop

              Re: Flash vs. EEPROM

              several contradictory statements such as "Flash is just one type of EEPROM."

              Not really contradictory; as I said both Flash and (what is now specifically called) EEPROM are both based on the same technology (generic EEPROM). One (Flash) is a truck where you can load a lot of cargo need to load and unload it by the pallet, the other (EEPROM) is a Postie van, much smaller, transporting individual packages. Both are transport vehicles

              <i"Flash uses NAND-type memory, while EEPROM uses NOR type." among other contradictions.</i>

              Different implementations of the same basic technology. Again, not actually contradictory.

    3. gap

      The JTAG probably isn't externalised, hence you'd need to crack the drive open, run the thing with extension cables, etc, etc. You'd be out of production for so long, it would be faster and cheaper to recover from a backup and keep driving.

      1. Cynic_999

        "

        ... be faster and cheaper to recover from a backup and keep driving.

        "

        But in this case there is a significant probability that the backups are stored on the same SSD model, and so have all failed at more-or-less the same time.

        The manufacturer must have *some* way of initially programming the disk, and whatever method is used could also be used to recover from a completely bricked condition. Even if the method involves soldering in a pre-programmed EEPROM or CPU, it would be possible to desolder that device and fit a new one.

        Yes, involved and time-consuming, but if the loss of data will cause the company to lose £millions, then no matter how involved, the company would do it.

        1. Stoneshop
          FAIL

          If a company is of a size that it would lose millions over a week or two (a rough estimate of the 'time consuming' restore procedure you outline) it will have a backup regime that does involve different generations on removable media, and very definitely not just a copy to another set of SSDs of the same vendor and of the same age. We're talking SAS drives, not exactly the stuff a ten-head business chooses for their data storage.

          Yes, the procedure you describe is possible but if there's a business that has to resort to that to recover their data they deserve to go under.

  24. Cynic_999

    Marketing opportunity

    Just re-brand it as a write-only device and convince users that it is a desirable feature.

    1. Stoneshop
      FAIL

      Re: Marketing opportunity

      It won't even do that.

  25. arctic_haze
    FAIL

    Fail

    The problem is not even that they used a short integer. The problem is they allowed an overflow to kill the drive.

  26. Boris the Cockroach Silver badge
    Devil

    The

    cynic in me says that the 'fault' was at the behest of the senior manglement on the basis

    "If our SSDs have no moving parts, theres very little to go wrong with them and wont need replacing every 3 years... make sure they replace the drives in that time... "

    But then, that would be far too unbelievable... wouldn't it?

  27. anthonyhegedus Silver badge

    What the fuck is 11/15?

    So HPE announces that it was notified on “11/15”? Yes, we all know what it means but for heavens sake they’re supposed to be a global company! Don’t they know that only one country writes their dates backwards like that? They should make a bit of effort and at least write “November 15” or “15 November 2019” or something.

    Oh and the thing about the failing drives - that’s just devices not fit for purpose. And saying that users accept responsibility by ignoring the notice. Would that be upheld in any court? It smacks of not giving a shit, poor quality control and general arrogance. Shame on hpe.

    1. Simon Harris
      Joke

      Re: What the fuck is 11/15?

      It's actually the 11th day of Month 15.

      If their hard drives can have impossible time codes, so can their calendars.

    2. gap

      Re: What the fuck is 11/15?

      Is that 11/15 date displayed as a octal, decimal or hex number?

      Was it stored in a 4 or 5 bit fields?

  28. Anonymous Coward
    Anonymous Coward

    WTF

    ... my mind boggles at the sheer amateurism of HP to allow a bug of this severity (like.. not only does the drive crash, but apparently powering off and on doesn't fix it? it's left permanently unusable, irrecoverable, all data lost.. because a transient hours-on counter overflowed.. WTF???) and then to somehow imply it'll be the users fault if they don't patch it

  29. FrenchFries!

    The good news

    Thank goodness that the HPE 3PAR frame is not affected by this. It has enough issues to deal with.

  30. DugEBug
    FAIL

    Hypothesis: Wear-leveling bug

    The flash chips in the SSDs wear out due to writing. To mitigate this, the firmware tries to spread out writes such that all of the pages in the flash get roughly the same number of writes over time. This means that there is a lot of data movement in the SSD even when it isn't actively being read from/written to. A flash page that was written once and never deleted could be moved to a different location so that new writes can use the relatively youthful page. Similarly, pages that are deleted are eventually placed in a sorted 'free list' according to the number of writes that have already taken place.

    As you can imagine, there is a heck of a lot of housekeeping going on inside that SSD - all executed according to the active firmware. If we imagine that the 'hour counter' is an integral part of the wear-leveling algorithm, havoc will be wreaked once the counter goes negative. The once properly organized lists/tables/etc. will suddenly be corrupted beyond repair. When the SSD driver asks for block XYZ, the SSD returns block WTF. There goes your file system, and your otherwise nice day.

    1. Cynic_999

      Re: Hypothesis: Wear-leveling bug

      I very much doubt tat an hour counter is used in any way by the housekeeping task. An hour is too long a duration, and in any case the counter needs to be resettable

  31. BenMyers
    FAIL

    How to spell U-G-L-Y

    HPE needs to come clean here and tell us which manufacturer made its SAS SSDs, because they do not manufacture their own. And that manufacturer needs to describe its part in this whole mess as well as identifying its other OEM customers. Possible manufacturers implicated in this are Seagate, Micron/Crucial and Samsung. Seagate is a relative newcomer to SSD manufacture.

  32. gap

    Why b0rk the drive?

    Using an unsigned short is one thing, but the fact it b0rks the drive when it overflows is insane.

    Why would the number of power on hours be used in anything that would get in the way of drive operation?

    1. goimir

      Re: Why b0rk the drive?

      Because they've always borked drives. It was juat supposed to happen at a random time after 32768 hours.

      Having the data being completely unrecoverable is just a bonus to them just like it (didn't) used to take 30 seconds to trace a phone call's origin. If the data was "kida recoverable" spooks would be all over the company to recover data off of "really important bad guy's" boxen day and night, just like the spooks used to ask for numbers traced all the time. Telco engineers got wise and told them it took so much time of connectivity. It never really was the case..

      So better just to bork it good and hard, randomly, after X hours. Because SSDs fail, and That's The Way it Was.

  33. gap

    Consequences

    The possibilities raise good reason to have backups on completely different technology.

    Imagine having an array of these things backed up to another array of them. Even with remote sites and offline backups, you could be in a world of hurt if you deployed the hardware at the same time.

    Let's hope the navy aren't using these on their nuclear subs, etc.

    1. Solmyr ibn Wali Barad

      Re: Consequences

      There is a newfangled system where data bits are stored on a long strip of magnetic media. Sounds radical, but might work.

  34. gap

    Other vendors?

    Now the fun part will be waiting for other OEM's or better still SSD manufacturers to issue a similar warning.

    I'll be surprised if HP is messing with custom firmware versions for those components to a degree that would break it like this, hence I'm guessing there will be more of these notifications soon.

  35. Anonymous Coward
    Anonymous Coward

    Fifty years ago an IBM 360-compoatible mainframe's memory was typically 64KB. As systems evolved to 128KB there was some relief when we could just squeeze in a new compiler with the operating system. When a maximum of 1MB became available the operating system size grew like topsy.

    So we had live service crashes when programmers had used arithmetic operations on addresses. Those showed up after when their code had to access above 32KB. Having fixed those the next boundary to cause problems was 64KB - when they had used 16bit address storage. Going to 32 bit storage didn't solve all problems - as arithmetic operations would propagate the sign bit into all the significant bits.

    Those who don't know history are doomed to repeat the failures of the past.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like