back to article Backblaze thinks SSDs are more reliable than hard drives

Cloudy backup and storage provider Backblaze has found that flash SSDs are more reliable than hard drives, at least as far as the boot drives deployed in its datacenters go, but cautions this could change in future as the SSDs age. These findings come from Backblaze's latest report detailing reliability statistics on the …

  1. Kev99 Silver badge

    Obviously, Backblaze has yet needed to recover any data from an SSD. It's nearly impossible as I learned the hard way.

    1. phuzz Silver badge

      If you're at the stage of trying to recover data from dead media, then several other things have gone wrong already. Such as RAID and backups.

      (Although, as these are boot drives, I wouldn't be surprised if they're not backed up per se, but instead Backblaze have a procedure for rebuilding on a fresh drive, as if it was a brand new server)

      1. Grooke

        It's a valid concern for consumer usage, but yeah disk recovery is likely out-of-scope for Backblaze's usage.

        1. ChrisC Silver badge

          Not sure it's valid even for consumer usage - if a spinny-rust drive has failed to the point where you can't see it *at all*, then the costs involved in recovering any data from said drive are likely to be beyond the level at which the average consumer will just give themselves a metaphorical kick up the arse for not already having backed up the lost data, and take it as a lesson learned for next time.

          And IMO, the average consumer isn't likely to be aware enough of what the signs of impending drive failure can be, or to have their system configured to proactively monitor the SMART parameters for them, which means the first time they realise their drive has a problem is most likely going to be when it suddenly ceases to exist as far as their OS is concerned.

          This is all assuming the drive is even in a position to be trying to give any sort of warnings that it's on its last legs - the last drive of mine that failed did so completely out of the blue due to a component failure on the controller PCB which, although it was probably degrading over time, wasn't able to be detected by SMART or any manual observation of drive characteristics (spin up time, noise etc.), and only became apparent when, on the next power cycle, it went "phut" and rendered the drive completely inoperable.

          1. Jou (Mxyzptlk) Silver badge

            Windows 7 does give that warning. Seen it myself on an SSD and HDD in a laptop which was just about to die (the drives, not the laptop). I got a call from my brother and niece about the warning popup of Windows, and my reaction was: "Total red alert, copy everything you need on USB disk right about now, don't reboot, just plug in and copy!" Later we created image backups too, with the statistic of about 450 unreadable sectors of the system SSD. They had that message on screen for a few weeks. SSD was OCZ, HDD don't remember, but dead too.

            Was migrated to one 1TB SSD, and the system survived (and upgraded to Win10 right away too).

            I doubt that MS removed that warning on later Windows versions. I think I saw that warning on Windows XP too, but that is too far away to be sure.

            1. phuzz Silver badge

              That'll be using the SMART data, and as others have pointed out, it's entirely possible for a disk to fail with no warnings.

              1. Jou (Mxyzptlk) Silver badge

                Yeah, but that is not what the post I replied to is about. Accidentally replied to the wrong post?

                1. Martin-73 Silver badge
                  Pint

                  Uhm yes it was, I am guessing this may be a bug as it happens often to seasoned commentards... Have a beer

            2. ChrisC Silver badge

              Interesting, I've never seen Windows natively give me any sort of pre-emptive warnings about drive health, but I may be forgetting just how long ago it's been since the last gradual failure I had where SMART monitoring would have been of use - it's therefore entirely possible that 7 and later do include this natively and I've simply yet to experience it due to the only drive failure I've had since switching to 7 being the one I mentioned earlier where it went from fully functional to brick in the space of a single system power cycle.

              1. Jou (Mxyzptlk) Silver badge

                Well, my personal opinion is that Windows could warn earlier, but that would require keeping a statistic to notice when specific numbers suddenly start to rise to warn BEFORE the drive itself goes "Pre-Fail".

                But I do that type on monitoring in a combination with powershell and smartmontools. Though pure powershell would be enough for most out there.

            3. katrinab Silver badge
              Meh

              I had an MX500 fail in Windows Server 2019 a couple of years ago. A partial failure rather than drive disappearing. The only warning I got was notification that the backup had failed.

              1. Martin-73 Silver badge

                OOF ungood , maybe some manuf defect, involving an open address line or one of the storage chips on board dying? Maybe RAID *WITHIN* a drive could become a thing for enterprise level drives? (or maybe it already is, thinking redundant NAND chips.....

    2. Francis Boyle Silver badge

      Maybe

      they know of a good backup service.

      Seriously any data that's stored in only one place, on whatever medium, might as well not exist.

      1. Piro Silver badge

        Re: Maybe

        Ouch, that's a hell of a take.

        The vast majority of computing wouldn't be possible. There are few systems that allow the running of memory in a mirrored configuration, and fewer still that allow entire systems to run in lockstep across separate hardware (think VMware FT).

        I agree that important data should regularly be copied to extra locations, not part of the original system, but to say all unique data in time and space is worthless is quite the thing to suggest.

        1. RichardBarrell

          Re: Maybe

          The accusation isn't that the data is worthless, it's that it's doomed.

          ...DOOOOOOMED!

          1. Sudosu Bronze badge

            Re: Maybe

            Everytime I hear that....

            https://www.youtube.com/watch?v=k1umrxqCB-w

            (its a minute in)

            1. Martin-73 Silver badge
              Thumb Up

              Re: Maybe

              Not the clip i was expecting (dad's army) but worthy of an upvote and maybe a watch of the movie?

    3. talk_is_cheap

      For an operation, the size of Backblaze a failed boot drive would mean installing a new drive and just running cloud-init to re-install the node.

      From the fact that they are using entry-level drives like the Crucial drive listed, it is clear that they have no concerns about the drives.

      1. VicMortimer Silver badge
        Megaphone

        Backblaze figured out many years ago that there is ZERO advantage to using 'enterprise' hard drives, because they're no more reliable than 'consumer desktop' drives. And of course the 'enterprise' drives are a lot more expensive.

        That's how their drive report got started. They did the testing, realized that they were being scammed by drive manufacturers, and published their reliability numbers because they thought it would be nice to let everybody else know just how much of a scam 'enterprise' hard drives actually are.

    4. doublelayer Silver badge

      They're almost certainly not bothered about recovery. Their business is storing lots of data, so I'm sure they're aware that getting to the point of having to recover from failed hardware means they'd have completely failed the customer. It's not that useful even in personal usage, as there are a lot of failure modes where recovery from anything, mechanical or SSD, isn't possible. By all means recover if it looks possible and could help, but never count on having that option.

      1. Sudosu Bronze badge

        I think this may be an older "article" but it shows a bit about how they set things up using disposable pods.

        https://www.backblaze.com/b2/storage-pod.html

  2. Potemkine! Silver badge

    Very interesting to have actual data. Sounds logical as there are no physical failure expected in a SSD.

    Aren't also SSD limited in number of write/erase cycles?

    == Bring us Dabbsy back! ==

    1. Grooke

      That's what they're referring too with "media wearout limit". That's the next thing then want to study

      "we'll try to confirm on how far past the media wearout limits you can push an SSD"

    2. Filippo Silver badge

      They are, but if these are boot disks, they probably don't get written to very much.

    3. JoeCool Bronze badge

      SSD is still a physical storage device

      There certainly can be failures of the electronics circuits. As well as physical and mechanical stress caused by the environment.

      1. Richard 12 Silver badge

        Re: SSD is still a physical storage device

        The published specifications are typical bathtub curve, with the "high failure rate" occurring at around 50-100 years old in a write-rarely read-continuously situation that is a typical server boot drive.

        It's nice to see that the specification isn't wildly wrong.

        Storing logs on the drive will greatly shorten the lifetime, how much will depend on how the OS and drive firmware handles appending.

        We had one batch of drives with a firmware fault that killed them within a few weeks of taking logs, as they did a full block erase-write for every tiny log line...

  3. 43300 Silver badge

    Assuming these are VM hosts, the usage on the boot drive with many hypervisors (e.g. ESXi) is going to be pretty light.

    Hyper-V Server (or the Hyper-V role on Server Core) would be heavier, but I rather doubt if they are using those.

    1. elwe

      I doubt very many of these are hypervisors. When you run a scale out service you tend to use many whole boxes as service nodes, with no need for a hypervisor on most boxes. Typically you do have a very small number of hypervisors to run VMs for DNS, LDAP directories, sysog servers and other ancillary services.

    2. J. Cook Silver badge
      Boffin

      Not quite.

      Unless you've configured your ESXi hosts to write log files somewhere else, by default it'll write to the local datastore. If you've been booting your hosts off SD cards, your log files get written to a ramdrive due to the failure rate of SD cards that get lots of writes. (Ask anyone who runs a hand full of Raspberry Pis about media failure...)

      VMWare, at one point, declared that installing and booting off SD cards would no longer be allowed for ESXi 7, which was walked back after a lot of outrage. It's still planed for a 'future version', probably 8 or a dot release of 8. (which is why I was rather annoyed that the new servers we just bought did not have any local drives on them...)

      1. DougMac

        Re: Not quite.

        Tried doing SD card boot for VMware.

        I had to replace the cards about ever 4-6 months.

        Gave up and spec'd out systems with local SSD boot after that.

        Granted, other brands of server hardware seem to do better than others, but it was of no surprise to me that VMware unspec'd SD card boot off the HCL.

        1. 43300 Silver badge

          Re: Not quite.

          Our previous set of servers had SD cards as boot media when new - each one had a pair for alleged redundancy. Trouble was that the failure reporting didn't work, so the first we knew was when we started getting issues first with patching, then with not booting at all.

          Fortunately the front bays were wired in, so after a few years I just bought some SSDs to use as the boot drives.No further problems after that, and I would never consider using SD cards again - either standard SSDs (as in the current servers), or BOSS cards.

  4. JoeCool Bronze badge
    Boffin

    These guys deserve a Nobel for "Science you can use"

    I've been reading their reports for years - such good discussion.

    1. ChrisC Silver badge

      Re: These guys deserve a Nobel for "Science you can use"

      And lovely to see a commercial entity releasing data like this for public use, rather than keeping it all firmly under lock and key for internal consumption only. Makes you stop and ponder just how much data (for all manner of things) is being gathered around the world and which could be of use in some way to someone other than the gatherer, but which never gets seen by anyone inside the gathering organisation...

      1. ChrisC Silver badge

        Re: These guys deserve a Nobel for "Science you can use"

        Or even *outside* the organisation, doh...

        Though given how much data is captured as a matter of routine these days, it wouldn't surprise me in the slightest if my earlier typo isn't a million miles from the truth in many cases as well.

  5. nautica Silver badge
    Stop

    'High-tech credibility' at risk here.

    "Backblaze thinks SSDs are more reliable than hard drives

    They are most definitely not.

    1. J. Cook Silver badge
      FAIL

      Re: 'High-tech credibility' at risk here.

      Citation needed, please.

      I'd like to point out that Backblaze is doing this for exactly this reason; to determine what types, brands, and models of drives are the most reliable over time in a vendor-independent manner.

    2. doublelayer Silver badge

      Re: 'High-tech credibility' at risk here.

      Do you want to back up your sentence with an explanation? Why are they certainly not, and what data do you base that on? Why does this hurt the credibility? Right now, I don't have a clue what point you were thinking of when you wrote that.

    3. ChrisC Silver badge

      Re: 'High-tech credibility' at risk here.

      Are you operating a fleet of 2500+ SSDs spread over multiple models/manufacturers? If so, feel free to share the reliability stats you've gathered to back up your claim. Otherwise, I think I'm going to have to side with Backblaze here...

  6. Ball boy Silver badge

    The choice isn't really about reliability

    No one should ever rely on a single disk for anything. Given that, then the reliability (providing it is within a reasonable percentage of the previous tech) isn't that critical - but speed is: SSD is simply way, way faster than platters.

    However, if an SSD does fail for any reason (be it the memory array or just in the control logic) I very much doubt you stand much chance of getting any data back. At least with a platter there's a fighting chance some of the information can be recovered - that is, assuming you forgot rule 1: never have a single point of failure unless you're planning to fail.

    1. Richard 12 Silver badge

      Re: The choice isn't really about reliability

      Missed the point.

      Reliability matters, but not for the data (it's a boot drive, the data is a clone of a million others). It matters for the downtime.

      Losing the boot drive takes out the server for a few hours, and consumes the valuable in-person time of a technician to go and physically swap it out.

      1. Ball boy Silver badge

        Re: The choice isn't really about reliability

        I didn't miss the point - hence the providing it is within a reasonable percentage of the previous tech qualifier.

        If losing a boot drive is such a problem for your environment, why are you dependant on a single point of failure? See rule 1.

    2. doublelayer Silver badge

      Re: The choice isn't really about reliability

      Depending on what you're doing with it, reliability can become more important and speed isn't always critical. Losing a disk doesn't just mean losing the data on it; I'm sure anyone running a storage business is aware of that and has redundancy. It means the cost in time and money to allocate a replacement disk and add that to the array holding the data. It means an eventual call to a technician to remove the failed hardware and replace it with fresh devices. It means buying replacements faster. There are reasons people care about that.

      You don't always need speed, either. In my personal machines, the boot disk is always an SSD because speed is very important there. In my storage server, it's mechanical drives. I can deal with it taking a couple more milliseconds to fetch a file I've moved over there, and if I couldn't, I wouldn't be using a network link anyway. This allows me to have more storage in it than I could afford if it was an all-SSD setup (when I was buying disks, SSDs were running about 4-5 times as expensive per terabyte than HDDs). Although my primary consideration was financial cost, I'd definitely consider reliability more than speed.

    3. phuzz Silver badge

      Re: The choice isn't really about reliability

      It's not mentioned anywhere, but there's every chance they're running a RAID 1 for their boot drives.

  7. Jou (Mxyzptlk) Silver badge

    I had only ONE SSD failing on me.

    And that one was *drumroll* OCZ Vertex 3 240 GB. They died like flies at the end of their time, and luckily I noticed it early enough.

    No other SSD failure so far, mostly Samsung top-line (I think six of them?) some WD-blue (four of them, but only two in active use. WD = Sandisk = their own flash memory). So indeed, the reliability is so good that they get too small instead of fail.

  8. Pirate Dave Silver badge
    Pirate

    Well, maybe

    Just one tiny anecdotal point - last week, our 4 year old Compellent/Dell SCv3020 finally blew its first drive. One of the SSDs. There are 10 SSDs and 20 spinning rustbuckets, all 30 of which were installed at the same time, and if memory serves, were relatively close in manufacturing date.

    Not a datacenter full of disks, and no where near a significant sample size, but enough to make me chuckle at the irony of Backblaze's statement. As always, YMMV.

    1. phuzz Silver badge
      Stop

      Re: Well, maybe

      Might be worth ordering some spares when you get the replacement drive. One thing I've noticed with SSDs is that all the drives in a single batch tend to fail quite close to each other.

      I've always found harddrive to be a bit more random in their failures, two harddrives with consecutive serial numbers might fail years apart, but with SSDs they're often only months apart.

  9. Henry Wertz 1 Gold badge

    Read-only workload

    I would expect SSDs to be more reliable especially in this use as a "nearly read-only medium -- (probably?) no swap enabled, minimum writes other than when the software is first insalled, and during software updates. Since these systems just boot off this disk and all the main work is done by (spinning rust) storage drives, I honestly assumed SSDs have near 0 failure rate other than failure due to excessive writes. Good to see though!

    Personally, I had 2 HP SSDs die -- both a $20 trashy "controllerless" 240GB SSD. The first one, I was running some tasks that needed about 24GB of RAM on a 16GB system, used the SSD as swap -- it croaked in a month or two. Disappointed but not surprised. The second one, I used as a regular data drive -- the piece of junk STILL failed in under 6 months. DO NOT BUY CONTROLLERLESS SSDs! And HP, shame for selling them! I still run mainly spinning rust, but must admit I have not had any other SSD failures, or failures of the ~24GB MMC they stuck in a few netbook-type systems (on the system my mom has wit that, though, only the Ubuntu / is on the 24GB, the /home and swap are on a 750GB HDD -- works a treat, software loads fast off the MMC, since Ubuntu is small it still has like 12GB free on it; and the big stuff goes on the hard disk which also has plenty of free space.)

    1. Jou (Mxyzptlk) Silver badge

      Re: Read-only workload

      12 GB for the OS feels like A LOT in Linux terms! When did Linux get so bloated? Server 2022 OS is about 14 GB (normal non-core install including a small swap), and with updates it can go up to 17 GB until the next self-clean cycle - which is very little for Windows terms.

      Soon they will trade place, but I doubt Windows will shrink, rather Ubuntu will grow.

      (And yes, I don't dare to compare with bloaty MS-Client OS-es, even though Win10 needs way less than Win7/Win8/Win11).

    2. Pirate Dave Silver badge

      Re: Read-only workload

      "And HP, shame for selling them!"

      Wasn't it HP that had the SSDs disabling themselves a year or two ago? That probably burned off what little shame they had left, so none left for your controller-less SSDs.

  10. FelixReg

    The number of failures are too low to mean much

    Title says it all. With failure rates like this, you'd want the drive counts to be well over ten thousand.

  11. worldtraveller2

    SSD Failure

    Whilst the figures reported show SSD's to live longer than HDD's my personal experience has been the opposite, with long lived HDD's still carrying on in service whilst SDD's have experienced total failure.

    1. Chz

      Re: SSD Failure

      It's my personal suspicion - based entirely on anecdotal data, I should add - that SSD failures are lower than HDD for the main part of their lifetimes, but the far edge of the bathtub curve looks different. Long-running HDDs do indeed keep going. I have one that's 15 years old. No essential data, I use it as temp space. I have a suspicion that the 15 year survival rate for SSDs will be lower than for spinning rust. Pure gut instinct on that though, as Enterprise flash is at most 10 years old and I don't think they ever sold enough of the consumer drives before then to make a decent data point.

    2. Jou (Mxyzptlk) Silver badge

      Re: SSD Failure

      Needs more detail: What SSDs? There is quite a difference which manufacturere you choose.

      Ranking for consumer SSD quality for "since ever SSDs are there":

      1. Samsung, using their own chips of course.

      2. Crucial, which is the Micron consumer brand, using their own chips too.

      3. WD, which bought up Sandisk, using their own chips too.

      4. Well, it COULD by Hynix, if their website would finally work outside of UK/US/Korea/Japan. Using their own chips too.

      5. Don't care for the rest :D - that Includes Intel.

      The ranking for professional server SSDs is different: "Don't care how much of them die, they have support for X years". However, I've not seen one dying yet, including the famous Intel-32767-hours bug drives.

      1. J. Cook Silver badge

        Re: SSD Failure

        The ranking for professional server SSDs is different: "Don't care how much of them die, they have support for X years".

        Correct. For server and enterprise the rule is also 'one is none, two is one, three or more is better'.

        Granted, I have seen cases where the controller falls over and takes the data on the drives with it as it goes down. (Previous boss liked to spin a tale of an EMC firmware update that proceeded to corrupt the data on the entire appliance, which is why he was very, very suspicious of firmware updates on storage appliances. I don't blame him, but the only bits of EMC storage we've ever had in house was a data domain, and that's been shrink wrapped until the data on it passes retention in a couple more years...

      2. doublelayer Silver badge

        Re: SSD Failure

        The full dataset includes not only manufacturers but specific models. It can allow you to find out in great detail what would have been the best disks to have bought five years ago, and although the SSD data is not as expansive, it should eventually provide similar levels of data for that. This is only slightly useful in determining what disks to buy now, unfortunately.

  12. Richard Pennington 1

    Sample size of 1 ...

    I am typing this on a vintage Apple machine (iMac 15,1 from 2014) which features a Fusion drive. This means that its main internal HD is actually a 1TB spinning-disk HD combined with a 120GB SSD.

    About March this year, this machine suffered repeated crashes (at intervals of 1-2 days). Long phone calls to Apple support later, the solution was: [1] all user data was backed up to Dropbox (which I was using anyway in the background to get my various machines to talk to each other); [2] delete the internal HD completely, and reinstall the system from Apple's Internet recovery system; [3] manually recover about 5 days' emails which got lost in the middle but which were captured on other machines.

    It turned out that the SSD had failed completely, but the spinning-disc HD was - and is - still in full working order. The machine now runs on just the spinning-disc HD. It has now gone over 80 days (and 3 system updates) since its last crash.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like