back to article Do SSD failures follow the bathtub curve? Ask Backblaze

Cloud-based storage and backup provider Backblaze has published the latest report on usage data gathered from its solid state drives (SSDs), asking if they show the same failure pattern as hard drives. Backblaze uses SSDs as boot drives in the server infrastructure for its Cloud Storage platform, while high-capacity rotating …

  1. baka

    these are consumer SSDs!

    based on the capacities, these appear to all be consumer-ish SSDs. enterprise SSDs should have an even better AFR. And bleh on Seagate and WD, their drives suck, as evidenced by the results...

    1. DoContra

      Re: these are consumer SSDs!

      Backblaze has long proved that, at their scale and/or needs, enterprisey storage is not more reliable enough to be worth the price difference for them.

    2. Peter2 Silver badge

      Re: these are consumer SSDs!

      And bleh on Seagate and WD, their drives suck, as evidenced by the results...

      No, it doesn't show that at all. One particular drive failed after 1.4 months, and they only had one of them. This breaks their system somewhat, as it gives a yearly failure rate of 842%, despite them only having one of them. (which would be an actual yearly failure rate of 100% of that model).

      They had ~1800 seagate drives installed and two failures in 170 thousand drive days, compared to the Crucial drives which they had ~550 which they had two failures on in 47 thousand drive days.

      The seagate drives actually appear to be their favourites, and the WD ones probably have too low a sample size and runtime to be a useful comparison.

  2. Nate Amsden

    not useful results

    I'm guessing they use Seagate SSDs because they probably get a super good deal on them for buying so many Seagate HDDs.

    But not many people use them of course so can't really leverage this data in a useful way. Unlike their HDD reports I suspect people are unlikely to switch to the SSDs that Backblaze uses based on their results. Would of been nice had they used the more common vendors.

    My own personal data point for SSD reliability over the long term is using Sandisk and Samsung enterprise SAS SSDs in my 3PAR 7450 which has been running since Nov 4 2014. Most recent batch of drives was installed in that array in 2016, not many, just 32 total SSDs. Just 1 SSD failure in the first 9 years of operation. Don't recall if it was Sandisk or Samsung that failed, but in Jan of this year that one SSD just went offline without warning. Even now nothing has less than 86% wear life remaining. I physically re-arranged the disks a few months ago so can't extrapolate exactly which kind of SSD failed. All were HPE OEM of course, and at the time were super expensive something like $20k+/drive list price at least, not something Backblaze would ever use on their stuff.

    I've yet to have any of my Samsung or Intel SSDs fail on personal devices, starting with Samsung 850 Pro I think was my first one many years ago. Read reports of 980(?) Pro NVMe SSDs having issues though mine have been fine(just about a year old, and came with the fixed firmware already).

    1. John Brown (no body) Silver badge

      Re: not useful results

      "in Jan of this year that one SSD just went offline without warning."

      Yes, IME, SSDs have two primary failure modes. Sudden and total death,. often no longer even being recognised as present in the system, or switching into read-only mode due to too many cell/sector unrecoverable write failures, which means you can at least recover data if needed. My experience is as a field service tech and the devices are primarily business grade laptops, so there little to no monitoring, so I've no idea if either of these failure modes are predictable from SMART monitoring. For my job requirements, I don't really care or need to know this, I just need to confirm it's covered by warranty, ie not user damage, replace the faulty parts and get to the next customer site. Personally, I'd like to know the causes and if proper long term monitoring of the SMART values would show any useful predictions, but that's something the customers IT team would have to do and in most cases, they really don't care either. User data is backed up or "in the cloud" and they are happy to just replace any failed laptops the users bring in to them.

  3. David-M

    It would be interesting to assess failure rate per £ in some meaningful way, to see how much more reliable expensive ones are, as there's always the choice of cheaper ones being bought more often versus pricier ones slightly less so.

    As an extra note I've taken to doing a lot of my work in RAM (on a RAM disk) and then just committing it at intervals (with browser cache switched to memory only) - otherwise working on large files can create a lot of writing. For laptops this is unlikely to be a problem, for PCs with stable electricity supply, vary rarely. I wonder if the resulting reduction in writes has any meaningful benefit in regards to lifespan.

    d

    1. Nate Amsden

      I think it'd be mostly determined on what kind of SSD you are using and what it's rated for. My previous laptop which was in regular use from mid 2016 until Oct 2022(very little usage since) has a 1TB SATA Samsung 850 pro in it, rated for 300TB of written data. Samsung Magician software says it has 88.1TB written. Also has two Samsung 950 Pro NVMEs, one has 5.6TBW and one has 18TBW. This laptop spent 98% of it's life booted to Linux but does dual boot with Windows 7.

      I checked my current laptop (~1 year old), which has a pair of 2TB 980 Pros, using linux's smartctl. One says 113GB written, and I guess the main one says 17.2TB written. Specs say these are rated for 1,200TB. This laptop has spent 99.9% of it's life in Linux(and runs a Windows 10 LTSC VM 24/7) but does dual boot with Windows 10.

      for personal use I've only ever purchased Samsung SSDs, important stuff always gets Pro, and non important stuff often gets Evo(even stuff that almost never gets used, other people may choose super cheap brand SSDs for those systems). I have at least one Intel SSD (with a skull on it) in my PS4. I really don't even consider other brands at this point regardless of price/purpose(as things are cheap enough now for sure). Only exceptions was I bought a few HPE OEM SSDs a while back (they were actually Intel though..).

      So in my case I don't have any concerns about write lifespan. Not all SSDs are created equal though, the 980 Pro(?) had some firmware issues which caused major problems for some folks. Fortunately not me though.

    2. Zola

      Makes me think the Backblaze figures would be more meaningful with the "drive lifetime write" figures - how can we tell if a drive failed after being absolutely thrashed to death (which would be expected) or if it failed while being mostly idle (or mostly read only), which would be highly unusual?

    3. John Brown (no body) Silver badge

      "I wonder if the resulting reduction in writes has any meaningful benefit in regards to lifespan."

      Yes. Reads are, in terms of normal usage, infinite. Writes wear out the memory cells a little every time they are written to. This is why TRIM/Wear levelling and built-in over-provisioning was created. Of course, there are differences. such as single and multi-cell memory which also affects lifetime writes and therefore longevity and price. The more robust memory cell model will last longer but cost more.

      There's a fairly decent summary here.

  4. An_Old_Dog Silver badge

    What I Wanna Know ...

    Perhaps my answer is in those multiple GB of expanded Backblaze data files, but I'd like to know which, if any, S.M.A.R.T. events indicate an impending SSD failure.

    With rotating media, I've found that high reallocated sector counts indicate you'd better back up that drive if you haven't already, replace the drive, and place the old drive into the presumed-unreliable "scratch drive" pool.

    1. Nate Amsden

      Re: What I Wanna Know ...

      Everything I have read says there is nothing worthwhile in the SMART data that would indicate that the controller on the SSD is about to fail. Everything seems geared towards monitoring the health of the flash media itself. Which may be all that is possible with a single point of failure in the controller itself. I'm sure controllers on HDDs fail as well just seem to fail at a much lower rate than the media.

      Even in my case with a tightly integrated enterprise storage system(which will proactively fail drives if more than a few errors occur), one SSD failure in 9 years, when it failed the drive just went offline with no warning. One moment it was there the next it was unreachable. When I physically moved some of the drives around earlier in the year I was worried more would fail, as many hadn't been power cycled in probably 5-6+ years, but none did. If any did I wouldn't have lost anything as I was able to instruct the array to move all data off the drives before I moved them.

    2. Spazturtle Silver badge

      Re: What I Wanna Know ...

      Most SSD failures are due to the controller dying which you can't predict. As the actual NAND chips wear out you will see the reallocated sector count rise and eventually the drive will switch to read only mode.

  5. Nik 2

    An Uncomfortable Bath

    That looks like a long way from the classic bathtub curve found in mechanical devices - there's no flat bit on the bottom, and the down and up curves are at similar gradients.

    A proper bathtub shows a very steep drop followed by a prolonged period of low failure rates and a gradual increase after that, but this looks like one of those fancy tubs that are only found in showrooms and expensive boutique hotels.

    1. TVU Silver badge

      Re: An Uncomfortable Bath

      "That looks like a long way from the classic bathtub curve found in mechanical devices - there's no flat bit on the bottom, and the down and up curves are at similar gradients"

      ^ That is a valid point to make and Backblaze has stated that their SSD sample is relatively small so I would expect to see greater deviations from what might be expected. Next year, they should have a larger SSD failure data set to work with and that might start to more closely resemble the expected failure curve.

  6. Anonymous Coward
    Anonymous Coward

    Soo I'm using 5 Crucial mx drives with one being used for boot.

    The boot drive is failing and upon contacting Crucial I was told since

    my system is on 24hrs a day I should have used something else. The MX

    drives aren't recommended for this. lol

    Cheers

  7. Mike 137 Silver badge

    Oh for a real statistician!

    "Backblaze has also produced a graph of SSD failures over time to see how well the data matches the classic bathtub curve"

    In reality, the bathtub curve represents the distribution of failures over time for a large population of one make and model of device (i.e. it dissociates infant mortality, middle lifetime and end of life failures for that specific device). The graph shown does not represent this, as it aggregates all makes and models of device and fails to allow for sub-sample sizes, the latter resulting in variable confidence across the entire sample.

    Although Backblaze seems to be the only high profile organisation offering such data (and it's good that someone does), it's a pity they're not a lot more rigorous.

  8. YARR

    Cell level

    I would have thought that the cell level used for each ssd (slc mlc tlc qlc etc) would correspond to the expected number of write cycles before failure? This should be included in the table.

    Also running virtual memory or swap partition on an ssd will significantly add to the write wear.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like