Have to agree with this.....anyone else noticed the recent increase in 'oh, FFS' obvious typos etc.?
Facebook SSD failure study pinpoints mid-life burnout rate trough
Facebook engineers and Carnegie Mellon researchers have looked into SSD failure patterns and found surprising temperature and data contiguity results in the first large-scale SSD failure study. In a paper (PDF) entitled A Large-Scale Study of Flash Memory Failures in the Field they looked at SSDs used by Facebook over a four …
COMMENTS
-
This post has been deleted by its author
-
-
Monday 22nd June 2015 10:49 GMT Anonymous Coward
They've labelled x "usage" which is probably and a passable colloquialism for whatever it is they're actually trying to express... data written or similar one would imagine, despite the earlier implication that time is usage. I'm not sure time is really their forte...
"In a paper (PDF) entitled A Large-Scale Study of Flash Memory Failures in the Field they looked at SSDs used by Facebook over a four year period, with many millions of days of usage."
Erm, 1461 != "many millions"
-
-
-
Monday 22nd June 2015 10:33 GMT Mage
Bathtub curve
So just a different shape but same issue as most other electronics products in existence.
I expect CFL are similar curve. Early failures due to electronics and manufacturing defects. Later failures due to cooking electrolytics, gradual degradation of phosphors (brightness halves every x thousand hours) abrupt failure due to electrode emission if turned on and off every day.
Of course the failure modes of SSD are not the same. But I think have been well understood for very many years. This isn't news or ground breaking research, but then it's a press release from an exploitive content free / free content advert network that wants to make a walled garden.
-
Monday 22nd June 2015 10:54 GMT Anonymous Coward
I would speculate that the reason it doesn't look like a traditional bathtub curve is because the initial failures are being caused by issues like atom migration which wouldn't be picked up in a quick test before the drive leaves the factory. Atom migration is slow but the damage would be cumulative so you'd see a rise in failures over weeks. The manufacturing would be designed to stop this from happening though so I could imagine reaching a state where there were few early death cases left to find and the failure rate would then drop.
-
This post has been deleted by its author
-
Monday 22nd June 2015 21:54 GMT asdf
can't resist
>The total number of errors per SSD is highly skewed, with a small fraction of SSDs accounting for a majority of the errors.
OCZ? I keed.
>So once a drive starts to show signs of failing, swap it out immediately. Funny, I've been doing that for 30 years...
Except often with SSD from what I have heard with no signs of physical failure like with spinning rust they often just shit the bed suddenly (controller takes a dump, etc). Much more an all or nothing type device.
-
-
Monday 22nd June 2015 14:01 GMT Innocent-Bystander*
Something Useful from FB
Look at Facebook putting out something that isn't a complete waste of time! I must have entered the Twilight Zone.
So the takeaway from this message is: put my desktop's SSD beside a fan for longer life and only fill it to about 50-70% capacity to let the wear balancing algorithm do its magic.
-
Monday 22nd June 2015 23:02 GMT razorfishsl
For every 10 Deg c you run cooler you basically double the life of a component....
This has been knows for over 40 years.....
Let us also be aware that no manufacturer is going to ship crap drives to a big outfit,.... they save that crap for the general public, so we can surmise this data is weighted badly.
Next up:
"Non-contiguously-allocated data leads to higher SSD failure rates"
how would they know that?
Unless they take the chips off the SSd and look at the actual data storage in the chip rather than thru the controller....
Just because you ASK the controller where it stuck the data in NO indication of WHERE it actually is on the chip surface or device, there is a mapping relationship.
So whilst the controller mapping may say :
"I stuck the data in blocks , 5 ,200, 70000"
the data on the chip may be at physical location 1,2,3
Also there is SIGNIFICANT research in read/write disturbance, which has shown that if data is written in contiguous blocks next to each on the chip surface, it seriously stresses the silicon and causes corruption of data in the surrounding areas.
Basically the reading /writing causes parts of the chip die to become electrically offset from the read/write amplifiers , resulting in the bit pair boundaries being incorrectly recognized.....
or to put it another way the playing field in no longer level due to charge buildup in a concentrated area, which incorrectly sets the 'floor' for the boundary recovery of bit pairs... compared to other areas of the silicon. like playing football up hill.
-