IMO a much better solution would be to use a fitter for purpose replacement for traditional RAID, such as what ZFS brings to the table.
Those who follow storage developments know that there are concerns about the viability of RAID systems. Rebuild times are so long that the chances of an unrecoverable read error (URE) occurring are dangerously high. What is true for traditional disk, however, is not necessarily true for flash. Now that traditional magnetic …
Agreed, old style RAID 1 to N except for some mirroring should be dead false data security because it is flawed even for small capacity arrays; rebuild time is not the only possible problem.
I've had recently bought failed drives detected by ZFS and failing drives detected before they out-right failed, even memory issues; my ZRAID-2 ZFS arrays either shrugged or lost very little data. The ability to have more parity drives, hot spares, and flash buffering/caching on a spinning disk ZFS array makes this article look silly and ironically dated. A FreeNAS ZFS RAID box with more powerful and reliable ECC hardware can be built for cheaper than off-the-shelf RAID boxes with a decent drive capacity support but often feeble hardware.
Flash is just too damned expensive for most huge capacity use, using flash for write buffering and read caching would be a better use of currency in most cases.
No one in their right mind is using RAID 5/6/50/60 at home with high capacity drives. The rebuild time alone makes the scenario infeasible. Since flash doesn't have the same capacity price points, there aren't many options other than RAID 0/1/10 or JBOD with healthy and vigilant backups.
If I lose a drive, I simply replace it, let the RAID recreate the stripe set and restore from backup. I can restore the 3TB of data across the network far faster than the rebuild could.
Nah, ZFS can get a _LOG_ more mileage out of current spinning rust in terms of data durability - to the point where n+2 and n+3 RAID are still quite feasible if you keep the number of disks per array within reasonable limits (e.g. 4+2 or 8+3). While there is no replacement for having good backups, reducing the frequency of having to restore a few TB of data over the internet connection helps.
Agree. I did the math recently for new house RAID array and went for RAID 1 instead. The hassle, combined with failure probabilities and rebuild time for 5, 6, 50 or 60 is just not worth it. You are better off throwing in a couple of drives as a DIY MAID on a spare machine and doing nightly backups to that.
Nobody at home is doing what you are doing. Thats because its insane overkill. Raid 5 and raid 6 in fact are perfectly fine for home users. Rebuilds literally do not matter to most as they happen in the background, and this article amounts to basically fear mongering with the amount of error correction available today and the ridiculously conservative numbers of manufacturers.
This is all topped off with the fact that unless you have a particular hardware raid build, ures dont mean the end of all your data, it just means one single stripe is corrupted and the rebuild continues.
I find it ridiculous you think anything youve said in that comment is sane. To have such an expensive setup when most people who even care about their storage just get a synology and are done with it is just silly.
I think the math was a bit off.
There are two sorts of problem with loss of data. One is a random failure, and there the BER is near irrelevant since the random failure mechanism that would cause loss of data has to occur if the SAME block is corrupted on two drives. The probability of this is the UBE rate multiplied by the number of blocks on a drive, which is roughly. This multiplies the UBE rate by another 2.5E+08, which means its really unlikely this would happen.
The other problem is a drive failure. Now the question is will a UBE occur during the rebuild time, if there is no parity left. 1E14 is an awful large number of bits, and no consumer operation gets even close to that per day (1E9 is more likely!). (The probability of a second killer BER is actually lower than this on real-world consumer drives, since only a failure in used space should be counted.)
The rules are different on servers
Yes, I think the maths is off. My puny system should have died by now (several times over).
It's 4TB of JBOD with backups. Every Saturday morning a cron job runs an MD5 checksum over all the files. It hasn't detected any random changes in about three years. The last time it did I replaced the disk and restored form backup. With this maths I should be getting an error every three weeks.
"1E14 is an awful large number of bits, and no consumer operation gets even close to that per day (1E9 is more likely!)"
10^14 bits is about 11.6TiB. If you are using the new 6TB in 2+1 RAID5 configuration, you are statistically very likely to hit an unrecoverable sector during rebuilding.Depending on how good your RAID implementation is, you might lose just that data block, or it could be far worse - software RAID is much saner than hardware RAID in this case, with there being a decent chance the latter will just crash and burn and lose the whole array.
Definitely voodoo math.
If the failure is once every 12.5 TB (as claimed), a nearly full 1TB RAID1 set where most sectors contain non-zero values (like the one I have on my old main house server) should fail during the weekly verify job at least once every few months. That set has been running for ~ 3 years now (it just got replaced by a new 2TB one). I need a few more coffees to recall the relevant formulae from probability theory, but it should be somewhere north of 95% for a failure during the 3 years of use.
There is definitely something wrong with the cited numbers.
//So RAID 5 for consumer hard drives is dead.//
You seem to be saying that a single URE event means the entire restore has failed.
I'd always understood that the risk of failure for RAID is that too many disks crap out entirely, before you're rebuilt the set. But you seem to be saying that this is not the case.
If the failure to recover a single sector means that everything is lost, I think the priorities of the system are questionable.
I mean, I have a few very important files, and very much more stuff which I'd like to keep, but could get by without. I make extra copies of the former on other media.
If (i was using RAID and) recovery did go wrong, I'd expect it to recover everything it could, and apologise profusely for the odd file which was lost. If instead it wigs out and fails then you're better off not having it in the first place.
"If (i was using RAID and) recovery did go wrong, I'd expect it to recover everything it could, and apologise profusely for the odd file which was lost. If instead it wigs out and fails then you're better off not having it in the first place."
Except that we are talking about block level corruption, which sits underneath the FS level. Unless you are running a full-stack solution like ZFS it won't be trivial to even find out which file's block was occupying the block that failed to scrub out. With any block level corruption, there is a possibility that the entire FS might end up hosed, even if your RAID implementation is clever enough (and many aren't) to give up on the errant block and continue rebuilding the rest of the data.
The problem is compounded by the fact that most disks today come with no Time Limited Error Recovery (TLER), and those that do don't have it enabled by default. So when an unreadable block gets encountered, the disk will repeatedly try to read it while ignoring all other commands. Eventually the higher layers will time out the commands, and kick the disk out. At which point you will have lost the whole second disk from the array, and thus will quite likely need to restore the whole lot from a backup. With TLER, the disk will time out the command much sooner, before it gets kicked out of the array or off the controller.
Unless you are running a full-stack solution like ZFS it won't be trivial to even find out which file's block was occupying the block that failed to scrub out.
Which is why you'd be insane not to imo. I've had ZFS arrays survive things they really had no right to. It's impressively difficult to kill.
I agree about ZFS.
The other thing about it is that it detects corruption. So, if you take a simple 2-way mirror, you may get a single error on one disk. Most RAIDs I have seen would load balance reads between the 2 disks. So, if you read the block in question, there is a 50% chance that you will try to read the corrupted data.
Now if this is detected by the disk, fine, it will probably move over to the other disk. However, there are forms of silent data corruption which would not be detected, and you would get bad data returned to you. This could end up as simple as producing some corruption in a video you ware watching, or could bomb out the entire system (as it was a system file).
With ZFS, it would notice that the file was corrupt and return the valid data from the other disk. It would also copy the valid data across so it was available from the other disk.
The problem, for me, comes when an error is detected on both disks. It handles it the right way, for most things: It throws an error and makes note that the data is unreadable. However, let us say that it is only one block in a 10GB video file. ZFS would make the entire file as corrupt and, even though the video would probably still play fine, it is gone.
Depends on your data if that's a bad thing or not I suppose. We throw a few hundred gig a day through some medium sized ZFS arrays (few tens of T) and it's mostly output from big scientific calculations and it really is imperative that it's _accurate_ more than it isn't lost. If we lose it, we can run the calculation again. If it's written but it's _wrong_ that could cause far more problems (Which is why we chose ZFS in the first place)
Depends on your data if that's a bad thing or not I suppose.
True, but it would be nice if there was an easy way to say "discard the bad bits but give me what's left", or even "ignore the error, just give me the (corrupted) file". However, I'd always rather know about the error, and it's only a minor issue for me. I can always restore important data from backup, and other things can normally be reacquired.
In most of the real world cases I see, the URE issue is not as much of a problem as the worst case scenarios that people evangelizing flash as the answer to just about everything like to bring up.
Mechanical failure of the drives still is the biggest issue, and as such doing a Raid 6 with Enterprise rated SAS drives (when you need maximum storage capacity) and not running your drives into the ground usually provides high success rates for rebuilds. The issues that I often see is if someone is running a Raid 5 array for 5+ years and then they get a failure and during the recovery the added workload on the remaining drives causes another one to give up before the rebuild can be completed fully, as the drives are all past their expected prime lifespan.
Flash has it's own set of issues, especially once you start pushing out to the 5+ year mark & different drives can have very different reliability factors depending on which corners were cut to hit a price / benchmark target.
...right next to the pile of dead flash. Once so glorified, almost deitified, so full of hype, starring in countless powerpoints, having a future oh-so-bright. Yet there they are. Mere mortals like everybody else.
Oh, well. Back to work. Gotta keep the blinkenlights flashing.
This particular 3PAR array was not heavily loaded at the time(80x15k RPM disks), much of my workload has been shifted to the all flash system since well it's faster and has 4 controllers(also like the 99.9999% availability guarantee on the new system, though previous 3PAR has had 100% uptime since it's deployment 3.5 years ago). But I just wanted to show that RAID even with big drives is still quite doable with a good RAID system (like 3PAR, I think XIV is similar though not as powerful/flexible as 3PAR).
rebuilding from a failure of a 600G 15k SAS drive in ~40 minutes.
Most/all of the drives in the system participate in the rebuild process, and latency remains low throughout. This technology obviously isn't new, easily 12-14 years old, quite possibly older.
I wrote a blog post(sorry for the link) almost 5 years ago on the topic, and even quoted someone who was using RAID 6 with ZFS who suffered total data loss due to multiple failures(in part because it took more than 1 week to rebuild the RAID):
Now I believe ZFS offers at least triple parity raid(optional), maybe they go beyond that too.
I don't have NL drives in my current arrays just 10k, 15k and SSD. My last 3PAR with NL drives was 5 years ago(~320 drives), it rebuilt pretty quick too even with the 1TB disks, since again the rebuild was distributed.
*very* few of my servers have internal storage of any kind.
First point has already been made - you just can't do all-flash for a lot of cost & space requirements.
Second point, as most folk will know sooner or later, HDD don't suffer from simple random bit errors, they are almost always big clusters at a time and generally much more common than the quoted BER figures would tell you.
Worst still is that most file systems don't tell you if something is corrupted, so if you do get a rebuild error on sector 1214735999 then how do you know which file to restore? Yes it is possible to work that out but it is a major PITA to do so. Further more, you can have errors that are not from disk surface read flaws, such as the odd firmware bug in HDDs, controller cards, etc. So you really want something that protects against all sorts of underlying errors if you have big volumes of data (or really important stuff). Enter ZFS or GPFS as your friend - they have file system checksums built in. And if it matters make sure the system has ECC memory so you don't get errors in cached data being written to disk!
The multiple day rebuild times are not such a problem in some ways just so long as another HDD doesn't fail during it. So if you have any biggish array you should start by using double parity. It is much better to have 8+2 in a stripe than 2*(4+1) in terms of protection against double errors, etc.
Finally if you have an array make sure you regularly scrub it - most RAID support this (hardware cards, Linux's MD system, ZFS, etc) and it forces the controller to read all sectors of all disc periodically so errors can be detected and probably corrected before you have a HDD fail completely (they will do the parity checks and attempt a re-write, probably forcing a sector reallocation on the flaky HDD). For consumer HDD do it every week or two, for enterprise you can probably get away with once a month.
Full disclosure; NetApp employee, vendor of high quality & properly engineered UFAs (Unified Flash Arrays). And the company that first provided its customers with flash as a cache in front of boring old spinning rust, with properly engineered protection (in September 2009).
NetApp was awarded United States Patent 7,640,484 on December 29, 2009 for triple parity raid.
Here you'll find [the math behind triple parity RAID] explained in some detail.
There are a number of other assertions in this article that I disagree with. Unfortunately, they're largely technical in nature and require a lot more space (and maths) than this comment box. Basically, we'd argue that multiplying two very small numbers together to generate a vanishingly small number isn't the whole story (in fact, if you'll pardon the pun, it's only a small fraction of it). This [NetApp Technical Report from 2007] might give a flavour. Yes, it dates from 2007, but it's still relevant. Maths is like that; pretty timeless.
Usually the biggest error made in predicting RAID failures is the presumption of uncorrelated faults. Most of us know from bitter experience that faults are much more likely to happen in a strongly correlated manner due to:
1) Manufacturing defects (or buggy firmware) that impact on a lot of disks, and you have all from the same batch...
2) A stress event prompting the failure, such as power cycling after years of up-time, or an overheating event due to fan failure, etc, that is common to most/all of the HDD in the RAID array.
So you should start by assuming HDD faults of around 5% per year and do the maths from that, not from claimed BER figures.
"So you should start by assuming HDD faults of around 5% per year and do the maths from that, not from claimed BER figures."
5% is near the AFR (Annual Failure Rate) ballpark. That's total failure of the disk, not the BER.
Here is a link to the most recent analysis by Backblaze:
Bit/sector errors are going to be considerably higher (unless you count that a 1TB disk completely failing constitutes 8Terabits of errors).
It is also worth noting that AFR and BER relate to two very distinctly different failure modes. Traditional RAID protects you from complete failures (as measured by AFR), but is massively more wobbly in case of sectors duffing out. There is also a failure mode that is a subclass of the duff sectors and that is latent bit errors, which basically means the disk will feed back duff data rather than throw an error saying the sector was unreadable. This could happen for a number of reasons, including firmware bugs, phantom writes to the wrong sector, head misalignment causing the wrong sector to be read, etc. - and it happens more often than you might think. Here is a link to a very good paper on the subject:
Against these sorts of errors (by far the most dangerous kind), the _only_ available solution is a fully checksumming file system like ZFS, GPFS, or BTRFS (make sure your expectations are suitably low when trying the latter).
Ouch - the typos... have you drafted in a new copywriter/ghost writer?
Also, I think some commentards are getting het up about the wrong things: I think TP is taking a consistency viewpoint - once you've "lost" a bit (all that's needed) due to error, yes the technology will soldier on, but your data may now be worthless.
The math is indeed complex, however the same observation is why I stopped backing up our nightly multiTB GIS databse to tape: Disk->Disk->tape means at least three reads/writes (including one off tape which has a higher error rate in real life backup cycles) to construct a restore point: way over the point of guaranteeing data consistency, and without a consistent database it cannot be used as we require...
And I find rebuilding my home 4*500MB NAS painful enough to be another one who will blitz rather than rebuild.
While I've been advocating that people don't touch RAID5 with a bargepole if they require their data to be safe for a very long time now, I have to say that this article gets it hopelessly wrong. RAID5 is bad enough to be totally inappropriate for very large volumes of data that have to be safe and have anything more than an utterly trivial update rate (and is only suitable for the trivial update rate stuff if a policy of reading a few sectors and writing them back again so that the whole lot gets hardware-checked regularly is adopted). However, it's pretty stragtforward to keep a disc with a bad sector still in the system - don't take the disc out, find what content the sector should have and then recover that sector, not the whole disc. After too many sectors have had to be mored, then recover the whole disc - but do it while full reduncancy is still available, not when it's too late. So it's not as bad as the article suggests.
If one uses RAID 10, plus sector (or track) recovery instead of disc recovery until too many sectors (or tracks) are broken, one can have a pretty viable RAID. Using FLASH doesn't really help, unless the data is all just about read-only so that there are not so many writes that the disc becomes totally unreliable in too short a time. Of course one can use Flash for very rarely written data, so that the read latency is substantially lower than is hard discs were used, and if the writing is rare enough, a RAID 10 configuration is used fior the data on flash, and extra reads were used to ensure that each copy of everything is read often enough to detect errors early enough but not too often (and recovery is by reassigning sectors when neccessary, as it would be on HDD) Flash could improve things a bit - but I'm not aware of any flash RaId that can do that.
Switching from the article to some of the early comments: As for ZFS, it appears to me to improve error detection; it can either handle redundancy itself, or rely on redundancy in some storage mechanism it uses. I've seen nothing to convince me that it offeres any advantages other than the early error detection, and it's not the only system that does that.
Biting the hand that feeds IT © 1998–2021