Crowd-sourcing interpretation of IBM RAID 5 extension paper

This topic was created by Chris Mellor 1 .

  1. Chris Mellor 1

    Crowd-sourcing interpretation of IBM RAID 5 extension paper

    The thread for comments describing and interpreting and reviewing IBM Almaden Researcher Mario Blaum's paper: "Construction of PMDS and SD Codes extending RAID 5" which can be downloaded as a pdf from here.

    My insufficiency of cerebral matter prevents me so doing. Help please.


  2. Michael

    Quick answer

    RAID 5 can handle one disk failure. If there is a sector failure also you can not recover the associated data.

    RAID 6 can handle two disk failures, or one disk plus one sector but not two disks and one sectors, or even one disk and two sectors.

    The proposition is for encoding of sectors to enable recovery of disk failure and a simultaneous disk failure. By adding additional parity to the disk contents you can recover from a disk and sector failure in RAID 5 which you could not do normally.

    1. Steve Knox

      Re: Quick answer

      More to the point, this paper is about generating a mathematical checksum function which maximizes the probability of recovering from multiple failures while minimizing the space needed to store the checksum.

      Beyond that, it's matrix algebra, which I knew for about two weeks in college, and even then gave me headaches.

  3. Anonymous Coward
    Anonymous Coward

    Well, the answer is...


    Sorry, what was the question again?

  4. Tom Womack

    As far as I can see from the paper, it's offering a technique for getting 3N capacity out of five capacity-N discs, with protection against one disc failure in conjunction with two badly-located unreadable sectors.

    (whereas RAID6 gives you protection against two whole-disc failures, but if you lose one disc and have unreadable sectors in the same place on two of the others then you've lost that sector)

    It seems to involve fifteen reads and five writes per sector write, because it works by looking at groups of sectors on each disc, whilst RAID6 requires three reads (the sector you're overwriting and the two parity sectors) and three writes, so there's a lot more bandwidth used.

    Basically this is a paper which has discovered a pretty mathematical pattern, with a dubious justification that it might be relevant for data recovery. It doesn't make sense in a world in which discs tend to fail mechanically rather than to develop individual bad sectors.

  5. Chris Mellor 1

    Comment from Forum member who has forgotten his password

    Basically the paper gives the theoretical proof and underpinnings for SD codes. SD codes are more efficient than RAID-6 in that you do not have to dedicate 2 disks for parity which tolerate the failure of one disk and one sector. ie RAID-6 protects you from a disk failure and an URE (unrecoverable read error) during a RAID rebuild.

    With SD codes, a disk and a sector within a RAID stripe are dedicated for parity which is more efficient and actually maps to how devices fail. ie Entire disks rarely fail, what is more likely is the failure of a sector within the device. The USENIX paper quoted in the comments has more information. A key quote is below

    "We name the codes “SD” for “Sector-Disk” erasure codes. They have a general design, where a system composed of n disks dedicates m disks and s sectors per stripe to coding. The remaining sectors are dedicated to data.

    The codes are designed so that the simultaneous failures of any m disks and any s sectors per stripe may be tolerated without data loss"

    The research paper offers the proof for why this is so.....


    Chris comment - perhaps we need erasure codes to recover forgotten passwords....

  6. PeteMaths

    So the first thing to understand is a simple error-correcting code. If you want to store some binary string, the simplest thing you can do is just store the string. If your storage is faulty, one failure of even one bit loses everything (or at least you have to do something clever to recover). So you could try to get round this for example by saving the same data three times. Now one bit failure can be corrected, since you can take the majority vote on each bit - in fact like this you can correct many bit failures, provided bit number i doesn't fail in more than one of the copies. But in practice if you think many bits may fail, then it's likely that for some i, bit i does fail in more than one copy and you lose everything. Usually you assume bit failures occur independently at random, and you want to be 99.99% sure that you will not lose data (or something similar, in practice with more nines). This save three times scheme isn't great protection (the bit failure probability has to be very small or you lose data) and it takes a lot of space. You can do better by doing something clever: using some `code' which transforms your binary string into some longer codeword in a more complicated way, which has the property that any two codewords are very different: to get from one to the other you have to change a lot of bits, so after a small number of failures you can still recognise that there was only one possible codeword.

    This works if you can find a set of codewords. But it's practically useless if the only way you can do it is some brute-force search: then your computer takes a century (or more) to set up your storage system. So you need some clever construction that can be computed fast, usually this is from some piece of linear algebra over finite fields. What you want is to minimise both the disc space used (given the failure protection you want) and the time taken to code and decode. Various sorts of clever constructions are well known these days; the problem is essentially solved.

    Now there is a new problem: your failures are not any more independent. If a disc fails, you lose the entire disc. If a sector fails, then other sectors on the disc are unaffected but the sector is gone. The classic codes don't work well (or at least they aren't known to work well) in this sort of model, because they would assume that failure of lots of bits is unlikely wherever those bits are and place an entire codeword in one sector of one disc - you lose the sector and the data is gone. You need a code which will cope well with this sort of thing. RAID 5 does this to a certain limited extent, but maybe even that is not good enough in (some) applications. Because you put lots of extra requirements on your code, you have to be even more clever with coming up with a code. Before this paper, there was RAID 6 (which is a more clever version of the save multiple times idea) and some brute force schemes which showed that it is possible to have such codes but which were practically useless. This paper gives something you could use in practice; it's not too wasteful and the construction is farly quick. But you would only do so if you complain about RAID 5 failures often, and this paper's method only gives you a moderate amount more protection, so it won't help you much. What you would really like is a more general scheme that lets you specify how many discs you can allow to fail before data loss. That doesn't exist yet.

  7. Chris Mellor 1

    Posted for Grant from US Army

    Very simple answer concerning the RAID 5 paper -

    The math is just to validate the findings and is immaterial to his basic premise, which is that by adding additional parity bits to a RAID 5 array, the fault tolerance of RAID 5 exceeds RAID 6, and at a much reduced cost.


    Chris comment: Wow; I wish the abstract had said that!

  8. jimplank

    Please read the FAST paper for context

    Hi Chris -- this is Jim Plank, from the University of Tennessee. I know a few of your readers have posted this already, but I'll reaffirm. The best way to put this paper into context is to read our FAST paper, "SD Codes: Erasure Codes Designed for How Storage Systems Really Fail". The FAST web site has the paper, talk slides and talk video all posted at

    The idea is to design erasure codes that tolerate the loss of entire disks, plus additional sectors, by devoting m disks and s sectors per stripe to coding. This is as opposed to RAID 6, which devotes two whole disks to coding, or Reed-Solomon coding, which devotes m out of n whole disks to coding. The intent of the SD codes is to devote less space to coding, but still to tolerate useful failure scenarios (m disks + s sectors). Whether or not SD codes are truly useful, of course, depends on a lot of factors, which many of your readers have made very insightful comments on. However, the research that Mario and I have been doing it so present them to the storage community so that they may be considered as useful alternatives to standard erasure codes.

    Reed-Solomon codes have a general construction that makes them really practical. Unfortunately, the SD codes only have general constructions in limited cases (s = 1). Mario proved those in his PMDS paper, which, like this TR, is also a difficult read. For the other cases, I threw hardware at the problem and empirically "discovered" SD codes for them -- those are summarized in Figure 4 from the FAST paper.

    What Mario has done in the tech report that you're reading, is derive a general construction for the case when m=1 and s=2, and proving that it's SD. It's not an easy read, but for good reasons, Mario wants to document it, which is why he's written the tech report. I don't really think he meant for anyone to read it, except for when we cite it either in the journal follow-on, or in the software that I'll post. Perhaps a coding theorist may want to corroborate or consult the proof.

    A follow-up to the FAST paper has been recommended to ACM Transactions on Storage, which we're submitting tomorrow. If you want, I'll make a UTK TR out of it and post it next week. In it, Mario has proven constructions for m=1, s=2 and m=2, s=2. I've derived a construction for m=3, s=2, but Mario hasn't proved that it's SD yet, but I've demonstrated that it works for all the cases I've tested. We don't include the proofs in the paper (they'll probably be put into an IBM TR -- ha ha), because they make the paper too hard to read.

    I know Mario's work is hard to read, but the work he's done with this (especially these follow-on constructions and proofs) is brilliant. It has been a sheer pleasure working with him on it, and I'm looking forward to continuing our collaboration on coding & storage.

    Sincerely -- Jim Plank

  9. Chris Mellor 1

    From Paper's author Mario Blaum


    in your note you write: "The paper, by IBM Almaden researcher Mario Blaum, professes to solve a problem where RAID 5 is insufficient to recover data when two disks fail."

    This is an incorrect statement. I never say that in the article. RAID 5 cannot recover data when two disks fail. For that you need RAID 6. The problem addressed in the paper is one disk failure and up to two "silent" failures in sectors. That is, when a disk fails and you start reconstructing using RAID 5, you may find out that you cannot read a sector, and then you have data loss.

    In order to handle this situation, people use RAID 6. But this is a very expensive solution, since you need to dedicate a whole second disk to parity. For that reason, we developed the concept of Partial MDS (PMDS) and Sector-Disk (SD) codes, which have redundancy slightly higher than RAID 5 but can handle the situation described.

    Let me point out that Microsoft researchers in the Azure system came up independently with a similar solution, though the application was different (our main motivation was arrays of flash memory). Both the IBM and the Microsoft solutions involve computer searches. A theoretical solution was an open problem, which I provide in the paper you mention. The solution involves mathematical concepts like finite fields (to which you refer ironically as a mathematical side line with no real world applicability). I will make no apologies for the math and you are certainly free to believe that this is just a mathematical curiosity. However, we recently presented our results at FAST13 (Plank, Blaum and Hafner) and there was great interest. Jim Plank and I are preparing an expanded version under request for ACM. Best regards.

    -Mario Blaum

  10. sblllllll

    RAID 5 can handle one disk failure. If there is a sector failure also you can not recover the associated data.

    RAID 6 can handle two disk failures, or one disk plus one sector but not two disks and one sectors, or even one disk and two sectors.

    The proposition is for encoding of sectors to enable recovery of disk failure and a simultaneous disk failure. By adding additional parity to the disk contents you can recover from a disk and sector failure in RAID 5 which you could not do normally.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon