Re: Fairly easy to work out
Continuing on from my post above, I'd like to say why I think an Information Dispersal (IDA) scheme would be better than RAID (or specifically RAID1 over RAID60) for this type of thing. First, what it is is basically the same thing as the RAID algorithm, except that instead of distinguishing between stripe blocks and parity blocks, each block (or "share", in IDA parlance) is created equal. In RAID, the controller would (usually, since it's more efficient) first try to reconstruct the data by combining stripes and then if there's a checksum error it will try to fix the problem by referring to the parity blocks. In contrast, IDA doesn't have this distinction, so it picks k blocks at random (the "quorum") from the n blocks that are actually stored (eg, k=2, n=4 for a RAID6-like scheme). Like RAID, it has to detect problems in the reconstructed data and rebuild that block where possible.
So far, there doesn't appear to be much difference between RAID and IDA. The key difference lies in IDA's suitability for a distributed network storage system. Each IDA block in this sort of scheme would be stored in a completely different network node, so that instead of RAID's complicated failure modes we'd only have really two failure cases to deal with: either the network fails, or a disk or node does. What I said about investing in redundant networking capabilities in the last post applies to IDA too, and by a combination of redundant links and having the right network topology we can handle a lot of transient networking failures.
We're still going to have disk or node hardware failures. Some of these will be transient and some will be permanent. Recalling that so long as k IDA blocks (usually named "shares") are available we can still reconstruct the original data, although with reduced redundancy level (n-1/k). One of the beauties of the IDA system is that it's possible to dynamically alter the level of redundancy so that if we detect that we now only have n-1 shares in total, we can still reconstruct the data and then regenerate a single new share to bring us back up to the n/k redundancy level. If the error was transient, then there's no problem. We can keep the new share, but the redundancy level for that block is now n+1/k (ie, we have n+1 shares, but we only need k of them to reconstruct the original data).
Alternatively, if we later decide that we want to increase the redundancy level of a collection of files as a whole, all we need to do is simply create new shares for each block (*note). Or we can assign different redundancy levels to different collections of files, based on how important they are. That's your backup problem solved right there, providing you have the storage. If you wanted to achieve something similar with RAID, you'd end up having to build up completely new RAID arrays (and copy or rebuild the files in situ) and make sure that each disk has enough storage for future requirements. Managing a heterogeneous collection of RAID arrays like this (each with differing disk sizes and redundancy characteristics) would be a nightmare. In contrast, the IDA scheme scales very easily simply by adding more networked disk nodes and changing quota characteristics. In fact, you can mix and match so that each node acts as storage for several different IDA collections, each with their own redundancy levels, so the amount of reconfiguration needed could actually be negligible.
Besides flexibility and simplified, orthogonal failure cases, IDA is also at least as space-efficient as the equivalent RAID system. The storage requirements of a system where all but k nodes can fail from n is simply n/k times the size of the stored data. RAID can be more inefficient due to unnecessary 100% mirroring (where IDA's "fractional" shares would work better) and because for guaranteed levels of redundancy it has to use more complex schemes than are mathematically necessary thanks to the kind of single-point and compound failures actually lowering the real reliability figures, as I mentioned in my first post.
I've made posts about IDA before (as an AC because I wrote this Perl module to implement it and I don't want to link to it from my regular user handle) and often people would complain about network latencies and so on. However, if you're talking about large storage clusters like this then you absolutely need for it to be networked. Going into the relative benefits of IDA over RAID in a networked environment is not something I'll bore you with here, though I would say that there are a few pretty interesting other features of note here:
* using multicast channels to send the file to be split across the network in its entirety, while each node calculates and stores the shares locally is pretty efficient (RAID could do this to, but disks tend to be local rather than network-based)
* readers can request >k shares in parallel to reduce read latency (simply use the first k blocks that arrive)
* the act of distributing shares also implies cryptographic security, so storage nodes could even be stored on the Internet and an attacker would need to compromise all k of them to read your files
Sorry for the long post... IDA is a bit of a fascination for me, as you can see.
*note: IDA can dynamically vary n, but if you need to vary k you need to rebuild all the shares and discard the old version.