back to article Traditional RAID is outdated and dying on its feet

The video below is an interview I did with IBM storage guru Robert Murphy at SC13 in Denver. I’m still in catch-up mode after my recent near-disastrous rootkit episode. Youtube Video In the video, Robert and I talk about how today’s typical RAID mechanisms (1, 5, 6, 10) just aren’t up to the job of protecting data against …

COMMENTS

This topic is closed for new posts.
  1. A Non e-mouse Silver badge
    WTF?

    Eh?

    According to IBM, rebuilds are radically faster (minutes rather than hours)

    Let's assume a 1TB drive. A typical speed for a HDD is 100MB/s. That gives a time of just under three hours to write 1TB of data to the disc. Still in the hours territory.

    Let's look at it the other way. Let's say the rebuild of our 1TB drive is done in 15 minutes. That requires a write speed of just over 1GB/s. No HDD is this fast, nor are the current SATA/SAS-II interfaces. (Although Wikipedia claims faster ones are available)

    Now a current SATA/SAS-II port can run at 6Gb/s or 768MB/s, so if we saturated a SATA-II port, we'd require just under 23 minutes to send the 1TB of data over the interface. (Assuming no protocol overhead)

    Once you increase the drive size, these times will proportionally get bigger.

    So how are they going to rebuild a 4TB drive in minutes?

    1. Anonymous Coward
      Anonymous Coward

      Re: Eh?

      IBM is outdated and dying on its feet.

      1. Anonymous Blowhard

        Re: Eh?

        "IBM is outdated and dying on its feet."

        I suspect that any company with almost 100 billion dollars revenue and with more Nobel prizes than a lot of countries will be around long after most of us, you included, are dead and buried.

        IBM doesn't sell much to consumers, a bit like Boeing and Airbus, but what it provides for business users is hard to replace; IBM has seen off most of the would-be pretenders to its big-data throne (DEC, BULL, HP, ICL etc.) and I can't see any competitor eating their lunch in the near future. The only credible rival in this space is Oracle (+Sun) and they are less than half the size of IBM.

    2. Malcolm 5

      Re: Eh?

      I guess if you had enough live free space you could rebuild the protection by making new copies split over multiple drives whilst bringing a new drive into live use to replace the free space?

      1. Anonymous Coward
        Anonymous Coward

        Re: Eh?

        I would assume they're rebuilding the data across the cluster, sharing the work across the drives, and then later shifting the rebuilt data to the resurrected drive. Not dissimilar to the way Hadoop and Cassandra handle deaths.

      2. Anonymous Coward
        Anonymous Coward

        Re: Eh?

        I'm, still a complete novice and only a home user, but "whilst bringing a new drive into live use to replace the free space?" seems the most viable option. As soon as the disk fails, rebuild and spread the rebuild across all the free space across all the disks. While planning for the new disk to be entirely "new space". Well, not quite, as you still need parity/stripe I guess, but as close as the resources will let you stretch them.

        PS, I also love how this all sounds like an immune system or growth of an organic creature. When something dies/breaks, all the rest of the "body" works with the repair and recovery. :)

    3. ToddR

      Re: Eh?

      You only need to replace part of the drive, (depending on the # of parity stripes).

    4. Anonymous Coward
      Anonymous Coward

      Re: Eh?

      I guess IBM is talking about a similar thing to ZFS where it it only rebuilds used blocks on the disc. Also on our ZFS filer with a factory configuration, we effectively have a JBOD of RAID 6 groups so only part of the file system is affected by a failure and it doesnt end up rebuilding the whole pool.

      1. Gordan

        Re: Eh?

        "I guess IBM is talking about a similar thing to ZFS where it it only rebuilds used blocks on the disc."

        While that works when the FS is mostly empty or contains only large files in large blocks, rebuilding a mostly full vdev full of small files can take much longer than rebuilding a traditional RAID because you go from linear read/write to largely random read/write (150MB/s linear speed vs 120 IOPS which could be 480KB/s on 4KB blocks). Then again, if your RAID is under load during the rebuild you are going to end up in the random IOPS limit anyway.

    5. jabuzz

      Re: Eh?

      The way to think of it is you take say a 4U 60 disk enclosure stuffed full of disk, but instead of doing RAID6 over a whole disk you break the disk up to say 1GB chunks and then do a 8D+2P RAID6 on the chunks which you scatter over all the disks. You then LVM treat the RAID6 chunks as PV's and throw then into a VG and carve LV's out of them. Every disk is then left with some of the 1GB chunks free, enough to cope with however many "hotspares" you specify. When a disk then fails all 59 remaining disks are involved in the rebuild of the damaged RAID6's of disk chunks. The rebuilt bits then get scattered over the remaining disks. Obviously you need to keep track so no RAID6 has more than one chunk on a single drive.

      NetApp have something similar in the Engenio/SANtricity range called Dynamic Disk Pools. IBM's XIV also did something similar but using RAID1 which is a bit rubbish for storage density. Of course none of this comes into play until you have a large number of disks. As such traditional RAID is not going anywhere fast for most people.

      http://i.dell.com/sites/doccontent/shared-content/data-sheets/en/Documents/Dynamic_Disk_Pooling_Technical_Report.pdf

      1. Michael Duke

        Re: Eh?

        You mean the same system HP have used in EVA and now 3Par for years?

        1. G Olson

          Re: Eh?

          You mean the same system HP have used in EVA and now 3Par for years?

          deja storage vu

          No matter how much marketing methods change, the storage stays the same.

    6. foo_bar_baz

      Re: Eh?

      Internet commenting is always entertaining, statements and declarations based on guesswork and conjecture. Let me indulge in some guesswork myself. Maybe the "rebuild" time is not about rebuilding a single disk, but about rebuilding the entire storage system from "degraded" to "healthy".

      Let's say your data is distributed across 100 storage nodes. Any one chunk of data in a "healthy" system is stored on n nodes where 100 > n > 2. One node dies, so the array is now degraded. To "rebuild" the system you just have to copy the chunks to free space on a sufficient number of nodes to satisfy the above requirement. Given that GPFS gets its high performance from storing files in small chunks across many nodes (not just 2 as in RAID1), it follows that rebuilding is also very fast. "Rebuilding" does not even necessarily have to involve replacing the broken node with a new one.

      1. JEDIDIAH
        Linux

        Re: Eh?

        > Internet commenting is always entertaining,

        Yes they are. You see plenty of handwaving and eager attempts to make things as complicated as possible but no real feasible explanation of how 4TB of missing data is going to get itself rebuilt in a few minutes.

        It sounds like the best you can do is just ignore the situation and pretend that it's not really a problem.

    7. Steven Jones

      Re: Eh?

      You are quite right, you can't rebuild a 4TB single disk in minutes. It's utterly impossible. However, what you can do is have a distributed system such that 4TB of "lost" data is actually lots of much smaller segments. Let's say, 100 x 40GB each, each of which is part of a redundant set of some sort. Then you re-assemble those 100 x 40GB segments on 100 different disks using spare capacity. It's certainly possible (in theory) to write 40GB to a single disk in less than 10 minutes.

      That's the basic principle, you use some form of distributed redundancy system such that if you ever lose one disk, then you can invoke 1 very large number in both the recreation of the "lost" data and in the reconstruction.

      Of course you can do this with a clever file system, but it's also certainly possible with block-mode arrays too. File systems can be rather more clever than that as it's only the active data that needs to be recovered. A native block mode device, at best, can only know what blocks have been written to via a bit map. However, doing all this requires a lot of heavy processing and data shuffling. You can't simply delegate the task to some dedicated hardware RAID controller to do in the background.

      That 4TB recovered in 10 minutes still involves (at least) reading and writing a total of 8TB of data (depending on how the redundancy is generated), and that's at least 13GB per second. With more complex, and space-efficient redundancy schemes, you may have to read several times that to recreate the data, so it would be even higher. That's very demanding, and will require a lot of CPU and memory bandwidth which also, of course, has to carry the normal workload at the same time. Of course, if you relax that 10 mins restore time, it's less demanding.

      It was my experience of storage arrays (mid and enterprise) that the processing capacity was often the limiting factor on I/O performance (depending on access patterns).

      1. frobnicate
        Headmaster

        Re: Eh?

        > You are quite right, you can't rebuild a 4TB single disk in minutes. It's utterly impossible.

        It is entirely possible and reasonable. There are 2 ingredients here:

        0. Parity declustering. In a parity declustered array, e.g., 8+2 RAID6 can be used to stripe data across a large number of drives, say, 100 (rather than 10, as in standard RAID). This means, that only a small fraction (10% in this example) of each drive has to be read during rebuild. See Holland's thesis (http://www.pdl.cmu.edu/PDL-FTP/Declustering/Thesis.pdf) for details.

        1. Distributed spare. By allocating spare space on each device, a fraction of *total* array bandwidth can be used for rebuild. That is, the wider is the array, the faster is the rebuild.

        The funny thing is that this technology is 20 years old.

    8. Anonymous Coward
      Anonymous Coward

      Re: Eh?

      Sounds like what Storage Spaces already does:

      Storage Spaces now includes the ability to automatically rebuild storage spaces from storage pool free space instead of using hot spares.

      What value does this change add?

      Rebuild times are accelerated because multiple disks in the pool can accept the data that was stored on the failed disk instead of waiting for a single hot spare to write all of the data. Additionally, hot spare drives are no longer needed, and storage pool free space can provide additional capacity and performance to the pool.

      What works differently?

      When a physical disk fails, instead of writing a copy of the data that was on the failed disk to a single hot spare, the data is copied to multiple physical disks in the pool so that the previous level of resiliency is achieved. Administrators no longer need to allocate physical disks as hot spares in the storage pool.

    9. talk_is_cheap

      Re: Eh?

      >> So how are they going to rebuild a 4TB drive in minutes?

      You don't, but if you move the RAID from the physical device level to the file system block level, you don't rebuild the single drive. What you do is spread the file system blocks that would have been found on the failed drive across all the remaining active drives (excluding the drives that already contain the block). For object storage you just do this at the object level, rather than the block level.

      I'm not sure why IBM think that this is 'new' news as its a feature of a number of file systems already.

    10. danolds

      Re: Eh?

      The way I understood this is that we're essentially talking about using the entire set of drives in the array (48? 64? or more) to rebuild the failed drive. This would be a lot faster, of course, then having five or ten drives rebuilding a single drive.

      As to the internal bandwidth and the limitations there, you make a good point. However, I'm pretty sure we're talking about SATA 3 rather than SATA 2, and PCIe 3 as well. I'm not positive on these points, but it makes sense to me that they'd use the latest/greatest on this new storage box.

    11. Jim 59

      and another Eh?

      Disks today are bigger than ever... When a 4TB drive fails, it takes... 20 hours to days for a single 4TB rebuild.

      Disks are bigger but faster with it. Writing a 100 Mb disk to full capacity in 1990 took about the same time as writing a 4 Tb disk to capacity in 2014 (a few hours in both cases). Hence rebuild times should not be increasing in quite the way IBM would like to suggest.

      1. Nigel Campbell

        Re: and another Eh?

        Disks are getting bigger a lot more quickly than they are getting faster. For a first approximation, disk capacity grows in proportion to the square (inverse square to be precise) of the size of the bit, whereas read performance is a linear relationship. A modern hard drive takes a lot longer to fill up than a drive built in 1990.

  2. clean_state
    Facepalm

    broken RAID

    Have they fixed the broken RAID controllers that take the WHOLE ARRAY offline if they encounter a parity error during the rebuild ? So a sigle bit flipped because of a cosmic ray, something that should have cost you one file, takes down the entire array thanks to the geniuses that made the RAID controller (in my case, it was DELL).

    1. jabuzz

      Re: broken RAID

      LSI and presumably now NetApp have patents covering that. So potentially the whole array has to go offline because just reporting a bad block on the ones with the parity errors is not possible do to patent restrictions.

      Note I have personally suffered a parity error on a RAID5 rebuild on an LSI/NetApp Engenio array. It was an interesting case, involved some maths to work back from the block on the RAID array that was duff, through the LVM and then the ext3 file system to find the effected files. Able to then mark the blocks as good again in the array controlled (which left them as zeros), delete the files and restore them from tape.

      Anyway don't blame Dell as that is covered by a patent.

      1. Destroy All Monsters Silver badge
        Paris Hilton

        Re: broken RAID

        Blaming Dell is covered by a patent, too?

      2. Ammaross Danan

        Re: broken RAID

        An easy way to sidestep that patent would be to do what ZFS or BTRFS does: checksum each block in addition to the usual "raid" parity/mirroring. Then, even a "RAID0" is protected from cosmic-ray-bit-flipping with a rebuild-capable checksum on each block. Of course, these two file systems just use the raid controller as a JBOD interface so the system doesn't halt up due to a bad block on a single drive anyway....

        1. Anonymous Coward
          Anonymous Coward

          Re: broken RAID

          I cry a little inside every time I hear that mathematics has a patent on it.

  3. Cloud 9

    Interesting if cheap ...

    The NetApp E-Series uses a method called DDP (Dynamic Disk Pools) to rebuild from multiple drives in order to greatly reduce rebuild times (x 8) so this isn't so much of an innovation. However, if GPFS can do it much cheaper, then it becomes interesting.

    However, there are also plenty of other solutions that detect failing drives and pre-copy the data to spares so when the disk does go, it's ready to stand the next disk up in an instant. That should take the fear out of large capacity parity rebuild times.

    1. jabuzz

      Re: Interesting if cheap ...

      GPFS native RAID is not going to be cheap because it is only an option if you buy your servers, disks and shelves from IBM who are not known for selling cheap disk arrays. Last time I looked you even had to buy IBM's racks. Plus it only really works if you want file system with your storage. If you want block storage go look elsewhere. A Dell MD3x00 will be much much cheaper and with the DDP give you all the same benefits.

      1. frank fegert

        Re: Interesting if cheap ...

        Seconds on GNR not being cheap, if you care about being on a supported platform. The code for GNR is available in the standard GPFS packages, so it could technically be used on COTS hardware, but IBM decided to only support it with the GPFS-SS (x86 or P775). IMHO making it not competitively priceable against "traditional" storage systems. Unless "traditional" translates to DS8k, VMAX, USP-V, etc. that is ;-)

        About a year ago i tried to get support via a DCR @ IBM for the combination of P730, IBM OEMed Brocade FC SAN and DCS3700 (OEMed LSI, same as NetApp E5400 and MD3660). Took IBM several months to come back with:

        "GPFS Native RAID is hardware specific so it is only supported on the Power 775 and our GPFS Storage Server. That said, you can attach a GPFS Storage Server to their Power Systems via standard Ethernet and GSS is built with DCS3700 enclosures. [Name deleted] is working on a proposal that would allow existing DCS3700 customers to convert to GSS."

        So the natural advice for the IBM folks to give to me as a customer was to by the whole shebang again, only this time labed as GPFS-SS.

        My 2 cents is that, while GPFS is a very good and rock solid product and GNR is also an excellent idea, IBM will - again - manage to drown an otherwise excellent product by just being the IBM of "the good old days"[TM]. Unfortunately there are nowadays alternatives spawning left and right, so the cool crowd that IBM wants to cosy up to has not one bit of motivation to attach itself to that wallstreet steered leech.

        1. Cloud 9

          Re: Interesting if cheap ...

          Later this year ... Lenovo GPFS

  4. bear_all
    Paris Hilton

    And it comes with a free unicorn.

    Every storage vendor has the answer to everything when they are making their sales pitch.

    Love to meet up with one of there post sales guys and ask him what the caveats are.

    "So I now need to manage total storage array capacity taking into account thin provisioning, deduplication and cloning so that when a disk fails, i have enough space to re-write the data on that disk to the rest of the system? How do I manage that?"

    "How many disks can fail before i notice performance impact of RAID calculations and busy disks, or am just not able to make those kind of corrections any more?"

    Nothing game changing here, just another way of doing it.

    1. Anonymous Coward
      Anonymous Coward

      Re: And it comes with a free unicorn.

      I would hope it was either part of the system, or it had a "fail safely" and just do a normal rebuild (not a super fast one) when space is limited... but then again, I would hope for common sense. That, and being naive, I've probably missed all the other really difficult and hard work tasks involved.

    2. danolds

      Re: And it comes with a free unicorn.

      Damn it! I missed out on writing about the free unicorn? They didn't have a unicorn at the show, or maybe it was in one of the customer-only whisper suites.

  5. Ian Michael Gumby

    Meh!

    IBM is old news ...

    You want a distributed file system... talk to Cleverssafe.

    1. Destroy All Monsters Silver badge
      Paris Hilton

      Re: Meh!

      "To date Cleversafe has been awarded over 100 patents"

      Certainly a formula for success.

    2. DavidRa
      WTF?

      Re: Meh!

      So I'll buy one for my datacenter and copy files to that filesystem with Explorer, right? What do you mean it's only in the Cloud? I want it local! I don't want to wait hours for data copies over this pathetic long distance WAN link! What about my backups (do NOT try to tell me that any form of resiliency = backup).

      And what the dickens do you mean you can't just save to it with a shared drive - a REST API!? Oh, wait, no you want to sell me a NAS gateway too - for goodness sake, man, if it's a filesystem let it store files my way!

      Stupid Cloud. It is NOT the be-all and end-all of IT (or if it is, I'm taking up flower arranging).

    3. Gordan

      Re: Meh!

      Cleversafe? You must be joking. They lost all credibility when they disappeared their earlier open source implementation from their website. And even if they hadn't the dispersed data storage only works when you have an incredibly fast interconnect between the nodes which demolishes most use cases they were touting for it.

      For real-world use GlusterFS is a far more sensible solution.

  6. Roland6 Silver badge

    Inexpensive hardware

    "IBM has also introduced a new storage appliance, the GPFS Storage Server, which combines GPFS with inexpensive hardware to yield a storage server with high levels of data protection and high performance."

    I thought the original idea behind RAID was to use inexpensive hardware, but given the size of a disk array the use of higher quality hardware increased the reliability and reduced maintenance costs. So I expect the real price difference between products won't be that great.

  7. This post has been deleted by its author

  8. Bartholomew

    You can squeeze blood from a stone if you squeeze really hard - use fingers or it will not work :)

    The price of the license to enable the filesystem under AIX may be cheaper but I'm guessing that the cost of the many IBM branded servers to implement it will not be. Oh and the training for disaster recovery just for when something unexpected goes wrong like a data center loosing a few dual power circuits to few racks of data servers (the unthinkable). Oh and don't forget about the ongoing platinum hardware/software/patch/security support contract(s) for when something does go wrong at 4am on Sunday morning, Total ongoing cost of ownership is the key, cheaper initial outlay (to get the toe in the door) and then higher ongoing/upgrade costs - Beware Greeks bearing gifts (or Beware of Trojans, they're complete smegheads!).

  9. Klaviar

    Yawn..

    This is hardly news.. EMC VNX2, EMC Isilon, HP 3PAR do parallel rebuilds like this.. about time IBM.

    1. uppland2

      Re: Yawn..

      Hm, Isilon and 3PAR were both founded after GPFS became commercially available.

This topic is closed for new posts.