back to article M.2 SSD drive format is under-rated. So why no enterprise arrays?

So the latest Samsung M.2 format drives have now hit the market. In something similar to a DIMM format, you get up to 2TB of capacity and read/write speeds of 3.5Gbit/s and 2.1Gbit/s respectively. I’ve mentioned these devices in a few recent presentations as I think the potential for such small devices could be enormous. But why …

  1. Anonymous Coward
    Anonymous Coward

    Samsung drives?

    Will they burst into flames?

    1. petur
      Flame

      Re: Samsung drives?

      While meant as a joke, your comment is actually not far from reality.

      Just scrap Samsung and flames, but these M.2 modules do get hot and cooling them is not always that easy.

  2. CheesyTheClown

    Gbit/sec?

    maybe missed a order of magnitude somewhere?

    1. Halfmad

      Re: Gbit/sec?

      Nope it's correct, I've got a small M.2 drive at home and it's insanely fast.

      1. TrevorH

        Re: Gbit/sec?

        Yes, Gbit/sec is wrong. It's GBytes/s.

  3. dajames

    But why're they all so small?

    They're not.

    As the article notes, M.2 drives are available in a range of sizes (they're all 22mm wide, but come in a handful of lengths from 30mm to 110mm, and a variety of thicknesses, according to Wikipedia). These new Samsung drives are all 2280 parts (80mm long) so they are NOT small, and certainly won't fit into either of the M.2-capable devices around here (an Acer Chromebook and a Thinkpad laptop) each of which expects a 2242 part..

    BTW: I do wish El Reg would mention the actual size of M.2 devices mentioned in reviews, rather than just saying "M.2" and leaving us to guess or wade through the usually-unhelpful manufacturers' spec sheets.

    1. TRT Silver badge

      Re: But why're they all so small?

      Very true. M2 seems to refer to the electrical/data connection standard not the physical dimensions, though the one that we couldn't use and I half-hinched out of one of the new Dells seemed to fit onto an Asus gaming motherboard perfectly.

      1. petur

        Re: But why're they all so small?

        And then there are M.2 modules that talk SATA and others that talk PCIe, so make sure you check what both sides are capable of.

        1. Charles 9

          Re: But why're they all so small?

          Thing is, SATA is so last season, especially in enterprise settings, so it's safe to assume an enterprise M.2 is using PCIe x 4.

  4. jms222

    Regardless of standards you can't hot-swap something on a PCB without risking bad stuff such as shorting things out. So definite no no for enterprise.

    Oh and PLEASE sort your units out. You mean bytes not bits surely. For a publication that pushes stories on storage so frequently at least get your units right.

    1. Anonymous Coward
      Anonymous Coward

      Isn't there such a thing as hot-swap PCIe? Would that have any bearing on potentially hot-swapping M.2 (also PCIe)?

  5. Ben Norris

    +1 for lack of hotswapping. None of these internal bus flash systems will replace SAS/SATA based drives in enterprise systems until standards or implementations for hotswapping and external access are offered.

  6. Annihilator

    Failure?

    "Costs are basically focused around the number of NAND chips an SSD contains,"

    There's an argument to be had for failure rates and costs associated. The more NAND chips you shove in a single device, the greater the chance of an overall failure. If they were treated more like individual elements of a RAID (say 1 NAND per m.2 interface, loads of m.2s), your failure costs are dramatically reduced as you'd only ever be replacing the NAND chip that had failed, not the entire module.

    1. richardcox13

      Re: Failure?

      That seems over complex (and would require massively more PCIe channels) when SSD drives already handle parts of the flash failing (and are thus over provisioned with flash on creation).

      Just treat a chip level failure as part of the same process. If a sufficiently large proportion of the flash is out of action then it is time for drive level replacement.

      Much like the process HDDs go through to remap bad sectors until there are no more spare sectors left on the drive.

      1. Annihilator

        Re: Failure?

        Hmm, good point. I guess it depends what role the drive is playing, as we'd be talking about losing an entire chunk of the drive (up to 25% on just a 4-chip device - no idea what the average is) and how over-provisioned they are.

        I guess it's the equivalent of a spinner losing an entire platter.

      2. Anonymous Coward
        Anonymous Coward

        Re: Failure?

        "Just treat a chip level failure as part of the same process. If a sufficiently large proportion of the flash is out of action then it is time for drive level replacement."

        But what about a controller-level failure, which is both sudden and catastrophic, meaning it's up one moment and bricked the next?

  7. talk_is_cheap

    So why no enterprise arrays

    One rather key issue is that you are talking about the 4x PCI standard, with the latest devices being able to fully load the 4 PCI lanes in test conditions. Such performance is able to overload 10Gbit ethernet links and saturate a 40Gbit link. This in many ways just makes them too fast for use in large arrays. Why bother if you can just deploy current SAS based options.

    5 of them placed on a 16x PCI could make a nice RAID 5 solution, but will need a very high performance control chip to be developed, with performance specs well beyond anything currently available. For every 1GB/s of write performance the controlling chip will have to also read 1GB/s from the array, xor it and then write the resulting parity information as well as the original data. While such a configuration may provide very high speeds the enterprise market seems to be more focussed on DIMM based SSD solutions which link directly to the server's CPUs. These remove the controller chip issue and will distribute the storage across all the CPUs in a system if required.

    The other great use for these types of drive would be as caches within arrays built with slower storage devices, but currently the write endurance is not designed for such tasks. The 960 PRO 2TB only has a 1.2PB endurance, which is not much in a enterprise write cache deployment.

    1. Anonymous Coward
      Anonymous Coward

      Re: So why no enterprise arrays

      You wouldn't want to use it as a cache for random data, as it doesn't improve latency much. However, using it as a cache for sequential data would allow a smaller RAM cache that is focused only on random data.

    2. Fazal Majid

      Re: So why no enterprise arrays

      What you are describing is exactly why storage arrays are going to be replaced by DAS - they are just too slow

  8. Anonymous Coward
    Anonymous Coward

    Enterprises need SAS, not M.2

    Hyperscale environments are ideal for M.2 because of the abstraction layer between the hardware and the object being stored. The actual system that writes the data to disk is just performing a glorified rsync from another node in the cluster, it's not concerned about multipathing or fencing that you would normally need from standalone storage appliances.

    Enterprise storage appliances use NVME instead, because it's faster than M.2, and PCI versions use the same form factor as traditional RAID and HBA cards. It's wasteful to redesign enterprise motherboards and chassis for M.2 when PCI and SAS do the job better anyway.

  9. Anonymous Coward
    Anonymous Coward

    Enterprise storage appliances use NVME instead, because it's faster than M.2 ...

    That doesn't seem correct. NVMe M.2 drives are widely available.

    1. Rainer

      Not those.

      He means NVMe PCIe flash in 2.5" form-factor.

      It's called "U.2". Formerly SFF-8639, but nobody could memorize that.

      Supermicro has a couple of 1U and 2U servers.

      You can get an "Enablement Kit" for HP DL380 Gen9 servers - but it's only for 6 bays.

      Supermicro's 2U server houses 24 of these.

  10. Crazy Operations Guy

    Build a carrier for them

    I'd like to see them put into a hot-swap carrier so they can be added and removed by pulling them from slots on the front of the machine. A 1U box could then hold 100 or more arrayed along the front panel. This would go over especially well in massive datacenters to provide in-rack ultra-high-speed storage that would only take a single U of space.

    You could even increase capacity by building 250 mm long carriers with two modules aboard it would support 200 modules and with modules at 2 TB each, you could cram 400 TB of storage with a raw performance of ~20 million IOPS in a single U of space...

    The one lesson I've learned when dealing with enterprise equipment is that cracking open the case is the very last thing hardware people want to do. If a M.2 module fails, its going to put that server out of commission for an hour while the hardware engineer un-cables it, pulls it from the rack, gets it to a work surface (And taking the time to figure out which one is the broken one), then dragging it back to rack, re-cabling it, then waiting for it to start up. An externally accessible disk, even if hot-swapping isn't possible, will still only take a few seconds to pull the old disk and slide in the new one, and maybe a few minutes to shut it down and start it back up if not hot-swap.

  11. jms222

    Perhaps if they were mounted in translucent coloured plastic blocks so you could open a panel and start pulling or swapping them while your HAL9000, errant starship computer or whatever complained.

  12. Anonymous Coward
    Anonymous Coward

    @Chris Evans since you asked...

    There's an issue with the economics.

    From an enterprise storage standpoint... you would have to redesign your hardware to create multiple M.2 slots on your motherboard. This no longer becomes a COTS product. Motherboard manufacturers could do it, but if you run the numbers... it doesn't make sense when you look at the larger picture.

    You're competing against the PCIe cards where you have a larger surface area to mount storage chips along with existing 4 slots in most motherboards. And then there's the DIMM slots that offer more promise because they are closer to the CPU. Neither would require the investment in to customized hardware.

    This is where you run in to trouble. Using DIMM slots, I can stuff either RAM or NVRAM and get better performance. Already in the server market, Super Micro makes a 24 DIMM slot server motherboard rated to 3TB of RAM. Now suppose you have DIMM sized storage cards that have 4TB per DIMM...

    You can put in 2TB of RAM and then stuff the other DIMMS with the NVRAM. No spinning rust. No SSDs unless you want a third tier of storage. (RAM, NVRAM, SSD)

    I don't want to suggest that the M.2 is an evolution dead end... its not, however it will be limited in terms of applications. If you were going to look at building an all flash array today... PCIe would be a start. If you were looking to use the M.2 for a custom solution... you could stand them up on their side (think of them as fins with the connection to the MB as vertical not horizontal. but you will need to allow for air flow. The problem is that if you're building a dedicated storage array... you would want hot-swap capabilities, otherwise you're going to have a lot of maintenance headaches.

    Where I see the M.2 is in blade servers. Now each card is stand alone, w multiple TBs of removable storage. Ultimately this would probably be replaced w DIMMs.

    In terms of evolving... in theory, you could increase the number of M.2 slots on the MB and then use some form of cabling to move the cards off the mb in to a storage container where your M.2 slots replace SATA. Here you could create hot/swap arrays instead of SATA SSDs.

  13. Fazal Majid

    PCIe hot-swap

    As a practical matter PCIe card-format SSDs are not hot-swappable, as taking a server out of the rack to remove the card usually requires powering it off.

    This does not apply to other form factors like U2 or Thunderbolt.

  14. naive

    M.2 and insanely fast M.2 NVME is mainstream in high-end gaming rig for years

    M.2 is a beautiful format. The speeds of M.2 nvme are over 2GByte/sec for read/write.

    Given a high-end Motherboard, enormous I/O bandwidths could theoretically be achieved.

    The disadvantage of M.2 is the fact that M.2 and M.2 nvme look the same, but are physically incompatible by a few mm, due to differences in pins. In webshops they are often sold as the same product, except with M.2 nvme being 4 times faster than ordinary M.2.

    It is interesting to see how the gaming world is the driving force for innovation in hardware development, while the "enterprise" server technology is stagnant due to the Xenon monopoly combined with CPU based licensing which excludes competitors with lower performance per socket.

  15. ntevanza

    Enterprise

    Star Trek aside, perhaps we would save some person-years by managing a better characterization of enterprise IT. Enterprise IT is Boeing and Airbus, not XtremeAir, Lockheed Martin, or SpaceX.

    In other words, the kicks are seldom where the money is. Flying an Airbus is mind-crushingly dull, and the tech is decades old. There's a reason for that.

    You can go and work for a sexy company that builds cutting edge tech, but you won't sell many planes to Lufthansa. You decide.

    A more IT-ish way to put this is that the NFRs in Enterprise are far harder to achieve than the FRs. This or that sexy tech is neither here nor there when the real differentiator is six sigma operations.

  16. TheSolderMonkey

    Actually the reason we don't use 'em is...

    Redundancy.

    The M.2 format only has a single host connection. U.2 or SFF8639 as it's much better known has two host connections.

    You can't build an enterprise system with M.2 decives because you can attach two controllers and all enterprise systems have two controllers for redundancy.

    It's perfectly possible to build a hot pluggable M.2 with Sata or NVMe, but the later is far more complicated. Host systems don't like their PCIe reads to go missing. If a read doesn't complete, PCIe buffer credits are consumed and the entire PCIe infrastructure crashes.

    Companies like PLX and PMC Sierra (now MicroSemi & MicroSemi respectively, how the hell did that get through monopolies and mergers?) offer PCIe switches with integral read tracking. The switch will fake read completion in the case of surprise removal. The chips will also fake a device as being present at bus scan time, so that you can deal with parts being added after boot time. They are of course chuffing expensive bits of silicon.

    Finally Xyratex (now Seagate, soon to be gone) did make a drive for enterprise boxes that had a U.2 front end and multiple M.2 devices inside. I don't know how well it sold.

    If anyone is wondering, yes, I am an enterprise storage box designer.

    1. NBNnigel

      Re: Actually the reason we don't use 'em is...

      "Host systems don't like their PCIe reads to go missing"

      Is this something inherent to the PCIe spec or just because of how manufacturers currently implement said spec?

      1. TheSolderMonkey

        Re: Actually the reason we don't use 'em is...

        We're off topic, this is surprise removal of PCIe. Hot un-plug. NVMe is essentially a protocol running on PCIe.

        Its ingrained behaviour at many levels, from the PCI protocol (some of this stuff goes waaay back beyond PCIe).

        The physical and electrical layers really don't care about surprise removal. You'll do no harm to the cabling or the PHYs.

        All of the other layers of PCIe kinda get their knickers twisted with surprise removal. The protocol isn't really designed to deal with stuff going missing. PCIe is essentially memory mapped 'stuff'. PCIe has strict rules about what types of traffic can overtake other types of traffic. If something goes missing lower down in the tree, then the root complex can stall because new transactions are queued behind the transaction that will never complete. So hot unplug can make things fall over in a heap.

        Hot plug presents a different set of problems. BIOSes (or is it BIOI?) weren't designed to cope with a shifting memory map. They scan the PCIe infrastructure once during the reset sequence and build a memory map based on what's there. Surprise insertion isn't possible with standard kit, the BIOS won't have allocated space in the memory map for the NVMe drive that you have just hot plugged into your stable running system. BIOS doesn't know about it, it's not in the memory map and the OS running on top doesn't stand a chance - let alone the apps running on the OS.

        So there is a new breed of PCIe switches that are capable of faking end points when they're not present.

        The switches track transactions to the endpoint and ensure that the root complex always gets a valid completion even if the end point has been unplugged. The root complex won't get valid data and will have to deal with that, but at least the PCIe infrastructure doesn't gum itself up.

        These new switches are also capable of faking a device at the bus scan. So if you have a chassis with an NVMe slot, and the slot is empty at power on, the switch will pretend that the slot is filled when the root complex does it's bus scan. So BIOS will map the device into memory and everyone is happy.

        Almost. The OS still needs to know that the device is not actually there and the switch needs to know in advance what type of device you'll be plugging in.

        So I guess you can see why these new fancy switches aren't cheap. NVMe offers some fantastic advantages over SAS, but building an enterprise system with ain't easy. An enterprise system means dual paths to every disk, ruling out M.2 . It means hot pluggable (and worse hot unpluggable) media. Doing this for NVMe means hardening a protocol that simply wasn't designed to support hot swap, then hardening the BIOS, then the OS and possibly the application too.

        It's not impossible, its just one of the things that makes my life fun. :)

        1. NBNnigel

          Re: Actually the reason we don't use 'em is...

          Thanks for the detailed reply. I definitely feel like I achieved my daily learning quota today :)

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like