* Posts by Axel Koester

10 posts • joined 8 Sep 2015

WekaIO pulls some Matrix kung fu on SPEC file system benchmark

Axel Koester

Re: FileSystem for AI

Disclaimer: IBMer here.

Interesting discussion... but I don't agree with this statement:

> ...the only distributed file system that was architected from day 1 for NVMEs and flash devices, therefore It is able to handle small files ...

The abililty to handle small writes does not solely (and should not) depend on the availability of flash devices that can do the job for you. Technologies like distributed log-structured small writes buffering give decent performance gains on "old" flash technology and HDDs alike, and the same is true for NVMe devices - which basically gain faster access and less queueing, but they are still EEPROMs at the core.

So if "architected for NVMe" - i.e. not using the system block io driver - means "we didn't care implementing media accelerators because the media is quick enough", then this is not the correct way forward IMHO, because others will be able to copy it easily once NVMe is widespread. Intelligent distributed metadata management is a better investment.

Btw. "IOs were terminated by clients unprotected RAM" ... huh? I don't think this exists in *any* of the IBM Spectrum storage products. The secret of the exceptionally low latency in Spectrum Scale is non-blocking metadata management and parallelism, and very shallow software stacking. This worked for Terabytes, works for Petabytes, and will work for Yottabytes. Page 15 ('client - server - device roundtrip)' in the CORAL presentation discloses what to expect: 0.074 ms Avg Latency. http://files.gpfsug.org/presentations/2017/SC17/SC17-UG-CORAL_V3.pdf

Thank you for the discussion!


IBM: We're not putting XIV on the cart

Axel Koester

Why the brand "XIV" has not been adopted for the A9000

Here is why the successor of the XIV doesn't bear the XIV name... Disclaimer: IBMer here.

In case no one has noticed, the "A" in A9000 stands for Spectrum "A"ccelerate, which is the software runnning that cluster. It's the same software family that runs it predecessor, the XIV Gen3 cluster.

Similarly, the "V" in V9000/7000/5000 stands for Spectrum "V"irtualize, the stretched cluster storage software, and the 2 S in Elastic Storage Server (ESS) also stand for Spectrum Scale, the third in the cluster software trio.

For hyperconverged deployments on storage-rich x86 servers we use bare Spectrum Accelerate (for ESXi) or Spectrum Scale (any native platform, not limited to hypervisors). VASA & OpenStack are supported. For higher stability & performance goals, we recommend the appliances (FC/IB/CAPI rather than Ethernet/IP). A single name for both variants would do.

>>> BUT <<<

The original XIV data distribution schema has been designed for large nearline disks plus SSD caches. That doesn't make sense in the All-Flash era, so we changed the data distribution layout - and with it, the name.

Plus we noticed two roadblocks in all-Flash x86 cluster storage:

First, off-the-shelf x86 nodes are not up to the task of driving dispersed RAID for large packs of dense SSDs at desirable latencies. It's like putting race horses in carriage harnesses. NVMe fabrics will resolve some of that, in the meantime we use InfiniBand.

Second, even the best SSDs eventually get depleted under enterprise workloads, and we want to avoid too many component failures at once. Also we preferred a design *without* opaque 3rd party SSD firmware mimicking disk drives, which brings serious limitations in lifetime / garbage collection control / health binning control etc.

The A9000 therefore leverages FlashSystem's Variable Stripe RAID developed at Texas Memory Systems. Think "Variable Stripe" as "self-healing", a feature known from the XIV - but with RAID-5 efficiency. The overall data distribution schema runs on a 2:1 ratio of x86 nodes to Flash drawers, or even 3:1 when it's just one pod (for lack of workload entanglement, among other things).

With this design, the A9000 runs global deduplication PLUS real-time compression at latencies suitable for SAP R/3 and Oracle databases. Which compress nicely at [up to] 5:1, by the way. Anyone else?

[up to] is the legal disclaimer, your mileage may vary.

XIV goes way of the dinosaurs as IBM nixes fourth-gen storage array

Axel Koester

Insider: Technical differences of XIV Gen3 vs. FlashSystem A9000

Let me offer some technical background about the differences between XIV Gen3 and its successor FlashSystem A9000, and the reasons why we changed things. (Disclaimer: IBMer here.)

First, both share the same development teams and major parts of their firmware. Notable differences include the A9000's shiny new GUI, designed for mobile.

The major reason why there is no "XIV Gen4 with faster drives" is that you can order faster or larger drives for a Gen3, or even build your own flavor of XIV deploying a "Spectrum Accelerate" service on a VMware farm - or deploy a Cloud XIV as-a-service. They're all identical in look and feel.

The original XIV data distribution schema was designed to squeeze the maximum performance out of large nearline disks leveraging SSD caches. For an All-Flash storage system, that doesn't make sense.

Plus we noticed two roadblocks in x86 cluster storage design: First, standard x86 nodes full of dense SSDs are not up to the task of driving all that capacity (plus distributed RAID) at desirable latencies; NVMe fabrics might relieve some bottlenecks in the future.

Second, even the best SSDs get depleted after heavy use, and we want to avoid having to deal with too many component failures at once. Also we preferred a design without opaque 3rd party SSD firmware mimicking disk drives, which would have serious limitations in lifetime / garbage collection control / health binning control etc.

The A9000 therefore leverages FlashSystem's Variable Stripe RAID, which is implemented in low-latency gate logic hardware. Think "Variable Stripe" as "self-healing" - a feature known from the XIV, but with RAID-5 efficiency. On top lays a data distribution schema that uses a 2:1 ratio of x86 nodes to Flash drawers, or even 3:1 when it's just one pod (for lack of workload entanglement, among other things). The result is a system that runs global deduplication + real-time compression at latencies suitable for SAP databases and Oracle. Which also implies that incompressible or encrypted data is not ideal - so it's not a system for ANY workload. But it's definitely not restricted to VDI, like some others. I'd encourage everyone to run simulations on data samples.

IBM’s DeepFlash 150: Got half a million bucks for a fat, fast JBOF* box?

Axel Koester

Kind of odd? Not at all.

>>Kind of odd IBM didn't use their own FlashSystem boxes for this config

Not at all, if you lift the covers. The FlashSystem is designed for high sustained write rates without hitting any write cliff - such as the log output from in-memory databases or the metadata store in Spectrum Scale. Therefore the chip technology used in IBM FlashSystems is higher grade and less dense. But it's really difficult to overwhelm it, whereas overwhelming SSDs (and getting bad response times) is simple.

*Many* larger Spectrum Scale clusters worldwide are using FlashSystems as metadata storage devices, with bulk data still going to large HDD JBODs. This is now changing: The price point of a DeepFlash 150 allows swapping out all disks for solid state storage.

It is still a good idea to place the metadata stream on a FlashSystem: A FlashSystem does 2D-RAID in ultrafast gate logic, while DeepFlash150 requires Spectrum Scale's native erasure coding software to provide device protection. For the bulk data, that process can be parallelized - but not for metadata operations.

If you decide to have both metadata and bulk files on the same DeepFlash device, it becomes an "entry" configuration, in the light of the above. Not recommended for an ultrascalable filer with heavy metadata updates, but ideal if most operations are read-only with no locking. Think of media servers, Hadoop analytics clusters, etc. An entry configuration will also happily support a collection of everyday-VMs at decent speed.


PS: A stylish GUI (in latest XIV design) comes with Spectrum Scale 4.2. No more command-line fiddling after inital setup: For screenshots, see http://www.spectrumscale.org/meet-the-devs-oxford-2016/


Disclaimer: I work for IBM. But you'll have guessed that.

EMC’s DSSD all-flash array hits the streets, boasting 10m IOPS

Axel Koester

Not ready yet: EMC to replace old tech in existing storage with DSSD

Kudos to EMC for getting DSSD to work. Questions remain about the non-standard connectivity: From my own experience at IBM working with low-latency attach ("CAPI" = cache coherent memory mapping), EMC will need to offer custom-designed applications with it to sell DSSD in sufficient quantities. Hadoop is a low-hanging fruit from a technical point of view, Spark would be more challenging.

DSSD attachment is a bit like CAPI attachment for the bare IBM FlashSystem (to Anonymous Coward -2: Yes you can also have FlashSystems without SVC code layer and without FC, 90 ms latency)... It's exotic. Any SAN data center will look for FC-compatibility, with InfiniBand covering special requirements. DSSD, CAPI and clones remain a scientists playground - unless you can offer the integrated cluster appliance with it. Also you'll need to make every integrated application aware of memory-mapped storage instead of volume-mapped storage.

But then the discussion is different: A NoSQL Flash appliance like IBM's "Data Engine for NoSQL" - or potential future EMC integrated devices - will NOT compete with storage. Instead they will compete with "in-memory" databases, the most popular deployment model for low-latency analytics. And the break-even point is not at one or two servers, considering the hefty price of DSSD gear and non-standard hardware. For reference, we designed the 4U-high IBM Data Engine for NoSQL to compete with 24 generic x86 servers running in-memory NoSQL (40+TB). That's not commonplace yet, even for Apache Spark. And it's more a discussion about energy cost versus deployment agility.

My guess is that EMC will rather go the mass-market compatible way and insert DSSD into established storage products, replacing old bus technology in VMAX, XtremIO etc. Not a game changer, but a reasonable progress. It's just not ready yet.

PS. (to Anonymous Coward -2: call it "hampered" with layers of storage code, but that's market demand. Try to stay below 200 microseconds *including* snap/thin/mirroring/data reduction software; SVC is a good reference).

We've just stepped out of our time machine, and we can reveal ... EMC's new kit for early 2016

Axel Koester

Success factors for low-latency attachment technologies

On a more technical note, launching a new low-latency attachment technology like DSSD is not a straightforward job. I started marketing CAPI-attached Flash at IBM two years ago, and the outcome was NOT some general purpose fast Flash storage. Read below.

CAPI, never heard?? That's because it is only prominent in big number crunchers in finance, life sciences and research: CAPI links graphical coprocessors or arithmetic FPGAs into CPU cache lines, but it can also emulate memory pages. Hence the name Coherent Accelerator Processor Interface. Cool technology. (It doesn't get much quicker than this.)

The outcome was that as you don't find common applications that understand new protocols like CAPI, NVMe or now DSSD, you'll have to write your own! Luckily some industries were just waiting for that external memory which feels like "nearline RAM" as it bypasses the 20.000 lines of code in the storage access layer, but comes at a fraction of the cost of true RAM.

We went for the most popular rising NoSQL database, Redis, which became "BigRedis" as CAPI-Flash support was added by Redis Labs. In 2014 we announced the first IBM "Data Engine for NoSQL", a pure Redis appliance in 4U height, 2U for Power Linux and 2U for an Enterprise-grade CAPI FlashSystem. It corresponds to 24 rack servers of in-memory database up to 40TB, but only producing 5-10% of their heat output.

The real question is, how big is this [gaming/medicine/web trade/geo science] niche of high capacity in-memory applications that will be upgraded to support "memory-style Flash" with energy consumption in mind? The (expensive) alternative is "cloud-style" deployment on an increasing number of general purpose servers which know just RAM and storage. Energy costs will decide.


More from Julich supercomputing center blogging on openpower.org... http://openpowerfoundation.org/industry-coverage/julich-tag-teams-with-ibm-nvidia-on-data-centric-computing/

Big Blue bafflement: Anyone in IBM Storage know which way is up?

Axel Koester

Re: why? ... Why "expensive" storage gear will live on - ad perpetuum

An important cost driver for enterprise storage is its extensive testing effort (>50% of the cost) and the critical support structure. This is maybe the "missing" part, besides a supposedly higher margin?

Putting some fancy code on generic x86 hardware, adding some SSDs and a GUI is the easy part, getting this combo to five nines of availability or beyond, across a fleet of ten to hundred thousand deployments is a totally different story. Midrange storage has done a good job evolving towards tier 1 availability levels, but software-based deployments on arbitrary hardware are quite often far away from that. Why? Because every deployment is the first of its kind, with a combination of adapters, microcodes, drivers or cables never seen before - and thus with yet unseen error combinations showing up. People who tried heavy-duty storage virtualization in software know what I'm talking about.

But there are also the "good enough" use cases, which are less sensitive to outages or data loss than e.g. bank accounts. A majority of today's cloud-based workloads are of the "good enough" type, and yet cloud service providers run highly standardized environments where the potential multitude of error combinations will be mastered over time. Yet, no cloud bank accounts so far - or only with a hardened storage layer.

Don't be fooled, there is no such thing as bug-free software (or vulnerability-free), and that includes storage software. This is not a "features" or "fancy redundancy algorithm" discussion, it's only measured in hours of operation and field maturity of each and every new 'arbitrary' setup. For my bank account, I want the opposite: not the latest and greatest, but rather as hardened as possible.

Seagate wins HP as ClusterStor array reseller, bolts on IBM Spectrum Scale

Axel Koester

Re: Don't Do That... IF YOU'RE SMALL. Larger environments, please read:

Huh, the days when academic storage environments could get along with lower reliability/availability figures are over! Today the cost of not being able to store or reproduce one-off experimental data is huge, not speaking of the reputation loss. Imagine telling a researcher that his data is lost, his samples - $50.000 a piece - were destroyed in the experiment, his residency time is over and the next available time slot is in two years? Do that three times and you're out of (funded) business!

The storage piece is not academic research, it is a strict service provider activity. Organisations beyond the do-it-yourself size protect their business with advanced redundancy techniques, like erasure-coded data rebuild without tangible impact on write or read performance. Few enterprise arrays can do that! This stuff comes native with Spectrum Scale, don't know about Lustre. [http://bit.ly/1WJ0M4o - last slide #39, impact of repair]

And while the first "pod" of a parallel file system might be harder to set up than a NAS box, that relationship reverses at larger scale: The parallel file system will still be a single instance ("single namespace, wide metadata"), while the countless NAS or SAN appliances will require substantial migration and balancing effort to replace old tech or eliminate organisational hot spots. Which is why "wide metadata" is so important, we're not talking of clustered NAS. Of course if you don't have scale out plans in the beginning, you wouldn't think that way. I'm pretty sure this happened to more than one NAS buyer.

DSSD says Violin's right: SSD format is WRONG for flash memory

Axel Koester

Re: Oh dear

Invalid argument, most likely influenced by 'FUD'... IBM FlashSystem response time including SVC is still a very quick 200 µs, plus Real-time Compression for Oracle, still 200 µs. With SVC, 5:1 Real-time Compression and active Snapshots, it's... 200 µs. Talking about transferring FC blocks back and forth. When the non-SSD back-end is real fast, you can do a lot with less.

How much will I get from an Xtremio solution, at equal TCO? (3x slower response time, 0.7 times the capacity? Therefore database workloads are not recommended, only VDI where dedup has a chance and higher latency is not harmful? SAP benchmarks, anyone?)

Note that the FlashSystem V9000 does not need a separate SVC anymore (only the predecessor did). There's also a FlashSystem 900 CAPI adapter option that feeds data right into the CPU cache line - C in CAPI stands for 'Coherent'. The same principle as NVMe, but CPU synchronous, kind of "Nearline RAM". Great for big key-value stores... google "IBM Data Engine for NoSQL".


Axel Koester

Finally seeing the light!

One by one, Enterprise Flash vendors are finally rejecting SSDs for their next generation All-Flash-Array design. Good idea, who needs disk emulation in a disk-less device?

Disclaimer: I'm working for IBM as a storage technologist, and I was already admiring Texas Memory Systems RAMSAN before it became IBM. Today's IBM FlashSystems have been designed this way since 2012. Nice reverence!

Here's my top 5 reasons why SSDs are not ideal :

1. As a developer, you don't want third-party firmware to steal time cycles and hide vital information from the chip-level, or do other unexpected things. You'll revert to SSDs only if you don't have the time to develop good endurance-enhancing algorithms of your own.

2. Wear-levelling algorithms work better the more chips they cover. Custom modules carry many chips, and the algorithms can cross module boundaries. In contrast, space inside an SSD is very limited and so is the number of NAND Flash pages for local endurance optimization. They will wear out quicker than necessary.

3. Another pain is the RAID controller required to protect against spontaneous failures: You don't gain stability or lower latency by adding internal interfaces, especially with third-party elements. Without SSDs, one can *combine* endurance optimization with hot swap module protection in RAID5 style, a very efficient code stack which fits in lightweight FPGA gate logic.

4. SSDs would not only wear out quicker, they will always be phased out before their reserve capacity pool is depleted. That is because disk slots don't support variable capacity by definition. Basically you're throwing away 101% healthy chip capacity. Custom chip modules don't have that restriction, they can "help each other out". This results in a significantly longer mixed lifetime.

5. Enterprise storage should always strive to become more reliable, faster, cheaper, and lighter on power consumption: We achieve this by removing any piece of hardware or code in the data path that can be consolidated in something smaller. As a designer, once you're talking to the raw chips, you also gain the freedom to select the chip manufacturer of *your* choice, not the one that the [commodity] SSD manufacturer currently finds to be the cheapest.

I'm also reading:

"The biggest slow-down in applications is inefficient code, improperly used cache, and terrible architecture".

Agreed. But we also find that the cheapest countermeasure (and sometimes the only one) is to throw faster hardware at the problem: 99 out of 100 FlashSystem clients are seeking the best latency as a quick solution for slowing applications, while less than 1% is already fully utilizing the max throughput.

Be careful, we're talking about 99 µs and 10 GB/s for a single 2U FlashSystem 900 here. And there's up to 8 such 57 TB drawers in a clustered FlashSystem V9000 configuration. The latter would also support Real-Time Compression modules, packing up to five times more Oracle DB data in the same space... equals 2.2 PB high performance databases in a single rack.

We're expecting that number to double with the new chip generation becoming broadly available. Rest in peace, disk arrays.


Biting the hand that feeds IT © 1998–2021