Air Movement Device
I think they changed that acronym to a Forced Air Node
113 posts • joined 18 Feb 2013
I work for a competitor, so this is going to sound bitchy, even though I think the 3PAR architecture is really clever and admirable engineering, and Nimble did a great job when they launched Infosight a few years ago, but I cant see a single thing here that couldn't be summed up as "We've renamed 3PAR, bumped the ASICs and CPUs to handle NVMe backends and wrapped it in some magical AI marketing sauce "
Those are all good things, but statements like
"incremental gains with NVMe or storage class memory, but there's no game-changer left there, because everybody expects the speed of flash as table stakes"
Huh ? storage class memory (SCM) has 100x (2 orders of magnitude) better response times than flash, used properly, it really is a game changer you just wont notice it so much if you put it behind a storage controller. Inside a storage controller it lets you have bigger caches, but the hottest stuff is already in DRAM which is faster than SCM and theres a rapidly diminishing return on making storage controller caches bigger even if you use The Magic of AI.
Then HP goes on to say ...
"Primera addresses this through the use of real-time embedded AI-driven analytics to optimise the infrastructure through predictive acceleration."
So first they say there's only incremental gains, and then they say they've "addressed" it using The Magic of AI to do something arrays have been doing since, like, .. forever .."predictive acceleration" is also called read-ahead caching, and more recently automated storage tiering,
Using ML to enhance data placement algorithms is cool, but hardly game-changing, or particularly unique, real time data placement optimisation is old tech now, and I strongly suspect theres nothing very "Learny" about the data placement algorithms which, to give the devil his due, were reputed to be pretty damn good in the old 3PAR code..
Overall this just looks like HP trying out the same marketing spin that DMC has been trying to pull of with PowerMAX, a big rebranding exercise without a lot of substantive technology behind it. Theres nothing wrong with that, but simply promising NVMe disks and SCM some time in the future and suggesting it is impossible to get the best out of either without integrating it back to the mothership's cloud analytics (good luck with getting that through in high security organisations like defence) seems not only dull, but a little bit misleading.
Hi there "John from Netapp" here .. which features do you think would be important .. ability to scale well beyond 8 nodes without worrying about the impact of bully workloads ? Support for vVOLs ? Support for VMware private cloud ? Support for RehHat OpenShift? Best in class integration with Kubernettes (going well beyond, and driving CSI standards) ? Storage I/O performance you can rely on with real quality of service guarantees (not just limits) ? Deduplication and storage efficiencies with no performance impact that actually save significant space (generally 5:1 - 10:1 before snaps vs Nutanix at about 1.5: 1 - 3:1 without needing yet _more_ DRAM and CPU overheads for the CVM ? Worlds best NFS and SMB File services ? Ability to replicate data to AWS, Azure, GCP, IBM today ? Lower TCO ? ... I agree a features comparison would be a hoot - though I don't think it would be entirely fair to compare a product built for All Flash, Containers and high speed networking and multi-cloud integration to something which was built to optimise the performance of a distributed filesystem in a single datacenter running on spinning rust over 1Gbit networks (must keep all data local because networks are soooo slow)
100 microseconds is a rediculously large overhead in the world of solid state media .. the protocol overheads of NVMe are about 5 microseconds, the wire latency of electricity is about a nanosecond per 30cm and the switch latency in ethernet is about 200 nanoseconds ...
If you look at OLD benchmarking from chelsio in the snip presentation here https://www.snia.org/sites/default/files/SDC15_presentations/networking/WaelNoureddine_Implementing_%20NVMe_revision.pdf. you'd see that there should only be about 8 microseconds of difference in NVMe inside of a server on PCIe and using the same I/O over a network .. end to end latency for a 4K I/O should be in the vicinity of 20 microseconds.
If you really want to geek out on this stuff, check out http://sc16.supercomputing.org/sc-archive/tech_poster/poster_files/post149s2-file3.pdf. which shows the actual latency differences between running RDMA traffic over Layer-2 vs TCP for a 4K I/O size should be about 5 to 10 microseconds if you're just measuring protocol level differences
As a benchmark of NVMe over fabrics using RoCEv1 or v2 vs using TCP its kind of uninspiring on all levels.
There are quite a few interesting omissions from the listed competitors, not just Microsoft or Azure, but also Oracle, Google, Cisco, and pretty much every Chinese/Taiwanese ODM and cloud vendor (if you haven't seen how big Aliyun is you're probably in for a shock). The last time I looked, MS alone was spending more on Azure infrastructure per year than Dell has spent paying down its billions in debt. Once you add in AWS, Google and Aliyun, the idea of Dell trying to compete head to head with them seems kind of silly, so it looks like they only want to say they're picking fights with people they think they can beat.
The trouble is, that even if they omit the Hyperscalers from their list, they really are competing with them for customer mind and wallet share, and the Hyperscalers aren't stupid enough to think Dell thinks of them as their best buddies either. If you don't believe me, grab a coffee with someone you know at AWS/Azure/GCP and ask them what they think of "VMWare Private Cloud" and stand back a little to avoid the mess as they spray warm beverage out of their nose while they're busy laughing at you.
Thats intrinsic tension between Dell's go to market and the hyper-scale cloud vendors is going to make it challenging for Dell to "Dominate the data" because in order to do that they'd be better of forming some pretty good relationships with them, because the Amazure Compute Platform group have some very clear intentions about drawing most of the data that Dell wants to dominate out of the datacenters that Dell sells into and put it into their datacenters under their control. In their view a customer datacenter full of Dell kit is, at best, just the new "edge" that holds a minor amount of transient data.
Oh, and one more thing, on the line "Dell Technologies’ listed competitors have all either shrunk or had other troubles of late", I feel compelled to point out that NetApp has been doing very nicely over the last year or two, and I don't think I've ever seen George K hesitate during an interview, especially when he's asked an obvious question.
I’ve got no beef with DDN bumping up their speeds and feeds, but it simply doesn’t have much relevance to machine learning and deep learning (which is what most people mean when they talk AI these days)
Unlike traditional HPC Ai / ML simply isn’t that hungry in terms of throughput, so the improvements aren’t really that relevant .. it bugs me because I have to spend too much of my time injecting some reality into the actual storage and data management requirements for ML and DL data pipelines and weaning the HPC folks who are increasingly being tasked with specifying ML/DL solutions off their storage bandwidth addiction .. deconstructing hype like this gets tiring so I get narky when I see it happen over and over again.
Also asserting that spinning rust is denser than flash is just silly, especially when the things you do to try to make it true increases the AFR of the spinning rust and makes it more difficult to swap out the drives when they do fail.
and in the interest of full disclosure I work for Netapp with a particular focus on AI/DL solutions and data pipelines.
"900TB in 4U (225TB/U storage density – better than the SFA200NV/400NV) and, with the four supported 4U x 90 bay expansion cabs, 4.5PB in a rack. That's better storage density compared to flash and it's cheaper."
AF700 is 4U an can hold 24 30TB drives .. raw capacity is 720 TB, then add say four of the 24 drive expansion shelves at 2U each ... with an average raw density of 360 TB/RU .. and you've got 12RU with 720 + 4* 360 and you've got a petabyte of raw storage in 12RU
Now take off some capacity for RAID etc which is probably going to be more or less the same for DDN and ONTAP, and then add on the 4.7:1 average savings from dedupe, compression, and clones and you're looking at about 4 Petabtytes of storage in 12 RU vs 4.5 in a whole rack.
Flash WINS !!!
Also ... deep learning performance is rarely about raw storage throughput .. you can choke the GPUs in a DGX-1 with less than 2 gigabytes per second of throughput, and thats with a _lightweight_ learning model, the really deep stuff rarely goes much above 500MB/sec for 8 of the biggest baddest GPU cards... HPC workflows and architectures != Deep Learning, so sorry DDN, fast'n'cheap isn't really going to be nearly as compelling in AI as it was in HPC.
1. https://www.systor.org/2017/slides/NVMe-over-Fabrics_Performance_Characterization.pdf is probably a better resource if you're really interested in the speeds and feeds of local, vs iSCSI vs RDMA
2. It's not just RDMA that makes things fast with NVMe-F, it's the "Zero copy" aspect of RDMA that gives most of the performance benefits .. NVMe-FC used zero copy techniques, but not RDMA, some micro-benchmarks show there's a tiny (a few microseconds) of difference between the two approaches
3. NVME-F consumes MUCH less CPU than any SCSI based storage protocol (FCP, iSCSI or even iSER which is also RDMA based), and other efficiencies in the software stack shave off at least 20 microseconds of latency when comparing SAS vs NVMe on a local system. That protocol efficiency to make accessing flash via NMVe over fabrics faster than local SAS (the network overhead of local NVMe vs fabric NVMe is much less than 20 microseconds), and based on the benchmarks done by Samsung, you're looking at about a 10% difference in latency for local vs remote NVMe
3. From my reading of the E8 architecture, it does a lot of caching at the host layer in the E8 Agents, the actual array itself isn't that special (about the same as a NetApp EF570 / EF580). If I've read the marketing material correctly, by absorbing a lot of the read I/O at the host layer you're not really seeing a much benefit vs DAS from NMVe-F as the article infers, which probably explains why the results dont show the same 10% difference in local vs remote performance seen in the testing done by Samsung, though a bunch of them were probably throughput tests rather than random I/O tests, and in throughput there's pretty much zero difference until you saturate the network
4. You really have to look at the end to end architecture, HDFS for example does a horrendous job of aggregating the performance of multiple devices on the same host, and distributed shared nothing infrastructures simply dont get down to anywhere near the same level of performance as a highly engineered HA pair, especially once the write workload becomes non-trivial .. that affects pretty much every hyperconverged solution out there, and adding in NVMe over fabrics isn't going to change that by much because the bottlenecks are in the other parts of the stack.
6. Attaching an RDMA capable external block level device to a DGX-1, you're going to have to use something that can attach via infiniband (like say an EF580), and as I dont think you can load external software like the E8 agent onto a DGX-1 you're going to be limited to the performance of the actual array. If you want ethernet, then low latency scale out NFS is still pretty much your only option, and theres a surprising amount of ML training data that turns out to be remarkably compressible which makes the AF800 (which supports end to end NVMe today) the biggest fastest storage you can easily attach to an DGX-1 today (e.g. three hundred Gigabytes / second of throughput is quite achievable in a single cluster)
1. When NVMe was being designed, it was designed quite specifically to be a way of communicating with devices not raw chips, so If your definition of SSD = Solid State Device, then NVMe pretty much does need an SSD.
Of course you could do a bunch of custom work by mooching together some NAND and an FPGA or an ASIC to implement the firmware that interprets the commands coming down from NVMe and then passes that onto the a media handler to actually do actual I/O to the chips .. (with NAND thats a flash translation layer), and then put the SERDES bits in and the connector to hook it onto a PCI bus .. but at that point what you have is an SSD with an NVMe interface.
If your definition of SSD = Solid State Disk .. ie the packaging format that looks a lot like an old school SAS / SATA dieks drive, then no there is nothing in NVMe that dictates the use of that format, its just an incredibly practical way of deploying solid state devices because it works with all the existing electrical and materials handling stuff (like hot swap in a drive enclosure) most datatcenters rely on
It's a minor point, but IBM doesn't have end to end NVMe in it's flash system products because their proprietary flash modules dont use NVMe, and the winner of the "we built it first" bragging rights still goes to NetApp because the EF570 announced NVMe over infiniband support a month or two ahead of IBM
Of course the creators of DSSD deserve a lot of kudos for doing a lot of early work in this area and releasing a product, but it was for a fairly niche HPC style use-case which required special software to be installed on hosts, proprietary interconnects etc which made if unsuitable for the vast majority of enterprise use cases. The same could be argued to a lesser extent for the EF570 and IBM infiniband based NMVeF implementations too. Great for HPC where Infiniband rules, but less so for the enterprise where Fibre Channel dominates.
The key here is that for anyone with Gen-6 Fibre channel (released in 2016, so there's a decent number of datacenters already running Gen-6 gear) can immediately start using NVMeF to ONTAP 9.4 based arrays with Gen-6 FC cards in them, (including the A300, A700) which will reduce the CPU load on the hosts from I/O and give a tidy improvement in overall latency. Furthermore if you want the goodness of full end-to-end NVMe to get those final hundred or two microseconds of latency improvement, the AF800 will give that to you today. Given that the A700 with SAS connected SSD easily outperformed the top-end array from a competitor using NVMe connected SSD's, this should put the A800 firmly in the lead as the fastest enterprise class array in the market.
The main benefits of moving to NVMe on the host (as opposed to just shoving NVMe drives into an array while still using SCSI protocols over FC or iSCSI) are
1. Lower latency
2. Lower CPU consumption on the host
3. No need to manage queue depths because they queues are effectively infinite
None of that will make much difference if you're using disk or are happy with 1-2 millisecond access times or only doing about 10,000 IOPS per host, but if you're doing some heavy duty random access like using your array to run training workloads for deep learning on a farm of NVIDIA-DGX boxes, then those things make a big difference.
Plus more performance, and lower overheads from a straightforward software upgrade (which is what moving from FC to NVMoFC should be) is a nice win.
I wrote some of this up in detail here https://www.linkedin.com/pulse/how-cool-nvme-part-4-cpu-software-efficiency-john-martin/
It's not what a dinosaur is, which is important, it's what a dinosaur means .. ask any 7 year old if a bird is a "real dinosaur"
You could argue the same bird vs dinosaur thing a you can with mammal vs Synapsids which were megafauna that pre-dated the dinosaurs and were (mostly) wiped out in the Permian-Triassic extinction event ..which eventually gave rise to mammals. When people think of dinosaurs, the think of the big impressive versions of megafauna that rely on a stable ecosystem and which die out during extinction events,. I'd agree though that its the smaller ones that can get by on the smell of an oily rag and survive through the dominance of the next set of behemoths which are the ones you have to look out for, even if they don't look anything near as cool as a dimetrodon (which isn't a dinosaur).
I'll avoid my usual war-and-peact posts here, but honestly outside of a really small number of use cases the network should never a significant part of your latency .. DAS approaches are generally much less efficient overall, and often slower than networked / external storage .. case in point, most HDFS implementations get a fraction of the available performance out of locally attached HDD/SSD resources. The whole "move the compute to the data" approach was genius when you were dealing with 100Mbit ethernet, but the compromises and the overheads involved simply dont have a good payoff when you're dealing with 100Gbit ethernet.
Even with storage class memory (Optane / Z-Nand etc), the difference between local and remote access on a 100Gbit RDMA network is about 2-3 microseconds with most of that being due to having to run through the software stack twice (once on the local and once on the remote).
Sometimes I wish people would stop being "forward thinking" by using approaches that solved a problem a from a decade or more ago.
Back just before the Sydney Olympics ... I was working for Legato then, we had a serviced office in North Sydney with about six people in it covering pretty much all of south east asia, and NetBackup was growing rapidly, Vertias seemed unstoppable, they were killing everyone on the Unix backup space, and had recently purchased BackupExec and owned the Windows and Netware market ... the revenue and mindshare they had from their volume manager business, kept them in big deals with the major array vendors, and their cluster product was beginning to get traction ... but we kept putting up the good fight.
Then I remember driving over the Harbour Bridge and there was this bloody great Veritas sign, bang smack on the top of one of the new high-rise office blocks, and I remember thinking ... bugger, they've won, game over, you just cant compete with that. A year or two later I got a tour of the place as an independent contractor, and it was hard not to be impressed with the investments they'd made in their support infrastructure, the quality of the staff, and the professionalism of the management. They deserved their success.
17 years later (damn I've been in this industry too long), the sign got changed to Symantec, and now not only is the shingle gone, but so, for the most part, is the office. Oh how the mighty have fallen.
Veritas' fortunes rose and fell along with that of SUN and the rest of the mid-range Unix ecosystems, so what happened isn't entirely surprising, but I really hope they looked after their support team, I never heard anything but good things about them.
So the proposition starts with "Say HCI is an advanced server", and then carrys on to a set of conclusions based on that premise. Not a bad set of conclusions, assuming the vast majority of HCI purchasers are people who have brand loyalty to DL380's or R270's or whatever other server type / vendor you have familiarity with, but I think the premise is flawed because theres' nothing that enhanced about HCI hardware.
From my perspective (which is of course informed by, but not limited to, the perspective of my employer), watching server vendors make the claim that their particular 1U or 2U server was "best suited" to VMWare at VMworld / Vforum events since about 2006, the VMware folks didn't really care. More often than not they were happy that server virtualisation completely commoditised the underlying hardware so they didn't have to care.
About the only difference I could see between a typical HCI box and the 1U / 2U rackmount DL380 style server with a handful of hard drives that was the mainstay of virtual server infrastructures is that HCI typically used a "4 in a box" configuration with 24 drives. I'm sure that if you locked a bunch of techs from Lenovo, HP, Dell, Fujistu, Huawei, Quanta and SuperMicro into a cage and got them to argue about the merits of their particular flavour of 4-in-a-box it might get interesting, but I don't think the VMware or business folks really care, provided they come in at a competetive price with good enough quality and service contracts.
The main enhancements with HCI is in the initial setup and expansion, user interface and software defined storage layer .. all of which are software things, not hardware things.
If server brand loyalty was a major determining factor, then I'd expect to see HP with a MUCH bigger marketshare with VSAN on HP vs what your graphs show with Dell, even with their current difficulties. I suspect those old "DL380 or die" guys are still buying DL380s and not moving into HCI.
BUT when the VMware guys get tired of fighting with the server team and the storage team and decide to take their fate into their own hands, current server brand loyalties mean very little, and may even work against the incumbent vendor as the VMware team decide to keep the old fogeys in server team out of their newly built walled garden.
Longer term, I'd expect to see the commoditisation of compute continue, with VMware / Internal cloud teams wanting take advantage of the Taiwanese ODM economies of scale .. because the ODM's seem to be the only people really growing. HP tried to keep up with them, supplying Azure, until it turned out it wasn't profitable enough for them. It won't go all the ODM's way of course, Dell seems pretty focussed on winning the private cloud war, as does HP, but of the US vendors Dell looks to be in a better position there because they own VMware (or maybe the other way around soon), but thats a VMware value prop, not a Dell hardware / organisation story.
Unlike blades, HCI doesn't seem to be an enhanced kind of server, it's an enhanced way of integrating the best parts of a software defined datacenter stack, and turning commodity compute into something more useful.. once you look at it that way, hardware incumbency starts to look kind of irrelevant.
Someone asked if anyone actually needs any more CPU, which is a fair question because I had same question when I had a desktop with a 386 processor running at 25Mhz with 2GB of RAM and a 40MB hard disk .. it was used mostly for word processing, terminal emulation and code editing, functions which it did perfectly well.
People didn't make new CPU's because software was badly written, they did it because it drove hardware refresh sales, inefficient software just kind of happened because the resources were cheap.
In any case, the bottlenecks are rarely CPU, more often its memory, and storage, and some people just devised ways of creating atomristors, which are memristors operating at 50Ghz that are one atom thick and have a storage density of 1 Terabits per square centimetre, and look like they'll be able to be stacked in 3D configurations like NAND .. which is kind of mind-blowing for a whole variety of reasons.
Moores law probably started dying (or actually died) in 2006 when Denard scaling stopped going as predicted https://en.wikipedia.org/wiki/Dennard_scaling , so the big impact from a CPU perspective is that it shows that we are now at the beginning of the "Post CMOS Age", and Moores law was intimately tied to the falling costs of manufacturing CMOS based transistors. If we go beyond CMOS, then we get a post CMOS Moores law .. though if you make CPU's out of stuff that doesnt need power to maintain state, then there might be a big blurring effect between CPU and memory resulting in radical shifts in compute towards something where "processing" is an emergent function of memory, leading to direct support for neural nets and deep learning
Interesting times ahead (still)
The difference between NVMe and SAS protocols is about 20 microseconds, and media access on flash drives using NVMe interfaces are still about 80 microseconds on both NVMe attached and SAS attached drives. Hence adding NVMe attached NAND media might give you about 20 microseconds of better latency which is good, but not really worth the hype that seems to be poured all over the NVMe discussion.
With NAND, anything offering lower latency level lower than 100µs going to be accessing the majority of its data from DRAM or NVDIMM or something else which isn't NAND Flash .. e.g. 30µs - 50µs for a write is going to NVRAM .. I don't have the numbers for an EF write, but I'm pretty sure its in the same ballpark.
The other advantages NVMe has is more PCI lanes per drive (typically 4 vs 2 for one enterprise SAS drive) and better queue depths which don't get used in practice and don't seem to make a difference in the vast majority of workloads .. I blogged about this here https://www.linkedin.com/pulse/how-cool-nvme-throughput-john-martin/ and here https://www.linkedin.com/pulse/how-cool-nvme-part-3-waiting-queues-john-martin/
The big benefit of NVMe from the client to the host is that it requires way less CPU to process I/O, so for environments like HPC where you are running the CPU's really hot, giving back a bunch of CPU cores to process 1,000,000's of IOPS is a worthwhile thing. It also helps on the storage array as well because processing NVMe on the target also requires less CPU which is often the gating factor for performance on the controller, but that CPU impact of doing SCSI I/O is a relatively small portion of the overall CPU budget on an array (vs Erasure coding, replication, snapshots, read-ahead algorithms and T10 checksums etc), so reducing the I/O CPU budget is going to have a useful, but hardly revolutionary improvement in controller utilisation, and having scale-out architectures are a better long term way of addressing the CPU vs performance issue for storage controllers.
As far as the apparent fixation on needing NVMe media to get the best out of flash, even ONTAP including inline compression and deduce with FCP at the front and SAS at the back end is able to achieve significantly better latency than a certain high profile all flash array that purely uses NVMe media. Getting significant performance improvements will be more about software stacks and media types than whether you can save 20 microseconds by moving from SAS to NVMe.
So, can things go faster by using NVMe end to end .. yes, will it be a revolutionary jump, no, unless you're using something way faster than typical NAND, but if you're going to do that, you're going to want to optimise the entire I/O stack, including the drivers, volume and filesystem layers which is where something like PlexiStor comes in.
Hahahahaha - Thank you for making laugh while reading a dry article on rebranding the cloud (hmm can't think of any other time something cloud related has been rebranded)
There's so much that can be done with Sphincter Storage ... I wonder if it's time for another simplistic.io
"its product lacks servers, networking and system-level applications. Servers and networking are provided through the FlexPod partnership with Cisco"
NetApp HCI gives Netapp a server capability outside of the Cisco relationship ..
In the end though its more about the software and platform layers such as NetApp providing the NFS capability within Azure.
Fabric is a well known term thats been around since at least the advent of Fibre Channel .. the main difference between a "Fabric" and a "Network" is that a network is generally a made up from a hierarchy of switches with the top layers oversubscribed .. its usually optimised for north south traffic. A
A fabric on the other hand is built where each node is more or less directly connected to the other nodes in a non-blocking way, while I am not a Network Expert, I believe these are also called CLOS or Spine-leaf networks. They are generally optimised for east west traffic
So really a fabric is just a special kind of network, but the use case is sufficiently different to the way the vast majority networks have been designed, that they're really in a class of their own.
Back in the good old days, those east-west networks were called a SAN which stood for Server Area Network .. it just so happened that directly connecting servers across a non-blocking network architecture didn't have much of a use case outside of HPC, but using it to overcome the limitations of SCSI cabling was a no-brainer ,, and so it became a "Storage Area Network" ... with scale out everything (Storage, HCI, Sharding, Map-Reduce etc etc) and RDMA these fabrics, (built from Ethernet, Infiniband and Fibre Channel) are all finding their own. I think it was Bob Shugart who was so threatened by IBMs SSA technology that he dug up this odd HPC tech (Fibre Channel, which was meant to replace something else called FDDI), and then cut it down to create FCAL and then rallied an "anyone but IBM" standards body around it .. and lo and behold .. SNIA and the entire SAN industry was born. But that's all ancient history now.
Also, whats the matter with the word Fibre ? .. unless you're bent out of shape about the spelling ? And theres actually a good historical reason why Fibre Channel doesn't use American spelling.
I'll try to be clearer so you can appreciate the difference
"So basically you're stating that you occupy the space that sits between smaller HCI deployments (8-nodes or larger) and a large enterprise 3-tier solution like a FlexPod/Vblock,
No, if thats how I came across, then allow me to correct myself ..NetApp HCI scales well into, the space currently occupied by FlexPod, the small FlexPods (FlexPod Mini) also scale down into the space occupied by the sweet spot for the Gen-1 HCI products. The important thing is that NetApp HCI is not limited to the departmental scale that is the typical implementation of an individual Nutanix or VxRAIL cluster.
But wait... doesn't VxRack (and Nutanix for that matter) already provide the same thing (service provide scale, thousands of nodes, millions of IOPs, PBs of capacity, blah, blah)?
Not really no .. it might be possible to configure a single VxRAIL based cluster which has some impressive theoretical specs, and then run a homogenous workload that balances nicely across all the nodes in the cluster, but running a real mixed workload typical of a datacenter would probably result in a SDS induced meltdown at some point, especially after a node failure resulting in a storm of rebalancing behaviour. I'm painfully aware that this comes across as unsubstantiated FUD, but I'm not at liberty to directly disclose the results of the interviews and market research that was done when the product was being designed, so I can't substantiate it here. By analogy though, there is a reason why VxRACK (EMCs large scale HCI offering) doesn't use VSAN as it's underlying SDS layer.
As I've said earlier, the key differentiation with NetApp HCI is the SDS layer, however if you're interested in the differentiation of NetApp HCI, begin with an overview of SolidFire, keeping in mind that it was built to leverage the best of Flash technology (it didn't start out as a hybrid array), and think about how in the storage world, most hybrid arrays are declining in sales while all flash is increasing dramatically. NetApp HCI's storage technology is good enough to compete with a specialist storage array toe-to-toe and win even without the other benefits of HCO .. I don't think you could argue the same thing for VSAN or NDS.
Odd, I heard almost exactly the same thing about All Flash FAS .. which two years after release generated the majority of NetApp's 20%+ AFA marketshare ($1.7Billion run rate) and is growing twice as fast as Pure or EMC. If you look at NetApp's SAN revenue growth, (+12.6%) vs DellEMC (-16.6%) or IBM (-12.2%) that shows that a lot of this is into net new logos .. some of them very large net new logos.
NetApp no longer equals ONTAP .. the industry is in one of those rare times when everything changes, and we've been planning for this opportunity for a while now. NetApp HCI is going to be a big part of that
There was a lot of research done in preparation for this, and a stack of NDA presentations to prospective customers .. the feedback was pretty consistent, the majority of Nutanix and VSAN customers hit problems at scale, mostly due to the SDS layer. One of those was to what was probably one of the two or three biggest Nutanix customers world wide .. the big storage workloads all ended up on a traditional SAN, There was also a recent register from Chad Saccac from EMC who pretty much said the same thing for VxRAIL.
I take your point about the Nutanix Management Interface, its a lot more than that, and its a great piece of technology, arguably the best thing about Nutanix. There are people however that prefer to stay inside vSphere and use the VMware toolkits for most things .. The NetApp HCI UX works really well for people who have that preference.
Saying we've reinvented vBlock isn't a bad thing either, I respect all my competitors and theres a lot of excellent engineering there for the right use case and IT organisation, but really the most direct comparison to vBlock is FlexPod which is growing rapidly(+20%) while vBlock is shrinking even faster (-30%). NetApp HCI is built to be installed and operated purely by the VM / Cloud admin without any storage expertise, it also scales in much smaller increments.
Which fancy buzzwords are you referring to exactly ? Reliability, Servicability and Scalaibility .. odd because I thought they were infrastructure design goals .. but let me answer you point by point.
1. all the existing HCI solutions have a bunch of "limitations", that are centered around flexibility, scale, and performance
Thats a fair characterisation of the first generation of HCI products, though to be fair every architecture has limitations in all these areas. In the case of Gen-1 HCI, those limitations are enough to make most implementations under 8-nodes in a cluster for a single workload before they crank up another one to handle a different workload. It's rare to see VDI and Database workloads on the same cluster.
2) hey we've got this new unique thing that let's you scale storage and compute independently with QoS!
From an HCI perspective a high quality QOS implementation based on all flash (which is pretty much required to implement guaranteed minimum IOPS) along with inline storage efficienciesr is a new and unique thing. From a Solidfire perspective this isn't new, but it is still unique in a shared nothing software defined storage product that is proven to work at scale.
High quality QOS at the storage layer enables scalable, predictable, multi-tenanted infrastructure. There is a direct correlation between the quality of your QoS implenentation and your ability to scale within a single cluster. QoS however it has little do do with independent scaling of Compute and storage. That feature comes from the way ElementOS has been packaged within NetApp HCI.
3) it's basically a SolidFire storage array + some compute + network...but it's in a new package!
If you'd also categorise that VxRAIL is just a VSAN array + some dell 2U servers + network, or that Nutanix is just a DSF Array + various compute + network then I suppose that would be a reasonable comparison, but none of those descriptions do justice to the rest of the work all three vendors have done around integration, user experience, workflow simplification and lifecycle management that goes into the packaging of those technologies. Arguably its the packaging that you appear to be deriding that delivers most of the cost savings and simplification benefits that people value in HCI.
"So we are full circle back to 3-tier architecture and all of the limitations/cost/complexity that comes with it."
Ok so when it comes to "3-tier architecture" .. I'll channel Diego Montoya and say "I don't think that term means what you think it does" vis <href>https://youtu.be/G2y8Sx4B2Sk</href> most people would argue that its software architecture design pattern thats' proven itself over and over again, unless you'd argue that because the presentation, logic, and data layers are separate and can be scaled independently that this just like most relatively modern datacenter infrastructure design patterns which is a bad thing, and that we should all run monolithic software on mainframes because thats simpler.
Ok, leaving technical pedantry around terminology aside, given there are people who argue that the lack of core-sharing means Netapp shouldn't use the term HCI,(see my post about looks like and HCI, walks like and HCI, quacks like an HCI) lets take the whole comment.
"So we are full circle back to 3-tier architecture and all of the limitations/cost/complexity that comes with it"
No, the compute and storage is designed to be separately scalable, and because of the way it's packaged the limitations/costs/complexity is removed, Thats the whole point to the work done on integration, delivery, packaging etc.
Maybe you can help us understand architecturally how this is different than a FlexPod? What benefits would NetApp HCI provide over that solution?
FlexPod (and vBlock for that matter) were built to be large standardised infrastructure scaling units using scale-up storage mentalities that was designed for traditional IT departments. There are trade-offs compared to NetApp HCI. With FlexPod you get the flexibility to chose pretty much any server config you like, and match that with an independently managed storage array sized, configured and generally managed by a storage expert who enjoys talking about RAID levels and LUN queue depths and NFS multipathing etc that aligned with the way many datacenter teams are built. Nothing wrong with that, still works really well for a lot of IT organisations, and there are lots of very large converged infrastructure deployments because it worked a lot better than the usually messy bespoke configurations that people had been doing for their tier-1 apps and virtualised workloads.
NetApp HCI scales in much smaller increments and is designed to be installed, operated, and run entirely by the VMware admin with little or no storage expertise at all. It helps a lot that Solidfire was never like a traditional array in the first place. It wasn't designed for traditions IT storage / infrastructure people, it was designed for cloud architects building scalable next generation datacenters.
"And given the very rapid adoption by customers of HCI solutions like Nutanix, Simplivity, VSAN, VxRail...you're telling us there are no benefits/value prop there, that only NetApp HCI can provide??"
No I never said that at all, the first generation of HCI solutions proved the value of the approach, if there was no value proposition there NetApp never would have invested in this space. What I am saying is that for customers who like HCI but have hit the limitations of their SDS layers and would like something that has better and more predictable and scalable performance, then they should be talking to us.
"Come on man...you're talking yourself in circles"
Not really, the message remains the same .. NetApp HCI is better than First Generation HCI for customers who want to save costs by consolidating more workloads into a single HCI cluster with guaranteed performance and better scalability for their next generation datacenter.
I use OSX mostly so if it was a cut and paste it would have been command-C .. but as it turns out I write all my own material.
speaking of which .. I once shot a competitor who was hiding behind an anonymous coward mask in my pyjamas ... how the mask got into my pyjamas I'll never know.
I'm here all week .. try the fish.
That the other HCI vendors are beginning to sell storage only nodes (I haven't seen them sell compute only nodes though) validates the architectural design NetApp has taken, and the consumption models around them and the rest of the menagerie of mix'n'mach node type seem to be a lot more complex than what's being launched with NetApp HCI, It's also worth noting that most (all) of these approaches require you to purchase additional VMware licenses for the storage nodes and tends to push up the licensing costs of Oracle and SQL server which tend to want to charge for the total number of cores in the whole vSphere cluster just because you might run Oracle on them one day (its dumb, but it happens).
QoS that actually works and is easy to use / change with floor, max and burst is different than QoS that just does rate limiting and causes unpredictable latency spikes, plus a lot of people are still unwilling or unable to move to the latest version of vSphere.
Lastly, there's a bunch of other strengths ElementOS brings to the table in terms of performance, scalability, replication, D/R, failure resiliency, multi-tenancy, and the ability to both grow and shrink the capacity and performance of the storage pool non-disruptively.
Even so, there are going to be times when buying servers that have exactly the right ratio of compute to memory will make more sense than buying one of the three HCI compute nodes, but that's why there are also more traditional converged infrastructure offerings within the Data Fabric .. both approaches have their strengths, you just have to understand the tradeoffs in each architecture.
If you want a technical description then I'll have to do one of my LONG posts .. and the comment section isn't long enough, so this is the short version .. the three main things that differentiates HCI solutions are the software defined storage layer the hypervisor and the management interface. I'm going to leave the hypervisor question aside for the moment because the vast majority of the market is VMware .. thats not to say that Hyper-V, or various flavours of KVM, or a hypervisorless containerised approach aren't good in their own way, but even if you don't like VMware, most people I speak to agree that ESX (especially when combined with vSphere and the rest of the VMware ecosystem) is the best hypervisor .. though clearly that's not differentiating. Some would argue that Nutanix's management interface is one of its best features, others would say that if you're already committed to VMware, then learning an additional interface just makes the learning curve steeper and that you're better off not having to context switch outside of the vSphere interface .. vSAN approaches do this as does NetApp HCI, so again, not overly diferentiating, and personally I think there's room for both approaches. That leaves us with the SDS layer ..which is both the fundamental enabler of HCI and also it's Achilles heel. Most Gen-1 HCI failures and meltdowns are caused by the limitations in the software defined storage layer .. thats not to say that at the right scale with the right workload vSAN and Nutanix don't perform adequately, but every HCI benchmark I've ever seen had really lacklustre performance results, and that limits the use cases and the scale of HCI deployments. There isn't the space to fairly describe the limitations of the SDS layers in VxRAIL and Nutanix and if you're a big fan of either, you won't appreciate my calling your baby ugly, and they both have strengths in particular use-cases, as someone said earlier, there are always tradeoffs. The SDS layer in NetApp HCI comes from the latest version of ElementOS .. to get a detailed understanding of its architecture, check this old tech field day presentation from 2014 https://www.youtube.com/watch?v=AeaGCeJfNBg ..also worth noting that SRM support which is often a cited concern with HCI has also been around with Solidfire was delivered about the same time and it doesn't require the absolute latest versions of vsphere to work. Since then there have been a number of enhancements including a brilliantly implemented support for vVOL (though you can still use old fashioned datastore if you want, something I don't think VxRAIL will let you do) .. for more information on that check out this series of 5 videos starting here https://youtu.be/4CH3thsRxR8. The other relevant and large difference is support for Snapmirror which allows integrated low impact, backup to cloud and integration with the rest of the NetApp DataFabric portfolio. Going into detail around the superiority of the replication technologies vs either Nutanix or VxRAIL(with or without RecoverPoint) won't fit on this comment thread, but if you're really interested I'll pull up some threads videos and post them.
The superiority doesn't just come from checkboxes on feature sheets, the devil as you should know is in the details, theres a big difference between a product feature, and a product feature you actually use, and that's why its usually better to ask an expert.
ElementOS is the software defined storage stack .. this is then packaged in such a way so that the CPU and memory impact of inline storage efficiency and other data services doesn't interfere with compute processing, and reduces VMware / Oracle / SQL server licensing costs. It supports the fine grained incremental scale out of HCI, the simple setup of HCI, the ability for the entire configuration to be managed by the virtualisation admin of HCI, the low TCO of HCI, the fast ROI of HCI, the "4 nodes in a 2U box with integrated storage" of HCI, and the API driven automation of HCI. If you want to disqualify as HCI because the storage software doesn't share CPU cores with the hypervisor (which at scale is more of a benefit than a drawback thanks to the way hypervisor and many other software products are licensed), then feel free, or call it composable infrastructure if you like. The vast majority of people who buy HCI don't buy it because of CPU core sharing, they buy it because of the incremental purchasing, easy installation, low administration costs, and better TCO and ROI than roll your own infrastructure, For them if it walks like a duck, looks like a duck and quacks like a duck, then its a duck ..likewise with HCI, and they won't care about core sharing any more than they care about whether the term duck should be used only for the Genus Anas within the Family Anatinae or whether the New Zealand Blue duck in the Genus Hymenolaimus counts as a "Real" duck, provided they both taste good with a nice orange glaze.
You can mix and match the storage nodes and compute nodes, so you could have a big compute node combined with one or more small storage nodes. That's part of the rationale behind the architecture because is makes it easier to get the scaling ratios right.
Seperate scaling of CPU and Memory in a pooled configuration (similar to storage) would be interesting though, wouldn't it :-)
Disclosure NetApp Employee
Theres a lot of stuff in NetApp HCI that goes well beyond what you'd find in Nutanix or VxRail, but one of the most obvious is that unlike either of those solutions NetApp HCI is designed from the ground up to run at DataCenter and Service Provider scale rather than being implemented as an edge / point deployment solution (land) with hopes that it can grow without too much pain (expand).
The majority of first generation HCI solutions end up as point solutions (e.g. VDI) that rarely go higher than 8 nodes. For many, the uncertainty around safely mixing workloads on a single cluster, due to latency and throughput variability in the scale-out storage underpinnings means that each workload gets its own cluster. This either leads to inefficiencies in management and utilisation, or long troubleshooting exercises when applications are affected by "noisy neighbours". NetApp HCI is built on SolidFire technology which was designed for mixed workloads at datacenter / service provider scale. The ability to safely consolidate hundreds of disparate workloads types on a single extremely scalable cluster significantly reduces administration and operational costs. This has been a design / architecture strength of Solidfire from the beginning, and NetApp HCI inherits that. All of this means you're getting enterprise levels of reliability, serviceability, performance and scale in the 1.0 release .. and this is just the beginning.
There a lot more to it than just enterprise class reliability, serviceability, performance and scale though, there is also the worlds best file services infrastructure and integration into a multi-cloud data management platform, so If you're interested in an in independent architectural framework that helps you do a fair comparison between NetApp HCI and the rest of the solutions on the market, check out the following link.
Or get in contact with NetApp and ask to get a briefing from one of the Next Generation Datacenter team.
Disclosure NetApp Employee - Opinions are still my own.
Spinnakker - Acquisition resulted in the technology becoming the foundation of what was called "Clustered Data ONTAP, now just ONTAP 9" .. admittedly it took a while to get there with a some bumps along the road but it's now 90% of all new ONTAP purchases with a multi-billion dollar revenue stream. There's very few successful examples of a complete product rearchitecture succeeding, ONTAP 9 is one of those.
Decru - Did quite well until the OEM market for discrete inline encryption dried up because it all ended up in switches or devices
Alacritus - originators of the technology behind the "NetApp VTL", actually sold quite well, but got killed in preparation for the purchase of DataDomain.
Topio - aka ReplicatorX .. probably an unwise purchase decision, the amount of money involved in maintaining and developing it was a large multiple of its revenue, and it had to compete for development dollars with stuff to support this new fangled "Virtualisation" thing. NetApp decided to double down on turning ONTAP into a first class platform for VMware .. given it's growth rate, it was a sound business decision. The software was ultimately licensed to a 3rd party who's doing quite well with it.
Bycast - Acquisition resulted in significant new development of StorageGrid, now one of the fastest growing products in the portfolio with an install base and growth rate significantly higher than Bycast was doing when they were purchased.
<Name Forgotten> - Nice technology, ultimately rolled into ONCommand Performance Advisor .. and some of the predictive analytics went into OnCommand Insight (OCI)
Onaro - makers of what is now OnCommand insight - which recently made it to the #1 position in the storage management revenue spot .. going from strength to strength.
SteelStore - Now AltaVault, again doing quite well, especially in combination with StorageGrid, and is generating consuming enough PetaBytes of cloud storage from the likes of Amazon and Azure to make them very happy with NetApp ..
Solidfire - Hitting the run rate expected after the acquisition, major release delivered on time, just became the basis of the new Netapp HCI
So, that makes for about a 66% success rate, with two of the "failures" coming from a decision to refocus investments away from backup and replication towards virtualisation at a time when virtualisation was driving a 40% annual growth rate for storage .. by most corporate measures of success thats a much better than average acquisition record, I could point out a bunch of failed acquisitions made by EMC .. but then again, they're not around any more.
-Disclosure NetApp Employee-
Oddly enough NetApp hasn't focussed exclusively on NAS for quite some time, and has grown it's SAN marketshare nicely on the back of an all flash growth rate which is double that of Pure's off a higher base. It also and provides object storage at a cost / GB which is about half that of S3 (Glacier is a different proposition, but thats more of a tape competitor than a NAS competitor)
Having said that you're right that cloud is gobbling up a lot of the traditional EMC and HDS and HP and Dell and NetApp "7-mode" business which is why the "SAN/NAS disk array" parts of those businesses are flat or shrinking, Offsetting that (at least for NetApp) are fast growth bits of the on-premesis market which are growing nicely, (Object, All Flash, Infrastructure analytics, HCI etc),
Personally I'm skeptical that Pure will hit escape velocity and make it as an independent company, and the list of potential buyers is pretty small these days .. Lenovo, SanDisk, and maybe Cisco (whiptail 2.0) I heard that HP had a good look at them but were turned off by the future evergreen commitments and decided to buy Nimble instead. I doubt that they'll be the next Violin, but I strongly suspect the age of building major new on-premesis infrastructure provider is gone, especially when your competition isn't just the incumbents like EMC and NetApp, but also Samsung, Intel, Micron, and AWS and Azure and Google, and every "Built from the ground up NVMe Startup" who are going to try copying Pure's business model and start poaching their best engineers and sales folks .. building a better mousetrap can only take you so far in an business environment.
Time will tell.
-Disclosure NetApp Employee, Opinions are my own, not my employers-
Given this is an article about market growth and profitability rather than technology, I've got more questions than comments, which even though they're coming from a competitor employee are genuinely things I'm curious about.
"Nutanix made a net loss of $112m" for the quarter and "Cash and short-term investments of $350.3m" .. assuming their burn rate stays the same or worse (as indicated by the trends on the graph), my assumption is that they'll run out of operating cash within 9 months. I became the director of a business like that once, and my first agenda item at the board meeting was appointing a liquidator because the business was clearly trading insovently. How is this different ?
I suppose they could do another capital raising, or selling some of their issued shares, (I think) but wouldn't that inevitably dilute the shareholding (the reverse of a buyback), hence doesn't the apparent inevitability of that make them a bad investment prospect ? If they can do that, how deep is that bucket before they have to do another capital raising ?
I've also heard (and only half understood) that its OK to keep making losses to gain market share but at some point you need to hit a revenue run rate that allows you to amortise your fixed expenses across that discounted cash flow (or something like that), at which point the path to profitability becomes obvious, but if you don't get there before you burn out your cash you hit the wall .. kind of like running out of runway. I'd heard that Pure's CEO had put that at run rate at $1B, which if true, then based on their most recent results, that would see Pure running out of money before they become profitable.
Of course I know this comes off sounding like a competitor saying "don't buy their tech because they'll run out of money and DIE !!! Just look at Violin !!!!", but from my perspective it really does look like that, having said, based on the recent stock price movements, that the market seems to see things differently, so I'm probably wrong.
Those notions of revenue run rate, projected expenses, time to profitability, and stuff like that are probably in their earnings disclosures, but it would be interesting to me if that stuff was decoded and summarised in articles like this, because it still remains a mystery to me how the "never mind the losses, look at the growth !" thing is justified in terms of financial engineering. I know this isn't Seeking Alpha or some other stock journal, but if you're going to cover stuff like this for the tech audience, it would be really cool to see it explained in a way an IT engineer can make sense of.
It's called secondary flash ... But it's not actually flash is it ? it's a hybrid mix of mostly disk with a little bit of flash. I don't have an intrinsic problem with the tech, just the innacurate marketing name.
I'm also unconvinced about performance on a cache miss during inline dedup, but that's a different issue.
Disclosure netapp employee- opinions are mine not my employer
So "adaptive flash" and "secondary flash" are the "I can't believe it's not butter" of the flash array storage world ? Any way you cut it, they're hybrid flash arrays, with all the benefits and drawbacks that go along with that. Inline compression isn't hard to do, but if you're doing inline dedup and have to do inline verification of the blocks by reading them from NL-SAS drives as new writes come in, your performance drops off a cliff. Arrays that are truly optimised for secondary storage (Data domain and AltaVault) can get away with that using only NL-SAS but they know they'll never see a random read or write workload ever so they can ignore that usecase entirely. Unless something has changed CASL isn't like that.
I'm sure the tech is good, and that if you're careful with your workloads it should perform OK, but that's not what people expect out of a flash array, that's what you get from any well engineered and sized hybrid array.
I mostly agree with you, but even if you use some of the more aggressive adoption assumptions, theres still going to be a multi-billion dollar spend in on-premesis architecture for at least the next five to ten years, though that will be a shrinking market overall. Even then based on the data I have, it will probably reach a steady state between 2025 and 2030. The reasons for that is that there are economic and technical benefits to keeping kit "on-site" that doesn't have anything to do with legacy applications, examples include latency in real time control systems, failure domain resiliency, network and API access costs, exchange rate variations, commercial risks, prudential regulation and a bunch of other things. The question isn't an either / or for public vs private but a what is the right mix today and how do you change that over time as technology and economics change. There's also a changing definition of "on site", arguably your pocket is a "site", and increasingly a lot of processing and other supporting software infrastructure will migrate to the the edge, leaving the traditional datacenter looking increasingly lonely.
That reminds me of something I saw, an AWS guy wearing a t-shirt the other day saying something like "Friends don't let friends build datacenters" .. its kind of hard not to chuckle at the truth of that.
Google, Apple, Facebook, AWS, Mictosoft etc all have plenty of custom made gear, but it doesn't form 100% of their environment. The people who buy and install cloud infrastructure inside the hyper scale data centres don't disclose what goes into it, neither do the people who sell stuff to them. In short anybody who really "knows" isn't giving away the details.
Most IT provisioning practices are pretty wasteful, so that's not a particularly new problem in HCI, and I've seen some horrific utilisation rates from LUN based provisionin so while In theory you can be a lot more efficient with separately scalable compute, storage and network, in practice there's still lots of wastage. Also while we're talking about theory with a large enough number of smallish workloads, and a half decent rebalancing algorithm, the law of large numbers should fix the HCI efficiency problem over time too.
The main place where the HCI approach promises to justify its approach clearly isn't in raw efficiency or performance or price numbers. HCI is often more expensive in terms of $/GB and $/CPU even when factoring in all those "overpriced SANS" .. it's main saving is in operational simplicity. Some of this comes from having simple building blocks with repeatable, automated ways of scaling and deployment, but IMHO, a lot more come from collapsing the organisational silos between the storage, network, and virtualisation teams. The purchasgin and consumption model for HCI are also a big improvement on building everything around a 3-4 year Big Bang purchasing cycle that the big building block / scale up solutions promote.
Having said that HCI isn't the only way of simplifying the purchasing, deployment, provisioning and other infrastructure lifecycle tasks, but its been one of the more elegant implementation I've seen over the last couple of decades, especially for on-premesis equipment.
Of course you're right that the simplest way of simplifying operational complexity around infrastructure lifecycle is to use a public cloud offering, but there are valid reasons why people will keep a portion of their infrastructure on site, and ideally that infrastructure will have comparable levels of simplicity, automation and elastic scalability as its public cloud counterparts.
CI and HCI or their eventual evolution into composable infrastructure, still seems like a good way of achieving that, and elegantly solve a decent number of IT problems today.
Perhaps it's because Google etc only use hyper-converged configurations for the workloads that make sense ? The whole "google only uses hyper converged because thats hyper scale" argument came out of the following use-case
"Files are divided into fixed-size chunks of 64 megabytes, similar to clusters or sectors in regular file systems, which are only extremely rarely overwritten, or shrunk; files are usually appended to or read."
So .. how many of your workloads look like that ??
Do you think that for other workloads they might have highly engineered and dedicated storage systems connected via high speed networks for large portions of their infrastructure. If you look at many large scale supercomputing implementations you'll see that even though you can implement things like Lustre or GPFS in a hyper-converged configuration, compute and storage are often scaled separately, using node configurations that are optimised specifically for that purpose. Even large scale hadoop style big-data analytics like EMR and Spark increasingly pull their data from network attached storage (vis S3). The whole "local disk = better" only stacks up so far from an economics and management point of view, especially at scale.
Part of the reason for this is that if you can eek out even just an addition 1% efficiency by using some hardware designed and dedicated specifically for storage and data management, then for hyperscalers who spend a few billion on infrastructure every year, that makes a lot of sense. Those optimised systems are called "storage arrays", and the networks (virtual, software defined or otherwise) which connect them to other parts of the infrasturctures are non-blocking fabric based "Systems/Storage Area Networks" .. so yeah, storage arrays, and SANs are likely to be with us in one form or another for the foreseeable future.
Hyperconverged currently makes a lot of sense, but it depends on the notion of a 2U server with a chunk of CPU and memory with "local" SAS/NVM/SATA attachments over a PCI bus (keep in mind PCI is now over 20 years old) .. what happens when we start to see the next generation of computing architecture built on memory interconnects like OmniPath or Gen-Z, or stuff like HP's "The Machine" ..Until then having a general purpose inexpensive building block for a majority of your tier-2 and tier-3 storage and compute makes a lot of sense, just don't expect it to do everything for you.
Disclosure NetApp Employee.
What people choose to include in their "Flash Optimised Array" definition is pretty debatable, the graph shows the new 8200 based A-Series controller data (which were only just released, hence the relatively small numbers) for NetApp, but doesnt show the 8000 series based AFF controller data. If you're familiar with the underlying tech for both of these you'd have to wonder .. hey wait .. what ? How can you count one, but not the other ? .. but industry analysts have their own viewpoints on what qualifies in vs out and by the same reasoning the EF series don't count for some because they weren't "built from the ground up", likewise you don't see the all flash VMAX numbers, or the all flash 3PAR numbers either.
The whole "built from the ground up" argument was always based more on marketing than actual technology .. if you're interested in a humorous (but technically interesting) take on this check out Dave Wright's presentation from tech field day here ... https://www.youtube.com/watch?v=35KNCOYguBU .."Coming Clean: The Lies That Flash Storage Companies Tell"
XtremeIO got a good chunk of their marketshare by aggressively disrupting their own VMAX sales, which was both brave, and brilliantly executed. Having said that if you look at the numbers, since the all flash VMAX came out, the XtremeIO numbers have dropped correspondingly, so it looks like DMC was quite happy to disrupt their XtremeIO sales by rolling VMAX back in, I guess its a great time to be in the forklift business :-)
- Disclaimer I work for NetApp, Opinions are my own -
I visit china as part of my job, and the speed with which they do things at massive scale is, to put it mildly, rather impressive. More to the point, a lot of the IT services and infrastructure and stuff isn't targeted at Europe or the USA. For example, many of the readers here probably haven't heard of, or at least understand the scale and impact of applications and companies like WeChat, Ctrip, Aliyun, or UnionPay. All of these are HUGE enterprises (e.g. Union Pay does more payment transactions than Visa), and their targets aren't in the overserviced, hypercompetetive markets in the USA and Europe but into the developing economies where more than 2/3rds of the world's population live, and where most of the world new demand is being created. Very few of those new global megacorps are investing heavily in GCAmazurelayerEngine ..a lot of them are using Chinese cloud providers like Aliyun or their own openstack / large scale IT automation infrastructures based on FOSS and internal development. While a lot of that is pretty much invisible to most native English speakers, from my point of view Chinese based openstack enabled corporations have already pre-emptively "busted" the western cloud monopolies in the developing economies.
I thought it was meant to be replicated to another array .. none of the explanations cover why the replica at the D//R site wasn't available .. either way .. having your backup infrastructure in the same failure domain as your production system is unforgivable, it's a fundamental design principal of data availability ..
I've done long documented lists of failure domains and had to analyse them using things like reliability block diagrams and detailed the impact of every possible failure scenario, including RPO and RTO for government departments who were reputedly less paranoid about these things than the ATO, and that was just for an RFI .. not a production environment.
You have to assume that people will do stupid things, operator error including stuff like "oops that wasn't the test instance I just dropped" is the leading cause of data loss and downtime, so if you don't factor that into the design, you're failing in your duty of care.
Rather than blaming "the dumb users", maybe someone should be asking who designed the system in this way, and who wrote the operational procedures, and why weren't they tested.
Disclosure NetApp employee - opinions and recollections are a product of my own grey matter and don't represent any official corporate position of my employer.
"Everyone now says they do 'analytics' but this is where Nimble has years of experience more than anyone else" - No .. not correct
Netapp was doing infrastructure analytics from its Autosupport database for longer than Nimble was ever around, and used that as a competitive selling proposition .. to be fair most of the analytics was more inwardly focused toward product operations for use in designing new products and helping to identify rare product faults across a large install base rather than outwardly focussed to the customers, and having something that was integrated into their administrative workflows, Nimble certainly got that bit right. Indeed if I recall correctly Nimble did some very targeted talent acquisition out of the NetApp performance and analytics teams, so its not surprising they were able to do something cool with a clean sheet of paper to work with.
While I kind of sort of agree with you, I've been told by people I trust that the economics of uranium based Nuke reactors aren't that good when you factor in the whole of the lifecycle costs and the security infrastructure you need to put around enriched U235. I'm not a nuclear physicist, nor do I play one on TV, but I'm reasonably convinced that small scale fusion reactors will be viable by the time the issues with a uranium based power industry could be profitable in south australia. Optimistically a 100MW fusion reactor the size of a jet engine and weighing a few hundred tons fuelled on radioactively stable hydrogen isotopes should be ready for prime time before 2030.
Of course producing the tritium locally would probably require a source of lithium and a source of high energy neutrons, and THAT would make sense building near Olympic Dam .. once we get that, then you have a nice clean source of base load power that can be distributed in such a way that you could build a properly resilient transmission infrastructure, AND keep it fuelled using reasonably familiar materials handling techniques.
I agree with you though, the problem is that people are thinking about the solutions far too narrowly, and IMHO the problem isn't generation, its transmission and the dubious decisions about selling natural monopolies to profit making corporations who are subject the the short term myopic quarterly reporting regime of the stock market.
"As we all know they are largely fading away mostly not refreshed the second time."
(Disclosure NetApp employee, opinions are my own etc etc)
Sorry wrong on both counts .. from where I sit, flexpod sales to new accounts are looking really healthy, and they don't use a rip and replace upgrade strategy in any case, so theres a lot of evolutionary growth throughout the stack over time in the existing customers. Thats one of the advantages of independent scale at each layer in the stack, as compute network and storage don't always need to be refreshed simultaneously (usually they don't, network tends to stick around the longest and compute cycles through the fastest)
I think vBlock used to be rip and replace at the end of their maintenance period, but they were built to be a large scaling unit, good in their own way, but the last time I looked they seemed to have moved towards a more flexpod like way of doing their life cycles. In any case DMC seems to be more focussed on VXsomething so maybe the vBlocks are fading at the hands of the incongruity of Dell selling Cisco servers.
"Data ONTAP is a heavily customised BSD ...
Data ONTAP uses BSD in more or less the same way ESXi uses Linux to boot, though nobody, (except maybe the Software Conservancy) would suggest that ESXi is a heavily customised Linux .. although unlike ESXi, some of the BSD pieces are leveraged by ONTAP, but things like network protocol stacks and CPU schedulers are currently managed by ONTAP directly .. the actual BSD bits which are part of Data ONTAP are pretty standard, and at one point the lack of support for BSD in Azure was something of an impediment to getting DataONTAP to run in there, however that time is well past.
"And NetApp has ONTAP Edge, the virtual version of ONTAP."
ONTAP Edge is an old name when the the virtual/software delivered versions of ONTAP were primarily engineered for edge deployments. A lot of work has been done since then, including multi-node HA and better performance, so that the Edge name has been retired. ONTAP now comes as ONTAP Cloud for the versions which can be purchased via the Azure and AWS cloud portals on an hourly basis, and ONTAP Select for the versions which can be purchased from any of the traditional NetApp sales channels and deployed on commercial off the shelf (COTS) hardware configurations from a variety of hardware vendors.
Biting the hand that feeds IT © 1998–2020