Re: Scale up vs. scale out
I'all grant you have many good points. I work with quite a few different workloads. Agreed that NVMe is simply a method of remote procedure calling over the PCIe bus as well as a great method of command queuing to solid state controllers. It is designed to be optimal for single device access and has severe limitations in the queue structure itself for use in a RAID like environment. In fact, like SCSI, it has mutated from a single device interconnection protocol to something which it really sucks at. If creating virtualized devices in ASIC, there are extreme issues regarding upgradability. If implemented in FPGA, there are major issues with performance as even extremely large scale FPGAs have finite bandwidth resources. In addition, even using the latest processes for die fabrication, power consumption and heat issues are considerable. A hybrid design combining a high performance/low power crossbar along with FPGA for implementing localized storage logic could be an option, though even with the best PCIe ASICs currently available, there will be severe bandwidth bottlenecks as expandability is considered. PCIe simply does not scale well in these conditions. Ask HPC veterans why technologies like Infiniband still do well in high density environments for RDMA when PCIe interconnects have been around for years. SGI and Cray have consistently been strong performers by licensing technologies like QPI and custom designing better interconnects because PCIe simply isn't good enough for scalability.
So NVMe is great for some things. For centralized storage... nope.
As for storage clustering, I'm not aware of any vendors that cluster past 8 controllers currently. That's a major problem. Let's assume that somehow a vendor has implemented all their storage and communication logic in FPGA or dreadfully within ASIC. They could in theory build a semi-efficient crossbar fabric to support a few dozen hosts with decent performance. It is more likely, they have implemented their ... shall we say business logic in software which means that even if they had the biggest baddest CPUs from Intel, overall their bandwidth on scale will be dismal. There are only so many bits per second you can transfer over a PCIe connection and there are only so many PCIe lanes in a CPU/chipset. Because of this limitation, high performance centralized storage with only 8 nodes will never be a reality. Consider as well that due to fabric constraints in PCIe, there will be considerable host limits regarding scalability without inplementing something like NAT. This can be alieviated a bit by running PCIe lanes independently and performing mostly software switching and mostly eliminating the benefits of such a bus.
Centralized storage has some benefits such as easier maintainance, but to be fair, if this is an issue, you have much bigger problems. When using a scale out file server environment configured with tiers, for DR, backup, snapshots, etc... this makes use of centralized clusters of servers. You may choose to use a SAN for this, but that just strikes me as inefficient and very hard to manage. When configuring local storage properly, there is never a single copy of data and it is accessible from all nodes at all times with background sharding that copes well with scaling and outages. If there is a SSD failure, that means the blade failed and should be offlined for maintainance. This is no different that a failed DIMM, CPU or NIC. These aren't spinning disks, we generally know when something is going to die.
You're absolutely right about blades and PCIe lanes. Currently, so far as I know, no vendor is shipping blades like this which is why I have been forced to use rack servers. Thankfully, my current project is small and shouldn't need more than 100 per data center.
I am actually doing a lot of VDI right now. But that's just 35% of the project. The rest is big storage with a database containing a little over 12 billion high resolution images with about 50,000 queries an hour requiring indexing of unstructured data (image recognition) with the intent of scaling to 200,000 queries an hour. I am designing the algorithms for the queries from request through processing with every single bit over every single wire in mind.
I have worked with things as simple as server virtualization in the past on small and gigantic scale. With almost no exception, I have never achieved better ROI with centralized storage than with localized, tiered and sharded storage.
The only thing that centralized storage ever really accomplished is simplicity. It makes it easier for people to just plug it in and play. This is of great value to many. But I see centralized NVMe being an even biggest disaster than centralized SCSI over time.