A few points
1. https://www.systor.org/2017/slides/NVMe-over-Fabrics_Performance_Characterization.pdf is probably a better resource if you're really interested in the speeds and feeds of local, vs iSCSI vs RDMA
2. It's not just RDMA that makes things fast with NVMe-F, it's the "Zero copy" aspect of RDMA that gives most of the performance benefits .. NVMe-FC used zero copy techniques, but not RDMA, some micro-benchmarks show there's a tiny (a few microseconds) of difference between the two approaches
3. NVME-F consumes MUCH less CPU than any SCSI based storage protocol (FCP, iSCSI or even iSER which is also RDMA based), and other efficiencies in the software stack shave off at least 20 microseconds of latency when comparing SAS vs NVMe on a local system. That protocol efficiency to make accessing flash via NMVe over fabrics faster than local SAS (the network overhead of local NVMe vs fabric NVMe is much less than 20 microseconds), and based on the benchmarks done by Samsung, you're looking at about a 10% difference in latency for local vs remote NVMe
3. From my reading of the E8 architecture, it does a lot of caching at the host layer in the E8 Agents, the actual array itself isn't that special (about the same as a NetApp EF570 / EF580). If I've read the marketing material correctly, by absorbing a lot of the read I/O at the host layer you're not really seeing a much benefit vs DAS from NMVe-F as the article infers, which probably explains why the results dont show the same 10% difference in local vs remote performance seen in the testing done by Samsung, though a bunch of them were probably throughput tests rather than random I/O tests, and in throughput there's pretty much zero difference until you saturate the network
4. You really have to look at the end to end architecture, HDFS for example does a horrendous job of aggregating the performance of multiple devices on the same host, and distributed shared nothing infrastructures simply dont get down to anywhere near the same level of performance as a highly engineered HA pair, especially once the write workload becomes non-trivial .. that affects pretty much every hyperconverged solution out there, and adding in NVMe over fabrics isn't going to change that by much because the bottlenecks are in the other parts of the stack.
6. Attaching an RDMA capable external block level device to a DGX-1, you're going to have to use something that can attach via infiniband (like say an EF580), and as I dont think you can load external software like the E8 agent onto a DGX-1 you're going to be limited to the performance of the actual array. If you want ethernet, then low latency scale out NFS is still pretty much your only option, and theres a surprising amount of ML training data that turns out to be remarkably compressible which makes the AF800 (which supports end to end NVMe today) the biggest fastest storage you can easily attach to an DGX-1 today (e.g. three hundred Gigabytes / second of throughput is quite achievable in a single cluster)