
Ah-ha!
Interesting to see how rack-scale integration is expected to improve performance beyond the uplift in oomph from individual computational components. I guess it comes from the better sharing of distributed memory that comes from faster and less hoppy scale-up interconnects (a lesser need to move data closer to each individual GPU all the time?).
Makes me wonder if classical FP64 HPC could see similar improvements as AI in this, say a 1.6x speedup in dense matrix HPL (similar to training, as 4x over 2.5x), and a 12x speedup in sparse matrix HPCG (like inference, as 30x over 2.5x). Currently, the CPU-only Fugaku is neck-to-neck with the bigger CPU+GPU El Capitan on HPCG, so a 12x uplift for GPUs could be a huge deal (if it pans out) imho!