... SGI Numalink interconnect, which is used to construct really, really large proper NUMA systems ...
Numalink is indeed a very impressive technology. Unfortunately, its global memory coherence magic comes at a substantial cost: you need to maintain memory directories and to devote a potentially significant fraction of your fabric to the cache-coherence traffic.
In the end, the speed of light still gets you: once the ratio of local/remote memory latency is large enough, you frequently have to rewrite your code in terms of message-passing anyway - and for that, the benefits of globally coherent memory over rdma are marginal, and for large configurations possibly negative.
At least for some large O2k and early UV systems I've seen [SGI UV moved outside of my price range some time ago], it was not unusual to fire up the whole system once for the benchmark/bragging rights, and then separate it into a cluster of smaller systems (each with better worst-case latency and bisection bandwidth) for production.
There are some very real physical reasons why we have developed storage-level hierarchies; a uniformly-addressable flat address space is not coming back, as appealing it may sound theoretically.