Re: 256 socket Xeon
"....Yes you're talking about it, but no I'm afraid you don't know the difference between SMP and NUMA. Lets drill a bit deeper into your example, the M9000....Actually, we can start with the diagram on page 22 of the M5000, and the following sentence that says: "SPARC Enterprise M8000 and M9000 servers feature multiple system boards that connect to a common crossbar."
If you have a design where sockets on a system board only have access to limited local memory, and must traverse an interconnect, like a crossbar, to access memory on another system board, then that is a NUMA, or NUMA derived design. It's most certainly not SMP. An SMP design is where all CPUs have equal access to to all memory. The problem with that is it doesn't scale well, hence the reason why NUMA was invented...."
Yes, I do know all this. I was the one talking about NUMA and SMP, wasnt I? It seems you claim no 32 SMP servers do exist. If that is true, then maybe you accept that no Linux 32 cpu SMP servers exist. So again I am correct: there are no 32 cpu linux SMP servers.
The M9000 is not a true SMP, I know. But Sun worked hard to make it act like SMP. This manifests in that memory latency is quite bad on the M9000, but the latency is not that catastrophically bad. The latency is quite tight, with a small spread between best case and worst case. A true SMP server would have no difference, there would be no best case nor worst case latency. So, in effect the M9000 server is SMP.
If we look at a true NUMA system, such as the 8192 core Linux ScaleMP server with 64TB RAM. This server is a cluster running a single image of Linux. And like all clusters it has a very wide spread between best case and worst case latency:
"...I tried running a nicely parallel shared memory workload (75% efficiency on 24 cores in a 4 socket opteron box) on a 64 core ScaleMP box with 8 2-socket boards linked by infiniband. Result: horrible. It might look like a shared memory, but access to off-board bits has huge latency..."
So it does not really matter if a server is a mix of NUMA and SMP, if the latency is good (because the server is well designed). If a NUMA server had extremely good latency, it would for all intents and purposes act as a SMP server, and could be used for SMP workloads.
-The Sun M9000 has 500ns as worst case latency. And best case... maybe(?) 200ns or so. The M9000 did 2-3 hops in worst case, which is not that bad, you dont have to consider it as a problem when programming. In effect, it behaves as a SMP server.
-A typical Linux NUMA cluster has worst case... something like 10.000ns or even worse. The worst case numbers were really hilarious, and made you jump in your chair (was it even 70.000ns? I dont remember but it was really bad, the worst case numbers were representative for a typical cluster). In effect you can not program a NUMA cluster like it is SMP, you need to program differently. If you assume the data will be quickly accessed, and the data is far off in a Linux cluster, your program will grind to a halt. You need to allocate data to close nodes, just like cluster programming. And if you look at the use cases and all benchmarks on all Linux NUMA servers, they are all cluster HPC workloads. No one is used for SMP work.
This Oracle M6 server is an island of SMP servers, connected with NUMA connection. I am convinced Oracle is building on the decades of experience from the Sun server people, so the M6 server has very small difference between best and worst case latency. It will act like a SMP server, because databases are typical SMP workloads, and Oracle cares strongly about database servers. The Oracle M6 server will be heavily optimized to make sure you dont have to make more than 2-3 hops to access any memory cell in the entire 96TB RAM server - it acts like a SMP server fine for databases and other SMP workloads.
I suggest you study the RAM latency numbers for M9000 and for all Linux NUMA clusters. The differences are huge. 500ns worst case, vs 10.000s ns or was it 20.000ns?? One can be programmed like a SMP server, the other needs to be programmed as a cluster.
So, you are wrong again.