Re: Multicore Performance Improvement for the PC ?
It depends on the workload if the OS can utilize all these threads. We distinguish between two different scaling: scale-up and scale-out.
-Scale-out workloads run all in parallel (embarassingly parallel workloads), there is not much communication going on between the threads. This is HPC cluster number crunching territory. Typically they run a tight for loop on the same grid of points, solving the same PDE over and over again, integrating in time. Everything fits in the cache. All these servers are clusters, such as SGI UV3000, supercomputers, etc. These clusters have 10.000s of cores, as they are a bunch of PCs sitting on a fast switch. They are cheap, if you buy a large cluster, you just pay the pay the price for a individual PC x the number of nodes.
Because all the workload fits into a cache, you never go out to RAM. Cpu cache is 10ns, and RAM is 100ns. Typically one scientist starts up a huge HPC task which takes several days to complete. So one user at a time.
-Scale-up workloads have lot of communication going on. They typically run business ERP workloads, such as SAP, Databases, etc. These workloads always serve many users at the same time, thousands of users or more. One user might do accounting, another payroll, etc. This means all these separate thousands of users data can not fit into a cpu cache. So business workloads always go out to RAM. That means 100 ns latency or so.
Say the cpu runs at 2 GHz. If you have 100 ns latency as you always go out to RAM, that means the 2GHz CPU slows down to 20 MHz. I dont know if you remember those 20 MHz cpus, but they are quite slow. So business workloads (communicating a lot, waiting for other threads to synch) serving thousands of users - have large problems with scaling up. Business servers maxes out at 16 or 32-socket cpus. Every cpu needs a connection to other cpus for fast access, and with 16 or 32 cpus, there will be lot of connections. Say you have 32 sockets, then you need (32 over 2) connections. That is 32*31 = 992 connections, that is quadratic growth. That is very messy. Going above 32 sockets is not doable, if you require that every cpu connects to another (which you do, for fast access). Look at all the connections for this 32--socket SPARC server:
https://regmedia.co.uk/2013/08/28/oracle_sparc_m6_bixby_interconnect.jpg
So large business servers maxes out at 16- or 32-sockets. Clusters can not run business workloads. The reason is clusters have far too few connections. Clusters typically have 100s of cpus, or 1000s. You can not have a direct connection between cpu to cpu with that many cpus. So you cheat, one cpu connects to a group of other cpus. So accessing a cpu to another takes long time, because you need to locate the correct group, and then go to another cpu, and another, etc until you find the correct cpu. There are many hops.. And if you try to run business workloads on a cluster, performance will drop far below 20 MHz. Maybe down to 2MHz. And that is not doable.
So, clusters are scale-out servers typically having 10.000s of cores and 128 TB RAM or so. They are exclusively used for HPC workloads. Supercomputers belong to this arena. They typically run Linux.
Scale-up business servers typically have 16 sockets or so. This arena belongs to RISC such as SPARC / POWER / Mainframe running Solaris, AIX, or IBM zOS. There are no Linux nor x86 here. The reason is Linux does not scale well, x86 does not scale well either. The largest x86 business server was until recently 8-sockets. Look at all the business benchmarks, such as official SAP. All top SAP spots belong to SPARC. x86 comes far far below. Business workloads scales bad, so you need extraordinary servers to handle them, such as old and mature RISC servers. RISC has scaled to 32-sockets for decades. x86 not so. The largest scale-up business server on the market is Fujitsu M10-4S, which is a 64-socket Solaris SPARC server.
Linux does not scale well on business workloads, because until recently there did not exist large business servers beyond 8-sockets - so how can Linux scale well when there does not exist large x86 business servers?
The business arena belongs to RISC and Unix. One IBM P595 POWER6 server costed $35 million. Yes, one single server. Business servers are very lucrative and costs very much. Scalability is very very difficult and you have to pay a hefty premium. Business servers does not cost 1 PC x 32 nodes. No, the cost ramps up quadratically, because it becomes quadratically difficult to scale.