Re: Many cores on power-limited package = poor single-thread performance?
These very high core count CPUs have become possible simply because the silicon process used to manufacture them lays down very power efficient transistors. The result is a lot of cores that can all run all at once somewhere near (or at) full bore and produce only 200Watts of heat.
It's also allowed for more things like memory controllers, cache, to be integrated on the same die(s) to help keep the cores fed.
Both Intel and AMD have been pretty successful at judging a good balance between thermals, core count, cache spec, memory bandwidth, etc for the "average" compute workload, with AMD benefitting significantly in this quest for balance thanks to TSMC's very good silicon process.
It's a good question, is this not what GPUs are for? Well, there is the already given answer that GPUs are good for vector processing (so they're not well suited for general purpose compute). But CPU cores these days are also pretty well equipped with their own vector (SIMD) units, with extensions like AVX512. It's not clear cut that GPUs always win on vector processing.
CPUs are very well suited to stream processing. GPUs typically have to be loaded with data transferred from CPU RAM via PCIe, the GPU then does its number crunching (in the blink of an eye), and then the result has to be DMA'ed back to CPU RAM in order for the application to deal with the result. The load / unload time is quite a penalty. Whereas one can DMA data into a CPU's RAM whilst the CPU is busily processing data elsewhere in its RAM. Provided the overall memory pressure fits within the RAM's bandwidth, the CPU can be kept busy all the time. This quite often means that the GPU isn't the "fastest" way of processing data.
One good example is that of the world's supercomputers, machines such as Fugaku and the K Machine - which are purely CPU based - often achieve sustained compute performance close to their benchmark scores; they cope well with data streams. The GPU based supers are also good, but only for problems where you can load data, do an awful lot of sums on that data before moving on through the input data set.
This is why NVidia have NVLink, to help join networks of GPUs together without total reliance on CPU hosts doing the data transfers for them.