@bazza
"The ARM core, even today, is still about 32,000 transistors. "
That's no FPU and no SIMD instructions then.
"So if you're selling a customer Nmm^2 of silicon (and this is what drives the cost and power consumption) you're going to be giving them more ARM cores than x86 cores."
No-one sells square millimetres of silicon. They sell CPUs and these days they sell CPUs with multiple cores, but not too many because you simply can't get the data on and off fast enough to make it worthwhile. Look at Larrabee or Cell. These remain niche products because the bottleneck hasn't been CPU speed or size for some time.
"Then you add caches and other stuff."
Indeed. A modern desktop computer is a cache with an ocean of slow memory on one side and an excess of processing power on the other side. Your 32000 transistor core is going to be clocking at a few tens of megahertz (DRAM speeds) unless you spend about a million transistors on L1 and L2 caches.
"On x86 there is a translation unit from X86 to whatever internal RISCesque opcodes a modern x86 actually executes internally."
Actually this is an urban myth. There *are* a few x86 instructions that bail to microcode, but apparently CISC-y things like "add eax,[ecx+edx*8]" are implemented fully in the processor pipeline. The address generation stage has its own ALU and the argument fetch stage can talk to the L1 cache. In effect, x86 *is* the internal RISCesque opcode set.
"ARMs don't need that."
But if they are to get close to x86 performance, they'll need out-of-order execution, which will blow your 32000 transistor budget all the way to Pluto. This is particularly true because the ARM would require multiple instructions to accomplish the "add" instruction mentioned earlier. That's multiple live (architected) registers and multiple trips down the pipeline. If those aren't allowed to run OoO, you'll need to clock at some multiple of the Intel chip to keep step, and power consumption goes with the square of clock speed.
"X86s can do almost anything, but most people just want to watch some video, play some music, do a bit of web browsing and messaging. Put a low gate count core alongside some well chosen hardware accelerators and you can get a part that much more efficiently delivers what actually customers want."
Which is great until the world starts using different codecs, which it does every few years. Then you start wondering if it wouldn't have been smarter to spend the same transistor budget on making your general purpose CPU a little faster. Or smarter still to save on the R+D of those units altogether (which you'll have to claw back by selling the final product at a premium) and buy an off-the-shelf solution from Intel.
"No one can argue that x86 instruction set and all the baggage that comes with it is more efficient than ARM given the overwhelming opinion of almost every phone manufacturer out there."
Phones are a very specialised segment. You can get away with a fixed number of codecs, hard-wired, and there's very little other processing to do, so a feeble ARM core is a good design choice. A feeble x86 core would be good too, but Intel simply don't offer one and so we arrive at the present market segmentation for largely historical reasons.
OTOH, for a desktop core, instruction decode is a few percent of chip area these days, so in *that* market, what you describe as "baggage" is actually lost in the noise.
Within living memory, Intel have tried to replace x86 with something they designed to be intrinsically better. It didn't make enough of a difference to be measurable. They've also made ARM chips so if there was anything intrinsically better in *that* ISA, they'd presumably know about it. The evidence suggests that x86 just isn't bad enough to measure, let alone matter, except at the absurdly low end of the market and with "devices" getting more and more powerful each year, that's an end of the market that is disappearing.
In fact, you could say that ARM is moving up-market simply to stay in existence. Perhaps in 5 or 10 years time we'll look at tiny ARM chips the same way that we look at the 8042 chip. The ARM started as the CPU for a full-blown computer and then found its niche for a decade or so in less powerful products, eventually fading out of existence as even those products evolved to require increasing amounts of processing power.
Or maybe it is the desktop (and the x86) that will be replaced by tablets (with ARMs in them for largely historical reasons).