Separate VPU and Floating Point Logic?
How quaint... And how very very very Intel.
They still do not get it.
It took fifteen years for Intel to shrink the computing power of the teraflops-busting ASCI Red massively parallel Pentium II supercomputer down to something that fits inside of a PCI-Express coprocessor card – and the Xeon Phi coprocessor is only the first step in a long journey with coprocessor sidekicks riding posse with CPUs …
Being complacent about Intel's competence has never worked out well.
Yes, it still has an x87 - it's an x87 borrowed from the Pentium-90, pipelined but not particularly superscalar. If you want to do arithmetic you use the VPU, if you have some little piece of setup code that desperately needs 80-bit floating point for thirty million cycles then you can run it slowly on the x87 side and the VPU will be briefly power-gated.
With the x86 "core" occupying less than 2% of the transistor count, and a whole new vector unit, I suspect you'll have to do a fair amount of re-work to take advantage of all of it all. Which kina seems to negate the original aim of the Larrabee chip: To be a bunch of x86 cores on one die, so people didn't have to learn a new processor architecture/Instruction Set.
You are missing the point a bit there. The core still executes x86 code, so runs the legacy ISA and can be targeted by the current compilers. The 2% refers to how much die area is used to specifically decode the x86 ISA into a processor internal representation of the opcodes. The remaining 98% of the core would look very similar if the core was executing ARM or any other ISA, that was the point of the comment. That still doesn't mean that x86 can shrink as small as a RISC ISA specifically designed for small cores, but at least the x86 ISA is not really a limiting factor in the size of _this_ core design (which is admittedly fairly small).
Actually this core is not fairly small, it's enormous. The previous generation Knights Ferry had 32 of these cores and a humongous die size of 700mm^2. Even at 22nm with 64 cores we're talking about a huge chip at ~350mm^2.
A modern GPU can achieve the same level of performance at half that die size on a 28nm process... So the 2% number is pretty meaningless - it implies the core is just way too large and too inefficient.
Like phones, being x86 compatible is totally pointless in a chip like this - even if the overhead is supposedly small. It even has an ancient x87 FPU with all its horrible flaws - that's just insane.
"...the four threads will look like HyperThreading to Linux software, but Chrysos says that the threads are really there to mask misses in the pipeline...."
It seems that the Niagara CMT cpus idea of masking the misses in the pipeline was a novel approach and worth wile to copy. How many other cpus will copy it? I mean, the idea of having of many lower clocked cores was shunned upon by IBM ("Databases works best on single strong cores"). But now IBM has cpus with many cores, lower clocked. Will IBM also copy this masking of caches, instead of trying to cram in larger and larger caches. As we all know, large caches are useless when we talk about server workloads serving thousands of clients, because all that data will never fit into a cpu cache.
Single thread performance is one important factor, but in terms of $/FLOP and FLOPS/watt, Power doesn't look that great. If you want to build a really huge system that you can afford to power and cool, guess what IBM sells you - Blue Gene, which isn't exactly a single thread speed demon...
So what? If you want single thread performance, a power7 is hard to beat. If you want that at the same time as power efficient, well guess what, that isn't going to happen any time soon. If you want a super computer cluster that is power efficient, a power4 is a much better choice. Actually the power A2 isn't bad either.
IBM seems to understand that it isn't one size fits all.
I can barely remember the last time an interesting new sparc came out. Hopefully some time soon a new one will (but it won't be from Oracle that's for sure). It's a lovely instruction set that just happens to be highly neglected by its makers.
"It seems that the Niagara CMT cpus idea of masking the misses in the pipeline was a novel approach and worth wile to copy. How many other cpus will copy it?"
Isn't that kind of an absurd statement as Pentium 4 was the first commercial available CPU that implemented CMT? How is Intel copying Niagara here?
Niagara T1 isn't even the 2:nd of 3:rd CPU to implement CMT, both POWER and MIPS had CMT before the release of T1.
On top of that, according to Wikipedia, IBM made the initial research on CMT in the late sixties.
The rigid U/V pipe architecture of the P54C was very different to the much more flexible out-of-order architecture of the Pentium II. Some trawling of Andy Glew's old posts to comp.arch in Google Groups' archives will give much more info. He was a principal designer on the Pentium II.
An anthropomorphised ASCI Red wouldn't recognise the Knight's Corner CPUs for the simple reason that they don't share much common heritage, not because they has diverged so greatly from the PII.
The P54C is a P5 core. The Pentium II was a P6 core. Those are very very different. The P6 had out of order execution, and was the first intel chip to translate x86 instructions into micro ops that were then executed on a more risc style core. The P54C is a plain old native x86 design where instructions are executed in order.
For building a chip with lots of cores that run predictable code, the P54C design is not a bad choice. The P6 core is much more complex and uses a lot more transistors, especially for the instruction translation system and handling out of order execution.
So yes the P54C is close in time to the Pentium II, but about as far apart in design as two intel cores could be.
Biting the hand that feeds IT © 1998–2022