Proprietary?
I wonder if like everything else coming from Imagination, you'll have to offer your first born children as sacrifices even to get access to basic programming details?
Chip designer Imagination Technologies today went public about its new processor design – the 64-bit MIPS Warrior I6400. It's an ambitious blueprint, aimed at car dashboards, digital TVs and tablets – the usual space for Imagination – all the way up to data center-grade compute, storage and networking kit. In other words, …
"Where rival ARM's shift from 32-bit ARMv7 to 64-bit ARMv8-a involved rewriting chunks of its instruction set and forcing some low-level engineers to learn a new assembly language, MIPS64 is basically MIPS32 with instructions for using 64-bit-wide data, and it runs MIPS32 code without a mode switch."
I instantly thought of the Data General Eagle and came across all warm and fuzzy.
"has the simultaneous multithreading (SMT) ... this technology essentially turns each physical core into two or four virtual cores. A hardware scheduler interleaves the virtual CPU threads into the processor's execution queues"
So, not simultaneous then - or have I misunderstood something?
> The hardware scheduler is better at scheduling tasks than end users
unless I'm being stupid, it's an instruction scheduler, not a task scheduler, and the 'scheduling' is very basic such as instruction interleaving (as the article says).
> but in effect you're just getting better performance out of similar hardware.
Hmm. I've heard this before but it didn't pan out IME. I kicked off some heavy processing work[*]. I ran it repeatedly, steadily upped the number of parallel threads that I allowed for that query, and it scaled linearly up to the number of physical cores. Once it started using virtual cores the rise stopped, and as more virtual cores were used, performance slowly fell. However, that may have been an atypical workload. Perhaps if it had been cache-bound rather than memory-bound it may have done better. It was running on a huge dataset. Dunno.
[*] happens it was in a DB but all memory resident so the disk was never touched.
But you are slightly misinformed. SMT usually refers to the ability of the hardware to present multiple threads of execution to the OS so that the hardware (think ALUs and stuff like that) is kept busy even if one thread blocks (due to slow memory). So: Pseudo-simultaneous, yes. Yielding more performance out of only slightly increased HW cost, also yes.
For some more detail, consider the picture in the article. While it doesn't go into all the details about the width of the blocks, it still shows most of the relevant information. Most of the computation work happens in the blocks after the Instruction Issue Unit. From the image, you can see that each of these units are separate and don't interact with each other. But each of them is able to do a substantial amount of work. This allows the CPU to be running two integer instructions, a floating-point instruction a branch and a memory request all simultaneously. With a traditional architecture, we'd issue one instruction to the set of compute blocks every cycle and the others would sit idle until they got an instruction. Instead, SMT tries to keep all of those blocks busy by issuing an additional set of instructions from an unrelated program.
In this case, we're looking at a dual-issue processor, which means that it can fetch, decode and issue two instructions at the same time. Thus, the entire processor is actually capable of running two threads simultaneously, but without having to duplicate all of the heavy/expensive compute hardware.
This also leads to an explanation of why your database wouldn't scale past the number of full cores on the machine. Since a database is heavy on memory access, those queues are going to be well saturated by a single thread. Adding additional threads of the same type of task won't work well, but you could have easily fit a program that needed lots of arithmetic onto the processor without affecting your database performance substantially.
I suppose I should have made clear that in four virtual core setup (four hardware threads), they feed into two execution queues that do all the hard work simultaneously. A two virtual core setup feeds into one. The hardware scheduler keeps the queues topped up so something's always happening, in theory.
C.
(Posting on my day off hence no Reg badge; I cba logging into work.)
"massively parallel, array processing stuff easy to do on relatively low cost hardware."
Email from Maplin today announcing this tiny (5"x 5"x 1") 192-core beast for £200:
"The NVIDIA Jetson TK1 development kit unlocks the power of the GPU for embedded applications. Built around the revolutionary Tegra K1 SOC, it uses the same Kepler computing core designed into supercomputers around the world. It is a fully functional CUDA platform that will allow you to quickly develop and deploy compute-intensive systems for computer vision, robotics, and medicine.
NVIDIA provides the BSP and software stack, including CUDA, OpenGL 4.4, and the NVIDIA VisionWorks toolkit. With a complete suite of development and profiling tools, out-of-the-box support for cameras and other peripherals, you have everything you need to realize the future of embedded.
[snip]
NVIDIA Kepler GPU with 192 CUDA cores
NVIDIA 4-Plus-1 quad-core ARM Cortex A15 CPU
2 GB x16 memory with 64 bit width
16 GB 4.51 eMMC memory"
[continues]
http://www.zotac.com/uk/z-zone/nvidia-jetson-tk1
http://www.maplin.co.uk/p/zotac-jetson-tk1-developer-kit-a30ny
http://www.linuxuser.co.uk/reviews/zotac-nvidia-jetson-tk1-review
I don't know if I want one, but some folks might.
From article: "Ironically, MIPS and the new ARMv8-a (PDF) instruction sets are conveniently similar: for instance, they both have a fixed register that always contains a zero value, they both have tons of general purpose registers, each instruction is the same width, the program counter is not directly accessible, and so on."
I don't see anything ironic here. These are the features that actually distinguished RISC processors from CISC in the first place. Every real RISC architecture implements at least some of these, especially the fixed-width instruction format and the large number of general-purpose registers.
While it is true that these features are what is generally seen to distinguish RISC from CISC, the original MIPS design has a large part in that definition: It was (alongside the Berkeley RISC processor, which is the forefather of SPARC) basically what defined the concept.
I have long thought that ARM should have moved the PC out of the numbered registers when they moved the status register to a separate, unnumbered register. While you save a few instructions by not having to make separate instructions for saving/loading the PC, PC-relative loads, etc., most instructions that work on general registers are meaningless to use with the PC. And in all but the simplest pipelined implementations, it complicates the hardware to make special cases for R15 (the PC). So this move is hardly surprising. I'm less sure about the always-0 register. I think it would be better to avoid this (gaining an extra register), and make a few extra instructions for the cases where it would be useful, e.g., comparing to zero.
And while code density is less of an issue now than ten years ago, I think ARM should have designed a mixed 16/32-bit instruction format. For simplicity, you could require 32-bit alignment of 32-bit instructions, so you would always use 16-bit instructions in pairs, and branch targets could likewise be 32-bit aligned. For example, a 32-bit word starting with two one-bits could signify that the remaining 30 bits encode two 15-bit instructions where all other combinations of the two first bits encode 32-bit instructions.
"And while code density is less of an issue now than ten years ago, [...]"
I recall compiling some programs for MIPS and some other CPU:s back in the 1990's, and the MIPS exes usually turned out to be around twice as large as the i386 or VAX ones. But this was not a big deal even back then.
Coming from a background in ARM, MIPS and i386, I think that while the code size is not much of an issue in some ways, it's very much an issue when you take instruction cache into account. If the MIPS code size is twice the size of i386 then compared to the Intel chip, the instruction cache size is effectively halved.
I realise this is veering wildly off topic, but I'd be interested to see how 32-bit ARM compares to MIPS in this respect, given its implicit shift instructions and multiple load and stores.
" I'd be interested to see how 32-bit ARM compares to MIPS in this respect, given its implicit shift instructions and multiple load and stores."
And ARM's predicated instructions (dropped in the 64bit, rather unavoidably, too much state to carry around), and their Thumb instructions for high code density.
High code density is not just good for using less memory for a given task, it's also good for getting more performance out of a limited bandwidth memory system delivering the instructions.
Are you aware of the CoreMark low-level benchmarks? Might be worth a look if you're not.
Lots of factors to look at before an informed decision can be made.
Does "Nobody ever got sacked for specifying ARM?" apply yet?
"Yes, I deliberately didn't mention those because I'm not sure they actually help with code density."
I got the impression (years ago, in the days when people were speculating about EPIC) predicated instructions were about improving performance rather than coding density.
Risc64 has been around since 1994 with the introduction of R8000.
I was in university at that time, and was one of the lucky sods to run my code on the state of the art Silicon Graphics Power Challenge Array. Whow It flew! Around 8 times faster than the Wax I previously used. R8000 was the processor that skyrocked SGI into the hpc space.
Unfortunately the R10k and followers where not as good. Mainly because the complex superscalar pipeline was hard to upclock. I recall all the problems with getting the R10k above 195mhz. By that time the AMD opteron had arrived and the dark ages of Mips had started.
The "always zero" register is surprisingly useful, both as a source ("read zero") and a destination ("discard result"). However, comparing to zero isn't one of its main uses as ARM's A64 also has a "compare immediate", which seems to be the compilers' preference for comparing with zero. Note, however, that CMP (compare) is just a synonym for SUBS (subtract and set flags) with the destination register WZR/XZR. You can change any flag-setting operation into a kind of compare by choosing WZR/XZR as the destination.
Also, although the register number 31 usually means WZR/XZR, with some instructions it means SP, the stack pointer, which likewise is no longer a general-purpose register. Therefore, if you wanted 32 general-purpose registers you would have to add quite a lot of extra instructions.
As it is, the A64 encoding is a thing of great beauty. You hardly even need an assembler! It's like going back to the days of the 6502 when you could program with a hex editor, only occasionally referrring to the single-page table of instruction encodings! ... Yeah, I exaggerate somewhat - the 5-bit register fields don't match the 4-bit fields of a hex editor - but A64 is really neat.