Great...
More buggy microcode to exploit, while adding net zero worth as nobody wants to compile architecture-specific code to utilise the new functionality.
Intel has revealed two sets of extensions coming to the x86 instruction set architecture, one to boost the performance of general purpose code and the second to provide a common vector instruction set for future chips. Some of the details were revealed on Intel’s developer website, showing the Advanced Performance Extensions ( …
"nobody wants to compile architecture-specific code to utilise the new functionality"
But the benefit is that it will ensure profitable churn for the industry - applications using the new instructions will not run at all on older CPUs, so we'll all have to 'upgrade' our hardware.
That's not what happens. What happens is that nobody dares to distribute binaries with them turned on.
If you want everyone to be able to run your binaries, all you are guaranteed on the x86_64 arch is SSE2. You can up that a little with games (SSE3) but there are still CPUs in use that don't support SSE 4,2 and popcnt, never mind AVX. So unless you do CPU detection gymnastics, you can't even use them. (and simply going for the model number or listed instructions won't do, you have to run code tests or you're going to have a lot of complaints of crashing games)
Hardly "gymnastics" - there are specific registers that can be read that allow you to determine what parts of the instruction set are available - as the article says:
> Developer code will only need to check three fields, according to Intel: A CPUID feature bit indicating that AVX10 is supported, the AVX10 version number, and a bit indicating the maximum supported vector length.
You only need to write the detection code into one library (and then keep it updated as new material arrives) - or find someone who has already done that - and then use the flags that extracts as required.
If you want to do some gymnastics, you can set up your build system to auto-generate multiple object files from recompiling the source once per whatever architecture variants you want to use and also generate the single function that the rest of your code will call; that function simply checks the flags (from the library, above) sets a pointer to the appropriate one of the multiple objects and then invokes via that pointer (you only do the check and assign on the first call, then the pointer is null, of course).
Again, you set up that build the once and then just need to maintain it in sync with the identification library.
As far as tricksy coding goes, that isn't really very hard gymnastics - call it a gate-vault rather than jumping over the box.
Yes, the compiler has heuristics that can choose instructions if you give it a target that supports them but that doesn't necessarily mean it's going to use those instructions in optimal places. For that, you have to write code to benefit from it and pass specific flags in your compile commands (e.g. -mavx2) for those objects.
Once I ask this question. How does an original chip, be it CPU, GPU. Pi, ARM, or whatever get coded. I can understand how a designer can add registers through modifying the microcode or whatever, but how does the original chip designer get the chip to respond to:
r
PB PC NVmxDIZC .A .X .Y SP DP DB
; 00 E012 00110000 0000 0000 0002 CFFF 0000 00
g 2000
BREAK
And then how does the designer get the mirocode to respond to machine code, and so on.
So, anyone going to try and fit the answer to that one into a single comment or should we just respond with the ISBN of our favourite 800 page text book on the subject?[1]
[1] nowhere near my bookshelves at the moment; must get around to memorising those ISBNs one of these days.
PS not a dig, Kev99, that is a sensible and worthy question, just that there are quite a few things that can (ought to be) be put into a decent answer to it!
(got a bit distracted, meant to get back here earlier - hope I'm not too late)
ISBN 978-0128119051 - yeah, pretty much so. "Computer Architecture: A Quantitative Approach" by John L. Hennessy, David A. Patterson. Although, just to be weird, I have the first edition of this, which is rather old now, but I looked at the later editions yesterday and came to the conclusion that the newer editions have lost some of the introductory material in order to squeeze in the more modern material! If you're not too worried about being totally up to date the older edition(s) are really cheap on Abe Books.
The same authors also have another series out on the hardware/software interface, which may be more interesting/accessible to programmers - they still delve into the architecture below the ASM opcodes (i.e. microcode) and there are now separate editions that cover "general" computers (e.g. the Intel Core i7, IBM Cell ...), ARM and RISC-V. Apparently also a MIPS edition. I haven't looked through those thoroughly, but one or more of those looks like it would be a good companion for my older "Quantitative Approach". Although second-hand 'cos they are rather pricey :-(
Also:
I haven't looked at this one, but it is apparently worth a look (and it is cheaper than the ones above): Microprocessor Architecture: From Simple Pipelines to Chip Multiprocessors by Jean-Loup Baer. 2009 vintage, it is said to concentrate on the micro(code) architecture level, rather than any particular ASM opcodes presented to the programmer. It does refer to the Alpha, P6 and Athlon as examples (vintage!).
Awesome question! Short answer: it's complicated, bordering on magic.
I have some ability to program in high level languages. I can do a little bit with assembly. In college I learned about boolean logic, including the theory, the math involved, and actual implementation with basic gates and related components.
In about my 3rd year of an Electrical and Computer Engineering degree program, I took a course on designing with MSI (medium scale integration) parts such as muxes, encoders, decoders, etc.
By the end of that course, I was finally starting to see the picture of how software really connects to hardware. I had moderately complex chips (internally formed from primative gates, which were built from transistors, which were themselves constructed from semiconductors). That hardware reacted according to the state of data stored within the parts that could represent data.
Unfortunately, I don't understand it nearly well enough to be able to explain it to anyone.
To sone extent, nobody ever really does understand it all. We specialize in our own area, and the rest is just an abstract concept. We understand the interface to the layer "above" or "below" our level and we leave the implementation details to others.
Not quite accurate.
The registers (can) live in a generic array, the "register file" - and the size of that is certainly set by the hardware design.
However, they *can* modify how the register file is accessed via the Instruction Set Architecture (i.e. the ASM opcodes we use to program the beast) and by adding/removing/modifying entries in the ISA they can change how many registers our code can see, how those are used and so forth. How much of that is possible is down to how flexible/generic their microcode design is - and I have no idea how flexible any of the Intel microarchitectures are.
Imagine a counter, which is a bunch of gates with a number of output lines that together form a number, and every input pulse causes these outputs to form the next number. Take that number, feed it to your memory bank, fetch the value in that cell. Let's say for the example it's a byte wide, so eight bits. Feed that value to your decode ROM, that for each possible input, in this case that's 256 possibilities, produces ones and zeroes at N output lines, that set up various other blocks of gates for the function the instruction is supposed to perform.
A gate, by the by, is a binary logic function, like and, or, not, or a bunch of others, sometimes with many more than one or two inputs, and outputs capable of driving one or more, sometimes many more, inputs. Built out of transistors. With them you can build function blocks that probably include a register file, at least one arithmetic unit, i/o sections, memory access (duh), and so on.
Wait for all the outputs to settle, call it a tick, and go for the next one. IE the "we're stable now, the results are now valid" signal triggers the next cycle at the program counter. (Well, usually there's a clock and the logic just has to be fast enough with the settling to keep up, but let's keep things conceptually simple.)
N can be more or less as big as you need. Though you may need to have multiple steps for a single (byte) instruction, so the decode sets up a micro-counter to run through a list of micro-instructions that together make up the one (byte) instruction, suspending the main program counter for the duration. Those again are stored in a bit of ROM with each row wide enough to drive the control lines to the various function blocks.
This is the general idea. In the pursuit of performance CPUs get a lot more complex, but for that, there's the textbooks. Computer architecture courses often study MIPS because it's nice and accessible. Or you could look up the verilog to the "J1 FORTH CPU", it is pretty readable and very simple, no microcode. Or look at, say, the "magic-1" homebrew CPU. Compare and contrast, get accidentally sucked in, and before you know it you've designed and built your own.
For that baseline support across many chips - are they going to make such elements of the spec available to AMD, or the Chinese X86 clones that I can't remember the name of?
Thought not. I stand to be proven wrong.
AMD was very smart when it became clear that Itanic was sinking, and they offered up X86-64 extensions as a 64 bit solutions. Including making the relevant parts of the spec available to competitors.
I guess those elements of the specs will be available to other CPU manufacturers, free or licensed I don't know, but if memory serves me right and my amateurish knowledge doesn't mix things up, wasn't their trouble in the past with the (popular because real good) Intel Compiler checking for "Genuine Intel" Processor and (deliberately?) delivering very suboptimal instructions/code for all non-Intel Processors; ingnoring completly the instruction cababilities statet by the CPU (like SSE2)? Now, when: "Developer code will only need to check three fields, according to Intel: A CPUID feature bit indicating that AVX10 is supported, the AVX10 version number, and a bit indicating the maximum supported vector length.", could such things happen again?
All the painful hoops they had to go through to add the new stuff whilst still keeping the entirety of the rest of the x86 model working, which is a massive pile of - random stuff - nowadays.
The i960 was attempting to get away from all that - and history demonstrated that people just wanted to keep using x86, with its limited and rather non-generic registers 'cos that is what all of the code in the world actually runs on.
Plus they seem to need to announce "big special changes" like "new AI data types" (huh?[1] you mean signed/unsigned/fixed-point values in various bitwidths? how very novel and AI'ish) rather than "guys, you can use R8 to R15 now, okay". Even if Real Programmers end up using the new registers/opcodes "just" to speed up str(n)len (which is actually a really useful thing to do, but doesn't fill a four-colour glossy on drool-proof paper very well).
[1] What follows is just my guessing; the AVX10 Version 2 release is going to be *so* exciting|
32 registers needs a 5-bit field. 16 registers needs a 4-bit field. "4" is a more "natural" number for manipulating bits - says the 8-register PDP11 programmer :)
An issue is how you fit those bits into the instruction bitfield. The Z80 doubles the number of registers compared to the 8080, but it did it by adding two "swap" instructions, so only the original half was visible at any one time. That was due to the instruction bitfields already being "full". Eg, a ULA operation is 10:ula:reg. So, 8 registers, no more bits to specify any more registers, you can't do eg ADD barry,other_barry, you have to do ld temp,barry, swap, add temp,barry.
The i86 instruction set will have bitfields something like XXXXXrrrr allowing you to specifiy 16 registers. To specify 32 registers you need another bit. How they support 32 registers depends on if there are any unused bits in the bitfied that the register number can expand into, and - importantly - whether existing code has consistantly set the "unused" bits to zero so the new CPU doesn't inadvertantly access register 16+R instead of R.
Another method is to use whole sets of unused instructions for 32-reg instructions, the original instructions can only access registers 0-15, to access regieters 0-31 you have to use different instructions, or prefixes (which essentially is just a longer instruction). That's the way the eZ80 and Rabbit went.
> Surely having 10 byte sequences just to encode new register references cant be good for performance and memory usage ?
It's an arguable point, but these days by far the slowest part is fetching data from DRAM, and that's most efficiently done by long burst reads into cache. Once the cache is warm, generally that's where the code will be executing from.
Some instructions being "long" are offset by common instructions being "short", and by the fact that you'd need to use multiple "short" instructions to achieve the same result as one "long" instruction.
> Isnt it time for a new cpu isa which stops the need for this terribly inefficient encodings ?
Well, ARM, MIPS and RISC-V are there for you to use. ARM seems to be doing pretty well at the moment: MacOS, Linux, Windows all run natively on it. But if ARM tighten the licensing screws too much, RISC-V could take off in its place.
Intel tried with Itanium, but failed.