
I'm very confused, I work in HPC and I'm supposed to know this stuff!
Since its introduction, AVX-512 has gotten a bit of bad rap for being hot, power hungry, and inconsistent in its implementation and feature set. Recall Linux kernel dev Linus Torvalds famously said he hoped the SIMD instruction set would "die a painful death." With the recent introduction of AVX10, Intel signaled its efforts …
I've taken the whole saga to mean that Intel pushed out some quite powerful chips that, if fully stretched, would munch prodigious quantities of Amps, necessitating the removal of heat. Meanwhile, the server manufacturers didn't really bother dealing with the heat, not providing adequate cooling to allow a machine to sustain such a full-chat workload indefinitely. This might be because Intel themselves might have been understating the thermal requirements, because it's been embarassing enough as it is to have AMD trounce them in the market place, no need to underline it even further by underlining just how hot these things can get. They've then been forced to chop and change exactly what AVX512 is to try and reign in the thermals without actually totally removing AVX512. And, now, they're doing it some more.
There are some (specialist) system builders out there who really do understand cooling, and fulfil market niches where customers do want to max out a CPU at 100% for years at a time. With such cooling in place, and the avoidance of frequency down-shifts as a result, these CPUs are absolute belters performance-wise. But the cooling is pretty prodigous to achieve that.
I'm quite interested to see how AMD's implementations of AVX512 holds up. They should be a lot better off, enjoying the benefits of TSMC's silicon process.
My reading of the released docs suggests that they _still_ haven't addressed the compatibility problem, at least not in the way that Arm SVE and RISC-V V do. Yes AVX10 has versions that work with 256 bit and with 512-bit vectors, and if you write code for the 256-bit ones that will work on 512-bit hardware, but only 256 bits at a time. And if you're silly enough to write code that uses the 512-bit registers, it will only work on the (high end) processors that have them. So you're back at processor capability tasting and multi-versioning. (And of course there are still millions of older systems that only have SSE and AVX{,2}.)
Meanwhile Arm and Apple are keeping up throughput-wise by cranking up the number of NEON execution units and scheduling multiple SIMD instructions every cycle...
There are 131 cpu capability flags on my AMD Ryzen CPU. Things like cqm_mbm_total, arat, and good old sse4_2.
Now Intel (and therefore AMD) want to add a few more! The fragmentation of the market in multiple cpu models, age-ism, and feature-ism is actually surprising.
It _is_ getting out of hand.
x86-64 SIMD is an utter shambles. There are so many different versions an variations in different CPU models.
I had a system which used SSE4,2. I looked at adding AVX support. I therefore spent time coding up AVX versions of the software.
When I bench-marked them though, I found that AVX was actually slower than SSE4.2 despite the theoretical advantages. A bit of googling turned up the information that this was a known problem, but was CPU model dependent.
Given the impracticality of implementing and bench-marking the AVX software on every model of x86 CPU out there, I binned the notion of using AVX and went back to SSE4.2 At least I know that it works.
The answer is... don't code directly for the instruction set. Use a library.
Intel's MKL / IPP is intended to take the pain out of it, providing a wide range of maths and other functions. In principle, if one writes software against that library then at run time the library know best how to run the FFT best on the specific CPU it finds itself running on. Keep that library up to date, and you've always got optimum run time on any CPU. Better still, you need not multithread one's code to really max out the CPU. If the FFT your software wants to run is large enough for threads to be a benefit (and they aren't always), then the library will task all the cores in the CPU to compute it on your behalf. All your source code can be single threaded, the multithreadedness being hidden in the library.
There are other examples of such libraries; there's implementations of VSIPL for x86-64. GNU Radio's math library VOLK seems to be taking on a life of its own too, though that's pretty immature as yet, but is good in parts.
These things often cost money, but not that much and it's all part of a trade off. If you want easy, reliable access to peak performance on any CPU architecture over a period of years of CPU evolution and improvement, you're either going to have to do a lot of dev work yourself or pay for someone else's library that's done that for you. The library is almost always better, cheaper and a whole lot easier, especially when it comes to support.
My software is an SIMD library. The problem is lack of published model specific data on the actual performance of individual SIMD instructions so this can't be solved by writing a library. Actual performance doesn't match up with theoretical performance. Sometimes the SSE4.2 version is faster, and sometimes the AVX version is faster. Sometimes the SIMD version is no faster than the non-SIMD version, or only marginally faster. This is CPU model specific behaviour, not something you can just make assumptions about by reading the vendor's published instruction documents.
The only solution is to benchmark each instruction with each integer and floating point data type on actual hardware for each CPU model and I can't afford to buy (and have no room for) every CPU model ever put out by Intel and AMD. The vendors don't publish model specific benchmarks and there is no authoritative third party published data that I am aware of which gives the answer to this problem.
Now multiply this through all the different x86 SIMD systems including mmx, sse, sse2, sse3, ssse3, sse4a, sse4.1, sse4.2, avx, avx2, and avx512. AVX512 itself has a dizzying array of different avx512 subsets as Intel attempted to segment the market in order to extract the maximum revenue for each chip model.
When you look at actual installed base of different CPU models, the only realistic course is to check for sse4.2 and use it if present, and if not, to fall back to a non-SIMD algorithm.
ARM is a completely different story. They have a simple and consistent SIMD instruction set instead of the absolute train wreck on x86.
Hmm, well, I'm still not sure that coding direct is the best course of action. The libraries supplied by Intel cover off the core functionality of what SSE and AVX can do, so using those library functions to build another library is probably still a good way to go.
However I understand your pain, and the solution I've seen elsewhere is that such a library is intended for specialist purposes, and they point to specific hardware as a requirement. Creating it for the General hardware case is hard work indeed.
Motorola / Freescale got Altivec right too, and IBMs Cell was effectively a lot of Altivec units all on one chip. Intel got it wrong from the start, with MMX being too tame, and only with the arrival of AVX and it's FMA did Intel's SIMD start looking sensible. They deliberately held back FMA to keep Itanium alive - which did have an FMA. They finally caved in with AVX, and Itanium lost its last edge.
The problem is that, clunky though Intel has been, if one can work one's way through all that, the end result is hugely powerful, simply because of the scale of the chip. You could follow ARMs approach and get to the same level of performance, but I don't think anyone is building ARM chips that massive.
Intel's libraries and compiler are great if you only care about Intel chips.
There have been documented cases in the past where their libraries/compilers would specifically detect that they are running on non-Intel chips and deliberately run a slower version of the code, despite the CPU supporting the faster version that they use on Intel chips. This gave Intel an unfair advantage in CPU performance comparisons. I have no idea if they are still doing that or not.
For example, see this post from 2009: https://www.agner.org/optimize/blog/read.php?i=49
Well, if you have a line of CPUs all evolving an architecture dating back to the 1980s, with contiuous improvements and software backwards compatibility, you're going to end up with lots of capability flags. It's only just over 3 new flags per year on average since 1980.
What alternative would one suggest?
One of the other new instruction set changes is an "architecture version" capability tasting mechanism, which certainly seems like a step in the direction of making things more uniform and easier to target.
Or you could look at it as yet another thing that needs to be tested for, if you want to address older processors that don't have it.
Dark Helmet: What the Hell am I lookin’ at? When does this happen in the movie?
Sandurz: Now. Whatever you’re looking at now, is happening now.
Dark Helmet: Well, what happened to then?
Sandurz: We just passed it.
Dark Helmet: When?
Sandurz: Just now.
Dark Helmet: Well, go back to then.
Sandurz: We can’t.
Dark Jelmet: Why not?
Sandurz: We already passed it.
Dark Helmet: When will then be now?
Sandurz: Soon.
The optimal method would have the code suit large vectors but the hardware works out how to split it into multiple instructions/data that suit the execution pipeline & power efficiency. Eg. The code says 4096 bit chunks which can be executed with hardware processing it 128b or 256b or 512b at a time. It's the same compiler code outputed for all of them, the hardware microcodes how it's split up.
I think ARM have been working on it.