Re: CPU
“If you want to make modifications of ARM chips you’re not allowed to”.
A) Rubbish. Of course it’s allowed, there’s a license available to add your own special instructions on ARM core.
B) As ASIC designer, and leader, and manager, I’ve been involved in dozens of tech tradeoffs. There’s always some bright young thing who thinks they are the first to consider a custom instruction to accelerate a tight loop. And we always consider it. But after careful tradeoff, it’s just always more optimal to hang it as a memory-mapped IP block.
Multiple reasons:
1 Cost of re-optimising and re-verifying the CPU core.
2 Risk of yield problems on the modified CPU core (big one this).
3 Hanging extra logic onto the fan out of CPU registers, increases required drive strengths, which means extra power dissipation all the time (not just when executing the tight loop).
4 It also messes around the whole floor planning, which tends to drop the clock frequency. Have you ever wondered what an amazing coincidence it is that all the CPU dies you’ve seen are rectangular? There’s no a priori reason for the gates to pack nicely like that. Add just ten gates in the wrong place, and *everything* has to move to the other side, which completely alters the critical paths, as they are dominated by routing delay in many cases.
5 Most tight loops iterate around a block of data, possibly larger than L1 cache. The *last* thing you want is to waste memory bus cycles pulling all that data into and out of registers, with cache misses. Better to provide dedicated on-chip scratchpad SRAM that never gets swapped, and operate on it without involving memory bus.
6 When you’re not executing the tight loop, you can power down an IP block. You can’t separately power down a tightly coupled execution unit within the CPU core.
Academics and “researchers” often think it’s cool, and makes it more “flexible” in software. It really isn’t, and actually it isn’t more flexible either than a well-designed and parameterised memory-mapped IP.