Reply to post: Re: Intel was fudging

Monday: Intel touts 28-core desktop CPU. Tuesday: AMD turns Threadripper up to 32

Peter Gathercole Silver badge

Re: Intel was fudging

Yes, but even IBM has backed off from pushing the clock speed to add more parallelism.

The Power6 processor had examples being clocked at 4.75GHz, but the following Power7 clock speed was reduced to below 4GHz (but the number of SMT threads went from 2 to 4, and more cores were put on each die, again 2 to 4). Power8 kept the speed similar, but again increased both the SMT and cores per die.

In order to drive the high clock speeds in Power6, they had to make the processor perform in-order execution of instructions. For most workloads, putting more execution units, reducing the clock speed, and putting out-of-order back into the the equation allowed the processors to do more work, but could be slower for single-threaded processes.

The argument about compiler optimization really revolves around how well the compiler knows the target processor. Unfortunately, compilers generally produce generic code that will work on a range of processors in a particular family, rather than a specific model, and then relies on run-time hardware optimization (like OoO execution) to actually use the processor to the best it can.

In order to get the absolute maximum out of a processor, it is necessary to know how many and what type of execution units there are in the processor, and write code that will keep them all busy as much of the time as possible. Knowing the cache size(s) and keeping them primed is also important. SMT or hyperthreading is really an admission that generic code cannot keep all of the executions busy, and you can get useful work by having more than one thread executing in a core at the same time.

I will admit that a very good compiler, targeting a specific processor model that it knows about in detail is likely to be able to produce code that is a good fit. But often the compiler is not this good. You might expect the Intel compilers to reflect all Intel processor models, but my guess is that there is a lead time for the compiler to catch up to the latest members on a processor family.

I know a couple of organizations that write hand-crafted Fortran (which generates very deterministic machine code - which is examined) where the compiler optimizer rarely makes the code any faster, and is often turned off so that the code executes exactly as written. This level of hand optimization is only done on code that is executed millions of times, but the elimination of just one instruction in a loop run thousands of millions of times can provide useful savings in runtime.

All of the time an organization believes that hand-written code delivers better executables, they may justify the expense of doing it. It's their choice, and making a generalization about the efficiency of code generated by a compiler is not a reason to stop when faced with empirical evidence. Sometimes, when pushing the absolute limits of a system, you have no choice than making the code as efficient as possible using whatever means are available.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon