An example
A long time ago I spotted the opportunity to replace four complex multiplications (24 floating ops) by three integer adds, one table lookup, and one complex multiply.
On a VAX CPU of the early 1980s that was a big win.
On today's, it's probably a big lose, because DRAM access is so slow compared to registers. Although, if the entire lookup table would fit into the CPU cache and be accessed many times from cache, maybe not.
And as for implementing it in a GPU ... I don't do this sort of coding any more. One thing for sure, a compiler isn't going to help. You may well have to go all the way back to the maths and choose a different algorithm.