+O9
Well I have seen large (300%) gains from optimisation, partly from profile-based and partly by letting the compiler go to the max and also enabling linker optimisations. Most developers go for fastest compilation, put whatever is working into test and then nobody wants to change anything for production.
Some optimisations can even effectively re-write source code so that +3 and then +4 gets turned into a single +7 machine instruction.
What cannot improve though is dependency on i/o. A single stream will not go any quicker; - but you will probably be able to run more processes in parallel and have them all blocked on i/o... This is have also seen, but not since 20 years ago.
So, yes, I can believe gains in non-i/o bound cpu-hoggers, such as ML learning in memory, but not for anything much else.