Let me just add this one optimisation to the code ...
I was working on implementing an imaging filtering algorithm by a competing research group to compare to our own new shiny algorithm. The authors of the competing algorithm weren't allowed to share their implementation due to IP issues, so in the interest of fairness, we decided to do our level best to squeeze every last drop of performance out of our implementation of the competitor's algorithm (as much as we did to our own algorithm) before comparing them. We had successfully reduced computing time by a factor of three on both methods, when I thought of another optimization for our competitor. It boiled down to reusing earlier results rather than recomputing them, at no extra memory cost. Easily done. It knocked another 7% off the wall-clock time. Not much, but nice to have, and it took just ten minutes to implement.
It worked like a charm on all images in our test set. Except for one, where it seemed to enter an infinite loop, never terminating until we killed it. I spent hours and hours trying to work out what was wrong, but ultimately had to give up. We left out this last optimization in the final test. The 7% didn't make much difference anyway, as our algorithm was far faster, and had a better worst-case computational complexity: O(N log N) vs O(N2 log N).