
"If an HPC application spends much time waiting on IO then someone needs to call in a real HPC expert to give the setup a once-over, because that's a total waste of time (as you rightly point out)."
Define "much" ! :)
How does a "real HPC expert" magic up no mem-waits on a 16 core Xeon running sparse matrix code with a 16 way set-associative shared L3 cache ? The killer micros have taken over, they are a lot faster than the beasts that came before them - but equally it's also much harder to extract peak performance from them with apps that feature large memory footprints. I'm not having a dig, just pointing out that some problems are inherently awkward. :)