@David Dawson, This is still funny
"Logically prove to me that the best course of action is always the hardest and I'll donate my organs to charity here and now."
Can't be done. You only discover the need for efficiency when the biggest cost becomes power consumption. Power consumption growth tends to have a nasty way of becoming unsustainable and unexpandable as things grow. For example, what if your next round of upgrades is going to mean laying in a £20M power cable to run an extra £1M of servers? So it's something you will have to discover for yourself. Of course, if a business doesn't grow to such sizes then power consumption may never become the biggest issue and the pressure to do things the hard way never builds. Facebook, Google and Twitter all did grow massively, and power is certainly their biggest issue. Google now build their own power stations and sell to electricity consumers when the world is asleep and not doing so many searches!
"Your examples aren't particularly compelling. Facebook did indeed compile PHP. but most of the PHP execution time is in C modules anyway, so they are optimising whats left. True, this emotionally feels imperfect, but at 500 million users, it seems to be holding its own...."
So if PHP is mostly executing in C libraries, why did the feel the need to compile it in the first place? Whatever inefficiencies they had resided in their source code, not in the library routines they were using. They've saved a little bit by elimintating the run time interpreter, but those inefficiencies are still there in the source code. It is holding together, but at what cost to their prfit margin? Their server farm costs must be tens of millions a year, and even a small saving would likely easily pay for a re-write in a more run time efficient language.
"Given 2 theoretical architectures, one scala/ immutable message based, and the other C++ using shared memory with mutexes, semaphore and what not."
Who said anything about shared memory and mutexes? I've been using OS IPC primitives such as pipes for message passing between threads for twenty years. In modern unixes, this sort of pipe:
#include "unistd.h"
int pipe (int filedes[2]);
and Windows
CreatePipe();
People need to read library references more. Fast, very scalable (on unixes and windows they're effectively interchangeable with sockets), very easy, quite well suited to modern CPU architectures that use high speed serial links between CPU cores. This is message passing done at the lowest possible level with the least possible impediment to performance.
The idea in one form or another has been around since 1978 and clever people have been programming that way for many decades now:
http://en.wikipedia.org/wiki/Communicating_sequential_processes
Ah, the happy days of Transputers!
"The correct architectural/ algorithmic decision here will totally rule which solution wins, not the language per se."
Yes, designing for scalability is often important, and message passing between threads and processes is a good way to scale. Most people run away from that sort of architecture to begin with, but are forced in to it sooner or later. But that doesn't dictate language choice. What does dictate language choice is the presures on the bussiness. Up scaling eventually means power consumption becomes the single most costly thing, so C++ or something like it starts looking attractive (if painful). Not up scaling gives one the luxury to indulge using in an easier language. Scala and Node.JS might make using the right sort of architecture easier, but they can't be as runtime efficient as a compiled native application with minimal runtime environment between app and cpu.
"Some things need hyper efficient code that keeps the power usage down; but then, why not use C? or assembly? Heck, use C and GPU/ CUDA or something else to make your system scream? Why the obsession with C++?"
One consequence of message passing in C++ is that really you don't need C++. I tend to use C, and consider threads to be "opaque objects" (I'm not using shared or global memory) with "public interfaces" (I talk to them only through their pipes / sockets). All good object orientated paradigms.
GPUs and CUDA/OpenCL are moderately interesting, but modern GPUs are too powerful. They're fine if you need only one or two because then that's one PC and you can keep them busy. As soon as you need more than can fit in a single PC you're in trouble, because you can't keep them fed by sending data over a LAN; you need something more exotic.
In the branch of the large scale embedded systems world I occupy PowerPC (of all things) is still just about the best thing because of the GFLOPS / cubic foot that you can achieve. Intel aren't far behind and may be on a par with PowerPC now. As I said earlier GPUs aren't bad but you can't efficiently bolt together more than two of them. They're OK if your application is such that you can send them a small amount of data and let them chew on it for a lengthy period of time because then you can match the interconnect performance to the compute power of a GPU. In my domain the interconnect performance is as important as the compute power, and PowerPC (8641D) is very good at interconnect.
Interestingly enough I'm beginning to hear of data centre operators making enquiries about this sort of kit because of the size/power/performance ratios that can be achieved.