Re: Nothing new, kinda pathetic really
Interesting arcticle, thanks for the link.
The problem is that, fundamentally, not all programs are efficiently reduceable to a large amount of parallelism; there will always have to be a compromise in hardware between being good at executing sequential code and executing parallel code. And while that compromise has to be made, then there's always going to have to be nasty things to make sequential code run fast. So yes, making C run well has lead to architectures that have Spectre and Meltdown faults, but I'm not convinced that there was ever much choice, even in an ideal world where we were all hell bent on maximum possible parallelism.
For example, if we went to the absolute extreme and split every if() statement into three parallel threads (one executing the condition evaluation, the two others executing either half of the branch, and join them all up at the end), I'm guessing that we'd run into problems implementing that in silicon at sufficient scale for it to be fast and efficient, even if following an Actor / CSP model such as Erlang. Interestingly, today's sequential instruction pipelines are in effect trying to do this exact thing, but had to take shortcuts to make it fast, leading to Meltdown / Spectre. That ought to be some sort of warning.
That article by David Chisnall is pretty good; he ends up describing a machine that, "would likely support large numbers of threads, have wide vector units, and have a much simpler memory model". He's basically just described the IBM Cell Processor. No cache, on-SPE-core static RAM, SPEs that were all vector unit, very fast IPC (not that code running on an SPE was called a "process"), the lot. And for those who really knew how to exploit it, tremendous performance could be extracted. It took Intel many, many years before their Xeons could match it for ultimate math performance.
He's right in that some parallel models are super-easy to use.
One that isn't is shared memory / resource locking with semaphores. That's not an easy thing to code properly, or implement in hardware.
Actor model isn't a bad choice, but it has it's own complexities. Sure, it's easy to get 200 threads all up and running and talking to each other, and have it work. What's very, very difficult is proving that you've not then built in the potential for livelock, deadlock. With an Actor model system, it's perfectly possible to have an interconnection of threads where everything works until that one single day when something takes just a tiny bit longer, and the whole thing locks up. There is no amount of testing or analysis you can do with Actor model systems and prove them to be free of such bugs, except for trivial architectures. The moment you have a loop in your architecture drawing...
A better formulation of the same basic idea is Communicating Sequential Processes. It's basically the same idea as Actor model, but it's different in one crucial regard; sending / receiving a message is an execution rendezvous. A thread won't return from a send until the receiving thread has finished receiving. This makes a world of difference.
Firstly, if you have accidently architected the thread interconnectedness so that it can livelock, deadlock, whatever, it will happen all the time, every time, in exactly the same way. This means you can spot it happening easily!
The other difference is that the architecture is now algebraically analysable; indeed, there is a process calculii behind CSP, put together by Tony Hoare in the 1970s, for this very purpose. This makes it possible to mathematically prove that a system won't run into livelock or deadlock problems.
Personally speaking, I'm a massive fan of CSP and I've used it on some very large parallel systems over the past decades. It's never let me down, and team members have enjoyed doing it. Interestingly, Erlang is effectively a CSP system. Rust and Go also support CSP.
The biggest problem in making any real difference is, as usual, legacy installed base. For a machine as decribed by Chisnall, it's starting again from scratch with the entire world's hardware, software and firmware ecosystem.
We very nearly pulled this off in the late 1980s, early 1990s, with the Inmos Transputer. This was a all-out CSP CPU, with hardware threads, not unlike the machine Chisnall describes (minus the vector units). For a while it was looking like the only way to get faster computers was architectures like the Transputer and large collections of them. There was considerable girding of loins, learned articles in Byte magazine, a lot of facing up to the prospect that yes, we were all going to have to adopt CSP / parallel programming to make further progress.
Two things went wrong. One was, though the CSP model is itself very sweet to use, the development tooling for Transputers was biblically bad, even by the standards of the day. It was catastrophically difficult to debug code in networks of Transputers. This wasn't even CSP's fault, simply that Inmos "forgot" about the need to debug, and didn't put in any sensible means of doing so in the hardware design.
The other problem was that, whilst Inmos (a typical, underfunded British tech company that tried to do it all itself in house) was messing around trying to get us all to invest in networks of Transputers running at 30MHz (max), Intel then cracked the CPU clock rate problem and put out first 33MHz 486s, then 50MHz, 66MHz, 100MHz, and so on. Suddenly to make performance gains, one could just buy a new PC et voila! The need for parallelism disappeared more or less overnight, and the whole parallel programming movement folded up its tent and went back to sequential programming.
We've been there more or less ever since, only recently has CSP started to re-emerge in Rust, Go, Erlang. What's crazy about these (and this is really Chisnall's point) is that on today's hardware we might have CSP-architected code running on top of an OS that is written fundamentally around sequential / SMP models of hardware that is in turn running on hardware that synthesizes an SMP environment using cache coherency and inter-core links that (devoid of the SMPness) would actually be NUMA architectures very much like Transputers, or not so very different to IBM's Cell processor. There's now so many layers of abstraction between good coding models and actual execution hardware that it's probably impossible to strip them all out.
Another problem with Chisnall's proposed machine is that it only really makes ultimate sense if the number of software threads in an entire system is less than the number of hardware threads (though to be fair I don't think Chisnall is suggesting that). That way the thread can be allocated to a hardware thread, and left to run there for as long as the thread persists. That way the hardware thread can be a very simple implementation. This is basically what an SPE in a Cell processor is.
The problem with there being more software threads than hardware threads is that, then, there has to be context switching, inter-process memory protection, etc and suddenly the hardware thread is now a much more complicated thing, if the context switch is going to be fast. On the Cell processor, you basically had to do the context switch yourself, in software, loading up a different bit of code into an SPE's RAM, then loading up the data for it to execute against, then unloading the results and code for the next piece. You could do some very neat things (e.g. move code instead of moving data), but it was all you, you, you in the source code. This made it tough, but oh so good.
What Is The Best Thing To Do?
That's a real poser.
Those who adhere to the sunk cost fallacy (which is a most people) would say "legacy base is too big, got to stick with what we've already spent a lot of time and effort on". And, given that switching to a better architecture fundamentally means breaking SMP (and all current software), there would have to be some almighty costly carrot / stick to move enough people's opinions. I don't know what that cost is.
Basically, humanity is not well disposed to the idea of adopting short term pain for long term gain. Just look at the fuss over climate change... So for Chisnall's ideas to take off (which I would like to see), there'd have to be some really bad crunch point not unlike that we were facing in the late 1980s, early 1990s, where progress had really stopped and there was no where else to go.