Re: Time for NUMA, Embrace your Inner CSP
Thanks for the CSP references, best of luck with them too, though the marketdroids and 'security researchers' won't thank you for them. Maybe there's a DevoPS angle on them somewhere?
No worries. I've no idea about what security researchers would think, etc. Adopting CSP wholesale is pretty much a throw-everything-away-and-start again thing, so if there is a DevOPS angle it's a long way in the future!
Personally speaking I think the software world missed a huge opportunity to "get this right" at the beginning of the 1990s when Inmos Transputers (and other things like them) looked like the only option for faster computers. Then Intel cracked the clock frequency problem (40MHz, 66MHz, 100MHz, topping out at 4GHz) and suddenly the world didn't need multi-processing. Single thread performance was enough.
It's only in more recent times that multiple core CPUs have become necessary to "improve performance", but by then all our software (OSes, applications) had been written around SMP. Too late.
As far as I know, modern RISC processors have tended to be built around a memory model which does not require external memory to appear consistent across all processes at all times. So if some code wants to know that its view of memory is consistent with what every other process/processor sees, it has to take explicit action to make it happen.
Indeed, that is what memory fences are, op codes explicitly to allow software to tell the hardware to "sort its coherency out before doing anything else". Rarely does one call these oneself, they're normally included in other things like sem_post() and sem_wait(); they get called for you. The problem seems to be that the CPUs will have a go at doing it anyway, so that when a fence is reached in the program flow it takes less time to complete. And this is what has been exploited.
Where can readers find out more about this particular historical topic and its current relevance?
A lot of it is pre-internet, so there wasn't vast repositories online to be preserved to the current day! The Meiko Computing Surface was a super computer based on Transputers - f**k-loads of them in a single machine. Used to have one of these at work - it had some very cool super-fast ray tracing demos (pretty good for 1990). I heard someone once used one of these to brute force the analogue scrambling / encryption used by Sky TV back then, in real time.
The biggest barrier to adoption faced by the Transputer was development tooling; the compiler was OK, but machine config was awkward and debugging was diabolically bad. Like, really bad. Ok, it was a very difficult problem for Inmos to solve back then, but even so it was pretty horrid.
I think that this tainted the whole idea of multi-processing as a way forward. Debugging in Borland C was a complete breeze by comparison. If you wanted to get something to market fast, you didn't write it multi-thread back in those days.
However, debugging a multi-threaded system is actually very easy with the right tooling, but there's simply not a lot of that around. A lot of modern debuggers are still rubbish at this. The best I've ever seen was the Solaris version of the VxWorks development tools from WindRiver. These let you have debugger session open per thread (which is really, truly nice), instead of one debugger handling all threads (which is always just plain awkward). WindRiver tossed this away when they moved their tool chain over to Windows :-(
There was a French OS called (really scrapping the memory barrel here) Coral; this was a distributed OS where different bits of it ran on different Motarola 68000 CPUs. I also recall seeing demos of QNX a loooong time ago where different bits of it were running on different computers on a network (IPC was used to join up parts of the OS, and these could just as easily be network connections).
The current relevance is that languages like Scala, Go and Rust all have CSP implementations in them. CSP can be done in modern languages on modern platforms using language fundamentals instead of an add-on library. In principal, one attraction of CSP is system scalability; your software architecture doesn't change if you take your threads and scatter them across a computer network instead of hosting them all on one computer. Links are just links. That's a very modern concept.
Unfortunately AFAIK Scala's, Go's and Rust's CSP channels are all stuck in-process; they aren't abstract things that can be implemented as either a tcp socket, ipc pipe, or in-process (corrections welcome from Go, Scala and Rust aficionados). I think Erlang CSP channels do cross networks. Erlang even includes an ASN.1 facility, which is also very ancient but super-useful for robust interfaces.
The closest we get to true scalability is ZeroMQ and NanoMsg; these allow you to very readily switch from joining threads up with ipc, tcp, in-process exchanges, or combinations of all of those. Redeployment across a network is pretty trivial, and they're blindingly fast (which is why I've not otherwise mentioned RabbitMQ; its broker is a bottleneck, so it doesn't scale quite as as well).
I say closest - ZeroMQ and NanoMsg are Actor Model systems (asynchronous). This is fine, but this has some pitfalls that have to be carefullly avoided, because they can be lurking, hidden, waiting to pounce years down the line. In contrast CSP (which has the same pitfalls) thrusts the consequences of one's miserable architectural mistakes right in one's face the very first time you run your system during development. Perfect - bug found, can be fixed.
There's even a process calculii (a specialised algebra) that one can use to analyse the theoretical behaviour of a CSP system. This occasionally gets rolled out by those wishing to have a good proof of their system design before they write it.
Not bad for a 1970s computing science idea!
OpenMPI is also pretty good for super-computer applications, but is more focused on maths problems instead of just being a byte transport.