Threading vs. Parallelism vs. Generalisation
Firstly, how the hell can a vax C compiler figure out what to do on which processors. Take for example the following snippet.
for (i = 1; i < 6; i++) {
c = b * 5;
b = i;
printf ("%s\n", c);
}
Run that through a VAX C compiler, if it were as simple as stated in the above comment we'd have enormous massively parallel computers already.
The problem with that kind of parallelism is that if the next iteration is dependant on the previous it all gets broken. Maybe this explains why we don't see vax computers on sale on the high street.
Talking about element management software exceeding the ability for the system to compute the process that is running in parallel and the management of the task is simply rediculous. Where massive parallelism is required to the extent that you're management software is taking up too much time then the solution is simple. Don't use any bleedin' management software!
Take as an example a computer we all know, it has a massively parallel design, it is constructed of billions of processing elements which are so simple and yet can process far more powerful tasks than a CrayX1 can today.
We all have one, its our brain.
What makes this computer special isn't simply its neurological design but more importantly its ability to distribute processing via lots of small adaptive tasks which have the ability to hypothesise what they are supposed to be doing based by what they are capable of doing and what data is provided. Couple that with the supportive neurotransmitter logic and influence to dendridic communication and you have a machine far more powerful than the old arithmetic and logic unit's joined together in parallel.
an ALU has importance in that we need to be able to do mathematic calculation quickly, this is the first requirement of a computer and was the only reason for the difference engine. Couple the technology in existence in a ordinary CPU, the networked routing technology of large parallel processors today, and generalise!
You take the loose logic and chaotic communication of a neural network design, couple that with very fast math processing elements allow data to simply be routed to not processor ID x of y, but instead think around the CPU addressing issue and just say, send this to a chip, get a response then fire the result off to another chip. You simplify the design by having the blocks of processing data sent out into a wide blue yonder of processors which have a private memory space and a shared memory space, the only issue left to worry about is the locking problem but by using shared memory in write and read mode and have only read after write and write after read and various other logical memory management systems you basically achieve something which is akin to a cross between a hardware silicone neural network and a classical computational system. You can even have flexible processor addressing influence the memory management so 5 processors in the cluster may read and write a certain area of memory asyncronously but by giving your process block an ID which is essentially a memory address to inform which chip is running the code you can have write memory, wait until processor dealing with block xxx reads the memory and write to that area again wait until processor yyy reads _that_ memory and write to ... so on and so forth...
Managing the processors no longer takes any time up, except the time to wait, which is required to avoid locking. You can't simply compile a program to do this, but you can extend existing threading models to respect the design of the system.
Its just an idea i've been playing with for a few years, never gonna pursue it though. I don't earn enough money on my meagre income ;P
K,