Wisdom for the aged
"... At some point, the OS starts to get in the way, and the OS actually becomes the bottleneck."
So he is new to computers then.
IBM's Cell chip will struggle to woo server customers looking to turbo charge certain applications because the part has a fundamental design flaw, according to AMD fellow and acceleration chief Chuck Moore. Sure, sure. Cell is a multimedia throughput dynamo and its SPEs (Synergistic Processing Elements) are just lovely. "But …
Massive parallelism has little applicability to the average Computer user who already own more CPU horsepower than they can already use.
The drive toward massive parallelism is being driven in small part by a desire to run a host of virtual machine sessions for server applications and to a lesser extent a need for faster speeds in scientific computing.
The primary drive is marketing - a means by which the various CPU manufacturers can complete for bragging rights in the industry.
My desktop here, is a duel core 64 bit AMD CPU running in 32 bit mode. It is currently and typically displaying a static screen and performing some keystroke processing. Through it's life, the CPU's on this machine will be compute a trillionth or less of their theoretical computational output.
This is the reality of the typical desktop.
Since transistor sizes are about to hit a brick wall and since CPU clock rates hit a brick wall a decade ago, it makes the most sense to manufacturers to look toward parallelism for increasing machine speeds. Multi-core CPU's (complex CPU's) are going to hit a brick wall with something around 16 cores. Now where the Roadmapsters ask, can the industry get more compuational power?
The most obvious answer is with the not so general purpose silicon on the video chips - which as a result of existing parallelism are already pumping out more raw calculations than several thousand general purpose CISC CPU's.
The easiest way in which the hardware guys can implement massive parallelism is to exploit the existing graphic hardware - tweaking it so that it is more compiler and programmer friendly.
Intel's take on the issue is to replace the Graphic CPU with a more general purpose computing engine that can handle graphics as well - tweaked for prodicing improved graphic performance.
Which method is more applicable to general purpose computing?
Intel's roadmap. Clearly.
Which is more flexable and more readily upgraded?
Intel's roadmap. Clearly.
Which method will win in the long term?
Intel's roadmap. Clearly.
Now the question is... Since this massive parallization will be available to all desktops as a natural consequence of having a video processor - no matter who's roadmap is taken... How will 1,000 more general purpose computing power than now exists on the desktop (in an exploitable form), be used when existing CPU's are already underutilized by many orders of magnitude?
The answer to that is Distributed computing.
Desktops will be enlisted, as a matter of course, to provide distributed computational resources to network service subscribers.
Need to model the folding of a protein? Just rent the time from 2% of the unutilized computaitonal capacity of South Korea.
Need to model the atmosphere with a resolution of a meter? Just rent the time from 15% of the North American computational infrastructure.
AI bot not smart enough for you? Just offload it's peak processing requirements to the PC's along your street, or across the city.
Now which Road Map best conforms to such general purpose computing requirements?
Intel's. Clearly.
AMD has fumbled badly with the purchase of ATI. Although they are in some sense better off because they now integrated a potential rival.
AMD had better start generalizing the CPU's on their derrivative ATI graphic cores very rapidly.
But even doing so they run into the problem of running two radically different programming paradigms, while with Intel there is essentially still only one core instruction set to deal with.
So even if first to market wins some market share, INTEL wins in the long run.
And what of the motherboard chipset? Well it's going to be shrunk to an IO block and all the computational power is going to come from the former video co-processors as well.
So future machines are going to consist of two major components, a low CPU count CISC parallel core main processor, and a rapidly reconfigurable massively paralel core for graphic and other massively parallel computing, which also uses a few cores to provide the smarts for most of the IO tasks.
Since a programming dichotomy will continue to exist between the massively parallell environment and the CISC computing environment, and since the OS is not well suited for micromanaging the allocation of alien cores, and dozens of threads for various IO functions. It makes the most sense to partition the OS into two components as well, with one OS subsystem managing the massively parallel cores, and another more traditional system managing the CISC cores.
The programming paradigm will be partitioned similarly. Traditional low thread count parallelism will be implemented within the language that targets the CISC cores, and the massively parallel components will be paged in place by the component of the OS that manages the massively parallel cores, which will run their task as a batch job and return the results the the data space of the main cores.
What you are <NOT> going to see is an intrusion of massively parallel constructs into existing programming languages. You won't see much in the way of fine grained parallelism from within the application framework.
What you will see are OS calls to load massively parallel computing configurations along with calls to load up shared memory regions with the data to process and a call to run and return the result as a batch process.
it will look like.
CALL Configure_Sort_Structure
OSCALL Push_Current_Parallel_Configuration()
OSCALL Confugure_Parallel_Sort(MaxPracticalCPU)
OSCALL Run_Paralell_Sort(SortStructure)
OSCALL Pop_Parallal_COnfiguration()
The programming language for the massively parallel core section of the code will look less traditional, and most probably won't even be touched by most programmers, as they will rely on a library of batch processing fuctions from within the multiprocessing component of the OS itself..
Pre-installed Components will be in part.
Return_Cores_Available()
Configure_3d_Engine(n_cores,Version)
Configure_2d_Engine(n_cores,Version)
Configure_Audio(n_cores,Version)
Configure_Lan(Version)
Configure_USB(Version)
Configure_Serial_ATA(Version)
Configure_Base_IO(Version)
Configure_Parallel_Search(n_cores,Version)
Configure_Parallel_Sort(n_cores,Version)
Configure_Neural_Net(n_cores,Version, height,width,depth,input array, output array)
Configure_RayTrace(n_cores,Version)
Etc...
And with an networked computing infrastructure you are going to have
Request_CPU_Resources(n_cores,Application_For_Purchase_Struct)
External_Configure_Neural_Net(n_cores,Version,Heigtht,width,depth, input Array, output array,Limiting_Price = 2 cents)
External_Link_To_Global_AI_Engine(AI_Name,Consiousness_Instance, Limiting_price = 3 cents)
External_Submit_Stimulus_Global_AI_Engine(Sense_Struct)
External_Return_AI_Status(*Response_Struct)
Etc....
And that, my little children.. Is the outline of your basic computational future for the next 100 years.
AMD isn't looking at what you and I do with our laptops. My current HP, roughly 4 months old (2002 vintage Viao finally rolled over) has a dual core 64 bit Athlon, running 32 bit Vista.
But AMD's thrust isn't at the commodity priced desktop/laptop. It's at the big boxes. Four quad cores with 64 GB of memory and multiple PCI-E gigabit or 10gig ethernet connections to some massive farm. That's what the itanic was aimed at, too.
But both AMD and Intel have the same intellectual agreement, even if they won't admit it. Cell computing is neat in the lab, but, as Moore says, the OS overhead gets to be too much too fast.
I've been increasingly annoyed by the dominance of x86. RISC isn't a "programmer's nightmare" as most programmers don't even know the x86 instruction set!
I do, but I'm pretty open to learn something else anyway. But I doubt anyone doing something other than kernel modules or device drivers would really care about the underlying instruction set, as it will all wrap nicely with the standard C libs.
Switching to another main architecture would be nice, as it seems we're stuck with more of the same, and have been for almost 20 years, since the release of the 386 (ok, maybe the 486?). Had PPC prevailed, maybe the 64-bit switch would've come earlier ... but oh well...
Go Cell!!!
You spend millions designing your massively complex chip. You hope it works first time because mistakes take months to fix. You have to get back your investment from a small market, so your prices are high. These big chips sit on big motherboards that are again not the mass market, so the motherboard manufacturer has to get their non-recurring engineering costs from a niche market. Add a non-standard enclosure, and your final price is huge.
If they went for small chips there is less risk of expensive failures, and NRE costs get spread over more sales. The result is that a super-computer built out of a large number of mass produced chips is cost effective compared to one built out of a small number specially designed expensive chips. A cell phone with a server processor is never going to sell.
Via are going in a good direction. The future will be DRAM, CPU and graphics on the same chip. Phones, cameras and PDA's will use one chip each. Laptops will have four or eight of the same chips on a SODI(memory+cpu)module. Desktops and servers will have a row of SODIMCM sockets.
AMD and Intel are oustanding at producing x86 monsters. World+dog can license MIPS and ARM cores and hire a FAB to crank out excellent systems on a chip. AMD and Intel have to play to their strengths, but their strengths are becoming less and less relevant.
but not because it has a power core, but because the cell spe-s have a very low physical memory limit (256Kbytes) and don't have any way for using virtual memory. So they are very fast dsp chips. Even nvidia's gf8 line is better, because they can run c code (with cuda) without hacking around memory limits.
And the os never gets in the way, because most os kernels are fully multithreaded (except linux and macosx), even windows nt 4.0 was multithreaded. Not to mention that nvidia's gpu cores can run their own vliw risc based os so don't even need a cpu to work.
However the best idea comes from intel. They just put old pentium I cores into a single chip and call it a gpu. It's not the fastest solution, but certainly the most general purpose. You can even select a single core from this array and run a whole os on it, they call it atom. And yes, it runs crysis and yes it's a single core from intel's upcoming new gpu array.
On the other side, while even nvidia is pushing out their own cpu-s, amd is stuck with a classic old big x86 cpu line and a traditional gpu line.
Fascinating to read the following:
Which method is more applicable to general purpose computing?
Intel's roadmap. Clearly.
Which is more flexable and more readily upgraded?
Intel's roadmap. Clearly.
Which method will win in the long term?
Intel's roadmap. Clearly.
Now which Road Map best conforms to such general purpose computing requirements?
Intel's. Clearly.
But wait... what was said before any of this ?
"My desktop here, is a duel core 64 bit AMD CPU running in 32 bit mode."
I assume that is dual core, not two cores having a duel - but why would someone who thinks Intel are vastly superior in their approach buy AMD ?
"You have to get going first on the PowerPC chip (inside Cell), and the PowerPC core is too weak to act as the central controller."
Okay...everybody is entitled to their opinion, but they should atleast have the courtesy to back it up with their reasoning. Why is it so weak to act as a controler? Why is an x86, or more specifically, an AMD x86 core better at task management?
Having written Cell code, and thus my own task management, I haven't had ANY problem with the PowerPC core. It runs pretty well, it's light weight and very little of what needs to be done is impacted by the lack out-of-order execution or major branch prediction.
The whole "number of viable cores limited by the OS" thing is a complete falacy as far as I'm concerned. The OS should not be responsible for managing the cores mainly because of the exact problem being mentioned....OS code is slow, and cumbersom because of all the features it has to support, and thefor not adept at handling core management.
When I write Cell code under Linux, I don't use OS functionality to send code/data to a SPU. Anybdy who thinks it should be done via some form IOCTL needs to be beaten with a wet kipper, Instead, I use the Cell SDK which accesses the Cell functionality directly, with the OS sitting on top of all that. The OS doesn't need to manage core functionality....that's just silly as far as I can tell, and leaves the system more vulnerable to attack (please, somebody correct me if they think I'm wrong).
It's no surprise that an AMD employee is giving negative press about the x86 markets biggest threat. After all, he's hardly going to say "yeah, this x86 stuff is a big pink elephant that we're stuck with but Cell rocks", is he?
Flames because this will degenerate into a flame war with all of the coder stuck in traditional development realising that they need to adapt beyond their comfort zone for the future.
OK, if he'd been pointing and laughing at Sun's T1 or T2 designs I'd agree, but Cell? I always thought the Cell design was one of the more interesting options out there and one which might have a chance of taking some share of the x86 market.
Anyway, the geezer said we need a chip that is easy to program, with optimised compilers, capable of both strong single-threaded processing and parallelism, and I assume he'd like a large on-die cache and a massive share of registers to help with all that task switching? And maybe it being an either endian design would help with all that porting of code from older chipsets (endian nature is the way direction the registers load code, either high value first or low value, and most registers and therefore their associated program are stuck with the one), just to make porting easier. What the heck, let's chuck in an up-and-coming interface like CSI to make the whole thing fit into the same motherboards as the next generation of low-end CPU! There, you've just bought an Intel Itanium, the most flexible core design out there, which is openly admitted to be one of the simplest porting platforms around (which is why it runs so many OSs compared to x86, SPARC, Cell or Power). Have a nice (Intel) day!
When I first meet Moore, I confused him with the more inventive and far brighter, Chuck Moore of Forth fame.
Comparing the two, I'm inclined to side with Chuck #1 who absolutely believes in the multicore future.
AMD has no expertise in software OR platforms, so I give Chuck #2's architectural plan no Moore then a 10% chance of success.
Or MIPS64. Or Alpha. Or Sparc64. Or even, heaven forbid, what Intel, HP, CPQ and others (even Dell) hilariously used to call "Industry Standard 64" (which most of the world called Itanic).
Volume is key, as other comments have already noted. But these days there's no volume without volume applications, and no volume applications without a volume OS. Once upon a time, in the early days of NT, there was the Advanced RISC Computing consortium (Alpha, MIPS, PPC), and for a while, there was a prospect of a critical mass actually building up. But Billco effectively killed that opportunity when he reverted NT to x86-only, and the concept of RISC as an x86 alternative in the volume market died.
The Linux community could have continued to support alternatives to x86, but why would they, especially after AMD64 showed that 64bit didn't have to be totally incompatible with IA32 (unlike Itanium)? So the "general purpose" Linux market is also largely x86 (routers, STBs, phones, etc are a distinct market).
For the foreseeable future, the Wintel ecosystem *is* the volume market, for better or worse. They have the monopoly (even more so since Apple went x86), and monopolists don't give up their monopoly without a struggle.
As far as parallelism in software goes: there are two times when most people's PCs have a chance to exploit parallelism: (1) when background-printing something with something else in the foreground (2) when doing something in the foreground while a full-system virus scan is running in the background. Anything else, such as full-system indexing in the background, is likely an irrelevance; most of the things most PCs do most of the time simply cannot be usefully parallelised, and even the complex things some PCs do some of the time cannot usefully be parallelised in general (and unless they are CPU-limited, which few are, there's no benefit to parallelising them anyway).
Parallelisation is only new and shiny in the desktop/volume market, it's old hat in the real computer market, and it's clear to those with clue that parallelisation is not going to work any miracles in the desktop market. But it fills column inches, so that's OK.
It sounds like many commenter's are making too many assumptions about the nature of parallel programming. In reality many problems are serial in nature in that they require output from one step before starting the next. Parallelism WILL NOT HELP in these cases.
Specialized processors are faster not (necessarily) because they run better/parallel algorithms, but because they can eliminate millions of gates required to make them generic and therefor shorten the electron's path in a cpu from the input pins to the output pins.
I believe that Moore's law continue for some time to produce results by converting software into hardware. Even without any parallelism, optimal graphics chips would be faster than any software running on optimal generic hardware. They would be far more energy efficient as well. Problem is hardware is expensive and difficult to build compared to software.
The solution in my opinion are Field Programmable Grid Array technologies. This might make it possible to virtually construct specialized processors on the fly. A game requires an advanced AI processor? Flip a few million bits in the dynamic FPGA and Viola! Now your software is running almost literally at the transistor level. Current FPGA technologies that I am aware of need to improve by two or three orders of complexity to be able to represent larger problem sets, but there is a lot of potential.
Now back to parallel programming. For those problems that CAN be solved by a divide and conquer approach, simply multiplying the hardware (no matter how inefficient it may be) could significantly benefit the computational ability. This is especially true in cases where one processing unit has very little need to communicate with other processing units.
However we must watch out for problems where each processing unit requires data and synchronization from other units before every parallel processing step. This data must travel over a slow network between units. For this reason an optimal serial algorithm can sometimes run faster than an optimal parallel one.
Shared memory architectures can help mitigate IO overhead, however shared memory is not scalable, every unit doubling requires more transistors and a longer data path between each processor and it's memory. Not to mention contention on the bus and caching issues and exponential expenses.
That is not to say that parallel clusters are not useful. I think they are cool, but they're not the magic bullet some people are claiming them to be. That said, the industry does need better tools in order to handle the hurdles between serial and parallel programming.
I can see how people see that 256k of local-store (LS) would be a problem, or some sort of "limiting" factor. But the truth is, if done in the right way, there's very little the programmer has to do in making the best use of the LS.
You're not limited to loading in big chunks of data to operate on at a given time, and even if that's what what you want to do; you can also load in chunks of data from different areas of memory in one go.
Sure; you have to split up your operations a little more, but that's about it. It's no biggie.
And, you can run straight c-code on the SPE without any issues. There's a group that's working on a sort of software caching system for big chunks of code that wouldn't fit in only 256k. After all, as far as the PSE is concerned code is just data too, and so can be streamed in as needed.
For decades, the leading edge of computer technology and the leading edge of computer consumerism have moved in parallel. There simply wasn’t enough computing power available to do the things we wanted to, so to see what we would be using at home and in the office in five years time you looked at predictions of technological developments.
Now this link is ending. For the first time in thirty years, we’re seeing manufacturers deliberately under-speccing their computer hardware to make it more attractive to the consumer (eg the Wii and the Eee PC), whilst the Macbook Air is selling very well considering its comparative technical limitations. Very soon (if not already), we’ll stop replacing our computers with faster models, but will instead replace them with cheaper, more personally tailored and better looking models. Thus the personal computer market has matured, and the technology will plateau. Parallel processing will not be particularly important in the personal computer for the foreseeable future.
In servers it will come in handy in specific situations… but look at it this way, as the processing power needed by individual network systems is dwarfed by the ability of physical server machines, virtualisation will become increasingly commonplace. If we have a large number of services running on a small number of servers, the expense of specialising the hardware on servers will actually inhibit the flexibility of the virtualisation. You don’t want to put an expensive parallel processor specialised for handling SQL databases onto each hardware unit in a set of servers that will only ever be hosting a variety of non-database related virtual servers.
I’d also go as far to suggest that the future of computing even in servers is not in parallel processing. There are two ways of viewing a multi-core processor. Either you have one computing environment with access to many processing ‘engines’ running in parallel, OR you have many co-existing computing environments each running one or two services on one or two cores. Moore seems to be seeing the future as the former – in which case the OS is clearly the problem as you need an OS to co-ordinate 50 human users trying to access a varying selection of one or two processes through a single environment running twenty different processes on 64 cores. It would be much easier to set up a system where the individual human users are only given access to the one or two virtual machines the services they need are running on.
I think this is one of those cases where basic reality will win out over the ‘ideal situation’. Water flowing downhill always takes the easiest path… and so it will be with computing. It will always be easier to adapt the new technology so that it runs the old systems than re-write the old systems to best take advantage of new hardware. Virtualisation allows software houses to make little or no adjustments to the way their software works, whilst true parallel processing would need us to throw everything away and start from scratch. Will MS spend billions creating a brand new server OS to handle 64 cores all at the same time when it can offer almost as good a server solution simply by virtualising and updating its existing server OS kernel? And which company is going to buy an expensive new server OS to parallel process on its new 64 core server when it can just virtualise the OS its already running on all its other servers for a fraction of the cost? I can't see it myself.
Speaking of the easiest path, sometimes parallelism can be your shining light.
Many number crunching problems reduce down to problems of linear algrebra (ie. matrix arithmetic). Many number crunching programs use common libraries for solving linear algebra (such as BLAS).
If AMD can produce a GPU/general CPU amalgam and proper hardware-accelerated libraries to go with it, then parallelising your programs may simply be case of slight code-hackery and compilation with the right libraries.
I know: I've tested this principle with NVIDIA's CUDA/CUBLAS. The principle works.
(This of course depends on /if/ your problem can decompose into linear algebra.)
"I assume that is dual core, not two cores having a duel - but why would someone who thinks Intel are vastly superior in their approach buy AMD ?"
I guess it was because AMD had the best low core count solution at the time. Massive parallelism on a chip for general purpose computing isn't available from either manufacturer at this time.
Learn to read. Anon.
but probably not if you are browsing the web or word-processing.
As always, it depends on your application. For home computers most people want low latency, so you have a TV card with its own image-stream processor, sound card with its own processor and massive amounts of power on the gamers' (and vista-owners') graphics card. Physics cpus are also in view, offloading further tasks from the main cpu, all in the quest to get lower latency. I don't want my TV channel-surfing to degrade a voip call just because I've put three TV shows on screen at once. In this scenario it actually makes sense to have multiple (possibly lower power) specialised processors.
In business scenarios (not desktop systems) you also have possible uses for parallelism. Running an SQL query on several million records? Line up your database records in a vector array and crank through the same comparison operation at high-speed on your GPU. IO constraints are always an issue but I suspect it will be cheaper/easier to do this than to split the job between several general purpose cores/CPUs. I would imagine the tasks would be rather limited at first, similar to the encryption accelerators we saw before general purpose CPUs could handle more vpns than anyone wanted. The difference is that it is difficult to see a point at which more database processing capacity would ever be unusable, especially at the price-points we are talking about.
Your file server or business desktop probably won't benefit from this sort of power so there is no point including it in general purpose CPUs - it just makes them more expensive and power-hungry. A small number of cores makes sense in desktop systems because apps can be load-balanced (rather than parallelised) and that is generally good enough.
If you have a large number of very disparate operations (e.g. JVMs running arbitrary processes) then multiple cores probably makes sense, though whether you achieve this with few cores many hosts or many cores in few hosts is an issue you still have to contend with.
Paris, because she knows more than I do about this...
Sorry, but the whole of this sounds an awful lot like some of the arguments that were being bandied about many years ago when the CISC vs. RISC debate was being fought over the 16/32 bit designs. The problem was that the best ideas weren't necessarily the ones that got carried forward though, goodness knows, processors such as the PowerPC, the ARM and so forth did receive some support.
The comfy little managers with their bunches of glossy brochures and assurances from certain vested interests were what muddied the pool. Not the programmers. These managers wanted to be safe in their familiar world rather than try out what could have been. And do you know what? Things haven't really changed.
"....showed that 64bit didn't have to be totally incompatible with IA32 (unlike Itanium)...." Ah, spoken like a true ignoramus! Sit down quietly, class, and I'll begin....
Right from the very first generation of HP-designed Itanium there has been the ability to run x86 code through the CPU with full backwards compatibility. At first, this was by the simple (and admittedly inelegant) trick of having a whole Pentium core embedded in each Itanium core, and when the CPU encountered some x86 code it diverted it to the embedded Pentium core. Next, HP and Intel introduced emulation software into the chip core, which allowed the OS running on the core to evaluate the x86 code in advance and decide whether it would be quicker to run the x86 code through the embedded Pentium chip or through the emulator. Then, Intel developed the emulator to the point where it always outperformed the embedded Pentium core, allowing Intel to remove the embedded Pentium core and use the die space freed up for even more cache, but still leaving each and every Itanium CPU fully capable of running x86 code.
Many different approaches to providing more computing power exist, although for the moment parallelism is needed because indium phosphide is not quite ready.
I am eagerly awaiting the availability of the FireStream 9170 from AMD, which seems to be an affordably-priced and powerful product, but it should have been out last month.
IA64 up to 2nd gen montecito, had logic for running ia32 x86 code in native mode, the problem being it was way to slow. Someone at HP then wrote a software emulator being around 2x faster (but still slow).
Since AMD couped the 64bit instructions set, itainics have even larger problems reading that format.
Itanic excels when you can multithread and fit all your data into the massive onchip 3rd level cache. Otherwise just leave it. multicore with a much higher pipeline throughput is the way to go.
As the world's first petaflop supercomputer is unveiled, based around [...drumroll...] AMD Opterons and IBM Cell processors!
Source:
http://www.nytimes.com/2008/06/09/technology/09petaflops.html?_r=1&oref=slogin
A somewhat apples and oranges comparison, I freely admit, but it does detract from the statement that the Cell architecture is an "accelerator weakling".
A few points from the posts above:
1) AMD deserves a great deal fo credit for coming up with x64. At a time when Intel and HP were off running around trying to get the Itanic to float, AMD's deceptively simple approach of augmenting the instruction set that most compilers already knew was a masterstroke.
2) Fact is, CISC will generally flatten RISC in terms of perf because
a) CISC instruction streams are very dense - sometimes managing to store several instructions in the space it takes to store one RISC instruction. This keeps the CPU pipeline full whilst minimizing RAM wait times.
b) Because RISC was essentially benefitting from their ability to live on the outer edge of the ever increasing clock speed curve, they're now in trouble. Clock speeds aren't increasing much and won't be for some time to come. So now they have to get more done per tick - and you can't do that if your instruction set uses sparse instructions. ARM have recognized this through their compressed instruction schemes.
3) It's not about RISC vs. CISC any more. It's not about multi-core vs. CPU's. It's all going to be, quite honestly, as Moore points out, about having chips with multiple special-purpose processing units on them. Some processors wil be REALLY good at decoding and executing x86/x64. Others will be REALLY good at matrix math. Others will be really good at floating point. Others will be really good at DSP. Collect more and more of these things together per package and you have the future of computer processing.
My biggest worry isn't the number and diversity of these cores it's how we build a memory and IO ifrastructure that can effectively arbitrate all the competition for RAM and IO!
At the end of the day, a single execution thread already spends too much time waiting for memory and IO before getting real work done.
"That is not to say that parallel clusters are not useful. I think they are cool, but they're not the magic bullet some people are claiming them to be." - Lou Gosselin
Did your parallel computing brain come up with that all by itself?
The universe runs in parallel. But it's amazing what you can serialize when you try.
"As always, it depends on your application. For home computers most people want low latency, so you have a TV card with its own image-stream processor, sound card with its own processor and massive amounts of power on the gamers' (and vista-owners') graphics card. Physics cpus are also in view, offloading further tasks from the main cpu, all in the quest to get lower latency. I don't want my TV channel-surfing to degrade a voip call just because I've put three TV shows on screen at once. In this scenario it actually makes sense to have multiple (possibly lower power) specialised processors." - P Lee
So what you want is the silicon for the TV card to sit idle when you don't view tv on your PC? And you want your video card 3d hardware to sit idle while your just doing 2D on your desktop. And you want your audio silicon to sit idle when not playing music. And your Motherboard chipset that manages your hard drives, sitting idle while you are not using them.
INTEL's solution is to replace all of computational portion of that hardware with one or more general purpose CPU's that will perform the same function <when needed> and which will be available for general purpose computing when they aren't needed to support the hardware feature set.
Why wouldn't I want my systems general computing performance to improve spectacularly when I stop using 3d video on my video card?
"At the end of the day, a single execution thread already spends too much time waiting for memory and IO before getting real work done." - Rich Turner.
Well, to solve that problem, you have to keep the data where it is more reaedily available. In a CPU register or in the cache. However HLL's are specifically designed so that the concept of registers and caches are lost.
And of course, compilers continue to optimize like crap, typically producing code that is 2 to 4 times slower than properly optimized code, and for vectoriable operations 60 times slower, 200 times slower, if not more.
A 64K instruction set has more than enough room to be able to directly address 64K of internal registers - be they scalar or vector themselves.