This is pleasing
More good news technology stories like this and I shall renew my subscription.
AMD has released details on its implementation of The Next Big Thing in processor evolution, and in the process has unleashed the TNBT of acronyms: the AMD APU (CPU+GPU) HSA hUMA. Before your eyes glaze over and you click away from this page, know that if this scheme is widely adopted, it could be of great benefit to both …
If I understand correctly, if they were use this architecture on a "traditional" PC, do we need to worry about memory on a graphics card anymore? Or would the CPU/GPU simply split the use of my RAM as they need it? Could that also mean that more RAM= much better graphics performance?
A co-processor is a co-processor. If it can act in place of the CPU with less nonsense then that's useful regardless of whether or not the co-processor is on the same die. This is just turning a GPU into a fancier math co-processor.
Surprised it hasn't been done yet actually.
"....if they were use this architecture on a "traditional" PC, do we need to worry about memory on a graphics card anymore?...." The article specificly mentions tablets and handsets, which implies this is more a better "system on-a-chip" than a replacement for traditional gaming PC architecture.
One reason it is unlikley to upset the PC applecart is because plug-in graphics card vendors like having complete control over the discrete memory on their cards, they don't have to wait for CPU or motherboard designers to catch up. If a new memory type that works best for graphics comes out they don't have to wait for the CPU manufacturers to redesign the memory controllers on their CPUs or the mobo designers to issue a new mobo with new RAM slots, they simply add it to their own cards. That is the advantage of discrete graphics memory in PCs. If you had no memory on your graphics card and had to go out over the bus to main memory then performance would suck. And graphics card makers will want to use the latest and greatest as they will want to maintain a perfromance lead over combined designs like this or they will go out of business.
I would suggest this is more aimed at tablets and possibly virtual desktop environments, the latter seeing greater efficeincies in memory if they can pool it for all tasks.
Yes - also it's unclear what degree of control the programmer has over CPU/GPU usage decisions. It would be nice to specify that compute-intensive inner loops be executed mandatorily on the GPU, for example. I get the impression that the GPU is auto-assigned only when the CPU runs out of puff, but that's just a SWAG.
By putting the GPU behind the MMU it does technically reduce one of the video memory concerns — you could have a single graphic however many gigabytes in size, memory map the file and call that the texture. Attempts by the GPU to read sections not currently paged would simply raise the usual exception, which would be caught by the OS in the usual way and handled by the existing paging mechanisms. You no longer have to treat texture caching as a separate application-level task.
That said, as others have noted the main point of the design is that when you write a parallel for loop in your language of choice to perform some vector operation — especially if it involves no branching — the GPUs can be factored into the workload just as easily as any traditional CPU cores, but so as to perform the work much more efficiently. So writing programs that take advantage of all the available processing becomes a lot easier. Collapsing virtual memory to a single mechanism that your OS vendor has already supplied is just one example.
In their current APUs the GPU doesn't interact with memory in the same way as the CPU does. That's in spite of the fact that they're on the same die and ultimately share the same DDR3 memory bus. In that's sense the arrangements are slightly Non Uniform, and you have to copy data in order to get it from one realm to another.
This new idea means that the GPU and CPU interact with memory in exactly the same way, and that makes a big difference. Software is simpler because a pointer in a program in the CPU doesn't need to be converted for the GPU to be able to use it. That helps developers. More importantly the "GPU job setup time" is effectively zero because no data has to be copied in or out first. That speeds up the overall job time.
I like it!
The key difference is not on the diagram. When a process on a CPU tries to access some memory, the address that the process selects is a virtual address (back then: a 32-bit number, now often a 64-bit number). The CPU tries to convert the virtual address into a physical address (a different number, sometimes a different size). There are several uses for this rather expensive conversion:
Each process gets its own mapping from virtual to physical addresses - this makes it very difficult for one process to scribble all over the memory that belongs to a different process.
The total amount of virtual memory can exceed the amount of physical memory. (Some virtual addresses get marked as a problem. When a process tries to access such a virtual address, the CPU signals this as a problem to the operating system. The operating system suspends the process, assigns a physical address for the virtual address, gets the required data from disk into that physical memory then restarts the process.)
Sometimes it is just convenient - the mmap function makes a file on a disk look like some memory. If a process tries to read some of the mapped memory, the operating system ensures data from the file is there before the read instruction completes. If a process modifies the contents of mapped memory, the operating system ensures the changes occur to the file on the disk.
In UMA, the CPU and the GPU access the same physical memory, but the GPU only understands physical addresses. When a process wants some work done by the GPU, it must ask the operating system to convert all the virtual addresses to physical addresses. This can go badly wrong because a neat block of virtual addresses could get mapped to a bunch of physical addresses scattered all over the memory map. Worse still, some of the virtual memory could map to files on a disk and not have a physical address at all. The two solutions are to have the operating system copy the scattered data into a neat block of contiguous physical addresses or for the process on the CPU to anticipate the problem and request that some virtual addresses map to a neat contiguous block of physical addresses before creating the data to go there.
Plan B looks really good until you spot that the operating system might not have such a large block of physical memory unassigned. It would have to create one by suspending the processes that use a block of memory, copying the contents elsewhere, updating the virtual to physical maps and then resuming to suspended processes. It gets worse. That huge block of memory cannot be paged out if it is not being used, and the required contents might already be somewhere else in memory so it will have to be copied into place instead of being mapped.
All this hassle could be avoided if the GPU understood virtual addresses. That would cut down on the expensive copying (memory bandwidth limits the speed of many graphics intensive tasks). The down side is it adds to the burden of the address translation hardware which is already does a huge and complicated task so fast that many programmers do not even know it is there.
Right, so that's why the Cell SPE did understand virtual system memory addresses and used those to access system memory. How can you chase pointers, ensure secuirty, etc etc., in a reasonable, portable and high performance way.
Cell showed that that architecture could be used even for things such as intensive pointer chasing in garbage collection. (Check out their cool Cell GC work in the VEE conference!)
To work well and with good performance, the correct page size as well as TLB structure is vital.
X86 systems today, work mostly with 4kB pages (there might be a handful of TLB-entries that can be used for huge (2MB) pages). Dividing a main memory of multiple, maybe +100 GB of memory into 4kB will be a huge overhead. It will be even worse with a combination of CPU's and GPUs.
4kB page size and 1024 TLB entries mean that you can only access 4MB of virtual memory before you need to start replacing TLB entries (reading the translation between virtual to physical memory from memory, before you can access the memory - ie you double the number of memory transactions).
SPARC and POWER today support much larger page sizes (+1GB) and this is something that needs to be done in X86 too.
So what I really want to know is will this actually result in a decent improvement in performance using photoshop \ lightroom \ premier pro? The mercury engine in PP was a nice improvement, ssd's helped with the other two a little, but this year on year ~10% performance jumps from intel, and amd struggling to keep up is getting old. If you want me to drop a few thousand on a new workstation, cough up a decent performance jump. If they turn round with an 16 core APU with 200% of the performance of a 4770 that doesn't require its own power station or ac unit I shall be suitably humbled and reach for my wallet.
If this comes out with a mild performance bump, on lithography 2 steps behind intel then honestly I will be disappointed. I would love for AMD to knock it out of the park, they are the only thing that keeps intel vaguely awake in the desktop space.
It could have an impact if the software is written to take advantage of it.
What concerns me though, I recall some time back (a couple of years ago) there being a WebGL exploit that could extract pieces of video RAM. Admittedly, the exact problem occurred nearly 2 years ago, and a lot has changed since then, however this isn't to say the same vulnerability can't exist in future software.
What makes this kind of vulnerability dangerous here though is that this sort of architecture potentially opens up your entire system memory to attack via the same vector, since video RAM is essentially your main system RAM.
The idea is not new though... the SGI O2 has a similar design, as did a lot of late 90's era desktop boards which had integrated video devices.
What concerns me though, I recall some time back (a couple of years ago) there being a WebGL exploit that could extract pieces of video RAM. Admittedly, the exact problem occurred nearly 2 years ago, and a lot has changed since then, however this isn't to say the same vulnerability can't exist in future software.
Issues of this nature were given by Microsoft as the reason they hadn't implemented WebGL in IE for such a long time.
I believe most of the SGI machines had all of the video card mapped in the CPU memory space so everything could access everything else.
Of course it used to be video cards had their memory mapped into the memory space of the PC, although there wasn't as much acceleration then, so allowing the CPU a fast way to write updates to the video card made sense. Once we got 3D chips with hundreds of MB of ram, the 32bit memory space started getting a bit tight and they stopped doing that for all the memory. No reason a 64bit machine couldn't allow everything to be mapped into one memory space though, unless you want to support running 32bit software still.
OK, not to get too down on this, what about memory bandwidth? Maybe an extreme example but the nVidia Quadro 6000 in my workstation uses GDR5 memory with a bus width of 384bits, and has an 128bit graphics engine, whereas its Xeon CPU is 64-bit and uses 64bit wide DDR3 SDRAM - two sets of completely different code have to run on each because the architectures are so different. Now, unless AMD are saying they're going to bump up their CPU cores to 128bit designs with much wider and faster built-in memory controllers that means the graphics engine actually has to accept CPU-specified memory that will be painfully slow compared to the discrete memory on the stand-alone graphics card. All in all, it may be great for tablets or handhelds, but not for PCs.
It's for their "APU"s, which are CPU with on-chip (possibly on die?) GPU.
So their GPU is already using the same physical memory bus and memory hardware as the CPU.
This isn't for discrete GPUs.
Looking at the list of partners, seeing ARM is very, very interesting - GPGPU in a Cortex A* is already very cool, and this would not only add go-faster stripes but severely reduce the CPU needed.
Anybody for 2-big.2-little.loads-of-titchy?
This development has been in process since 2006 and it's finally starting to come to fruation. It's great to see AMD leading the next big PC performance improvement. They may have had their troubles over the years, no thanks to Intel's illegal practices, but at least AMD continues to deliver the best value products for consumers.
I personally will be buying a Kaveri powered laptop as soon as they are available for purchase. Kaveri should offer a dramatic improvement in APU performance allowing AMD to then offer mid-range and high-end desktop solutions that equal current and future discrete CPU/GPU systems, but at a lower cost with lower power consumption. It's all good for consumers.
what do Intel and nVidia have up their sleeves for CPU/GPU shared memory that will beat AMD out of the gate by a year to 18 months on a smaller process .. say 14nm ..
* HSA Foundation, of which AMD is merely one of many members along with fellow cofounders ARM, Imagination Technologies, Samsung, Texas Instruments, Qualcomm, and MediaTek.*
Notable lack of Intel .. nVidia .. you'd think Google would be interested in the tech as well ..
down votes expected .. <sigh>
This post has been deleted by its author
The Commodore Amiga line of computers had unified memory way back in 1985. Such hardware was way ahead of it's time.
I sometimes wonder how Amiga came to fruition, because there was practically nothing else that could touch it:
. 16/32-bit CPU (Motorola 68xxx series and later IBM PowerPC accelerators)
. Preemptive multi-tasking built into the hardware
. Custom chipsets for video, audio and input/output
. WIMP interface.
. Programs running in their own screens, each with their own resolution.
. Video-friendly timing, hence no RCB monitor required. Hugely popular with cable networks.
. Massive array of public-domain software
I owned several Amiga's in addition to x86 hardware and from a hardware viewpoint the Amiga was in a class of it's own, even compared to a Mac or Atari ST. The thing could even run Mac software through emulation, at a fraction of the cost of a real Mac. The biggest problems I faced were the cost of upgrades, along with public perception (due to marketing), of Amiga being a kid's game machine rather than a creative powerhouse.
Sorry of this post was a tad off-topic. I have fond memories of this machine and the things it could do, as well as the great community who stood by it through thick and thin. Seeing innovative products like this from AMD reminds me a little of those days.
Less "unified memory" than 1 CPU (with no memory-management unit, hence the need to save often, save early and the kludge of patching up executables when loading them into memory) plus off-CPU video hardware that could access the lower 512K with DMA.
All very nice, but not exactly on the level with 2013.
The problem is that backward-looking enhances things. Do you remember the beautiful workbench? Then you look at it for real and you know ... it was nice then but it sure ain't now.