Re: Unified Memory
If anything this could make the situation worse for NVidia. Before this, when the programmer had to do their own data transfers, the latency was explicitly there in the source code
I got the same feeling on reading the article. The latencies are still there, but now they're just hidden behind a software translation layer. I'll agree that doing explicit DMA or other main memory <-> device memory transfers is annoying, but we already have a technique for hiding DMA latencies(*), namely double (multi) buffering.
Multi-buffering can, for many problems, not only "hide" the latencies, but effectively eliminate them for all but the first block to be transferred. If this new feature does automatic loop unrolling and transparently adds multi-buffering (or even just double-buffering) when it detects it should be used, then that would be pretty nifty. Unfortunately, judging by the description in the article, this isn't what it's doing, and all we get is blocking, full-latency access to the "shared" memory, with "shared" in quotes because it's only a software abstraction, not a hardware feature. I could be pleasantly surprised, but from the article, it seems like it's only a sop to lazy programmers, and not real shared memory at all.
(*) I'm not actually up to speed on CUDA, so I'm assuming it uses DMA to do data transfers?