purportedly up to 95 percent
And as we all know, the first 95% of the port takes 95% of the effort, while the last 5% takes 95% of the effort
Nvidia is facing its stiffest competition in years with new accelerators from Intel and AMD that challenge its best chips on memory capacity, performance, and price. However, it's not enough just to build a competitive part: you also have to have software that can harness all those FLOPS – something Nvidia has spent the better …
...Sucks. I have an A750 with "8Gb VRAM." I put that in quotes because you can't initialize an array larger than 4Gb using ipex or oneapi. So technically it has 8Gb, you just can't use it for compute. Intel, aware of their bug, has stated they have no intention to fix it. Intel Arc is useless for language models. It's almost a passable gaming card, except they refuse to support VR. I'm never buying Intel again. My next computer will have an AMD processor and Nvidia graphics. Icon because buyer's remorse.
Maybe a literary sociolect of the webomorphic persuasion, as seen in https://visualstudio.microsoft.com/vs/features/cplusplus/, while some concisely prefer CPP (eg. https://www.w3schools.com/cpp/), and yet others lean towards unreadable hexadecimal percentages: https://en.wikipedia.org/wiki/C%2B%2B. A belt and suspenders principle might suggest: C+Plus%2B+Plus%2B! </ahem!>
I guess it was particularly hard to foresee the raytracing of this CUDA castle moat by king Nvidia, seeing how he spent most of his time playing video games. And yet, there it is, in all its nearly unassailable medieval glory, mostly safe from catapults, battering rams, ballistas, siege towers, and other telescoping software!
But this period might not bear long-term remembrance as history lest it be associated also with such social movements as grand fanaticism, inquisition, and related embrace of an equivocal orthodoxy, along with specialized torture methods for the heretic rebel fighters: surprise, fear, ruthless efficiency, and torture by pillow and comfy chair. Didn't expect that?
Well, nobody expects the Spanish Inquisition either! ;)
Speaking of moats, just saw this quite strikingly cool one (albeit somewhat in ruins) in "Meurtres à Château-Thierry" on French TV. It's actually at the Château de Fère en Tardenois, but not too far from Château-Thierry proper. The site features an awesome covered bridge (as seen on the photo). French Champagne is made to the East of this (eg. Epernay).
Sounds like it.
But seriously if a hardware mfg wants to compete with another hardware mfg isn't up to them to make the transition path as smooth as possible? the lower level that happens at the wider the market.
Logically every command could be mapped with macros provided their hardware can somehow carry out the exact same task.
There are 2 challenges. The external challenge is to make developers code "Just work" so it's no more of a chore to run on your hardware than nvidia's but the internal challenge is to maintain a commitment to improve the port till it gives both 100% compatibility (or at least nothing but a few minor tweaks, the kind a simple script could make) and at least 100% speed compared to a native nvidia port.
Time will tell which of its competitors executes this well enough to look to developers like a serious competitor.
And who knows, maybe prices might start coming down?
The price of pick-axes and spades will stay high as long as the insane gold rush of LLM remains in full fever.
Personally, I used to like amanfrommars ramblings assuming they came from his own strange mind and perhaps a hand-crafted algorithm or two. Now that any old generative engine can spew it out, it's much less entertaining (but I was tickled by the idea of "torture by comfy chair and pillow")
Well duh, ya think ?
Who wants to write code like
ASR A
BCC ASC
LDA A ACIA+1
AND A #$7F
Not me. I prefer writing
Function getASC(data as String) as Integer
Dim Char as String
Char = left(data,1)
getASC = ASC(Char)
End Function
When I look at Assembly code, I have no effin' clue what it is supposed to be doing, whereas even a non-developer can take a guess at what my preferred code is doing.
Some of those 6502 mnemonics are invalid. Not sure why you are inserting those extra "A"...
its LDA not LDA A, theres no point to the second A, as the LD A already mentions the destination.
Same for AND, they always imply the A when an immediate value is iused.
Ultimately CUDA is under assault from three vectors;
First, enthusiasts and hobbyists who are unable to afford Nvidia hardware with sufficient amounts of VRAM to do anything interesting.
Second, the manufacturers themselves.
Third, and perhaps more importantly, all of the hyperscalers are investing heavily in their own accelerators. None of them can tolerate the current status quo; most can't even secure enough Nvidia chips to do what they want.
I think the long term prognosis for Nvidia remains dim, not in a "Nvidia will fail" way but a "it'll go back to being an AMD sized company and not an Amazon sized one". The moment right now feels a bit like the mid-10s when it appeared that Intel had a monopoly on x86 and nobody else could challenge it (eventually AMD would come to save the day, as it were, but arm started being taken seriously as well as a result of that time), but perhaps even more volatile. Nvidia does have a lot of the fab time booked and HBM bought which will make competition somewhat difficult, but you can already see Google and now Anthropic (via AWS Inferentia) migrating towards custom accelerators and away from GPUs.
And of course the GPU architecture isn't really purely optimized for Matmul, there's a lot to be said for, say, cerebras' waferscale approach (which also handily doesn't rely on HBM).
Isn't Cerebras' aiming primarily for the so-called "language output inference" (better name "statistical-inference of language output") niche? Similar to another "language output inference"-niche company "Groq"? So they are not exactly directly competing with NVidia, because NVidia hardware can used for both "training" and "inference".
The CUDA "moat" is a classic example of the first-mover advantage in an ecosystem.
It's not just about technical parity but also developer convenience. HIPIFY, SYCL, and ROCm offer alternatives, but often require manual intervention, have compatibility issues, or lag in performance. NVIDIA's strength lies in its relatively unified ecosystem.
Custom accelerators, however, may challenge Nvidia's moat from outside rather than within.
Displacing a dominant player is always hard, whether in computing or evolution. But disruption often emerges from an unseen niche that overturns what seemed unshakable. Either that or the equivalent of an monster asteroid strike that shakes things up a bit.
I think NVIDIA's moat was more "going all in" than first just mover.
There was a while when the new OpenCL had a lot of interest. Certainly in academic circles it was a lot more popular than CUDA, there is a lot about the language design that was nicer.
But nobody really committed to the HW, everyone had an "OpenCL implementation", but they were all slightly non-standard, didn't keep up with new versions and had terrible tooling. Soem <cough>AMD<cough> had terrible unfixed bugs in core routines like FFT but nobody cared because the market was games or laptops - OpenCL was just a checkbox to meet some purchasing requirement
NVIDIA on the other hand bet everything on GPU compute, years before AI, not just HW but the tooling and support around CUDA.
They could try, but it likely wouldn't work. There are already open standards. Someone mentioned OpenCL. That's still around, there's your open standard, and the chips already support it. Every manufacturer can claim compliance on that. It's not their fault that people are not writing for it. Having an open standard doesn't do anything if people choose to write for the closed versions. If you try to forbid the closed versions, then you'll get a lot of complaints from all the people whose code you've just disallowed if they don't just ignore you.
...hasn't tried to get tensorflow/pytorch/spacy/whatever to work GPU accelerated on NVidia vs basically anything else. Just the other day I went on AMD's website to find out how to get pytorch to run GPU accellerated on AMD. They wanted me to downgrade Pytorch, use their specific branch, patch a bunch of other stuff, boot with a special kernel param etc. When I did that, it still didn't build. After hitting it with a stick a few times it built but didn't actually run gpu accelerated. This is pytorch - which is the most important library for ML.
The exact same thing on nvidia? Just builds. The latest pytorch, the standard upstream of everything. You don't have to do anything special- it just builds and runs GPU accelerated straight away.
If AMD or Intel or anyone else want to actually get serious they will stop putting out press releases and do the work that's necessary to get their shit to actually build and work with the 2 frameworks that actually matter. Until then I'll believe in the nvidia moat.
The M2 chip from Apple boasts superior efficiency in power and thermal management, making it highly suitable for LLMs and general AI applications with lots of dev libraries support. Although Apple's market share in the AI hardware landscape is smaller compared to NVIDIA's dominant position (maybe less on servers but did see AWS offer them), it is progressively encroaching on AMD and well lots of devs use macs over x64/x86!
That's a techie answer, not a business answer.
Nvidia gives you vendor lockin with a supplier.
M2 gives you vendor lockin with a competitor.
And pretty soon you're not their competitor anymore, you're their acquisition.
People have seen the "Enfold, extend, extinguish" game played plenty of times already.
As aggravating as Nvidia can be at times, I do have a grudging admiration for their ability to execute their plans over the past 10 years or so. Sort of like Intel was 15 year ago. AMD and Intel can't get into the market not because their products are particularly inferior (though you could argue lack of CUDA makes them so), but because Nvidia hasn't put a foot wrong in a very long time in IT years. Even with a *better* product, when the market leader has 90% of the market it's not enough. They have to make a mistake. And I think one thing that Nvidia has and the other two don't that helps them not bugger it up is that they actually know where they're going. CUDA's dominance is the result of a near-20 year campaign to put it where it is. I don't see anything out of AMD that indicates that kind of vision on the GPU front.
I did mention Intel before to point out that it's not inevitable that Nvidia continues to eat all the pies. Hubris is a thing in IT, and Intel was certainly guilty of it. They still dominate the market of course, but they've shown their weak side and AMD continues to nibble away at their marketshare. And this is exactly the sort of boost that the ARM-based companies need to get their foot into the lucrative consumer desktop/laptop world. Maybe finally become a general purpose option in the server room, too. One wrong step out of Nvidia when AMD or Intel are having a good year could cause a seismic shift in the GPU market. Let's see what 2025 brings us.
Good article, but it misses chipStar (https://github.com/CHIP-SPV/chipStar), a compilation flow/runtime that can compile HIP and CUDA sources to an open standard API-based heterogeneous software stack (OpenCL and SPIR-V). It doesn't support IL/binaries though, but it also means no reverse engineering is required for the compatibility layer. Its usage has increased lately for making various legacy HIP code bases open standards compatible. The development is currently led by Argonne National Laboratory: https://www.alcf.anl.gov/events/chipstar-hip-implementation-aurora.
I'm guessing because it was started by developers and all the mfg preferred that they adopt their (proprietary) choices.
Congratulations HW mfgs. You got your wish.
Except for most of you it wasn't your proprietary standard that the market chose, was it?
I'd like to see them attack Nvidia's market dominance through OpenCL but that depends on how big a job it would be to shift the existing code bases to it, and of course how much of that job can be reliably* automated.
The other direction is to deliver 100% compatibility at the lowest level, so they can inherit the toolchain above that level, and of course maintain that compatibility.
2025 could be quite interesting.
*Unreliable automation is a complete f**king waste of time. You're not just re-writing your stuff, you're checking the automations attempt to rewrite your stuff. This does not save time, unless you cross y our fingers, hope your test suite will find all the errors and give you enough info to drill down to them and fix them. If anyone does this (and it works) please let me know. I've never seen it made to work properly, but there's always a first time.
OpenCL had various restrictions, with each of them having some downward effect on its usage. However, it didn't end up becoming the most popular neither because someone deliberately killed it nor because it had gaping technical flaws. We can debate how large each of the effects was, but I think we also need to question why we assume it was likely to be the victor.
By being a standard, OpenCL tended to run on more things, but not as efficiently on each one. Optimizing for a specific part meant you could run faster, and for something as compute-intensive as training ML models, that was a key benefit. Any other standard is likely to have the same downside. We can try to make a standard that's close to the hardware which would reduce that difference, but it's always likely to exist. That's not necessarily a problem; although compiled languages tended to run slower than hand-coded assembly*, people still chose them for the portability or writeability advantages.
Your alternative suggestion, a standard at the lowest level, is going to have other restrictions. At that point, you're no longer going to prefer one manufacturer over another, but you also entirely eliminate any ability for a manufacturer to improve anything. If they can't change the instructions their chips understand, they can't do something as simple as allowing you to work on a larger piece of data, thus accomplishing in one instruction what would have taken multiple. Other optimizations would be prohibited. Nvidia and AMD have spent lots of money building those improvements, testing them, making hardware that can do them, and building them into compilers. By requiring a standard like the one you've described, you're effectively telling every user that they will no longer get any of that work in exchange for being able to buy AMD parts. It would be similar to mandating that every CPU manufactured must be an AMD64 one, with alternative approaches like ARM and RISC-V disallowed, but anyone who wanted could try to build an AMD64 one. Do you think users are going to be pleased with that?
* Nowadays, compilers are so good at optimizing that they often produce better code than someone working in assembly. This doesn't change the point. People can still improve on the compiler's output, and in some cases they can do it quite easily because they understand the semantics of their program's contents whereas the compiler's optimizations are general ones. It is still possible to exceed a compiler's efficiency. You just have to really want to do so and be willing to put in the substantial effort.
I think you're mistaken.
It would be the other mfg that would be doing this work. If they added a feature then the patch would recognise it. It potentially leads to a "Virtuous circle" of mfg leap-frogging each other with tweaks to their instruction set that would be recognized as making apps on their hardware run faster.
While this would mean developers needing to update their tools more frequently it could lead to a more level playing field. Mfgs competing on performance, which I think is what we all want.
I don't understand how that's supposed to happen. When you say "If they added a feature then the patch would recognise it", what is "the patch"? They can change their implementation in microcode or something like it, thereby improving their implementation of the allowed instructions. There is only so much you can do to improve that. They wouldn't be allowed to have new instructions because those new instructions aren't part of the standard, so if someone compiled for those new instructions, it would only work on that manufacturer's parts. What improvements, other than microcode, do you think they'd be allowed to do with a mandated standard? While we'd see some improvements with microcode, ruthless focus on microcode alone is how we got several serious CPU security bugs, because Intel and AMD wanted more performance out of the same ISA every year and kept looking for more and more hacks to get it until one of those hacks had some nasty side-effects. Even then, they were still adding some things to the ISA which sped up some classes of program, which wouldn't be allowed if they have to stick to a standard.
Nvidia's moat is not a good thing, but attempts to eliminate it by requiring Nvidia and AMD to build the same chip isn't helping. There are a few negative effects. It seems we're disagreeing about the ability to improve on a mandatory standard, which I assume is due to a miscommunication somewhere. We also have the problem that, if any improvement one makes is immediately available to its competitor, there is less benefit to the manufacturer for improving, so why bother to do so? I think the best way to fight against Nvidia's moat is what AMD is currently doing: making it easier to port things from CUDA to run on theirs as well. They're lagging, not because Nvidia did anything wrong, but because Nvidia built something that AMD didn't bother with and people wanted that thing. AMD can catch up but they have to go to some effort to do it. I don't think we are helping the users by trying to fill in that moat for AMD. There are some times where that kind of regulation is justified, but I don't think we're there and I think attempting it nonetheless will be unwelcome to users and less effective than you expect.
I agree that CUDA was the moat back in the days we called these things GPUs. In those days we felt processing graphics and videos and smaller scale forms of AI was a lot of data and processing. Now with LLMs, the scale of processing is a new ball game. Nvidia's new moat has everything to do with scale that often saturates Ethernet and melts components. New highspeed memory chips and joining multiple processors on the same board or even the same silicon. Liquid cooling, DPUs, infiniban and who knows how many trade secrets, patents, strategic partnerships are needed to get the job done right. Imagine the patents owned by the creators of the robots and automated precision manufacturing required to assemble these things and sourcing reliable suppliers. Don't forget the engineers who actually have the skills and experience at building something that resembles a super computer and other aspects of HPC. As an example, many of us are programmers here but few of us are kernel programmers. I think CUDA is still part of the moat but keep in mind they are building a high performance fighter plane and not just an engine.