back to article Nvidia won the AI training race, but inference is still anyone's game

With the exception of custom cloud silicon, like Google's TPUs or Amazon's Trainium ASICs, the vast majority of AI training clusters being built today are powered by Nvidia GPUs. But while Nvidia may have won the AI training battle, the inference fight is far from decided. Up to this point, the focus has been on building …

  1. HuBo Silver badge
    Go

    A hardware bonanza

    Good point about tokens-per-dollar, with capital and running costs for TCO! TNP has some data points on this, about how optics can beat copper in the networking part, and how at FP64 AMD CPUs and CPU+GPU provide the best bang for the buck at present ... we might need a similar analysis for FP16 too ... (or a refresher given my limited human memory!).

    Madhu Rangarajan's interview was also interesting relative to the degree to which pure CPUs might be used for inference as a function of model size and frequency of use.

    Beyond the bigger players though I do love the diversity of hardware being developed for inference, from Cerebras' distributed thinking, through SambaNova's novel rhythms and Tenstorrent's dataflow, to NextSilicon's free spirited Maverick-2 mill cores, and on to d-matrix's Digital-In Memory Compute (DIMC) tech and EnCharge's analog route, among others. There's something there for everyone it seems (and don't forget the interconnects -- they're reconfigurable on the Maverick)!

  2. Bitsminer

    Need for speed

    The Cerebras service (free for casual use but with only a couple of models available) is very, very fast. Especially compared to commercial Nvidia-equipped CPU+GPU services.

    I compare it to a video terminal versus a punch-card deck. With a fast response you can re-compose or revise your prompt quickly. With a punch-card deck you lose your train of thought, and you've probably wandered off to do something else.

    If the same or similar LLMs are available on comparable platforms, then speed wins every time. If mixture-of-experts and chain-of-thought models get popular, speed is even more important.

    1. O'Reg Inalsin Silver badge

      Re: Need for speed

      Maybe I'm a bit slow but, if I ask a solid deep question, its going to take me a at least a few seconds to deeply understand the answer - so a delay of a second is not much compared to how long it takes to digest it on my end (reading, testing, ...). Far more important is that the answer be as short as possible to satisfy Occam's razor - fluffy verbosity is a negative feature. For that reason Tokens/sec doesn't seem to be that useful as a measure, because the token quality is still a free variable.

      I can think of exceptions such a translation between coding languages, where the results are supposed to be so reliable that testing being without being read first by a human. (Just hypothesizing here - I have always given such translations a brief read over anyway). But such exceptions are not the rule.

      1. adsp42

        Re: Need for speed

        I'm with you, but let's not forget that internet and social media in particular is full of trivia. And cats. Not that many "solid deep questions" so we are ignored by the profit seeking AgI propagandists.

        Very good point Occam's razor, not that easy to use for quality control though.

  3. old bald white guy

    Hmm, not convinced about inferencing only market

    There are lots of parties chasing the inferencing market, and IMHO 1) it is not a single market and 2) it may take a lot longer to mature than people are hoping.

    The market is going to segment between data center and then a range of edge applications from super low cost (think edge facial recognition or wake word) to mutliple auto apps (smart mirror to crash avoidance). A lot of the start-ups are building edge apps, and another set are building datacenter. And the hyperscalers are actively building out their own chip sets.

    I think your tech spec drivers are correct for datacenter, but you have to add in power and 'right-sizing' for edge apps. By right sizing I mean the appropriate number of parameters and computational capabilities relative to the app.

    But for inferencing only to take off, regardless of the app, you have to have stable parameter sets (both in terms of parameter values and number of parameters). Otherwise you are still in training mode and inferencing only buys you nothing. Taking automotive for example, there is still a lot of training and updating going on. That is why Nvidia still wins along with AMD/Xilinx for other apps.

    BTW, if you are still in training mode for edge apps you still need to get new data back from the edge to the data centers for training. Have to get that LIDAR and camera data back. Again, points to Nvidia because they have the CODEC and transmission capabilities.

    I maybe being excessively skeptical here, but time will tell.

  4. jbrower

    Texas Instruments *should* be an inference player, but they missed the AI boat 10 years ago when they cut their HPC chip product line. They had 1U servers with 256 CPU cores (using PCIe accelerator cards), they had presence in vision and speech recognition open source groups; basically they were on the right path. But as we've seen with Boeing, Intel, Ford, etc when the company is run by MBAs technology vision and clarity becomes a fundamental liability. Unless they target robotics and other real-time inference sectors head-on, and restart their HPC chip product line for which they still have homegrown technology that excels in performance density (performance/power consumption/package size), they face a dim future. They will increasingly depend on their China revenue which will continue to decline.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like