back to article Want to save the planet from AI? Chuck in an FPGA and ditch the matrix

Large language models can be made 50 times more energy efficient with alternative math and custom hardware, claim researchers at University of California Santa Cruz. In a paper titled, "Scalable MatMul-free Language Modeling," authors Rui-Jie Zhu, Yu Zhang, Ethan Sifferman, Tyler Sheaves, Yiqiao Wang, Dustin Richmond, Peng …

  1. Caver_Dave Silver badge
    Boffin

    Old news?

    FPGA's were the only way to perform neural nets fast and efficiently enough last century.

    A new generation make the 'startling discovery' again.

    1. Justthefacts Silver badge

      Re: Old news?

      No, I don’t think it is old news. There’s a nice subtlety here, which gets obscured by them making more than one change at a time.

      The key point is that they have replaced the “self attention” mechanism. The self-attention is/was there because it’s the only way for the algorithm to look at the whole set of text at once. But what probably hasn’t occurred to people before, is that the self-attention algorithm is designed to handle the situation on a normal CPU where only one register-bank is available to the CPU execution units at one time - it’s costly to shove things in and out of memory, all the way through cache.

      But FPGAs don’t have that constraint: the execution units can see tens of thousands of tokens simultaneously - so you can design a different self-attention mechanism, the one you would have designed if the CPU constraints hadn’t been there. We didn’t know, because software people make software assumptions.

      And then, on top of that, he’s refactored the overall matrix multiplication algorithm to use what FPGAs do well: bit wise operations. That has a performance hit, but one you can mitigate by simply having more parameters, and that precision trade off comes out differently on an FPGA than a CPU or GPU.

      But no; this is new stuff, because we didn’t know about self-attention until a couple of years back. And presumably his improved self-attention algorithm tuned for FPGAs is an all-new concept we haven’t seen before. It’s nice work, I think.

      It still might not be “successful” because GPUs are so dominant, that it’s easier to make progress there than on limited-supply restricted skill-set FPGAs. And just throw more GPUs at the problem. That I don’t know, but $100M of GPU time per training run, focuses a lot of minds, so we’ll see.

      1. Caver_Dave Silver badge
        Joke

        Re: Old news?

        OK, I'll bite. I was referring to the general case, whereas the researchers are referring just to LLM.

        As I have been out of this field for some time, I did have to look up "self attention" mechanism.

        I fancifully thought that this was referring to the usual "trumpet blowing" hype that start ups use to gather attention to themselves, until I researched.

      2. that one in the corner Silver badge

        Re: Old news?

        > It still might not be “successful” because GPUs are so dominant, that it’s easier to make progress there than on limited-supply restricted skill-set FPGAs

        Well, the next stage after proving your idea on an FPGA is to go custom ASIC, and then you get boards made with stuffed full of ASICs and given some overall name, next thing you know the same units turn up as chiplet subprocessors in your CPU.

        Just look at the history of blockchain mining for a recent example of the trend (with only a bit more hype, instead of getting new CPUs being hyped with "NPU" units they'd be appearing with BCUs instead!).

        The only problem is that if this new unit takes off fast enough, Intel, Apple et al will be left trying to figure out how to get you to trust them and buy a new CPU which has removed the NPU in favour of the new hotness. "This time, there *will* be useful software you'll want to use, and it'll be ready before we change the architecture again, honest".

        1. Justthefacts Silver badge

          Re: Old news?

          I agree that the next step would be ASIC. But like all the other AI projects, the problem is whether the function is fixed enough to do that, or you would do better off continuing development in flexible systems, taking a temporary hit in efficiency, and stepping onto ASIC in six months time. Or a year. The great thing about the matrix multipliers, was that *every* algorithm used them as it’s core tight-loop, so it was safe to onboard that as hard macros. I have no idea if this follows that paradigm.

  2. jake Silver badge

    Perhaps work on the ...

    ... "hallucinations'" issue first, THEN make it go faster and/or with less power.

    If you can't get trustworthy results out of the kludge, it's a dead-end anyway. No point in making it more efficient.

    1. Neil Barnes Silver badge

      Re: Perhaps work on the ...

      Ah but, never mind the quality, feel the width!

    2. Justthefacts Silver badge

      Re: Perhaps work on the ...

      Not necessarily true. He’s also changed the self-attention, which is a major thing. This isn’t just an algorithmic refactoring for efficiency. The quality and type of issues will be “different”. Can’t say whether they will be better or worse; but different.

      1. Michael Wojcik Silver badge

        Re: Perhaps work on the ...

        It'll be particularly interesting to see how this "overlay" mechanism interacts with adding recurrence to the model, as Google did in their (misnamed) "infinite context" paper, since those are both ways of incorporating a time series into what was an asynchronous stack.

        Some of the well-documented issues with transformer models were due to a lack of memory, and were often mitigated to some extent by using much larger context windows so part of the context would capture the model's recent trend. There are obvious limitations to that, and I agree that having better mechanisms for capturing the path history will likely have some effects on what error modes we see.

    3. Dave 126 Silver badge

      Re: Perhaps work on the ...

      > Perhaps work on the .. ... "hallucinations'" issue first, THEN make it go faster and/or with less power.

      Is it not possible that by having multiple systems can reduce hallucinations? In an operation is of lower power and faster, then it can get run several times from 'different angles'.

      I as a human see a distant white spot against a grass background. My life experience tells me it might be a mushroom, a rock, a plastic bucket or a seagull. However, I don't act on any one line of reasoning, perception or memory. I remember that the month is May, so I rule out mushroom. I remember that I'm not in a chalk area, so a rock seems less likely. I watch it for a while and it doesn't move, so not a seagull. I walk closer to it, and confirm that it is a plastic bucket.

      I suspect we humans are hallucinating all the time, but we don't act on any one hallucination.

      1. Caver_Dave Silver badge

        Re: Perhaps work on the ...

        Excellent description

      2. Justthefacts Silver badge

        Re: Perhaps work on the ...

        Definitely. We’re already had “Mixture of Experts”, a bunch of separate LLMs, each of which is better at different kinds of things, selected at run-time depending on the question. Now we’re doing Mixture of Agents, a bunch of separate LLMs operating in parallel, communicating with each other like people in a team, to reach a consensus answer. But we’re still at the stage where those “Agents” only communicate via their text outputs. That must be *staggeringly* inefficient, as they could use much greater bandwidth to communicate intermediate data, which would look like different brain regions interacting.

        Its well-known that we should do that, but nobody yet has released a frontier model that way. Next six months or so, I expect.

    4. Michael Wojcik Silver badge

      Re: Perhaps work on the ...

      There are many users and applications that don't give a hoot about accuracy, or at least are willing to accommodate hallucinations and other errors as long as they fall below a (often generous) bound.Text summarization and the generation of informal documents (email messages, memos, posts to web forums) are two prominent examples.

      And there are many users who don't care about (perfect) accuracy even in applications where it does matter. Human beings are, in general, careless. Programmers who text while driving probably won't care if GitHub Copilot puts a vulnerability in the code it generates for them.

      Personally, I think there are other, far more interesting risks to the use of gen-AI, such as learned helplessness, loss of opportunity to learn, and loss of serendipity, and I won't be using the damned things. But many people will use them, and so reducing the waste in energy is useful.

  3. Pascal Monett Silver badge
    Thumb Up

    13W instead of 700

    That's almost 54 times less.

    FPGA FTW !

  4. Bebu
    Windows

    Monogamy?

    "In self attention, every element of a matrix interacts with every single other element," he said. "In our approach, one element only interacts with one other element."

    The first ("self attention") by analogy is narcissistic promiscuity and the second ("interacts with one") monogamy. ;)

    The former is both delusional and unhealthy continuing the analogy with hallucinatory AI.

    At 3Wh per AI query (3 J per sec for 1 hour = 10800 J* or 10.8 kJ !) or a 3.0V chip drawing 60.0A for 1 minute. So we are cooking the planet in order to replace the human lack of intelligence with an artificial version of that same deficiency? ;)

    * For unrecoverable faredge reformists and the recalcitrant left pondial: ~10.2 BTU, 7966 ft-lb(f), 2.58 kcal

    1. MonkeyJuice Bronze badge

      Re: Monogamy?

      Sorry, I can't understand your figures. Can you put that in an officially sanctioned Reg unit of measure? How many double decker buses is that?

  5. FrogsAndChips Silver badge

    So with more efficient techniques

    They'll just throw more data and more complex models to their FPGA and still need the same amount of power...

    1. Justthefacts Silver badge

      Re: So with more efficient techniques

      Yes. Why is that a problem? Power usage is only a problem if we aren’t getting the desired output results achieved.

      The *wrong* way to look at it is to say that a human brain requires 100W, so until LLM achieves that it is “inefficient”.

      The *right* way is to observe that we don’t own slaves any more, and human brains don’t come in jars. We’re prepared to pay humans at least $10per hour, for really low-skill tasks. That’s about 100kW in electricity usage.

      As long as LLM is using less than 100kW, ie less than one full rack in data centre, I don’t care as long as it gets the job done. We can debate *if* and *when* it will get the job done, but that’s what we should be focusing on.

      1. Anonymous Coward
        Anonymous Coward

        Re: So with more efficient techniques

        > I don’t care as long as it gets the job done. We can debate *if* and *when* it will get the job done, but that’s what we should be focusing on.

        Well, first we should be focussing on whether the job is actually *worth* doing at that cost.

  6. munnoch Silver badge

    Just need more parameters…

    I can’t pretend to be an expert on neural nets or ML, but I’ve always been surprised by the low precision arithmetic employed in these models.

    I do a lot of financial numerical stuff and the weirdness in FP precision even when you have lots of bits to play with is something we have to pay close attention to otherwise the results can go very skew whiff.

    Isn’t that even more of a problem in ML with the more limited precision? Or does the arithmetic weirdness of individual calculations just get lost in the general weirdness of the whole thing?

    And now we have these dudes saying let’s reduce the precision even further to just *three* possible values so the model becomes more resource efficient. All they need are more parameters ie a bigger model.

    Doesn’t a bigger model imply a bigger training set? And haven’t we already shoveled all the excrement produced by mankind into the existing models with what can best be described as mediocre results? Where do we get training data good enough to produce all these lovely extra parameters?

    Whenever a particular processing domain becomes a bottleneck you usually have someone come along with an PFGA that implements an optimized version to relieve the bottleneck. After a while those ideas get absorbed into general purpose computing and that becomes good enough. I suspect that’s all that’s happening here.

    1. HuBo Silver badge
      Windows

      Re: Just need more parameters…

      I basically agree ... though the switch from arbitrary weights in matrices (possibly FP16) to just {-1,0,1} might be viewable (conceptually) in a similar way as the replacement of Fourier Transforms with Haar Transforms (or possibly Haar Wavelets), which has decent math backing IIRC.

    2. Anonymous Coward
      Anonymous Coward

      Re: Just need more parameters…

      It might not be that surprising. Compare it with the board game "Who is it", where one player selects a person and the other has to guess who it is with the least amounts of attempts. Typically the guessing player asks questions like "Is it a man or woman?", "Is it a child or adult?", "Is the person wearing glasses or not?" and if the player makes no mistakes there is only one option remaining in the end and the guessing person gets it right. The asked question were all binary questions.

      Now think of a computer / ML vision version of the game where there is some uncertainty in the answer. The computer interpreting the person to be be guessed isn't always sure of the answer and responds question as "Is it a man or woman?", "Is it a child or adult?" with "first answer is likely correct", "I am too uncertain to say either", "second answer is likely correct". Do it this way like you play as a human player and you'll most likely end up wrong if you only use the same amount of questions as the human player.

      Change it a bit and replay the game a hundred times with each time a different set or "tree" of questions and you'll get a lot closer if you manage to find the most chosen outcome.

      In addition, you can run the question through for example four different computer programs analysing the person to guess. For example ask four different programs if the person wares glasses. Combine the answers in a binary-like tree, two by two. If the first two inputs are "likely wares glasses" then send that to the next stage. If the first two inputs are conflicting, send "too uncertain" and so on. Do the same in parallel with the remaining two inputs. Recombine the outputs of this first layer in similar ways to find out what the combined guess is of the four different programs and it'll be more accurate.

      In a way, through taking more inputs and routes to evaluate these inputs, the accuracy of the guess gets improved even if it still is described in only three states. That sounds contradictory, but it doesn't need to be. The further in the decision three, the more likely outcomes.

      So in the beginning you start with "is it a child or adult" and evaluate against chances of > 30% child, > 30% man, unknown. In the end you finish with 40 (or however how many pictures there are) "is it person 10?" and evaluate against the chances of >95% person 10, <95% person 10 and the same for person 11, 12...

      It doesn't work exactly like that, but it helps understanding the possibilities.

      1. Anonymous Coward
        Anonymous Coward

        Re: Just need more parameters…

        Correction, with binary choices over 50% is an indicator so one better evaluates against > 60%.

    3. MonkeyJuice Bronze badge

      Re: Just need more parameters…

      As you quantize a model, it's performance *does* decrease. Because of the natural tendency to treat the results as coming from an imperfect AI talking like a pirate, or a marketing exec, people find this more acceptable than suffering a $3k rounding error. Think of it as the salami scam- you're still losing, but it's distributed into a huge pile of disappointing coefficients. The advantage is you can conceivably fit the model into GPU ram rather than having a bottleneck thrashing the bus, which gives you _far_ faster inference and training.

      Although I've not looked at the details it seems perfectly conceivable to have redundancy of ternary bits to alleviate the error, but you would (probably) need to train on that architecture from the start. I suspect you could dial the precision / power / performance tradeoff to some extent.

      As for making LLMs less awful, more data is the conventional wisdom, but better curated training data helps. That is far more boring and doesn't secure you as much an R&D budget if you can gun for the petabyte models, but we basically hit the point of diminishing returns a while ago, so your kind assessment of 'mediocre' feels spot on to me.

      We're only really here using GPUs because there was enough market from the games industry to propel hardware development in that direction. With 'NPUs' (whatever they are) round the corner, and other research like this one, it looks quite likely that you're right.

    4. Korev Silver badge
      Boffin

      Re: Just need more parameters…

      I do a lot of financial numerical stuff and the weirdness in FP precision even when you have lots of bits to play with is something we have to pay close attention to otherwise the results can go very skew whiff.

      I was involved in a biochemistry project that got hit by this a few years ago. Luckily for us it was very obvious as the models started outputting negative concentrations of a ligand around a receptor (which without a quick rewrite of the laws of physics can't exist).

    5. LionelB Silver badge

      Re: Just need more parameters…

      > Or does the arithmetic weirdness of individual calculations just get lost in the general weirdness of the whole thing?

      Kind of, yes. One way to look at it, is that the data itself is so noisy that arithmetical imprecision is (up to a point) not that critical.

      Imagine, for instance, if a few tokens in a massive training set (and LLM training sets are epically huge) were changed. This would imply that (given absolute mathematical precision) the model parameters would be ever so slightly tweaked. But who's to say those few tokens were "necessary" or "correct" in the first place? So "tweaking" model parameters via arithmetical imprecision is analogous to the effect of small variations in the training set, which are on the whole unlikely to significantly effect the output.

      (That on the whole, though is a caveat: if all parameters in the model are skewed by arithmetical imprecision---as they will be---then, depending on the size and nature/quality of the training set, and the size of the model, the cumulative effect may become significant... I have no idea of the extent to which that scale/precision trade-off is quantifiable; I suspect that would be very hard to work out.)

  7. steelpillow Silver badge
    Boffin

    Ternary

    Intriguing to see ternary logic (-1, 0, 1) replacing binary in maximising efficiency. This has always been theoretical, with binary proving so much easier to manufacture that we ran with that instead. But ternary logic and memory devices are not /that/ hard, so perhaps we'll be seeing ternary array hardware in the not too distant future. The question has to be, is the move from binary to ternary most efficiently made at the software, firmware or hardware level?

    1. Michael Wojcik Silver badge

      Re: Ternary

      It hasn't always been theoretical. I occasionally run across ternary systems. IOTA, the company behind the "Tangle" Merkle graph thing that the OMG picked up as an alternative to blockchain1 originally used ternary (for the usual reason, because 3 is closer to e than 2 is). Apparently they dropped that in the face of resistance from other implementers or something, though.

      And didn't the Russians have some ternary computers back in the Cold War days?

      1Which, on the one hand, makes sense: blockchain is a dumb, degenerate sort of Merkle graph, so why not use a grownup version, as e.g. git and various filesystems do? On the other, since blockchain is basically useless, why bother with a distributed ledger that's a full-fat Merkle graph either?

  8. Pete 2 Silver badge

    Good heavens Jevons!

    > AI-powered Google searches each use 3.0 Wh, ten times more than traditional Google queries

    Making AIs more efficient may not be a good thing.

    The Jevons Paradox tells us:

    occurs when technological progress increases the efficiency with which a resource is used (reducing the amount necessary for any one use), but the falling cost of use induces increases in demand enough that resource use is increased, rather than reduced

    So for example when steam engines were primitive and inefficient, they weren't used much. But as they improved and became cheaper the increase in their number made them much more useful. Hence the amount of coal they (in total) used, increased more than their efficiency. Net result: more steam engines, and more energy used.

    1. Michael Wojcik Silver badge

      Re: Good heavens Jevons!

      Sure, Jevons is a risk. But we've already pretty much established that "AI" contains a great deal of Shiny and appeals to people's innate laziness. Those are attributes which pretty much risk saturating the market regardless of even an OOM or so in cost. I suspect "really cheap AI" isn't going to see ten times the usage of "pretty cheap AI".

      Also, while building SotA models is expensive, the power cost isn't the economic factor limiting who's building them. That seems to be a combination of the high cost and limited supply of expertise, and the diminishing returns from Yet Another Model.

  9. Anonymous Coward
    Anonymous Coward

    French art prior

    In 2016, Alemdar, Leroy, Prost-Boucle, and Pétrot, of the Grenoble CNRS, wrote: Ternary Neural Networks for Resource-Efficient AI Applications (with an FPGA and all). And, of course, Zhu, Zhang, Sifferman, Sheaves, Wang, Richmond, Zhou, and Eshraghian, of UCSC, Davis, and LuxiTech, do not cite them in their Reference section ...

    It's well worth comparing the 3rd paragraph of the 2024 paper with the 2nd paragraph of this (uncited) 2016 one IMHO.

    1. Michael Wojcik Silver badge

      Re: French art prior

      As Justthefacts pointed out above, the replacement of self-attention with this "overlay" mechanism is probably the most important contribution of the present paper. The UCSC team may have missed the French paper (or, if you want to assume the worst, taken the idea without credit), and that's worth knowing, but it doesn't mean there's nothing novel here.

    2. diodesign (Written by Reg staff) Silver badge

      Re: French art prior

      As others have pointed out, the optimization described in this latest paper does separate it from prior research. I've added a note to the article about it.

      C.

  10. that one in the corner Silver badge

    -1, 0, 1

    That is a Tern Op for the books.

    1. steelpillow Silver badge
      Coffee/keyboard

      Re: -1, 0, 1

      There, and we all thought Baldrick was muttering about turnips. Was an idiot savant all along!

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like