back to article Cheat codes for LLM performance: An introduction to speculative decoding

When it comes to AI inferencing, the faster you can generate a response, the better – and over the past few weeks, we've seen a number of announcements from chip upstarts claiming mind-bogglingly high numbers. Most recently, Cerebras claimed it had achieved an inference milestone, generating 969 tokens/sec in Meta's 405 …

  1. beast666 Silver badge

    Another LLaMe advertorial.

    1. m4r35n357 Silver badge

      Hey El Reg, have you noticed yet that your readership doesn't give a shit about your "machine larning" fluff?

      1. This post has been deleted by its author

        1. m4r35n357 Silver badge

          Why not learn what you need from technical articles or papers? It is not as if they are in short supply. This hype is nothing to do with technology, it is all smoke & mirrors.

          1. This post has been deleted by its author

      2. Anonymous Coward
        Anonymous Coward

        > readership doesn't give a shit

        It would be nice to see less "LLM sucks" drivel in the comments. One such comment with likes is enough. I see this in Elon-related topics too: hate and flood. We've got your point already. Informative posts, please.

        1. m4r35n357 Silver badge

          Re: > readership doesn't give a shit

          Good luck with that! This stuff is already causing more problems than it solves, and things will only get worse with fanboy support.

          People still pushing it at this point are _part_ of the problem. It really is shamefully poor technology.

        2. m4r35n357 Silver badge

          Re: > readership doesn't give a shit

          Fewer

          1. Jedit Silver badge
            Headmaster

            "Fewer"

            No, "less" is correct here. Fewer refers to objects, less refers to category. For example, I want there to be fewer conflicts and less conflict. So in this context you would say "I want to see less talk of AI drivel and fewer posts/articles on the subject".

            1. m4r35n357 Silver badge

              Re: "Fewer"

              "Less drivel" - of course you are right! I misread as "less comments".

              I shall leave my idiocy in plain sight as penance ;)

            2. This post has been deleted by its author

            3. Bilby

              Re: "Fewer"

              I have decided to just use "fewer" in all contexts, just as so many of my peers have chosen only to use "less".

              It's a lot fewer hassle for me, and if folks don't understand, well, I couldn't care fewer.

              I am quite shamefewer about it.

              1. Jedit Silver badge
                Trollface

                Re: "Fewer"

                I think we all learned a valuable feweron about language today...

        3. nobody who matters Silver badge

          Re: > readership doesn't give a shit

          ,......"It would be nice to see less "LLM sucks" drivel in the comments".......>

          It would be nice to see some general usage case for LLM where they don't suck tbh.

          Maybe one day they will get there and the 'drivel' as you call it will be replaced by posts expressing awe and wonder at the brilliance of it all, but at present the technology has a very long way to go and is certainly a long.long way from being reliable enough for general usage. 'Generative AI' is at present a bit like Teslas' supposed 'Full-Self-Driving' mode (ie. it isn't what it says it is!).

        4. This post has been deleted by its author

      3. Dan 55 Silver badge

        You have to distinguish between:

        - ChatGPT/Gemini/Copilot, etc... occupying whole datacentres slurping the entire internet, throwing it at the wall and seeing what sticks, and still being unable to count the Rs in strawberry

        - self-hosted LLMs where if you feed them the right selected training materials is probably the future for this technology

        So, I find these series of articles on self-hosting LLMs interesting reads.

        1. m4r35n357 Silver badge

          If you are self hosting, why not code up a proper expert system _with your specific business logic_? ML is just flinging shit against the wall, and relying on what sticks.

          1. Dan 55 Silver badge

            Your specific business logic might be in your businesses' documentation, e.g. a series of red books and white papers. If you have an LLM which gives reasonable answers to staff questions (with sources so they can double check) then you've saved time.

            1. m4r35n357 Silver badge

              I find your first sentence puzzling. Surely this is as it _should_ be, and translating docs to code should be a deterministic process. We all know that this technology is going to be most "useful" to those idiot company bosses who see documentation as an avoidable cost (in other words, pretty much all of them!).

              1. Dan 55 Silver badge

                I wasn't thinking about translating documentation to code or vice-versa which would probably not end at all well, rather allowing employees to ask questions about the products and services that a business sells, so having correct up-to-date documentation is valuable in itself. Imagine asking a question about a UI feature and be told what frontend and backend code is used, data formats and stored procedures, how data flows from one system to another, and being told what the chapter and verse is in the original documentation so it can be checked, and so on.

                1. m4r35n357 Silver badge

                  Sounds like a "nice to have", but that is exactly the sort of procedure that is likely to go wrong IMO, or more likely just not be used. How much "training" do you think would be necessary to get that process anywhere near right? What if code or products (or anything else) change - yes, retrain again. then patch up mistakes, etc. etc. ad infinitum.

                  I really do think the "translation" route is the right one, the docs and code are right there and can be directly compared! Why add layers of expensive an error-prone fuzzing?

                  1. Dan 55 Silver badge

                    How would a deterministic process work which can be queried in free-form English?

                    1. m4r35n357 Silver badge

                      Well I admit I know nothing at all about expert systems, but perhaps attempting to define business logic in terms of arbitrary text queries is the fundamental problem . . .

                      1. Dan 55 Silver badge

                        I'm not proposing that it be used to define anything, rather that it can generate an answer for employees about the software products or services they sell from all the business logic in all the technical and user documentation that the company has.

                        1. m4r35n357 Silver badge

                          OK, if it is bots you really really want, I have no more arguments against that . . .

                          But I still can't stand the technology or how people manage to "believe" in it!

                          1. HuBo Silver badge
                            Pint

                            Thesis -- m4r35n357: "why not code up a proper expert system"

                            Antithesis -- Dan 55: "have an LLM which gives reasonable answers"

                            Synthesis -- I think you're both right, together, that ES and LLMs have to be combined for this (as they were in IBM's Watson Jeopardy champ). LLMs are great at NLP but crap at logic. ES are great at logic but crap at NLP. Put them together to make systems that work well in both fields.

                            Now, yes, Meta's CoConuT attempts to introduce backtracking (central to Prolog) into its LLaMas, but I doubt it'll be as effective as going full hog-wild predicate calculus first-order logic through an ES (or Prolog) for the parts of the reasoning process that require logic (i.e. the actual reasoning part) -- imho (but it'll be great for the language bit).

  2. Gene Cash Silver badge

    "AI inferencing"

    It's not inferencing, it's simulating a string of words that looks close to what an answer to your query might look like, based on the training set.

    There is no thought process or reasoning behind it, and no inferring of anything.

    As humans, we can't help but be impressed by what appear to be right answers, and we assume (consciously or unconsciously) that these answers were arrived at by similar processes to the ones that we use in our own brains. But this is nowhere near the case, and it makes us all easy to fool.

    Stop peddling this snake oil.

    You're better than this, El Reg.

    1. This post has been deleted by its author

    2. O'Reg Inalsin

      Re: "AI inferencing"

      I definitely agree that LLM operation is not on the "same scale" as the human mind. Snake oil salesmen are using "learn", "think", and "infer" to shake down and rob the rest of society. Social media and other aspects of online-life left society bleeding and SmartyPants-AIs are the screw worms burrowing into those wounds for the final kill. For that reason, I had to give you a thumbs up.

      However, Armageddon aside, the meaning "inference" as used in "statistic inference" is not a new concept:

      "From the point of view of statistical inference, the most crucial problems are those which arise in arguing from samples and their statistics back to populations and their parameters. The problems which arise here are no longer wholly deductive; conclusions cannot, in general, be drawn with certainty. Statements can be made, however, subject to risks of error of being wrong, where the error is precisely expressed in terms of probability and sampling theory." [Samuel Stanley Wilks, The Theory of Statistical Inference, 1937]

      That 87 years old definition of "statistical inference" could be used to describe AI-LLM outputs, although "where the error is precisely expressed ..." should probably be changed to "where the error is approximated in terms of probability and sampling theory." because "error" is itself an amorphous quantity that is decided by humans on an ad hoc basis.

    3. Justthefacts Silver badge

      Re: "AI inferencing"

      What’s this “query” nonsense? If you’re using an LLM as a natural language substitute for websearch…..well, in that very limited sense, it’s probably not a good websearch tool. You’ll be wanting a search engine for that - a tool whose entire job it is to match your query to an existing body of knowledge, and simply direct you to the exact text as it was originally written without making any changes that could insert errors.

      In other news, they’ve been advertising this iPad thing, but it’s really not as good at knocking in nails as the hammer I already have.

      If however you have a really challenging Maths problem, PhD level plus, out of your field, and don’t have somebody in the top 1% of Maths PhDs to hand……try this:

      https://www.youtube.com/live/hkTpMmkVAok

      1. HuBo Silver badge
        Gimp

        Re: "AI inferencing"

        Maybe ... but remember, the William Lowell Putnam Mathematics Competition is a math contest for college students (not PhDs), and both questions and answers were posted online prior to o1 Pro "taking the test". Plus, the questions were well-written (near machine-like), without red herrings, making it easier to pattern-match answer fragments to them, and then glue those together. And, Kyle admits to not understanding a bunch of the math involved in them (iirc), and therefore to be unable to actually grade those responses, except for "final answers" (that is not necessarily logically concluded through rigorous math by the o1).

        Testing o1 Pro on math, seriously, is something that more than one group of actual independent scientists might want to do, to check on accuracy and repeatability (with answers not posted online, and not otherwise available to the rotund language model), and with the introduction of mild variations (eg. red herrings) to test for robustness and sensitivity. IMHO.

  3. sabroni Silver badge
    WTF?

    but....

    if you have to wait for the big, slow thing to check the small fast thing how are you saving time? The short time the small model takes is used as the metric provided the slower check eventually passes?

    This is just bollocks, isn't it?

    1. HuBo Silver badge
      Windows

      Re: but....

      Yeah ... the paper linked under "suggests" (id under "discussed") in TFA has all the gory details of how this method, inspired by speculative execution in CPUs, and applicable (at least) to autoregressive models (like Transformers, aka GPTs), works. It's all down to the interplay between speculative sampling and speculative decoding that results in the speedup given by Theorem 3.8:

      S = (1 − αγ+1) / (1 − α) (γ*c + 1)

      with c hardware-related and close to 0, α some intrinsic prop of the model and task (val from 0 to 1), and γ+1 the number of concurrent small (speculative) models that can be run in parallel with no increase in walltime (γ = 2,3,5,7,10 in Tables 1 & 2). Table 1 shows S = 6.9X with c=0, α=0.9, γ=10, and in experiments they went up to S = 3.4X (Table 2).

      1. sabroni Silver badge
        Thumb Up

        Re: S = (1 − αγ+1) / (1 − α) (γ*c + 1)

        Sweet, thanks. I don't grok the maths exactly but I get the concept.

  4. Anonymous Coward
    Anonymous Coward

    Hallucinations matter

    A fly cannot hallucinate at human's level. Hallucinations and dreams are hypotheses. Automatisms (fast decisions) are reliable hallucinations, which could still result in mistakes. Automatisms are necessary to move fast, because detailed analysis is slow and expensive, though possible or necessary. Impulsive purchases are a great example. The more expensive a purchase, the more regrets. Like buying a house before a market crash. Tell me humans are intelligent.

    1. Gene Cash Silver badge

      Re: Hallucinations matter

      Stop rolling your jargon dice since you don't understand half of what you said.

      Hallucinations and dreams are NOT hypotheses. A hypothesis is something you can test: "is water wet?" [sticks hand in] "yes it is" from which you evolve the theory that "water is wet"

      And there's no such thing as a "reliable hallucination"

      1. ChromenulAI

        Re: Hallucinations matter

        You're thinking about hallucinations was hallucinated under false pretenses that not everyone shares. Feel free to jump up and down screaming until your lungs bleed, because some of us specifically go out of our way to think differently than everyone else.

        Spare me your arrogance and serve me your allegiance.

      2. Charlie Clark Silver badge

        Re: Hallucinations matter

        I think someone may have been testing a model to come up with the bullshit.

    2. Anonymous Coward
      Anonymous Coward

      Re: Hallucinations matter

      ¡¿ ... don't hallucinate yourself a fast open fly automatism in public decisions where half of humans exceed median intelligence ... ?!

  5. Combat Epistomologist

    So ......

    .... now LLMs can be wrong faster?

    1. m4r35n357 Silver badge

      Re: So ......

      Like Max Power!

  6. Dimmer Silver badge

    As a test.

    I fed the comments into an AI and here is what it came back with:

    Conclusion:

    The discussion reflects a diverse readership: deeply technical individuals, skeptics of AI hype, and those seeking balanced, informative content. Users engage in spirited debate about the limitations, usefulness, and societal implications of AI/LLMs, often mixing sarcasm, technical depth, and community banter

    So then I asked if it was funny:

    Yes, parts of the conversation were funny, particularly if you enjoy dry wit, sarcasm, and technical humor. The humor emerges from a mix of pedantry, playful wordplay, and snark, often in response to others’ overly serious or exaggerated takes.

    If a normal person had a choice between drinking a cold beer and reading the comment, which would they?

    A normal person would almost certainly choose the cold beer.

    The comments, while witty and entertaining for a niche audience (tech enthusiasts, grammar pedants, and AI skeptics), are dense, opinionated, and full of technical jargon. For the average person, reading through a lengthy back-and-forth about “fewer vs. less,” expert systems vs. LLMs, and AI hallucinations would feel like hard work—especially when a refreshing beer is on the table.

    The choice is simple:

    • Cold beer = immediate enjoyment, easy to process, universally appealing.

    • Comments = mental gymnastics, niche humor, and the risk of needing a second beer to recover.

    1. Burgha2

      Re: As a test.

      Seems pretty accurate, tbf

      1. This post has been deleted by its author

  7. metooaswell

    These systems are useful

    These LLM systems are indeed very useful, accurate and practical for a wider variety of problems.

    One example I implemented last week at work, was a system for helping to write Cypress tests for journeys through an insurance company website. Writing a system to fireup a browser and navigate to the site, and then giving the LLM the DOM of the loaded page a human can write something like 'accept cookies and then select that you want to purchase Dog Insurance'. The LLM can then take that (along with the serialized DOM) and return Cypress code to perform the action on the site, which we can then concatenante to our test code file and also execute to change the state of the website. We can then pass the new DOM and a new instruction 'fill out the form with my dog 'Geoff's' details. He is 10 years old and a border collie.

    Using an LLM in this way generates accurate results and usable test files. The advantage over writing the tests directly oneself are that if the structure of the website changes you can just regenerate the test file at the click of a button. A human would have to go in and start looking at the DOM all over again to rewrite the tests to ensure that any DOM changes haven't affected his test commands.

    This is just one practical use of LLMs doing something that we needed humans to do for us previously. To say they have no utility, or a trivial is a very misguided way to think about this new technology.

    1. This post has been deleted by its author

  8. This post has been deleted by its author

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like