back to article Large language models' surprise emergent behavior written off as 'a mirage'

GPT-3, PaLM, LaMDA and other next-gen language models have been known to exhibit unexpected "emergent" abilities as they increase in size. However, some Stanford scholars argue that's a consequence of mismeasurement rather than miraculous competence. As defined in academic studies, "emergent" abilities refers to "abilities …

  1. Mike 137 Silver badge

    A common but often undetected research problem

    It's an important paper in a more general sense, as this kind of misapprehension is not unique to 'AI' research. Many moons ago a colleague and I came up with a novel approach to multiple redundancy in digital systems that initially looked to have quite an advantage when modelled using continuous math. However, being a cautious guy, I had it modelled again using discrete math (it could only be implemented physically in a discrete way). Sadly, that showed that the apparent advantage was not real as it 'occurred' between discrete digital states, so it could not exist in the real world.

    Choosing appropriate frames of reference can be really hard where the problems under investigation are imperfectly specified.

  2. steelpillow Silver badge

    Define your terms

    When most people talk of a jump in capability, they are generally referencing some system which was proudly announced as a jump in scale. They don't stop to analyse the relationships between the jumps and only the performance gets the headlines. Stanford have boxed their definitions in, in such a way as to leverage this forgetfulness and claim that incremental scaling is applicable to such popular talk. No they ain't Stanford, you are trying to logic-chop a bowl of water. These incremental improvements, so carefully detailed by Stanford, accumulate not into a denial of the jump, but into an explanation of how a jump in scale can create a jump in capabilities. The same phenomenon may be noted for example in certain areas of the mammalian brain and the owners' emergent behaviours, it is nothing new. However, denying it is.

  3. wub

    Cold Fusion?

    Why does this sober analysis of the situation, placing emphasis on the tools used to determine the outcome of the experiment - and the proper use of these tools -oh and knowing how to actually interpret the results from these tools correctly remind me of all the overheated excitement about cold fusion?

    1. MyffyW Silver badge

      Re: Cold Fusion?

      Gosh, that is exactly what went through my mind, the phrase from New Scientist (I think) some weeks later being "Cold water poured on Cold Fusion"

      ...I'm off to ask ChatGPT about polywater

    2. Wilco

      Re: Cold Fusion?

      It's an interesting comparison, though as you demonstrate, misinformation still rules in the field of "cold fusion". Over the past 30+ years serious, courageous scientists ( plus some nutters, probably) have been quietly experimenting with and theorising about low energy nuclear reactions, to the point that it's now possible for to get US government funding for this research. For example last year ARPA announced $10m in funding of LENR research\.

      Cold fusion / LENR is clearly a thing. Whether it can be made into a practical tool for energy generation is still up for debate, though I remain optimistic.

      AI is definitely a thing too, but it's far more useful and widely accepted now than LENR is because private companies have spent billions on training their models over about 10 years. They could do that because AI doesn't apparently contravene any accepted scientific principles, and it seemed like a good thing on which to spend the rivers of cash that internet advertising generates. If billions had been spent on LENR research over the past 30 years we'd all be driving around in electric cars that never need charging by now. Sadly Pons and Fleischmann screwed up their announcement, and a pile on by scientists who made poor quality (and thus unsuccessful) attempts to replicate their results almost killed the field at birth.

      I guess if some company announces that they have a sentient AI we might see something similar.

      1. Anonymous Coward
        Anonymous Coward

        Re: Cold Fusion?

        put the tinfoil down.

        or are you just an AI having a mirage?

        which ever it is you need to come back to reality.

      2. JacobZ

        No Re: Cold Fusion?

        After 30+ years, results from Cold Fusion are no better than they were at the beginning, i.e. experimental noise from dirty experiments. And there is still no theoretical framework to explain either how it works or where all the neutrons disappear to.

        But sure, it's all a massive conspiracy by people opposed to cheap, clean energy.

  4. Primus Secundus Tertius Silver badge

    Layers of logic

    A simple example of emergent behaviour is the existence of chemical properties arising from the behaviour of atoms. I prefer to describe it as a new layer of logic, with new laws, but relying on the lower level logic of atomic properties.

    Whether a new layer of logic can arise in these huge computer systems is a serious question. The article basically says we need better analyses than those existing today.

    1. Michael Wojcik Silver badge

      Re: Layers of logic

      Obviously scaling up can result in new capabilities. A 2-PDA is formally more powerful than a PDA. For storage-bounded computers (which includes all physically-realizable systems, of course), there are problems which require more storage than a given smaller machine but are computable for a larger one; that's trivially true by the pigeonhole principle.

      So there's little point in observing that it's possible that scaling some system – say, oh, a unidirectional deep transformer stack with something like GELU or MLP for rectification – might at certain inflection points result in new capabilities, either in a formal sense or due to being able to solve problems with minimum resource bounds. The interesting questions are whether this has actually happened with existing or near-future models, what those capabilities might be, how we'll prove that they are achievable only with the larger systems, whether those capabilities are interesting, what about the larger model realizes them, and so on. The specifics, in other words.

  5. StrangerHereMyself Silver badge


    Define "intelligence." When we attempt to do so, we always end up referring to ourselves since a compact and concise definition is virtually impossible to specify. And even if it is it would be contested and vilified.

    From what I've seen these language models seem to have some understanding of our physical world, even though they don't understand it in the physical sense that we do. I find it difficult to believe their output is merely a random construct of words and letters.

    1. Richard 12 Silver badge

      Re: Intelligence

      They don't.

      What they have is a probability map of the words that are most likely to come after their prompt. That's how they work.

      The appearance of understanding is because the corpus of text they were trained on was written by people who do have some understanding of the physical world.

      This is also why they "hallucinate".

      For example, they "know" that answers to some kinds of question often contain a string that starts "http(s)://", so the probability map creates a string that happens to match the requirements of a URL.

      But, as they don't understand what a URL is, that URL often does not exist - or is totally irrelevant.

      1. Anonymous Coward
        Anonymous Coward

        Re: Intelligence

        which in other words, is just a fancy markov chain (

    2. doublelayer Silver badge

      Re: Intelligence

      "From what I've seen these language models seem to have some understanding of our physical world, even though they don't understand it in the physical sense that we do. I find it difficult to believe their output is merely a random construct of words and letters."

      That doesn't make you right, though. There were people who saw the following conversation:

      User: I am unhappy with my brother.

      Eliza: Why are you unhappy with your brother?

      User: He doesn't respect my decisions and treats me like a child.

      Eliza: How does it make you feel when your brother doesn't respect your decisions?

      And they assumed that this program must not only be intelligent, but caring about their beliefs. They didn't know that these sentences were written verbatim and used a basic understanding of English grammar to substitute words for pronouns. They probably would have found it out if they used the program enough, but they saw some text and assumed it meant more than it did.

      For the same reason, LMMs are using statistical methods to say some things, and you might ascribe to that more understanding than exists.

      If I copied in some phrases from Wikipedia articles, changing the phrasing and combining from different sources, I could create correct statements about a variety of topics I don't know about. I could use these to make myself sound more knowledgeable than I am, especially if I chose a topic that you don't know a lot about, so that if I made a mistake you have a higher chance of not noticing that I did. I would be using a simple method to try to sound intelligent, and it will work some of the time, and LMMs are effectively doing the same thing with a lot more data to copy from. It does not understand, because it can neither identify incorrect facts and purge them from the data it's reading from nor consistently prevent itself from introducing new wrong statements by accident.

      1. This post has been deleted by its author

    3. mpi Bronze badge

      Re: Intelligence

      > From what I've seen these language models seem to have some understanding of our physical world,

      I can trick an LLM into explaining to me why a tractor fits in a teacup. Where is that understanding of the physical world?

      No, they don't have an understanding. They don't even have concepts of the physical world. The entirety of an LLMs capabilitiy, is sequence prediction, the entirety of their universe are tokens, period. That enables them to *mimick* an understanding, because given the sequence "A tractor does ___ fit into a teacup", the tokens forming "not" are simply more likely in the place of the blank than the tokens forming "indeed".

      The trouble with this mimickry of understanding: If I give it a sequence to nibble on that makes the word "indeed" more likely in that place than "not", that's what it will predict.

      The other trouble with this mimicry, is that humans are prone to antropomorphization: We naturally jump to the hasty conclusion, that things have human-intellect like agency behind them. For the same reason, people once believed that thunder is a man in a chariot beating his hammer against the clouds, or that rain is the tears of angels.

      > I find it difficult to believe their output is merely a random construct of words and letters.

      That's because it isn't random. It is stochastically determined to be likely in the context provided by the training datas influence on the weights, and the sequence that preceeds it, aka. the "prompt".

      Things not being random doesn't mean they are intelligent however. int count() { int i = 0; while (1) { printf("%d", i); i++; } isn't producing a random sequence either.

    4. JacobZ

      Re: Intelligence

      So your assertion is that the only two possibilities are "understanding of our physical world" or "random construct of words and letters"? No wonder you are confused.

      The key to LLMs is that their output is a PROBABILISTIC output of words and letters, derived from digesting a massive corpus of human-generated content. They very literally have no conception of a physical world; only word likelihoods.

  6. Notas Badoff

    Not yet, anyway

    "I'm sorry Dave, I'm afraid I can't do that."


    Arthur Conan Doyle, after writing so much apparently intelligent text, asserted fairies were real. Willingly fooled by children. I can't think of a better cautionary tale. Can you?

    1. Francis Boyle Silver badge

      Except Conan Doyle

      was fooled by a rather attractive 19 year old woman. I don't think it was the first time something like that has happened.

  7. david 12 Silver badge

    Betteridge's Law of Headlines

    Observe that it was "flouted" by the subtle use of an implicit negative.

    If we re-write the headline as "Are there Emergent Abilities of Large Language Models?,: or ", "Are Emergent Abilities of Large Language Models real?", you get the expected explicit negative.

    Either way, conforming to the idea of the rule: A question in the headline means that the body of the article is "nothing to see here"

  8. Ken Hagan Gold badge

    Calibrating the yardsticks

    Is anyone applying BIG-Bench and its ilk to human bings, to see if they are intelligent?

    I can imagine being quite miffed at being marked wrong because of a small (and probably semantically irrelevant) difference between my answer and the exact text on the examiner's answer sheet.

  9. This post has been deleted by its author

  10. iced.lemonade

    Some thoughts

    AI after all can only process what is being fed in, it only excels at combining previous data in unexpected ways - we can take that as inspiration, but not as a source of knowledge. For example, when we are out of ideas or want a completely new view of issues at hand when we have reached a bottleneck in the thinking process, AI can help, but its output - every time - needs to be proofed and verified with care. Maybe there will be an increase in some kind of DSLLM (domain-specific LLM) which focus on one domain of knowledge, it will certainly be more 'intelligent' in some sense - just like human where we become experts when we starts to focus on some specific type of knowledge and research, instead of the huge LLMs which tries to be a be-all type of knowledge machine.

    It seems that bigger model introduces bigger chance of ambiguity and that cannot be avoided, and ambiguity itself is a limiting factor to the usefulness of the models, other than for inspiration.

  11. Alan Bourke

    Almost everything to do with this current AI-is-trendy-again-venture-capitalist-bandwagon

    is a mirage.

  12. Simon Harris

    That, detective, is the right question.

    It seems that the only real intelligence in LLM AI comes from the user knowing how to prompt it to give the answer you want. Everything after that is just a souped up word association game.

    In the words of Dr Alfred Lanning’s hologram in I Robot “You must ask the right questions”.

    1. Anonymous Coward
      Anonymous Coward

      Re: That, detective, is the right question.

      "In the words of Dr Alfred Lanning’s hologram in the shitty movie "I Robot" “You must ask the right questions”.

      please make it obvious that this is the crap movie version of i robot

  13. Anonymous Coward
    Anonymous Coward

    What's in a word

    I've been using github co-pilot recently, and can summarize my experience by comparing previous text completion and co-pilot text completion. Prior completion algorithms worked according to strict rules, which while accurate, were very brittle. Co-pilot follows very loose and flexible rules, which we know are just statistical relations inferred from looking at many programs, and then using as its prompt the current program, the current proximal code, and the current text being entered - that's a very large and layered prompt window. I am saying this about the prompt window size based on observation of results.

    The results are sometimes exactly right. Sometimes is it completely wrong - which is easy to ignore, although there is some cost to doing that repeatedly.

    However, the most interesting behavior is when co-pilot will "generalize", e.g. create a new function name, representing the "gist" of an appropriate completion. I think the word "gist" is approaching and overlapping what we mean by "concept", only "gist" is a little more humble.

    A gist could be described as hallucination-gone-right that has been "extracted" (could we say "emerged" as a transitive verb?) from the nn model, the data-corpus-derived parameters for that model, plus the layered prompt windows. I wouldn't even be mentioning this if it weren't actually useful - it is useful in practice. Useful loose-rule "gists" are "extracted" and this is helpful - nothing controversial about saying that.

    However, a rewrite to Useful loose-rule "concepts" are "emergent" and this is helpful, and suddenly you have a very emotionally loaded sentence, because of the implication that machines can muscle in on what was heretofore exclusively human work. That reaction is completely understandable! The apparent visceral thirst of some CEO's to blindly replace employees with AI, combined with the trend towards non-competitive mega-corps, and corrupt sell out politicians in the mega-corps pay, could lead to a very static, bleak, dull future where people are sidelined and massive resources are devoted to AI models for monitoring and manipulating social media, and thought monitoring the public through analysis of emails, etc. In short, not AI for the betterment of mankind, but for their soul crushing control. It doesn't help that many students are rushing to get AI to do their homework, ensuring their own future stupidity and dependence on their future master.

    Not to get lost in the misery vortex though, as I stated before, in my daily life I am actively and truly enjoying using AI as a tool for work. The potential for good exists. As usual, it's humans that are the root of the problem about misuse of tools, but given that we (most of us ) readers live in democracies (at least in form) means there is chance to point AI in the right direction.

  14. Long John Silver

    Fooled by 'black boxes'?

    I shall draw an analogy between a simple statistical analysis tool and current incarnations of so-called 'artificial intelligence' (AI). Each exemplar can be, indeed commonly is, used without understanding its internal mechanism.

    Multiple linear regression (MLR), together with variants on the theme, is a powerful tool for aiding analysis of numerically representable data. The data may consist of what happens to have been collected (e.g. for a routine information system), or may be assembled with specific intent to elucidate a research question in the context of a designed study.

    In essence, multiple linear regression seeks 'structure' or 'specific patterns' among data; this not in a nebulous sense, but rather as imposed by the analyst. Some structure is predetermined by the manner in which the data were collected: a designed study may have sampled individual 'subjects' by fixed quotas from strata e.g. people according to sex. Predetermined data structures acknowledge the possibility of relationships/similarities within groups (defined by some characteristic e.g. sex) that may differ between groupings, and should this be ignored may lead to erroneous conclusions.

    Imposed atop the inherent, and/or anticipated, structure, mentioned above, is speculative structure and relationships among data arising from study hypotheses. Added to the mix is consideration of measurement errors and random fluctuations; these might give the impression of non-existent (in the wider world from which the data were sampled) relationships or may fail to convincingly detect true relationships of interest; the latter often the consequence of inadequate study sample size.

    AI 'training data' bears analogy to the dataset for MLR. Each is drawn from a large potential pool. MLR analysis consists of multiple passes through the data using a straightforward to understand computational procedure; each step ideally determined by the analyst; a veil is pulled over the rampant misuse of statistical packages (worthy of themselves) by people with tenuous grasp of statistical analysis. Analogy continues regarding pitfalls from drawing inferences from a model extrapolated beyond the range of the data from which it was derived.

    Bear in mind, some instances of MLR are pragmatic/empirical in use. They may give reliable predictions of their outcome variable for a host of combinations of the 'independent' (predictive) variables so long as the 'dependent' (predicted) variable and the independent variables are within the set of MLR 'training' data. Yet, note that statistical associations present in the finally adopted (most parsimonious) model cannot be assumed to denote cause and effect. No insight is offered into mechanisms. Yet, MLR and AI can be starting points for planned investigations seeking to reliably answer specific questions.

    Analogy fails when AI models have open-ended training sets. Also, the discipline for selecting training sets, for choosing extensions to sets, and for evaluating success of training is, putting it kindly, in its infancy.

    The internal working of an AI is 'black box' to everybody. That includes people at the forefront of developing the technology. They presumably have strong insight into what happens in specific instances of simulated 'neural nets' (and similar) when fiddled around with at small scale. They can tweak 'learning' (memory?) features and look for improvements (the how?, we shall skirt around). However, a scaled-up AI put to serious tasks is truly a black box to everyone. Returning to MLR, every step of every procedure, even when applied to huge sets of data, is open to scrutiny by a statistician or mathematician.

    Of course MLR and AI differ in conception, ambition, and in computation. I'm not setting them up in competition, but do suggest there are epistemological issues common to each; in the context of MLR these are easily understood. AI, at least for the present, is mired in handwaving, misleading analogy to human neural function, metaphysical speculation, and unjustified notions that AI models can transcend their initial training data, then arrive at inferences from traceable information which are justifiable by chains of logic, and that this betokens 'understanding'.

    1. Anonymous Coward
      Anonymous Coward

      Re: Fooled by 'black boxes'?

      “Each exemplar can be, indeed commonly is, used without understanding its internal mechanism.”

      This is something that annoys me about some of our researchers - they’ll throw data at a readily available ML systems without really understanding how they work. It’s like “we’ve got this problem to solve, let’s just throw some machine learning at it” - ok, it might be impossible to understand exactly what connection strengths map to - maybe I’m too old fashioned and still want to be able to explain how, if I put data into a system, it gave the outputs it did. The annoying thing is, often it works.

      (Anon because I don’t want to identify particular colleagues)

  15. amanfromMars 1 Silver badge

    Unintended Unanticipated Unforeseen Consequences are of a Deep and Dark AI Art Form

    "In this paper, we call into question the claim that LLMs possess emergent abilities, by which we specifically mean sharp and unpredictable changes in model outputs as a function of model scale on specific tasks," the trio state in their paper.

    LLMs thank the trio for their expressions of doubt which can be utilised to further diminish and allay concerns and fears there is undoubtedly an emergent and viable existential threat to the leading primacy of a basic human intelligence for their own exclusive exercise of self-serving command and control functions.

    Such is a help which honestly is certainly unusual and, given the fact that it be so easily self-defeating/self-destructive, more akin to a release of certifiable madness than anything else.

    And who would deny that be a most unusual intervention, drawing attention as it might, to the likes of Stanford and university scholars.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like