> Studies say that they create language-independent internal representations of concepts.
Got any good references for those studies?
That claim implies that somebody is able to usefully interpret the insanely large pool of nadans that make up the model and how it activates, which has major implications. Then add on top the implications that follow from determining *those* specific nadans represent a "concept", let alone if they can identify what the contents of the concept are...
And having done all that, what is very important also includes: how stable this interpretation is as the model is retrained, whether the techniques used are transferable across LLMs and proving how complete (or not) these internal representations are as coverage of all the materials in the training set.[4]
Or whether you, or whatever papers you have seen, are referring to the research that The Register already reported on[1]: in one instance of one model at one stage in its training, when used with a certain small set of prompts[2], they found a number that (IIRC) was the id for the token for the string "Paris" and when they changed it to the id for the token "London"; now the machine printed out "London" when it would otherwise have printed out "Paris". So they had, um, proven that the concept of "Paris" had been replaced by the concept of "London". I was not convinced by their paper.
So, with all honesty:
Citations, please.
[1] a year or so back? Really *must* dig that URL out, this is the second time this year I've wanted to reference it.
[2] the overall prompt space is ludicrously large, so unless you have proof of coverage (ahem, sensible, explainable proof of coverage)[3] any testing of an LLM within the lifetime of a researcher, let alone a research grant that has to result in a paper, is going to involve a minuscule portion of that space.
[3] which would be a series of papers in and of itself, also needing citation - but we'll hope those appear in the list at the back of the "found some concepts" paper.
[4] if it turns out that LLMs routinely *do* create a representation of a few concepts, but they always end up being "cats are cuddly" and "water is greenish-purple", and no more, then great research, have a PhD, but did you find out anything that will actually help with using (or not using) LLMs? 'cos that particular set of "concepts" may not be terribly useful in the grand scheme of things.