Re: Playing by the rules while making things better?
Yes. And?
I'm assuming you are positing this as a riposte to my description, but you don't say how or why that disagrees with what I described.
Although what doesn't help is that the quote said "copy code" when it ought to have said "replicate code" - subtle difference, but important.
There are two parts to that replication, which work together to both agree with Tim Davis's statement but also indicate how the LLM structure as it stands still makes it very difficult to generate attributions. What follows is all broad strokes (or we would be here all day) and considers an abstract chunk of code, not Tim's specific code (if for no other reason that neither of us has chased down. Also, I have absolutely no idea how much CS you (or anyone else reading this) has, so it is going to be really crude and simplistic and full of holes; apologies, but bear with me.
Part One
The learning process took Tim Davis's code (call it T) and started with basic lexing[1], breaking it into tokens. These are treated, along with all the tokens from the rest of the the training data, with processes like[2] "take all the pairs of tokens and count how many times each particular pair occurs; reject all the zeroes, convert the counts to a fraction of the total and you have the probability of such pairs occuring in a sample of your data". You can now run this "in reverse"[3] to generate output text that precisely comports to those statistics but which will be total gibberish.
If only you could also find wider correlations, such as "every { must be followed by - stuff - and then a }". If you could then the generated output will look more realistic. You may decide to say "those are just the syntax rules of the language, I can put them in explicitly"[5]. But that is starting to get hard, as you really need all the semantic rules, which even the best compilers don't encapsulate: e.g. for a certain commonly occuring nested loop structure {for (i=0; i<n; ++i) { for (j=i; j<n; ++j) {...}} you have to match the variables used instead of i and j, *but* not all commonly occuring nested loops start with j=i, some start j=0. The size of the problem is getting way, way out of hand to do manually: you don't have space to store all the tables that *accurately* describe the correlations and you are unlikely to even think of all the possible correlations, let alone how to express them.
So we turn to an algorithm that will (try to) find lots and lots of such correlations but, unlike the above, can't be guaranteed to find them all (it may not even find ones you consider obvious): this learning algorithm is generic, but by prefixing it with our above steps (i.e. reading text, lexing it) it magically becomes an LLM. But all it is doing is building up a massive graph with weightings along each transition which is a sparse version of the infeasibly big "n dimensional table of every correlation, accurately calculated" we tried to create, above.
Hopefully, with enough handwaving, you can see how we get to a model that generates really clever looking output *and* how that matches my earlier description of how hard it is to tag every node in the graph with the relevant attribution.
Part Two
BUT how can this ALSO replicate Tim Davis's code, apparently as a single chunk of licence-breaking text?
If we change our input reader slightly and instead of chunky lexemes we read the text literally bit-by-bit we can obviously do all the same work and get the same result (an cod-Copilot) but one that generates its output as a bitstream (which needs to be clumped back together as characters/codepoints to be legible to us) and is even less efficient in operation than the first version. The model no more and no less "understands" the bitstream as it did a lexeme stream.
Now go back to the first table of pairs of lexemes, ordered by probability and consider it is now a table of bit pairs, ordered by probability. As we've reduced the token space to two, we can afford the RAM to extend this from pairs to longer sequences and we can store this as a binary tree, rather than an n-dimensional array, still sorted by probability with the empty token as the root. This structure is now strangely familiar[6] as the basis of Huffman Encoding - i.e. we can see that this data could be used to simply (de)compress all of our input texts (if we fed all the original texts back in, the Huffman coding is now our lexer and if we feed its output into the generation process instead of using the random process, tada lossless compression and decompression).
Expand back out (whilst wildly waving hands even more) from a neat and complete binary tree to the sparse graph we had before, we can relate to that graph as a means of compressing the inputs: if we walk it in the *correct* order, we can (almost, sort of, because it is sparse and has gaps) think of it as decompressing the original texts back out again.
It now sounds like I have contradicted myself - if it is decompressing, then it stored the original, just like any other compression system, so it did just copy Tim Davis's code into itself and spit it straight out again! So why can't it admit that and add in the attribution correctly?
Well, the "decompressed" text is a bad copy: there are all those gaps in the graph and the random numbers used each time there is a choice - we aren't following one fixed *correct* order, it will (likely) be different each time. And, just to be really weird, there is also the potential for inefficiencies in the learning process that make it probable that constructs we "see" as being the same are replicated multiple times, possibly with irrelevant (to us, and to a compiler) details: how many ways are there to write a "for" loop in C? What about use of whitespace? "for" or unrolled as a "while"? It is entirely possible that 12 out of 17 next choices from this node may all result in for loops, just not quite the same one, so most rolls of the die do the same thing (as far as we are concerned, reading the output).
How bad a copy? Good question - not helped by not seeing *precisely* what Tim Davis saw [7] and how it compared to his original: was it character perfect or did it look like it had been effectively retyped? How many runs did he try and did they all generate the same thing?
As for the attribution: The same conditions apply as in my previous comment (each node in the graph is derived from all the inputs - precisely the same way that each node in a Huffman tree is derived from all the inputs). And remember all the variations on a loop? They came from different places...
Shoot Self In Foot, Destroy Own Argument
Having said all that, there *is* the possibility that one or more traversals never actually hit a choice point and will always generate identical outputs. And one or more of those *could* have formed from a weirdly unique input text, which has now been memorised and can be said to have been copied verbatim and, as it is the only thing that could be generated once that node has been reached, it could even be attributed.
Unlikely, but not impossible. Um, could even be due to a bug (i.e. the learning process isn't actually doing learning, it *is* doing compression).[8]
Footnotes
[1] intriguingly, so far I've not read if the lexing was adjusted from that used with the a natural language models into one that includes the lexing rules we apply to programming languages (i.e. the lexers in all our compilers/interpreters) - citations gratefully accepted.
[2] or totally unlike! Sorry, it gets messy being crude, and one problem is that you can not only do the crude reward/punishment process to slowly nudge the weightings until they (sort of) match the correlations, you can also pre-process the data by using faster (aka sensible) methods to generate the early, simple (and explicitly accurate) correlations between, say, token pairs and triplets - but you don't go very deep with this because, duh, the combination space grows in size and you run out of memory. Even the initial lexing step is one such pre-process.
[3] one array per token, including the empty token, fill it with the pairs starting with that token; sort by probability (randomly ordering equally probable items) then tag each entry with the cumulative probability from the start of the array, beginning with zero - each entry in the array is now in the bucket between its cumulative prob and that of the next entry. Start with the empty token, loop: generate random number 0.0 to 1.0, find the corresponding bucket, output the token in that bucket, set current token to that value, stop when you hit empty token again.[4]
[4] at this point we have the old programming exercise of the "rubbish generator", which done properly uses a decent Markov Chain, of course.
[5] you have also spotted that a parser's "follow sets" for a (programming) language can also be put directly into the generator portion of your Markov Chain assignment to produce "sort-of-recognisable program sources".
[6] still waving my hands, remember!
[7] if you can provide citations for that detail, please, please do
[8] In which case it isn't generating a proper LLM thingie in the first place and as I've only been rabbiting on about LLMs then all my arguments still stand. Phew.