Re: Is deep learning a copyright violation?
It is popular among LMM creators and fans to try to make an equivalence between training and human learning. The word was originally chosen to make this association because metaphors are particularly common in this area. That does not make them similar. But let's take this from the top of your comment.
Copyright has involved several restrictions which have been enforced long before computers. As written today, having a copyright on a work means that you're the only one who gets to sell it, it is illegal to use it without having a legal copy (purchased, borrowed, or provided to you by someone who has permission to do so count, going to get an unauthorized copy does not count), you can attach a license to use of that work (E.G. open source and every other software license), and that these rights apply to substantial portions of the work, not just the entire thing as a unit. It's not just being the only person to sell it.
You maintain that an LMM does the same thing as human memory, but this is not correct. "The way an AI processes information fed into it mimics the way humans do it.", for example, is just wrong. Neural networks do not mimic human neurons; we don't have that good of an understanding of how specific human neurons work. LLM memory uses specifically divided tokens. Human brains do not; memories are significantly longer. LMM memory does not have the ability to generalize further than those tokens, whereas human brains can. Their difficulty doing mathematics without writing programs and having them executed demonstrates this. There are lots of differences between this software and brains, which is perfectly natural; as cool as our brains are, and as great an AI as we could create if we could actually model them, our knowledge of neuroscience and raw computation power is insufficient to model the entire thing. The LMMs we produce do the task assigned to them, evidently with sufficient accuracy for the people who sell and use them though not for me, so nothing says it has to work like a brain does. Nothing, that is, except for the argument about why their use of copyrighted data is valid.
Another problem is your contention about what happens to the data after it is ingested. "The information becomes an integral and inseparable part of a body of knowledge, identity and decision making (simulated in the case of an AI but based on the same principles as the organic neural networks that AI is modeled after) and does not continue to exist within that AI's memory as a separate "work" that the AI then redistributes or publishes in whole or in part." None of those parts is true. The tokens are linked as probabilities, meaning that while some of them are entirely discarded, others are still present in their original form. LMMs do quote their training data frequently, sometimes when they're asked, sometimes by mistake, and sometimes when answering unusual queries that don't have a lot of associations in their trained state. A human brain might do the same. If a human uses their brain to quote copyrighted information for an audience, that's not allowed. The fact that an LMM is possibly doing this unintentionally doesn't change the result.
On the way that you learned programming from a book, you are allowed to teach from a book. If you gave out copies to the students without getting permission, that would be a copyright issue, but if you just read it, learned it, and taught based on your learning, then you are not violating copyright. There are three differences between you and the LMM here:
1. You presumably obtained an authorized copy before reading it. LMM authors could have done that. The many cases show how often they chose to obtain illegal copies, either deliberately or by scraping sites where someone else did. Before any training takes place, that's already a violation. In many cases, this is the only violation being litigated, meaning that that whether you agree or disagree with me about the differences between an LMM and a brain, it doesn't matter because they haven't bothered to train it yet. At this point, they just have a copy in their big storage array of data they plan to retrieve to send to their training later, which violates the license attached to most books which often says that it is "not to be stored in a retrieval system". This means that, even if they did go and buy a book off the shelf, that would probably be insufficient for all the things they expect to do with it and they would need a special license, but they haven't even tried that. They have proven this by obtaining some licenses to datasets, for example paying Reddit for copies of their user posts. I do not have any objection to them training on data like that that they have permission to use (if you don't want Reddit to be able to sell your posts, read the Reddit terms and conditions and maybe don't post there). I do have a problem with data that either doesn't have that open a license or where the use is explicitly disallowed.
2. You probably didn't just learn to program from that book. After reading it, you wrote some code of your own, adding extra information to what you were teaching later. You may have read more sources as well. The LMM doesn't do that. It is incapable of writing code and watching the result, understanding whether what was intended or not. An LMM can write code, but it has never had the experience of debugging to a goal. It is as content (to anthropomorphize it a bit too much but I didn't start it) to write code that doesn't compile as to write something perfect, and it doesn't adhere to the letter or the spirit of the specification unless by chance. While your teaching is based on an actual goal, the LMM's is just based on what teaching looks most like the text it already saw. In fact, the code it writes is mostly based on other code it saw, not the content of the textbook. If the textbook says never to do something because it's a readability disaster but a lot of the code in its training data does that anyway, it will very likely use that structure anyway. A normal mind could easily mentally reformat that the way the book suggests, the way that I've seen lots of code that uses bad structure or expressions but can still glean what I need from it and not write that way because I remember cautions from others.
3. If your teaching wasn't based on anything but the book, then there is a chance it could be viewed as a copyright violation. If I had to teach about something I knew nothing about and I decided that the easiest way for me to do the job assigned would be to get a textbook and just parrot it back to my students, this could be interpreted as a performance of that book. I may have summarized and paraphrased the contents, but that is not enough to prevent the violation. In practice, nothing would really happen because the copyright holder wouldn't know I was doing it, and for that matter neither would my students. They'd probably both think that I was just a bad teacher. Also, nobody is very interested in trying to prove that I was doing that specifically instead of being one of the many other types of bad teacher. This is the hardest argument to make about an LMM, which is probably doing that to lots of sources instead of one. I think it is still correct and a viable complaint to be made, but I think the arguments about use without permission and direct quoting are more convincing.