Re: So we need DCMA 2.0?
"The conceptual problem I see is: how is my reading and internalizing a web page manifestly different from a LLM being trained with the same page?"
This again? Every time, some argument like this is made, and each time, it does so by either misunderstanding or misrepresenting facts. Starting with:
"In neither case are the page's contents reproduced or stored the LLM or my brain (I don't have an eidetic memory.)"
They are stored and they are reproduced, often accidentally. First, they are stored in the training archives, without permission. That is an accurate, byte for byte storage. Then, they are partially stored inside the LLM. True, I can't, even with access to the model, run a command like "llm-extract book-title" and get it back, but it will often print from it verbatim. This has happened, over and over, across models and sources, relevant to the query and not, and it is only somewhat less now because code has been written to minimize it because it makes their crimes too obvious.
"A fairly simple example I would consider is where I train a LLM on the entire Public Domain corpus of The Gutenberg Project say from an offline resource (eg their 2010 DVD.)
From my reading of Gutenberg's T&C I think I would not be in conflict with any of those provisions."
You would not be in conflict with anything, even if you downloaded them fresh, although if you're going to, Gutenberg would rather you used something like their Kiwix versions so their servers aren't stressed and that way you can have the full archive rather than the subset on DVD. This is specifically because the work they distribute is not copyrighted. You can do whatever you want to that data.
"Posing rhetorical questions I would ask what moral or ethical lines will I have crossed at that point? And when I provide free, open access to my trained LLM? Finally when I place a paywall in front of my LLM?"
No lines at all. Public domain training content is fine to use for all purposes, commercial or otherwise. It's other content where those lines appear, and they appear at the start. Training your model on content you don't have the right to is both unethical and illegal.
"Finally how does one legislate ethics and morality? Extant attempts are without exception cures disastrously worse than the disease."
That's what law is. Laws are always intending to codify our concepts of ethics and justice. They have lots of downsides, but unless you think that no law is better, we've already decided to try.