
It's debatable whether training LLMs or other generative AI infringes copyright. But LLMs aren't trained directly from the internet, they're trained on a curated data set that comprises COPIES of the scraped copyrighted data. I'd go after the copies made tor the training set if I was a lawyer.
Big tech will argue the training set is a cache. No. It's a copy.