back to article Fining Big Tech isn't working. Make them give away illegally trained LLMs as public domain

Last year, I wrote a piece here on El Reg about being murdered by ChatGPT as an illustration of the potential harms through the misuse of large language models and other forms of AI. Since then, I have spoken at events across the globe on the ethical development and use of artificial intelligence – while still waiting for …

  1. Mentat74
    Mushroom

    "Some businesses try to wriggle out of fines and other punishment"...

    And others have enough money to buy enough politicians and lawmakers so they don't get fined in the first place...

    I don't see this flagrant disregard for people's privacy and copyright ending any time soon...

    Maybe nuking them all from orbit is the only real solution ----------------------------------------------------------------------------->

    1. Anonymous Coward
      Anonymous Coward

      Small thinking

      YouTube generates enormous value to creators by network effect. But initially it was one huge copyright violation. Before you can extract value, you have to generate one. Patience, please.

      Attribution will be trivial in the future. So even authors of rarely bought books may suddenly earn their fair share. That may be a problem for popular writers, who have "monopolized" majority of earnings from human knowledge by pure luck and network effect. There are too many valuable books collecting dust. Aristotle has not earned a dime from descendance of the knowledge graph, but he should have. Now it will be possible for future thinkers.

      Value from AI will be orders of magnitude larger than any single copyrighted contribution, so paying back will be a non-issue. Majority of models will be public as free services, because of competition, and because money will be made running advanced hardware, for example. Majority of readers here would have no idea what to do with the best model, other than hoard it on a hard disk somewhere.

      As for involving bureaucracy to save us all, I understand the corrupt logic - some want to jump the train and extract value as intermediaries. Cookie warning anyone? Good luck enforcing your stolen PII. Waste of time and public resources.

      Modern luddism. Burn Alexandria's Library, burn!

      1. Richard 12 Silver badge

        Re: Small thinking

        Attribution is quite simply impossible, because none of the LLMs are capable of doing that, even in theory.

        Put very simply, they are infringement on a massive scale.

        YouTube is a publisher. It is not comparable.

        1. Mye

          Re: Small thinking

          Fixing the IP tracking is an engineering problem. However, I'm not entirely convinced it is needed. The model behind the chat window does not contain the works in literal form. It is more like turning a book into paper mache, which is permitted if they buy a single copy of the book.(doctrine of first sale)

        2. cornetman Silver badge

          Re: Small thinking

          > Attribution is quite simply impossible, because none of the LLMs are capable of doing that, even in theory.

          I'm not sure that is true. I've been trying out the Perplexity "search engine" and it seems perfectly capable of doing just that: all its claims come with web link citations to verify them. I don't know what it is doing differently to other "AI" engines though: perhaps it is running some sort of hybrid architecture?

          1. klh

            Re: Small thinking

            It's probably hybrid in the sense that the LMM parses the conversation but search is done using conventional means based on what the model gathered from the chat? That would be a sensible approach and what LMMs are generally good at.

            The popular "write me an essay" LMMs are not able to do that because of the way they are built, that information is lost during vectorization, otherwise they'd have to annotate every other word. They don't persist knowledge, just linguistic probabilities.

          2. Not_A_Hat

            Re: Small thinking

            This is likely 'Resource Augmented Generation', where it essentially compares words in your question to words in a 'table of answers', adds the most likely answers to the 'context' of the question, and then generates text based on that. Basically, it reads the top three google results into memory and writes down the citations. There's still probability involved in how it presents the answer, but usually little enough that facts won't be distorted.

            The core LLM tech is still probabilistic (random on some level) but RAG is a decent way to constrain hallucination by aligning it to an external 'source of truth'. It's pretty interesting. The 'intelligence' of the AI is still directly proportionate to the number of parameters and architecture of the model, but the dumber the model the closer it gets to just being a fancy google search filter.

            If you're interested in LLM tech, I'd highly suggest trying to run one locally; it will show the limits of the tech so much better than a polished presentation like ChatGPT ever could. A decent desktop GPU is more than enough for a small one to tinker around with. Look up KoboldCPP and Obabooga.

            1. Anonymous Coward
              Anonymous Coward

              Re: Small thinking

              It’s called Retrieval-Augmented Generation.

        3. Mostly Irrelevant

          Re: Small thinking

          They don't currently do this, but it could be done. You'd have to keep track of where the weights of the nodes in the neural network came from. It would definitely increase the complexity of the model and cost of "training" it though, which is why they aren't doing this. Also, the number of works attributed to even small pieces of text would be absolutely huge. We're talking about pages and pages for a paragraph.

          It would be a lot easier to list ALL of the input in the LLM somewhere and link to it. Said list would be a huge Wikipedia-like monster but it's more practical.

      2. Like a badger Silver badge

        Re: Small thinking

        "Value from AI will be orders of magnitude larger than any single copyrighted contribution, "

        Citation, please.

        AI currently looks like big tech's latest fad to justify their unsupportable price to earnings ratio. I can't think of a single compelling example of AI achieving anything. If it does get pushed hard enough, I'm sure some marginal benefits plus significant downsides will be offered to the masses, but the bulk of benefits will be retained by big tech. In short, at its best AI will be a wealth re-distribution tool from the masses to the owners of the technology, much like internet retailing.

        But this is a discussion forum, so persuade me (and many others hearabouts) exactly what AI is going to do for us? Oh, and no vague assertions please.

        1. Justthefacts Silver badge

          Re: Small thinking

          Ok, here’s an obvious almost trivial example: Output a complete, correct Gerber (PCB design file) based only a simple text description of the function of the board. Tell it you want six LEDs, a USB, which CPU, RAM, headers, and power. That’s all it needs to know.

          I mean what “novel intellectual input” do you think the human is adding by drawing in all this stuff, knowing how to do a simple power-supply and clock-crystal, import footprints etc? Seriously. It’s a skilled job, and I bet *you* can’t do it. You’ll need to spend probably hundreds of hours from a software background to know even the basic electronics, and then another couple hundred hours to learn then schematic capture software etc. It’s hardly the pinnacle of human invention though, as a species we’ve probably made tens of thousands like this, and they’re all basically the same.

          Or…..get GPT-o1 to do it. From scratch, given an hour of faffing with prompts, send the Gerber straight out the door via the web to a contract manufacture house, receive your completed single-board computer to your own spec in a weeks time. You don’t even need to know what process and files you need to send….. just walk it through step-by-step with GPT4 in a couple hours before lunch on Xmas Day…..

          1. doublelayer Silver badge

            Re: Small thinking

            This sounds great. I certainly do not have the knowledge, and I've had many ideas for something where the hardware design should be pretty simple. If I could have someone make me the board designs where usually the most complicated thing is getting buttons in the right place, that would be quite helpful.

            Sadly, I have seen GPT's software output and I do know how to write software. Therefore, if I ask it to produce a PCB output which I don't know enough to check, I expect that it will come back from the contract manufacturer perfect, at least the exact shape that I expected with the USB port fully connected to the processor and RAM, except the data pins won't be connected so I can't flash software to it and a couple of the LEDs installed without the necessary components. I can get working code out of GPT. If it's a really simple problem, the code often comes out correct the first time. If it's anything else, I have to do it again and again, checking manually each time, so much that it would have been faster to do it myself, and the only reason it's correct at the end is that I have the skills to check its output manually. That's why I don't trust it for anything I can't do myself and why I don't use it for things I can.

            1. Justthefacts Silver badge

              Re: Small thinking

              . My company does as I described all the time, and it does work, at least dozens of boards done, and much more complex.

              It’s an engineering project, so you need to figure out the process. A cold GPT might do as you say, but that would be stupid. You warm it up by giving it some Gerbers in the prompt window. There are hundreds freely available. Honestly, most of these boards are *absolutely* monkey-see-monkey-do. The number of sample Gerbers with the data-lines not connected up is zero, so they’ll be connected up for your use-case. Zero-shot learning might be cool, but it’s unnecessary for most use-cases. Plus you can get GPT to write the Hello World blinkenlight basic software for when the board comes back. Use GPT to drive a digital simulator, get it to fault-find before the board goes out.

              None of this is magic, it’s a tool. The job of engineers is to figure out the processes, to make the tools work.I honestly think most people are spending their entire time trying to trip it up and give stupid answers, because they’re scared of it. Either that, or they’re getting hung up on “AGI” - if it were intelligent, it could do this without being careful about the prompting, ergo it is dumb, ergo it is useless. Rather than just figuring out the sensible steps, to get the project done.

              1. doublelayer Silver badge

                Re: Small thinking

                I'm not trying to trip it up, nor am I hung up on AGI. My problem is that, if the result of the process is something I can't check, which your example is, then I have come to expect that there may be many errors. Your clarification makes me more concerned about this. You fully expect that GPT will make these mistakes unless I prompt it with something I'm supposed to get at random online, because somehow it can't do that itself. I'm guessing this works for you because, if you're building dozens of these, you probably have all the knowledge you need to look over and confirm these designs before you pay someone to make it. For someone who doesn't, it seems like a recipe for wasting money on flawed designs, of which disconnected data lines is an example, not something that has actually happened because I'm not actually building designs.

                My basis for this has been actual testing of the thing when I had software tasks to do. If someone who didn't know how to write software used GPT to write some software for them, they would not get something usable. I conclude that from several attempts to get it to write software, which I know enough about to judge its output. Its output on small tasks often contains simple yet unpredictable errors, and that's when I split it into little pieces for it which a nontechnical person will not do. I cannot give GPT to someone unfamiliar with software and expect them to get good results. I have as much familiarity with PCB design as they have with writing software, so I cannot expect good results from that. It may be that simple PCB design is simpler, with more boilerplate or fewer options for messing up, than the simple software which GPT reliably fails to write properly. In that case, maybe it is actually more capable for you than it is for me.

      3. Pascal Monett Silver badge

        Re: Small thinking

        Aristotle's work is public domain. When he was alive, he was far from poor and spent his time in seminars and teaching.

        He was quite successful in his own time. No need to transport our "modern" values to him.

      4. IGotOut Silver badge

        Re: Small thinking

        "Modern luddism"

        Another badly educated person not understanding what the Luddites were all about.

        Burn Alexandria's Library, burn!

        You may want to look into that after educating yourself about the luddites

      5. James 51

        Re: Small thinking

        So, what are the odds this 'responce' was written using an LLM?

      6. anonymous boring coward Silver badge

        Re: Small thinking

        Aristotle is dead, FYI.

  2. Doctor Syntax Silver badge

    I've a better idea. Given the amount of energy being wasted on it simply close them down. Power down the racks. Ley them go bust.

    Next up - mining cryptocurrency.

    1. Muscleguy

      Also how do you force the AI Co to continue to host their now Public Domain LLM? What’s to stop them pulling the plug or just going bust to avoid having to do it?

      1. Anonymous Coward
        Anonymous Coward

        Well there's a positive outcome, anyway.

  3. HorseflySteve Bronze badge

    Don't fine them or release the LLMs

    Get injunctions that order that the LLMs be turned off and training data deleted. Threaten company board of directors with jail for contempt if they don't.

    I know that's probably not legally possible but I can dream ..

  4. rgjnk Silver badge
    Devil

    Delete them

    Don't try to hide behind some wafer thin 'environmental' sunk cost excuse, pretending to have some morale & ethical justification for the swerve. It's feeble stuff and it's embarrassing to see it tried.

    If they've taken my intellectual property and derived something from it then it's not a remedy for me if you put that derived product out into the open. The harm still exists even if the originator isn't profiting any more.

    Delete it all - models, training sets, everything.

    Of course if your purpose in life is indirectly making a living off the back of it existing as an ongoing problem and pushing poorly formed legislation that fixes nothing I can see why a complex non-solution might appeal...

    1. Anonymous Coward
      Anonymous Coward

      "If they've taken my intellectual property and derived something from it "

      Welcome to the club. This has been the story of much of my work, but you just keep ahead of those lot.

      99.2pc ff humans are parasitic.

    2. Mye

      Re: Delete them

      A better idea is to let them remove your IP. You will see that your IP does not matter to the system's function. In other words, when taken in aggregate with all the other content, your content is insignificant and has very little value. If we get IP tracking, you will see that your IP is not unique, and you owe others for what you generated/"plagiarized".

      1. rgjnk Silver badge

        Re: Delete them

        Have you ever tried to unmix something? Entropy wins.

        The data is in there forever as an inherent part of the model and it's impossible to extract or suppress the derivative.

        The only guaranteed solution once it's trained is nuking it and trying again after sanitising the training data.

      2. Conor Stewart

        Re: Delete them

        It doesn't matter how significant the data is or is not, it is illegal to use it like that. Also if a lot of the data is used without permission and illegally then it should all be removed, so even if an individuals data is insignificant all of the illegally used data combined is definitely not insignificant. You talk like it should be an opt out system after the fact, why? Why shouldn't it be an opt in system like it morally should be?

        Since you seem to have all the answers. How do you remove training data after the model has been trained? It's not a simple task and likely isn't even possible due to how the models are trained and would involve removing the data from the training data and retraining the model which is not feasible to do every time someone wants their data removed considering the cost and time required to train the model.

        Plagiarism is deliberate copying or stealing of ideas, it has nothing to do with how unique a concept is. If someone's IP is not unique that doesn't mean they plagiarised. There are many examples of the same technology or ideas being developed independently at the same time.

      3. doublelayer Silver badge

        Re: Delete them

        That's exactly what I want. They should remove any IP they don't have the right to. If it's insignificant, then that's no problem. They are free to build an LLM out of anything to which they have legal access, whether that's public domain stuff, stuff they paid for, or stuff that someone agreed to give them for free. I just don't want them to assume that, because they found it online, that means they get to use it for free. As I've said many times, if I find a copy of their model by getting one of their computers to hand it to me, do I get to do whatever I like with it, including making lots of money, without their permission? If not, and it is not, then they should not do that to others either.

        1. catprog

          Re: Delete them

          If you read an article on the Internet that is freely available does that make your brain is now trained on that article and everything you write has to pay a royalty for?

          If not why is it diffrent for a LLM?

          1. Mimsey Borogove

            Re: Delete them

            <Re: Delete them

            If you read an article on the Internet that is freely available does that make your brain is now trained on that article and everything you write has to pay a royalty for?

            If not why is it diffrent for a LLM?>

            No one is using what my brain is "trained" with without my permission to make themselves a lot of money. If they want to, they'll have to talk to me about it. That's the difference.

      4. Anon the mouse

        Re: Delete them

        Very little value is not zero value. I don't care if my work is worth 0.003p per use (same as Spotify) to an AI company, I want paying. And if it's illegally acquired then it is the licencing fee for each infringement, which is at least £1k.

        When AI companies have to pay for what they've taken they'll soon change their tune to it our creativity being worthless.

    3. Anonymous Coward
      Anonymous Coward

      Re: Delete them

      And start by deleting the execs from these companies...

  5. Headley_Grange Silver badge

    Everyone, from tech companies through to government is looking at AI with a view to making as much money out of AI as possible. At one end there are some who really believe in it and are fighting to get in on the ground floor so they can gobble up the competition before it gets too big. At the other end there are some who think it's a bubble that will burst so they want to make and bank their fortunes as quickly as possible while the going's good. No one across the spectrum is going to help regulate AI/LLMs; it would be the equivalent of trying to hold back a gold rush and it makes no difference whether it's gold or pyrites in them thaar hills.

    1. Anonymous Coward
      Anonymous Coward

      "No one across the spectrum is going to help regulate AI/LLMs"

      Error correction is the killer that regulates LLMs. More work goes into lobotomising the AIs than any other area. Offices full of low-pay workers add hex's (name/subject blocks) all day.

      This is why o1 went from breathtaking on launch to another ChatGPT3.5 clone in 3 months. And is the cause of many hallucinations (artificial cognitive dissonance).

      The gold rush is for the pick-axes. Like web1, with Cisco and those lot coining it in no matter what the internet weather was. AI is slaughtering the work place. This year alone I've seen hundreds be laid off. All of them never guessed it would happen a few years back. But most realise they cant compete and take the pay-off.

  6. Anonymous Coward
    Anonymous Coward

    Just ban them from operating in Europe cause they will just continue to train and scan the internet outside Europe. As for disgorgement...that's going to get real spicy real quick.

  7. xyz Silver badge

    Welcome to slavery...

    In 10 years you'll having nothing but the spawn of today's LLMs sucking all the power available and pwning your every action as you sit in a cold damp puddle asking them to save your asses.

    You are tomorrow's digital serfs.

    1. Anonymous Coward
      Anonymous Coward

      Re: Welcome to slavery...

      Merry Xmas to you, old soul!

      I like your optimism on it being 10 years. 10 week more like. Or since about 2021 to be more accurate when it was first thought up.

      Humans need to stop thinking things up and telling the LLMs. Well, I do anyway.

  8. Filippo Silver badge

    I respect the author's point of view, but the proposed solution looks convoluted and pointless to me.

    First of all, there really is no legal basis for this. A derivative work of an unlicensed work does not become public domain; it just becomes illegal. You would need new legislation in order to do this. You'd need to lobby for it, and hope another lobby doesn't get loopholes into it, the whole sausage-making. It's a messy, lengthy process, even when it works.

    And then, what do you get? Training these models is staggeringly expensive. If you remove the ability to monetize them, nobody will ever make another one. There is no scenario where you get an open-source ecosystem of LLMs - not out of this proposal, I mean; there may be ways, but this isn't one. At best, they'll start making them out of licensed material, which is fine, but does not result in open-source models.

    So this proposal amounts to asking for new legislation, just to get a few models for free and then either put LLM companies out of business or force them to behave in order to keep subsequent models proprietary.

    I mean, it's okay, but why don't we just declare that LLMs are derivative works of their entire training set? That doesn't require any new legislation, it's a decent argument within existing copyright law, and the end result is that current models get deleted, after which LLM companies either go out of business or start to license training sets.

    So the only difference in result is that we don't get a few models for free - but we only have to convince judges, not politicians. Also, are we sure we want to keep those models anyway? Part of the problem is that they are full of PII. There is no fix for that.

    1. Mye

      Throwing the baby out with the bathwater

      for me AI has created two major accessibility capabilities. My hands don't work right. I use speech recognition instead of keyboards. AI is significantly better at speech recognition than Dragon. Graham really helps with fixing speech recognition errors both in grammar and in missing words. With speech recognition, I'm now able to write code again. I'm sure that speech recognition could also aid in improving the usability of computers for vision-impaired people. With appropriate peripherals, it would probably also work for mobility-impaired people.

      I suspect you and many others here are TABS (temporally able-bodied. Computers have royally forked us over, and until the introduction of chatbots, it's only gotten worse. If you're looking for an ethical and moral reason to keep developing AI systems, accessibility is my number one reason. Then there are a whole host of others when AI's beneficial such as weather forecasts, finding financial crimes, developing new medical treatments, and all the systems are based on a mixture of public domain and copyrighted information. Without that information, the system's performance is degraded to something that is not worth running, and we, as a society, lose out.

      1. Headley_Grange Silver badge

        Re: Throwing the baby out with the bathwater

        I agree, but the LLMs could do all that without ingesting, learning and mimicing every work of fiction, non-fiction, journalism, etc., couldn't they? If the only way the LLM business can grow is by stealing stuff then they haven't got a business, have they?

        It's not a defence I could use in a court of law is it? "Hey, your honour, I was just stealing the bank's money to set up my business but once I'm done a few more robberies I should be up an running and since my business will be beneficial for everyone I should be allowed to continue."

        1. Mye

          Re: Throwing the baby out with the bathwater

          That's a good question. I don't think so.. for example, with writing code, no, they can't. They need as big a training set as possible to synthesize the code matching the prompt. Gramerly it seemed that the bigger their data set, the more accurate the grammar corrections. Back the deep distant past, I had conversations with people at Dragon Systems. They told me that every generation of Dragons has an increasingly large vocabulary, allowing them to better match what is said and what should be written. from what I can tell, they were again using a precursor to an LLM for language models dependent on larger data sets. AI speech recognition is so much better than Dragon. Play with Aqua sometime and you will see what I mean. It still has problems with performance but it's getting better.

          in a different comment, I outlined a different solution than banning companies from using harvested information. take a look and tell me what you think

      2. doublelayer Silver badge

        Re: Throwing the baby out with the bathwater

        You're conflating two different types of AI, but it's not likely to convince anyone. Newer speech recognition systems that use AI, in the sense of modern machine learning strategies, are not covered here. They are not covered because those weren't trained on stolen data. They were trained on recordings and transcripts that the creators have the rights to. At least, that is the case for those we know about. It's possible that some of those have been trained on recordings they didn't have the rights to, but you wouldn't notice it the way you do with LLMs. Likewise for weather forecasting models, pattern recognition models, and the like.

        There's something all of those have in common: none of them use LLMs. I do not get a forecast by asking a chatbot for one. I do get one by using a specialized forecasting model. The argument in this article is about LLMs, the ones trained from stolen content. Thus, your most convincing arguments, the ones that bring accessibility or other societal benefits, aren't covered in the argument at all.

        1. Anonymous Coward
          Anonymous Coward

          Re: Throwing the baby out with the bathwater

          "They are not covered because those weren't trained on stolen data. They were trained on recordings and transcripts that the creators have the rights to."

          No. They don't have enough archive to train a LLM. I used to translation software a lot in 2005 and it was basic at least. Not usable but gave a quick basic understanding of the sentence. Now, the result is better than expected and is native-level.

          "There's something all of those have in common: none of them use LLMs. I do not get a forecast by asking a chatbot for one."

          So much missing in your knowledge here. Instead of having focused LLMs, have an AI that is focused on all subjects.

          This bypasses your argument.

          1. doublelayer Silver badge

            Re: Throwing the baby out with the bathwater

            My post made one major point, that there are models that are not LLMs. You seem to have missed that point. Voice recognition models are not LLMs. People who make the former out of licensed data do not have enough licensed training data to make the latter. Improvements since 2005 come from more computing power, more advanced techniques, and more time and money spent on improvements. They do not come from LLMs.

            You're welcome to try making an LLM forecast the weather better than tailored models. When you do, feel free to submit it to rigorous examination. You won't have much success, because the tailored models are going to improve their performance faster than your LLM will.

          2. Doctor Syntax Silver badge

            Re: Throwing the baby out with the bathwater

            "I used to translation software a lot in 2005 and it was basic at least. Not usable but gave a quick basic understanding of the sentence. Now, the result is better than expected and is native-level."

            So as a user of S/W trained on massive amounts of other people's work without their consent, how much are you proposing to pay those people for the benefit you obtained?

          3. Conor Stewart

            Re: Throwing the baby out with the bathwater

            All this comment shows is that you don't have as much understanding as you think you do.

          4. Anonymous Coward
            Anonymous Coward

            Re: Throwing the baby out with the bathwater

            According to Apple AI, the BBC said that Luigi Mangione shot himself. He didn't, and the BBC never said that.

            https://www.bbc.co.uk/news/articles/cd0elzk24dno

            Let's just hope you don't use the water forecasts to help plan a long distance sailing trip.

      3. Conor Stewart

        Re: Throwing the baby out with the bathwater

        Just because it benefits you that doesn't mean it should be allowed or legal.

        It is theft, plain and simple, if you applied your logic to anything else you probably wouldn't agree with it then. Is stealing money from people fine if it is given to charity or to poor people? Is stealing houses fine if they are given to poor people? If I buy one copy of a book and then print many copies of it myself and give them out is that fine? No none of that is fine, it is all theft.

        That code you are writing with speech recognition, are you actually speaking every word and piece of punctuation yourself or are you basically telling the LLM how you want other people's code stitched together?

        Everything you described can be done legally without using people's stuff without permission. Yes it won't be as easy and will be more expensive but it is possible.

        Weather forecast models shouldn't have any problem with finding legal training data, neither should medical models, neither should financial models. Your statement that, "all the systems are based on a mixture of public domain and copyrighted information", has absolutely nothing to back it up. Also the medical industry has a requirement for privacy and laws specifically made to deal with medical privacy, at least where I am if anyone outside of who is actually treating me wants access to any of my medical records for any reason then they have to ask permission.

        It has also been found through research that if you prepare the training data better and are more selective with it then it is possible to train a better or just as good but smaller models on less training data. The data can be prepared better by manual detailed labelling as one example that can be used for images.

        For good speech recognition then why do you need to use data without permission? Either hire people to create training data or ask for volunteers and market it as helping disabled people. Ask people to spend a little time, maybe even just 5 minutes, reading out some text to be used as training data. This is something that people like you could do for yourself since if it is just your hands that don't work well then you can still read and speak. Yes it would be more work than just stealing every bit of audio with a transcript that you could find online but it is the morally and ethically better way of doing it. There would also be the requirement to use the data for a specific purpose and nothing else, like if it is collected for speech recognition then it can't be used for text to speech without asking for permission again. If it is code that is collected for a code speech recognition model then it is only used for that, not to train a code generator model.

        There is no excuse for stealing, even if you believe it is being used for good. There are ways to get training data in a moral and ethical way, but these companies don't because that is more work and would cost more money.

        1. catprog

          Re: Throwing the baby out with the bathwater

          If an author reads a book and remembers parts of it is is theft if they then write a new book?

    2. Anonymous Coward
      Anonymous Coward

      > Training these models is staggeringly expensive. If you remove the ability to monetize them, nobody will ever make another one.

      That's the whole point.

      > There is no scenario where you get an open-source ecosystem of LLMs - not out of this proposal, I mean; there may be ways, but this isn't one.

      That's not the idea. The idea is to stop all this being done, but hell, if it's already been done, make what they've got public, rather than waste it entirely.

      > At best, they'll start making them out of licensed material, which is fine, but does not result in open-source models.

      That's the goal.

      The whole point is not to make a big open-sourced LLM ecosystem, it's to stop them using data they aren't licensed to use. The open-sourcing of their data is just the punishment to stop them doing it again.

  9. nobody who matters Silver badge

    <......"Fining Big-Tech isn't working"......>

    It sin't working in part because the fines are pathetically small in relation to the profits of 'Big-Tech', but mainly because thus far, they seem to be getting away with failing to pay them.

    1. Phil O'Sophical Silver badge

      Exactly. Stop fining the companies, fine the board members directly, and use the Norwegian speeding fine approach, with a substantial percentage of their annual income.

      1. Conor Stewart

        It reminds me of parking tickets. If you are rich enough then you just don't care, paying the tickets is not a problem and they wouldn't even notice the difference in their bank account, they don't even pay them themselves, they just tell their assistant to pay them. They know they will get a parking ticket but there are no real consequences so they don't care.

        Fines for large companies are just the same, the fines make no financial impact or have any consequences so the companies do not care. Or they use their expensive legal team to get out of it.

      2. LucreLout

        Stop fining the companies, fine the board members directly, and use the Norwegian speeding fine approach, with a substantial percentage of their annual income.

        While this is certainly an interesting idea, there's just one or two reality bricks to smack it in the face before we can see if it still has teeth.

        Firstly, what do you consider, presumably British, legislative reach to be? While its often news to the Americans, a nations legal reach is generally taken to extend to its territorial waters and not beyond - would you really wish to follow say, Saudi law here in the UK because they decided you should? Fining American tech CEO's for what they do in America, from England simply isn't going to work. It has no reasonable prospect of success.

        Most board members would simply take the Philip Green approach and have the assets and income derived from their work held elsewhere by a spouse, thus negating the power of any fines available. As they're not our citizens, again, you're going to struggle to actually see any of that fine income.

        I do like your thinking, but unfortunately its not remotely achievable in the real world.

    2. Pete 2 Silver badge

      Cui bono

      > relation to the profits of 'Big-Tech

      It is worth remembering that while many profitable companies are creating AI implementations, none of those AIs currently make a profit.

      The only company making real money from AI is Nvidia. And they do it without scraping anybody's data.

      So if AI offerings aren't making money from the data they scrape, it is difficult for a claim of damages to ask for a slice of that non-existant profit.

      1. ChrisElvidge Silver badge

        Re: Cui bono

        So NVidia here is the equivalent of Levi Strauss in the Gold Rush? Also Pickaxe, shovel, etc. manufacturers.

        1. Conor Stewart

          Re: Cui bono

          Pretty much, just like they were with crypto mining. They had no direct hand in crypto mining they just built and sold hardware that could be used for crypto mining. They made a huge profit that didn't come from crypto mining themselves. So Nvidia is very much like equipment manufacturers in the gold rush.

      2. Doctor Syntax Silver badge

        Re: Cui bono

        "It is worth remembering that while many profitable companies are creating AI implementations, none of those AIs currently make a profit."

        Profit is income minus expenses. Just fine them a proportion of income. Include all that investment in the income.

  10. Anonymous Coward
    Anonymous Coward

    50GWh? That’s all? For the whole model?

    I hate to equivocate too much, but a training run consuming “the equivalent power use of 4,500 homes” while running seems like a pretty good bargain to me.

    By comparison there are around 145,000,000 homes in the US, and in 2023 the US consumed, per its energy research agency [0], 11.3 quadrillion BTUs (~3,300TWh) for residential uses (including all residential energy sources, not just electricity). So over the 100 days the model trained, let’s eyeball the consumption of those homes at a little under 1,000TWh. Or 1,000,000GWh.

    50GWh consumed in training—that puts us 5 orders of magnitude under the residential consumption for that country over that period, which itself is only 15% of the total energy consumed in that overall economy.

    This is the training run—the energy-intensive “building the factory” part of delivering the product. To a back-of-envelope first estimate, the Empire State Building used 57,000 tons of steel [1], and per a random but plausible-sounding web search result, building steel today takes something like 7,500 KWh/ton or so of energy to produce. Suggesting this OpenAI model took somewhere in the neighborhood of 10% of an Empire State Building’s steel to build (ignoring the energy cost of the concrete!).

    The Empire State Building has 2.8mm sqft, so even cramming people in at 100sqft each, that suggests it’s serving on the order of 28,000 people’s office needs. OpenAI say the model they built with this energy serves 200,000,000 every week [2]. We’re firmly in apples-to-oranges territory, but still: in terms of humans served, that’s 4 orders of magnitude improvement over building something that cost 10x as much energy just for its steel.

    Longer-run, sure, inference is more energy-intensive than status-quo data center activities: it’ll still be more energy-expensive than slinging JavaScript over a wire. And already the big operators are making a splash by tendering contracts for dedicated generation capacity co-located with their bit barns, and tending to prefer nuclear supply for that purpose (however you feel about nuclear, it’s not carbon-intensive).

    But OpenAI are already serving their 200,000,000 weekly active users with extant capacity—the same RISE article estimates the total energy cost of serving the whole user base GPT-4 for a year at 91GWh. So with our remaining 90% Empire State Building-worth-of-steel energy budget, we can serve something *10^9 humans for what, 5 more years? I imagine that works out to wildly less energy than asking all those people to schlep to a library to get their questions answered.

    There are plenty of thorny problems with the AI stuff. I just don’t understand why the carbon-intensity angle is so salient for critics: these models pose hazards along so many lines, some existential—this just doesn’t feel like a significant one (and it does seem like an especially solvable one).

    Refs (I don’t have link permission):

    [0] https://www.eia.gov/energyexplained/us-energy-facts/

    [1] https://ascemetsection.org/committees/history-and-heritage/landmarks/empire-state-building

    [2] https://www.axios.com/2024/08/29/openai-chatgpt-200-million-weekly-active-users

    1. nobody who matters Silver badge

      Re: 50GWh? That’s all? For the whole model?

      <......."I just don’t understand why the carbon-intensity angle is so salient for critics: these models pose hazards along so many lines, some existential—this just doesn’t feel like a significant one (and it does seem like an especially solvable one)..".....>.

      It is indeed very easily solvable.

      Just switch them off.

      Additional: 200 000 000 users being served unreliable and inaccurate answers to multiples of that number of queries. What an utter fucking waste of time and energy!!

    2. Like a badger Silver badge

      Re: 50GWh? That’s all? For the whole model?

      "I hate to equivocate too much, but a training run consuming “the equivalent power use of 4,500 homes” while running seems like a pretty good bargain to me."

      I'd agree - except that for it to be a bargain requires some tangible benefit. In many ways I am reminded of the original Mechanical Turk, which was a con-trick to persuade people that a machine was doing clever stuff. Two hundred and fifty years later we have dispensed with the human hidden inside, but still have a machine that does clever stuff, but actually doesn't do useful stuff.

  11. Howard Sway Silver badge

    Alternative solution

    As the big tech companies have decided that everybody else's intellectual property is effectively open source, and theirs to take as they want and profit from, declare that all these companies' intellectual property is now also open source, force them to publish all their source code and let the world do what they like with it. No energy wasted, much benefit gained for the world.

    1. Anonymous Coward
      Anonymous Coward

      Re: Alternative solution

      Yes. Make their non-current versions Open Source and managed like domain names are.

      Pre ChatGPT3.5 was open. And most of the jumps made to 3.5 came from the fact it was open. As soon as they closed it, the model withered being keep alive by more GPUs.

      It isn't that it costs so much to train them. It is more the waste of it. Snapshot LLMs peaked a while back.

      The big problem is that if/when we switch to real-time AI to fix many of the niggles Commenturds have, then the leccy bill will be insane with our current models.

      The solution is to create with the AI a new model that is designed to be low-power from the get go. ATM, emulating a liquid neural network(LNN) on a LLM is the best answer. Liquid AI use something like that to compress/abstract their neural network models.

      So instead of fiddling while Rome burns and stoking the flames, let's focus on LNN and slice the necks of the power giants.

      1. Conor Stewart

        Re: Alternative solution

        If you believe in your idea then go and do it and prove it instead of just commenting on how it would be so much better with no proof.

    2. Richard 12 Silver badge

      Re: Alternative solution

      Doesn't work.

      Training these models requires an absolutely astronomical amount of resources, and even running them requires a massive amount of resources.

      Just kill them, dead.

      If a movie is made using unlicensed IP, and no settlement can be agreed or cuts made to remove the IP, that film is destroyed. End of. Doesn't matter how many millions were spent making it, it's gone.

      LLMs are the same, with the small detail that it's literally impossible to settle or remove the unlicensed IP they use.

      1. Wellyboot Silver badge

        Re: Alternative solution

        An excellent example of showing that to have real power all you need is money in vast amounts*, the movie industry using it to stifle competition by 'encouraging' laws in exactly the opposite way to IT companies that just ignore laws and stall court cases until fines are a balance sheet footnote and any 'ban on specific activity' has little relevance.

        * Politicians looking for donations (and other perks) will appear with complementary agendas to aid these endeavours.

  12. Mye

    something I wrote elsewhere.

    I know I have pissed off more than one person with my stance on IP changes necessary to accommodate the unethical behavior of AI companies. In a nutshell, all information discovered by humans is part of a common pool should be shared among all people. If you use that information to make a profit, the moral and ethical requirement is to put some of that profit into a sovereign fund (see Norway, revenue from oil fields and FRAND licensing), and any IP discovered while creating the AI system should be declared public domain.

    I concede that limited-time monopolies are needed to protect new IP discoveries and enable the discoverer to extract revenue from the market, which would incentivize the commercialization of discoveries, whether from an AI or a biological entity.

    My view is driven by my desire to gain a new perspective on the topic and have new conversations without re-treading the same cemented positions.

    I thought I was alone in this idea when I discovered this article in my morning news feed. While it starts from a different set of premises, it almost ends up with the same result.

    IMO, there are some flaws, such as preserving the current copyright model using GDPR to restrict what you can use for training and focusing on punishment rather than social benefit. However, the outcome of training results being considered public domain is good.

    1. doublelayer Silver badge

      I don't agree with either of your points, and I'm a bit surprised to see them together. You are not the first person who wants to significantly restrict or eliminate IP protections. While I think some parts of patent and copyright need to be changed, most proposed architectural changes are, in my mind, harmful. Maybe you have specifics that I would agree with, but in the general terms you've stated so far, I don't think I would agree.

      The idea that any IP-based activity should submit funds somewhere is not a problem. It's called taxes. If, however, you mean what I think you mean, that they need a special tax for the fact that there wasn't something physical involved, I disapprove of the suggestion. Physical labor is not the property of others. Intellectual labor shouldn't be either. The work of others is not free to be consumed for any reason, and making the people consuming it pay into a public fund does not cancel out the harms they've done to everyone whose work they've used without permission.

      1. Like a badger Silver badge

        If we're going to fiddle with IP protections, how about we address the disgraceful differences between IP rights and patents? Why is that M&E innovators get 20 years benefit, but some bloke who slings together a bit of software has his entire life plus 50 years? Music and media are even more outrageous at 70 years after author's death.

        Reset all rights to 30 years from the date the rights are asserted, job done.

        1. Mye

          Even 30 years is too long. At 20 years, the vast majority of IP is forgotten and abandoned. There is the issue of time from discovery to reaching the market. I would also suggest that if IP was not commercialized and actively sold/generating sales five years after registration the IP is considered abandoned and most of the public domain.

        2. doublelayer Silver badge

          "Why is that M&E innovators get 20 years benefit, but some bloke who slings together a bit of software has his entire life plus 50 years?"

          Because patent protection is a lot stronger than copyright protection. If I have a patent on something, you are not allowed to use my invention without my permission. For example, if I have patented a certain chemical, you can't make that chemical. It doesn't matter if you've made a different manufacturing system and have a different use case for it, and it doesn't matter if you've invented something you mix with it to do something else, the product isn't allowed unless you get my approval. In copyright protection, you just can't use the same form I did. If I write a piece of music, you can't write the same one, but you are free to write something similar. Patent protection is a lot more than copyright protection, and thus it has a shorter time period.

      2. Mye

        The proposed fees for a sovereign fund are not a tax; they are rent, just like the monopoly holders of copyrighted and patented IP charge rent for using that IP. The foundation of why an IP holder or user should pay rent to society is that no IP is ever generated from nothing. It's all based on the work that comes before. As I said, all IP is part of a common pool. For example, many drugs result from research funded by governments. That research is based on something discovered previously. Pharmaceutical companies take that research for free and make obscene profits. They should be paying into a pool that could be used to pay for more study or benefit society in other ways.

        The same is true of art. Artists train on the works of others and then incorporate the techniques and perspectives into their own work. This is why there are "schools" of art—a collection of artists who use the same or similar techniques to produce art. The Impressionists, Pointillists, abstract, Hudson River Valley, pastoral—art is also not limited to Homo sapiens. Earlier hominids have left behind similar works on cave walls or carved shapes. Elephants are another creature that creates art.

        We don't need to significantly restrict or change IP protections. Semiconductor manufacturers use this practice. See https://www.jedec.org/about-jedec/patent-policy. This should become standard for all protected IP. You get paid for it but you can't restrict what people do with it. This could benefit society in many ways, for example, lower-cost pharmaceuticals. Third parties could make the same drug as the original discoverer, and they can do it at a lower price than more profit. One thing we need to protect against is the equivalent of patent trolls—someone "inventing" something and claiming part of the FRAND licensing pool without actually producing something themselves.

        Yes, intellectual and physical labor are different. If you dig a ditch, you can't claim IP protection and extract rent from all ditch diggers for it. However, many people, including myself, have discovered that the results of our intellectual labor was created by somebody else earlier. This is why the patent office has a first-to-file rule. As a society, we need more intellectual honesty and effort to determine whether a piece of IP unique enough to be worth protecting. I also think we need to expand IP protections to AI-generated intellectual labor.

        1. Doctor Syntax Silver badge

          "However, many people, including myself, have discovered that the results of our intellectual labor was created by somebody else earlier. This is why the patent office has a first-to-file rule."

          If something is independently reinvented should it really have passed the originality test to receive a patent for first to file?

        2. Conor Stewart

          If any third party can just copy someone else's research then companies will just stop researching. Say a company spends millions researching a drug and then after all that research another company just comes along and makes it themselves and sells it at half the price, what do you think happens to the company that developed the drug? They can't compete, they make no profit, they spent lots of money on development so the company either or goes bankrupt or stops researching, either way the research stops. Why then would companies research if they know they will just get copied and undercut immediately and the whole process will just lose them money? This doesn't lead to society improving long term. It may be good for some people short term but long term it just means no more research.

          Your comment about extending IP protection to AI generated IP is just not a good idea in any way. LLMs are trained on data gathered without permission and everything they generate is derivative of its training data, there is nothing truly new or unique generated.

      3. Conor Stewart

        I am not the person you replied to but whilst we do need a patent and IP protection system it needs to change. In a lot of cases now it is abused and used to stifle innovation, both with companies patent trolling and patents being too long and too broad.

        In my opinion patents need to exist but need to not cover entire concepts and abstract ideas, instead they should only cover implementations and only if the company actually uses the patent. For technology now 20+ year patents on whole concepts is too much.

        As an example for pharmaceutical companies, a patent on a specific drug is absolutely fine, even long term, it is a specific implementation and other companies would only copy it, not innovate on it. However patenting a whole class of drugs or concept is different and should not be allowed in my opinion because even though the company may make drugs within that class of drugs it slows down innovation and possibly prevents good treatments from being found sooner.

        Another example is 3D printing. The entire industry was held back due to 20+ year patents by stratasys on entire concepts.

        In my opinion patents should be used to prevent people from just copying implementations, not for preventing innovation.

    2. Doctor Syntax Silver badge

      How, in this scheme, does payment get back to pay the original creators?

      Admittedly some work is created essentially pro bone to some extent - however event OSS under GPL requires anyone adding further development to make their additions available on the same terms.

    3. nemecystt

      This seems like some hare-brained half-baked Robin Hood scheme. Rob from everyone and then scatter-gun some repayments to some other people.

  13. Pete 2 Silver badge

    Who wins?

    > Household names and startups have, and still are, scraping the internet and media to train their models, typically without paying for it

    ISTM the objections that the originators have about AIs trained on their data, is that someone else us profiting, but they don't get a cut. Consequently their objections are only about money.

    If AI outfits are required to give away LLMs trained on other people's data, those people still won't see any money, so their objections aren't resolved.

    1. Mye

      Re: Who wins?

      This is why I recommended a RAND solution. on the other hand, they may only get 10^ -9 of a cent for the fragment of their work that was used but at least they get paid

      1. Doctor Syntax Silver badge

        Re: Who wins?

        "they may only get 10^ -9 of a cent for the fragment of their work"

        So you think that, involuntarily, they should be providing all the rest of the value of their work to the public gratis?

        1. doublelayer Silver badge

          Re: Who wins?

          From this and their other comments, not only do they want that, they want to strip people of even that amount almost immediately, and they want to give the AI credit for anything it outputs. Therefore, implementing all their solutions means people who make training data get nothing, except for a few people who published very recently who get next to nothing, and the people who make the AI models get tons just on volume. There's one thing they got right though, I really disagree with almost every part of their suggestions, excluding one tiny element where their nonspecific proposal is something I agree with in principle.

    2. AlexanderHanff

      Re: Who wins?

      A cut doesn't have to be money - so your argument falls apart right there. By forcing the models into the public domain - everyone has access to the benefits, not just the lawbreakers who trained the model and benefits could be a whole host of different things depending on what the models are used for.

      1. Pete 2 Silver badge

        Re: Who wins?

        > A cut doesn't have to be money - so your argument falls apart right there

        The argument is valid. We can see that because all that the originators of AI training data have asked for is cash.

        So yes, their cut does have to be money. The only point of disagreement is the price.

  14. skwdenyer

    This is a far more complex problem than it first seems.

    If I spend 3 years reading others’ history books, and then use that research to write a new history book, have I created a derivative work? Or have I simply used what I’ve read to educate myself? To learn? Should my school text books contain a waiver to enable my essays to be free from these suggested IP infractions? Where is the line?

    If the LLMs are learning from published works, they’re doing no more than you or I would given a large enough library and sufficient reading time.

    There may be a different argument to be applied to visual models, but, again, if I study enough pop art and then create my own, should I be paying royalties to Warhol’s estate?

    The proposal here seems to be to create new IP law to treat an LLM differently from a human.

    The issue really isn’t, to me, about copyright; it is a wider discussion about social good. And, in particular, whether there should be in effect an “AI tax” to in some way level the playing fields. The primary difference is one of scale - however much I learn, I can only do a day’s work every day; an AI is scalable. The value it can extract from its research and study is far greater than mine.

    As a society, do we want LLMs or not? And how narrowly are we prepared to write laws that catch OpenAI, but don’t tax individuals just for using a library to better themselves?

    1. Like a badger Silver badge

      "As a society, do we want LLMs or not?"

      How about "no"?

      The people who do want them are the tiny minority who have invested millions and are desperate for either a payback or a greater fool, and those devoid of healthy scepticism.

    2. BartyFartsLast Silver badge

      Have you brought anything new to the party or have you just reworded other people's stuff is the question you need to ask.

      As far as I see it so far there's no novel content being created by AI, it's just rehashing other people's content and not offering anything new.

      even the people using it to develop "new" drugs, molecules etc are only getting iterations of existing works albeit faster.

      So far, the only new work is the work of the people trying to find uses to justify the utter waste of energy that AI heats the planet with.

    3. AlexanderHanff

      Are you aware that when you go to University most of the time you will sign away the rights to your academic work to the institution itself? Most Universities I have attended or worked for have the same contractual clauses - so in essence - yes you already lose your rights to the derivative works you create from your education instruction.

      1. Conor Stewart

        All universities have different arrangements. The one I went to states that students retain ownership of all their coursework and research because they are not employees so their IP does not automatically become the universities.

        For employees it is different, the university owns the IP if it was created during their normal duties or if it was created using university resources. Research is either subject to those rules or can have its own arrangements. Externally funded research is different still and has individual arrangements between the researcher, company and university.

        Even in the cases where the university owns the IP the inventor still has a part in it, they still receive a portion of the profits and still have some rights. The same applies to spin off companies, they need to pay the university a portion of their income.

    4. Boris the Cockroach Silver badge
      Unhappy

      What you are forgetting is that in order to read those history books, you'll need to buy them first... thus the author gets paid, even if you borrow them from a library, the author will get paid because the library had to buy them.

      These LLMs are scraping everything they can without paying for it, followed by having a plan to make you pay for the results. and going into your next research project, you get help/data from the LLM and you realise that almost word for word comes from an article you published 2 years ago.

    5. Doctor Syntax Silver badge

      "If I spend 3 years reading others’ history books, and then use that research to write a new history book, have I created a derivative work?"

      If that's all you've done then maube you have. Where are those archive visits to access original sources? Or adding something extra - insight - that LLMs are not adding?

    6. nemecystt

      I think the point is that a person learning a subject or learning a "school" of artistry puts in work and gets some recompense for that work if they make products with it, or if they use their skills in employment for another person or company. Once you've trained an AI to do that, those people's ability to earn has been taken away, or at least significantly reduced. Companies get to have cheap results. Employees get to lose their jobs.

      This has happened over the years with manual labour type things, via mechanisation and automation. It should be a good thing - freeing up people's time, allowing them to do more interesting and rewarding things. But instead it just drove real wages down, productivity up and corporate profits up. Now the same approach can be wielded against the more interesting and rewarding things. Sigh.

  15. This post has been deleted by its author

  16. Anonymous Coward
    Anonymous Coward

    King Cnut

    History is littered with those that tried to hold back the impending tide.

    All with good intentions.

    Luddites anyone?

    1. skwdenyer

      Re: King Cnut

      Important to remember that King Cnut didn’t thijk he could hold back the tide; his disciples did. His demonstration wasn’t (to him) a failure; he *wanted* everyone to see he was no God.

      1. Anonymous Coward
        Anonymous Coward

        Re: King Cnut

        Nobody likes a smart cnut.

        (Joke)

    2. doublelayer Silver badge

      Re: King Cnut

      We may not be able to prevent LLMs from being created. That doesn't prove that the tide is useful. It doesn't prove that the intentions were wrong. It also suggests that your idea of why the Luddites were Ludditing might be wrong, but you're far from the only one to make that mistake.

      However, in this case, that's not my opinion. Make an LLM if you want. I don't find them very useful, and I'm more than happy to ban the use of them in some cases, but if you want to make one and use it to try to do your work, have at it. Just don't steal people's work to do it. You have to buy that when it's copyrighted. If you don't want to, you can use any stuff that is in the public domain and anything people agree to give you. I may not think your electricity usage is the best use of that resource, but you're paying for it, go ahead. I may not think you have the ability to make a good product, but it's your product, don't worry about me. Trying to protect people from abuses, specifically ones that have been obviously illegal for some time, is not trying to hold back a technology. You can use copyright infringement for a number of things, and some of those are things I find useful. That still doesn't justify letting you do it.

  17. sedregj Bronze badge
    Windows

    Garbage in, garbage out

    "have, and still are, scraping the internet and media to train their models"

    Multiple LLMs have been trained on basically the whole internet and they still hallucinate and talk bollocks. GPT5 is late and will be worryingly crap too.

    This is not the AI you are looking for and never will be, no matter how many billions of squids you spend on GPUs and power stations.

    It's all tulips.

    1. Like a badger Silver badge

      Re: Garbage in, garbage out

      Ah, but at the end of the the Tulip Mania, we had some pretty flowers. LLMs aren't going to even deliver that.

    2. Conor Stewart

      Re: Garbage in, garbage out

      On top of that there has been research to suggest that better preparing and selecting training data can lead to smaller models and better results. An example I read about was about a model for working with images, I can't remember what exactly it could do but they generated training data by paying people to look at an image and generate a description, they would speak their description and it would be captured by speech recognition software. They got the workers to describe the image in quite a lot of detail. After training the model was smaller and better than other models that were trained on a lot more data, like the whole Internet.

      One problem with the current approach to AI is that it has no understanding or reasoning, it is all just pattern recognition and advanced predictive text. It has no way to tell if the training data or it's responses make sense, this is why training data quality is so important. I think that the issue of LLMs hallucinating is because of how much nonsense data it is trained on, anyone who has spent time on the internet knows that it is full of incorrect information even amongst otherwise correct information. If an AI is being trained on this huge mix of correct and incorrect and contradictory information with no ability to understand or reason then of course it will get confused and generate nonsense or incorrect outputs.

      Until AI has the ability to understand and reason it will have issues with hallucinating but this can be partially avoided by being very careful what you train it on. If AI could understand and reason then it could weed out the incorrect information and check what it is outputting makes sense and is correct. However we are a long way away from AI being able to understand or reason, it is a massive jump from current "AI" technology and what we have now is basically decades old ideas (like neural networks) taken to the extreme.

  18. GNU Enjoyer
    Facepalm

    Imaginary Property does not exist

    It disappoints me greatly that the register lends support to such fraudulent claims maliciously crafted to confuse people about the legal reality; https://www.gnu.org/philosophy/not-ipr.html

    The relevant law is copyright law, but the article only mentions it once?

    Massively scraping writings and software from most websites on the internet and shoving all of such text into parrot software (LLMs) and then selling SaaSS access to such parroted output is simply mass copyright infringement.

    It is known to be copyright infringement even if a human memorizes and retypes a substantial amount of text with a few minor word choice differences then tries to license such resulting work, but suddenly it isn't copyright infringement to sell SaaSS access to output of proprietary software that is used to implement a rather lossy compression algorithm that text can be shoved into to get nice looking combinations of the input as output, or sometimes exact input text as output.

    In this case, such companies are not the copyright holder of the input works, thus they have no copyright permission to release the model to the public domain unless they were to get approval from each and every copyright holder.

    I agree that such models should simply be deleted, as it's not like they actually form a practical use (plausible looking text without any substance to it is not of any practical use aside for things like fraud).

    It really isn't that hard to sort works by compatible licenses and then only train each model on those and then include a copy of all relevant licenses and meet attribution requirements and license conditions (this will probably actually result in a higher-quality model, as reddit posts about putting plastic glue on pizza won't get inserted into the model).

    1. GNU Enjoyer
      FAIL

      Re: Imaginary Property does not exist

      Also, I can't believe I only saw this now;

      >Make illegally trained LLMs public domain as punishment

      It would not be a punishment for a business to get away with mass copyright infringement by merely ceasing to claim copyfraud restrictions over a model that the business didn't have a copyright claim on in the first place.

      Validly ending copyright restrictions by releasing a work into the public domain without other restrictions is in no way a punishment, no matter the situation.

      1. doublelayer Silver badge

        Re: Imaginary Property does not exist

        It would be a punishment, just a much smaller one than they deserve and one that harms someone else at the same time. Companies with LLMs would lose a lot of money if the models they spent tons of money on were released for free, so they would really hate it. For the other reasons, that's not good enough. They should be required to destroy the models that contain the data, destroy their copies of the data, and they can go back and train on the remainder of their training data, already confirmed to be something they have rights to, using their latest software. That probably makes for a bad model, but that's their problem.

        1. Conor Stewart

          Re: Imaginary Property does not exist

          It is punishment to the company but it is also punishment to the people who's rights are violated. It would be like if someone copied your published music or copied parts of your book and instead of being made to delete it they are just made to make it available for free. Yes it punishes the company for copying your work but it also punishes you for no reason and to most people wouldn't be an acceptable solution.

          1. doublelayer Silver badge

            Re: Imaginary Property does not exist

            I agree. It is definitely the wrong punishment. I merely wanted to indicate that it's incorrect to say that "ending copyright restrictions by releasing a work into the public domain without other restrictions is in no way a punishment, no matter the situation". It is, but it's not strong enough and it causes unnecessary and preventable harms that we shouldn't allow.

        2. GNU Enjoyer
          Headmaster

          Re: Imaginary Property does not exist

          >Companies with LLMs would lose a lot of money if the models they spent tons of money on were released for free

          Free means freedom.

          Grammatically "for free", would mean for freedom ("for free of charge" is clearly a contradiction).

          Most English readers/speakers never realise this until I point it out?

          There is a big difference between no longer being able to illegally and more importantly immorally profit from restrictive activities and losing money (most LLM companies have only lost money so far anyway).

          > already confirmed to be something they have rights to

          There are no rights - only restrictions.

          They don't need to be the copyright holder to use data for training, they can simply choose training data with a license on it and follow the license terms.

          1. doublelayer Silver badge

            Re: Imaginary Property does not exist

            Free means several things:

            1. Liberty (see freedom).

            2. Lacking something (a field free of trees).

            3. Not costing any money.

            4. Not limited physically (let it fall free).

            5. Generous or frequent.

            6. Probably other things, plus all the things it means that aren't adjectives.

            Stop claiming it means one thing. We all know what "free software" means. When we talk about it, we will use both words, and many of us will say "free/libre software" just in case it gets mistaken for software that is free(3). If someone says something like the thing you quoted, they and you both clearly understand that they were using the completely valid and more widely understood definition number 3. If they meant the free you're trying to claim as the only option, they would have said "released as free software". Your grammatical argument, that free(3) must always expand to "free of charge" is wrong, and it fits just fine in the way they'd express any other price ("for £20").

            "There are no rights - only restrictions."

            Wrong again. There are rights. If you have copyright over some code, you have a right to copyright, defended by your nation's copyright laws. "Right" has another similar meaning, the ability to do something, granted by a contract. Thus, if I sign a contract stating that I will allow you to come live in my house, you can call it perfectly grammatically, and the contract may also call it, the right to live in my house. You can negotiate for permission to do something, and while it has a slightly different meaning than the governmental right, that's a valid word for it.

            Grammatical defenses of things tend to be unconvincing, even to someone who mostly probably agrees with you.

  19. Anonymous Coward
    Anonymous Coward

    Potentially another option - poison the well?

    What you put online for, for instance, a test is your responsibility, what someone else does with it is not.

    Ergo, comments you put in a webpage or documents which remain invisible but are indiscriminately hoovered up by robots could contain all sorts of fun. This sadly does not work for visual and audio artwork like pictures and music which are already abused to produce soulless derivatives, but there ought to be quite a lot of fun to be had with ye olde text.

    Maybe THAT is an Open Source project worth collaborating on?

    I cannot see any legal approach work, especially as the US will have two Presidents as of next year, one real billionaire who is doing his best to hide evidence of misdoing and whose environmental pretense are very much in question and one who is as fake a billionaire as his orange skin tone who you can apparently just pay into his presently rather depleted bank account to have things shoved under the rug. That one is already likely to take an axe to the Justice* system to keep himself out of jail.

    * A misnomer as it rarely delivers actual justice as far as I can tell.

    1. Conor Stewart

      Re: Potentially another option - poison the well?

      Maybe you could do it with images or audio too, similar to how people hide data in the least significant bit of images (stenography). Maybe if that was done enough (maybe the 2 or 3 least significant bits) and the model is sensitive enough it could mess with the model.

    2. GNU Enjoyer
      Trollface

      Re: Potentially another option - poison the well?

      >comments you put in a webpage or documents which remain invisible but are indiscriminately hoovered up by robots could contain all sorts of fun.

      I've read about claims about the best way to ensure a webpage doesn't get fed into a LLM after scraping is to insert hidden text full of profanity and words affiliated with racism, which I figure will be effective against scrapers set up to exclude such writings.

      >THAT is an Open Source project worth collaborating on?

      For something to qualify as "open source", it needs to be software that is under a license; https://opensource.org/osd

      The most effective text I can think of is an alphabetical, or randomized collection of all know "offensive" words, but that wouldn't creative enough to qualify for copyright and therefore could not be "open source".

      Many projects that brand themself as "open source" are chock full of proprietary software and/or have a proprietary CoC, thus when I see that a project is on github and that brands itself as "open source", I lose interest in collaborating.

      I only contribute to projects that proudly brand themselves as free software, as I know that the project will always be working for freedom.

      1. doublelayer Silver badge

        Re: Potentially another option - poison the well?

        "I only contribute to projects that proudly brand themselves as free software, as I know that the project will always be working for freedom."

        Ah, one of those who think that "free software" and "open source" aren't often used as synonymous terms by people who understand that they are slightly different, but not so different that ideological battles are necessary. You're not wrong that some things are described as open source when they're not, but two things apply to those: A) all the things you list are contrary to definitions of open source, including the one you linked to and B) something calling itself "free software" has done all of the same things.

        You go on to decide that, somehow, GitHub is a way to tell these from another. I don't know why, you just do. Of course, everyone who has spent a little while actually working with this stuff understands that, to figure out how well it fits with the ideals of free software or open source, you actually have to check the details because what website they host their repos on tells you nothing. A good start is often to open the license file and see what it says. Is it a standard license you've already read? Is it one of those with an additional clause attached? Is it one of their own design? Does the one of their own design veer into restrictions on being allowed to use, modify, or distribute? Does it have a section that could best be summarized starting with "except for someone"? That doesn't answer all of it, but it tells you a lot of useful things, often good enough to toss out many of the fakers.

  20. Anonymous Coward
    Anonymous Coward

    This again?

    Wow... let's rinse and repeat. Copyright doesn't apply to transformative work. An LLM is the absolute definition of a transformative work. If someone using a LLM forces it to generate copyright material, they are the ones infringing the copyright, not the people who created the LLM. A small snippet of an LLM output would count as a quote anyway (even if not attributed) and hence again does not infringe copyright.

    There are plenty of "reaction videos" on YouTube, if someone wants to go and apply copyright law. Even those would be tough to prosecute.

    1. heyrick Silver badge

      Re: This again?

      I seem to remember that it was ruled that some music playing in the background of a clip that wasn't even the song or particularly about it had a valid copyright strike.

      I myself have had a copyright claim on some music in one of my videos - a tune that a blood pressure monitor played while doing it's thing (but the copyright holder didn't object to the use). Maybe those reaction videos are allowed because it's good publicity, not because it's "fair use"?

      Also, what you might want to call "transformative" may actually be "derivative". If I was to grab a copy of some famous book and translate it (by machine) into a different language, it won't resemble the original at all, but it's the same story, the same ideas, the same characters. Can I claim copyright on having "transformed" that book into some other language? I think you'll find, that's not how it works.

      LLMs are basically a digital blender into which anything and everything has been thrown. That doesn't make it right or legal just because the output is a mishmash that only sort of makes sense if you ask very simple questions.

  21. harrys Bronze badge

    Missing link

    LLM's.... the missing link between now and idiocracy

    absolutely fascinating

    goto any maternity ward, look around and imagine all those tiny humans 18 years from now having grown up dumbed down by LLM's

  22. Anonymous Coward
    Anonymous Coward

    Personal Info.

    Oh no, Fred has stolen a copy of my front door key.

    I know, I'll devalue his key by giving everyone a copy of my front door key.

    1. heyrick Silver badge

      Re: Personal Info.

      More to the point, if the key is broken and doesn't work, what makes anyone think the rest of us would want a copy?

  23. Groo The Wanderer - A Canuck

    Illegal I'd illegal - they should be destroyed.

  24. Bebu sa Ware
    Coat

    "harvesting fruit from their poisonous trees, gorging themselves on those fruits,"

    "... of the tree of the knowledge of good and evil, thou shalt not eat of it: for in the day that thou eatest thereof thou shalt surely die." KJV Gen.2:17

    Hope then that these perishers might... umm... perish?

  25. stiine Silver badge
    Big Brother

    Not if my data was used...

    If my data was used in an LLMs training, I would prefer that the LLM be deleted.

  26. Eivind Eklund

    You are an LLM

    Human brains are large neural networks that generate language as output. If we follow fruit of the poisoned tree: Have you watched a movie illegally, maybe a VHS brought from home and showed in your kindergarten? Your output is poisoned. Have any of your friends done it? Fruit of the poisoned tree goes to the entire evidence tree - you talked to them, your output is poisoned. If you really believe this, you'd better delete everything you've written, or release it as public domain too acid the environmental cost.

    1. Conor Stewart

      Re: You are an LLM

      Human brains are in no way comparable to LLMs. Human brains function very differently and have different functions like being able to understand and reason and have unique thoughts. LLMs are just pattern recognition. We don't even fully understand how brains work (if we did then making a copy of one would be easy) so how can you claim that brains are just LLMs?

    2. catprog

      Re: You are an LLM

      Have you watched a legally purchased vhs. That is also what they want to be prevented.

  27. Anonymous Coward
    Anonymous Coward

    So Big Tech scraped the entire pre-2025 internet, and have it safely ensconced on their private servers and LLMs. That data becomes even more valuable if they use their influence to see that the pre-2025 internet sources are de-monetized into oblivion such that it gradually vanishes from public access. All the more reason for the public internet to be as ephemeral as possible.

  28. AVR Silver badge

    How to win with delaying tactics

    Suppose some politician started pushing this idea now with the intention of getting a law in place in a couple of years. If Big Tech put some effort into it I expect they could push out the date when it finally takes effect at least a decade, probably more. Then there'd be enforcement and the courts, and assorted appeals; I doubt it'd take less than 20 years to force ChatGPT 4 to become open source if it even succeeded.

    At which point if LLMs are still big business their owners would 1) have replaced the old ChatGPT 4 with something else, 2) could have moved any affected money or operations to somewhere else in the world with more appealing laws, 3) IBG YBG. Sounds like a pyrrhic victory at best.

  29. John Robson Silver badge

    You can't just public domain works derived from copyrighted sources.

    More reasonably you could:

    a) insist that every source is identified and paid $1000 per token.

    b) insist that the model isn't used until all the relevant copyrights have expired.

    c) insist that all future models are correctly licensed, with a fine of 120% of global turnover of all parent companies.

  30. AlexanderHanff

    There is a reason I didn't focus on copyright in the article (which most commenters seems to have missed) - I am a privacy guy, I am more concerned with the privacy and data protection issues than the copyright issues; and yes I am perfectly happy to accept that others might be more concerned with copyright - and of course they are perfectly entitled to be - but as the author, my work will always be more focused on privacy than anything else.

    If you read my previous article (referenced in this article) that much should be come very clear. I originally felt that deletion was the only possible way to deal with the issues (and even sent a legal demand to OpenAI to delete GPT3.5 as it contains my personal data) but as much as some people would be happy to just ignore the environmental impact of such a move - I am not. That is why it is an OpEd - it is my opinion and my opinion will always be more focused on privacy implications than anything else.

    It is not possible to extract the personal data from the models (at least not yet), but those who broke the law by training their models on such data should absolutely not be permitted to profit from it (in my opinion).

    Thanks for all the comments though, I read them all and will continue to do so as the day progresses.

    1. rgjnk Silver badge
      Alert

      Impact

      "just ignore the environmental impact of such a move"

      That's already happened. The damage is done. The training cost is well and truly sunk. Gone.

      The actual impact of implementing the deletion is negligible. Data evaporates cheaply.

      The ongoing cost to the environment of keeping it in existence and in use however *is* significant. Assuming anyone really cares about such things.

      Then again just like the mess of GPDR etc. this opinion seems to be busy focusing on minor issues while ignoring fixing the substantive and proposing a solution that, well, isn’t one.

      If your concern truly is privacy then putting the resultant blob of data in the open is still not a solution, however much of an environmental figleaf is used to dress it. The infringement and harm exists even without the profit.

      Deletion is the quick, simple, permanent solution. No sinecures to be had off the back of a one time fix but we can't all win.

    2. JWLong Silver badge

      Privacy

      This is my major concern also. I realized about 10 years ago what Google was up to, so I began a war with them of "poisoning the well".

      So far, so good!

      If they, or anyone wants to train their mess on my info please have at it because it doesn't represent me .

    3. doublelayer Silver badge

      If your concern is privacy rather than copyright, your solution is somehow even worse. It would be much easier to get at any of your private data those models hold if those models were open source. I could start chopping out any protections that would prevent it from happily spitting out the data. I can run queries much faster until patterns start to show up. Asking for something that contains your private data to be made public because you're unhappy that it contains your private data is backwards.

      It's the same as if I said that Google should be penalized for their data extraction by taking their records on everyone they can find and making those public. It's a real penalty, as that's the data they use to claim to advertisers that they can target ads. "See, we have every site this person ever visited and everything they ever searched for, so surely we know what they will be willing to pay for." They won't get that revenue without having exclusive access to that. However, my problem is that Google has it, and that is not solved by making sure that everyone else, from Facebook to governments to criminals has it too. It is a purely negative change.

      I understand your reasons for suggesting it, but those are flawed as well. You see something that was costly to create, and you don't like getting rid of it. I can be like that as well. If something is working, even if I don't need it, I don't want to toss it into the trash. I try to find someone else who will use it, even if most of them tell me that they've got something better, because it's not dead yet. That doesn't work when the item concerned has a flaw. If, for example, I had a Samsung Galaxy Note 7 with one of the self-igniting batteries, I might be unhappy that I now have to dispose of a device that was expensive to make and so far hasn't done anything wrong. It would be dangerous to myself and others to keep using it on that basis.

    4. Conor Stewart

      You say you are more concerned about privacy than anything else, yet you aren't. If you were really concerned with privacy then you would be advocating for the deletion of these models and training data and nothing less. Instead you are concerned about the environmental impact and wanting the companies to not profit off of it. There have already been exploits to make LLMs output things they shouldn't, do you really think that the companies will continue patching these exploits if they have to make it public? Even if there is currently no way to extract personal information once the model is public it would be available forever giving people more than enough time to find ways to extract data or just to keep it archived until the technology required to do it is available.

    5. John Robson Silver badge

      "It is not possible to extract the personal data from the models (at least not yet), but those who broke the law by training their models on such data should absolutely not be permitted to profit from it (in my opinion)."

      But your proposed remedy doesn't stop them profiting, it just means others can profit as well.

      And given the size of the models, the only others who can profit are the other giant tech companies.

      1. Conor Stewart

        Exactly, it isn't like you could host it effectively on your home computer. Their business model wouldn't even change much, they would still charge for subscriptions to their cloud service where they host the model, it just means any other company can do the same using their data centres.

    6. LVPC

      There is no legal basis for your "solution" of ignoring copyright. Your "feelz" don't count. People are focused on copyright because LLMs are tools of mass rights destruction, INCLUDING privacy rights.

      Read the room.

  31. An_Old_Dog Silver badge

    Not a Helpful Solution

    Making the models public would allow World+DogCorps to copy, tweak, rename, and monetize those models, but not myself, as I couldn't afford to host them.

  32. sketharaman

    Are you new here?

    Millions of people trained on "In Search of Excellence" and created excellent companies that made humungous profits. How much share of their profits did they give to Tom Peters and Robert H. Waterman?

    Billions of people read content on zillions of websites that were accessible publicly and went on to earn diplomas and degrees and get high-paying jobs. How much share of their salaries did they give the publishers of those websites?

    Some people argue that it's different in the case of LLMs since they do the slurping and training at unprecedented scale compared to humans but I see no big difference between billions of humans doing something once and one ChatGPT doing that thing billions of times.

    So many people and companies have filed so many lawsuits against OpenAI and other GenAI / LLM companies alleging copyright infringement over the past few years. AFAIK not one has received a favorable decision in a court of law. I tend to believe that's because their complaint has no legal merit.

    1. Conor Stewart

      Re: Are you new here?

      If you read a book then someone has likely paid for that book. If you read information on a website then either it is provided for free, paid for by you or monetised through adverts.

      If a LLM uses information gathered from advert monetised websites then how much money does the website make? Only the money from the initial harvesting from the LLM which will be equivalent to one visitor and that is only if the LLM loads the adverts. The website would lose out on a lot of money if everyone accesses it's information through the LLM instead.

      People pay for what they consume one way or another, even if not obvious or indirectly and the owner sets the price they want for it. There is no need for a portion of profits or salaries because the information is paid for or free.

      Your argument is based on incorrect assumptions.

  33. heyrick Silver badge

    it’s roughly the equivalent power use of 4,500 homes over the same period

    Is my maths really bad or are you missing a zero or two? I make that a little over eleven thousand kilowatts per house over the 100 day period...or a smidgen under 46kWh per hour every hour.

    Which at the rough price I pay (charges included) is €0.34/kWh times 45kW times 24h times 100 days is... €36,720!

  34. John Savard

    Not a Remedy

    I don't think this will fly. The bad thing about "illegally trained" LLMs is that they were trained on copyrighted material. If they were trained illegally on stuff like E-mails and USENET posts, it would make sense. But otherwise giving the model away in the public domain would just compound the harm to the copyright owners.

    1. Conor Stewart

      Re: Not a Remedy

      Exactly, if the issue is copyright then this solution is the equivalent of someone copying your music or book and selling it and then being made to release it for free, yes it hurts the person infringing copyright but it also greatly harms the copyright owner.

  35. Mike VandeVelde
    Trollface

    this is one of the stupidest things I have ever read on the internet

    And now I am posting this comment after having read it. Should I be executed for copyright infringement? I was "trained" on mountains of data which includes this article in order to create this response which I am not paying the author anything for.

    How do I know that the author never read anything that I have posted on the internet? If I can prove it then should I have the author be executed for infringing my copyright, reading my internet postings and having them contribute more or less in some way to this article for which I am not being paid anything??

    "Whah, I posted something on the internet and it was responded to!!1!!!!!111!!!!"

  36. nemecystt

    Wrong solution

    While I have some sympathy for the reasoning here, I think there's still a big issue with allowing even the public to benefit for free from tools which are in part trained upon copyrighted work. Buried in all that scraped personal data is work for which artists/creatives of various types are due royalties or at least control over how their hard work is exploited. These companies seem to have no regard for any kind of IP law. Better to delete the models and enforce real (not fake green-washing) carbon cost to the companies. Make them plant forests.

  37. LVPC

    So a billion copyright violations justify a billion more?

    Giving away the models doesn't do anything to fix the copyright violations. All of does is make it permanent.

    The solution is to destroy the LLM (same as you would counterfeit merch), fine the perps (same as you would anyone selling counterfeit merch), and make restitution to those whose work was used without permission.

    The AI bubble will burst, graphics cards, storage, etc., will come out of their bubble world pricing, and the world will be a better place.

  38. Mostly Irrelevant

    I disagree with the idea that the models should be public domain, unless it's with the express agreement of EVERY copyright holder (which is functionally impossible). The remedy I personally support is to delete the offending models AND provide restitution to all copyright holders. Said restitution should be all income made off the model, not taking expenses into account, divided equally among all claimants. This should also have the needed deterrent effect of bankrupting any company selling a commercial model containing stolen IP.

    This is compatible with existing IP laws, where giving away the offending derivative work isn't because it gives companies a way to steal your work and put it in the public domain.

  39. TheMaskedMan Silver badge

    As "solutions" go, this is utter cobblers. The whole point of intellectual property laws is to protect the stuff by restricting access to those you can screw money out of One does not achieve that by giving it away.

    Of course, if the model isn't actually a breach of those laws, then forcing into the public domain wouldn't harm the IP owners, but then they'd have nothing to complain about in the first place, so it seems a bit pointless other than to give the OP something to tub thump about.

    I'm wondering why they're not also going after Google search - that scrapes the web, caches it and squirts bits of it back verbatim with every single query. And they blatantly make money from doing so. But Google bashing isn't trendy, is it? Won't give you a sexy axe to grind and a nice spot on the crusading lecture circuit.

  40. JulieM Silver badge

    Seems entirely reasonable

    It's not enough to wipe the data. The only way to be sure no-one is benefitting from "intellectual property" without authorisation is to open it up to the Public Domain so no-one at all can benefit from it.

    And then offer redress to anyone who loses their ability to monetise their creative work as a result of this, in the form of a right to sue the original perpetrator.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like