back to article Open source licenses need to leave the 1980s and evolve to deal with AI

Free software and open source licenses evolved to deal with code in the 1970s and '80s. Today it must again transform to deal with AI models. AI was born from open source software. But the free software and open source licenses, based on copyright law, to deal with software code are not a good fit for the large language model …

  1. b0llchit Silver badge

    I'd argue the current system is "Gospel In, Garbage Out".

    1. jake Silver badge

      If you were fully aware of what's in a lot of the training data, you'd probably agree with the standard definition.

      1. b0llchit Silver badge

        Gospel is just the indoctrination of Garbage to create an illusion.

        1. Anonymous Coward
          Anonymous Coward

          The next sentence, "We need, the EFF concludes, open data.", seemingly concludes that the "open" part of the data is without considerable resources.

          What seems obvious to me is that companies have 2 reasons not to track the data. 1. Plausible deniability, this is a huge. 2. Hardware resources. Looking at 2, what is the hardware requirements on logging a single function that supposedly has a billion parameters? How many times is that function being ran in 1 second? Granted this is probably easier than I think, but no company is going to let on to anyone how easy it is for reason 1, which is enabling illegal/unethical corrupted operations (plausible deniability is HUGE in all contexts).

          As for the not knowing what's inside mumbo-jumbo/FUD... if a circuit depends on an electrical current, it can be fully traced, especially by its creator (with enough resources).

        2. jake Silver badge


          This round's on me :-)

        3. Adair Silver badge

          'Gospel'='good news', from Hellenist Greek 'εὐαγγέλιον'.

          So, there is actually a very pertinent question re the issue concerning AI/IT and data: is the input 'good news', i.e. 'true', or 'wholesome'?

          'Garbage in: garbage out' has it exactly right, but with AI offering the added piquancy of 'Gospel in: garbage out'.

  2. Flocke Kroes Silver badge

    Unsettle law

    At the moment, it is not clear what the law is or even should be. Providing an ML code generating service is a legal risk. Using an ML code generating service is a risk. I would vote for:

    *) GPL in, GPL out

    *) BSD in, BSD out

    From then on it gets more difficult:

    *) BSD+MIT in, who gets attribution?

    *) Many licences in, output code cannot safely be distributed under any license.

    1. jake Silver badge

      Re: Unsettle law

      "who gets attribution?"

      I can point to software that states upfront "This software or parts of this software are provided under one or more of the following licenses ...".

      "Providing an ML code generating service is a legal risk. Using an ML code generating service is a risk."

      Yes. And the lawyers are eventually going to figure that out. Gut feeling is we're heading for another AI Winter based on this aspect alone.

      1. Flocke Kroes Silver badge

        Re: Unsettle law

        "One or more of the following licenses" can come about in different ways:

        1) The original author(s)/copyright holder(s) selected a license consisting of "One or more of the following licenses" for the whole project.

        2) Files from different projects with different compatible licenses were included together in one project. Each file retains the license from its copyright holder.

        3) Someone has created a file by mixing together files with different licenses.

        (1) and (2) are fine. (3) is discouraged. If you come across something like that, either avoid or get proper advice from a real lawyer and not some fool like me commenting on the internet.

      2. Claptrap314 Silver badge

        Re: Unsettle law

        That would be nice. More likely, the big money behind these GOLEMs gets a section 230-styled exemption for a decade & then it's too late.

    2. Doctor Syntax Silver badge

      Re: Unsettle law

      Train it on the licences the code comes from and let it write its own. A combination of Microsoft & BSD 2-clause would be interesting. (Actually early versions of Windows did include a BSD licence for the BSD network stack.)

      1. Lil Endian Silver badge

        Re: Unsettle law

        Train it on the licences the code comes from and let it write its own.

        What next?! Allowing MPs to vote on their own pay rises? Wait, wut? Seriously!

        1. Derezed

          Re: Unsettle law

          I'll have you know the MP pay board is 100% independent and definitely not in thrall to the merry-go-round of corporate pay and mutual back scratching. Whatever our hard working MPs get paid I am 100% sure it is 100% worth every penny. 100%. /s

      2. Anonymous Coward
        Anonymous Coward

        Re: Unsettle law

        it will probably hallucinate something giving itself the rights to the users' organs..

    3. the spectacularly refined chap

      Re: Unsettle law

      *) BSD+MIT in, who gets attribution?

      That's missing the point, the attribution relates to the authorship, not the licence. Even historically if you lift code from a BSD-licenced project and place in another BSD-licenced project (or indeed GPL code in another GPL project) and do not copy over the relevant copyright notice that is itself a licence violation, even without a change of licence.

    4. rerdavies

      Re: Unsettle law

      > *) BSD+MIT in, who gets attribution?

      Every single copyright holder gets attribution, with the license they granted you.

      > *) Many licences in, output code cannot safely be distributed under any license.

      You have to include all of them. Every copyright holder gets named. Along with every license.

      Not complicated at all. Just very tedious, and very necessary.

      The most recent project I shipped used a lot of open-source packages (none of them GPL, which I consider to be a plague upon humankind*). All told, after scanning sources, and dependent libraries, the copyright notices for my medium-sized project are 148,241 bytes long -- considerably larger when it's translated to HTML in the about box. I would estimate that about 4,000 people or groups of people have copyrights in some part of the code in my project. Yes, I did merge license text. (multiple copyrights, with only one statement of the license). There are in fact substantial variations and families of variations to "BSD" licenses (e.g. BSD 3-clause, BSD 0-clause), and many variants of the "MIT" licenses as well.

      * Why is the GPL a plague upon humankind? Because it divides opensource software into the half that people genuinely made available for use by anyone, and the half that is virally tainted by the GPL which can't be used by anyone that doesn't want to become infected by a license that all the corporate lawyers I've talked to seem to consider to be uninterpretable gibberish. (Try parsing the expanding definition of what "linking" means in GPL 3, for an obvious example).

      1. heyrick Silver badge
        Thumb Up

        Re: Unsettle law

        Why is the GPL a plague upon humankind?

        + bignum

        Exactly this.

      2. Justthefacts Silver badge

        Re: Unsettle law

        This is a gross misunderstanding of the way copyright (and attribution) works. Please, please have a quick chat with a copyright lawyer.None of what you are doing is necessary, or would be helpful/protective if it were.

      3. I could be a dog really Bronze badge

        Re: Unsettle law

        Because it divides opensource software into the half that people genuinely made available for use by anyone, and the half that is virally tainted by the GPL which can't be used by anyone that doesn't want to become infected by a license that all the corporate lawyers I've talked to seem to consider to be uninterpretable gibberish.

        Let me fix that for you :

        ... by the GPL which can't be used by anyone that wants to rip off other people's work without abiding by a fairly simple obligation to pass on the freedoms to others.

        If I were to write some code, then I can choose what terms I distribute it under (or don't distribute) - that's my choice. The people that make GPL licensed code available to you chose to use a licence that (effectively) says "if you use my code that you got for free and with certain freedoms, then you must make any derivatives you create of it available to others with those same freedoms".

        Your comment suggests that you are one of those who wants to use others' works for free, embed it into something, and then profit from selling a closed system. That's actually fine if those who provided the free code use a licence that allows it because they are happy with you doing that - but some have a different viewpoint and aren't happy with a freeloader profiting from their work without passing on those freedoms to tinker.

        As to the complexity of GPL v3, well that was in response to people abusing the GPL v2 in a way that technically complied with the licence, but effectively ignored the intent behind it - c.f. Tivo for example. If you aren't familiar with that, in effect Tivo used GPL v2 code - but made the hardware actively block the running of anything but their own compiled version via an encryption key. I.e., they actively blocked the "freedom to modify the software yourself" because the box won't actually run it. I have mixed views on v3, but I can't let your anti-GPL disinformation go without challenge.

  3. alain williams Silver badge

    Let us hope that we can ignore licensing

    This seems to be the attitude. Grab/digest as much code as possible, learn from it, generate proprietary code as a result.

    Having to worry about the license of the code digested is hard but should be doable. It will increase costs and this is what the ML owners do not want.

    Copying of code and ignoring licenses has been going on since the year dot. If the output code is not distributed then it is hard to detect. So what is happening here is not new but just happens faster.

    1. JimC

      Re: Let us hope that we can ignore licensing

      Yep, it seems to me the only code that can legitimately be used for this game is actual public domain. But there's probably not enough of that to make it viable, so like many before they just steal.

      1. Orv Silver badge

        Re: Let us hope that we can ignore licensing

        Among other things, it's arguably impossible to release something into public domain under current copyright law, at least in the US. That was one of the motivations for Creative Commons licenses.

        1. Doctor Syntax Silver badge

          Re: Let us hope that we can ignore licensing

          "at least in the US"

          Other places exist.

          1. Orv Silver badge

            Re: Let us hope that we can ignore licensing

            Yes, but all but 14 countries are signatories to the Berne Convention, which is what created the issue. Under Berne works receive copyright protection at creation; you don't have to register them. Once the work is copyrighted there's no real mechanism for un-copyrighting them. What the CC licenses aimed to do was let you sign away certain rights in spite of retaining copyright.

    2. Claptrap314 Silver badge

      Proven business model

      Let us hope that we can ignore licensing the law. Been the silicon valley motto since Google got going.

  4. karlkarl Silver badge

    I don't think open-source licenses need to change at all. There have *always* been criminals that have tried to exploit it without adhering to the license. This is not a new thing.

    Companies (including AI companies) are just waking up to this wealth of functionality that has been slowly and steadily growing (whilst the proprietary stuff has burned out due to corporate lifespan or over monetization). The responsibility is on them to not break license terms. But this is what companies do; they push legal boundaries for maximum wealth.

    If anything, what the open-source movement needs is a way of *mass* detecting when i.e GNU licensed code has been used and perhaps come up with a concept of crowd funded lawyers to tangle up the company in breach. Perhaps we should also consider AI GPL lawyers to churn through all the many companies I am sure are in breach.

    1. localzuk Silver badge

      Agreed. This article seems to mix up needing licenses for use with AI, and open source licenses for existing code.

      The open source licenses that exist at the moment already seem to be doing their job - leading to lawsuits when companies come along and try and use it without following the rules.

      The only reason to adjust the licenses would be to weaken them, which only those who will make money from LLM will be advocating.

  5. Mishak Silver badge

    How far do you take it?

    How of you cope with code fragments that could be "original" or found in many projects?

    for ( int i = 0; i < N; ++i ) array[ i ] = 0;

    Even if it represents an algorithm to do "X", there could be many implementations of the same algorithm out there.

    Some crazy software patent attempts have been made along the lines of "using a loop to search for...".

    1. Howard Sway Silver badge

      Re: How far do you take it?

      That's not really the question here. The answer is that you reproduce the attribution and license from whichever project you extracted that code from. Because that code was supplied as copyrighted work under the terms of that license. Those arguing that this is "too hard" have clearly never heard of databases.

      If someone wishes to create a "free to use by AI, without attribution" license and release code under it, then it would be perfectly acceptable for Copilot and similar systems to do what they're doing now with that code. The problem is that Copilot is currently wrongly assuming that all the code it ingests already has such a license.

      1. 42656e4d203239 Silver badge

        Re: How far do you take it?

        >>The answer is that you reproduce the attribution and license from whichever project you extracted that code from.

        Is it though? That answer, surely, depends on what, exactly, the LLM is doing with code.

        Is a "code writing" LLM just a database of code, from which it regurgitates on demand or does it still apply probability/Markov chains to the source query ( al la "text generating" LLMs) and write original (even if looking like it was copied) code? that is, does the LLM have a table that says "if someone asks for code to do X then spit out Y" or is it creating Y' from the probabilities stored internally selected by the initial prompt and, coincidentally (probabilistically?), Y' ≈ Y?

        LLMs are "randomish Y from a given X" generators - not, in my understanding, "IF X then Y" generators and, therefore, can't breach copyright becasue they aren't actually copying (in the sense of copy and paste) anythng; they are generating, from probability, something that looks like it is copied. Does that break copyright? dunno (obviously I tend to think no) but IANAL and I suspect actual lawyers will get rich on that question.

        1. HMcG

          Re: How far do you take it?

          Software copywrite infringement cases have been won by the copyright holder simply because the accused developer had access to the copyrighted source code. That's why, back in the day, IBM PC clone manufacturers licenced in 3rd party bios software (notably Phoenix) rather than rolling their own- the BIOS software itself was not difficult to reproduce, the difficult part was reproducing it in a legally provable clean-room environment where no previous exposure to the BIOS source code was possible. If OpenAI and it's ilk won't reveal in detail all the source code the model was trained on, I suspect it's because they know they will fall foul of this..

        2. Michael Wojcik Silver badge

          Re: How far do you take it?

          LLMs are "randomish Y from a given X" generators - not, in my understanding, "IF X then Y" generators and, therefore, can't breach copyright becasue they aren't actually copying

          That argument would have to be tested in the courts of various jurisdictions, but I doubt it would hold up under judicial scrutiny. In the US, for example, copyright law (USC 17) makes no such distinction. Copyright violation in the US is predicated on the result, not the means.

          Under USC 17 §101, a work or portion of a work is reproduced if it "can be perceived, reproduced, or otherwise communicated, either directly or with the aid of a machine or device". It doesn't matter how complex the reproduction process is. It would appear that if a prompt to an LLM retrieves enough of a copyrighted work to be infringing,1 while the prompt itself is not infringing, then the LLM has infringed. Nor need it be an exact copy, because derivative works can be infringing.

          I strongly suspect courts will not be persuaded by the "but we arrived at the protected work accidentally!" argument.

          1And how much is that? That's one of the thorniest areas of US copyright law, of course, thanks to the complexities of the Fair use aspect (§107).

      2. doublelayer Silver badge

        Re: How far do you take it?

        The problem is that the programs can't tell you where every line came from and don't copy every line verbatim from somewhere in the training data. That's why you have to assume that, if you put licensed code in, it could come back out at some point. For example, the relatively basic line given by the last comment would probably be modified in several ways:

        1. Variable names switched to deal with local conditions. "array" switched to have some semantic meaning, "n" changed to some more useful condition, and "i" changed if this is in a context where a variable called i already exists.

        2. Format changed to match other formats. Single-line for statement changed to use braces, indentation changed, internal spacing changed, if the condition that replaced n is long enough, the forstatement split across lines.

        3. Changing this line to do something related but not identical. For example, instead of setting every location to 0, running some other initialization procedure.

        Does making some or all of these changes make the line different enough that you no longer need to attribute? It's hard to tell, because it would probably depend on which license the original line was under (the program doesn't know) and how common the line is (in this case, even quoting that line would be so generic that it couldn't be copyrighted in itself, but more important lines could). The program will not figure this out. In some ways, I don't care about these small uses of code from licensed projects. However, the licenses exist for a reason, and when they copy larger parts of the code without following the licenses, it becomes a larger issue. The solution is to exclude data before training, because otherwise it's very difficult to prove whether it was involved in creating the output.

      3. Falmari Silver badge

        Re: How far do you take it?

        @Howard Sway "That's not really the question here."

        No it is a valid question and one that needs to be asked.

        Do you really think that for ( int i = 0; i < N; ++i ) array[ i ] = 0; requires attribution, that code that contained that line infringes copyright.

        To me as a programmer it is really worrying that you consider @Mishak's question unimportant.

        My last employer of 23 years was an early adopter (v1.1) of C# .Net for the Windows part of our products. So in that time I written a fair bit of C#, 95% of which is sitting in files with their copyright comments. If snippets of code like @Mishak's example are copyrightable than all I can say is it a good job I have retired. As I can't see how I could ever write C# for another company as I would be forever infringing my last employer's copyright.

        Don't get me wrong I believe that MS are infringing copyright with CoPilot, just it is training part that infringes. Snippets of code should be judge like code produced by humans, but training data is not snippets it is the whole source there is little argument that that is not copyrighted.

        1. Richard 12 Silver badge

          Re: How far do you take it?

          That snippet is not, in itself, copyrightable.

          However, many snippets that size are copyrighted. Some are also patented and maybe even trademarked.

          The LLM has no way of telling which.

    2. Justthefacts Silver badge

      Re: How far do you take it?

      This code-section isn’t copyrightable anyway. It’s far too generic and obvious. So it doesn’t matter if it was actually copied or not, or from where.

      Its really important to understand that “derivative work” is not a statement that the work is “derived from”. The actual copying history doesn’t matter. People usually don’t appreciate that. “Derivative work” is a statement about three things: similarity (determined from a comparison of the outputs), “distinctiveness and novelty”, “substantial part”.it

      If I take fifty copyright works each of 50 lines and munge them together, to make a working output that does something different, then the output is *not copyright*. The output work simply isn’t very similar to “an individual input work”. There’s no “class action” here. That is a statement about copyright law. The terms of any “open source” licenses simply do not come into play until copyright is accepted. You can’t legally assert control over the use of text, until you show you own it in the first place.

      This is fundamentally different from how things stand for music, or performances. Peoples expectations have been shaped by the use of sampling in music, but the written word is a very different branch of copyright case law. And it’s got nothing to do with software or not, it’s the fact it’s written word. It may surprise you to know that this has all been thrashed out before in history, well before the advent of software. As far back as the 1920s in fact, when Dadaists such as Marcel Duchamp used found objects, and Picasso was using collages of snippets from newspapers. The newspapers tried to assert copyright ownership, to get a slice of the sales price of the art, and got their ass handed to them in court. The exact example I gave above, taking fifty works and slicing a phrase or sentence out of each, to produce a new piece, was so common in sixties counterculture it was a trope. It was tested in court, multiple times, and found not in breach. I remember someone took sentences sliced from government propaganda, and produced an anti-Vietnam piece published in NYT.

      Modern “legal” advice on collages, usually comes from university legal departments who just can’t be arsed to defend any of this stuff in court and advise artists to avoid using copyrighted items. But every single time it has come to court, in every country, collages, found objects, and sliced text have been deemed firmly not in breach, even and up to reproducing entire newspaper pages including advertisements.

  6. Zippy´s Sausage Factory

    I'd love to add a clause forbidding use as training data. I'll probably ask a lawyer friend to make one for me and start adding that to my sites, as well as a clause stating that by using my site for any purpose they agree to be bound, perpetually, to a legal domain of my choice.

    A completely toothless remark as I can barely afford lawyers at that sort of level, but should at least put them on notice that I've thought about this stuff and that if I win Euromillions I'm probably going to get annoying.

    1. Doctor Syntax Silver badge

      Licences recognised as Open Source licences don't have such exclusion clauses. Adding one to an existing OS licence would produce a non-OS licence as a result.

      1. Zippy´s Sausage Factory

        That's true (not that I mentioned open source, though, I don't think).

      2. Anonymous Coward
        Anonymous Coward

        Licences recognised as Open Source licences don't have such exclusion clauses

        Open Source licences already have an Exclusion clause which says "you are not licensed to use this code unless you provide proper attribution." - e.g when the AI splits out part of the original source code in its output.

      3. doublelayer Silver badge

        Adding a clause which says "No training of an AI system whatsoever" would violate the open source definition, but adding one that says "If you use this to train an AI, you must release the source to the AI" would be acceptable. I'd rather people didn't, because the more incompatible licenses that exist, the worse the tangles they can get into, and such a clause is not necessary for the producers of AI to face copyright charges. Still, if you want to make an open source license that effectively prohibits use as training data, you could do it that way.

        1. Roland6 Silver badge


          Can’t see how restricting use for AI training violates this, unless you take a wider interpretation of derived works than the authors intended.

          So perhaps a tightening of the open source definition, to clarify the assumptions made in the original drafting would not violate the definition.

          But what is clear, if you can show the AI has drawn on open source the output from the AI is also open source…

          1. doublelayer Silver badge

            I was using this part:

            The license must not restrict anyone from making use of the program in a specific field of endeavor. For example, it may not restrict the program from being used in a business, or from being used for genetic research.

            In most licenses that have been affected by this part, the meaning has been where you can run the program. However, since the open source definition has some equation of the program and its source, it equally applies to who may use the source. To forbid someone in AI from using the source, whether that means they are not allowed to run it or process its code somewhere. The field would be forbidden, which directly contradicts that part of the definition. You don't have to care about this; many licenses look sort of open without complying, but people who have strict preferences may object and refrain from supporting the license.

            1. Roland6 Silver badge

              I thought you were, but the underlying assumption was the use of the “program” (which you discuss), I took this was focused on execution not reading which is covered by copyright (Creative Commons license ?) hence my observation that the inclusion of the source in a LLM is a form of reading. However, when the AI generates some program that incorporates the original source (*) the licence doesn’t limit the fields in which that generated program can be used.

              (*) I’m ignoring the question over the extent of the source that is included in the generated output in the above, even though it is an important consideration.

              Aside: this and other discussions show just how much those working with software need to have a good appreciation of licensing and the implications of the various public licences.

              1. doublelayer Silver badge

                "when the AI generates some program that incorporates the original source (*) the licence doesn’t limit the fields in which that generated program can be used."

                No, but a license which didn't obey that part of the definition wouldn't be able to get to that point. The definition does not say that the restriction is only on the running of the software. They say "use". I think it is illustrative to consider how licenses that are considered not open source based on this rule tend to be written.

                For example, one that I've seen a few times are non-military licenses. Yes, they say that you aren't to run the software if you're in the military, but they also mean using parts of the code. They mean that the code is simply not to be used at all, no matter how much of it you take. Their licenses could specify that it's just running it, but they don't. Saying explicitly that "no reading it into an AI system" is prohibiting a use of the code, even if it's not a use that people have cared about before. As I've said, I don't think this should be a major impediment to preventing it, as existing use is already violating nearly every open source license in existence*, so I see no compelling reason to make a license the openness of which will be debated.

                * Unless someone changes the way copyright works or finds a way to argue that LMM processing shouldn't count, reading it in is still copying. The licenses have requirements on copying, even if it's only attribution.

        2. Anonymous Coward
          Anonymous Coward

          Sounds suspiciously like GPLs "if you dare call a function in a GPL library, you must release your source code".

          1. I could be a dog really Bronze badge

            "if you dare call a function in a GPL library, you must release your source code"

            Which as any person actually competent to write any code would understand is not correct.

            If you dynamically link to a library and call that a function in that library, then you are OK - you have not embedded any of the library code into your program. If you embed that library into your program (i.e. copy the code into it, or statically link it) then your program (or at least, that copied code) becomes a derivative work which is also covered under the GPL an the terms of the licence mean that you are obligated to offer others the same freedoms your yourself took advantage of. This is one of the classic bits if FUD that anti-GPL people keep re-iterating as it repeating it often enough will make it true.

            This really isn't complicated.

      4. Michael Wojcik Silver badge

        Adding one to an existing OS licence would produce a non-OS licence as a result.

        How so? It might create a license which does not meet the OSD, but to the best of my knowledge OSI does not have a trademark on the term "Open Source", so someone would be free to continue to claim such a modified license is an "Open Source license".

        1. doublelayer Silver badge

          They would be free to continue claiming it, but it might not work out for them. Courts have previously recognized "open source" as having a definition, and used the OSI's definition. For example: ruling against a license which violated the OSD but was called open source. They considered "open source" to be a technical term, and therefore claiming something to be open source when it didn't meet the definition was considered false advertising in the same way as saying something had WiFi when it only had Bluetooth would not be permitted. A quote from that article might be useful here:

          The defendants appealed, and in February the US Court of Appeals for the Ninth Circuit affirmed a lower court decision that the company's "statements regarding ONgDB as 'free and open source' versions of Neo4j EE are false."

          On Thursday, the Open Source Initiative, which oversees the Open Source Definition and the licenses based on the OSD, celebrated the appeals court decision.

          "Stop saying Open Source when it's not," the organization said in a blog post. "The US Court of Appeals for the Ninth Circuit recently affirmed a lower court decision concluding what we’ve always known: that it’s false advertising to claim that software is 'open source' when it’s not licensed under an open source license."

          In an email to The Register, Bruce Perens, creator of the Open Source Definition and open-source pioneer, observed, "This is interesting because the court enforced the 'Open Source' term even though it is not registered with USPTO as a trademark (we had no lawyers who would help us, or money, back then). This recognizes it as a technical claim which can be fraudulent when misused."

          That might not be tested and courts can always change their minds, but if your business relies on calling your license open source, I wouldn't want to rely on that.

  7. MarcoV

    Article shorts the discussion and only guards AI corporate industries.

    It totally bypasses the question if Open source authors WANT to be AI fodder.

  8. Doctor Syntax Silver badge

    Maybe the burden should be shifted to require the output of CoPilot & the rest to prove it doesn't contain or depend on the OS inputs.

    1. Ken Hagan Gold badge

      There *must* be a presumption that it does depend on the OS inputs since otherwise what is the point of including those inputs in the training data?

      1. FeepingCreature Bronze badge

        Yes, the question is if the duplication is of the kind as human learning or of the kind as copying.

        I don't think anyone wants a world where people have to avoid working on GPL software at risk of poisoning their brains with unlicensed training data.

        1. Richard 12 Silver badge

          Not even that

          "Clean room" implementations are a well known thing, and the only reliable defence in many cases.

          Copying by rote is still a breach of copyright - just ask any composer.

      2. rerdavies

        There's also a presumption that you are incapable of doing anything without on depending things you have seen or read at sometime in your life. But thankfully, copyright law doesn't cover that sort of dependency.

        Copyright doesn't cover ideas. It covers expression. Whether AIs are regurgitating ideas or expression... hopefully somebody else will foot the very expensive legal bills required to settle that.

        1. Michael Wojcik Silver badge

          "Whether AIs are regurgitating ideas or expression"

          – depends entirely on a specific regurgitation, at least under US copyright law. It's not a question of how the system does it; it's a question of the result.

        2. Anonymous Coward
          Anonymous Coward

          "Whether AIs are regurgitating ideas or expression... "

          Expressions. "AI" (any of them) has absolutely no comprehension of ideas as it operates solely on *word* or *phrase* level, i.e. expressions. It's an advanced and automated copy-paste -system.

          No more, no less.

    2. Doctor Syntax Silver badge

      Some of the above discussion just prompted this thought. One of the tenets of software freedom, the principle that underlies FOSS, is the freedom to study the code. Does this - or was it intended to - include the freedom to have your LLM study the code?

      1. Michael Wojcik Silver badge

        Training an LLM on open-source software does not appear to violate any licenses that meet the OSD, or US copyright law.

        Using such an LLM to produce code which may include substantial portions of the training data does not appear to do so either, provided the result is unpublished and for personal or archival use.

        Publishing such code, in whatever form – that's where the trouble enters.

        In other words, I don't offhand see anything in any OSD-conforming license or in USC 17 that attaches to ingestion of open-source software. It's regurgitation that can fall foul of it.

        So, hypothetically, you could train your LLM on open-source, and indeed on proprietary, software, and then run it through a magical filter that rejected all infringing outputs,1 and keep prompting that combined system until you got an output to your query. It's conceivable that US courts could still infer a derivative work there, but it would be much more of a leap from the existing language of USC 17.

        1Basically, an oracle that could predict with high certainty whether any court in your jurisdiction would find infringement in the candidate output. Remember, this isn't meant to be realistic; it's a hypothetical.

  9. Lee D Silver badge

    No they don't.

    Copyright law is quite clear.

    Just because "AI" people think that they can ignore it and just suck in everything and use it as they like for commercial purposes does not mean the licences are wrong.

    It means the AI people need to verify their training data's origin and copyright status.

    1. Anonymous Coward
      Anonymous Coward

      ... and they need to provide any attribution that may be required by the licenses when it spits part of it out as output.

    2. FeepingCreature Bronze badge

      I agree in exact reverse, in that I don't think training+evaluating is covered by copyright to begin with.

      But no matter which way that debate goes, the licenses are fine as they are.

      1. Lee D Silver badge

        Copyright covers distribution of derivative works.

        Anything that produces related output after having been trained on copyright material is making a derivative work.

        Letting someone else consume that media without appropriate licensing (including any attribution, etc. that's necessary under the author's chosen copyright licence) is copyright infringement.

        1. FeepingCreature Bronze badge

          I am unambiguously a thing that produces related output after having been trained on copyrighted material. Since I've worked on GPL source code, can I no longer make any non-GPL work? And which license wins out?

          1. Lee D Silver badge

            Is your code substantially similar to the copyrighted works?

            And nothing makes you "no longer making any non-GPL work".

            What you can't do is infringe your agreed-to GPL or non-GPL licences by making code "substantially similar" to code you don't hold the copyright for without the permission of the copyright holder(s).

            You can be as smarmy as you like, but hyperbolising this into GPL vs the world is a trick employed by such copyright experts as SCO (who lost) and Oracle (who also lost but because interoperability is required and made the question of potential copyright infringement moot and so that part was never actually tried in court).

            And yes - there's a reason NDAs exist. There's a reason non-compete clauses exist. And there's a reason why Wine and Samba don't want you working on their code if you've been exposed to Windows internal source code.

            1. FeepingCreature Bronze badge

              Great, so as long as ChatGPT doesn't produce code substantially similar to GPL'd code, we're in the clear to use it.

              Honestly, I wouldn't want anyone working on Wine to be looking at Windows source code regardless of what the law says. There's a difference between "suits you can win" and "suits you can get thrown out", and if you're fighting Microsoft that difference is probably in the millions of dollars. I think ex-MS employees working on Wine would probably be winnable, it'd just be expensive.

              1. FIA Silver badge

                Great, so as long as ChatGPT doesn't produce code substantially similar to GPL'd code, we're in the clear to use it.

                Yes. That's why people want it to attribute.

                The things is, I can ask you if you've used GPLd code, at the moment I can't ask ChatGPT.

                Also, your ability to think means it's much more likely you'll produce original code to do something as you will (hopefully) understand the task at hand. ChatGPT doesn't think, or understand, so is much more likely to spit out infringing code. (as 'remembering' is something it does with much more fidelity than humans).

                1. FeepingCreature Bronze badge

                  I mean, where I'm coming from is that I think ChatGPT does "think", in the sense that it has conceptual understanding of code that it can deploy in novel contexts and arrangements. So if I'm asking it a halfway interesting code problem, I'm unlikely to get substantially similar code to its training set.

        2. Michael Wojcik Silver badge

          Anything that produces related output after having been trained on copyright material is making a derivative work.

          This is obviously prima facie false under at least USC 17's definition of "derivative work".

    3. Michael Wojcik Silver badge

      Copyright law is quite clear.

      213 years of jurisprudence say otherwise.

  10. Cybersaber

    Author misses the point...

    Imagine I have a tool that helps me by harming others, but the tool can't be fixed to stop that harm. I don't get to say 'well, I can't stop harming others, and I can't fix the tool, so they'll just have to suck it up and let me keep hurting their interests.' NO, you need to stop using the tool until someone figures out how to make it not harm others. I don't care how 'helpful' the tool is. This is not a balance of interests situation.

    Saying that the tool can't provide attribution or understand the licenses is irrelevant. I don't get to say 'well all these software licenses are just too hard/complicated to understand follow, so I just get to take what I want anyway.' It then follows that I can't use a tool to do the same thing on my behalf.

    Author, I can't copy your website, or your article, and put my name on it and set it up as my own. But your position is that as soon as I write an algorithm, that I will purposely design to be unable to tell it's stealing from you... well, that's just peachy. You need to update your profession to better cope with my ability to steal your livelihood. That's your position. Enjoy being jobless soon!

  11. Phil O'Sophical Silver badge

    What about non-artificial intelligence

    What about the situation where you remove the "A" from the discussion, i.e a smart human C programmer downloads and studies lots of Python code, and teaches themselves Python from it.

    They then write some new code, which inevitably borrows concepts from the code they trained themselves on, even if it doesn't contain direct line for line copies.

    Is that an acceptable use of the input open source code? I think most authors would be happy that others learn from them, and most of us have probably done it at some time.

    Now, try to define a licence which permits that, but doesn't allow a machine to learn from the same code. I think it would be difficult.

    1. Claptrap314 Silver badge

      Re: What about non-artificial intelligence

      And now you know why I've made a point of not studying GPL'ed software very closely.

    2. Roland6 Silver badge

      Re: What about non-artificial intelligence

      > Now, try to define a licence which permits that, but doesn't allow a machine to learn from the same code.

      I thought that was part of the problem, existing licences were written in an era when it was assumed humans that were doing the learning and copying and hence this underlying assumption can be used against AI usage. It is the proponents of AI wanting to change that assumption to include without any rewrite.

    3. Anonymous Coward
      Anonymous Coward

      Re: What about non-artificial intelligence

      " concepts from the code they trained themselves on,"

      Logical error here: "AI" does not understand concepts, it has no intelligence whatsoever. It literally handles *text* and text only. Equivalent of copypasting piece of code and changing the variable names to hide it's a copy.

  12. Chubango

    "AI" just needs to comply with contract law

    > I guarantee that licensing trolls will come after "your" ChatGPT and Copilot code.

    Good. Respect the terms of the license; stop hoovering up code and regurgitating it if you disagree.

  13. Anonymous Coward
    Anonymous Coward

    Copyright infringement is very tenuous

    …even if trained on copyrighted content, these chat bots do not reproduce large quantities of it (unless requested by the user - in which case they might as well just copy the original code)

    It’ll be difficult to get a court to declare infringement on “quotations”

    1. Paul Kinsler

      Re: Copyright infringement is very tenuous

      So, if I were to repeat myself from an earlier thread, perhaps the question is "what level of reproduction fidelity - and what reliability of obtaining such fidelity - would (or should, or might, ...) constitute legal grounds?"

      1. rerdavies

        Re: Copyright infringement is very tenuous

        Currently, the threshold gets determined on a case by case basis. But the general procedure is that you hire very expensive lawyers, who then stand toe-to-toe while burning hundred dollar bills until one of them runs out.

    2. Anonymous Coward
      Anonymous Coward

      Re: Copyright infringement is very tenuous

      " these chat bots do not reproduce"

      They do, that's the problem. They do not generate a single word on their own, *everything* is copied from somewhere else. All of it.

      Taking 5 copyrighted texts and combining them into one doesn't make it suddenly non-copyrighted, because the *source* is copyrighted. You believe that adding that number to thousands will change something?

  14. Anonymous Coward
    Anonymous Coward

    Hugging Face Hub ?

    Was that deliberately named to make me think "alien" ?

    1. Michael Wojcik Silver badge

      Re: Hugging Face Hub ?

      The company initially produced a chatbot for teenagers. I expect the name was chosen to maximize twee.

  15. steelpillow Silver badge
    IT Angle

    The dice-and-slice photocopier

    An avant-garde artist cultivates geek friends who work for all kinds of outfits. She put a lit of effort into building a collection of their code printouts.

    She creates artworks by photocopying these, cutting them up and making montages, in such a way that she tries to make the code sequence in the montage executable.

    Her partner does the same with sheet music, and their daughter it does with lines of poetry.

    Their son gets an AI to do all three and mash them up in a multimedia experience.

    They decide to have a joint exhibition, including LAN fest, concert and recital, to launch their art on the world.

    Shortly after the opening, the IP lawyers descend on them like a ton of bricks.

    The code, music and poetry artists argue in court that the photocopier was to blame. They get shot down in flames.

    The son argues that the AI was to blame. [Choose your own end to the story]

    1. robinsonb5

      Re: The dice-and-slice photocopier

      While it's easy to fall into the trap of assuming US-style fair use rules apply everywhere, they are at least a reasonably well thought out set of rules-of-thumb for reasoning about such scenarios.

      In your artwork analogy the most interesting points to argue would be to what degree is the new work transformative, and to what degree it usurps the need for the original work. Artwork generally has an easier time clearing those two hurdles than something which aims to exist in the same space as the source material. (Furthermore, artwork which remains in the same space as the source material - and your examples seem carefully constructed so that this is the case - will have a harder time clearing those two hurdles.)

  16. Joe Burmeister

    It's not just AI were FOSS is fighting yesterday's battles.

    Everything is becoming smart, running closed systems, even if a lot of it is FOSS based. Modern cars are a horror show software freedom wise. But so are TVs and increasingly, every device in the house. Lots of them are security holes in your network to become part of "the internet of infected things". Most of they are spying on you, harvesting what data they can. All built to be thrown away in a few years.

    They need to be engaging with regulators and governments. Hammering home Right To Repair, privacy, competition / anti-vendor-lockins.

    Just talking to programmers and shouting at companies, isn't enough.

    1. doublelayer Silver badge

      I'm not sure what you want them to do. Licenses can't force everyone to make only what we want them to, even though I agree with your preferences. For example, the security argument. Yes, a lot of devices are improperly secured and insufficiently supported, creating security problems. An open source license will have a hard time mandating support of a commercial product when they're explicitly refusing to support their product. Legislators can make requirements on commercial products that license authors cannot. If someone does make a license that tries, people will understand how weak that license will be and will ignore it, and if many people decide to do it, it will likely be replaced by a manufacturer consortium with something under an even more permissive license.

      That is, of course, if they don't just ignore it. A lot of licenses are being violated all the time with nobody doing anything to enforce them. If you put even more important things in a license, it will still not get enforced frequently, so that's another reason to try to get regulation from regulators who will have at least some budget for trying to regulate rather than hoping that somebody will eventually go after all the people who have GPL code somewhere and don't do anything to follow the license.

  17. Andrew Williams

    AI isn’t creating anything

    It’s carrying out systematic plagiarism.

  18. CowHorseFrog

    Here we go again, giving CEO's a platform like we should all be gracious that our grand overlord parasites are sharing their wisdom, becuse we all know how they never take credit for the hard work of others while never actually doing anything to contribute themselves.

  19. Henry Wertz 1 Gold badge


    This is nonsense. Open source writers do not have to rewrite licenses to accomodate AI systems memorizing their entire code and then typing it out for somebody. Any more than, a company could not just hire people with photographic memory to read some code, write it down verbatim then claim that this is new code and not subject to the license. Clean room implementation (i.e. one person writes a description of what the code, does, and a second writes code based on this description?) This is allowed. Copying the code over? It's still subject to license whether they want it to be or not, and that is as it should be.

    And, to be clear, patent trolls are patent trolls -- companies that have not invented anything, just patents and lawyers, often abusing the patent system by extending their patent(s) so they can add things that are already being done by others but get it backdated so they can falsely claim they "invented" them first (.. I think in the US this was finally fixed, but the patent system did have that ridiculous bug where you could extend a patent for years WHILE adding things to it and everything in it would be backdated to the original filing date.) People producing code under an open source license enforcing their license against people who think they can incorporate their code into proprietary products without following the license? This is not trolling this is using the copyright system as intended.

    As for developing new licenses for AI data sets -- that does make sense. It's still tricky, an AI that will spit out vebatim blocks of code without following the license of that code, putting the entire AI data set as a whole under a license does not change the fact that the code it's spitting out is still subject to the original license whether the AI company or people using the code the AI spit out want it to be or not.

    (To be clear, I'm not hating on the article, Steven did a fine job writing up what's going on here. I'm objecting to the AI system vendors arguments that they should be able to pirate open source code for proprietary products because an AI read the code off, say, github, then spit it back out.)

  20. Michael Wojcik Silver badge

    My what will what now?

    I guarantee that licensing trolls will come after "your" ChatGPT and Copilot code.

    I guarantee they won't, because my ChatGPT and Copilot code does not exist, and never will.

  21. FIA Silver badge

    So where will this code base come from? I doubt many existing open source codebases will re-licence?

    Also, why do we need a new open source licence now? Didn't we need one about 10-15 years ago, or is Amazon, Facebook, Google's GPL code use 'in the spirit' of the licence?

    Oh, no, it's not that is it, it's the I don't want to get sued; but I don't really want to have to program either licence?

  22. Anonymous Coward
    Anonymous Coward

    "Today it must again transform to deal with AI models."

    No, it must not.

    Anyone claiming that is either stupid or paid shill as there's literally *no intelligence whatsoever* in any "AI" system existing today. No-one has an idea how actual artificial intelligence could exist, even in theory. A language model, any of them, is literally as dumb as a brick, a text mangler existing solely (in current form) to circumvent copyright and the right way to fix that is *not* to change copyright to allow blatant stealing.

    Because 'unauthorized copying' becomes 'stealing' when you *sell* it as your own. As *every "AI" existing* does.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like