> rarely – the complaint cites a study reporting 1 percent of the time
It's just the tip!
Attorneys representing Microsoft, its GitHub subsidiary, and OpenAI have asked a judge to throw out a copyright case against GitHub's programming assistant Copilot, on the grounds the challenge against them lacks standing. To have standing – to be allowed to make a complaint to a court – a plaintiff must have suffered a harm …
And they certainly did intend to facilitate copyright infringement of attribution and GPL licences when they fed all that code containing these licences into the AI mincer. I don't think "our AI didn't look at the licenses" is a precedent they want to end up setting, as it will make their own product licences invalid when used by an AI too.
... and I wonder how long before the AI generated code comes with a Microsoft copyright notice (I presume it doesn't at the moment) and some sort of checksum-hash that allows them to trace it it back to their code generator?
"Any GitHub user thus appreciates that code placed in a public repository is genuinely public,"
Public doesn't mean "free use". Have they said anywhere that they restricted the training data to appropriately-licensed repos? This seems to imply they just scraped everything that was set to public visibility....
No it's not. It's merely published. On-line published content is still subject to the specified license. Otherwise the newspapers would have no case when news aggregators republish news articles without attribution.
Everything put on github is cross licensed with whatever the user chooses and Microsoft, it has to be or it can't be uploaded. Effectively, Microsoft finally found a way to destroy the free open source model and millions of naive coders help them do it.
I read complaints elsewhere that they were able to get copilot to generate their code despite it being from private repositories. But of course Microsoft will just say, "Ooops that was an accident" and keep going their merry way.
This whole scenario is absolutely screaming "thin edge of the wedge".
Public means public. You can't put the toothpaste back in the tube. Whether you know it or not, someone out there is probably profiting off your public code without your knowledge...most likely in China.
The only reason people are going after Github / Microshaft / OpenAI et al is because they are massive, really obvious, hard to miss targets with lots of money. The people after them are after the same thing they are after.
We need a Quark icon, because there is probably a rule of acquisition somewhere in this post.
Certainly on the criminal side, there's a right to face your accuser. How can you file any lawsuit without publicly stating who the plaintiff is?
The rest, however, seems quite legitimate. Microsoft copied a HUGE amount of copyrighted works without checking the licenses to make sure that was legal, and their software is specifically designed to disseminate that copied work (typically in very small pieces, but that doesn't matter legally) without attribution. If someone did that to Microsoft's software, MS would be all over them with lawsuits. Pot, kettle?
There's no such thing as "AI" in this process. That's just advertising propaganda. What there is are human created algorithms that are taking code and then ignoring and breaking the terms of their licence to redistribute it for profit. If the output of this Microsoft operation is not derivative code, then then words no longer have meaning.
If the answer you get back is a copy of something fed to the learning algorithm then it could be a violation of the license terms, depending on what those terms are. But if it is a solution based on the aggregate learning of ingesting multiple similar solutions then I don't see the problem - but then again I'm not a copyright lawyer.
How is this different from what every developer does in the age of the Internet: you get assigned to create a new function, you don't have it in your pre-existing library, you search the Internet for examples and then write something similar using what you find as a guide. Of course you can't copy/paste the found code into a commercial application - unless the license says you can.
I don't know what "every developer" does, but what I do is put in a comment with the URL where I found the code suggestion. Of course my code is always heavily refactored, because code from the internet is crap. Even when it comes from a good developer, at a minimum it's missing error handling. But even though the code is now "mine" (whatever that means), I still put the URL, sometimes more than one URL. Later anybody (it may even be me) who wants to compare my code with the internet code should be able to find it easily.
If the answer you get back is a copy of something fed to the learning algorithm then it could be a violation of the license terms, depending on what those terms are. But if it is a solution based on the aggregate learning of ingesting multiple similar solutions then I don't see the problem - but then again I'm not a copyright lawyer.
That's why the complaint alleges this copying happens in only a subset of uses. It's precisely what they're complaining about: the cases where the output infringes on license terms.
As someone noted in another forum post a while back, this sort of thing is sometimes called "node collapse", where the model overfits for a particular case and returns training data verbatim. I think there's a very real legal issue here, and Microsoft et al are trying to weasel out of it. (The defendants arguing standing is often, though not always, a gloss for "yeah, we probably hurt someone, but you can't prove it was you!". It's a bullying move.)
I'd like to see the defense lose and lose hard here. I do not like GitHub anyway, but this was so patently an abuse of the system that they need to get penalized enough that they think twice before trying this sort of shit again. And I hope a good number of FOSS projects learn an important lesson about GitHub in the process.
I'm losing the plot over all this.
As a programmer, I don't see myself as some sort of deity challenging the boundaries of technology. I see myself as someone with a box of Lego bricks and all I'm doing is arranging them in a way that resembles a product for someone and I acknowledge that someone else could arrange bricks in the same way.
What sets me apart from other developers isn't the way I arrange the bricks, it's my understanding of the bricks in the box.
If more people over time arrange bricks the way I arrange them, because more people have been exposed to the way I have arranged bricks, that's not an infringement of copyright...that's progress.
On the other hand...
HANDS OFF THE PRECIOUS!!! *GOLLUM GOLLUM* THE PRECIOUS IS MINE!!