Just one question
Why would you publish copyrighted source code on a platform you do not control ?
Github is, AFAIK, open source. If your code is copyrighted, why is it available there ?
GitHub Copilot – a programming auto-suggestion tool trained from public source code on the internet – has been caught generating what appears to be copyrighted code, prompting an attorney to look into a possible copyright infringement claim. On Monday, Matthew Butterick, a lawyer, designer, and developer, announced he is …
ALL code is copyrighted. Some jurisdictions do even consider null and void the renounce to copyright rights (so technically public domain code which wasn't written 70 years before the death of the author, ie none, is not free to use at all).
The difference between open source and closed source is how they leverage copyright law to their goals.
Closed source licenses will use copyright law to make sure you can't share, modify or reuse their code.
Open source licenses will use copyright law to make sure you CAN share, modify or reuse their code on their conditions.
Where this crap AI falls foul is that they might share, modify and reuse third party code without granting whatever rights or obligations the original license "gave" to the training set. For starters, most (if not all) open source licenses require that a copy of the license itself to be given along with the source code, no matter if the whole work or just a part is at issue.
For MIT-like licenses, not retaining authorship notices is a copyright license violation. For GPL-like it is even worse, as none of the GPL granted rights would be passed upon downstream, which is by itself a violation.
Totally right. MS are trying to argue that because the code is publicly available, then copyright licenses no longer apply, even though most FOSS code is released under a license. So they say they're free to ignore the license.
If that's true then closed source licenses can also be ignored, and any machine code that's publicly available can also be ingested by an AI and spat out into an AI generated executable which anyone else can use for free. So I could make an "AI copilot" myself that gets an executable from https://www.microsoft.com/software-download/windows11 and use that to add a bit more code in, then distribute it however I want. After all, who cares about their copyright and license conditions, it's public availability that counts right?
They're playing a bit too fast and loose with other people's work here, they should be wary that it doesn't set any precedents that may come back to bite them.
Re: "Would that be fair use? It's not clear. " it's very clear, it's written in the license that by law should be published along side my code.
Anyone who really thinks publishing = giving away free has never read a book or watched a dvd.
This is "git"hub, Linus should step up and claim copyright infringement of git itself.
It's not been tested, but it would interesting what punishment a judge could deal out for blatant copyright infringement on such a massive scale. They have breached a million licences in one fell swoop.
Microsoft have illegally copied my software, I want to see them punished.
I moved most code off to gitlab as soon as Microsoft took the reigns. It was clear they would fsckup github the moment it was announced.
Copyright means very little to most people, even honest people who wouldn't steal copyrighted code and write it daily and would be offended by the *idea* that they might steal code. When push comes to shove, they'll copy your IP, reformat it a bit, and call it an original work.
Don't publish your source unless you're OK with people copying chunks of it. (And then those same people acting like they wrote it, or maybe demanding that you modify it to suit their taste, or being angry because it doesn't work, or being angry because it works but they're too stupid to understand it, or..)
This post has been deleted by its author
It's not just the copyright aspects of GitHub copilot that worry me.
The tool is 100% cloud based, there is absolutely no offline version, it brings all our coding activity online where it can be surveilled. If Copilot becomes widespread in the industry then not using it will be a productivity disadvantage compared to those who so. So we will have to end up using it. This is a *major* privacy problem, it is like a nightmare. And it means our computers are being turned into what are effectively terminals to a cloud based AI "mainframe".
Just chilling, absolutely totally chilling. It really gives me the creeps, because of the potential for a slippery slope here. It is one of the most unnerving, scary thing I have read for a long time. I hope someone develops a decent open-source offline equivalent, that doesn't have to send every keystroke to the cloud.
It's only a matter of time before government, corporations, police, etc. get their hands on the data stream to analyse, "for our protection". Take my word for it, they will find a suitable excuse one way or another, it's only a matter of time. And they will first target those writing "controversial" software, maybe cryptocurrency tumblers, for example. Just wait 10 years from now and see what happens.
This might be one of the first steps towards making everything in computing cloud-centric. One by one by one, boiling the frog, over a period of roughly 20 years. So the ultimate consequence of that would be our computers would *require* an Internet connection to function, i.e. they would be just dumb terminals. And law enforcement will be trawling through our files, looking for things that offend whatever moral sensibilities are in fashion at the time. And keeping people "safe" will be the excuse for doing that. Of course, it's an excuse, one for exercising power over the general public.
Hopefully people will push back against such moves. We need to resist each and every small step, starting from the beginning. Before we end up losing all our freedom in the decades to come.
Boycotting GitHub Copilot for this reason could be even more important than the copyright issues.
Thus to prevent a general shift in power from individuals, towards those who operate cloud computing infrastructure, we must refuse to use such cloud-centric tools. And work together to create offline alternatives.
N.B. Its piss easy to migrate code off github.
To set up a git server is literally
ssh myserver
git init
If anyone wants community tools and pull requests go to gitlab.com.
N.B. can install gitlab locally too.
If you have deep embedded you whole private company, and all your private code in Microsoft's github... Get out ASAP.
Github only makes sense because it's the biggest community, hosting _private_ code since Microsoft sell out is just plain stupid IMHO.
You might think nobody got fired for choosing github, but check yourself, lots of people have been fired for accidentally exposing secrets in github repos because devs were too lazy to run
git init
"training ML systems on public data is fair use [and] the output belongs to the operator, just like with a compiler...
By that standard you can take all what is written about books and use the description's content text to train a ML system. Then when you use the system and it "writes" "Henry Flotter and the magical wanderer's gem" we'll see how long the fair use defence will stand.
This is very much the issue and it's not nearly as clear-cut as your example makes out.
Most people learn by looking at what others have done, then build something themselves based on what they've learned. Whether that is a copyright violation or not depends on just how close it is to what they've seen from others.
One way of looking at Copilot is that it's a tool to make that process of looking at other people's work and using what you learn a lot more efficient. But it lacks any sort of "hang on, that's too similar to what we've seen elsewhere" filter and also hides the source of the material from the human who is using the tool, so they have no reasonable basis to assess whether the code it's just produced is a copyright violation or not.
Copilot, as an AI, is not a legal person who can be sued for the copyright violation and, naturally, the Copilot terms of use make the end user completely responsible for assessing whether the output is a copyright violation or not.
The only sane course from here is to avoid Copilot like the plague.
If it is open source you CAN copy verbatim but must attribute and following the licensing requirements.
In the particular case of the pinched "sparse matrix transposition" code used a specific example to make a legal claim, I'm inclined to believe that the designers of Co-pilot could have easily designed the system to also emit attributions for strongly related source code. The fact that the designers did not do that was a conscious decision to deny due credit and steal it for themselves, using as an alibi the faulty logic that Co-pilot was an intelligent entity equivalent to a responsible human that abstracted from training data and re-emitted as original code.
" I'm inclined to believe that the designers of Co-pilot could have easily designed the system to also emit attributions for strongly related source code. "
From my understanding of how ML works, that actually sounds like a very hard problem. These algorithms typically offer very few clues for why they chose a particular solution.
Even if possible, I suspect the result would be a warning that this code may fall under any of half a dozen popular FOSS licences and dozens of other personal copyrights.
Both issues would be much more simply dealt with by always warning that "this code probably breaches someone's copyright and you are on the hook if you choose to use it". Funnily enough, that's what MS seem to have done, except they've buried it in their T&Cs rather than reminding users on each occasion.
The "why" is almost entirely irrelevant to this discussion though - the "what" and the obligations of the license set by the copyright holder are. If the ML model is horfing up a chunk of code verbatim, it ought to be pretty clear where it came from, and to maintain a link to the source of that chunk so as to properly track *and return to the party using the system* what copyright attribution is appropriate.
That *sounds* like a perfectly reasonable concept - unfortunately, the state within the ML model needn't be laid out with clearly obvious chunks of plain text that it is horfing up, instead it is spread around in a (to the eye) random pile of weightings and triggers from one layer to the next. Instead of just copying "lines 2081 to 3033 from file fred.cpp" it is going through a weird process that re-assembles that source (or something that is so like it as to be considered the same - e.g. just changing variable names).
A (not great) analogy would be a compressed text file. To the eye, you have a jumbled mess: a dictionary of fiddly bit of what is clearly text but nonsensical and a pile of arbitrary number. But if you run the decompression process it pulls one tiny bit from this location in the dictionary, a couple of copies of that bit one after the other until, ta-da, you have a sizeable chunk of readable text. Ah ha, you say, but the "pile of arbitrary numbers" is clearly in order and you can mark which ones refer to what part of the final output. True. Sadly, the related information in the ML model is spread randomly all over the place and, unlike the compressor, there was nothing in the "learning" process that knows (or needed to know) where any of it is.
Trying to make ML models explain what they are doing is a topic of ongoing research. Not that the people chucking ML at everything an hoping ome of it sticks care about that or are putting their money into it.
From my understanding of how ML works, that actually sounds like a very hard problem. These algorithms typically offer very few clues for why they chose a particular solution.
It's certainly difficult for transformer-based LLMs, or other stacked CNN architectures. There's a fair bit of research into explicable/interpretable machine learning, but it requires quite different architectures. While some attempts have been made to create explanation mechanisms for some aspects of machine learning (such as binary classifiers), there are strong arguments that this isn't going to be a useful approach for large models build using deep ANN stacks and we need to create interpretable models from the start.
There is already case law establishing that an AI cannot be an "inventor", in respect to patents, and that the "operator" / provider of the AI's training and output is. This presumes then that the liabilities also follow the same path, and this leads to Microsoft. Furthermore, the damaged party, which is the originator of the training data, may not even be a party to any agreements with github or microsoft.
Given that it's trained on GitHub public repositories, this language in the ToS looks relevant:
You grant us and our legal successors the right to store, archive, parse, and display Your Content, and make incidental copies, as necessary to provide the Service, including improving the Service over time. This license includes the right to do things like copy it to our database and make backups; show it to you and other users; parse it into a search index or otherwise analyze it on our servers; share it with other users
If Copilot is part of the GitHub service, then anyone who posted their code on GitHub has implicitly licensed their code for this purpose.
Note that, in this agreement, the "Service" is defined as "the applications, software, products, and services provided by GitHub, including any Beta Previews."
Correct. You can distribute your work as you see fit and Microsoft can distribute your work as they see fit. GitHub automatically puts everything under a dual license. Github is a bank that you mortgage your code with. This is not new... there is no shock... this is how it has always been.
This story prompts one primary question: *why* do I consider giving my code and insights to other humans (via GitHub or StackExchange or the like) to be alright if the thought of Microsoft-controlled Co-Pilot exploiting the same is anathema?
I think this is a story about solidarity and, quite simply, I don't hold any solidarity for corporations. For other coders, I can at least try to believe that I'm helping out a human being who may very well be living a similar life to my own – past and present. Their high-functioning thoughts may very well be being exploited and, frankly, any little helps, right?
Had Microsoft said to the open-source world that their A.I. trained on open-source *was* itself also open source and, additionally, free to use for free-as-in-freedom work – and, also, not useable for proprietary work from which its training data would also be precluded – I expect that the revulsion from the world of coding would be very much different.
Do not break the picket line! Solidarity! No open-source code for corporate parasites!
I give my code and insights away to other humans on the understanding that certani rules will be followed, as specified in the chosen license.
* If another human is able to make use of it, and respects the terms of the license, I'm happy.
* If an AI system is able to make use of it, and respects the terms of the license, I'm happy.
* If a human does the Finding Nemo Seagull thing with my code and uses it without respecting the license terms, I'm annoyed. (It's happened several times.)
* If an AI system hoovers up and aborbs my code like some kind of ethereal Katamari Damacy, and then horks up chunks of it at random with no way for the receiving human to know it was mine or what the licence terms are, I'm going to be even more annoyed.
PAAAAAhahaha. You posted it on GitHub! You realise that involved choosing a license, right? Specifically this in the GitHub terms of service:
4. License Grant to Us
We need the legal right to do things like host Your Content, publish it, and share it. You grant us and our legal successors the right to store, archive, parse, and display Your Content, and make incidental copies, as necessary to provide the Service, including improving the Service over time. This license includes the right to do things like copy it to our database and make backups; show it to you and other users; parse it into a search index or otherwise analyze it on our servers; share it with other users; and perform it, in case Your Content is something like music or video.
RE "4. License Grant to Us"
Call out the lawyers then, as to ...
* whether CoPilot is reasonably covered under the "We need", for xyz purpose, language.
* what the word "incidental" refers to, e.g., maybe CoPilot, maybe not.
Most people using a service like GitHub would expect "parse it into a search index" would still return the results as a link to the original repository. Again, something for lawyers to argue over.
The interesting thing about CoPilot is that it's a two-way street. You search for code by typing, and equally you return all your code to CoPilot by typing. So not only are you using someone else's software without regard to copyright, but it makes no sense to copyright your own software because Microsoft already "published" it first in CoPilot. The extinguish phase is well under way.
Well put.
There is a big difference between a) putting my effort into an application for my community, or assisting my fellow worker, and b) working for free to create a paid-for service that will assist capital in profiting off of replacing labour.
No way. That's abhorrent.
Actually, many proprietary software companies were already prohibiting their developers from using resources like stackoverflow toward the end of last decade precisely because they feared snippets of code from there might cross-contaminate and taint the licensing of their own products. Hence, I don't see why these same companies would not similarly fear co-pilot, which does so with even greater ease.
Were I to put my code into the public domain, which has been known to happen, it would be for one of two reasons.
1. I think someone might be able to learn something from it.
2. I think that someone might be able to make a useful criticism of it.
There is no number 3. "so that someone can just use it as-is without understanding how it works."
It seems to me that the non-existent reason number 3 is exactly what Copilot promotes. Some might describe it as theft.
-A.
If it was only using "public domain" code in the training set, there wouldn't be a problem.
The trouble is that (a) there's no such thing, all code ever written is copyright to someone, and (b) they hoovered up GPL, AGPL, BSD etc licenced works, but do not provide any possible way for the copilot customer to comply with the licence obligations when sections of those are vomited forth.
There is a very small amount of code that is licenced as DWTFYW or similar, but they wanted more.
Are you sure you understand what "public domain" means?
If you put anything into PD then you are explicitly telling everyone that they "can just use it as-is without understanding how it works."
There is no such thing as "theft of something in the public domain" pretty much by definition.
If all the material that Copilot had been trained on was public domain then there wouldn't be any issues with it (well, States where PD doesn't exist may be in a state of confusion over it, but in practical terms no-one would be trying to sue Github over it).
Huh? The Crown Jewels aren't public domain in any way, shape or form! What are you prattling on about?
Hang on, you think that "public domain" means "is displayed for the public to view"? Or even vice versa? No.
Try a bit of simple research, such as: https://fairuse.stanford.edu/overview/public-domain/welcome/ or https://www.bl.uk/business-and-ip-centre/articles/what-is-copyright
The author of said code agreed to the GitHub terms of service, which includes a license for Microsoft to use your code for essentially any purpose "as necessary to provide the Service" (quote from the ToS). Here 'The “Service” refers to the applications, software, products, and services provided by GitHub, including any Beta Previews.'
Do you have an alternative interpretation? Those words seem pretty clear to me. Anyone who posts code on GitHub licenses it to GitHub for the purpose of providing any service that GitHub provides - including Copilot.
That doesn't extend to the people who use Copilot of course - they're just SOL. But Microsoft is covered for their use in training Copilot.
"The" service - running a git server, issues tracking and email redirect. And buildbots if explicitly opted in.
Not "Any" service.
It's not a free pass to do whatever the **** they want.
Code that was published under a licence with an attribution clause requires attribution. Copilot does not provide that attribution, and therefore has breached the licence.
You haven't actually read the terms of service, have you?
If you had, you would have spotted this in the definitions section:
The "Service" refers to the applications, software, products, and services provided by GitHub, including any Beta Previews
Licenses mean what they say they mean, not what you'd like them to mean.
> When coding for a certain task, how many ways are there to display "Hello world"?
Did that certain task involve displaying the words "Hello world"? (With or without comma after 'Hello', according to taste.) On a limited sample of those tasks I've been assigned over the years, I'd say none at all.
-A.
When coding for a certain task, how many ways are there to display "Hello world"?
Theoretically it's unbounded – countably infinite.1 In practice, it's limited by the size of the machine.
You can always take a working program and produce a longer equivalent program.
Now, if you'd specified sensible ways to display "Hello world"...
1With a quantum computer uncountably infinite, I think, because then you're dealing with a probabilistic TM and that gives you access to the continuum. Haven't tried to prove that, though.
If we get to the point where copyright is granular enough that a single line of code is copyrightable (eg 'cout << "Hello World";'), we are just as bad off as we would have been if Oracle had won the Java suit....
At least for the US the legal purpose of copyright is to "promote the progress of science and the useful arts" - which allowing the copyright of complete software works (libraries, programs, etc) does, but allowing the copyright of individual lines of code does not.
The question is where-in between does the line get drawn, and that ends up being answered by lawyers & judges at the end of the day....
who has deliberately avoided looking at GPL'ed code because I did not want to risk contamination of what I create for my employer?
To me, this has been a lurking issue with GPL since it came out. (Yeah, I've got quite a bit of gray on me.)
As for SO, my personal experience (when I was required to get into .Net) was that the user selected & top four vote getters were clearly broken. Somewhere in the lower half of the top ten would be a suggestion that had the kernel of a useful idea. No worries about legal risk coming from that. Brain bleach recommended, however.
That is a reasonable approach to GPL code - assuming by "looking" you meant "reading" with a side order of "hard enough to understand it". After all, that is the reasoning behind the old BIOS creation efforts putting up Chinese Walls between documenters and coders, for example.
BUT be warned that you must also take that approach to any code that has an attribution clause in its licence (just in case you accidentally copy it and forget to include attribution) and definitely never sneak a peek at any third-party proprietary code you may have access to (e.g. you have a legit licence to use).
In other words, don't just use that as a way to denigrate GPL - if you need to take those precautions then do so properly and be wary of other licenced material as well.
(OTOH, that was a pretty good assessment of SO!)
The only proprietary code that I know of that might qualify was a BIOS. This was in the time frame that I had realized that I was a hacker, not a programmer. I definitely needed brain bleach after reading it. It was embarrassing to read. I can only imagine how I would feel reading that same code after 25 years as a programmer.
My fear is that I might learn some useful technique which is considered specific to the licensed code. My comprehension and memory is such that it might be an issue.
So you are going to repeat errors fixed by past programmers.
Most businesses repeat past mistakes by not learning from their history.
We stand on the shoulders of giants.
IBM published it's bios code and all the circuit diagrams
so people could copy without copying.
This created the clone PC industry, hoo fucking raa.
I got a Chinese clone PC with it's own bios and spare eproms with IBM bios.
Nope.
There is an unfortunate lack these days, but a compiler ought (and in days of yore, most certainly did) contain wording in the licence that states the output retains whatever licencing was applicable to the input source. Although, by now, there may be a case for arguing that "it is accepted that the output of a compiler..." [1]
But even then, just because you've compiled some source doesn't mean you have carte blanche over the result - you try freely handing out a newly compiled exe you just made from some proprietary sources you happen to have. Or statically linking your company's code to an library under the (standard) LGPL.
Just because you've passed all the licenced code through an ML meat grinder doesn't mean the all those licences get stripped away, especially if the grinder is spitting out whole organs untouched, except for squeezing out the blood (aka attributions).
[1] However, I'm willing to bet that there are still compilers whose sellers try to impose more restrictions, such as "you can not give this compiled G-code to anyone else to run on their CNC-whatever, they've got to buy the compiler from us as well".
Granularity is probably everything in this. There's some trivial level of plagiarism that is acceptable. Whole files? No, clearly not. A single for loop line? By itself, probably yes (given how common for loop lines are). Somewhere in between? Grey area. Contents of the entire for loop? It depends, but someone's very good algorithm might be that single for loop content or they might just be counting spaces in a string.
All in all, it comes down to the fact that text files are pretty rubbish as a way for people express what is important and what is not. People can't reliabily work it out which is why court cases happen (Google and Oracle), so ML won't be able to get it right either.
Compilers are dying, and language servers are rising in their place, and it is becoming increasingly dumb to store language server state in source text files (from which state has to be rebuilt every time). These are good reasons for there to be a file format that formally includes metadata about the source, like copyright critical sections, language server state, modification record, etc (which combined with language server data could do a lot to make things like merges less risky), etc.
I doubt that such a thing would happen because no one will agree on what that should be, and most wouldn't even recognise the problem in the first place.
> Compilers are dying, and language servers are rising in their place
Hmm, what? I don't see any signs that compilers are going to go away now or at any time in the foreseeable future.
> a file format that formally includes metadata about the source, like copyright critical sections, language server state
Trying to understand what this is supposed to mean, a web search for "language server state" throws up things like:
"LSIF is basically a way to persist language server state" from https://news.ycombinator.com/item?id=22446984
or
"However, if you have a simple use case where embedded content can be easily handled without context or language server state," from https://code.visualstudio.com/api/language-extensions/embedded-languages
Both of these - and all the similar hits - are just talking about using a "Language Server Protocol" server to provide super-duper features to source code editors. LSP servers contain lots of language-aware features (such as, ooh, the front ends from compilers) but are otherwise unrelated to running a good old compiler to generate your exes.
> a file format that formally includes metadata about the source
Like, in the source code file itself? Such as annotations within comments, laid out in, ooh, another formally defined language suited to the purpose - you could define one using XML; either way around - have the XML as a comment in the source or have the source as one or more text blocks within the XML elements. That last coud even lead to going the full Literate Programming route (not necessarily using Knuth's web/weave tools, as they don't look for the metadata you want to enforce).
Notice how the above formats are also - text! Text files *are* a good way to keep lots of the metadata about your source and its attributions, if only so that you can actually pass it on to the legal team. You can even keep the digital signature as text in the same field, to prove it hasn't been modified. About the only thing that I'd not automatically propose being text in the source file is the version control history (although VCS using simple text files is quite doable and has advantages, although speed isn't one of them).
What I see people the most concerned about is that copilot will generate code and copyright it all, even if it was based off of open sourced code. Let's take this to a but more of an extreme. What if I want to write novels, and I publish one that I made with only a little bit of inspiration from others. Is it really fair for someone else to sue me because their book is close enough to mine? Does this mean that in order to write a book I have to check every book ever written to find out if I'm infringing on someone's 'intellectual property'? Obviously this is insane, and it also means that as a writer you would have to write books that fall just far enough from every copyrighted book. I'm wondering why the same isn't thought of software. What if I copyrighted a basic idea like a linked list? Even if someone else thinks of the idea without ever hearing about it, is it really a good system to let them take my code down, and force me to use a different data structure? I believe adobe has copyrighted some of the equations that their photo editing software uses, which is part of why people haven't been able to make a good adobe clone. Is it actually just to copyright MATH? What if I took away your 'rights' to the quadratic equation? It would be pretty horrible if I were to stop all advancements in that field of math because I declared some ideas to be exclusively mine.
This all seems to be a result of the ridiculous idea that the first person to put a license on their idea is the only person who is allowed to have it. If copyright was enforced to a greater degree then our world would be a bland, bland place. You can't seriously tell me that one murder mystery author hasn't had the exact same idea as another one.
Anyways, it's not like people will stop writing books, music, mathematical equations, and code if intellectual property didn't exist. There have been organizations writing free software for a while now, and a lot of you have whole operating systems that are built off of it, like I do. People will still pay writers if they want books and stories, pay musicians to write for them, pay them to perform, and people have been doing math for thousands of years. The world will be just fine, nay, better without constraints on the very arts that copyright law allegedly supports.
There have been many copyright cases in courts around the world about plagiarism, where the arguments were that two books/pieces of music/paintings/photos had "significant similarities".
So yes, plenty of settled case law in whichever jurisdiction you choose to name.
Claiming Fair Use outside of a courtroom is an admission of infringement. Fair Use does not apply when there is no infringement.
Fair Use is known as an affirmative defense that, although infringement has been established, the exceptions that courts are permitted to apply in decreeing fair use are very clear.
Public statements asserting fair use are baseless. The test is always in court. Software producers who make such claims need to pay better attention. Organizations that make such claims need better attorneys.
An informative example is from the lawsuit of Oracle against Google in which Google offered a fair use defense that failed in court concerning the ways the Java APIs were adopted for Android without obtaining a license. That determination followed *after* the determination of infringement.
There is a fair-use element that might apply in the Copilot case, although it being mechanical is going to create a difficult situation for courts, it seems to me. I am personally skeptical that a copyright violation suit can get very far, except perhaps around the willful use of covered works in the "training" of such mechanisms.
The Google situation with excerpts of works comes to mind. Since GitHub has explicit arrangements for announcements of copyright status and licenses on GitHub projects, it's odd to assert such a defense absent being taken to court. On the other hand, it is unclear who the plaintiff(s) would be in such an action.
These are the same problem.
To solve this would require a lot of work in the initial setup of the training data.
You will not be able to solve this using training data that is not completely vetted and cleansed.
The training data needs to be correct and complete and fully attributed.
Unfortunately not something that is effectively done in today's mad scramble to create cool toys.
Code sample x made by author y does z using language w
Code Sample:
Code:
---
x
some piece of code
---
z
What this code does:
---
Some description of what this code does,
description of inputs,
description of outputs,
description of purpose
---
Language:
---
w
What language is this in?
---
Author:
---
y
Author of this code, Available licenses for this code.
---
For creating the code generator you need to generate suggested x given z and w
Given:
---
Description of what this code needs to do
---
z
What inputs are available?
What outputs are required?
How should it behave?
---
Language:
---
What Language are we generating in?
---
This produces generated code.
Questions about this provided code are:
What code samples from the training data are most likely to have contributed to the output.
What are the authors and what is the licensing terms of the code that most likely contributed to the generation of the output.
Generating code samples we seem to know a bit on how to do.
Identifying what code samples most likely contributed to the generation of the output is a second challenge that could be met by further research. This would likely require extensive supervised training using sample Training data-->generating sample output that human evaluators could then compare to the training data and identify the likely contributors Enough of this could train a model at identifying likely contributing code samples given generated code a a training corpus just like models can be developed for attributing purposes. Like identifying if some piece of prose might have been written by Shakespeare.
This can then be linked to the author and license information linked to the training code samples.
This is a vastly greater amount of work than just training a model on any tom dick and harry code set scraped from an essentially random source.
There is no easy shortcut way of doing this unless you are happy with the untoward consequences of not caring about any consequences.
Microsoft has had a lot of source code leak over the years, how would they react if someone trained an AI to spit up hunks of their source code, using it as part of the training dataset ? But on the other hand it is mostly old Microsoft OS source code so will be riddled full of critical security bugs it will need to be sanitised and purified first with something like Coverity Scan, Cppcheck, Flawfinder, Clang Static Analyzer, .... Actually the more I think about it, the more I think it would just be a really bad idea, the only way to avoid critical security bugs dripping out from old nearly mummified M$ code would be to just "Nuke It From Orbit".
(* For anyone who English is not their native tongue "what's good for the goose is good for the gander" is an idiom for "If something is good for one person, it should be equally as good for another person; someone who treats another in a certain way should not complain if the same is done to them.". And I know that idioms to not easily translate, it is the only reason I mention it. )
Is every program you write a 'derivative work' of the OSS code you looked at while learning to write your own? Does a professor using an excerpt from the Linux kernel in a lecture cause all students' work to be derivations of Linux and bound by the GPL?
Even if all the code does is output 'Hello World'....
Because that is the logical 'end' of the argument being made here against CoPilot..
One would think the OSS world would not want to go 'there', and would want to stick with a more logical derived-work definition - wherein a derivative work has to be substantially similar to the original (eg, MacOSX Mach kernel is a derivative work of the BSD kernel, but a 'hello world.c' program cannot be a derivative of Linux even if the author (or authoring program) examined Linux code in the process of figuring out how to write C that will actually compile)....
At least for the US, there is also Google v Oracle to consider (and the anti-CoPilot position here is remarkably similar to Oracle's losing argument over being able to copyright and license the syntax of Java).
I know readers here don't like MS, but this really is not in their spirit you all so dislike.
So to review a few scattered points and issues:
- In general it's considered fair use to train machine learning models on publicly available data.
Models do not typically memorize, although they could memorize a thing here and there, you're going for generalization.
Yes, you will sometimes get memorization of pieces of data that are incredibly common in the dataset.
If people made thousands of pictures of fanart of some popular character, it will learn to draw that character well.
If most people write some few liner utilities the same way, your code model will do it in the same way.
Google won a landmark case some years ago that concluded that training even on copyrighted content like books was fair use!
- For image generation models, the current legal consensus appears to be that the output is not copyrightable as the author is not a human!
If this consensus is carried on, it will make the legal status of generated code maybe not one companies like MS would like.
In fact github has plenty of their own leaked code reuploaded by users who don't really care about copyright and likely many code models have some knowledge about some propertietary internals, not just GPL or BSD or other open licenses.
Most likely the model seeing even code like that is as much fair use as you seeing that code without copying it verbatim, some people do sometimes avoid looking at code for potential legal issues, but of course it's a big gray area.
Maybe a fun future would be one where everyone just gives up on copyright as a flawed concept and chooses to make code which is simply uncopyrightable and sort of pseudo public domain - if MS starts using it for their code writing, which they are, a lot of their internal code might not be copyrightable or very gray area!
- Some of you are upset because copilot is closed source and paid, while being trained on open source code. Who cares? There are open source models trained on open source data and you can use those just fine.
The problem with using these, especially the better ones is the hardware needed to run it. We need a lot more VRAM to run these, and Nvidia and other manufacturers keep the big VRAM models for enterprise use and at severely inflated prices.
If you want freedom to run any of these things, and there will be a lot more interesting things than even code models in the future, the average person needs a lot more powerful hardware for their own use, not just cloud.
Problem might be made even worse with the current US and China tensions - US is blocking a large part of the world (China) from being able to make use of said AI chips (and GPUs), but this may not matter unless you live there.
However, this is a war on general computation almost, and one that we should fight to get our freedom to run anything we want on our machines. There are many small and big players that are very bothered that at some point the public will be able to run the more advanced future AI models and would do anything to stop this.
- From my personal viewpoint, I don't see an issue with this: you learn by reading other code and that's fine. It's how your brain works, you compress the information into concepts and you use it as you see fit.
Your big machine learning models do a similar thing, they don't just memorize data unless that data repeats often or is very common (or you train it in a way that this repetition happens).
You may claim that the way ML models learn needs too much data (for LLMs big models actually learn much faster, probably because they can better leverage what they learned before and there's a lot more room to express the higher dimensionality of the data), but even then the general idea is still there.
Let's say this case goes south and MS loses (I expect them to win like google won against that landmark case on training against copyrighted books), now this makes copyright far stronger than it ever was, and whatever applies to the machine also has to apply to the human, after all, the learning processes will get closer and closer.
Yes, you may view MS with dislike for how much they push their strong copyright views, but in this case, they are the ones against copyright! As someone with a strong dislike of copyright and IP laws, I know where I stand!
Them losing the case would also set even more dangerous precedents, in the future we will end up training models that will approach human ability in dimensions that now seem weak, and the way forward to that is obvious (even if maybe not to many readers here), that is, we will approach human generality or even reach it.
Ignoring alarmist/catastrophist thinking about what happens when we reach that point, consider the idealized case where at some point you would have what is essentially a person or something with a thinking process close to a person, and who is not allowed to learn from human works, and probably not allowed to copyright their stuff either (the latter I don't mind, as I don't believe humans should be able to "copyright" anything either).
This then gets a very human chauvinist perspective. I know a lot of unspoken neoluddites would applaud this, and hope that cases like this would stop progress by reducing financial incentives, although in the very long term, we would be giving up ability to automate science and research itself and thus doom humanity to much slower technological progress than would be achievable if we build truly autonomous agents capable of doing such work.
This may seem like a little thing now, but precedents could get set which would change history in bad ways. I do hope that they do end up in the right direction, and here MS is actually fighting against their own drive for more copyright and this will probably help human technological progress by leaps and bounds.