"Open source is a cancer"
Forgot who said that, but it looks like GitHub has taken up smoking.
GitHub Copilot, Microsoft's AI-driven, pair-programming service, is already wildly popular. Microsoft broke out GitHub's revenue and subscription numbers in its latest quarterly report for the first time. GitHub now has an annual recurring revenue of $1 billion, up from a reported $200 to $300 million when it was acquired. It …
This asks the addititional question: who exactly is going to assimilate WHOM? At least for the OS, a workable windows running on Linux would be a better model...
As for open source licensing, in theory this whole situation with CoPilot begs the question of "what exactly is plagiarism"? I'd say if you look at code in a book or online and then write your own it is NOT. But if a machine creates an AI model (like fractals for a photo) and then re-creates that code from the model (in a nearly identical way) it IS plagiarism. Hopefully the courts will agree.
I am not too happy with such AI writing code. I see gross obvious junior-coder mistakes in THAT future.
"I don't expect to see a definitive answer this decade."
And therein lies the problem - by the time a court finally agrees that Microsoft are a bunch of thieving b******s, they will have:
a) ripped off the code of thousands of others for their own financial gain
b) altered their ML algorithm sufficiently to still take advantage other's work but recode it when republishing so the theft is impossible to prove
c) deprecated the current version in favour of a v2/v3/v4 to pretend they didn't benefit all that much
We all wondered about the real reason for MS purchase of Github - "By joining forces with GitHub, CEO Satya Nadella said, “we strengthen our commitment to developer freedom, openness and innovation.”
At least now we know it was simply to make the theft of all those resources easier for them...
I think Microsoft should and probably will lose this fight as well, but some of your accusations are a bit weak.
"At least now we know it [the acquisition of GitHub] was simply to make the theft of all those resources easier for them..."
Come on. It's publicly available. I can clone all of that. It doesn't take an expensive ownership and operation to point a downloader bot at the site and start cloning all the repos meeting some criteria. If that was their reason, not only did they start their evil plan years before they started using it, but they've come up with the least efficient heist ever. This suggests their reasons were probably unrelated, given that they can and did get training data for copilot from locations they don't own.
Don't forget...
- Basically, own Linux despite the wailings of Linus.
- Moved windows onto a Linux Kernel (why pay to maintain the kernel when others will do it for free...)
- Started issuing DMCA takedowns for people not subscribing to their business model and being an even bigger PITA than they have been since Gates went dumpster diving.
MicroSoft has stolen software as long as they have been in business. In the beginning they would just intergrate their "own version" of the most popular programs running on their O/S. When sued, which has been often, they've always outlasted and/or settled for pennies on the dollars Microsoft earned from their theft of the intellectual property. It's the most lucrative play in Microsoft's playbook.
Not a surprising development (in the legal sense), and I'm not too unhappy as it's a lousy way to write code anyway.
But I wonder what the minimum code fragment is that can be considered to be copyright. Unless this gets defined clearly very soon, pretty much every developer in the world will be liable to challenge for copyright infringement once the lawyers start to get interested.
A bit of a blunt instrument?
Like the sentiment, but in this case isn't the *actual* problem the regurgitation of the inputs by the ML, more specifically, without attribution?
Because it is possible to feed code into an ML whose results pass does something other than just fling out chunky bits of predigested sources.
Better ideas for such an ML system exist, but the immediate thought is: how about one that looks for code that is suspiciously close to your copyright material appearing elsewhere, as though it had been spat out by Copilot?
A bit of a blunt instrument?
True, but them trawling the entire GitHub code base is also a blunt instrument insofar as it makes attribution (or blame) impossible.
As AI/ML training falls outside of the uses laid out in the FOSS licence (but is not explicitly excluded) perhaps they should allow GitHub code contributors the choice to opt in or out of having their code uses for training.
"but the immediate thought is: how about one that looks for code that is suspiciously close to your copyright material appearing elsewhere, as though it had been spat out by Copilot?"
There;s already software out there designed to look for plagiarism in exams and academic papers that could probably be fairly easily repurposed for the task. Depending on how it works, eg simply looking for matching strings of a certain minimum length, it might well work as is.
That would certainly find out if Copilot is taking existing chunks of code and regurgitating them. On the other hand, it may well show large chunks of code being reused in other FOSS without acknowledgement and maybe against licensing terms.
It's an alignment-with-errors problem, and there's a ton of existing art and ongoing research, for example in genomics.
I've known CS grad students to build detectors for this sort of thing with good F0 and throughput, as class projects. If you're only concerned about checking specific Copilot outputs (e.g. those that show up for a set of queries you've defined) against a fairly small codebase (e.g. your personal projects), that ought to suffice. If you want to do bulk scanning you'll likely want to partition the datasets and do some preliminary classification to make the problem tractable.
If/when Microsoft loses, it will be because CoPilot looked only at the code and completely ignored any software licenses. Of course, due to similar code existing under different licenses, it would have been a nightmare to include licensing. I'm not sure "it would have been really hard to do it the right way" would be a good argument in court, so they will have to come up with a different argument.
"shouldn't be too difficult to add something like"
Copyright does not work that way in most major countries - you do not have to detail that you retain copyright as you automatically own copyright over your own work. Instead, you grant people access to your work or if you are a developer working at a company you will have likely transferer ownership of your work to the company as part of your contract.
I've been using it for around a year, and even have a subscription, 99% of the time, the code it proposes is literally just statement-completion.. i.e. if x or z then y ... and yes, its very handy, it provides more of a shorthand than anything.
There have been 2 occasions where it has effectively proposed that is more than 1 line.. and I won't deny, this blew my mind, as one of them fair a fairly complex recursion function.. which did save me a lot of time, but likewise, I knew how to write it myself and it was word for word a standard recursion function.
But I'm not sure I will renew after this period expire, I'm extremely lazy, so the idea of CoPilot, and AWS's CodeWhisperer is amazing, but both absolutely take the piss when it comes to respecting those who built the products which form their core of their services.
The solution to the problem is obvious and will soon be presented to us: GitHub and CodeWhisperer are the first step towards abandoning all programming languages. Indeed, any programming language is a formalized natural language, put in the net of the commonly understandable constructions. Microsoft, GitHub and CodeWhisperer formalize the same common, everyday language and get the same constructions, without hussle-bussle with manual programming . Then why programming languages?
So for those two occasions that it offered something more complex, I wonder if that code traces back to one or a very few particular sources. If so, co-pilot could just add comments with URL to (an) attribution(s).
It's not written in this article, but in a previous article someone identified there own sparse matrix inverse code, even using the same variable names - in that case it should have been simple to provide an attribution.
Adding attribution when possible seems like a very straightforward good faith no-brainer to me. Also knowing the source can be helpful to the copilot user - there may be context, quality may be inferred, if it's a library there may be useful usage pointers, etc.
Conversely, not adding attribution when it can be seems like dumbing down.
Developers should somehow encode their copyright message into their code, then look out for fragments of it in regurgitated code.
I expect though that there will be many, many copies of similar algorithms in the repository, and the AI will be able to strip out the copyright froth, by treating the code as a black box and testing outputs against input combinations.
Have to wonder if this is not going the way of DCMA takedowns...
So if I publish something GPLv3 to Gituhub and its deemed good enough for copilot to use and regurgitate, I can see I have some recourse to claim copyright on the created work - but not entirely sure what laws Github has broken. The person using copilot might be able to make a complaint to Github that their service lied and led to them getting sued (or a takedown notice or whatever).
But how about if I publish something GLPv3 to Github, and then someone copies it to another part of Github but miss-licenses it. They put it up as a no-licence, anyone can have it. When they create the repo they promise to Github that they will not break the law and that they have permissions to push the code they are storing. Github takes the promise as true and uses the code to train their AI.
At that point, is it not up to the copyright holder to enforce their license terms on the intermediary work?
I have to hand it to MS, acquiring github was a masterful move. Here's why:
Copyright exists by default on the created work, and by default a third party (including Github / Microsoft) has no right to copy it, redistribute it or do whatever (except for any fair use provisions that might exist in your jurisdiction.)
Open source works come with a license which specifies under which conditions you may copy and distribute the work. As the GPL is fond of pointing out, you don't have to accept that license, but nothing else gives you the right to copy and distribute the work... except... the github user agreement does just that. By using github's services you give github the right to use the code however they deem necessary in providing "the service", where "the service" now includes copilot.
Where things get interesting is the question of whether or not you're actually able to give github that permission.
If you're uploading entirely your own work then you can license it to whomever you wish under whichever licenses you wish - you can even supply a buffet of contradictory licenses and let people pick one. But if your work derives from someone else's you're not free to grant permissions that contradict the original license. So if you grab some random GPL code and make some changes, you're OK to pass the result onto someone else under GPL terms, but not at liberty to give github permission to mix it into an ML meat grinder that will spit out chunks without attribution or license. Likewise the MIT licenses which require attribution - you can't waive that restriction if you didn't write all the code yourself.
Unfortunately, the github terms of service also include an indemnity clause, so if they get sued as a result of something you uploaded being absorbed into copilot, they can theoretically shift the liability onto you.
So the tinfoil-hat interpretation of the situation would be that github's value isn't in the codebase, it's in the army of fall-guys!
@robinsonb5 "Copyright exists by default on the created work"
Does it?
Section 102(b) of the Copyright Act excludes copyright protection for “any idea, procedure, process, system, method of operation.”
"In no case does copyright protection for an original work of authorship extend to any idea, procedure, process, system, method of operation, concept, principle, or discovery, regardless of the form in which it is described, explained, illustrated, or embodied in such work."
https://www.law.cornell.edu/uscode/text/17/102#b
Is reproducing the same logic that is in someone else's code copyright infringement?
> "Section 102(b) of the Copyright Act excludes copyright protection for “any idea, procedure, process, system, method of operation.”"
Copyright protects a particular expression of an idea rather than the idea itself. I don't think anyone's claiming that AI systems have any actual "understanding" of the data they process, so the concept of an idea isn't relevant here.
> "Is reproducing the same logic that is in someone else's code copyright infringement?"
No, but it pays to be extremely careful about how you reproduce that logic if you want to avoid being accused of copyright infringement - Compaq's clean room reverse engineering and re-implementation of the PC BIOS for example, where the team writing the new BIOS weren't directly exposed to the disassembly of the original. That separation doesn't (and can't) exist in the Copilot scenario.
The service does not include Copilot. That is a separately licensed product.
Aside from that, you cannot arbitrarily expand the meaning of "the service" at your whim. 99.99...% of the code ingested by Copilot was uploaded to github long before anyone knew it might exist. Nearly all of it predates Microsoft purchase of github.
Otherwise El Reg can lay claim to your salary for permitting you to post here.
But you do.. it's in their terms of service
"We need the legal right to do things like host Your Content, publish it, and share it. You grant us and our legal successors the right to store, archive, parse, and display Your Content, and make incidental copies, as necessary to provide the Service, including improving the Service over time. This license includes the right to do things like copy it to our database and make backups; show it to you and other users; parse it into a search index or otherwise analyze it on our servers; share it with other users; and perform it, in case Your Content is something like music or video."
You don't need to use their service, but if you do then you agree to their terms...
I would even go as far to say that if a copyright holder uploaded something to GitHub then this would trump any licence the copyright holder put on the code.
"I would even go as far to say that if a copyright holder uploaded something to GitHub then this would trump any licence the copyright holder put on the code."
Doubtless that will be one of the many arguments put forward by Microsoft's legal team. Whether one thing trumps another thing seems to be the pinnacle of civil disputes.
Unless public release binds the copyright owner to that licence eternally, then the copyright owner has the right to release their work under any licence they want.
So I will retract my use of the word 'trump' and state that regardless of the licence included in the code by uploading to GitHub copyright holders are exercising their right to licence their work in whichever way they want.
I think that any claim to be "fair use", when the thing is spitting out copies of billions of lines of code, and making a billion dollars in revenue should be a very easy argument to defeat. This is why I think that the case has a good chance of succeeding : fair use is about using small amounts of copyrighted content, not harvesting it on an industrial scale.
The other main argument is about whether terms and conditions of Github take priority over established copyright law. Again, this is shaky ground for MS because copyright is a strongly established principle, and separate from licensing, which is a kind of contract.
To me, this is an important case because of the principles involved, not because of the money MS are making, and could set some horrible precedents if it's not considered properly,.
> copyright extends 70 (?) years after the death of the author; then Co-Pilot might be legal in say 150 years ?
Sounds about right except for Mexico, they have death of last author plus 100 years! Oh and Yemen which is life plus 30 years, which sounds more reasonable.
Imagine of patents has the same duration as copyright, sometime in the next century or two we would probably have our first jet engine. Maybe that could be a solution to global warming *hard to type when laughing*, longer patents. Life of inventor plus 100 years, most of the world would still be using steam power, so yea maybe not a solution to global warming :) But would be a fun fictional parallel universe for a film - Amish world. I can picture the tagline now "Amish world - there is no Rumspringa"
IANAL (but I am guessing you aren't either :-) ). My understanding is that "fair use" has little to do with value or scale. It is the nature of the use, whether it is transforming it into something else or just reproducing it, etc.
Also, of course, "fair use" is a US legal concept - there is no such established principle in UK law.
In the UK there is a concept of Fair Dealing, which isn't quite the same. (Introductory from British Library at https://www.bl.uk/business-and-ip-centre/articles/fair-dealing-copyright-explained) Of course, it's never quite that simple. For our American cousins (and possibly others) IANAL.
So why doesn't Microsoft publish the source code for all their products online and we'll take what we need, rename variables, shift some lines of code around, add/delete a few comments and there should be no copyright dispute. We won't even fix any bugs.
I am glad to report that I did my bit many years ago to thwart these robotic coding attempts: I have some buggy, half-arsed projects on GitHub with code that I am not proud of. Whomever gets that forced into their project by the bot net is going to spend more time debugging arcane code than what it would have taken to just write the stuff themselves. Hahaha haaa! Despicable!
You cannot should not be able to patent software algorithms. This is about copyright (and copyleft).
The human brain doesn't work like a digital computer because it is constructed differently - it made from ion pumping analog circuits that are noisy and imprecise. In general human brains cannot remember code photographically - the only choice is to deeply internalize the underlying algorithm in some neural encoding, which can later be used to generate new and unique instantiations of that algorithm.
Ion pumping analog computers are coming along though - e.g., New Scientist, "‘Artificial synapse’ could make neural networks work more like brains - Networks of nanoscale resistors that work in a similar way to nerve cells in the body could offer advantages over digital machine learning" Imagine the circuitry of the human brain, but built with solid state components making it a billion times faster.
I'd shorten that down to "concept vs copy". Or in legal specifics, patent vs copyright.
It is very hard to argue (In My Bombastic Opinion) that an AI-based programming algorithm is anything more than a fuzzy data compression and expansion method. As such the data from the program source it scanned is uncompressed and included in the output. Whereas humans, of course, would have to create something fresh the way it has been done for 100,000 or so years. OK so your neighbor made a wheel. You can make one, too. NOT plagiarism (but maybe violates his 'patent'). etc,
The training data contained GPL'd code.
The GPL (with few exceptions) applies to derived works.
Copilot can this emt GPL'd code.
Thus any product that used Copilot generated code may be subject to the GPL.
This is not a problem with the GPL, it's a fine license and ensures software freedom.
The problem is bullies like MS not respecting others' licenses.
The problem is: "Is Copilot emitting the same code, or code the looks alike - even very alike - but it is not a verbatim copy but a product of its algorithms?" Because even a human programmer may write code that looks identical but it is not a verbatim copy. Code is not a novel, there aren't many innovative way to write it.
Many algorithms have common and obvious implementations. Now, if we forbid writing code that looks very alike, programmers would be forced to write bad algorithms - or we may have only one program.
So it's just the question if an AI just copies or "creates"....
While the kneejerk reaction might be to find something in copyright laws to hit microsoft with, you have to remember that this could have consequences that are larger than copilot. If microsoft actually gets hit with something and we have a new precedent or law for copyright, it could end up backfiring on the little guy later. While trying to 'regulate' microsoft, you end up putting down more regulation that gets in the way of everyone else.
It almost seems this way for all of copyright, really. While it might be nice that people can't 'steal' your ideas from you, you've ended up causing a torrent of problems.
If CoPilot or competing AI systems generate ASR rather than AST for training their ML systems, is this a violation of copyright? ASR could generate source code in a number of languages from the language-invariant concept in the ASR.
LFortran, a modern LLVM-targeted Fortran compiler, generates an ASR of the source code. This enables cross-compilation to Python and other languages.
https://gitlab.com/lfortran/lfortran/-/wikis/GSoD%20Proposal%202022%20-%20LFortran%20Compiler%20Developer%20Documentation
How hard would it be to generate different training datasets based on license, and then fully track attribution during the training process.
This suggested fragment of code was automagically generated was and still is GPL 3.0 licensed and based on the contributions from this list of 4000 people. Both of which must be included in your code if you use the suggested code.
Microsoft also probably has enough in-house source code that they could use as a training dataset, as long as they did not mind leaking source code to their core products.
"How hard would it be to generate different training datasets based on license, and then fully track attribution during the training process."
Very, I would have thought. Determining what licences apply to particular bits of code in a mixed project might require natural intelligence. You might end up having to limit the system to code that has been opted-in by its owner (as someone suggested a few million comments ago) and that opens up the possibility of "hostile" training data.
There may simply be no way for Copilot to be both legal and worth using. Too bad, Microsoft. The world does not owe you a business model.
Co-pilot's concepts don't appeal to me much. Most of what I work on for code is rather specialized and does not follow traditional behavior patterns because it is integrating existing systems the code has no control over. That kind of interface coding doesn't lend itself well to automation.
Personally I think they're barking up the wrong tree. I absolutely despise the "autocomplete" behavior of IDEs that auto-insert brackets and close-parens and the like. I'm a touch typist; most of those interface "enhancements" slow down my input rather dramatically by inducing typos and "alternative interpretations" of what I was typing. Maybe if I was a hunt-and-peck typist I wouldn't feel that way about autocomplete technology.
The other problem with approaches like co-pilot is they assume there is nothing more to learn and nothing more to do differently; that all you have to do is regurgitate what was done before, rephrased. I seriously doubt we're at the end of computing history...
Yes, I was going to post something similar. Copilot is in no way a pair-programming mechanism. The purpose of pair programming is vigilance, not copy-and-paste.
You could certainly train a model on common errors and have it flag those – though we already have on-the-fly static analysis without needing vast resources for a transformer architecture with a zillion parameters, so that would be Kind Of Stupid. But that would be something along the lines of mechanized pair programming. Not particularly good pair programming, since pair programming works best with agents that understand the intent of the code; but it would be closer to pair programming than Copilot is.
I have copyrighted the 0 in computing!
I have copyrighted the 1 in computing!
I AM COMING FOR YOU ALL!
Unless you are willing to buy licenses to use 0s and licenses to use 1s from me for the low low price of 200 quid/head, my lawyers will be in touch. Don't like it? Use o's and i's instead.
I wonder what is the reason for Microsoft to use GitHub/Open Source as a training set, should they own codebases, which needs to be huge, offer enough material at least to support MD own targets like. NET / Windows coding. Wait...No way them to give up all they codebase secrets (and reveal how shifty it is) via AI/Ml to world - but it's okay to do that for other people's code...
Genuine fair use is to include your own code training set, if you do that it will show to world that you really believe and, walk your walk, In fair use sense. If MS thinks it is fair to include code from the others, the others are rightfully expecting to be able to use snippets of MS own code too with co-pilot.