Microsoft violates copyright.
Microsoft and GitHub have tried again to get rid of a lawsuit over alleged code copying by GitHub's Copilot programming suggestion service, arguing that generating similar code isn't the same as reproducing it verbatim. The duos' latest motion to dismiss [PDF], filed on Thursday, follows an amended complaint [PDF] from the …
I agree with the GitHub/Microsoft argument: You made it public. A computer read it and paraphrased concepts it learned, but the way generative AI works, it doesn't "copy" code.
The license on code only keeps me from taking that particular code implementation and using it. There is nothing preventing me from reimplementing the functional interface in another language or even the same language, provided I am not copying it exactly. And in the case of generic terms used as class names, even that restriction is fuzzy.
The 303 creative decision you're unhappy about doesn't bear a lot of similarity to this case. I get that you're unhappy with the result, but really, your post just comes over as salty. 303 Creative was about a standing problem where the website designer was in the position of being unable to sue without knowingly violating the law and getting prosecuted. Colorado stipulated on the record that they WOULD prosecute the owner of 303 and send them to a 'reeducation camp' if they went forward with excersizing their free speech rights, and the standing argument in that case would have amounted to 'of course you can sue, but only after you go get convicted first just to have the right to sue.
This isn't that at all. Nobody's arguing about pre-enforcement standing except the giants trying to avoid consequences for their stealing. The conduct has already happened here. This isn't preventing something from happening. It happened, and Google and MS are just trying to say 'we know it happened but you can't prove it because we're making it impossible for you to do so.'
I get that you're salty about the 303 creative decision, but it's not relevant to this case except in the broadest of senses that they'd both be cases where there were various arguments about who has standing to sue. One case would never reference the other in some later legal decision.
Model training might be fairly described as a process of encoding the whole work/s, rather like jpeg encoding a copyright artwork. As such it is not "fair use".
The model itself is a copyright violation.
The fact that it produces imperfect copies of it's source is an encoding limitation, the same as other highly compressed copying and storage methods.
> Model training might be fairly described as a process of encoding the whole work/s, rather like jpeg encoding a copyright artwork
Uh, no, sorry, that is taking analogies too far. And it most certainly isn't a "fair" description.
> The fact that it produces imperfect copies of it's source is an encoding limitation, the same as other highly compressed copying and storage methods.
Bad analogy. The lossy nature of JPEG is well-defined in the quantisation step and even the worst over-compressed JPEG still spits out a recognisable copy of the *whole*. If you can suppress the ringing, a lossy image compressor will literally spit out a copy "as though you were standing x metres away" and can not resolve the high frequency components. Still designed with the sole purpose of representing the whole.
Note that we only apply lossy compression (of a form arguably similar to JPEG...) to audio and visual, not text. "Strip out the high frequency components" from text and you get gibberish, especially when trying to compile the results.
> The model itself is a copyright violation
I have argued before (and it was not well received, sob) that the model *might* be compared to the Huffman tree you can find in many compressors: as you traverse the tree, you hit a leaf node and spit out the tiny (just a few characters, maybe a word) on that leaf. The same tree is used (in this analogy) and traversed a *lot* to finally spit out a decent chunk of material. In the decompressor, it is the input bitstream that causes a specific traversal - one bitstream outputs "Moby Dick", another " Life of Brian" - and does so reliably and repeatedly.
It is those input bitstreams, the traversal pattern, that are the copyright violation. Not the Hufman tree.
In the analogous traversal of the Copilot (or other language model), the model itself should be as innocent as the simple tree (just a ot bigger and mor graphy than treey). So it must be what is guiding the traversal that is to blame, much as we did the bitstream above? But the traversal process is stochastic: until you finish it you have no idea precisely what you will get out. Unlike the compression example.
UNLESS, of course, in Copilot "the model" is a really crap model and the whole thing is not behaving like a "well behaved" deep learning model ought to and Copilot is all just a fake.
"and Copilot is all just a fake."
It's from Microsoft, what do you guess?
*All* it exist solely to circumvent GPL, so the quality of the model is absolutely irrelevant. You won't see MS using *any* of their own code to teach it, that's for sure. Which should tell anyone what the actual function of it is: Blatant copyright breach in large scale.
I mean, open source is source that is open and closed source is source that is closed.
So, yes? Microsoft don't train it on their code. This is consistent because they don't make their code available for public viewing in general.
I have projects under various open source licenses. I put them under free licenses because I want people to be able to reuse them in various ways, as covered by copyright. But as a base fact, I put the source code out in the wild, meaning people can look at the source code. The reproduction of the code is covered under various licenses, GPL or BSD; the presentation is covered under the license I grant Github ( https://docs.github.com/en/site-policy/github-terms/github-terms-of-service#5-license-grant-to-other-users ), or just simply the fact that I place the code on a public server. That is rather what they are for.
If Microsoft put the Windows source code on a public webserver for people to look at, and say Anthropic train their model on it, and Microsoft sues them, I will likewise take Anthropic's side, regardless of what the actual license of the code is. Licenses cover reproduction, the fact that you are presenting it to the public covers viewing - and, hence, training.
> Note that we only apply lossy compression (of a form arguably similar to JPEG...) to audio and visual, not text. "Strip out the high frequency components" from text and you get gibberish, especially when trying to compile the results.
Strppng sm vry hgh frqncy cmpnnts lvs nglsh txt cmpltly ntllgbl. Yr brn wll vn dcd t fr y (slwly)
Tabs/spaces/newlines are high frequency components of code. Most of those can be trivially stripped by a language-aware lossy compression algorithm with no impact at all on compilability or function, only on readability. Even that can be (mostly) restored by an automated formatter.
I'd suggest we don't do lossy compression on text mostly because it's so small compared to images and videos, and it isn't worth the bother.
> It is those input bitstreams, the traversal pattern, that are the copyright violation. Not the Hufman tree.
I'm guessing your Huffman tree isn't going to be 100s of GBs in size (the V-RAM requirements of GPT-3 - so probably higher now), and is therefore incapable of independently coding and regurgitating much larger chunks of copyrighted text than the input.
This post has been deleted by its author
Further on the contention that lossy compression doesn't really exist for text: while it isn't - as above - cost effective to do for storage or network cost reasons, a lossy (and sometimes dramatic) reduction in text size is often desirable to save time and mental load for readers. It's just called something different in that context: text summarisation. As well as being something your teachers made you do frequently at school, there's an entire subfield of NLP dedicated to doing it algorithmically.
Abstractive summarisation aims to reduce the size of a text while preserving the most important semantic structure. This is conceptually very similar to lossy compression in the image, movie and sound domains, where you're aiming to reduce storage size while allowing reconstruction of the most readily perceived information. It is of course more challenging to do text summarisation acceptably - there is no easy shortcut equivalent to the relatively dumb transforms JPEG can use to throw away details the eye can't see (analogously that are unnecessary for the elementary exam question).
As it turns out GPT-3 has been tuned to do abstractive summarisation for book-length unseen text with near-human performance - https://arxiv.org/pdf/2109.10862.pdf - and GPT-4 will be better. There's your lossy text compressor, right there.
Except compression and summarization achieve different goals, even though in both cases the result is shorter than the input. Summarization will eliminate some of the data that the user would use, whereas lossy compression is designed to be decompressable to something containing all parts of the original data with certain aspects removed for size. I can summarize a video file by cutting out chunks we don't need, and it is not the same as compressing it but all the frames remain there with fewer pixels specified.
One could do a lossy compression of text by removing some punctuation and maybe even spaces, and someone might be able to read it with some work, but that would be compression, not summarization. The two are not synonymous.
"something containing all parts of the original data with certain aspects removed for size" just means "something without all parts of the original".
Compressing at too low a bit rate occurs frequently (e.g. any Netflix stream) and frequently manifests as posterisation etc.. It is clearly not "containing all parts". It chucks away large chunks of colour information that otherwise I can perceive from the side of the room, and I find it intensely annoying. Thankfully audio streaming at 64-96 bits is no longer a thing, but it was, and it was awful. Does this mean low bitrate compression cannot be compression under your definition?
On the other hand:
Wikipedia definition of lossy compression: "Reduces bits by removing unnecessary or less important information".
Definition of text summarisation: "Creating shorter text without removing the semantic structure"
For situations where semantic structure is the most important information in text (which are many), these are synonymous.
"arguing that generating similar code isn't the same as reproducing it verbatim"
US copyright law doesn't require exact copying for it to be a violation of copyright. For software there is a process of evaluation involving distilling the code down to its essentials and comparing it that way. That way you can't for example simply change the variable names and formatting to avoid copyright. If that were allowed then you wouldn't need AI to get around copyright law.
The real issue will likely come down to whether each bit of copying had enough code copied for it to be considered significant enough.
A common analogy is that you can't take the latest Harry Potter novel and simply change the character and place names, add a few scenes from another book, and then publish it as your own work. It's still considered to be a copy even though it's not identical.
There's a handy guide to "derivative works" from the US Copyright Office:
It clearly states "Only the owner of copyright in a work has the right to prepare, or to authorize someone else to create, an adaptation of that work". It also gives as an example "A new version of an existing computer program".
However, coming up with sufficient incontrovertible evidence against an impenetrable black box is likely to be the hard part.
How many on here have used stackoverflow?
I might use Stack Overflow to get an initial introduction to an area I'm unfamiliar with (ditto Wikipedia), but I never copy code from it because most of what I've seen there is a godforsaken cross between Chinese whispers and cargo cult programming with no real understanding of what's going on. But then, I'm very old school and hate writing code for a problem I don't understand.
think you are expected to be able to work that out for yourself.
THis isnt always possible because classes are often really generic names that are reused many times in other libraries.
A simple exampe of this is Problem. There are multiple different versons of this sharded within Spring and its many sister projects.
Many of these Spring problems are the original zalando Problem copied over and modified or tweaked....
Again its not always clear which Problem is mentioned in a code snippet without the imports etc.
Obviously you have done very little programming or at a childish level, that you have never encountered a large project with zillions of dependencies which often have sharded libraries of different versions.
Anyone using Spring would knoe whaat im talking about, so go ask your mommy or daddy if yu cant spell those big words.
> Anyone using Spring would
> zillions of dependencies which often have sharded libraries of different versions
I recall attending seminars from Sun when they were first trying to convince us to use Java - pushing for bulk adoption of a VM for "run everywhere" wasn't a new idea at the time, of course, but there were already rumblings about GUI standardisation (or lack thereof).
In all honesty, as we'd never had any great difficulty with writing portable C/C++ (Sun, SGI, MSDOS then Windows) and knew how not to leak memory we didn't leap into it. One project wrote a Java applet to run in the browser, but, well, the performance was crap and we just recoded it in C++: zooooom.
Since then, I've never really bothered with Java: if a project really needed it, quite ready to step in for a bit (anyone want two feet of O'Reilly Java books? JNI is *such* fun - and, um, doesn't fit "write once, run anywhere" if app devs needed to know about it!). And the ruddy stupid problems with the runtime: making it cheaper to force hospital staff to use two different computers because the suppliers of software to a major(!) customer weren't able to get two programs running on one PC!
the one: In all honesty, as we'd never had any great difficulty with writing portable C/C++ (Sun, SGI, MSDOS then Windows) and knew how not to leak memory we didn't leap into it
cow: but the world does, and since we cant replace all those other lamers with we have a problem.
the one: One project wrote a Java applet to run in the browser, but, well, the performance was crap and we just recoded it in C++: zooooom.
cow: java may have been slower back then, but at least it couldnt steal files from your computer. Im sure your customers appreciate keeping this big hole open for the bad guys so your thing can run.
the one: Since then, I've never really bothered with Java: if a project really needed it, quite ready to step in for a bit (anyone want two feet of O'Reilly Java books? JNI is *such* fun - and, um, doesn't fit "write once, run anywhere" if app devs needed to know about it!).
cow: Java is a success because it has libraries for everything, time is money... Catch up with today and stop living the glory days of the Queen Victoria.
No. But if my project is provided under a license saying "The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software" (MIT license) then you must abide by that license if you copy a "substantial portion".
Microsoft are arguing that because they mashed up large amounts of code they took, nobody can prove that a specific piece of output was copied from a specific piece of input. But that's ignoring that they inputted all that code that came with a license in the first place and ignored the attribution requirements of its license.
A proper test would be if Copilot was retrained only on code supplied without licenses containing attribution requirements, would it still output the same code? If it can't then it proves that copyright requirements were violated.
I learn from licensed code that I read. For instance, if some GPL licensed code is my first exposure to a particularly clever way to write a for loop (maybe the `*p++ = *q++` idiom), and you retrained my brain without that code, I might not use that idiom. That doesn't mean I'm violating copyright; technique, or skill, is not copyrightable. (It may be patentable, and cursed be that fact...!) I understand the logic behind the idiom and can use it in generic situations; by truly learning it, I have made it my own.
I believe the disagreement is in general whether GPT copies content or skill.
> what's the legal situation there?
I believe CoPilot's license pushes (or at least tries to) liability for that onto the user.
So, if you end up in a position where Copilot spits out something that's covered by a patent, it's you on the hook if the patent holder finds out. Unless you've got the resources to try and sue Github/MS (or get lucky and the patent holder decides they'd rather pursue those deeper pockets)
Of course it hasn't. I combine constants all the time, because it makes it easier to understand what I'm doing. For example, I could write 691200. What does that mean? You'll just have to guess. If I write it 60*60*24*8, you can probably guess what it is, and if you want to change the number of days, it's a simple modification. Similarly, I have a few scripts that report the sizes of files, and I find it easier to divide by 1024**3 or 2**30 to get the number of gigabytes than to spell out 1073741824 every time.
It gets even more useful when the constants are more specific. If I'm allocating a certain number of bytes, then I'd much rather call malloc(40*FRAME_SIZE) than malloc(960) *. If FRAME_SIZE ever changes, it's easy to change it once than to redo the allocations. The compiler will calculate the constant for me and store it as such anyway, so why do you think I should do that computation for it?
* Even better would be malloc(40*sizeof(frame)), but in many cases, these aren't structs and are instead strings of bytes which have to be constructed manually. Sometimes, I will create libraries so they can be structs to the user, but not always.
Absolutely, you are providing the voice of experience and good practice.
In addition to your examples, if you are into programming MCUs, the code can be made so much clearer to re-read if you set control registers using bit-structures (or macros ) to combine bit positions into the required byte or word. Seeing a page of "IOCTL0 = 0x47; IOCTL1 = 0x85; ..." gets very tiresome very quickly. Also a pain in the backside when you need to change a single setting from a constant to a variable.
For some reason, the electronics engineers really liked working out all the hex values by hand and sometimes even commented out the bitfield assignments to replace them with hex again when it was their turn to edit. What fun we had. Great bunch of lads but sometimes...
 had an MCU project, in the dark ages of, ooh, 2014 or so, where the MCU had been chosen "because we have a devkit already lying around": a derivative of a Well Known 8 bit MPU from the 1970s with bolt-on programmable i/o. Fine, I'd used the MPU before, knew its assember (still have the relevant Sybex book, slighly foxed and very badgered) and there was a Windows-hosted C compiler for the bulk of the work. Even has an IDE with remote debugger. Yay.
Ah. The compiler is a - bit lacking (ho ho): no bit fields in structs. Not a problem, whip up macros to match the registers and pass in the relevant 1, 2 or 3 bit values, shift and OR, tada. Use enums to name the non-trivial values. Include comments for the "why" of that seting. Much readable, very SwEng. Only it won't compile. Weird error codes, no-one on StackOverflow admitted to using it. Fiddle about - that compiled, this didn't...
Turns out the compiler's preprocessor pass has a 255 character buffer. Any macro expansion that could be completed in 254 chars (plus NULLC) is fine. 256 or more chars, error message. 255 - don't look, just delete the stderr capture. Oh, and that 254 includes comments in the middle of the invocation. Luckily, I use Make instead of compiling in the IDE, so easy fix to preprocess using GCC before the calling the cross-compiler. Much smug, very compiles.
 yes, I check SO. I have no pride left.
From https://www.copyright.gov/circs/circ14.pdf via abend0c4
> "Only the owner of copyright in a work has the right to prepare, or to authorize someone else to create, an adaptation of that work". It also gives as an example "A new version of an existing computer program".
I can not copy a work but I can study it, see how it was constructed and use those methods.
For example, I can study the works of a living artist, a painter in oils, and learn from them how layering of colour works, when you mix paints to match a shade and when you place unmixed shades side by side to create the impression of the mix.
I can take those techniques and create a whole new piece, applying it to a subject that the other artist would never touch.
So long as the original allows me - and probably everyone else - to view his works, this is all perfectly legitimate, even fully expected, behaviour. I do NOT need the artist's exolucit permission to do so.
A model like Copilot is *meant* (if done well) to act similarly: so long as the source code is made available to read, it is acting entirely legally and your quoted paragraph simply does not apply.
The question you (and the court) have to apply becomes: *is* Copilot acting as it is *meant* to do? Or is it just a bad/incomplete implementation of the idea of something that can synthesise code? If it *can* synthesise, at what point does it inevitably spit out "recognisable" code because that is the only known (or simply the "obvious to any practitioner") way to do the job?
The issue will likely come when the project you are working on resembles only one or a very few original works that the model was trained on. The model will likely then spit out suggestions that are very close to the original work.
The biggest problem of course is that due to the way the system works, you will have no idea when this is happening.
To take your art analogy further, imagine a very unimaginative art student whose art education consisted of being shown a selection of paintings in a museum. Now suppose you tell this budding artist to produce a new landscape containing windmills. However, that art student has never seen an actual windmill himself and had only ever seen two museum paintings containing windmills. The resulting new painting will likely contain recognizable copying from the originals. This will be the issue with software.
Going by the way that the copyright system works, I would not be surprised if Microsoft are found to be not liable, but that that the people who use their Copilot service are instead found to be shouldering the full liability on themselves for any resulting copyright violations. This is because Microsoft will be feeding customers small fragments at a time while the customer is the one assembling them into an infringing work.
Software copyright law is based heavily on copyright for books, movies, and other entertainment and educational media. Eventually generative AI will get good enough for it to be applied to that industry (it's already being used for stock images, with resulting controversy there as well). So imagine feeding 50 years of American TV situation comedies into a model and using them to produce new ones based on broad scripts without the original studios collecting royalties on them. It's not going to be allowed to happen and laws will be changed if necessary to ensure that it doesn't. Software will be affected as a byproduct of this as well.
If that is the case, it should be possible for the plaintiffs to seed copying of their code by choosing some sufficiently specialised function, then asking Copilot to output code that meets the same business requirements. Then they'd have a slam-dunk case.
The fact that they have apparently not been able to do this, makes me suspect that this is just an effort to shake down Microsoft.
> it should be possible for the plaintiffs to seed copying of their code by choosing some sufficiently specialised function, then asking Copilot to output code that meets the same business requirements.
They would have had to have done that seeding at the time Copilot was being trained: bit late to do so now (though they may catch out a future release of Copilot).
In a fashion, this is what appears to have happened with the Fast Inverse Square Root example (see the https://www.theinsaneapp.com/2021/07/github-copilot-ai-facing-criticism.html link provided by an AC)
The questioner wanted fast inverse sqrt() - well, *the* version of that routine that we all know and love is from Quake and it has been copied *everywhere* - verbatim, including the sweary comments. So no great surprise that it gets regurgitated.
HOWEVER it has also learnt that chunks of code of that size or bigger are accompanied by a comment talking about licencing, so it went ahead and generated one as well. But, probably there are lots of examples of such comments, so it can generate lots of different ones. That the licence comment didn't match the code is of no surprise at all - there is nothing to *make* it match the code! All the model does is spit out something that looks sort of like a licence comment, it has so many to choose from and the stochastic process just sent it down the path to (re)create the one shown.
The weird thing is that this mismatch occurs because Copilot explicitly doesn't do what many think it does, namely store great chunks of text that are simply strncpy()'ed to the output. If it did do that then there would be the chance of storing a ref to the appropriate licence alongside that text chunk.
A stochastic model becomes deterministic (and hence regenerative rather than generative) when it is fed too few options (or too many copies of the same option) - hardly a novel observation. It would have gone better for Microsoft if they looked for such limited chains and replaced them with canned text plus attribution. Assuming, that is, that they have enough understanding of how their huge pile of nadans actually gets traversed to spot them and are willing to spend the resources to do such a cleanup.
 actually, it gets recreated, piece by piece, following the chain of "this is highly correlated to follow what as already been spat out" - and if it has only ever seen the one bit of code follow the phrase "fast inverse square root" *and* it has seen that many, many times that chain will become predictable - which makes it a naff model in that regard, btw.
 although there is also possibility that the bulk of the copies the Quake code Copilot has seen *also* get the licence comment wrong and that has been incorporated into this very narrow (as in, singular) set of paths for generating fast inverse sqroot. We should try checking that (must remember when back at a proper computer).
"The biggest problem of course is that due to the way the system works, you will have no idea when this is happening."
By design, I'd like to add. Basically a black box which takes all copyrighted content and spews it out as its own. That smells like Microsoft all right.
"A model like Copilot is *meant* (if done well) to act similarly: so long as the source code is made available to read, it is acting entirely legally and your quoted paragraph simply does not apply."
If you actually believe Microsoft *meant* to do that? If you do, I've a bridge to sell to you.
I've noted before that there are reasons for models like Copilot to spit out "recognisable" code, most of which boil down to a naff model (shoved out too early, surprise) and/or the monkeys on typewriters: there is a random component selecting from "I've just put out A now that is usually followed by W, X, Y or Z, roll a die...", do that often enough and you'll be able to see strong matches with inputs.
We've heard lots of stories about people seeing their code being spat out again and maybe it happened.
But "maybe" isn't useful in court.
Why have the plaintiffs gone in with such feeble evidence, against a monolith like Microsoft?
 In fact, every line Copilot spits out *has* to be recognisable, or it won't compile (assuming that the complaints are over code that does, in fact, compile). Ah ha, that is a for-loop, I recognise it. Obviously, we are concerned about non-trivial, longer, sections of output.
 Sorry, I have to say "maybe" because so far I have not seen a fully-described example of it happening, complete with the whole conversation. Please, if you have them, give citations.
> Sorry, I have to say "maybe" because so far I have not seen a fully-described example of it happening, complete with the whole conversation. Please, if you have them, give citations.
Also this article has screenshots in case you can't get to Twitter right now (I can't).
> Why have the plaintiffs gone in with such feeble evidence, against a monolith like Microsoft?
Perhaps because, like in the aforementioned "fast inverse square root" case, as soon as Microsoft are made aware of Copilot generating clear evidence of copyright infringement, they apply a bandaid to prevent anyone else from seeing that particular evidence.
That move is not really a problem. By pointing out that it did happen and that the model was not retrained, they can prove that it can generate verbatim substantial portions of code. That it will no longer generate that one won't prevent it from generating any others, including other sections of the plaintiffs' code. Adding guards for literally every portion of their code would likely start causing problems if a lot of people asked for them to do it. That's the kind of evidence that will be needed, but most likely, the person who owns the code concerned will have to be a participant in the trial (if you get it to print some of my code but I'm not involved, you will likely not be able to sue on that basis).
The article suggests that they have some code they say was printed by this bot, but they don't want to share it because it could identify them. From previous articles, it seems that they got threatening communications from someone which is one reason they want to stay anonymous, but it may make it difficult to make their case if they won't supply evidence because that is difficult to distinguish from not having evidence. I'm not sure how easily that can be fixed, but they might want to find an option in order to make their position stronger.
If it can happen, then why can't the plaintiffs show examples of it happening? That's the feebleness of evidence that OP is talking about.
As to "we dursn't show the evidence", surely evidence can be kept private to the court? Happens all the time in various contexts. Unless they're saying they fear official retaliation from Microsoft itself, and if that were the case this would be a very different story.
I get the argument. Basically, I was agreeing with it. I think that, if they could provide the evidence, their case would be strong even if Microsoft hid the chunk so it wouldn't come back. That still requires them to show that a piece of their code, not a chunk from Quake that has already been copied under several licenses in many places, can be printed by the bot.
As for hiding the evidence and only showing the court, it could work and if they really won't provide the evidence publicly, they should at least try that, but I think it will seriously weaken their case. The problem is that Microsoft will, if they can't see the code, start looking for claims for why it shouldn't count. They could argue that the plaintiffs won't show them the code because it would prove the copying to be insubstantial or obvious (we all agree that copying a boilerplate expression or standard lines wouldn't qualify, and how would Microsoft know that wasn't the code submitted). The court is staffed by people who can't recognize obvious code from original and very clever code, so they could be swayed to either argument. Doing it publicly would help if they could do it. That they haven't makes their case suspicious but not automatically faulty.
You train an AI with cat data you get some form of cat data output.
You train an AI with drug molecular data you get suggestions for new drugs that may or may not be valid.
You don't get data of dogs or designs for cars from these examples, it is simply impossible.
AI trained on open source software will output open source software, it has no other references, it cannot imagine another paradigm for software development, it only "knows" of the methods by which open source software is constructed. A person or company might treat it as propriety software just like you might download openwrt and sell it as the basis as a closed source router. However it _is_ still open source and has a licence that should be respected.
This is the case wether or not you can identify a particular source for a piece of code output from the model. It is a spurious argument that if you cannot identify an exact copy it changes the actual reality, it could be an amalgam of many projects, but they are all open source. Otherwise openAI and Microsoft have trained it on their own closed source, which we know is not the case or they have trained it on other people's private code. Not a good outcome whatever way you look at that.
Huh? You are claiming that there is some recognisable difference in form and structure between open source and proprietary code?
If I show you some code, you are able to decide whether it is open source or proprietary as easily as you can the difference between a cat and a dog? And from the same sort of distance (i.e. you aren't just cheating by hoping to spot an explicit copyright comment tucked away)?
> the methods by which open source software is constructed
Huh? In fact, double huh? We change methods when coding OSS? When I'm not under contract for proprietary work there is some kind of Jekyll and Hyde transformation? Or I dramatically sweep the monitors and keyboard off my desk, pulling out the hammer and chisel to start crafting the beautiful Open Source (the serifs are tricky but well worth the trouble)?
> it cannot imagine another paradigm for software development
Assuming that sentence means anything, do all programmers who write proprietary code have to train only on proprietary code, in order to learn the correct paradigm? They can not possibly have learnt from any open source, like, say, the code examples in a "programming cookbook" or other textbook or they will be working from entirely the wrong paradigm.
I am all for discussions about how language models generate their output and the possibilities (or lack thereof) of tagging them for attribution, but damn that was some weird shit!
"Huh? You are claiming that there is some recognisable difference in form and structure between open source and proprietary code?"
No. I siad if it has only been taught oss and doesn't know there is any other and has no terms of reference to _anything_ else as in anything else in the universe and has no intelligence to enable invention, how can it output anything else but what it knows about which is oss.
To all intents the output may be identical to proprietary code, but it can't invent and doesn't understand the idea of proprietary , it's universe is oss source code
You know actual reality, so you may not be able to abstract your thoughts to understand this systems tiny reality.
I put code on StackOverflow. It's been copied twice: it's now on GitHub in two versions, without attribution.
The copies are actually better than my source code: they've been cleaned up and commented. On the other hand, it was actually original: nobody else had done it before, or even thought about how to do it before.
The lack of attribution burns a little bit.
"AI model trained to recognize functional concepts and then generate suggestions reflecting that training."
Proper BS and whole article derails itself badly at very basic capabilities of "AI".
AI does not "recognize" anything, ever. That would mean intelligence and "AI" doesn't have any. It only can take code, suffle it a bit (to make copyright disappear) and present that as its own.
No more, no less. No amount of huffing or puffing will add actual intelligence to a language model.
"AI" doesn't, and never has had, the concept of "functional" or "concept" in the larger scale: It's literally word salad and *all of it* is copied from somewhere else.