back to article Even robots have the right to learn from open source

If the soap opera of Microsoft's relationship with open source had a theme tune, it'd be "The Long and Winding Goad". To a company whose entire existence depended on market control, open source's radical freedoms were an existential, cancerous threat. In return, open source was only too happy to play the upstart punk movement …

  1. ComputerSays_noAbsolutelyNo Silver badge
    Paris Hilton

    Well, let's hope micros~1 only trawled the public code, and not also all the private repos.

    It would be such a shame when e.g. CoPilot suggests values if you create a variable named passwordForSQL or similar.

    There's potential for information leakage. But, most probably, it's only my imagination that's running wild.

    1. John Brown (no body) Silver badge

      I was wondering how they separate the good code from the shit code. Or does it all just get fed to the AI, leaving it potentially biased and possible liable to making incredibly bad or stupid suggestions. I wonder what the ration between good and bad code is on GitHub?

      1. Michael Wojcik Silver badge

        Based on anecdotal experience, CVE rates, and some academic studies in related areas, I'd say the overwhelming majority of code in GitHub is crap.

        That's also true of most other repositories, of course.

        It looks like a prototype of the Codex model used by Copilot was trained on a massive amount of data: "The amount of training data is 54 million public software repositories with 179 gigabytes of unique python files", according to one source. Doing any meaningful data hygiene on that sort of big-data volume is all but impossible. So we can assume Copilot was trained on a great deal of rubbish.

        Note, too, that one of Copilot's goals is "filling in repetitive code" – a task that explicitly violates the DRY principle and suggests that a redesign was in order anyway. Copilot appears to be in significant part a tool for creating lousy code.

  2. Howard Sway Silver badge

    The complaint isn't about the use of FOSS code, it's about attribution

    Open source people aren't luddites, nor do they care who uses their code or what they use it for, AI training included. However, most licenses require attribution, if only as an act of courtesy and appreciation for whoever wrote the code. Take the permissive MIT license :

    Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

    The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

    The beef with Microsoft here is their attitude of "we're going to take other people's work and not give them credit".

    1. Gene Cash Silver badge

      Re: The complaint isn't about the use of FOSS code, it's about attribution

      The beef with Microsoft here is their attitude of "we're going to take other people's work and not give them credit".

      Not only that, they're going to make people pay for that work. They're making a profit off my unattributed code.

      As per the article: "The result, called Copilot, is then sold to programmers as a code suggestion aid"

      1. Flocke Kroes Silver badge

        Re: profiting off your code

        What is the licence? There are "not for commercial use" licenses but they are incompatible with some of the popular FOSS licences. For example, you would not be able to link your code to GPL libraries. Also, check the github terms of service. By using github you may have promised to bake them a birthday cake as well as given commercial use.

        1. Alan Brown Silver badge

          Re: profiting off your code

          GPL is simple at its heart: comply with the license conditions or you're into copyright violation territory

          The license says you must show your sources or the license is void. There are no restrictions on commercial/non-commercial use other than that

          ie: You're more than welcome to make money off it, but you MUST NOT hide the origins

          If I wanted people to use my code unattributed and hidden, I'd release it under BSD license. GPL exists because of companies which kept taking code, putting it behind paywalls and claiming copyright on other people's work

      2. Mye

        Re: The complaint isn't about the use of FOSS code, it's about attribution

        Whatever it generates it's not your unattributed code. There are numerous occasions where I have found code functionally identical to what I've written on a commercial project inside an Open source project. My code came first. Should the open source project license my code from the company I worked for?

        After decades in software development, I've come to conclusion that the same code will be generated over and over again in different contexts and different languages because we work with a very small and very finite set of ways of expressing solutions to problems in code.

        Even in cases where there are multiple solutions to a problem, the number of solutions can be counted one or at most two hands.

        In other words, plagiarism is inevitable because the same description can cause multiple programmers to create the same code and that is what we have here, Give co-pilot a description and from that description it uses its trained networking to generate code from scratch. It's not copying your code or anybody else's code. It is generating the new code based on your description.

        Copilot is working just like a human programmer in that it recognize a pattern and reapplying that pattern in new contexts. The only difference is that it is able to scan many orders of magnitude of code then you can in order to be able to identify patterns and figure out how to generate similar code based on a description. Another way of thinking about is that it's like you on stack overflow except much much more efficient.

        1. unimaginative Bronze badge

          Re: The complaint isn't about the use of FOSS code, it's about attribution

          There are numerous occasions where I have found code functionally identical to what I've written on a commercial project inside an Open source project. My code came first. Should the open source project license my code from the company I worked for?

          If the author of the open source code had seen yours before writing theirs, then a court might well find it is a breach of copyright.

          This is why people write clean room reimplementations of code. This applies to Copilot as much as to a human programmer, with the added danger that someone using copilot does not know when it might do this.

          Copyright law was designed for books, does not work all that well for them, and when it comes to software it is -------->

          1. FeepingCreature Bronze badge

            Re: The complaint isn't about the use of FOSS code, it's about attribution

            > If the author of the open source code had seen yours before writing theirs, then a court might well find it is a breach of copyright.

            But is that good, or bad?

            I think it's bad. People shouldn't need to tiptoe to protect the intellectual property rights of a five line for loop.

          2. Alan Brown Silver badge

            Re: The complaint isn't about the use of FOSS code, it's about attribution

            "Copyright law was designed for books"

            also, it was only valid in geographic areas (the USA famously built up its technology by ignoring european copyrights and encouraging rampant IP thef) as well as having a term limited enough to ensure that creators COULDN'T rest on their laurels and simply rely on residuals forever

            The system is broken in a lot of ways, but GPL has turned the efforts by Eisner et al to its advantage

            It should be noted that patents and copyrights were BOTH suspended in Britain by James 1st because of rampant abuses and shakedowns not that much different to those being seen today - and weren't reinstated for over a decade (with new rules more similar to what we trhink of when we see the words today)

            The reason for killing the system "as was"? It was harming the economy and stifling innovation

        2. Fr. Ted Crilly Silver badge

          Re: The complaint isn't about the use of FOSS code, it's about attribution

          Convergent evolution in action...

    2. Lord Baphomet

      Re: The complaint isn't about the use of FOSS code, it's about attribution

      It isn't using the software, at all, it's preserving from the code. Completely different thing.

    3. VoiceOfTruth Silver badge

      Re: The complaint isn't about the use of FOSS code, it's about attribution

      And how far do you go with this? In pseudo code: printf("%s\n", somevariable) is in a piece of open source code. It is also in just about every piece of code. Now some people are claiming that open source is the original of this. Is it? How would YOU know that Microsoft created its tool for this based on YOUR line of code?

      Linux is full of copiers. Look how many desktops there are trying to emulate MacOS. Hahaha. It should come with a notice: we copied the look and feel of MacOS cos we couldn't think it up ourselves.

      1. katrinab Silver badge

        Re: The complaint isn't about the use of FOSS code, it's about attribution

        1. That example isn’t sufficiently creative to get copyright protection.

        2. Even if it was, it would be covered by fair use.

        1. Michael Wojcik Silver badge

          Re: The complaint isn't about the use of FOSS code, it's about attribution

          It's also rubbish. If you write:

          printf("%s\n", somevariable);

          you should have written


          which has the same side effect, is clearer, is shorter, and is more efficient.

          And this is why training a generative model on existing code is a bad idea. It will reproduce average code, which is terrible, because most programmers write poor-to-abysmal code. The "bash out some code and put it on the Internet" (as an old Dilbert strip had it) approach to software development1 has been a disaster for software quality and security.

          Think twice, code once.

          1Which is not what proper Agile development is about, so we can skip that argument, yeah?

  3. Pascal Monett Silver badge

    Let's face it

    Borkzilla destroys everything it touches.

  4. Flocke Kroes Silver badge

    Tone deaf article

    AFAIK, Microsoft are not doing anything wrong by renting out copilot, even though it is trained on source with different authors and licences.

    The problem comes from anyone using code generated by copilot. There is the minor risk of getting sued for billions for something as trivial as rangeCheck. There is the more major problem of the copyright holder's intent. Some code is written by universities founded by government grants. They often select a BSD/MIT like license so they can track where their code is used and use it as evidence that last years grant did something productive and they should get more next year. People often select GPL so that improvements cannot be hidden in binaries and must instead be returned to the community.

    I respect the intent of Microsoft's licenses: pay up (inclusive?)or fuck off. They should respect other people's licenses by getting copilot to generate accurate attribution and licensing requirements.

    Robots have not rights at all and certainly do not have the right to ignore copyright law. If there is a problem with the law it is that it may not be possible to hold Microsoft to account for actions taken by their badly programmed robot.

    1. John Brown (no body) Silver badge

      Re: Tone deaf article

      "If there is a problem with the law it is that it may not be possible to hold Microsoft to account for actions taken by their badly programmed robot."

      It may well eventually go to court to be settled, but I suspect the outcome of a "robot" doing something illegal will put the onus on the robot owner. They programmed it. It's their problem if they didn't test for and foresee all possible outcomes. After all, if hardware fails and kills someone, the designer/company is likely to be sued if a design flaw is shown. A "robot" doing something illegal is, be definition, a design flaw. The court case, when it comes, will be a super-sized, multi-pack of popcorn event.

      1. Alan Brown Silver badge

        Re: Tone deaf article

        "but I suspect the outcome of a "robot" doing something illegal will put the onus on the robot owner."

        I'm minded of the plethora of patents for "Doing XYZ well-known and unpatentable business process ON A COMPUTER" which proliferated in the 1990s until the courts ruled they weren't novel and rapped the USPTO over the knuckles for not rigourously inspecting things.

        Unfortunately the USPTO (and other countries too) have taken the attitude that income matters more than novelty, therefore virtually every application is approved and it's up to the courts to deal with challenges. I wouldn't be at all surprised to find patents again being issued for perpetual motion machines

        GIven past history, patents & etc will start being granted for "XYZ, but USING AN AI"

        We already know that "training" on existing data is a really bad idea - the way AI insurance and legal software ended up with systemic biases against various racial groups is a classic example of how it perpetuates rotten input unless that input is utterly ruthlessly audited - and also a good example of how users took the outputs at face value "because the computer said so"

        If I wanted screeds of bad code, I'd farm it out to Bangalore(*). The only advantage an AI will have in the longer term is that it will undercut even the cheapest lousy human programmers in the "why output 12 lines of solid code when 8 pages of obfuscation wll keep us employed for decades?" stakes

        (*) Sooner or later some other city/country will take that crown

    2. Lord Baphomet

      Re: Tone deaf article

      There is no chance whatsoever of being sued for billions. That's a silly thing to suggest. This tool doesn't beach the licence at all. The article is spot on

    3. Anonymous Coward
      Anonymous Coward

      Re: Tone deaf article

      The law will also have to address fair use. How many lines of code exceed fair use?

      For example, "Garbage In Garbage Out" was first used in a print article in 1957 which would still be under copyright. Using the phrase without attribution or payment is fair use.

      I think the article is accurate.

      1. Michael Wojcik Silver badge

        Re: Tone deaf article

        As Google LLC v. Oracle America, Inc. showed, the courts in the US, at least, are far from agreement on questions of fair use and other aspects of copyright as they apply to software. Breyer's opinion in GvO helped establish some fair-use tests for source code, and those should guide other courts (particularly the pernicious CAFC) in similar decisions; and more importantly should squelch some similar cases, much as Alice Corp. v. CLS Bank International did for software patents.


        1. Thomas and Alito dissented from the majority in GvO, and SCOTUS has shown that it no longer gives a damn about stare decisis, despite Roberts's posturing in the past.

        2. The majority opinion in GvO, and Breyer's four tests, don't establish a bright-line test. They leave much room for subjective interpretation.

        3. There are always judges who will try to wiggle around SCOTUS decisions they dislike. There are always state legislatures that will pass laws that violate SCOTUS decisions, and those will be effective until a Federal judge gets around to staying them.

        4. Anyone can sue over anything,1 and if the case isn't kicked immediately, a defendant may be in a very difficult position until it's resolved.

        1Well, modulo "vexatious litigant" status and the like, but that's rare.

        1. Alan Brown Silver badge

          Re: Tone deaf article

          "the courts in the US, at least, are far from agreement on questions of fair use"

          The saga of a song being found to have violated copyright - not because of lyrics or stanzas, but because it emulated the "look and feel" of Marvin Gaye is arguably when USA copyright law jumped the shark

    4. mmaug

      Re: Tone deaf article

      Anyone remember the SCO V Linux battles( MS had its hands in deep on that one and this is more of the same.

      We have entered the extinguish phase of the operation...

      Licenses matter and AI does not get to ignore licenses. Just because they do not have the tech to properly apply the license conditions is not a reason for them to be ignored. If they want to stick to open permissive license then their exposure is much less, but actively using copyleft'd software to learn (even if none of it appears on the output side) is problematic. And trying to hand wave it away is incredibly naive at best and dishonest at worst.

      Carrying water for a multi-billion dollar corporation, exploiting the unpaid work of others, is not a good look.

    5. veti Silver badge

      Re: Tone deaf article

      How exactly do you "attribute" everything you learned in college? It's common knowledge. More importantly, it's your knowledge now. What you do with it is up to you. You don't have to keep explaining where you got it.

      Github is Copilot's college. That's all.

  5. Tim99 Silver badge

    "Just because it's Microsoft doesn't mean it's wrong"

    …but it would be an initial working proposition.

  6. SolarDesalination

    I'm a bit more concerned about Microsoft monetizing intellectual property that is intended as a public (or common) good than them innovating with AI.

    You're framing this as a luddites vs. progress debate when the issue I think for many (if not most) is the largest for profit corporation in the world is making money off the goodwill and hard work of people labouring on code to better the human condition. At minimum they should be compensating FOSS developers for the data they're pulling from their codebases to train tne AI model. If not, I think coders should leave the github platform and go to the co-operative platform Codeberg where the membership have a say in how the profits are distributed.

    It's almost insulting for you to frame it as them fighting yesterday's war.

    1. cornetman Silver badge

      As has been answered above, whether or not making making money from an author's code depends on the opinion of the author, and they usually pick a license that reflects their position.

      As a contributor myself, I have absolutely no problem in Microsoft (or anyone else) monetising a service which serves up code that I have written for others to use as long as they reflect the license that is granted.

      This is primarily about attribution.

    2. VoiceOfTruth Silver badge

      For your line of code with printf, here is $0.00000000000000000000000000001, cos that is what it is worth.

  7. Anonymous Coward
    Anonymous Coward


    Quote: "Nobody has ever expected a human programmer, trained through open source, to attribute everything that contributed to their skills, to be repeated for all new code they write."

    True, but note the word "expected". Strangely enough, many creators of new work actually do credit the people and tools which inspired/supported the work they created: editors/reviewers, mentors, people who inspired them, close family who gave emotional support, computer tools they have used, and so on. This M$ Copilot thing strips away ALL of these varied types of attribution from the work which trained the so-called AI.

    All we have left is an attribution like: "I'd like to thank M$ Copilot for help in creating this new work". Yuk-k-k-k-k-k!!!!

    1. veti Silver badge

      Re: Yuk-k-k-k-k-k!!!!

      There's a huge difference between "choosing to mention a handpicked selection of people who you think particularly inspired you to create something", versus "being obliged to list every teacher you ever had, every book you ever read and every person who ever, deliberately or not, taught you anything".

      The latter is what the critics are calling for here, and it's ridiculous.

      1. yetanotheraoc Silver badge

        Re: Yuk-k-k-k-k-k!!!!

        It's ridiculous for a human to do that. For a piece of software, it's not necessarily ridiculous. Compare the effort of mining the attribution (i.e. mentioning the source as well as mentioning the credits from the source) to the effort of mining the code itself. It seems quite do-able to me, but they didn't even try.

  8. silent_count
    Paris Hilton

    A Thought Experiment

    Imagine someone trains an AI using Microsoft's source code and then distributes their Co-Penguin-Pilot AI using a creative commons license.

    The argument is that the author is not breaking an NDA, distributing proprietary code or violating Microsoft's IP because the AI is only offering snippets of code. And those snippets may have come from the 4 lines of the author's own code in the training set.

    Do you reckon there's even one of Microsoft's army of lawyers would consider that fair play?

    1. Lord Baphomet

      Re: A Thought Experiment

      It isn't a snippet tool. It's not giving you a copy of other people's code. I do wish people would stop yeah talking something they haven't tried and don't understand.

      1. mmaug

        Re: A Thought Experiment

        MS paid SCO to sue IBM because there were 'similarities' in lines between ctype.h Linux and the Unix source that MS bought and gave to SCO.

        Don't think the lawyers won't go after fair-use; they've nearly destroyed fair-use in the music industry.

        1. FeepingCreature Bronze badge

          Re: A Thought Experiment

          But surely just because Microsoft are evil means we must be evil in response.

          Is Microsoft's behavior hypocritical? Yes, it's unbelievably hypocritical, in that previously they were doing bad things and now they're trying to avail themselves of the protections of a charitable interpretation that they'd previously attacked. But charity is good no matter who avails themselves of it.

  9. thejoelr



    Anyway, the github license allows them to do what they want with things put there. Read your licenses before you use things.. especially if they're someone like Microsoft. My only question is if someone puts a repo there and isn't associated with the original work that lives elsewhere.

    Anyway, for anyone harboring the delusion that github wasn't really Microsoft... now is the time to move off of github.

    1. Flocke Kroes Silver badge

      Re: Huh?

      At a guess, if I put your work in github without your permission and you are unhappy about how it gets used Microsoft can try to recover their loses (if any) from me.

    2. John Brown (no body) Silver badge

      Re: Huh?

      "now is the time to move off of github."

      But is your repo really gone if you delete it? Will MS still have a secret copy stashed away for a rainy day?

      1. Anonymous Coward
        Anonymous Coward

        Re: Huh?

        "What has been seen can not be unseen."

        Once someone gives away something to a cloud, there's no reasonable way to prove everything about what / has been / is being / will be / done with it.

    3. Anonymous Coward
      Anonymous Coward

      Re: Huh?

      Then again, M$ (or anybody else) could also incorporate all of GitLab's code.

    4. katrinab Silver badge

      Re: Huh?

      Suppose I take your GLP licensed code, modify it, and distribute those modified binaries.

      In order to comply with the requirement to distribute the source code of the modified version, I point everyone who downloads it to a GitHub containing the source code.

      You are not a signatory to the GitHub license agreement, but most of the code is yours. How does this work?

  10. amanfromMars 1 Silver badge

    The Bigger Deal .... Future Derivative Option for Pricing/Cost Benefit Analysis

    In IT especially, which is entirely about machines taking over human tasks, the deal with the devil has long been done.

    Hmmm? A.N.Others would be content telling you the long deal already done is in particular and peculiar regard to virtual machines taking over humans ....... which delivers a whole greater magnitude of practical and physical effect almost entirely unaffected by contrarian human input or output.

    A much better argument is that if Microsoft isn't documenting its training data well enough to identify source files, the training data itself is suspect, and undesirable outputs will be harder to diagnose. But that's a reason to avoid Copilot, not to abstain from GitHub.

    Oh? Is such an undesirable outcome not a very valid reason to hack Copilot in order to expose its inherent and/or unpleasant flaws?

    The beef with Microsoft here is their attitude of "we're going to take other people's work and not give them credit”. ..... Howard Sway

    But Howard, that is the American way, is it not? Take what you can and to hell with the rest and unintended consequences?

    It is certainly most unlikely to change any time soon, methinks, with more than just a few thinking it is all quite acceptable ..... and supported in that notion because it can be so extremely positively rewarding with mountains of flash cash to stash one of the major attractions it is so easy to be hopelessly addicted to.

    Some recognise that not as an almighty strength to be mindlessly lauded and applauded, which it surely can be, but a catastrophic weakness to be mercilessly exploited and repurposed ..... either to merciful extinction or worthy improved refactoring.

  11. heyrick Silver badge

    If it's not immoral for humans, how can it be for AIs?

    Easily. The average human isn't capable of reading the entirety of the source on Github, remembering it in exactness, and offering snippets up to everybody who asks. The AI very much is capable of exactly this.

    The human may use other people's code as a guide to how they approached a particular problem, helping them to develop their own solution. Or, yes, copy-paste a bit of code. But they're unlikely to be offering bits of other people's code to anybody else who asks. The AI, on the other hand, won't be using it to learn anything other than how to more effectively offer appropriate solutions for what somebody else is doing.

    Now, on the face of it this might not seem like a bad thing as it will be an appropriate suggestion without the human having to go and read loads of code in order to find it. However the problem is that it is presented without context. And whether you like it or not, the terms of the licence are part of that context.

    1. Lord Baphomet

      Re: If it's not immoral for humans, how can it be for AIs?

      Again, someone who hasn't actually used the tool or understood what it's doing but who still thinks it's appropriate to comment.

      It isn't a snippet tool and it isn't copying code from repos. It doesn't memorize all of github's code and it's not a snippet search engine.

    2. Fifth Horseman

      Re: If it's not immoral for humans, how can it be for AIs?

      I hear your concerns about licensing, and share them. However, I think there is a more fundamental problem here.

      We all start from other people's work to guide and inform our own, that's how we learn. In pre-WWW days we used textbooks: a good fifty percent of what I have done probably had a starting point in "Numerical Recipes" or "The Art of Electronics", aided by numerous manufacturer's application notes. Not intrinsically any more reliable than a random web page, granted, but at least they have been past an editor and a proof-reader first...

      Over the last couple of years though, I have seen more and more copy and paste programming, with little effort made to understand how the code works, whether it is really appropriate in the current application, or indeed what it actually does. In hardware, it is sometimes worse - there are some great open source designs out there, but certainly in websites targeted at the maker community, I have seen designs ranging from "won't work under any circumstances" to "will probably kill you".

      My concern here is that if the AI (god, I hate that term) algorithm has suggested the solution, there will be much less chance that the naive user will be critical of what is put in front of them. The computer said it is right, and it is better at this than me, so it must be OK? As someone has pointed out earlier, we don't know the signal to noise ration of the GitHub codebase, and neither does the AI bot.

      Anyway, just my two pennorth. Feeling even more cynical than usual.

  12. original_rwg


    "Just because it's Microsoft doesn't mean it's wrong"

    But they'll fix it on Patch Tuesday....

  13. Lord Baphomet

    Spot On

    This article is exactly right. Until and unless OSS licences are changed to account for this novel use of source code, CoPilot is perfectly acceptable.

    CoPilot doesn't copy source code. It isn't a search engine. It doesn't offer you snippets of open source code. In fact, it does little thousands of other tools do - it reads the public code in github and builds a statistical model based on what it finds. Lots of other research has been done on the source found in github and nobody has complained about any of it.

    Any complaint about this tool is spurious, and as most of these comments prove, is voiced by people who haven't actually seen and don't understand the tool.

    1. heyrick Silver badge

      Re: Spot On

      Doesn't copy source? Doesn't offer snippets?

      Please watch (from ~50 seconds) and explain to me what that is, if not working out what the user is wanting to do and then filing in a prewritten function body in order to do it.

      1. Falmari Silver badge

        Re: Spot On

        @heyrick “Doesn't offer snippets?”

        What @Lord Baphomet wrote was “It doesn't offer you snippets of open source code.”

        From that link you provided all I see is Co-Pilot auto completing or providing snippets of code. But I can’t say from looking at that video if the snippets are or are not copies of open-source code.

        What I did find interesting was from about 3:10 in, where he deleted his own manually entered code which Co-Pilot had not seen, but left the comments. He let Co-Pilot generate the code from those comments and it created basically the same code snippet, just different variable names.

        So, if the code snippet Co-Pilot produced was open-source, does that mean the code he previously wrote would also be open-source?

        1. heyrick Silver badge

          Re: Spot On

          "But I can’t say from looking at that video if the snippets are or are not copies of open-source code."

          ...and is that not the argument?

    2. heyrick Silver badge

      Re: Spot On

      "Until and unless OSS licences are changed to account for this novel use of source code, CoPilot is perfectly acceptable."

      And, for what it is worth, many open source licences contain text to the effect of "this legal bullshit must be verily copied in any and all incarnations and copies of this code".

      Well? Where's that? There's nothing wrong with the existing licences and everything wrong with CoPilot.

      Say, would CoPilot offer suggestions based upon Microsoft's own, closed, code base? If not, why not?

      1. FeepingCreature Bronze badge

        Re: Spot On

        I think that should also be fine. I guess you'd go after the person who trained the AI on it because they didn't have rights to even look at the code. Though Microsoft of course have the license to their own code.

  14. katrinab Silver badge

    As I’ve said before, in an “AI”/ML system, the training data is the source code.

    If you use someone else’s code in your product, you have to comply with the license.

    1. Falmari Silver badge

      @katrinab "If you use someone else’s code in your product, you have to comply with the license.”

      But is that code in the product or has it just been used to produce the product. Much like using an open-source editor to code your software.

      My opinion is Microsoft were wrong to use that open-source code to train their AI/ML. But the article's argument that training an AI/ML can be viewed in the same way as a person learning, is interesting and does have merit. It certainly highlights the need for rules/regulations/legislation on how public and private data can and cannot be used for training AI/ML systems.

      1. katrinab Silver badge

        The training data determines how the program responds to inputs, therefore it is the code, even if it is a bunch of jpegs.

  15. RobLang

    Microsoft didn't create CoPilot, OpenAI did

    Disappointed with El Reg for this whopper:

    "Microsoft has been industriously mining the code in the GitHub repositories and feeding it to an AI to train it in programming"

    Is wrong.

    CoPilot's AI engine is Codex, built by OpenAI (not Microsoft), a reduction of GPT-3, also created by OpenAI. Microsoft has the exclusive license for GPT-3. Codex is a reduction model of GPT-3, which is trained on the internet as a whole as a general language processor.

    I would expect an article bashing Microsoft (which I'm all for) to include some basic facts. It wasn't Microsoft trawling the web, it was Elon Musk's OpenAI. Microsoft is licensing not embracing, enhancing and extinguishing - for once. Microsoft is monetising the resultant model but then so is OpenAI.

    1. FeepingCreature Bronze badge

      Re: Microsoft didn't create CoPilot, OpenAI did

      For better or worse, Elon was not very involved in OpenAI.

  16. Anonymous Coward
    Anonymous Coward

    It's not a snippet / search engine

    As posted elsewhere - it's not copying people's code - it is learning how to code from examples, just like everyone else does. It may do so at a scale that is impossible by human standards, but that's where its benefit is supposed to come from - it will have a wealth of knowledge useful to a wide range of people.

    The article is spot on, IMHO; if we want (any) useful tools from an AI, it needs to learn from useful training sets. Any in the case of code, then an open repository of diverse projects is the obvious choice and we could all be beneficiaries of the results in due course.

    Incidentally, the UK Govt has been mulling exactly this issue over, and has decided the same way as this article:

    1. yetanotheraoc Silver badge

      Re: It's not a snippet / search engine

      "it's not copying people's code - it is learning how to code from examples"

      The "it's copying" camp simply have a lower opinion of Machine "Learning" (ML) and Artificial "Intelligence" (AI) than you do. To whit: it *is* copying, and the heuristic just determines based on context which piece of code from its massive database is the most likely to fit the context. The intelligence is in the heuristic, but the copying resides in the database. That's not at all the way humans learn, calling it learning and intelligence just dodges the issue.

  17. Plest Silver badge

    Who in their right mind would call open source a "canerous threat"?

    Oh yes...

  18. Anonymous Coward
    Anonymous Coward

    GPL code is GPL code

    If Copilot was trained on any GPL code (especially AGPL), then anything it generates is also GPL and any project that uses that code must be GPL'd.

    Great for F/OSS, not so great for proprietary companies who should be considering ditching GitHub or, at the very least, banning their developers from using Copilot.

    1. Ken Hagan Gold badge

      Re: GPL code is GPL code

      If your argument stands up, then anyone who has ever learned anything from looking at GPL-ed code is forever afterwards incapable of writing non-GPL code.

      The existence of WSL therefore makes the whole of Windows GPL.

      1. matjaggard

        Re: GPL code is GPL code

        That's true but it is more nuanced than the article implies. When I was working on a project once, we'd licenced proprietary source code doing a job and we wanted to dump that licensed code - first we needed to spec the work it did and get it implemented by someone who had never seen the licensed code because they would almost certainly have learned things from it and implement them based on what they learned. This is very similar except that previously we were asking humans to make a judgement call on when it was valid to use learnings from a differently licensed code. Now this tool is using that learning without giving the human any knowledge of the original license so the human cannot make the judgement call any more. That means that the AI is left to make a moral call on how and when it is valid to show some code to the human - maybe they have included some safety to codify this judgement, but I doubt it.

  19. teebie

    you can't legislate for novel use.... use of land/aircraft ...Phonographs/sheet IP, internet/broadcast radio IP

    There are plenty of counter examples, assault laws deal with tazers and pepper spray, harassment laws deal with online harassment, fraud laws deal with online fraud, libel with online libel, and employment laws (eventually) with uber.

  20. carltonh

    Bad music industry analogy

    The prog rock vs. punk rock is so wrong. Historically, the establishment Big Music Industry controlled AM and tried to initially keep FM illegal, and then irrelevant. It became popular ~1970 as independent media where it was outside of industry control. Therefore, prog rock developed because it was outside of industry control where they would no longer be bound by the 3-minute pop song of formulaic drivel. Musicianship finally mattered as well as creativity. At the end of the prog rock golden age, mega-industry music got control of FM radio. They created punk rock as a new formulaic 3-minute pop song that pretended to be anti-establishment to hide how extremely establishment it really was. Musicianship was no longer important, only image and thus each band and band member was highly replaceable and fully controlled in punk rock.

  21. Sam Adams the Dog

    People who think this is terrible are basically nuts

    May I remind you that it is completely legal in every way to make commercial use of FOSS code? Think of all those commercial HPC clusters running Linux.

    And there has never been a prohibition against reading FOSS code, seeing what you can learn from, it, and using what you have learned when writing your own proprietary code. Yes, fuller attribution would have been gentlemanly, but there is no FOSS requirement to be gentlemanly, as anyone who has listened to Linus or Stallman over the years well knows.

    Those who point out that most FOSS code is crap are likely right, but there is no justification for the assumption that the training process assumes that it is all great. Though I don't know what the process was, it is extremely unlikely not to be that, even aside from the fact that FOSS code varies strongly from case to case in code style, safety and correctness.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like