back to article Even robots have the right to learn from open source

If the soap opera of Microsoft's relationship with open source had a theme tune, it'd be "The Long and Winding Goad". To a company whose entire existence depended on market control, open source's radical freedoms were an existential, cancerous threat. In return, open source was only too happy to play the upstart punk movement …

Page:

  1. ComputerSays_noAbsolutelyNo Silver badge
    Paris Hilton

    Well, let's hope micros~1 only trawled the public code, and not also all the private repos.

    It would be such a shame when e.g. CoPilot suggests values if you create a variable named passwordForSQL or similar.

    There's potential for information leakage. But, most probably, it's only my imagination that's running wild.

    1. John Brown (no body) Silver badge

      I was wondering how they separate the good code from the shit code. Or does it all just get fed to the AI, leaving it potentially biased and possible liable to making incredibly bad or stupid suggestions. I wonder what the ration between good and bad code is on GitHub?

      1. Michael Wojcik Silver badge

        Based on anecdotal experience, CVE rates, and some academic studies in related areas, I'd say the overwhelming majority of code in GitHub is crap.

        That's also true of most other repositories, of course.

        It looks like a prototype of the Codex model used by Copilot was trained on a massive amount of data: "The amount of training data is 54 million public software repositories with 179 gigabytes of unique python files", according to one source. Doing any meaningful data hygiene on that sort of big-data volume is all but impossible. So we can assume Copilot was trained on a great deal of rubbish.

        Note, too, that one of Copilot's goals is "filling in repetitive code" – a task that explicitly violates the DRY principle and suggests that a redesign was in order anyway. Copilot appears to be in significant part a tool for creating lousy code.

  2. Howard Sway Silver badge

    The complaint isn't about the use of FOSS code, it's about attribution

    Open source people aren't luddites, nor do they care who uses their code or what they use it for, AI training included. However, most licenses require attribution, if only as an act of courtesy and appreciation for whoever wrote the code. Take the permissive MIT license :

    Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

    The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

    The beef with Microsoft here is their attitude of "we're going to take other people's work and not give them credit".

    1. Gene Cash Silver badge

      Re: The complaint isn't about the use of FOSS code, it's about attribution

      The beef with Microsoft here is their attitude of "we're going to take other people's work and not give them credit".

      Not only that, they're going to make people pay for that work. They're making a profit off my unattributed code.

      As per the article: "The result, called Copilot, is then sold to programmers as a code suggestion aid"

      1. Flocke Kroes Silver badge

        Re: profiting off your code

        What is the licence? There are "not for commercial use" licenses but they are incompatible with some of the popular FOSS licences. For example, you would not be able to link your code to GPL libraries. Also, check the github terms of service. By using github you may have promised to bake them a birthday cake as well as given commercial use.

        1. Alan Brown Silver badge

          Re: profiting off your code

          GPL is simple at its heart: comply with the license conditions or you're into copyright violation territory

          The license says you must show your sources or the license is void. There are no restrictions on commercial/non-commercial use other than that

          ie: You're more than welcome to make money off it, but you MUST NOT hide the origins

          If I wanted people to use my code unattributed and hidden, I'd release it under BSD license. GPL exists because of companies which kept taking code, putting it behind paywalls and claiming copyright on other people's work

      2. Mye

        Re: The complaint isn't about the use of FOSS code, it's about attribution

        Whatever it generates it's not your unattributed code. There are numerous occasions where I have found code functionally identical to what I've written on a commercial project inside an Open source project. My code came first. Should the open source project license my code from the company I worked for?

        After decades in software development, I've come to conclusion that the same code will be generated over and over again in different contexts and different languages because we work with a very small and very finite set of ways of expressing solutions to problems in code.

        Even in cases where there are multiple solutions to a problem, the number of solutions can be counted one or at most two hands.

        In other words, plagiarism is inevitable because the same description can cause multiple programmers to create the same code and that is what we have here, Give co-pilot a description and from that description it uses its trained networking to generate code from scratch. It's not copying your code or anybody else's code. It is generating the new code based on your description.

        Copilot is working just like a human programmer in that it recognize a pattern and reapplying that pattern in new contexts. The only difference is that it is able to scan many orders of magnitude of code then you can in order to be able to identify patterns and figure out how to generate similar code based on a description. Another way of thinking about is that it's like you on stack overflow except much much more efficient.

        1. unimaginative
          Devil

          Re: The complaint isn't about the use of FOSS code, it's about attribution

          There are numerous occasions where I have found code functionally identical to what I've written on a commercial project inside an Open source project. My code came first. Should the open source project license my code from the company I worked for?

          If the author of the open source code had seen yours before writing theirs, then a court might well find it is a breach of copyright.

          This is why people write clean room reimplementations of code. This applies to Copilot as much as to a human programmer, with the added danger that someone using copilot does not know when it might do this.

          Copyright law was designed for books, does not work all that well for them, and when it comes to software it is -------->

          1. FeepingCreature Bronze badge

            Re: The complaint isn't about the use of FOSS code, it's about attribution

            > If the author of the open source code had seen yours before writing theirs, then a court might well find it is a breach of copyright.

            But is that good, or bad?

            I think it's bad. People shouldn't need to tiptoe to protect the intellectual property rights of a five line for loop.

          2. Alan Brown Silver badge

            Re: The complaint isn't about the use of FOSS code, it's about attribution

            "Copyright law was designed for books"

            also, it was only valid in geographic areas (the USA famously built up its technology by ignoring european copyrights and encouraging rampant IP thef) as well as having a term limited enough to ensure that creators COULDN'T rest on their laurels and simply rely on residuals forever

            The system is broken in a lot of ways, but GPL has turned the efforts by Eisner et al to its advantage

            It should be noted that patents and copyrights were BOTH suspended in Britain by James 1st because of rampant abuses and shakedowns not that much different to those being seen today - and weren't reinstated for over a decade (with new rules more similar to what we trhink of when we see the words today)

            The reason for killing the system "as was"? It was harming the economy and stifling innovation

        2. Fr. Ted Crilly Bronze badge

          Re: The complaint isn't about the use of FOSS code, it's about attribution

          Convergent evolution in action...

    2. Lord Baphomet

      Re: The complaint isn't about the use of FOSS code, it's about attribution

      It isn't using the software, at all, it's preserving from the code. Completely different thing.

    3. VoiceOfTruth Silver badge

      Re: The complaint isn't about the use of FOSS code, it's about attribution

      And how far do you go with this? In pseudo code: printf("%s\n", somevariable) is in a piece of open source code. It is also in just about every piece of code. Now some people are claiming that open source is the original of this. Is it? How would YOU know that Microsoft created its tool for this based on YOUR line of code?

      Linux is full of copiers. Look how many desktops there are trying to emulate MacOS. Hahaha. It should come with a notice: we copied the look and feel of MacOS cos we couldn't think it up ourselves.

      1. katrinab Silver badge
        Megaphone

        Re: The complaint isn't about the use of FOSS code, it's about attribution

        1. That example isn’t sufficiently creative to get copyright protection.

        2. Even if it was, it would be covered by fair use.

        1. Michael Wojcik Silver badge

          Re: The complaint isn't about the use of FOSS code, it's about attribution

          It's also rubbish. If you write:

          printf("%s\n", somevariable);

          you should have written

          puts(somevariable);

          which has the same side effect, is clearer, is shorter, and is more efficient.

          And this is why training a generative model on existing code is a bad idea. It will reproduce average code, which is terrible, because most programmers write poor-to-abysmal code. The "bash out some code and put it on the Internet" (as an old Dilbert strip had it) approach to software development1 has been a disaster for software quality and security.

          Think twice, code once.

          1Which is not what proper Agile development is about, so we can skip that argument, yeah?

  3. Pascal Monett Silver badge
    Coat

    Let's face it

    Borkzilla destroys everything it touches.

  4. Flocke Kroes Silver badge

    Tone deaf article

    AFAIK, Microsoft are not doing anything wrong by renting out copilot, even though it is trained on source with different authors and licences.

    The problem comes from anyone using code generated by copilot. There is the minor risk of getting sued for billions for something as trivial as rangeCheck. There is the more major problem of the copyright holder's intent. Some code is written by universities founded by government grants. They often select a BSD/MIT like license so they can track where their code is used and use it as evidence that last years grant did something productive and they should get more next year. People often select GPL so that improvements cannot be hidden in binaries and must instead be returned to the community.

    I respect the intent of Microsoft's licenses: pay up (inclusive?)or fuck off. They should respect other people's licenses by getting copilot to generate accurate attribution and licensing requirements.

    Robots have not rights at all and certainly do not have the right to ignore copyright law. If there is a problem with the law it is that it may not be possible to hold Microsoft to account for actions taken by their badly programmed robot.

    1. John Brown (no body) Silver badge

      Re: Tone deaf article

      "If there is a problem with the law it is that it may not be possible to hold Microsoft to account for actions taken by their badly programmed robot."

      It may well eventually go to court to be settled, but I suspect the outcome of a "robot" doing something illegal will put the onus on the robot owner. They programmed it. It's their problem if they didn't test for and foresee all possible outcomes. After all, if hardware fails and kills someone, the designer/company is likely to be sued if a design flaw is shown. A "robot" doing something illegal is, be definition, a design flaw. The court case, when it comes, will be a super-sized, multi-pack of popcorn event.

      1. Alan Brown Silver badge

        Re: Tone deaf article

        "but I suspect the outcome of a "robot" doing something illegal will put the onus on the robot owner."

        I'm minded of the plethora of patents for "Doing XYZ well-known and unpatentable business process ON A COMPUTER" which proliferated in the 1990s until the courts ruled they weren't novel and rapped the USPTO over the knuckles for not rigourously inspecting things.

        Unfortunately the USPTO (and other countries too) have taken the attitude that income matters more than novelty, therefore virtually every application is approved and it's up to the courts to deal with challenges. I wouldn't be at all surprised to find patents again being issued for perpetual motion machines

        GIven past history, patents & etc will start being granted for "XYZ, but USING AN AI"

        We already know that "training" on existing data is a really bad idea - the way AI insurance and legal software ended up with systemic biases against various racial groups is a classic example of how it perpetuates rotten input unless that input is utterly ruthlessly audited - and also a good example of how users took the outputs at face value "because the computer said so"

        If I wanted screeds of bad code, I'd farm it out to Bangalore(*). The only advantage an AI will have in the longer term is that it will undercut even the cheapest lousy human programmers in the "why output 12 lines of solid code when 8 pages of obfuscation wll keep us employed for decades?" stakes

        (*) Sooner or later some other city/country will take that crown

    2. Lord Baphomet

      Re: Tone deaf article

      There is no chance whatsoever of being sued for billions. That's a silly thing to suggest. This tool doesn't beach the licence at all. The article is spot on

    3. Anonymous Coward
      Anonymous Coward

      Re: Tone deaf article

      The law will also have to address fair use. How many lines of code exceed fair use?

      For example, "Garbage In Garbage Out" was first used in a print article in 1957 which would still be under copyright. Using the phrase without attribution or payment is fair use.

      I think the article is accurate.

      1. Michael Wojcik Silver badge

        Re: Tone deaf article

        As Google LLC v. Oracle America, Inc. showed, the courts in the US, at least, are far from agreement on questions of fair use and other aspects of copyright as they apply to software. Breyer's opinion in GvO helped establish some fair-use tests for source code, and those should guide other courts (particularly the pernicious CAFC) in similar decisions; and more importantly should squelch some similar cases, much as Alice Corp. v. CLS Bank International did for software patents.

        However:

        1. Thomas and Alito dissented from the majority in GvO, and SCOTUS has shown that it no longer gives a damn about stare decisis, despite Roberts's posturing in the past.

        2. The majority opinion in GvO, and Breyer's four tests, don't establish a bright-line test. They leave much room for subjective interpretation.

        3. There are always judges who will try to wiggle around SCOTUS decisions they dislike. There are always state legislatures that will pass laws that violate SCOTUS decisions, and those will be effective until a Federal judge gets around to staying them.

        4. Anyone can sue over anything,1 and if the case isn't kicked immediately, a defendant may be in a very difficult position until it's resolved.

        1Well, modulo "vexatious litigant" status and the like, but that's rare.

        1. Alan Brown Silver badge

          Re: Tone deaf article

          "the courts in the US, at least, are far from agreement on questions of fair use"

          The saga of a song being found to have violated copyright - not because of lyrics or stanzas, but because it emulated the "look and feel" of Marvin Gaye is arguably when USA copyright law jumped the shark

    4. mmaug

      Re: Tone deaf article

      Anyone remember the SCO V Linux battles(GrokLaw.net)? MS had its hands in deep on that one and this is more of the same.

      We have entered the extinguish phase of the operation...

      Licenses matter and AI does not get to ignore licenses. Just because they do not have the tech to properly apply the license conditions is not a reason for them to be ignored. If they want to stick to open permissive license then their exposure is much less, but actively using copyleft'd software to learn (even if none of it appears on the output side) is problematic. And trying to hand wave it away is incredibly naive at best and dishonest at worst.

      Carrying water for a multi-billion dollar corporation, exploiting the unpaid work of others, is not a good look.

    5. veti Silver badge

      Re: Tone deaf article

      How exactly do you "attribute" everything you learned in college? It's common knowledge. More importantly, it's your knowledge now. What you do with it is up to you. You don't have to keep explaining where you got it.

      Github is Copilot's college. That's all.

  5. Tim99 Silver badge
    Joke

    "Just because it's Microsoft doesn't mean it's wrong"

    …but it would be an initial working proposition.

  6. SolarDesalination

    I'm a bit more concerned about Microsoft monetizing intellectual property that is intended as a public (or common) good than them innovating with AI.

    You're framing this as a luddites vs. progress debate when the issue I think for many (if not most) is the largest for profit corporation in the world is making money off the goodwill and hard work of people labouring on code to better the human condition. At minimum they should be compensating FOSS developers for the data they're pulling from their codebases to train tne AI model. If not, I think coders should leave the github platform and go to the co-operative platform Codeberg where the membership have a say in how the profits are distributed.

    It's almost insulting for you to frame it as them fighting yesterday's war.

    1. cornetman Silver badge

      As has been answered above, whether or not making making money from an author's code depends on the opinion of the author, and they usually pick a license that reflects their position.

      As a contributor myself, I have absolutely no problem in Microsoft (or anyone else) monetising a service which serves up code that I have written for others to use as long as they reflect the license that is granted.

      This is primarily about attribution.

    2. VoiceOfTruth Silver badge

      For your line of code with printf, here is $0.00000000000000000000000000001, cos that is what it is worth.

  7. Anonymous Coward
    Anonymous Coward

    Yuk-k-k-k-k-k!!!!

    Quote: "Nobody has ever expected a human programmer, trained through open source, to attribute everything that contributed to their skills, to be repeated for all new code they write."

    True, but note the word "expected". Strangely enough, many creators of new work actually do credit the people and tools which inspired/supported the work they created: editors/reviewers, mentors, people who inspired them, close family who gave emotional support, computer tools they have used, and so on. This M$ Copilot thing strips away ALL of these varied types of attribution from the work which trained the so-called AI.

    All we have left is an attribution like: "I'd like to thank M$ Copilot for help in creating this new work". Yuk-k-k-k-k-k!!!!

    1. veti Silver badge

      Re: Yuk-k-k-k-k-k!!!!

      There's a huge difference between "choosing to mention a handpicked selection of people who you think particularly inspired you to create something", versus "being obliged to list every teacher you ever had, every book you ever read and every person who ever, deliberately or not, taught you anything".

      The latter is what the critics are calling for here, and it's ridiculous.

      1. yetanotheraoc Silver badge

        Re: Yuk-k-k-k-k-k!!!!

        It's ridiculous for a human to do that. For a piece of software, it's not necessarily ridiculous. Compare the effort of mining the attribution (i.e. mentioning the source as well as mentioning the credits from the source) to the effort of mining the code itself. It seems quite do-able to me, but they didn't even try.

  8. silent_count
    Paris Hilton

    A Thought Experiment

    Imagine someone trains an AI using Microsoft's source code and then distributes their Co-Penguin-Pilot AI using a creative commons license.

    The argument is that the author is not breaking an NDA, distributing proprietary code or violating Microsoft's IP because the AI is only offering snippets of code. And those snippets may have come from the 4 lines of the author's own code in the training set.

    Do you reckon there's even one of Microsoft's army of lawyers would consider that fair play?

    1. Lord Baphomet

      Re: A Thought Experiment

      It isn't a snippet tool. It's not giving you a copy of other people's code. I do wish people would stop yeah talking something they haven't tried and don't understand.

      1. mmaug

        Re: A Thought Experiment

        MS paid SCO to sue IBM because there were 'similarities' in lines between ctype.h Linux and the Unix source that MS bought and gave to SCO.

        Don't think the lawyers won't go after fair-use; they've nearly destroyed fair-use in the music industry.

        1. FeepingCreature Bronze badge

          Re: A Thought Experiment

          But surely just because Microsoft are evil means we must be evil in response.

          Is Microsoft's behavior hypocritical? Yes, it's unbelievably hypocritical, in that previously they were doing bad things and now they're trying to avail themselves of the protections of a charitable interpretation that they'd previously attacked. But charity is good no matter who avails themselves of it.

  9. thejoelr

    Huh?

    Wow.

    Anyway, the github license allows them to do what they want with things put there. Read your licenses before you use things.. especially if they're someone like Microsoft. My only question is if someone puts a repo there and isn't associated with the original work that lives elsewhere.

    Anyway, for anyone harboring the delusion that github wasn't really Microsoft... now is the time to move off of github.

    1. Flocke Kroes Silver badge

      Re: Huh?

      At a guess, if I put your work in github without your permission and you are unhappy about how it gets used Microsoft can try to recover their loses (if any) from me.

    2. John Brown (no body) Silver badge

      Re: Huh?

      "now is the time to move off of github."

      But is your repo really gone if you delete it? Will MS still have a secret copy stashed away for a rainy day?

      1. Anonymous Coward
        Anonymous Coward

        Re: Huh?

        "What has been seen can not be unseen."

        Once someone gives away something to a cloud, there's no reasonable way to prove everything about what / has been / is being / will be / done with it.

    3. Anonymous Coward
      Anonymous Coward

      Re: Huh?

      Then again, M$ (or anybody else) could also incorporate all of GitLab's code.

    4. katrinab Silver badge
      Megaphone

      Re: Huh?

      Suppose I take your GLP licensed code, modify it, and distribute those modified binaries.

      In order to comply with the requirement to distribute the source code of the modified version, I point everyone who downloads it to a GitHub containing the source code.

      You are not a signatory to the GitHub license agreement, but most of the code is yours. How does this work?

  10. amanfromMars 1 Silver badge

    The Bigger Deal .... Future Derivative Option for Pricing/Cost Benefit Analysis

    In IT especially, which is entirely about machines taking over human tasks, the deal with the devil has long been done.

    Hmmm? A.N.Others would be content telling you the long deal already done is in particular and peculiar regard to virtual machines taking over humans ....... which delivers a whole greater magnitude of practical and physical effect almost entirely unaffected by contrarian human input or output.

    A much better argument is that if Microsoft isn't documenting its training data well enough to identify source files, the training data itself is suspect, and undesirable outputs will be harder to diagnose. But that's a reason to avoid Copilot, not to abstain from GitHub.

    Oh? Is such an undesirable outcome not a very valid reason to hack Copilot in order to expose its inherent and/or unpleasant flaws?

    The beef with Microsoft here is their attitude of "we're going to take other people's work and not give them credit”. ..... Howard Sway

    But Howard, that is the American way, is it not? Take what you can and to hell with the rest and unintended consequences?

    It is certainly most unlikely to change any time soon, methinks, with more than just a few thinking it is all quite acceptable ..... and supported in that notion because it can be so extremely positively rewarding with mountains of flash cash to stash one of the major attractions it is so easy to be hopelessly addicted to.

    Some recognise that not as an almighty strength to be mindlessly lauded and applauded, which it surely can be, but a catastrophic weakness to be mercilessly exploited and repurposed ..... either to merciful extinction or worthy improved refactoring.

  11. heyrick Silver badge

    If it's not immoral for humans, how can it be for AIs?

    Easily. The average human isn't capable of reading the entirety of the source on Github, remembering it in exactness, and offering snippets up to everybody who asks. The AI very much is capable of exactly this.

    The human may use other people's code as a guide to how they approached a particular problem, helping them to develop their own solution. Or, yes, copy-paste a bit of code. But they're unlikely to be offering bits of other people's code to anybody else who asks. The AI, on the other hand, won't be using it to learn anything other than how to more effectively offer appropriate solutions for what somebody else is doing.

    Now, on the face of it this might not seem like a bad thing as it will be an appropriate suggestion without the human having to go and read loads of code in order to find it. However the problem is that it is presented without context. And whether you like it or not, the terms of the licence are part of that context.

    1. Lord Baphomet

      Re: If it's not immoral for humans, how can it be for AIs?

      Again, someone who hasn't actually used the tool or understood what it's doing but who still thinks it's appropriate to comment.

      It isn't a snippet tool and it isn't copying code from repos. It doesn't memorize all of github's code and it's not a snippet search engine.

    2. Fifth Horseman

      Re: If it's not immoral for humans, how can it be for AIs?

      I hear your concerns about licensing, and share them. However, I think there is a more fundamental problem here.

      We all start from other people's work to guide and inform our own, that's how we learn. In pre-WWW days we used textbooks: a good fifty percent of what I have done probably had a starting point in "Numerical Recipes" or "The Art of Electronics", aided by numerous manufacturer's application notes. Not intrinsically any more reliable than a random web page, granted, but at least they have been past an editor and a proof-reader first...

      Over the last couple of years though, I have seen more and more copy and paste programming, with little effort made to understand how the code works, whether it is really appropriate in the current application, or indeed what it actually does. In hardware, it is sometimes worse - there are some great open source designs out there, but certainly in websites targeted at the maker community, I have seen designs ranging from "won't work under any circumstances" to "will probably kill you".

      My concern here is that if the AI (god, I hate that term) algorithm has suggested the solution, there will be much less chance that the naive user will be critical of what is put in front of them. The computer said it is right, and it is better at this than me, so it must be OK? As someone has pointed out earlier, we don't know the signal to noise ration of the GitHub codebase, and neither does the AI bot.

      Anyway, just my two pennorth. Feeling even more cynical than usual.

  12. original_rwg
    Coat

    Umm...

    "Just because it's Microsoft doesn't mean it's wrong"

    But they'll fix it on Patch Tuesday....

  13. Lord Baphomet

    Spot On

    This article is exactly right. Until and unless OSS licences are changed to account for this novel use of source code, CoPilot is perfectly acceptable.

    CoPilot doesn't copy source code. It isn't a search engine. It doesn't offer you snippets of open source code. In fact, it does little thousands of other tools do - it reads the public code in github and builds a statistical model based on what it finds. Lots of other research has been done on the source found in github and nobody has complained about any of it.

    Any complaint about this tool is spurious, and as most of these comments prove, is voiced by people who haven't actually seen and don't understand the tool.

    1. heyrick Silver badge

      Re: Spot On

      Doesn't copy source? Doesn't offer snippets?

      Please watch https://www.youtube.com/watch?v=xgYOCUtUJbs (from ~50 seconds) and explain to me what that is, if not working out what the user is wanting to do and then filing in a prewritten function body in order to do it.

Page:

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like