back to article More than half of GitHub is duplicate code, researchers find

Given that code sharing is a big part of the GitHub mission, it should come at no surprise that the platform stores a lot of duplicated code: 70 per cent, a study has found. An international team of eight researchers didn't set out to measure GitHub duplication. Their original aim was to try and define the “granularity” of …

  1. Anonymous Coward

    Don't depend on it

    “If ever you have felt like you are downloading the universe when running npm install..."

    So what if I have twelve copies of tightpants.js floating around in my project? As long as they don't all try to run at the same time I'm good, right?

    1. theblackhand

      Re: Don't depend on it

      Depends how tight you like your pants....

  2. Notas Badoff

    Forking right! ?

    I see some project's code that's wrong and needs maybe three lines changed. To submit a pull request I have to fork the whole project, change those three lines, update, push, generate pull request, yada-yada. I can't actually delete/erase my fork, so those files stay 'duplicated', right?

    I'm just not quite getting the assertion that all that duplication is duplication so much as Github's equivalent to softlinks.

    1. thosrtanner

      Re: Forking right! ?

      Pretty sure they can tell the difference between a forked repo and a copy and paste. It's pretty much how github works.

      Copying the whole of some javascript library into your repo rather than using sub repositories (or whatever they call them now) seems to have been pretty much standard practice. Not sure how package manager works but some things are really good at breaking APIs. And there's probably a lot of "I just need a small tweak" mentality about this copy and paste stuff.

    2. Steve the Cynic

      Re: Forking right! ?

      Um. I don't know who has failed on comprehension of github. I'm willing to accept that it's me, but isn't the process more like this?

      * git clone the project on my local machine with the github repo as a "remote"

      * create a branch and make my changes there.

      * Usual software process guff about testing my changes.

      * commit my changes to my local clone

      * push my branch to the remote

      *** Now my branch is on github's repo.

      * Submit pull request

      So there are two copies of the repo: one is on github's servers, the other on my machine.

      Feel free to tell me how totally wrong what I just said is.

      1. ibmalone Silver badge

        Re: Forking right! ?

        If you're part of the core team for the project then yes (though you'd submit a merge request for your branch, not a pull request). If you're an outsider (e.g. you use a library and there's some fix or feature you've added and are proposing they incorporate) you can't push to their repo and will usually push your changes to your own fork on github and submit a pull request to the upstream repo from there.

        1. Steve the Cynic

          Re: Forking right! ?

          OK, that makes sense; I had a nagging feeling there was something wrong with my interpretation. Ho Hum.

    3. Anonymous Coward
      Anonymous Coward

      Re: Forking right! ?

      "I can't actually delete/erase my fork, so those files stay 'duplicated', right?"

      Of course you can. Once you've merged into the target branch you can delete whatever you like. It's one of Git's key strengths that you don't have to keep stuff hanging around if you don't want to.

      Whether you should or do delete things is a different question.

    4. Blake St. Claire

      Re: Forking right! ?

      > I can't actually delete/erase my fork,

      Sure you can. Go to the repo settings, scroll to the bottom. Push the delete button

  3. bombastic bob Silver badge

    it's part of the design, actually

    when I wanted to submit a patch for an open source project (in this case, the Arduino IDE) I basically had to clone the repo, make my change, and then submit a pull request. Makes sense to do it that way, though, right? It _is_ the way the github docs tell you to do it, after all.

    But, is THAT the level of "copying" they refer to?

    It's a fair bet that cloning the repo [for this very purpose] only constructed pointers to particular revisions, and so it wouldn't reallly take up a lot of storage space on github's server.

    STILL, I'd think it would count as a "copy"...

    yeah, there IS a lot of "copying" going on, in this case. [I suppose I oughta delete my copy, now that I think of it, and the patch has been incorporated into the project]

    post-post note - someone else mentioned 'forking', possibly as I was writing this...

    1. sabroni Silver badge

      Re: it's part of the design, actually

      BOB, is THAT really YOU? What's GONE wrong WITH your KEYBOARD?

    2. ibmalone Silver badge

      Re: it's part of the design, actually

      The article seems to be about not so much how much space is taken up on github's backend, but what happens when you sample code from github, in which case forking does matter. Say you want to know how many people use mmap versus malloc for memory allocation. You pick C code at random and count. However if a large proportion are forks then popular projects are over-represented and your results are skewed towards the properties of those projects.

  4. Adam 1


    If I one day hate myself enough to get back into JavaScript, I'd probably want a local copy of any NPM in case some function went missing...

    Seriously though, I'd like to know a lot more about their methodology before getting my pants in a twist. I have seen dup checkers complain about nunit test cases being too similar. And yes, you probably could have extracted 2 lines of the arrange into a private method so the three test cases with those lines could share it, but then to describe those two trivial lines of code you would need to spend a month of Sundays trying to come up with a sensible name, and this is somewhat missing the point of refactoring.

  5. Charlie Clark Silver badge

    Java provided a good example: people creating a project would commit NPM libraries

    What? Since when has Java been using NPM?

    1. Anonymous Coward
      Anonymous Coward

      Certainly several cases in the article where Java was stated yet Javascript was meant. How many times have we seen that?

      1. mgbrown

        "Certainly several cases in the article where Java was stated yet Javascript was meant. How many times have we seen that?"

        Surly they actually meant ECMAScript?

      2. The Indomitable Gall

        I only see Java used to refer to Java and Javascript used to refer to Javascript. The figures being different for the two languages shows I'm not reading it wrong.

  6. richardcox13

    The authors misunderstand git

    > there is a lot more duplication of code that happens in GitHub that does

    > not go through the fork mechanism, and instead, goes in via copy and paste of files

    > and even entire libraries”

    At face value they appear to have checked the GitHub repository being marked as a fork. But it is quite easy to fork without it showing in GitHub[1]. Given the basic model of submitting a pull request is to start with a fork either in GitHub or by using a local repository[1] there will be a lot more forks than immediately shows.

    Why would anyone go through the local route: interested in how it works so clone locally. Make a change and then realise I want to push that before creating a PR. Given changes already committed the new remote approach is easier.

    [1] Clone GitHub repository to local; create new GitHub repository; set new GitHub as remote in local repository, push local to new remote.

    1. ibmalone Silver badge

      Re: The authors misunderstand git

      You can do that, but it makes life a little painful as having a project marked as a fork of another makes it easier to keep it up to date. Can't remember what, but I think it's re-basing that becomes difficult if your project isn't marked as a fork of the one you want to track (and you can't change this, you can only get fork status through forking on github).

  7. Herbert Meyer

    creativity ?

    All these yoofs got through school by copying each other's work. Continues into their "work". I refuse to refer to them as "adults",

  8. Nick Kew

    The mote in thine eye

    but along the way, they turned up a “staggering rate of file-level duplication” that made them change direction.

    So their own work was driven by what they discovered after they'd started. That makes it statistically worthless.

    Was the slightly-ironic sub-headline El Reg, or from the research? If the latter, I hope the tongue was firmly in the cheek.

    1. Fading

      Re: The mote in thine eye

      Using the scientific method in research seems to be a lost art......

    2. ibmalone Silver badge

      Re: The mote in thine eye

      Data descriptive rather than hypothesis driven: exploratory investigation is a valid thing to do, and statisticians are meant to check assumptions on data before using it for hypothesis tests (normality and correlation of confounders and explanatory variables being examples). Someone who's interested could now go, collect their own sample and try to confirm if they really wanted. The scientific method cycle, at least in its idealised form, doesn't require the same person executes all the steps, if we only did hypothesis driven work we'd still be trying out different methods to make the sun rise.

    3. The Indomitable Gall

      Re: The mote in thine eye

      The purpose of the research was to inform other research that says "Language X is the most popular language (according to Github)." Analysing Github for patterns that bias other research is completely valid, and identifying patterns is part of that.

      This is perfectly scientific, and in fact continuing with their original plan would have been the statistically worthless option, as modified file copies are at least an order of magnitude less common than verbatim file copies. The originally sought data would be valueless without this file-level data, so there was no point in pursuing the original plan.

  9. Nick Kew


    Now remind us.

    How much of the human brain is redundant?

    How much of the human genome is duplication?

    or even

    How much of a great artwork is duplication?

    It seems to go with the territory of being large and complex.

    1. Hollerithevo

      Re: Duplication

      Actually, very little of the human brain is redundant. That is a very old idea, and research keeps showing that what we used to think of as filler actually has essential functionality. Makes sense: the brain uses up a lot of the body's resources, and evolution tends not to focus on wasted space.

      Ditto the human genome: they are discovering that duplication is an essential component of corrections.

      No great artwork is duplication. There is a lot of second-rate art which is uninspired copying, but every piece of great art stands alone. 'The Raft of the Medusa', the Parthenon, Michelangelo's 'Pieta', Brahms' Third Symphony -- these are all mountain peaks, unique and wonderful in their own ways.

      The only place I have found identikit wastes-of-space have been middle management. Oh, and marketing copy.

      1. Nick Kew

        @Hollerithevo Re: Duplication

        Oh dear. I don't think much of your musical taste. Among symphonists contemporary with Brahms I'd put Dvorak or Tchaikovsky head, shoulders and torso above the sub-Beethoven-wannabe.

        That aside, if you look at any music, there's a lot of repetition. Sometimes identical, other times modified. Whole styles and genres are defined by how repetition works. One of the main things that distinguishes music worth listening to from a pop single is that it's not merely repetition, but development of ideas. From antiphonal echo, to the major classical forms like sonata and rondo, to the leitmotif and its many imitators, to name but a few forms spanning the centuries.

        Take the familiar repetition away and you have Stockhausen. Or let the repetition overwhelm development for longer than a pop single and you have muzak.

        Which is kind-of like github. Clone something, you have duplication. Fork and go your own way, or feed back to your upstream via pull requests, and you have different modes of development. Is not a bugfix branch just what you say of the genome: an essential component of corrections?

        I guess an in-depth study of analogies to other complex systems might look more like a PhD thesis than an El Reg post. Maybe a good halfway house could be a paper examining some aspect in depth, which El Reg could then report and commentards could debate in an ingenious self-reference reminiscent of Escher.

        Mine's a pint, please. I'll need it to take this any further.

  10. hatti


    Most of them are probably my tutorial 'Hello World' examples.

    Sorry Github

  11. tiggity Silver badge

    jQuery in own code

    Local jQuery can be useful - a couple of examples

    1. If no connectivity then pulling down code from jQuery online will fail, local copy means scripts will still work. This can be useful, e.g. your code is a product used by many people, may run on web server set up so site cannot pull in data from the web, only communicate with sessions established to it.

    2. If user has script blocking tools enabled, having local js, means they only need to allow same domain js

  12. 2+2=5 Silver badge

    47 levels of nesting

    > including nested dependencies (nesting up to 47 levels deep was discovered, with median five)

    I can't decide whether that particular developer deserves an award or a kick up the bum.

    I certainly wouldn't want to be taking over its maintenance.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Biting the hand that feeds IT © 1998–2022