back to article Open source devs consider making hogs pay for every download

I'm at the Linux Foundation Members Summit, and Sonatype's CTO Brian Fox introduced me to a new open source problem. I wouldn't have thought that was possible, but here I am. Fox, who also oversees Maven Central, the Java registry, explained that its repository site is at risk of being overwhelmed by constant downloads. The …

  1. Anonymous Coward
    Anonymous Coward

    Abusers eventually have to pay the piper

    This is IMHO, shortsighted on the part of those companies. How much can it cost to run their own GIT repo either on prem or in the cloud? On prem would probably cost a lot less if the traffic is as heavy as reported in the article.

    I run my own repo on a hosting site. I moved all my code there the day after MS bought GitHub. It is secure as I've setup the firewall to only accept requests from three validated addresses. the rest get a 404.

    1. Headley_Grange Silver badge

      Re: Abusers eventually have to pay the piper

      I guess that if it costs more than free then they aren't intersted.

    2. Claude Yeller Silver badge

      Re: Abusers eventually have to pay the piper

      "How much can it cost to run their own GIT repo either on prem or in the cloud?"

      The cost is not in running the GIT repo, but in the paperwork and effort needed to get the few bucks in your budget.

      If you have to work for months argue with PHBs and bean counters to set it up, you rather use the "free" version.

      1. Anonymous Coward
        Anonymous Coward

        Re: Abusers eventually have to pay the piper

        And if reliance on the "free" version causes a throttling SNAFU which snarls work output, will the PHB take responsibility?

        Hell no. He'll blame IT.

        It's IT's job to make the case for being in control of mission critical infrastructure. Sometimes that requires extra paperwork dealing with bosses who just don't get it. Life isn't fair.

        A smart IT department is happy to fire up that email chain in advance of the SNAFU and let the PHB say no. Then IT has the paper trail saying "we told you this was mission critical!"

      2. AnonymousCward

        There’s nothing to argue with the PHBs

        I suspect this has nothing to do with a Git repo for some reason. If it did, it would be trivial to resolve and not worth writing about. Linus (when building Git) catered for low bandwidth and even zero Internet scenarios when he made the thing!

        On the FOSS project side: Git is distributed by design. Git also correctly uses cryptography so people can tell if the code they’re receiving really did come from where it’s claimed. Any project using Git could mirror a read-only copy of their repo to anywhere they want (e.g. GitHub, GitLab, Codeberg etc.) before setting up stringent bandwidth and connection limits for their own internal repository. The “but GitHub uses code to train AI, we don’t want our code slurped!” crowd forgets that pretty much anyone on the planet with access to a computer and the Internet could mirror an objectors code to GitHub and the exact same thing will still happen, meaning you’re better off getting something for your troubles than nothing.

        On the large-scale consumer side: Pretty much the same story. If the project isn’t mirroring elsewhere, you create your own mirror (in the case of GitHub it’s in the cloud and for free) instead and have your workers fork that to make changes first.

        1. Claude Yeller Silver badge

          Re: There’s nothing to argue with the PHBs

          "Any project using Git could mirror a read-only copy of their repo to anywhere they want (e.g. GitHub, GitLab, Codeberg etc.)"

          Getting automatic updates for your mirror takes "work" and/or money. Both the work and money are negligible, but someone has to account for it, which again requires sign-offs. Even a $0.02 monthly payment must be accounted for.

    3. Michael Strorm Silver badge

      Re: Abusers eventually have to pay the piper

      Shouldn't that be 403 Forbidden?

      1. that one in the corner Silver badge

        Re: Abusers eventually have to pay the piper

        Nah.

        403 means the undesirables can tell there is probably something juicy waiting for them and they'll get all angry that they are being denied their rightful access, directed at you/OP.

        404 means the problem is that their end, they can't even type in a URL correctly; and (unless they are getting involved in timing attacks, which raises other questions) they won't be able to tell which URL variation *would* have worked if they were on the friends list.

    4. computing

      Re: Abusers eventually have to pay the piper

      You're right of course - companies should run their own repo. But it's a question of having access to the skills to do it.

    5. anothercynic Silver badge

      Re: Abusers eventually have to pay the piper

      It's not a Git repo that's the problem. It's running a Maven repo that is (and it also goes for the likes of npm, pypi, etc etc etc). It's a bit involved, but infrastructure providers like AWS, Azure and Google Cloud *can* and importantly *should* run these repos on their customers' behalf. How they can help with that should be fairly straight-forward, like the article says. Caching is important.

      If AWS were to run a Maven mirror/cache, they could literally redirect any calls to maven.org or maven.sonatype.com to run via their proxy. Eventually they would have a full, accurate and up-to-date copy and the likes of Sonatype would not fall over (or get concerned about their bandwidth costs). Github could implement the same given that the per-commit unit tests run on their infrastructure. AND - Github could suggest in their UI that users use the Github repo instead to keep traffic to the likes of Sonatype low.

      It should not be difficult to do. It's just that the will to do this either does not exist ("oh, well, look, it exists there, so why should I need to do this") or, as pointed out, it's just plain ignorance.

      The added benefit of such 'pass-thru' cache systems is that the hyperscalers could also roll back specific packages (if they are poisoned by miscreants), which in turn could drop the problem of supply chain attacks through the floor.

    6. Avalanche

      Re: Abusers eventually have to pay the piper

      The problem isn't about git pulls at all; the article itself makes that same mistake as well.

      It's actually about downloading *libraries* from artifact repositories like Maven Central, NPM, and the like. Some organizations don't have their own mirror repositories, nor do they cache artifacts, which means that every time a CI pipeline runs, all dependencies are downloaded again, increasing the load and cost of such artifact repositories.

    7. coderguy

      Re: Abusers eventually have to pay the piper

      The issue isn't really about the source code, It's the many 1000s of dependencies found in a typical project today.

      For example, Maven artifacts, NPM or apt packages can be quite large, they're being downloaded repeatedly by a small group of users.

      Someone need to make these available for download. Bandwitdh and storage aren't free.

  2. Juha Meriluoto

    Getting x for free and making profit from it... Capitalism, pure and simple.

    1. Anonymous Coward
      Anonymous Coward

      It is way past time that the government uses strike breaking laws to force open source developer to support their software with updates, registries, CDN's etc. They should absolutely be prohibited from increasing what they get paid - that would be tantamount to communism.

      Luckily business will do it's part, by laying them off from their jobs, so they have more free time to support FOSS.

      1. Anonymous Coward
        Anonymous Coward

        Or government could like, uh, fund critical open source projects.

        How many developers and downloads could be paid for if the US bought one fewer stealth bomber?

        Just one fewer. I think we'll survive.

      2. Alumoi Silver badge

        /sarcasm

        You forgot something, buddy!

    2. Ian Johnston Silver badge

      No. Capitalism is making money from owning capital. There is a clue in the name

    3. gaston

      Suggest a wall of shame

      Posting a list of the biggest abusers and the largest profiteers may induce more responsible behavior.

  3. IGotOut Silver badge

    Nice idea...

    ...I totally support this, but I'd fire a warning shot first and not just throttle them, but out right ban them, for say for initial 4 hrs, then 8, then 24, then 48 etc.

    Think as these billionaire scum as the spoilt brats they are. Keep making them sit in the corner until they learn behave.

    1. that one in the corner Silver badge

      Re: Nice idea...

      Trouble is, as TFA points out, how do you know who is "them" if the best you can say is that the IP address is coming from, say, AWS? And isn't even the same IP for the same abuser's next run, as the VM was chucked away and a new one started up?

      Of course, they could try an approach of "we will block every request we think is coming from a hyperscaler", on (perfectly reasonable) grounds that your devs should be doing the ONE AND ONLY pull of that content onto a dev machine from your company's IP and then loading that into your build system's cache (whether that be a localised Git instance or any other storage). That won't stop all the abuse, of course, but it'll help improve the accuracy for a more targeted throttling/replying-with-an-invoice setup.

      1. _wojtek

        Re: Nice idea...

        just block hyper scalers - they use ephemeral VMs which always start with "clean state" and dint bother with any proxy...

        alternatively force using accounts (for maven central) with quota so majority of users don't be affected but this 1% hogging UO 80% of resources will either optimise their pipeline or pay for the abuse...

        1. Roland6 Silver badge

          Re: Nice idea...

          Would be interesting for comparison if figures were available for GitHub and GitLab.

          Agree with quotas/thresholds for free and paid. However, the real challenge is getting people to understand, it is the source and only the source that is free, everything else including getting binary install packages for your platform.

        2. collinsl Silver badge

          Re: Nice idea...

          I'd block the hyperscalers apart from one IP and then apply throttling to that - it would then be up to the hyperscalers to create/apply proper proxy solutions for that IP (which they could choose, and perhaps automate changing it every hour or something if they can't keep it the same forever) so that their customers aren't affected by the throttling limits.

    2. Ken Hagan Gold badge

      Re: Nice idea...

      A temporary ban has the added advantage of looking like an outage at your end. If the miscreants start seeing their business operations disrupted by your (apparent) flakiness then they might (finally) work out that some caching at their end is necessary.

    3. DS999 Silver badge

      Isn't this announcement / discussion

      The warning shot?

      They know who they are. If it causes big issues in their business before they figure out what happened and get on board with paying, that's their problem as far as I'm concerned.

      If I was using an empty lot next door as additional outdoor storage space for the last five years without the owner complaining, it would be nice of them to tell me if they planned have my stuff piled up in my yard or taken to the dump. But they would be under no obligation to, and I would have no but myself to blame if their actions ended up costing me money.

      1. Anonymous Coward
        Anonymous Coward

        Re: Isn't this announcement / discussion

        There's a big, fat promotion waiting for whichever hyperscaler IT dude flags this for the boss, gets ahead of the problem, then follows up with the story of unprepared others getting smacked with avoidable downtime.

        Seriously.

        Give me a couple obsolete 8-core servers, 32GB RAM, a few NVMe drives of a few TB each, and 40 GbE NICs.

        That's spare kit most places and crushes the problem.

        What is IT gonna do, sell that 40 GbE card for $40 on eBay?

        1. Anonymous Coward
          Anonymous Coward

          Re: Isn't this announcement / discussion

          Do you have a link for that listing?

      2. Roland6 Silver badge

        Re: Isn't this announcement / discussion

        > They know who they are.

        Agree, probably need to do some analysis of the 1% to identify geographic region, company and individual (I am assuming the repository requires a login to be able to set up a project and to interact with it.

        1. doublelayer Silver badge

          Re: Isn't this announcement / discussion

          The problem is that the stuff they're talking about doesn't require a login. It's open source, so anyone who wants can have a copy, so no identifying information is collected, so if someone requests a thousand copies, it's not always easy to know who that is or, depending on how they did it, to determine that it was one person requesting a thousand times rather than twenty people requesting a thousand times between them somehow. Throttling can often work, but naming and shaming is often more difficult than the reward would justify.

      3. A Non e-mouse Silver badge
        Headmaster

        Re: Isn't this announcement / discussion

        > They know who they are

        From the fine article:

        "In one case, a department store's team of 60 developers generated more traffic than global cable modem users worldwide due to misconfigured React Native builds bypassing their Nexus repository manager."

        So, no, they don't they're doing it.

        1. DS999 Silver badge
          FAIL

          Re: Isn't this announcement / discussion

          Whoever is paying their internet hosting bill ABSOLUTELY knows. They get stats/graphs about usage, and would know there is a big spike when that "misconfiguration" occurred and especially know there is that big spike with all of that new traffic going to a single location!

    4. anothercynic Silver badge

      Re: Nice idea...

      The likes of RIPE already do something similar - if you run a bunch of whois queries against them again and again and again, they will eventually ban your IP address for being abusive, *especially* when you repeatedly request registrant information when all you want is to know who the owner of an IP address is but don't care about the actual role/personal contact information.

      Some popular Whois modules for Perl do not set caching on by default, and do not set brevity on (i.e. don't ask for personal role info if you don't need it). Setting those option *on* reduces the hits on the likes of RIPE, ARIN, APNIC etc significantly too. Nothing's nicer (!! Not!) than discovering that your IP has been deny-listed because you never did that (been there, done that, wrote a grovelling apology to RIPE for my sin and was forgiven).

    5. coderguy

      Re: Nice idea...

      I'm fine with this, the repositories just need to require a token to be sent with each request and throttle based on that.

      Afaik, all the major repositories support some for of authorisation.

      Hobbyist usage : Some value

      Commercial usage : Some other but higher value for X currency unit

  4. QET

    Those leeching buffoons who were eventually embarrassed had never heard of the concept of a on-prem caching proxy?

    I'd have thought any sufficiently large company with IT staff worth their salt would implement something like that.

    But on the other hand, companies that big, too often nowadays have beancounters who concluded outsourcing it all could save the company 0.1% compared to the previous on-prem IT staff's salaries.

    1. Gene Cash Silver badge

      sufficiently large company with IT staff worth their salt

      I'm an optimist. I hope to see one someday.

    2. Anonymous Coward
      Anonymous Coward

      Exactly. Code has moved from local repo to github, build environments have moved from Jenkins to Azure Cloud. On-prem is being gutted and the people who worked in build, deployment, and installation fired as a result and replaced with outsourced labour.

      Corporate thinking is if we don't have the servers then we don't need the people but we do because these are the people who would have set up caching, amongst other things. Now everything is pulled in at the start of each build, only application developers might understand that caching is required (if they're actively checking something that is not within their domain), but the outsourced labour who replaced those that were fired won't do anything unless they're told to, which is "fix this, it's broken" instead of anything like best practice or preventative maintenance.

      1. Richard 12 Silver badge

        Worse, all the CI systems deliberately delete everything, every step, by default.

        So a pipeline that does "setup, build, deploy" on the same physical runner ends up doing three pulls every single time it runs!

        1. alisonken1

          The difference is that most CI systems are part of the repo, and they can keep track of CI pulls and "everyone else" pulls.

      2. Roland6 Silver badge

        >” only application developers might understand that caching is required”

        “Might” is a key word. Given the emphasis on cloud and increasingly running dev desktops in the cloud (ie. On AWS or Azure), I suspect many will simply say “the Internet is slow” and “need a faster connection”. It will be a long time before they appreciated it was the way they were using the tools and the limitations in those tools which were the problem.

  5. Yorick Hunt Silver badge
    Holmes

    Easy solution

    Charge everyone $1 per pull. Maybe compound it by charging $2 per pull after the first thousand pulls in a month, $3 after the second thousand, and so on.

    Regular developers won't have a problem paying <$10 per week.

    The ones making millions of pulls, though...

    1. find users who cut cat tail

      Re: Easy solution

      If you started at 0.01 €, I might agree. But 100+ pulls per week – which should still not put much strain on the infrastructure – is not unusual during active and granular development. And while $100 may be the cost of a couple of coffees in the US, it is big money in other parts of the world. Taxing the poor again? For comparison, even your $10/week is about twice as much as I pay my ISP.

      1. that one in the corner Silver badge

        Re: Easy solution

        > But 100+ pulls per week – which should still not put much strain on the infrastructure – is not unusual during active and granular development

        100+ pulls of the same data in a week? Huh? When do you need to do that? And do so without enough foreknowledge of your own development "strategy" to be able to set up a local cache (e.g. just 7zip the download before doing anything else!) instead of wasting time repeating the pull!

        Two or three identical pulls because the team isn't talking to each other, maybe. Although that indicates a problem in your team.

        100+ because you are in a megacompany with loadsa separate projects that are - not talking to each other, which indicates a problem with your software architect.

        1. Anonymous Coward
          Anonymous Coward

          Re: Easy solution

          > 100+ because you are in a megacompany with loadsa separate projects that are - not talking to each other, which indicates a problem with your software architect.

          That's also a sign of teams operating with autonomy.

          The alternative is a bunch more useless meetings between teams which don't actually gain anything from coordination (even if it does make the PHB look "in charge")

          1. that one in the corner Silver badge

            Re: Easy solution

            Having your software architect keeping control over the proliferation of external dependencies is not about about continual inter-team meetings.

            Nor is *not* controlling your externals about "team autonomy". Not unless you definition of "autonomy" fundamentally includes wasting time by pointlessly duplicating effort. Effort not just for you: you are checking for all the potential non-technical issues surrounding an external, aren't you? Legalities?

            In case you don't know how it works:

            You determine the need for functionality X; hey, maybe that is your entire team or maybe it is literally just you making the determination. Then one person - just needs one, but you can take a friend - goes to the SA with these requirements, and probably one or two names of possible solutions. He points you at the already in use, tried and tested, solution - which may be another project in the company or may be an external that is already in the company library, open source or proprietary. Plus a list of who is already using it and can help with any issues you may have, and any notes about legalities your team's project now has to abide by (e.g. the correct and consistent declaration for your software BOM; seat licensing; a statement that your project is for internal use only can not be released outside of the company because it is now using GPLed code and you don't - yet - have clearance to ship the company-written portion). If there isn't already something suitable - or you have a *really" convincing argument why there needs to be yet *another* XML parsing library added to the list - then you and the SA work together to ensure all those boxes that you don't care about, but Legal and Contracts get worked up about, are ticked. And for your sins you get to be added to the list of "any questions, this person has experience with the thing".

            No doubt this all a pipe dream to many, but if you are wasting both your time and the company's EITHER duplicating effort OR getting tangled up in all those meetings, instead of being able to treat a common collection of libraries & utilities &c as a trustable source that can be accessed without involving your PHB, then something has gone wrong.

            Or you are really just a one or two man band and all of this is overkill - and you probably also are not the source of the problem to the repos that TFA is taking about.

            1. Anonymous Coward
              Anonymous Coward

              Re: Easy solution

              Even when I've seen this pipe dream atrwmpted I've never seen it really work well.

              1. that one in the corner Silver badge

                Re: Easy solution

                Yeah. I had it working for a few years - not a large dev department, but we had a number of distinct projects that would be active in parallel. I took the unofficial role of the librarian (weren't using SA terminology at the time), checking which licences were in use, generating the declarations for SBOMs from the build system blah blah.

                But I think I can say with absolute certainty that that all went by the wayside pretty much as soon as I left (except for contractual fixes to those older projects), as the Bright Young Things were having none of it. Including the person who was formally announced to be the SA across all the dev teams! I'm pretty sure that the SBOMs, if done at all, are now back to being manual, and as one of the last things I did there was to discover an undeclared GPLed library where none should be... And, yes, there already were excess XML libraries in gleeful abundance, also DOM'ing when SAX would have made sense (and probably vice versa).

                I wish them well, but glad not to have to worry about this stuff now. Although - still use the same build system at home, so if I ever do let some poor fool have a copy of any of my for-fun projects, it'll come with a SBOM :-)

        2. doublelayer Silver badge

          Re: Easy solution

          Who said all the pulls were identical? The original suggestion was just charging each time the repository is pulled. I pull a lot, and most of the time, it's on repositories where I already have most of the history, but I still need the changes from yesterday from other people. This is something you have to keep in mind when making suggestions; if you say charge per pull, we'll interpret it as charge per pull. If you meant charge per clone from nothing, then you have to say that and it's still not that easy a solution because cloning and pulling look somewhat similar from the server's point of view.

          1. that one in the corner Silver badge

            Re: Easy solution

            > I pull a lot, and most of the time, it's on repositories where I already have most of the history, but I still need the changes from yesterday from other people.

            That sounds like the description of devs working together on one or more shared projects, possibly with you at the top of the heap. So, yes, you'll do a lot of pulls to keep in sync with all the work around you. Changes from yesterday? When I was working, we could be updating ourselves multiple times each day from each other's work, checking that what we were doing still fit together neatly.

            But none of that is relevant to the situation being discussed in TFA.

            > Who said all the pulls were identical?

            I made the foolish assumption that you were joining in the discussion about the issue raised by TFA, and the proposed solutions to it, when you made a statement that a few hundred pulls a week was not being abusive to the provider of a repo:

            >> So, for example, a single company might download the same code hundreds of thousands of times in a day, and the next day.

            1. doublelayer Silver badge

              Re: Easy solution

              That's the problem when using terms that have specific meanings. "Pull" has a specific meaning when talking about repositories. You made the correct assumption about what I was talking about, but if you used the word pull to describe downloading all the code in one go and definitely nothing else, that part would be foolish. If the person who suggested charging per pull didn't mean that, they should have clarified, and then this discussion would be different. Maybe we should just reset, since you were not the first person to use the word pull in this context and I wasn't the person who said it wasn't abusive. Let them argue over terminology instead.

              1. that one in the corner Silver badge

                Re: Easy solution

                It all gets very murky :-)

                You can have a VM image that is used as the base for each time the VM is started up, so each VM instance starts identically (bar the necessary differences set by config outside that image, e.g. network addresses). That sounds very sensible and controlled.

                The first week this is used, a pull is a "nice normal pull", bringing down just a small number of updates. Although, yes, each VM instance is getting the same data time after time.

                It is all working well, so no need to update the base image, doing that requires change orders to be signed off and all that hassle.

                Months - or years - later, those VMs have been working ok, no failures being reported back to the PHB's dashboard, so let them carry on. However, the repos being pulled *have* been changing, a lot, so now instead of the pull only calling for a few recent changes, each one is now calling for 95+% of the volume that a fresh clone would do. And no, the commands within the VM base are not properly using the "depth" parameters to stop all the history being dragged down (well, the entire issue is about the abusers not bothering to optimise the downloads, even though we are all shouting "you don't do it like that!"). And each new VM instance is repeating the same downloads as all its siblings.

                1. doublelayer Silver badge

                  Re: Easy solution

                  Exactly. That's what makes it hard to give a simple "pay per pull" solution to the problem, since small pulls or ones that end up indicating that there are no changes shouldn't cost the same as a pull that gets pretty much everything. I would favor a bandwidth limiting approach and mostly do except that that would only stop a few of the problems, but anything doing IP hopping wouldn't notice, so even that doesn't always work. Users need to just build a cache for themselves, but I don't know how to make them do it.

      2. Roland6 Silver badge

        Re: Easy solution

        For a price point, I suggest looking at email and specifically email marketing where the per email costs are reasonably well known.

        This would suggest a per email cost of around £0.0032 where the volume is circa 1,000 and it falling to £0.0012 for a million.

        Obviously, if your repository is hosted on AWS, you could probably get some real cost metrics…

        1. Anonymous Coward
          Anonymous Coward

          Re: Easy solution

          Yes, when bulk email is that cheap and subject to abuse, don't be surprised when users resist or refuse to give out their real email addresses. So what does the service get? A throwaway? Better ways to throttle than that. When 80% of traffic comes from 1% of IPs, the answer is for the service to not guarantee any specific speed. There are some good reasons for that much traffic from one IP (e.g., FOSS users on Tor) and some bad reasons (e.g., megacorps at AWS). In that case, the service can not throttle users over known Tor exists, while applying a stricter standard for servers hosted at AWS.

          If someone doesn't ever want to get {email, phone calls, letters} from you (or your automated system), then you don't need their {email address, phone number, mailing address}.

          We, as a society, have become progressively more numb to unwanted contact. Tech types, though, not nearly as much. If we give up that principle, then who becomes privacy's standard bearer?

          My primary email is human-only. I maintain a number of service-specific role accounts from services that might actually have reason to make contact. An "important update" from a repo isn't one of those.

          The open community has more at stake here than just bandwidth. Resisting the siren song of unnecessary account registration using personally-identifiable information sets an important example for others to follow.

    2. Doctor Syntax Silver badge

      Re: Easy solution

      How does your billing infrastructure work?

    3. Anonymous Coward
      Anonymous Coward

      Re: Easy solution

      Shitty solution which fucks the free user.

      You can't have billing without accounts, identity, addresses, etc.

      "Free" accounts become a cesspool of throwaway emails that becomes a giant nuisance for the casual user.

      Paid pulls run into the inevitable problem that it's a lot harder to get a name, a billing address, a phone number, etc out of a user (especially a FOSS nerd) than it is the actual $1. Sure, they might be willing to reach in their pocket, pull out a $1 bill, and put it in the jar...but there is no jar, just another platform to sign up for which requires a bunch of personal information and has a privacy policy which translates to "you're screwed."

    4. PRR Silver badge

      Re: Easy solution

      > Regular developers won't have a problem paying <$10 per week.

      In the bureaucracy I worked in, it was a lot easier to steal $1,000 than to be funded $10 as petty-cash. (Which could only be cash, not paper-work money.)

      Why not turn to the CDN experts? CloudFlare, Akamai, Google Cloud CDN, et al have capacity, optimized for content delivery at low per-file cost. They have management organizations who are in the business of streamlining delivery and up-selling needy customers. They see the bigger picture that little developers have no hope of tracing: if MegacCorp is sucking hard on one developer's work, they are probably sucking on dozens of other teats, and can be confronted efficiently. CDNs can understand that some of their paid business depends on FOSS software in all different ways, so supporting this ecosystem supports the CDNs' business. While present CDN policies seem to avoid this type of service, they could individually or jointly bud-off FOSS/developer delivery and redundancy services.

  6. Jan Ingvoldstad

    The source is free, the CDs are not

    That’s how it went back when we paid to get source distributions in the mail, and it’s fair that we pay for the distribution media today as well.

    But I’m cool with the idea of a free-of-charge tier for hobbyists if it is at all possible to make the distinction.

    I suspect, though, that this will turn out like spamming, which is highly distributable, in low volumes per IP address and/or domain.

    1. VoiceOfTruth Silver badge

      Re: The source is free, the CDs are not

      I used to buy the FreeBSD CDs back in the day. I still have some knocking around for posterity.

      They have identified that 82% of downloads come from < 1% of IPs. The IPs no doubt move around, but mostly come from AWS, Google, and Azure. Bandwidth throttle the whole networks.

      1. Roland6 Silver badge

        Re: The source is free, the CDs are not

        Given the growth in cloud development workbenches (hosted on AWS, Azure, Google) enabling a developer to seamlessly switch physical locations of work whilst retaining the state of work-in-progress. The growth in downloads from a small pool of IP addresses isn’t wholly unexpected, the challenge will be separating the hobbyist from the corporate..

  7. Crypto Monad

    The problem is laziness.

    "git pull" is almost free, bandwidth-wise; to update a local copy of a respository, it only fetches the differences from your local copy.

    The problem is these people are building CI/CD pipelines which start from fresh state and do a "git clone" from scratch, every time. Not only are they fetching the latest version of everything, if they omit to do a "shallow" clone then they're also fetching the entire version history.

    The solution is simple:

    1. Keep your own git copy of the code you use, and refresh it via "git pull" periodically.

    2. Point your CI/CD at your local git copy. Clone it as many times as you like, nobody is affected.

    IMO there's no need for a pricing model. As the article says, the offenders are the big hyperscalers; they easily have the resources to do (1) and (2). In principle then, the solution is simply to block out the big consumers who keep cloning over and over again.

    However, this still requires users to register, and there's a risk of some people using throw-away registrations as a way to work around the blocks.

    1. Steve K

      IMO there's no need for a pricing model

      That doesn't solve the infrastructure cost problem though

    2. _wojtek

      it's less about git but more about what PR incline - each time there is a new comming the build fires up with new, ephemeral machine, which then pulls whole tree of dependencies and this is what ccausing the problem... solution would be GitHub hosting a local proxy on their server's forcing all builds to go through that and JT would easily slash the traffic to maven central był something like 50% probably....

    3. kmorwath Silver badge

      Exactly. Why a lazy sysadmin should put up a mirror, or a cache, or something alike? C'mon, that's work, sysadmin install once and never change anything for fear it can break something... I do have caches and local mirrors for a lot of things, it's faster, and more secure.

  8. Anonymous Coward
    Anonymous Coward

    Tragedy of the commons

    Reminder that the "tragedy of the commons" was a fallacious argument thought up by a capitalist to justify the enclosure (privatisation) of common land.

    In reality, commons were administered by consensus, with people who abused their access to the commons being shunned and made social pariahs; their access to the commons, and the community in general, revoked by common consent. This proved to be an effective land management strategy for hundreds of years.

    1. Anonymous Coward
      Anonymous Coward

      Re: Tragedy of the commons

      By the time William Forster Lloyd published his pamphlet on how common land could be abused by commoners, the many land owners working individually or together had already fenced off most common land in England and denied people their rights of access. Then Parliament rubber-stamped enclosure.

      Nowadays the tech aristocracy want to keep open source as open for business as it is now. Most licences the tech oligarchy-backed OSI promote allow wholesale ripping off with no giving back, the exception are established copyleft licences which the OSI couldn't really claim are not open source otherwise they'd lose their credibility but they certainly don't push them as hard as the other licences. They will absolutely not entertain licences which formalise that business has to subsidise open source projects in any way. If there are subsidies for certain projects, it's on the business's terms and probably has strings attached.

      The tragedy of the open source commons is that it too much access will destroy it, but not access by commoners but by business. For open source to survive long-term, there has to be some restriction, i.e. some formalisation of business subsidising open source projects because clearly many businesses only do the absolute legal minimum, if that.

      1. Doctor Syntax Silver badge

        Re: Tragedy of the commons

        Common land in England was never in some sort of public domain. The land itself had an owner, usually the manorial lord. Certain people (e.g. householders or manorial tenants) had certain rights (e.g. pasturage, pannage, turbary, collecting dead-wood). Grazing would be limited by some form of rationing (stinting, agistment).

        There were also complications of visenage or inter-commoning where two communities would have common rights on the lnd between them. If lordship demanded boundaries there might be differences between the neighbours' definitions of those boundaries leading to disputes, violence and even death. The commoners taking part were, I think, often pawns and victims in disputes between their manorial lords.

        Enclosure goes back a long way, e.g. the peak of the English population in late Edward II/early Edward III. If the population grew beyond what existing cultivation could support extra land would be granted (asserts). When the population collapsed, e.g. in the famine of the late 1310s or the Black Death a generation later, come of these assarts were abandoned. When it rose again more common was enclosed well before the Parliamentary enclosures. I'm currently trying to work out how one common came to have been largely enclosed, apparently by a neighbouring estate, mostly by at least half a century before its Act. I've also seen an instance where the post-Enclosure field layout as seen on the 1st ed OS 6" map differed from that in the Commissioners' map; I suspect it was following previous encroachments which had already enclosed a substantial part of it, again from a neighbouring estate although this time they didn't get to keep it.

        1. Like a badger Silver badge

          Re: Tragedy of the commons

          The land itself had an owner, usually the manorial lord. Certain people (e.g. householders or manorial tenants) had certain rights (e.g. pasturage, pannage, turbary, collecting dead-wood). Grazing would be limited by some form of rationing (stinting, agistment).

          But wasn't this all some Norman shit that the F****s put in place, a bit like the abomination of leasehold?

          1. Doctor Syntax Silver badge

            Re: Tragedy of the commons

            From what I've read local lordship was evolving before the Normans arrived and even without it rights would need to be regulated. The tragedy of the commons is what happens when they're not.

            1. Charlie Clark Silver badge

              Re: Tragedy of the commons

              It really goes back to the change from hunter-gatherers to farmers in the Stone Age, and the continuing expansion of the population through the destruction of woodland for fuel and farmland. The most prominent example of modern day, apart the contentious notion of the climate, would be fisheries and perhaps soon low earth orbits.

              But it's a poor fit for describing the challenges around depending upon open source, which is a more classic principle/agent problem. Given the right incentives, I don't think it would be difficult to get them to contribute more actively. But I don't see shakedowns providing any such incentives.

      2. Roland6 Silver badge

        Re: Tragedy of the commons

        >” because clearly many businesses only do the absolute legal minimum, if that.”

        A principle businesses apply across the board: wages, training, adherence to regulations…

        I had a laugh a few weeks back when the CEO of one of the UK’s major builders complaining about “red tape” in a radio interview was asked what regulations they would like to see done away with. His only answer was that Planning departments were understaffed and hence not able to turn work round in a reasonable amount of time.

    2. that one in the corner Silver badge

      Re: Tragedy of the commons

      The tragedy of the commons is a description of what happens *after* the bulk of the lands have been enclosed and hence *after* the "management by consensus" is no longer taught to all the folk, replaced as standard by "do what the owner says for your day's pay".

      Add in the increased population and increased migratory workers, who no longer tend local to their homes but are brought in for a season to work the larger enclosures, and you have lost any social power over the abusers: who cares if Lower Bagshot are shunning you this week, you'll be over at Upper Dicker tomorrow. And that is only if the Bagshottians even spot it was you amongst all the incomers who churned the Downs into mud (don't try this in Ambridge though, they have people who'll remember an incident from decades ago).

      The sudden increase in the tragedy may have come about as a result of the enclosures (and they weren't really Capitalists as we understand the term now, merely the first sprouting of those strangling weeds) but population growth and townships displacing villages was the more inevitable problem. Enclosure most certainly accelerated it and made it stand out, the stark difference between trampled Downs and pristine managed lands; which inevitably was then used as yet another argument (alongside "well, *this* army" and "God says I can", amongst others) in favour of The Man taking charge of resources that the lesser folk are clearly ill-equipped to manage.

      And whatever you believe to be the origins of (and the intent behind those origins of) the tragedy of the commons, sadly it is at this point in time a very real phenomenon, backed by the anonymity available in a crowd and an unpleasantly growing disregard for others: festival campsites to Git repository traffic.

  9. Bebu sa Ware Silver badge
    Windows

    "But I’m cool with the idea of a free-of-charge tier for hobbyists…*

    "But I’m cool with the idea of a free-of-charge tier for hobbyists if it is at all possible to make the distinction."

    I don't imagine these abusive corporate CI/CD pipelines would last long if they were persistently stalled by random delays.

    Sites like Anna's Archive have a paid tier that doesn't have a delay which the free mirrors have (up to 5 minutes) before commencing a download.

    I imagine something like grey listing with a paid whitelist could do the job.

    Presumably the problem is going to explode with the blight of AI driving an increasing proportion of so·called software development.

  10. gerryg

    No such thing ..

    I think the economist Milton Friedman coined that one. FWIW

    1. Doctor Syntax Silver badge

      Re: No such thing ..

      "I think the economist Milton Friedman coined that one."

      What one is that one? If you mean "tragedy of the commons", no he didn't.

      1. Anonymous Coward
        Anonymous Coward

        Re: No such thing ..

        TANSTAAFL

        And it pre-dated both Friedman and Heinlin

  11. glennsills@gmail.com

    Doing it right takes more effort than people think

    I think you'd need to set up a caching proxy repository for every open source repository you are accessing. https://github.com/google/goblet is an example. You don't want to simply copy the latest code in the open source repository somewhere, you need a way of tracking the updates in the original repo - so "get latest" actually means get latest. Alternatively, you could manually get the latest code from the open source repo and store it in the repository holding you applications's code, but that sort of manual versioning creates problems that source code control systems were supposed to solve in the first place.

    I had no idea that caching proxy git repos were a thing. Who knew?

    1. Anonymous Coward
      Anonymous Coward

      Re: Doing it right takes more effort than people think

      Give it six months and the community will roll out a converged, multi-service proxy with some sort of local discovery protocol to address exactly that scenario.

      Hell, we could use one of those anyway.

      Necessity is the mother of invention.

      1. glennsills@gmail.com

        Re: Doing it right takes more effort than people think

        Yeah, I was thinking it would be pretty straight forward to create the necessary deployment tools that

        1) creates the instance of the caching repo

        2) points it to the target repo

        3) pulls from the main branch. (Assuming you don't want all the history).

        It really would matter much if it was deployed in the cloud or locally, it could all be scripted.

        1. Anonymous Coward
          Anonymous Coward

          Re: Doing it right takes more effort than people think

          And just add a universal pull utility that understands the intricacies of each popular source repo and preferentially seeks out local sources, hopefully on the same AS, and preferably, on the same network segment.

          The necessary concepts were pioneered what, two and a half decades ago? Junior IT staff weren't even alive during the Napster days, or the year after the RIAA killed it, when everything got much, much faster, thanks to innovation. All these big pulls need is an infohash equivalent.

    2. Richard 12 Silver badge

      Re: Doing it right takes more effort than people think

      It's trivial and built into every "on prem" git - it's literally how git was designed to work!

      The problem is primarily because "on prem" has become "old school".

      1. glennsills@gmail.com

        Re: Doing it right takes more effort than people think

        On think the "on Prem" repo you are talking about is really just a caching proxy where you refresh the cache by hand. You are right though, that will work.

      2. Roland6 Silver badge

        Re: Doing it right takes more effort than people think

        >” The problem is primarily because "on prem" has become "old school".”

        It’s IT, the wheel is turning, so “on-prem” just needs a refresh, given a new name eg. “Xyz Local” and bundled with a subscription and “cloud” will become “old-school”…

      3. a.b

        Re: Doing it right takes more effort than people think

        This!

        Clone once, then pull. It's how we backup 78 GB of company repos every hour without pulling 78 GB every time.

  12. JimmyPage Silver badge

    This is a classic example

    where "free" does not mean "of no value".

    Why can't FOSS repositories get big business to have to sign up with a credit card for when they bust the "free" limit *before* they have access to the code.

    You know like they do to their customers

    1. Ken Hagan Gold badge

      Re: This is a classic example

      Today? It's because they (the repos) don't know who is doing the pull.

      Tomorrow? You could insist on users signing up with an email address but we know how easy it is to acquire a new, free email address. Big Corp can simply maintain a pool of such addresses and teach their Azure pipeline to pick the next address from the pool. Hassle, but I'm sure someone will eventually write a helpful tool to automate that and put said tool on GitHub for everyone to use.

      So I think we're back to some heuristics that guess whether this new connection is a returning abuser, and throttling accordingly.

      1. Roland6 Silver badge

        Re: This is a classic example

        >” Today? It's because they (the repos) don't know who is doing the pull.”

        Implement controls, take the risk of upsetting people and they will find out.

        This problem is the same one many businesses faced when they went from free to tiered subscription: LogMeIn, TeamViewer, Dropbox …

  13. Anonymous Coward
    Anonymous Coward

    Reg readers need to think this one through

    Government attempts to age gate the Internet have been rightfully criticized as corrosive to privacy, because you can't age gate the Internet without forcing everyone to participate in the identity ecosystem.

    How does everyone think that repos will "corp-gate" access? Magic?

    Nope, it'll start with casual users having to create 12 different disposable accounts on 12 different platforms so they can make the occasional request. When that's not enough, they'll need to cough up a name, a billing address, and a phone number to a payment platform with a shitty privacy policy before they can even make the $1 nuisance payment.

    You can't do what you're proposing without forcing everyone into this identity ecosystem, too.

    There's two complimentary solutions which aren't based on identifying every request:

    1. P2P Distribution: Large requests for popular content can go via P2P, leaving host platforms seeding metadata, signature files, and low-volume objects. Quite easy. No technical challenges to solve here, just the will to implement.

    2. Ban Abusive Networks: The flipside to "PHB doesn't want to pay for a local cache" is downtime and wasted staff effort when throttling kicks in. The PHB will pay for reliability. It only takes a few bad actors being made examples of before IT staff get the message about how to maintain reliable access to freely hosted resources. It's not hard, it's just that people are lazy. Exploit that laziness. When pulling everything from the Internet ends up causing more work, people will take the local cache route as the path of least resistance.

    That's better than identifying everyone trying to use a piece of FOSS software to crack down on, as the article notes, the 1% of IPs causing the usage problem.

    1. Doctor Syntax Silver badge

      Re: Reg readers need to think this one through

      I'd add

      3. The faster requests come in from any IP address or block of IP addresses the slower they get served. In the event of really fast requests when somebody starts pounding return is they think they're not being served quick enough the connection is dropped and reconnections refused for the next 10 minutes. This is made quite clear to users as is the suggestion that if they want to make frequent pulls they set up their own cache. Alternatively they can have a private cache set up on the server provided they pay for it, solving the problem of who pays for the infrastructure.

      1. druck Silver badge

        Re: Reg readers need to think this one through

        You have to take in to account consumer ISPs and NAT, which means there could be a lot of individual low rate users appearing on a single high rate IP address.

  14. Anonymous Coward
    Anonymous Coward

    wrong tech?

    I feel this is probably about maven central jar downloads, rather than git pulls?

    1. doublelayer Silver badge

      Re: wrong tech?

      I've worked for the guilty here. For the record, I was the one who eventually set up a caching service to stop the problem, though I only did this after the throttling let me say we needed to do it or our processes would start breaking. In that case, it was a git repo. We already were shallow-cloning, but it was a largish database being distributed over git, and we were cloning it from scratch periodically because a scheduled process was spinning up a temporary environment, getting the database, processing its contents (all of them so we did need the full copy), and shutting down until the next run. Cloning was easier than making a local copy, so that's how the system was first implemented.

  15. ChoHag Silver badge

    If you don't want people to take stuff for free, don't give them free stuff.

    It still isn't rocket science.

    For the benefit of the slow of thinking who seem to dominate these discussion pages of late, restrictions do not have to involve a firewall or any form of technology, they can just as easily involve words written by a friendly lawyer.

    For the even slower of thinking, yes they can. It works for my large corporate employer who go to great lengths to honour such restrictions even though nothing other than the words forces them to.

    Nothing requires anyone to give leeches easy access to their work except the for fear that they might not take it.

  16. Anonymous Coward
    Anonymous Coward

    Just block those IPs? Which will as a side effect stop this particular megacorp thieving.

  17. Jamie Jones Silver badge

    The problem ISN'T the lack of caching

    No, no, no, the underlying problem isn't that big organisations don't do local caching of the repositories - adding proper caching would be a superficial fix.

    The problem is the dumb mechanism where software loads it's "live" system from non-bundled third party libraries in the first place.

    The auditing disaster of NPN, rust, go, and others is the fact that they encourage a writing philosophy where just about everything is a third party library, so you end up with simple programs loading thousands of piddly library files whose comments are larger that the code.

    The security and reliability disasters on top of this bandwidth problem are with the systems (NPN, and others) that encourage projects to download these files fresh not at compile time, but at runtime.

  18. Japhy Ryder

    BitTorrent

    ...is an auto scaling CDN.

    Just saying...

  19. OpinionatedAhole

    Who would have known that providing a service for free* would have led to everyone, including large companies, using the free service all the time?

    If the Maven Central folks want to stop paying the bills out of "their" (i.e., their donors') pockets, they should start charging for the service they provide, just like any real service provider does. Plain and simple.

    *in the hopes that they (the free service provider) will gain fame, become a FOSS celebrity, receive $$$,$$$ in donations, and be invited to give speeches.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon