Look ! A horse !
I think it's bolted...
A UK non-profit is planning to introduce a new licensing model which will allow developers of large language models to use copyrighted training data while paying the publishers it represents. The Copyright Licensing Agency (CLA) has intends to launch a Generative AI Training Licence, which is set to be available in the third …
"Training AI models on copyrighted content requires permission and compensation."
Permission, why? Compensation for what, and to whom?
Those are simple questions in a context of enormous assumed certainties.
Consider the UK "Public Lending Right" which leeches upon Britain's communal libraries. When introduced, it overturned long-established principles regarding access and use of books. If charged at librarians' desks to borrowers, there would have been an outcry. It passes itself as a hidden component of 'rates', the tax paid by home occupiers and businesses. Doubtless, if it was considered remotely feasible, a 'rentier' tax would apply to private book/music/DVD lending among individuals; I made no mention of 'copying' which belongs in a different kettle of would-be monetisable-by-edict fish.
There is wailing in some quarters over people, these deemed alike to others 'who steal cars', who use the 'Robin Hood' services of Anna's Archive and Sci-Hub. A notable difference among the categories of complainants about the existence of these facilities is that the former consist of publishers and authors, whilst the latter, distributing academic literature, occasions moaning only from publishers.
During the early days of home copying of digitally encoded music onto CDs, later film on DVD also, some nations introduced a levy on retail sales of blank media to 'compensate' publishers' creative accountants. Incidentally, that did not affect early attempts within the same nations to extort money from some users of the BitTorrent protocol.
Shall there be demands put before easily 'bought' legislators that all private users of the Internet are taxed to compensate allegedly creative individuals (publishers masquerading as such) for material 'filched' via Anna's Archive, Sci-Hub, and from non-rentier-compliant Internet-based AI services? Perhaps, renowned economists will be set to determine what proportion of every person's disposable income can be extracted, at source, to keep the massive cultural rentier monopolies free from worrying that their easy way of doing business is under threat.
Despite being unpalatable to some, it is undeniable that rentier economics applicable to digitally representable culture is taking its last gasp.
Indeed.
We've been here before with Napster etc...the main reason Napster et al were as successful and widespread as they are is because the open market decided that the price for music and media was too high. Had music and so on been priced fairly, I don't think piracy in that way would have taken off as wildly as it did.
In the early days of streaming services, before Netflix et al enshittified themselves, piracy was on the fall because the price was fair. Now that ads have crept back in and packages have been made shittier, piracy will increase again.
In the case of LLM companies, I think the reason a lot of them decided to pirate the materials is because they thought it through and decided they'd never get a fair deal if they tried to negotiate. Thing is copyright holders tend to value their content based on what they think the volume they will sell is at the price they arbitrarily set..."well we made 20 million copies and we plan to sell them at £10 each, that's £200m, if you take away 10% of those sales, you'll cost us £20m...so we want £20m..."...thing is, that "cost" is unrealised revenue...it doesn't exist. It will probably never exist...but that's their argument.
Typically the price on copyrighted materials is not the fair market price, it's the price set by the holder/publisher because there is no other way to purchase/consume the content.
The problem we have with content is not that people are stealing it, the problem is the way in which it is valued by the holders. The value is usually massively inflated, based on fuck all and entirely arbitrary.
Everyone else on planet earth expects payment based on the time they spend on something and the scarcity of their skill set...the market rate for their skills. Content creators for some reason expect a lot more than that...over and over again, for as long as their copyright is valid...until content creators come back to Earth and understand where they fit in the grand scheme of things, nobody is going to sympathise with them.
They should only be taxed if two conditions are met:
a) They charge for access to the API / models (even if a limited free tier exists, especially if the free tier is arbitrarily restrictive).
b) They don't release the models for local use. All AI should be free. No exceptions. Either everyone has equal access to it, or nobody does. We're supposed to live in a world with equal opportunity, if AI is not available for free, there is no equal opportunity in the future. It's that simple. It's the only fair way.
Otherwise I agree. If you pirate a shit load of data, the least you can do is release the models for free.
As for UBI...ehh there are arguments for and against that. I don't particularly like the idea of my basic income being at the mercy of a parliamentary debate. It'll only make people poorer and drag us back to being serfs again...corporations will find a way to subvert that and take the piss...it'll reduce their costs (because they will claim they don't have to pay as much due to the UBI) and they will price things according to what they know you're being paid...it's a lovely idea on paper, but I think the reality would be grim...I reckon we'd end up with a bigger wealth divide and more poverty.
It would be far better to use the tax raised to pay for or subsidise resources that are universally accessible. Energy, water, transport etc...or just give every one free broadband or something. It would have to be infrastructure related to benefit everyone...including the AI companies...the more we subsidise energy with AI taxes, the more profit the AI companies make and the more tax we can take with which we can subsidise more energy, which raises profits etc etc etc...this would have the knock on effect of making energy dirt cheap which drives costs down for everyone which puts more money in your pocket.
but does that mean, if the AI wanted to scrap someone's webpage. That someone could pay the non-profit to get a license that they would then give to the LLM, who'd pay them and then they'd pay you?
I don't get it. They won't do it for free, they'd want their own commission. So the LLM pays you 1p, the non-profit takes half, so you get 1/2p
As far as I can tell, the system works like this:
1. This group, or someone else, gets the lucky position of the only licenser who the AI companies will deal with.
2. If you use their license, then they negotiate the price for access to your content. That price will be very low, and they will keep most of it.
3. If you didn't use their license, then clearly you weren't interested in getting funding, so no payment. Your data still gets used.
They seem to have no idea how they'd make AI companies comply with their license when they ignore much clearer rules about not using things they don't have the rights to. Neither do the nonprofits appear to have any mechanism for detecting misuse. My guess is that they're hoping to get a large group of signatories so they can try to negotiate with AI companies in bulk, but if that turns out to be true, I expect that they will fail badly and we will never hear of them again.
What would be a fair price for training?
Google has indexed some 60B web pages. LLMs need everything they can get their hands on. So I suspect they want it all.
At 1 cent per page, training an LLM would cost ~500M in license fees. Even at 0.01 cent per page it would still cost ~5M.
At the moment, the web scraping is probably already costing websites a lot more in hosting and processing costs per page
I am pretty sure keeping track of the data, accounting, and distributing the license money alone will eat up most, if not all, of the fees.
I don't think this will work.
The cost of fixing smashed locks and hotwiring materials is pretty high, I don't see how I could pay for the car as well. My secondhand car business would close, and all those thieves would be out of work!
See how ridiculous that claim sounds?
If they can't be bothered to properly licence the works they're using, then they cannot be permitted to exist. As an industry, these LLMs and diffusion models are orders of magnitude larger infringement than torrenting or Napster ever were.
"See how ridiculous that claim sounds?"
Sigh, the plan won't work. No amount of huffing and puffing will change that.
That doesn't mean the AI people are somehow in the clear. Just that this plan will not work to make it "legal".
That will not stop the robber barons. It never has.
With particular and peculiar regard to the Anonymous Coward statement .......
That will not stop the robber barons. It never has.
..... the one thing you can be sure about, and it is that which is so terrifying to all who would deny it possible and seek to prevent its rapid unilateral progress, is expert LLM training/trainers will be are able to stop and disable and disassemble current running conventional and traditional style robbers barons such as those presently pontificating and masquerading in the guise of popular elected figurines of global political leaderships representative of a national collective such as be a Keir Starmer or a Donald Trump or a Benjamin Netanyahu.
It is at least one of the major prime reasons why there is such a hullabaloo and increased attention being paid to limit simple and free from registration access to non-governmental controlled mass media platforms with their novel rogue and disruptive renegade social reprogramming abilities/facilities/utilities.
"properly licence the works"
I don't think anyone disagrees with that, the problem is in putting a value on the original work. Copyright holders always assume their work is the final bastion of originality and is worth more than mere money...and as such they pluck ridiculous figures out of the air based on arbitrary metrics.
Where it falls apart is when you add up all the amounts the rights holders want, it would equal more than the size of the economy on Earth. Therefore, the only cost effective way to proceed without having to negotiate ridiculous terms with millions of rights holders is to just pirate it and wait for them to sue you.
The only fair way to license material for AI, I think, is if a prompt is used that specifically calls for something that is specific to your work.
e.g. Let's assume Picasso is still alive and kicking, and he's got a full repertoire of copyright works. if I prompt a diffuser to create "a modern art masterpiece, surreal, facial features, feminine, minimalist" and it happens to spit out a Picasso derivative, then no license applicable...but if I prompt using "a modern art masterpiece in the style of Picasso" then my motive here is quite clear and I want the AI to take into account some potentially copyrighted material. A license fee is due.
I don't think training an AI holds as much weight as the motive of a user prompting the model.
Taking AI out of the picture (lol) for a second...before AI existed, if I wanted an image for my website, I could just search for it on Google images...the results may include images that legally require me to pay a license fee to use them. if I use those copyrighted images, *I* have to pay the license to use the image, not Google. Google doesn't have to pay anything to lead me to the image.
I don't see how an AI model is any different. Sure it was trained on copyrighted materials, it can produce knock of derivative versions of the copyrighted materials but it doesn't produce (nor contain) exact copies of the original.
"orders of magnitude larger infringement than torrenting or Napster ever were"
Is it though? I could download a complete copy of Metallica - Enter the Sandman from Napster in it's complete form. Once I download it, I have absolutely no motivation at all to ever pay for that track...although weirdly, as has been demonstrated many times over (and exploited by artists, including during the time of Napster), piracy can lead to higher sales for the artist...purely due to discoverability. I discovered loads of artists through Napster, and I subsequently paid for CDs because I liked them...CDs that I would never have bought had I not pirated them first...because I wouldn't have even been aware of them. On the flipside, piracy also helped me get hold of albums that just weren't available in the UK. When I was a kid, I really wanted a copy of The Gone Jackals - Bone to Pick. It wasn't generally available in the UK, you just couldn't get it...you had to request it and pay a massive premium to import it...I was quoted around £70 in the late 90s for that...which was fucking insane for the time, especially for a teenager...so I just pirated it instead...even then, it wasn't great...I had it in a lossless format, but the production quality was crap...like it had been recorded in a shed through a microphone shoved up the lead singers ass...so I had to remaster it myself so it wouldn't sound shit.
As far as I am aware, nobody has extracted a complete original work from an LLM / diffuser because I don't think it is possible...therefore the model producers are not distributing copies of original works. What an LLM does contain is a huge number of vectors and parameters that have been derived from original works...which could be considered to be transformative and therefore subject to fair use.
"I don't think this will work."
What you mean is that their business model isn't financially viable if they have to fairly compensate content creators. Since it requires every last scrap of human thought to be shovelled into it before it becomes even vaguely viable then unsurprisingly the compensation bill would be excessive.
Usually when a business is faced with that situation what they do is find a different opportunity, one that can be made to work fairly and ethically, or make the process more efficient or whatever. The point is the onus is on the business to change something if they really want to carry on down that road. The alternative course of action to fuck right off and leave us all alone is also available.
Its a bit like Ford decided that if they had to pay for the steel that goes into their cars that would make them too expensive for consumers so they just help themselves to it from the foundry. [To us an analogy that the MAGA's out there will understand...]
I think what they were saying is that this license won't work even if the AI companies accepted it. And they would be right. Their financial objection is not the only reason it won't work. I think they won't even get to the stage where that would be the problem. However, if they did somehow get to that stage, it would be the problem. Distributing payments that amount to a few pennies to millions of people is so expensive that it's not worth doing, especially when the people in charge of it realize that if they try, nobody really gets anything, and if they don't, nobody really gets anything except them who get something spendable. See also the settlements reached in most group or class action lawsuits.
I don't know about you but I value my intellectual output at more than a few pennies.
Class action lawsuits are one-off events and the injury involved is usually quite mild e.g. phone battery didn't last as long as it could have. A better analogy would be if you are injured due to someone's negligence and unable to lead a productive life from that point forward. You would be compensated on the basis of lost lifetime income. This is what the Bros are trying to do to us all. Suck the life-force out of us and and sell it back to us as a monthly subscription.
The problem isn't the distribution of the compensation its that the real value of what they are misappropriating is gargantuan even for their very deep pockets.
I value my product at more than that too, but I can guarantee you that if you signed up for this service, they would negotiate for access to everybody at once and they would negotiate for a value that would end up being pennies for each participant. If they tried to negotiate for individual pieces, the negotiation would never end because of how many individual pages there are in that set and how little any AI company wants to actually decide which ones they care about. If they tried to negotiate a high price for everyone, nobody would agree to pay them. Therefore, expect that that's all the payments would be worth if you agreed to be represented by this bad idea.
The point of the lawsuit settlements is that they often get settled with what looks like a large financial amount, but that amount is small per participant and participants don't get all of it. For example, you might see a settlement of ten million $local_currency_units, which, when divided by the 400k participants would give everyone 25. That isn't a large amount for what could be a large offense, but they won't get that much anyway because in practice, it actually goes like this:
Settlement amount: 10M
Lawyer's fees: 8.3M
Processor's fees: 200k
Cost of posting notices to 100k people: 50k
Remaining amount: 1.45M
Remaining amount per person: 3.625
Expenses for delivering payment of that size, per participant: 1.439
Actual amount received per participant: 2.18, sometimes in some inconvenient form like a discount voucher
PRS, YouTube, Spotify and ad networks all exist.
PRS charges tens of thousands of venues an annual license fee, and divides it up to creators using some popularity info as they've no way of tracking what was actually played. I suppose that's similar to "inferencing"
Spotify charges millions of people a monthly fee, tracks what they each play and pays those particular licence holders a fee.
YouTube is like Spotify, but ad revenue rather than plays directly. Ad networks pay the websites to show their ads.
These do pay peanuts per play, of course. Most artists get pennies (a friend got 50p a year from PRS), but because huge numbers of plays happen a fair few artists get decent money.
The "AI" simply want to take it all for free. No more, no less.
I do think the first problem they'd run into is getting any AI company to accept that they need to pay for the data they think they can have for free. However, if they somehow managed it, I do think payments are a problem because, unlike the systems you mention, there is a very different scale involved. In three of the four options you list, the payments are amalgamated over time. For example, if I put an ad network on a site of mine, they add up all the ad views for all the pages of my site over a month, then send the payment for those views to me. Meanwhile, any company who paid for this would be using a pay-once for unlimited usage policy because it is required by the system they create; they can't count the number of uses of any document because who knows what even happened to it after the training process got it. A single licensing charge divided by everyone who had documents in the set is going to produce very small amounts that aren't going to recur. That is not convenient to pay. They could be honest and find a way to pay it anyway. I'm predicting that the people who have access to the lump sum are going to realize the personal benefits of not doing that.
> A single licensing charge
Why would it be a single licensing charge? An ongoing monthly royalty payment for any creator whose works have been used for training would be a far fairer method. And the creator should be able to set whatever fee they wish, or have their works removed from the training data.
This would obviously require some independent compliance monitoring, the first step in which is for a list of all the LLM training data used to train a model to be made public. The fact that Open AI and others are desperate to advise such scrutiny tells us all we need to know about the legitimacy of their use of data.
"Why would it be a single licensing charge? An ongoing monthly royalty payment for any creator whose works have been used for training would be a far fairer method. And the creator should be able to set whatever fee they wish, or have their works removed from the training data."
You have to recognize the difference between something that would be fair and what these people are trying to do. A lot of things would be fair: the company is forbidden from using your content without your permission, you can set any price you like, they have to have ongoing permission, you can withdraw permission. None of those things are planned or will happen with this method because they are hoping that AI companies will voluntarily sign on to this plan. AI companies, meanwhile, are using the system of being allowed to use anything they want without anyone's permission or a requirement to pay for it. They don't want to accept any reduction in that and will only voluntarily do so if it is cheap and results in a decrease in legal risk to them.
The reason it will be a single payment is that, if it was ongoing, AI companies wouldn't agree to pay for it. They also wouldn't agree to removing it at any time because they can't remove it from their models after creation and because they have no interest in maintaining the systems necessary to find and remove it from ongoing training data at your request. The reason it will be a non-negotiated payment is that it would take forever to negotiate with each person in a group of millions for how much they want for an individual page, and because paying a fairly-negotiated amount would be more money than they have. The organization trying to sell all this content to them will either request these things and never get anyone to agree, or they will negotiate all at once for one tiny value because it means they don't end up a complete failure.
This is why this suggested method is bad. It will not achieve any of the things we need, nor are they necessary to remedy the illegal actions of AI companies. Existing copyright laws already implement all of this; it's illegal for the content to be used without permission and compensation and negotiations for those would have to be individual. All we need to make your preferences (which are also mine) happen is for courts to confirm that AI companies are not exempt from copyright law and punish them for their illegal actions. A licensing organization will not help this happen, and they will not try to organize something that makes you happy. They will try to organize something that makes them happy which will be more advantageous to the AI companies than it would be to you.
> LLMs need everything they can get their hands on.
Well, that helps fulfill the woeful objective of "plausible but wrong".
To be actually useful LLMs have to produce output that is better than mediocre. I don't see how swallowing the entire internet -- have you seen what's on the internet?* -- and regurgitating it wholesale can ever do that. Training only on high quality data would be an improvement, but throws up obvious problems for the parasites trying to exploit this stuff.
But in any case, is this really what we expect from Artificial Intelligence, that the very best it can do is parrot the data that it was trained upon? Hardly progress, is it?
-A.
* A joke, but, well, have you?
Indeed.
The vast majority of people get paid based on a time and materials basis, content creators don't. I knew a very famous musical "artist" who had a one hit wonder in the 80s, he died sometime before the pandemic...everything else they released was shit and never made any money, so the band just went their separate ways about a year or two after the hit. His royalties for this one single hit were in the hundreds of thousands of pounds per year on average, not every year was the same, but it was at least £150k a year (mostly tax free due to the way his royalties were paid and where the were paid).
I think people generally have a problem with this sort of setup, because the vast majority of people work on a time and materials basis. They don't work once or for a few years, then live off the royalties for their work for the rest of their lives. The problem is, copyright is too long. I think the efforts of Disney over the last 100 years has a lot to do with it...I think unique and original works should have some kind of short to medium term protection (like up to 5 years or something) but beyond that it should be considered public domain...there should also be some consideration given to the saturation of an artists work with regards to the length of copyright as well...for example, if you manage to produce something that reaches millions of people or sells millions of copies, it should be considered public domain in a much shorter period of time...because at that point it basically is.
Artists should be respected and credited for their work beyond their copyright period and they absolutely should have control over who can use their work, but they shouldn't be able to milk it for a lifetime of royalties...humanity has always tended towards copyright theft and mass distribution ever since the invention of the printing press, I'm amazed we've got this far and there are still people fighting the same battle...if you create something worthwhile and it grabs the attention of a lot of people, it's going to be pirated. Especially if you put it on the Internet...at this point they're trying to fight the tide by bashing waves with their shoes...they need to at least meet the demands of the world half way...because if they don't compromise, they will just be swept out to sea.
Well, I think there's this thing called "copyright" that you might have heard of... but it's a relatively new concept so perhaps you haven't.
It's supposed to protect content producers from being ripped off. All the AI companies are conveniently ignoring it because they're backed by companies worth billions and have armies of lawyers.They are ripping off everything left, right and centre. Far more than any so-called "piracy" could ever hope to do. But, most individual publishers are too small to fight it.
As for why a non-profit would want to do this? Surely it just means they charge a fee for their services but it all goes into the directors' pockets and there's no profits to distribute to shareholders. Aren't ther also tax breaks for non-profits? Why wouldn't they want to do that? Plus it sounds nicer to most people. Not some greedy publicly-traded corporate entity in cahoots with some unseen business interest.
So by that analogy,
I don't need to pay for Metalica songs: https://en.wikipedia.org/wiki/Metallica_v._Napster,_Inc. What most egregious, the fuckers just need to wait https://www.gov.uk/copyright/how-long-copyright-lasts if they don't want to pay.
The very concept that someone does not need to pay to access Intellectual property (via copyright) is so un-American, so anti-capitalist it makes you feel sorry for Ayn Rand.
No I think the Large AI companies (all seemingly American) can pay like everyone else.
The tech giants and their financiers don't give the slightest micropoo about legality just so long as they can establish a monopoly to exploit before (and if) it's banned. By which time, by definition, there's no competition to step into the gap.
This is how 21st century capitalism works. Suck it up.
-A.
Yet another attempt to make money by an organisation trying to intermediate itself between two groups tob"solve" a problem that is better handled by current law. You use up without permission, you pay a fine and are forced to destroy your derivative work.
The only ones that benefit from this scam are the LLM scams stealing people's work. Why? Because they are vastly exposed to extinction level lawsuits and want "greater certainty."
1) Non-profit proposes license.
2-99) Blah
100) Government mandates licensing.
101) AI mobs buy licenses.
102) AI mobs charge for services to cover licensing cost.
103) AI mob in (for the sake of argument) China says that decadent westerners can shove their licensing regime up sideways and carries on as before with leeching and free service.
104) Chinese AI pwns the world and everyone else goes titsup.com
This is the inevitable result of some bunch of idealistic twatspanners attempting to stuff the genie back into the bottle wth the sink plunger.