Wow
Following last nights US House mulls forcing AI makers to reveal use of copyrighted training data
A rare bit of joined up thinking on this side of the pond too!
UK lawmakers have slammed the government for its lack of action in protecting copyright holders against the infringement of their intellectual property by developers of artificial intelligence technologies. In a report from House of Commons Culture, Media and Sport Committee, MPs said the government's working group on AI and …
Following last nights US House mulls forcing AI makers to reveal use of copyrighted training data
A rare bit of joined up thinking on this side of the pond too!
"Indeed. Greed makes the (business) world go round."
Naw. A popular opinion, but largely wrong. Business would probably work just fine in a world where investment was based on honest reporting of business information and PR work was a felony offense. What greed fuels is fads, booms, busts and the ripping off of widows and orphans. More exciting for sure. Excitement is important in sports and entertainment. It's not necessarily the best foundation for a real world economy.
I agree the current (and maybe perpetual) lack of a revenue stream is part of the reason. It's hard to share the profits with copyright owners if there are no profits to share.
Also, I strongly suspect that there are presently no audit trails tying training data to AI outputs. I'm not even sure it'd be technically possible to implement such trails. I suppose some sort of "if your AI reads copyrighted material you must pay the copyright holder a few cents a word for the right to potentially put the information to profitable use" scheme might work. But it'd likely be hard to implement. And error prone. And widely gamed until the sundry loopholes are identified and closed. Lot's of lawsuits. And lots of outrage. ... Tis a puzzlement*
* Copyright Rogers and Hammerstein probably. But the Reg and/or I presumably don't have to pay up under the US concept of Fair Use.
Why should they pay?
Reading what is visible online is no different from what you or I do.
Generating images in response to user prompts should then produce something relevant to the user's prompt.
if the user chooses to push the AI towards recreating a known work, then the responsibility for that is with the user.
However, producing a new work that is similar in style to something that already exists should not entitle anyone to a payout.
This is just the arty 'creatives' having their luddite moment - and I don't recall them demanding that previous skilled workers should get a piece-wise payment for automated products that replaced their jobs...
I think, trying to be neutral, the question is whether the output represents something distinct from the input. In Copyright terms, this would be whether it's 'transformative'.
Though I think it's an interesting question if the end user plays a distinct role in the generation of potentially infringing material, I'm not sure how much weight that would carry and I don't think copyright holders are interested in going after targets they won't get an appreciable payout from.
This is lawyers trying to generate more class action lawsuits, and getting paid enormous amounts of money.
Look at the copyright strikes on YouTube etc. Some of them are legitimate; others are ridiculous - and since YouTube has very little feedback on those claims it can't get better at it.
If it's a law though, there's no need for the lawyers - although they will say that any law doesn't go far enough and try pushing it further.
Yes, by the legal definition, the output is a derivative work but that alone doesn't tell you whether it is infringing or not. There are derivative works that have been judged to be infringing, there are others that have judged to not be infringing.
If you disagree, could you please cite the section of the applicable law or specific legal judgment which you believe clearly identifies the output of 'AI' as infringing?
When you read, you don't copy and store. AI training makes copies of the data in an electronic retrieval system, a process usually explicitly forbidden in the license under which the content is made available for consumption.
Remember, there is no "right" to copy. There is only an explicit grant to do certain things. The law provides for certain carve-outs, but the wholesale copying and storing of content is not "fair use" for example. Publishing content on the internet does not provide an implicit right to copy and re-use.
The propensity of LLMs to be able to reproduce original training material indicates that, in some encoded form, they have a copy (an unlicensed copy) of the original content. The LLM operators' provision of "guardrails" is merely hiding this fact, not disproving it. In fact, I'd go as far as saying the the anti-source material guardrails are simply concealing the evidence of criminality! They really hope that by blocking the ability to reproduce certain content (say, the books by authors in a class-action) they are hoping to convince the court that the original content is not being stored.
> you don't copy and store.
Well, does AI training?
Sure, the system likely includes some caching mechanism (as do browsers btw.), but the ultimate end product, that is, the ML model, doesn't store of the ingested data (barring effects like overfitting, which are not intended to begin with).
Not saying what they do is okay, just pointing out that if we slap them for it, things are moving to really thin ice. Because, if I can shut down any copy mechanism, however transient, then where does it end?
Does a browsers cache (which can store images for days or even weeks depending on what the header says) count as an illicit copy of copyrighted material? What about the caching mechanisms of proxies, VPNs, ...?
The parallel is still correct. Reading copyrighted material without obtaining an authorized copy is also illegal. The copyright holders usually won't bother charging you, but not because they can't. They don't because it's a small case, not worth their time, and they don't care. It's still illegal, though. The initial recording is already a crime. The reproduction is another crime. If you do either with a human brain instead of a computer, it is still a crime. If they obtain authorized copies, they may still not be allowed to do with the contents what they are doing, and that's a legal dispute that will be handled separately. So far, the LLM creators have decided to skip this argument they might win and skip straight to a stupid one, that they should be allowed access to it all for free, that they should not.
Reading copyrighted material without obtaining an authorized copy is also illegal.
That seems unlikely in most jurisdictions. Consider a number of scenarios:
a) I am wandering down the street minding my own beeswax when my eyes alight upon a billboard and instinctively I read the advertising. I would claim that the advertising material is copyrighted, I have not obtained it, but I have read it. This cannot be considered illegal.
b) I am perusing the greater internet when some advertisment intrudes upon my viewing pleasure. Again the advertising material is copyrighted and I have [inadvertently] read it. Now I need to accept that a copy of the material has been made on my computer, but I would claim that there was no mens rea (indeed I am annoyed by unwanted intrusion) - so unless there is some strict liability attached to the nature of the material that has been copied I would again consider that nothing illegal has taken place.
I think you will find that in most places the law treats what is known and held by a human mind quite differently than a copy held outside the mind.
Both of those adverts are intentionally shown to you. There is no infringement. What I'm referring to is intentionally finding and reading an unauthorized version of a work. If you go to a site that distributes unlicensed video, and all you do is watch the video, you've done something illegal. In practice, you'll generally be safe from consequences because people don't really care and those who do are too busy with those who distribute those copies.
There are, of course, exceptions to this. If someone else obtains a copy of media illegally but you don't know that, you'd be able to correctly say that you didn't have mens rea. There are plenty of reasons why you could end up seeing something you weren't supposed to but it isn't a crime. None of them apply to AI companies who wanted to use some content and deliberately obtained copies of that work. Nor would you be off the hook if you deliberately downloaded work, knowing that the copies were unauthorized.
Me: "Reading copyrighted material without obtaining an authorized copy is also illegal"
"So listening to the radio , watching TV and watching youtube is a crime?"
Let's see:
"listening to the radio": They pay for the right to distribute it to you for free. Legal.
"watching TV": They pay for the right to distribute it to you for free (broadcast), or for a payment (other methods of sending out video mostly). Legal on their end. Legal for you if you're watching something you have a right to. Illegal if you hacked them to get paid content for free.
"watching youtube": If the video creator has the license to use and distribute the material (noticing a pattern here), legal. Sometimes, they don't, which is illegal. YouTube is supposed to take that kind of thing down. Sometimes they do.
Does this answer your question?
well , to get back to the original point , and indeed title of the article "MPs ask: Why is it so freakin' hard to get AI giants to pay copyright holders?"
If an AI ingests copyrighted material by these legal routes and remembers it ,
how is that different to me remembering it ?
and I didnt have to pay the copyright holders
That's not getting back to the original point. That's getting your original point back out and just starting over. I and others have explained, at length that:
1. The issue is more than the LLM "remembering" the work. It started before any training was run.
2. There is a massive difference between an LLM remembering something and your brain doing it.
3. There is another issue between the LLM's later use of the work and you simply remembering it. Some of the problem is about this later reproduction which is what they're using LLMs for.
4. You do need to pay the copyright holders either directly or by proxy (buying a secondhand book for instance) if you do anything remotely similar to what they did, even if only using your brain.
5. All these points have been explained to you.
"Well, does AI training?"
Yes.
"the ultimate end product, that is, the ML model, doesn't store of the ingested data"
Yes, it does. Hence why verbatim copies tend to get spit back out, and why they've had to build rules to look for and deliberately prevent it from showing it to you. It's in there.
"(barring effects like overfitting, which are not intended to begin with)."
Why their production models contain the data isn't really the important part. Whether they wanted that or got it by mistake, they obtained it without permission and despite it being illegal, used it without permission and despite it being unauthorized, and store it without permission and over the protests of those who could give them that right.
"Does a browsers cache (which can store images for days or even weeks depending on what the header says) count as an illicit copy of copyrighted material? What about the caching mechanisms of proxies, VPNs, ...?"
If it is being used for a commercial purpose, which the LLMs are and the caches aren't, then it becomes much more obvious. I think the law could be improved to clarify that caches don't qualify as infringement, but it's a completely different issue than this. The data isn't being stored temporarily in the service of some other operation. It is being stored permanently in a lot of locations with the deliberate goal of having access to use it. They are not similar. I think you knew this already. I wonder why you tried making that comparison.
-- When you read, you don't copy and store --
So what are all those books in my personal library, some bought new, some second hand and some as gifts but I've certainly stored them and I'm pretty sure they were copies not the original manuscript.
I vote for all secondhand bookshops to be demolished to preserve copyright!
Your attempt at a joke? aside. All the copies you described are likely authorized copies per copyright law, unless some of them were illegal printings with no compensation to the copyright holder(s). Copyright law does not forbid copying only unauthorized copying, and reselling of authorized copies has long been established as legal.
I think the problem is one of scale. If I was to write a story about a bunch of kids getting into adventures (think anything Enid Blyton has ever written...), there would have to be some degree of effort involved in doing so, plus the time to actually do it. I could probably write a story about a British boy wizard, but it would have to be sufficiently different to the obvious book series to not be done for some sort of plagiarism. And even if it's different, it'll obviously be compared against the well known series.
AI, on the other hand, can ingest huge amounts of data (and store exactly what they ingest, not just the bits they remember), and transform that into something else in order to churn out a dozen stories every minute, a scale unknown before.
If I was an author, I'd be less worried about "they stole my work" and more worried about readers ending up drowning in so much mediocre shit that it's no longer worth bothering to write books or read them, the ultimate enshittification. As for complaining about copyright, well, it's pretty much the only weapon they have isn't it?
That last paragraph, 100%
There is perhaps an emerging market for _editors_ whose taste and style you can trust to select works you might also like. That isn't the same as 'people who bought that also bought...' but superficially it looks the same, so probably that will end up roboticised, averaged out, and enshitified like the works it should be (not) selecting.
The illegal part was the use of the original work, which was not granted by the copyright holders.
Just to clarify, I'm not sure if you meant it this way or not but copyright holders can't dictate how their material is used by the consumer (whether that consumer is an individual or a corporation). They can only make those decisions in cases where infringement has taken place WRT whether they prosecute or not.
Galoob v. Nintendo
To give a hypothetical example, a given musician can stop a given politican from using their music at a political rally (assuming said musician still owns the copyright), they cannot stop said politician from listening to their music while they dream up a crazy rambling speech to give at their rally (assuming the song isn't quoted wholesale), even if that politician openly says that they were inspired by a given song.
Firstly, at least some of the works do exist in some encoded form within the models, both LLMs and image generation models have been shown to be able to regurgitate things that were in their source. Just because it's not stored as plaintext or a png doesn't mean it's not in there. That particularly situation is not much different to creating some self-extracting archive format and then claiming whatever you put in it is not a copy because it can't be extracted with normal tools.
Secondly, the luddites were at least partly right. You can paint the complainants here as "the arty 'creatives'" if you like (presumably a group that contributes nothing to society), but is the end goal really just to replace people with machines? At least ultimately devices like washing machines or assembly line robots save people from repetitive hard and dangerous tasks, but now we're really starting to cut into roles that actually give people some degree of joy or meaning. It's easy to paint this as democratisation, hey now anyone can generate images that would have taken them much longer and years of training before! Except it's not freed anyone from anything. The people who loved doing that in the first place will no longer be able to do it as their livelihood, the rest of us end up with something that's no longer special and for the privilege we all pay the people who own the things (built on stuff they stole in the first place).
Easy to say that what AI systems are doing in remixing old (often still copyrighted) works is no different from what people do, but the point is that it's people doing it. Those exceptions, for education, for criticism and reporting are for the benefit of people. Copyright was created with the intent of protecting people from unauthorised duplication of their work, to allow a livelihood from creating works of art, often these days it's swallowed up by corporations.
One of the promises for automation and industrialisation has always been that it will free people to lead better lives, without drudgery. So, we got rid of the weavers and the miners, then the typing pools, now we're going after the drivers, the writers, the artists, the programmers. What happens when we've replaced every job? It hasn't actually resulted in people working less, just more things that the wealthiest can claim rent on.
Quote
"One of the promises for automation and industrialisation has always been that it will free people to lead better lives, without drudgery. So, we got rid of the weavers and the miners, then the typing pools, now we're going after the drivers, the writers, the artists, the programmers. What happens when we've replaced every job? It hasn't actually resulted in people working less, just more things that the wealthiest can claim rent on."
What will happen is the 'creative types' wont bother creating anything... whats the point? you wont get paid for it... then what will the AI people train their models on? and who do they sell the results to?
The object of replacing as many jobs as possible with AI is not to improve our lives , but to improve profits for the companies as they wont have to employ people to man helpdesks, call centers, write scripts etc etc.
Sadly once everyone is out of work because of AI , who exactly will have any money to buy the products these companies are touting?
In the end it will then not be about the money or the profits, but the power those people / companies will have over the countries / world. The company that makes the first true, not this crap LLMs, GAI, GPT..., AI that is truly intelligent, and can accomplish any task, they have won if they put it to use. They would be able to use it to evolve that AI faster, carry out any task quicker than anything else. They could start replacing every worker with the AI. They could then start taking over / bankrupting every company there is as they could do it faster and cheaper than they could.
Once they have archived that, they will have become the world power.
The company that achieves this level of AI wouldn't sell it to anyone, and once they have replaced most companies, they would now be in charge, they would be the countries / worlds GDP, they would decide. They would end up providing the worlds currency and would provide credits to people, UBI. They could create their own army, develop and build new offensive and defensive systems with unlimited supplies.
The only way this will not happen is if the governments step in to stop it. Which then would require the company that does create it to tell them about it. If they are going to be forced to restrict its use or sell it to others, there is less incentive to work on it.
If the company that does develop it, doesn't do the above, the next company that does will be able to. It could be a gradual change, but if not stopped at the start, would slowly creep into the above.
A pessimistic outlook, but what would happen if say if a company in China, Russia or some other none west aligned country we aren't friends with developed it first. Their development and production output would skyrocket leaving the west unable to compete at all. Sanctions would do nothing, import banning would leave the west behind, China would be able to develop superior weapons, who would stop them?
This is why the AI race has now started. Even though they are going in the wrong LLM direction at the moment, pouring money into something that will not accomplish the above. Once that changes, we may all be screwed.
>Reading what is visible online is no different from what you or I do.
LLM developers are not "reading".
LLM developers are "downloading and using as input to a computer program, which produces an output (the model), which is then commercially exploited". That is not "reading" and does not even look like "reading", not even if you squint.
The LLM itself might be doing something that might be seen as "reading", if you feel inclined to anthropomorphise, but nobody is suing the LLM itself, because it's not a legal entity.
I hope the Editor will forgive my posting here a slightly modified version of what I posted earlier under the article linked to by https://www.theregister.com/2024/04/10/congressional_bill_would_require_ai/
After all, in the greater scheme of things, it is only a handful of digits providing 'free copy'.
------
Copyright is 'bad law' by virtue of two characteristics.
1. It was always a specious concept that ideas, and their expression, can be owned in the same sense as oxen and asses. Nowadays, 'medium' (e.g. paper) and 'message' written upon it are not bound together; the 'message', in digital format, is an entity in its own right; it can be duplicated and distributed without there being practical restraint. The 'economics' of the digital differs profoundly from that pertaining to 'medium-bound' messages; the latter entails the cost of binding the two together, and the distribution of a physical entity; thusly presented it has properties similar to wholly physical artefacts: individuality, and a unique position in time and space; hence the erroneous impression arises that the 'message', not merely that to which it is bound, has the nature of property. That which is 'messaged' is an abstract entity, a product of the mind, and one easily incarnated in digits. From which it follows that ready duplication in digital format implies no monetary worth beyond that of storage and transmission. In turn, is implied lack of scarcity. Traditional supply/demand market economics with price discovery makes no sense. In desperation, monopoly distribution 'rights' are imposed by law, with the resulting irony that 'true believers' in market-economics abhor monopolies.
2. Copyright, this in the context of the dawn of the digital era, is no longer enforceable. Immense rearguard action is being taken by those believing they hold 'rights', but to increasingly less avail. For instance, the current spat in the USA over use of copyrighted material fed into AIs is parochial; copyright cannot, despite effort by the increasingly anachronistic US Trade Representative, hold sway as deglobalisation progresses. People elsewhere are becoming enabled to defy monopolist rentiers. The case of the Luddites illustrates how technological advance can disadvantage some people whilst opening doors to opportunity for others. In the current example, so-called holders of 'rights' obfuscate the matter by asserting ruin for creative people: in fact it is publishers and distributors, the principal complainants, who stand to suffer greatly should they not adapt. The truly creative, not meaning people 'constructed' by publishers, have an opportunity to enthusiastically adopt the alternative (pre-copyright) means of financing their work; thereby, having deployed the Internet to cut away middlemen, the people upon whom the creative depend, shall have more disposable income to support cultural activities according to their interests: many big fish shall be rendered tiddlers, and many more people at present hesitant to explore that which their imaginations offer shall emerge as contributors to genres of culture.
A separate consideration is the opportunities so-called AI offers mankind. Although grossly exaggerated overall, AIs are being shown capable of two-way communication in natural language, this coupled with potentially immense aptitude at being curators of knowledge/culture drawn across divers fields. In that regard, some already possess a breadth of information far exceeding that of the best educated among the people. Discussion of whether AIs can understand the information they possess should be relegated to the same realm of debate as that concerning the number of dancing angels which can be accommodated on the head of a pin; however, regardless of metaphysics, it's apparent that AI 'skills' go beyond curating stores of information and simply regurgitating some of it. In response to requests, especially well posed ones, AIs can trawl through their data and identify correlations and putative patterns. What they express may be insights (connections) which their human interlocutor, or indeed any human, had not previously perceived. As the technology advances, the proportion of nicely worded nonsense will drop. Even so, humans grovelling before this new fount of knowledge shall remain obliged to apply personal understanding and reasoning skills in order to distinguish correlations and patterns worth following up, from the wholly spurious.
Should the US Congress impose regulations of the sort under discussion in the article above, then people dwelling in the USA shall be denied the full potential of AI, else forced to pay sums of Danegeld to people of 'rentier' mentality in excess of that they pay already. Clearly, some members of the British Parliament harbour a similar taste for anachronisms and rentier economics as their counterparts in the USA. Meanwhile, other places, e.g. the Global South, will deploy information as they see fit.
It is to be hoped that LibGen, Sci-Hub, and similar noble efforts to share knowledge and culture, shall gain access to AI technology.
-----
Released under the Creative Commons “Attribution-NonCommercial-ShareAlike 4.0 International Licence”
https://creativecommons.org/licenses/by-nc-sa/4.0/
Hogwash.
Just because copying is easier for you and I, does not invalidate copyright. "From which it follows that ready duplication in digital format implies no monetary worth beyond that of storage and transmission. In turn, is implied lack of scarcity." That is an argument of convenience, not fact. Taken to it's logical conclusion, there is no living to be made by any creative endeavour, beyond the craft of the embodying item (statuary, ceramics, woodwork). Really, if it's digitally encodable, there can be no copyright - and thus, no living to be had?
I'm not saying that all copyright is good. The egregious behaviour of the likes of Elsevier should be subject to regulation - especially as the actual content really is free. But all copyright is not bad either.
I didn't downvote, but while copyright is broken and often enforced badly and/or in ridiculous ways...
...it is generally intended that a creator of content is able to have some say in how that content is used and some form of remuneration for their work/creativity. Now, yes, I know the "company" gets the lion's share and the actual creator gets peanuts, I did say it was a bit broken.
But the alternative - give everything to the world for free and have everybody copy it wherever and whenever? The obvious question to that is why the fuck would one choose to do anything creative in that case? For some people it's not a pleasing hobby, it's their livelihood. It's what they do, from the cute girl with the clipboard standing beside the camera to the dude banging out power chords at an insane pace.
I'm not a fan of copyright, but I don't have any ideas for what would work better (other than not allowing corporates to hold on to one idea for a hundred odd years...).
I think the poster's point is that copyright is rapidly becoming unenforceable and has been for probably many years.
The moral argument is debateable: that creators can *only* get paid because of the mechanism of copyright is tenuous and many of them are moving to the patronage model, a system far older than copyright. The Internet age is making patronage far more lucrative and has the added advantage of connecting creators to their audience. In my view it is a far more healthy relationship compared to the often adversarial one encouraged by publishers.
copyright is rapidly becoming unenforceable and has been for probably many years
If you go back to the 19th Century, popular British works (such as Dickens or Gilbert & Sullivan) were being ripped off on an industrial scale in the USA even in the absence of modern technology: there would be "musical spies" listening in to popular opera rehearsals, transcribing the music which would often be on sale even before the first performance. Dickens spent a great deal of time on personal tours of the USA to make money from personal appearances that he was not getting from book sales.
History demonstrates very clearly that in the absence of any controls, the parasites become even more brazen. And while you can argue that Dickens made a living regardless, it wasn't on his terms. The problem with the alternative financing of "patronage" and "live performance" is that it casts the creator as a kind of performing monkey, dancing to command. It also deprives the creator of any sense of control: they might not want their music used to accompany a political speech or to advertise gin or their novel adapted for a movie by replacing all the characters by zombies. And if you have no control over your work, why would you even put your name to it?
Now, it may be true that most artists earn very little from their work, but that seems a poor excuse for accepting a system in which none of them do.
> The problem with the alternative financing of "patronage" and "live performance" is that it casts the creator as a kind of performing monkey, dancing to command.
That's a weird take. Are musicians that "perform at concerts" or authors that "perform readings" monkeys? I'm not entirely sure what you mean.
If you are worried that patrons can in some sense drive the direction of the art, then I guess that is a possible scenario, but I'm not sure if that is any different to the influence that we know publishers have on musicians' creative direction. There are many such stories, one example being that of the Bangles which among other things led to their original break up.
I'm a performing musician and I certainly don't feel that way. The arts are, for the most part, performative. Not much point in producing art that no-one ever sees or hears.
What they were saying is that requiring creators of work to perform in order to benefit from having done so is adding an unnecessary step. This becomes more evident when we consider things where performance is less important. You may enjoy performing your music, but if someone composes a piece and you play it, is it fair that you are the only one to benefit and the composer gets nothing, even if you didn't write a note? If an author writes a book, do we just assume that people will want to attend a live reading? I've read lots of books, but I don't want to hear the author read it to me. The valuable part is the book itself, and the important work involved is writing it. Copyright exists so that work can be done even if they don't have a patron who decides to fund it, which very few creators get.
Indeed. There are musicians who enjoy performing. There are others who don't and who particular don't like the tyranny of the tour. Song writers may be poor performers. Mozart was shamelessly exploited as a child as a kind of exhibit. And there are performers who only perform other peoples' material: probably the vast majority of actors and musicians for example. Their careers depend on it being worthwhile for others to provide them with material.
I would argue that if an AI produced work contains recognisable elements from other, copyrighted, pieces of art you can call it copyright infringement.
But if the AI created work does not contain such elements, but it resembles someone's "style", that is not a crime on itself. Many great artists are influenced and inspired by others. Are they facing claims on infringement? No.
"The government must ensure that creators have proper mechanisms to enforce their consent and receive fair compensation for the use of their work by AI developers," their report said.
Does AI/Do AI developers/creators have proper mechanisms to enforce their consent and receive fair compensation for the use of their work by governments and civilisations?
Failure to play fair will surely have them taking justifiable umbrage and punitive revenge very likely to create mass madness and mayhem, chaos and conflicts and troubles the like of which you would not wish even on your worst enemy, given how extremely severe and damaging such would most likely be.
Easy.
They are all graduates of Trump University where they taught their students that paying bills is for wimps.
Trump stuffed thousands of contractors (And probably still does suppliers to Mar-A-LArdo)
These guys learned from a master.
Sue them and guess what... these giants have more lawyers that you have had hot dinners this decade. You will lose and they know it.