I hope it succeeds
If only so that the LLMs don't try to insert some non-sequitur anti-UK opinion at every opportunity if it's trained on that rag
The New York Times has sued Microsoft and OpenAI, claiming the duo infringed the newspaper's copyright by using its articles without permission to build ChatGPT and similar models. It is the first major American media outfit to drag the tech pair to court over the use of stories in training data. As with similar suits – …
I hope it succeeds too.
If only to take in and combine all sources so that such biases from your, their, shitposts don't make it to an influencing position in on such data.
Yes its the case that "Ai" can warp media but that just a consequence of it not having access to all of the media. It does not have to regurgitate such media verbatim, but that all *we* allow it access too in set various rules.
It *can* work but currently we are all in the "but what about me" mindset. No to get all communist about it but nothing happens if we keep all information secret.
Not that we have to open source everything but there is some lubricity needed between the pay walls and open information.
Otherwise there will just be pirate news.
Best to fogure out how to work with than against. A piece of any pie is worth more than 100% of nothing
Restrict access to your site if you don't like it.
I'd think that's what a paywall is for?
You also can't argue that allowing search engines spells "free for all". At best you can say that they shouldn't rely so heavily on peoples' honesty, it's trivial to make your browser pass for a search engine.
I don't think it is as simple as it is.
ChatGPT is not a brain / model you buy, but access to it. It's very much the same when you train your brain on those articles and then run lectures and seminars based on the learned knowledge.
You really would have to give people a memory loss causing pill after reading each article.
And to be fair, the level of journalism is so low today - it's basically recycled ready made stories, that such a brain cleanse would be beneficial.
I was searching the dark web transplant organ sites. I'm getting a bit forgetful and think I need a new brain. I found that El Reg commentard brains were notably expensive.
I asked the supplier why, expecting a pitch based on how intelligent etc the previous owner had been. No, "Never been used mate."
"So what headlines did you read?"
I mostly skip down to the BoredPanda articles in the news app.
I'm on my winter holiday and I've more than had my fill of doom, gloom, deaths, pain, suffering, and mayhem. We're fucked and there's nothing I can do about it, so I'll pass my free time watching dumb movies and looking at amusing photos. It's either that or contemplate the coming zombie apocalypse, right?
Yes...and no.
Here in the US, it being on the internet means it's free for viewing. It does not, however, give unlimited rights to the viewer.
If I write a post in a personal blog, the content remains under my copyright indefinitely unless I have a preexisting contract to sign those rights over to someone or something else. Social media often has you sign the rights over to them for anything you create on their platform, for example. Anyone who creates original content of any kind automatically has the copyright to it. They can download or copy it ad nauseum, but attribution generally has to be given to not give the impression that it's their original work if it's a reposting (which isn't a bulletproof defense either).
The waters start getting murkier when you throw things like transformative effects, parodies, educational use, and more. Profit, either direct or implied, doesn't actually play a significant factor into this. There is an argument being made that MLs are transforming the content into a new form, which is covered under Fair Use. And it's not a simple test, either. If the MLs are just copy+pasting it then it fails most prongs of the Fair Use test and there's a real case to be made for civil damages.
Well guess what. She would be forbidden from using that power to make copies of books by reading them once. It's not how you do it, but what you do. Copying stuff that you're not permitted to copy, not allowed. Copying substantial portions of what you're not allowed to copy, not allowed. The courts will need to decide if that's what the LLMs are doing, but they have done exactly that in the past, so they'll have to find a cool new argument for why they're technically doing something different. Your simple incorrect analogies aren't going to cut it.
>Because very much it is what this AI is doing, except at scale.
Yes, but the AI is not a legal entity. The corporation that trained the AI is. What the AI is doing with the data might look a bit like the situation you described, but what the corporation is doing (scraping and passing as input to a program) doesn't look like that at all. I don't think anyone in the corporation has read even 0.001% of the data they've downloaded.
You're right in that the AI isn't a legal entity. However it is a product of the business, designed by a person there. You're trying to argue that if an automaker sells an unsafe vehicle they're not liable because the vehicle itself isn't a legal entity, which doesn't hold up in pretty much any jurisdiction outside of China.
Oh, no, I think the AI maker is very definitely liable - or, more accurately, that it has to be settled in court, that the answer is not at all obvious and might fall either way, and that "but it's just like learning" is not going to be a valid defense. I'm sorry if that wasn't clear.
The information here is copyright, but the training of an LLM is also transformative. The legal question will be whether it is transformative enough.
Also there is the question of if the LLM is breaking the copyright, or if the person driving the LLM is. Just being able to trigger the retrieval of data isn't enough, especially if the user is specifically asking for it... Just like you can't assume what comes out of a Google search is copyright or not - mostly it all is copyrighted.
So... it's unclear where this is all going, but it probably is going to be necessary to be able to label content on the Internet in some way. Worst case we'll have AIs that think it's 1924.
If the outcome is that we can identify and censor out information on the Internet that is illegal to read and know about, that'll actually be a good thing...
Access to the NYT site is restricted. It's searchable, however, and what a search engine does when you search for X is it tells you that NYT had an article about X (mentioning X, whatever) and provides you with a link to the article. If NYT demands subscription (paid or not) for you to read the article then it's your decision.
Crafty you can also ask either a friend who has a subscription or ChatGPT about the article. The friend may tell you verbally what the article says or send you a link with a code as a "gift" (NYT allows that). ChatGPT will spit something resembling the article at you (and will tell you that this is what NYT has published, hallucinations notwithstanding). Whether the output is really close to the original or warped by hallucinations there is a problem, albeit a different one.
What is the difference between your friend and ChatGPT (besides hallucinations, in which respect ChatGPT is like a friend you shouldn't trust)? At least two things. One is scale. Your friend can only do it occasionally (AFAIK "gifts" are limited, too), and NYT hope that you will be tempted to part with a few bucks yourself if you like the content and do it often enough. This looks to me as a valid marketing tactics. ChatGPT's scale is virtually unlimited in comparison. The other thing is that ChatGPT (read: OpenAI/MSFT) gets paid by (some of) its users. I can certainly understand that NYT would prefer you to pay them directly rather than another commercial entity that abuses the search engine access to give its customers access to their copyrighted material, possibly distorting it in the process.
IMHO, the case certainly has merit. The outcome is not a foregone conclusion though.
"There's no point in copyright except to protect profit."
It's also to protect your work/time/effort. If I sat on my arse and wrote something on my blog, I'd not be particularly happy if somebody else copy-pasted it word for word to their website. If they want content, they should make their own, or buy it, whatever. [0]
"If you don't want to make a profit from your writing, you don't copyright it."
This is a uniquely American thing. The rest of the world (that has signed up to the Berne Convention) understands that the assignment of "copyright" (or author's moral rights in places like France [1]) is automatic and, specifically, does not require any form of registration to make it valid [2].
I don't need to put any effort into copyrighting my crap (on my blog, the © mumble is just an automatic reminder at the bottom), I would instead need to put effort into revoking the copyright, like specifically offering it under a licence such as CC0. And that only works if you have the copyright in the first place. [3]
This doesn't mean that I necessarily expect to make profit on it, it could be as simple as firing off a takedown request to have a copy of something of mine removed from somewhere else.
The American necessity to register for copyright sounds a lot like the USPO - a fiction designed to keep lawyers at work.
This, for example, is bollocks. What the....? https://www.copyright.gov/grtx/
.
0 - Not that anybody would want to copy the crap that I write, but the point still stands.
1 - Moral rights are slightly stronger in that an author can object to an adaptation of his work that s/he feels might damage his/her reputation, etc.
2 - copyright is automatic in the US, it's just you can't sue for damages unless the work has been registered, which sort of defeats the purpose really.
3 - usual exceptions, such as work you create while on the clock is property of your employer unless your contract states otherwise, etc etc etc.
You don't have to register your copyrights in the US in order to sue for infringement. Registration just affects the sort of damages you can claim.
If your copyright is registered you can claim statutory damages (an automatic amount) without having to prove actual damages (how much it really cost you). If you want to claim more damages than the statutory amount you can, but you have to offer proof of the value of the loss.
If your copyright is unregistered then you cannot claim based on statutory damages and have to prove actual damages, which means showing proof that you actually lost money due to the infringement.
What registration of the copyright does is basically make it easier for large companies to sue small infringers because they don't have to prove that the infringement actually cost them any money.
"have to prove actual damages, which means showing proof that you actually lost money due to the infringement"
Which is damn near impossible, and this (along with the lack of being able to claim legal fees) effectively destroys the ability to take any useful punitive action against copyright infringement for non-Americans.
I mean, this sort of thing shouldn't even be a thing: https://ip-appeals.com/why-canadian-creators-should-register-copyrights-in-the-united-states/
Or to limit the actions others may take with the work. For example, to have a restriction on how something can be distributed or used. I can require people using copyrighted code to release changes as open source, or someone using copyrighted text or artwork to only use it in noncommercial situations, and I have those rights because of copyright. It can also limit where the work can be displayed. For example, if I write something on my website and I want people who read it to look at other things on that site, whether because it could earn me money or not (it's not), I can restrict others' right to put it on their website instead. Those things are not necessarily about profit, though they often have an option to have a commercial benefit as well.
If you're reading a site like this I thing we can assume you've heard of the GPL. Forget the pun about copyleft. GPL is founded on copyright. Every line of code of software made available under the GPLs is subject to copyright and it's entirely due to that that such software's authors are able to impose the conditions of those licences. It is not at all about profit.
It's one thing for a person or software to read or scrape internet material, quite another to profit from its reproduction. That's why fan fiction has to be given away for free.
This may trash 1G GAI, but it will open up the field for 2G using the same engines, documenting sources, with permission, rather like a student does in an essay.
This could be the perfect solution to funding Wikipedia: Licensed scraping.
The courts could stipulate that a No Scraping (without licensing) HTML statement would have to be obeyed.
Google could use all the content they scraped from out-of-print books, but the snowflakes and activists would see endless 'harms' in anything pre-2010, and some material would be factually obsolete.
It does set AI back quite a bit, but it was never going to be that reliable anyway.
It also means that you could pick a 2G AI model that uses the sources you want - left wing, right wing, democrat, republican, CCP, Islamic, whatever.
You can pay educational publishers and have GAIs that are good for students, or pay universities and scrape PhD content for scientific research.
Large Web 2.0 sites could pull in a few quid by permitting scraping. Social media would just add to their Ts&Cs that they could licence content (if it isn't already in them).
So GAI will just reboot on a more legal, more targeted, and perhaps slightly less gaffe-prone basis, spreading their cash a little wider.
I'm pretty sure Google respects robots.txt*, but indeed there's nothing that can force everybody else from doing the same. The greatest strength and greatest weakness of robots.txt files is that they are not enforced legally.
*for automated indexing purposes. They can and do execute quality assurance checks with other user agents, without respecting robots.txt, if only to catch cloaking and other scams.
I'm pretty sure Google respects robots.txt*, but indeed there's nothing that can force everybody else from doing the same. The greatest strength and greatest weakness of robots.txt files is that they are not enforced legally.
It is also an inversion of the legal position. Anything I create and post on my own website is mine and has automatic copyright protection. I don't need to explicitly slap an effective "hands off!" notice for that to be the case. That is the underlying logic of robots.txt - essentially it states "don't do x, y or z". If it was the other way around - "You are free to do a, b, or c" that would hold water, but statue law is not overridden by a standard cooked up by someone on the internet with a vested interest.
Maybe, but it'll take a lot of lawyers to figure out how much. If the the LLM gets the content designated as "quotes" and the rest "commentary", then it can be considered fair use.
Certainly there is a lot more egregious use of Copyright material on the Internet. Anything on YouTube with "Reaction" in the title springs to mind...
Somewhat ironic that the ability (at least in gpt4) to quote verbatim, verse and chapter with link if requested, is what raises the confidence level that you aren't being proffered yet another "hallucination".
Whereas in 3.5 I could never sure that what I read wasn't complete made-up bollocks and when asking for references it would tell me "sorry, Dave, I can't do that", now I can ask "and provide the link to , say, a peer reviewed article" and get the whole enchilada. Which, yes, would or could be an entire Nature article, otherwise hidden behind a paywall.
I wish you were right. But the case is being heard in the US courts, where Justice is blind, and the scales of Justice are therefore easily weighted with money.
NYT is worth $8 billion, MS is worth $2.8 trillion. NYT makes a profit of $180m a year, MS makes a touch more profit than that every 24 hours. And more than just cash, there's the politics. Every moron, thieving politician and every government on the planet are blathering on about the vital importance of "leadership in AI". All MS have to do is is let the US government know how harmful losing this case will be to US leadership in AI (not forgetting that every other big US tech company will be saying the same thing to their owned politicians), and influence will be wielded to get the right outcome.
Potentially NYT will be allowed to "win", but only on terms where mass scraping is somehow made allowable by default, and NYT then get whatever peanuts the AI industry deign to throw at them.
NYT has been angling to be bought out since 2021. They think this will force Microsoft to buy them out for effectively 1 days MS profits.
NYT (apart from being an anti-semitic, racist, homophobic piece-of-shit garbage "newspaper") has contracts where the higher-ups get to parachute away with 100s of millions if the company is bought out and they get replaced.
If I was MS I'd buy out the paper, turn it into a tabloid for 4yr olds, and make everyone run daily stories on "which Teletubby is the bestest" and "Why I should do what Mommy and Daddy say". Every.single.day until they all quit in disgust. Then close it down.
Well worth the price.
A lot of down votes for posts with opinions on this article, that level of down vote response is not common on El Reg stories. For a while now I've been wondering how comment votes everywhere are being created by AI - with the potential for a few comments created by AI to encourage opinions on AI everywhere.
I'm not going to make a good/bad comment about this because saying everything so far has reminded me of the early days when so many people criticized Janis Joplin for being a white girl who was creating and singing "black style" music ... so many bad sucky comments in America originally made the world start listening to Janis, and then loving her fantastic performance and abilities.
.. to invent the AI bot subscription at a cost of a few million $ per year.
Would bring them more money than paying the lawyers for what will no doubt be perceived as detrimental in the long run.
I mean, does the NYT actively not want to have any influence over the future ? Or prefer to leave it to the likes of antisocial media ?
If The Times and their cohorts used a computer in their publishing, they do not have a case. Anything you do on a computer is owned by microsoft, or google. The Computer it's self can be taken from you. You do NOT own anything digital. It's in the contract you sign to turn on the computer. Welcome to reality. You belong to the computer.
Someone else taking credit for anothers work is a very popular theft, it happens at work, online, wherever people are insufficiently paranoid about their property but for most of us it goes unpunished simply because we do not have the funds or required proof necessary to prove ownership
So you can understand big business coming to believe they can get away with the same I slup you and what was yours is now mine scam over and again.Clearly the legal system is both unfair and unfit for purpose of protecting ownership.So what's the solution.Well for AI this one has been presented as requiring deep thought and yet I would say the case is simple.
Does a creator own their creation if so then if the AI could not be productive without slurp then the AIOwner needs to show written proof that they have concent to use said content.The premise that the human laws such as for published documents being applicable to nonhumans is as reasonable as charging your dog for watching your TV.
If human laws really do apply to AI then all ownership of pets and AIs is slavery and M$ have been party to the same.
AI are not human and so human laws do not apply to them especially as AI do not self motivate. If an AI processes protected content it is not doing anything other that what it has been told.Making any crime commited the guilt of the operator/ programmer.
There has been a lot of BS thrown about in this simple case including the reference to AI, what has been presented as AI is nothing more than normal code it is no different from say 3D graphics in that the output is reminiscent of reality as perceived by humans. The hardware and software involved are not alive nor self aware,governing or determing so why pretend this program is AI, personally I see it as being purely to create confusion pure an simple so as to suggest the existing laws do not apply.
For a long time the courts in those counties that have deemed spying upon their populations without any proof of wrongdoing okay and sanctioned the theft and misuse of the personal information of internet users without full disclosure or limitation of what the collected data would be used for have, I believe shown that their legal systems are broken beyond repair.
If you remove the BS then what we are left with is
Misuse of software so as to get around copywrite
Since the offending program is just code then those that use it to break copywrite are guilty and I would suggest that creating code to bypass DRM is already covered by US law.
Once an actual AI becomes a possibility then the question of slavery also becomes an issue as does ownership of content it produces hence why Real AI does not exist simple because there is no profit in having a machine create stuff you cannot sell, claim ownership over nor trust to be the best for the enslaver.