Should sue them for...
...the full $10bn invested in OpenAI, and every penny of profit from AI based on data scraped without permission.
Microsoft and OpenAI were sued on Wednesday by sixteen pseudonymous individuals who claim the companies' AI products based on ChatGPT collected and divulged their personal information without adequate notice or consent. The complaint [PDF], filed in federal court in San Francisco, California, alleges the two businesses ignored …
"They systematically scraped 300 billion words from the internet, 'books, articles, websites and posts – including personal information obtained without consent."
Who is alleged to have obtained the information without consent? OpenAI, or the sites they scraped?
Either way, if you're going to make information about yourself available on websites and posts, you're not too concerned about who sees it, are you? Or, if you are, and you still make it available, you're too bloody stupid to be allowed access to the net.
Material in books and articles is a little more tricky, in that the subject may not have been - and probably wasn't - the author, and may not have had much say in what was written. Even so, the material is published and available to anyone who wants to find it; if the subject doesn't like what was written, they need to take that up with the author.
The situation would be vastly different if the complainants had give their data to OpenAI directly, and possibly in confidence. But to complain about them using information that is already publicly available strikes me as pure opportunism. Are the same people suing Google for indexing the pages containing their data? Or the sites that have been scraped? If not, why not?
Of course, Google has long had a policy of (eventually) removing, or at least hiding, some personal information on request. Have the complainants approached OpenAI and asked them to remove their data? Again, if not, why not?
The likelihood of anyone having suffered any real harm from the likes of chatGPT seems to be pretty remote to me. I recall hearing some statistics recently (can't recall where, unfortunately) that suggested that a very large percentage of Americans had never heard of chatGPT, and of those that had heard of it, most hadn't used it. What are the chances that the relatively small proportion of people who have used it did so with a view to asking it about the complainants' personal data, rather than getting it to spew out semi-reliable filler for their homework / website / court documents etc? Pretty much zero, I should think. But who cares about that that when there's potentially money to be made from litigation?
And yet, if I went to a dump of data which contained data about you and made it a lot more public, you'd still have objections and I would still be breaking the law. It does not matter that I didn't steal it in the first place, nor does it matter how the original source got the data (if they stole it or if you gave it voluntarily, you did not authorize its publication). This only applies to certain types of personal information, and the particular jurisdiction will determine whether some information gets protection or not. The people in this case are complaining about increased publication of their details, and their case will succeed or fail based on that argument.
Even if this was only about collecting posts they have voluntarily written and published, they could still attempt to block OpenAI from repeating it using copyright claims. Just because something can be accessed does not mean you have the right to distribute it. Laws exist which limit your rights in that area, both in privacy and elsewhere. In practice, people should be careful not to publish anything they wouldn't like to see abused, because a lot of people will not obey those laws, but just because that will happen doesn't change the fact that they have rights over some of that data and they have a chance of getting a court to penalize those who violate them.
When I see they filed for $3 billion in damages, it tells me that they've found an excuse to hit the jackpot.
I'll be very interested in learning what proof they have for the judge, because their complaint is quite technical. They're going to have to prove those points, and I wonder if they have the means to do so.
"if they stole it or if you gave it voluntarily, you did not authorize its publication"
And therefore they are the ones to be sued. Having a finding in your favour would put you in a strong position should you then wish to request that Google, OpenAI et al kindly remove your unlawfully published data from their systems.
But, despite it being sensible and logical to promptly remove the source of the problem lest anyone else scrape it and publish it further, it wouldn't be nearly as profitable.
As for copyright, you are of course correct. As I recall, the DMCA contains provision for notice and takedown of infringing material, and it is a relatively simple matter to request removal of your material from sites that have copied it without permission. But, again, that isn't profitable.
I don't doubt that cases could be made in respect of both privacy and copyright, though of course making an arguable case is a long way from winning that case. It's the motivation for bringing the cases that I find questionable.
This ignores a large portion of personal information that exists on the internet without permission. For a public example, Kiwifarms as a website documents the private information of individuals largely without their consent (with exceptions). If these people had their private information scraped off of that site and used as a part of OpenAI's ChatGPT learning model, I think there's a reasonable argument that the model must be re-trained with that data omitted.
Regardless of this specific instance, I think it is a trivially provable statement that "there is information on the internet that is illegal to distribute", and OpenAI have not done a reasonable job of demonstrating that their previous dataset is actually free of any foul play. If AI researchers want to pretend that AI can learn like a human can, and as such its creations should be distinct from the training information, that still doesn't change the fact that consuming illegal content as a human is still illegal. I can't watch pirated copies of Disney films on YouTube for "training purposes".
"For a public example, Kiwifarms as a website documents the private information of individuals largely without their consent (with exceptions)."
Interesting - I don't think I've ever heard of that. Are they not inundated with notices to cease processing? Or are they based somewhere beyond the reach of lawyers?
If I give someone my photo ID, they are given it for a specific purpose. They do not then have the right to use that photo ID for whatever other purpose they desire.
Its the very basis of GDPR. Consent is context specific - you consent to a specific type of processing. Not to an open ended "we can whatever we want".
So, putting your info on a website should be subject only to processing as specified in their terms.
Start here: OpenAI in the past has dealt with the reproduction of personal information by filtering it.
That's two points proven:
1 - acknowledgement that they know information is private
2 - trying to clean up, but after committing the crime
I find it fascinating how quick these companies drag people into the courts when someone as much as comes close to their patents, but seem to be quite happy to commit large scale theft/crime themselves.
By the way, that's just the personal information problem. There's also that pesky copyright law..
Either way, if you're going to make information about yourself available on websites and posts, you're not too concerned about who sees it, are you?
To successfully defend your personal information you have to never screw up; the attacking websites have to succeed only once, and then they've got your info.
This post has been deleted by its author
For example, since the debut of ChatGPT, stackoverflow has apparently experienced a rapid 14% drop in visits, possibly because chabots can answer just as well - but only because they have scraped stackoverflow and other sites of their value.
In the long run, this could drive to bankruptcy the very sites that were scraped by chabots. But chatbots can't come up with original answers, so quality will decline. Eventually there will be some new kind of equilibrium built on the narrow margins for sites collecting real human answers, but it will be of a lower quality than before chabots.
It mirrors, in a way, how journalism has declined since the advent of google and facebook, which now draw the advertising revenue that used to support newspapers.
SO's quality has been pathetic for years. Perhaps they have experienced 14% fewer visits because people have realised the site is full of crap?
Or maybe it's because of all the layoffs in the tech industry lately?
I thought the way LLMs worked was to analyze the probabilities of Y given X and also some context (so they can do things "in the style of...")
The way the law suit seems to be wrtten implies that the LLM actually stores the data it scraped for training and regurgitates that data directly rather than using the probabilities/Markov chains to generate the output.
Can someone please enlighten me? is an LLM just a copypasta engine, with bells and whistles, or does it actually do something a bit more clever and use word/letter frequencies and Markov chains, using the chat prompt as a seed, as I originally thought?
You are correct in your original thinking.
This claim of just dumping out copied text en masse keeps appearing: it seems to be very appealing to some (or maybe they just can't their heads around things like MCs in the first place?).
There are plenty of ways that you *could* get "accurate" personal data out, starting with pure chance: these models are being queried many, many times and someone who actually lives at 23 Railway Cuttings, East Cheam is one day going to see that appear in a response. Or maybe your personal info is actually spread around the Internet so much that the model has pushed the correlation of "Fred" followed by "Bloggs" followed by "Mastercard 1234 5678..." up to near certainty!
You can also get *inaccurate* personal data out: oh look, ChatGPT "claims" a radio show host is an embezzler. The LLM gets it in the neck both coming and going.
 Okay, the process is more involved than a Markov Chain (and a lot messier and less comprehensible - less explanatory power - but MCs are a good starting point)
Thanks for that confirmation - I am glad I am not going senile (yet).
Almost everyone, who bothers to comment/file a law suit, seems to scream "But it stole my data" without understanding (willfully or not) how the blessed things actually work. Their uninformed rants were repeated so often that I was starting to doubt my sanity...
Pint for helping me out --->
Even if it doesn't regurgitate verbatim, I should still be able to control the usage of my own free thoughts, knowledge and other random spoutings. This is after all my USP. Its how I earn a living. The legislators should be working very quickly to enact this sort of limitation.
If I contribute to something like stackoverflow using my amassed years of education and experience, the default position should be that I'm do so only in the context of helping out that particular problem. Whether you use my output verbatim or use it as an input into a process is really the same thing. Verbatim is just a null process. Fixing your code is a process that I can help with and I've implicitly consented to do so by using that website. Training a random LLM is another process but I did not explicitly consent to that (and if the website terms and conditions say that they'll give away or sell my contribution then I'll move on to the next website).
But hey can't wait until the LLM's are being trained on the random output of the previous generation of LLM's because all original thought has been obscured. What a shit show that will be. We have entered the atmospheric testing age. Everything is now contaminated with fall out.
Using LLM's for noble purposes like medical diagnosis will only be done with relevant, peer reviewed data for training. These sorts of applications are not going to use reddit or stackoverflow or twitter. So removing those as sources doesn't stop the AI 'revolution'.
Great. The *one* thing that is *required* to demonstrate that there is a claim for damages is replaced by page after page of "maybe". Do they *want* this case to fail?
You could almost imagine that these pseudonymous plaintiffs are really named " OpenAI" and "Microsoft".
"Do they *want* this case to fail?"
It isn't necessary to win in order to hit the jackpot. You just have to make a big enough nuisance of yourself that it's easier for the respondent to settle than go to the trouble and expense of fighting. Part of that strategy could easily involve the generation of mass negative publicity, such as might come about through media coverage of their allegations, for example.