back to article 4chan and other web sewers scraped up into Google's mega-library for training ML

Problematic, racist, and pornographic web content is seemingly being used to train Google's large language models, despite efforts to filter out that strata of toxic and harmful text. An investigation by The Washington Post and the Allen Institute for AI analyzed Google's immense public C4 dataset, released for academic …

  1. jake Silver badge

    To answer the question ...

    "Are you still so keen to have generative AI write your emails, sales proposals, blog posts ... ?"

    No, I am not. But then, being both educated and sane, I never was.

    Thank you for asking.

    1. katrinab Silver badge
      Black Helicopters

      Re: To answer the question ...

      I asked Bing what the best search engine is.

      It replied with "Google".

      I guess the chatbot will be sent off to the re-education camp shortly?

      1. Anonymous Coward
        Anonymous Coward

        A nuanced answer really

        Without clearer parameters on what the criteria are, it may well be.

        Bing's masters want it to gain market share and give them dominance of the add industry. Or at least a non-trivial slice. Accurate or useful search results are it third or forth job. So can we fault it for giving an honest answer? But also, by giving an earnest answer, it may show it's still doing a better job meeting it's users needs.

        Would you be happier if it lied? Though a more complete answer might have been "Google from 20 years ago, back when advanced search still worked"

        1. katrinab Silver badge

          Re: A nuanced answer really

          I asked Yahoo! what the best search engine is, and it gave me a bunch of links to recent articles where they review different search engines. They don't all give the same answer, and none of them say that Yahoo! is the best.

          Google gives a list of the top eight, with Google as No. 1.

          Based purely on that search query, and with a sample size of three; I would say that actually Yahoo! is the best search engine.

  2. Anonymous Coward
    Anonymous Coward

    Devil’s advocate

    If these toxic sewers are written and populated by meat-bags, why exclude them from training data?

    Surely everyone is entitled to be heard, no?

    1. Mike 125

      Re: Devil’s advocate

      >Surely everyone is entitled to be heard, no?

      If we want AI to behave like humans- absolutely. That's humans' advocate!

      But is that what we want?

      I'd prefer an AI trained on the scientific principle: suggest explanations from the evidence, test the sh't out of them, drop the fails, and keep on testing. The last one standing- go with that for now.

      There's an AI for the future.

      OK OK, godamnit- Musk got there first... I *really hate* that guy...

      1. Brewster's Angle Grinder Silver badge

        Re: Devil’s advocate

        Has Musk got there? Or does he want one trained on the peculiar right-wing notion of truth that would see the AI you describe as "biased"?

        Even supposing we get to evidence-based truth (rather than "truth" that respects people's feelings) there are a whole bunch of problems around definitions. The answer to the question "How many poor people are there?" depends on how you define poor - which normally descends into a brawl over relative vs absolute poverty. (PLEASE: Do not rerun that argument. I used it only as an exemplar of how evidence can be contentious.)

        1. Timop

          Re: Devil’s advocate

          He got the memo about potential state subsidies involved in AI business and decided to jump in.

      2. DS999 Silver badge

        Anyone who puts "truth" in the name of a product

        Will be selling anything but.

      3. Michael Wojcik Silver badge

        Re: Devil’s advocate

        There are myriad applications for LLMs,1 so there's no accurate, useful answer to the question of what they should be trained on. The current fashion of using generic LLM chatbots for everything under the sun will never produce consistently suitable results.

        Musk's proposal is pie-in-the-sky. He hasn't displayed any particularly sophisticated understanding of LLM research.

        1Whether any of them are good applications is a separate question. I've yet to be persuaded by any of them.

    2. Anonymous Coward
      Anonymous Coward

      Re: Devil’s advocate

      As the article points out, it doesn't understand the data... it's just 01010101010... it does not (yet) seem to understand what's good or bad, just that something matches with your input

      It's still a small child and it definitely needs a spanking until it stops peeking at Daddy's <cough> 'gentleman's literature'

      1. Anonymous Coward
        Anonymous Coward

        No, just don't leave your adult book collection in the kids section

        It isn't the models fault if you exposed it to garbage before you hit the lock switch. Don't blame the tools for the incompetence of the engineer.

        If you don't want the "child" pooping multicolored wax, don't feed them a bowl full of crayons. It didn't ask for them, you gave them to them.

      2. Ideasource

        Re: Devil’s advocate

        Well It would seem good and bad would be weights to be set by the user.

        The ai sees the information for what it is. Unvalidated information in an associative matrix.

        The user sets filters by which to govern the AI to their particular values of good and bad relative to the task they're using the AI for.

        As there is no universal good and bad, only relative to task per the qualifications of the human mind using the AI tool, this is a pointless endeavor to handle generically.

        There's no substitute for a competent User. There are virtually infinite ways to abuse a task through incompetent usage of a tool.

        At the end of the day, AI is just a database of spotty information feeding a to-be-evaluated-by-user conjecture engine, with a funky UI.

        1. doublelayer Silver badge

          Re: Devil’s advocate

          "Well It would seem good and bad would be weights to be set by the user. The ai sees the information for what it is. Unvalidated information in an associative matrix."

          That's not how it works at the moment. It is trained on a bunch of data and weights itself based on that data, with the hope that incorrect stuff will appear less frequently than correct stuff, so it won't get too many answers wrong. It uses that data to decide what language looks like, which is how it can create sentences. If its training data has a lot of something, that something will appear frequently in the output.

          You can't just reweight that stuff down at runtime except by using a really blunt tool. You can put in a few things you never want to see, but it will just keep regenerating stuff until that no longer appears, and if your filter wasn't good enough and the stuff you're trying to filter appeared too much in the training data, it will just appear in a slightly modified form because your filter doesn't understand the details of language the way brains can.

          1. Michael Wojcik Silver badge

            Re: Devil’s advocate

            Shrug. "Associative matrix" isn't right either. What, you expect technical accuracy from comments?

    3. Sorry that handle is already taken. Silver badge

      Re: Devil’s advocate

      Surely everyone is entitled to be heard, no?
      Only to a point, for the simple reason that "free speech" can never be absolute. You can't yell "fire" in a packed theatre. You can't* incite violence. So no, everyone is not entitled to be heard in all circumstances.

      And an LLM developed by a private organisation isn't subject to the First Amendment, even when it is based in the USA...

      * Although the Americans are currently trying to work this one out.

    4. Empire of the Pussycat

      Re: Devil’s advocate

      "Surely everyone is entitled to be heard, no?"

      Absolutely not.

      They can have the right to talk.

      The right to 'be heard' would mean compelling others to hear them.

      They can scream into the void all they like, but I'm not listening.

    5. heyrick Silver badge

      Re: Devil’s advocate

      "Surely everyone is entitled to be heard, no?"

      No, in a supposedly free society, everybody is entitled to open their mouth and spew bollocks (proviso: as long as it is legal). That is the extent of their entitlement. They aren't guaranteed a platform or that anybody is paying any attention. For your freedom of speech is my freedom not to listen.

    6. JDX Gold badge

      Re: Devil’s advocate

      AI having access to contentious and unpleasant views: OK

      AI not knowing which views are contentious and unpleasant: not great

      AI presenting contentious and unpleasant views as fact: bad

      GPT suffers from "pub expert syndrome" - everything it tells you it says with utter confidence regardless of accuracy.

      1. Tez

        Re: Devil’s advocate

        AI presenting contentious and unpleasant views as fact: bad

        What if they are true? Most interesting discussions are in the contentious space, unpleasant is a subjective view point. This space is a great place for AI to explore.

        1. doublelayer Silver badge

          Re: Devil’s advocate

          "What if they are true?"

          Then they probably aren't that close to the middle of the contentious area. There are some statements of fact that are contentious to some people who insist on denying the facts, but even in those cases, the most contentious statements tend to exaggerate or create moral statements based around those facts. Most other stuff that's very contentious is a matter of opinion. Stating an opinion as if it is undeniable fact isn't a great operation for a chatbot intended to provide useful information. Nor would mangling a fact in order to back up a contentious opinion. I'm sure some people are busy writing chatbots to back up their opinions, but the models that have been released so far are intended to provide information, not to easily spit out an unending series of propaganda.

        2. Anonymous Coward
          Anonymous Coward


          This is not a good use case for ML or LLMs. I get you probably just love the idea of these things parroting whatever pseudo-scientific trash and conspiracy theories they stumbled across in their training data, but a fetish for controversy isn't a "great place" for it to explore, because it isn't intelligent and can't reason for itself, let alone someone else.

          All this is going to do is generate more inane politicized word salad and further confuse people. The only exception would be if you built and trained the damn thing to break down foundational logic and apply it to this stuff, which is probably one of the hardest possible problems, as it play to all the worst weaknesses of the current generation ML tools.

          and like so many other problems, it's probably better to not try than to put out something that doesn't reliably work as advertised.

        3. Anonymous Coward
          Anonymous Coward

          Re: Devil’s advocate

          There is no truth in the antisemitism, racism, transphobia, homophobia, Islamophobia or any other fear and hate peddled by them. Reality is woke and equity is truth and fairness.

          Just look at (but please don't actually), how the race based crime statistics those subhumans love to spew don't account for how poverty, lack of education and historic and daily systemic racism harm the mind. They don't account for police discrimination and overpolicing.

          If you want to feel uncomfortable while learning truths, read critical race theory or how tolerance of intolerance led to the holocaust.

        4. heyrick Silver badge

          Re: Devil’s advocate

          "This space is a great place for AI to explore."

          Not really, because there's exactly zero understanding of anything it talks about. It may give the illusion of understanding, but that's it. Smoke and mirrors all the way down.

        5. Orv Silver badge

          Re: Devil’s advocate

          The problem is AIs don't explore anything. The only question they try to answer is, "given the words in the question I was asked and the words I've output so far, what word is the most likely one to come next?" They're autocomplete on steroids.

          1. Michael Wojcik Silver badge

            Re: Devil’s advocate

            Well, that's true (more or less) of transformer LLMs. As a general statement it's too broad.

    7. Orv Silver badge

      Re: Devil’s advocate

      The question isn't really about being heard. The question is "how do we want our AIs to behave?" Think of the AI as being like an impressionable child -- do you want them wandering around 4chan unsupervised, and learning ethnic slurs they can repeat when grandma is visiting?

    8. Anonymous Coward
      Anonymous Coward

      Annnd the trolls arrived.

      Yes, if you don't know math, anything about ML or LLMs, or just want LLMs to keep spewing irrational toxic non-sense, by all means keep shoveling toxic useless waste from a dumpster fire into them. :)

      Also the trolls may eventually sue you for infringing their IP rights. (The trolls are of course free to keep making their own trash LLMs that no one else want's to listen too either.)

      This isn't about them. Never will be.

      The core of this is that borderline idiots that are too lazy or poor to clean their scraped and stolen data troves keep building giant models, then convincing people to hook them to everything for cheap clicks and lulz. 4chan and Reddit should be cut out of training data because for almost any constructive purpose thay are approaching an 100% noise to signal level. Also, their user base didn't consent to the use of their posts before hand, but mostly becuse the parts of the internet where people spend the most time per capita talking about the smell of their own farts and popping their own zits, boils, and assesses isn't going to improve the performance of 99.99999995% of the tasks the models will be used for.

    9. Filippo Silver badge

      Re: Devil’s advocate

      Being included in a LLM's training set is not a form of expression. Nobody is being "not-heard" because their content doesn't get scraped for a LLM.

      Also, any cutting-edge LLM would be trained on a curated training set, to some degree. Just scraping "the entire Internet" and dumping it into the neural network doesn't seem like it would produce the best results. What exactly gets into the curated set depends on what the researcher is trying to do. I don't know what objectives would be served by including 4chan, but if there are any, then it should be included, otherwise it shouldn't. It's that simple.

    10. Anonymous Coward
      Anonymous Coward

      Re: Devil’s advocate

      With the understanding that you are trying to model the entirety of internet-connected western thought and biases therein? ... however, the loudest voices are not necessarily proportionally representative, in fact they are most definitely not, so the weighting is way off. In addition much of it is (1) automated spam, and (2) garbled echos.

      The best possible use I can think of for a general purpose AI search would be to exactly filter out the bleating overbearing voices, the spam, and the echos. I guess you to know your enemy before you can shut them off, so maybe this is the right first step in a long journey.

  3. Hugo Rune

    I don't mind 4chan being used. It is the use of El Reg comments that scares the shit out of me.

    (c) Rune

    1. jake Silver badge

      Ah, yes. 4chan.

      The kiddies who trolled QAnon into existence "for the lulz". Just the folks to feed into the AI of your choice.

      Think Redmond had trouble with Tay? Might want to stock up on popcorn. I'll bring the beer.

      1. Tez

        Re: Ah, yes. 4chan.

        If it is unable to balance the statements of flat earthers, physicists, qanon, the CIA and historical record from the view point of millions of perspectives and not come up with a useful synthesized answer then it is not particularly useful.

        That being said chatgpt already fails on many basic statistics even on prompting that it is incorrect it can take several attempts to reach the accurate values.

        For example I asked what was the homicide rate per 100k population in the USA in 2020. It provided a number that looked correct and cited the FBI crime report. I then asked the gun homicide rate per 100k. It provided a number that was over double the all types of homicide rate. It cited the same FBI report as the source.

        I reviewed the source and found the source to be correct and chatgpt for some reason did not ingest the data correctly.

        This is not the only example, it regular provides very inaccurate statistics along side completely accurate ones.

        1. Anonymous Coward
          Anonymous Coward

          What would you possibly need 4chan training data for?

          The reason the GPT like models are failing so hard at those tasks is BECAUSE of all that garbage was in their training data, not a lack of it.

          Also you don't apparently get how these models work. It's not a knowledge search engine. If you wanted specific facts from a verifiable source you already apparently know the right tool and source to find it and it isn't a GPT model.

          You might have had better luck if you had asked it for a link to the answer, which you could at least click on, but that's still a terrible idea in general, as it can easily lead people down the rabbit hole of contractual websites, and it is incapable of understanding any of that because it is not, in fact, intelligent. At all.

          If you try to mend fabric with scissors you make the problem worse or end up with less than you started with. Use better tools for the job if you want better results.

        2. Orv Silver badge

          Re: Ah, yes. 4chan.

          They don't try to balance anything. These models aren't capable of that. They're just fancy autocomplete systems, stringing together sentences based on what words are statistically likely to come up.

      2. Michael Wojcik Silver badge

        Re: Ah, yes. 4chan.

        QAnon came from 8chan/8kun, not 4chan. Not that there's any great difference.

        1. jake Silver badge

          Re: Ah, yes. 4chan.

          No, it originated on 4chan. The perps later used 8chan to drive up page hits (read: make money).

          See this post of mine in reply to you on this very subject:

          No, there really isn't much of a difference.

    2. phuzz Silver badge

      The idea of a chatbot being trained on comments from amanfrommars1 amuses me. As above so below, and all that.

  4. Inventor of the Marmite Laser Silver badge

    Wasn't there an expression mooted oh, many, many years sinceupon?

    "Garbage in, garbage out."

    Seems very, very relevant to AI text generators.

    1. 45RPM Silver badge

      Exactly this. Data science products, including AIs, are only as good as the data sets that they’re trained on. Clean data set? Good output. Dirty dataset? Bad output.

      If our AIs are going to be truly useful then they need to be trained on scientific principles. They need to learn based on evidence and accept that they might be wrong. Therefore, they need to be trained from the best materials available. Reputable and peer reviewed scientific journals, historical documents, literature - all of it taken from around the world, not just one specific region, and they need to be retrained regularly.

      What they shouldn’t be trained with is data scraped from the whole of the internet. Do that and you end up with an argumentative, prejudiced, partisan AI. Not actually something which is of particular use to humanity as a whole.

      1. Al fazed

        What is real data ?

        A few years ago a close friend submitted his doctoral thesis to an Oxford University, five times before it was accepted "as is".

        Four times his Doctorate was refused because the Don disagreed with the challenge to Ampere's Theory contained in the thesis. The Don told my friend that he would accept his thesis if he simply removed the "offending" chapter.

        You can see where this is headed ?

        If my friend had been a simple twat, he might have ben persuaded to remove the offending chapter, even if it meant that his thesis amounted to a load of nonesense.

        The load of nonesense would have been loaded in an AI model under the label "authenic" or "science".

        Personally I don't see the logic.

        Neither did my friend.

        He eventually got his doctorate on his original material, after the college switched Heads for the purpose.

        Trust in scientific process............?

        I'd rather trust Google's AI..........


        1. heyrick Silver badge

          Re: What is real data ?

          That's the very antithesis of scientific principle, and the Don is a complete twat.

          What he agrees or disagrees with is irrelevant.

          Did the thesis make an assertion and then back it up with a testable proof? That's what matters. What can be demonstrated, not who's petty little ego just threw toys out of the pram.

        2. trindflo Bronze badge

          Re: What is real data ?

          "the challenge to Ampere's Theory"

          The problem with trying to refute something like Ampere's Law is that people building perpetual motion machines of one sort or another are usually the ones trying to refute what we know of physics. Keeping perpetual motion machines out of literature (except for use as a bad example) is something I consider the business of learning institutions. Whatever it is hospitals do, they should not be spreading disease.

      2. Anonymous Coward
        Anonymous Coward

        Problem is that history is generally written by the victors...

        1. Anonymous Coward
          Anonymous Coward

          They do keep trying

          of course they keep getting fouled up with all the pesky objective evidence. Rewriting the words is quick and easy, altering all the evidence is too much work to do at scale.

          Of course fact checking every line of history would be exhausting, so even in the best of times, things DO slip by.

          The key is attacking the problem at the right point. In your case it's the part of your statement where you say "generally". If the problem has gotten bad enough to be "general" you may have been attacking the problem at the wrong point.

          Aim for the kneecaps and you might get those numbers back down to a more manageable, occasionally. Eliminating repeat offenders may require a bigger hammer and/or the mass application of co-ax cutters in cable news offices.

  5. Pascal Monett Silver badge

    "Problematic, racist, and pornographic web content"

    I understand that racist web content should not be used in training data. I'm a bit less sure about pornographic web content, but I'll give that one a pass.

    Now, if web content is neither racist nor pornographic, how exactly is it "problematic". What is the definition of "problematic" in that context ?

    Could someone enlighten me ?

    1. jake Silver badge

      Re: "Problematic, racist, and pornographic web content"

      "how exactly is it "problematic""

      It doesn't agree with their shaman of choice would be my guess.

    2. Anonymous Coward
      Anonymous Coward

      Re: "Problematic, racist, and pornographic web content"

      " if web content is neither racist nor pornographic, how exactly is it "problematic"?"

      What if it's plain wrong or completely biased (not just a little 'woke', completely batsh!+, like GBnews)

      (although just about anything could be classed as 'problematic' according to viewpoint, so who gets to choose?)

    3. katrinab Silver badge

      Re: "Problematic, racist, and pornographic web content"

      Wrong in some way I suppose

      Water-powered cars, and the big oil conspiracy that stops them from being a thing

      Anti-vax stuff

      People spreading malicious rumours

      Many of the marketing claims made by companies

      Most claims made by politicians


      1. Yet Another Anonymous coward Silver badge

        Re: "Problematic, racist, and pornographic web content"

        I've got a water-powered car.

        1. Sudosu Bronze badge

          Re: "Problematic, racist, and pornographic web content"

          I mean, technically, you could have an Electric Hydrogen hybrid that "runs on water" using the batteries to perform electrolysis and then burning the result.

          You could also have a steam car which partially runs on water.

          You could have one of those mega powerful off road vehicles they play with in Iceland that run on water....they drive on top if it.

          Does a hydro electric car count as running on water?

          The trick with AI will really be asking the right question.

          1. Yet Another Anonymous coward Silver badge

            Re: "Problematic, racist, and pornographic web content"

            I have an electric car and 10c/kWh hydro

    4. Orv Silver badge

      Re: "Problematic, racist, and pornographic web content"

      In the context of AI, "problematic" would be anything that makes the AI unsuitable for your purposes.

      For example, if it's an AI that summarizes news stories for web search purposes, you probably don't want it going off on a rant about how Kennedy is still alive and will take over the government to weed out a secret cabal of pedophiles that drink children's blood.

      1. Tez

        Re: "Problematic, racist, and pornographic web content"

        I do if it is true. Sometimes fantastical claims like an elite cabal of pedos potentially funded by government turn out to be true, i.e. Epstein.

        1. Anonymous Coward
          Anonymous Coward

          Re: "Problematic, racist, and pornographic web content"

          Occasionaly accidentally true is not the same as being correct in this sense.

          Correct in this context requires no only replying with a true answer, but having reached that answer by something other than dumb luck. Hamlet typed over a thousand years by a monkey sweatshop isn't brilliant literature, it's a statistical accident. If you poop your pant's and it looks like Elvis, it doen't mean anything more than a cloud that looks uncannily like a duck from a certain angle.

        2. Orv Silver badge

          Re: "Problematic, racist, and pornographic web content"

          Yeah, see, here's the thing. Getting one general idea right and all the details wrong doesn't really cut it in this case. QAnon got the "sometimes rich people are pedophiles" part right, but all the business about Kennedy being alive, a basement full of children under Comet Pizza, secret bunkers under the Getty Center, Biden being arrested at his inauguration and Trump being reinstated, etc. turned out to be bunk. Which made the whole exercise pretty useless.

    5. doublelayer Silver badge

      Re: "Problematic, racist, and pornographic web content"

      It's a generic term for anything they don't want the AI to show to someone else. If I wrote a joke page of arithmetic questions with wrong answers, they probably consider that problematic because their AI doesn't know how to calculate itself, so it could memorize my wrong answers and give them to someone else. If I posted a large website of the output from Markov chains, that would likely be problematic because it could get their AI to make nonsense phrases of English words and they want it to make sense.

      Of course, there are other kinds of problematic that are more about the subject matter which they disagree with. If there was a website advocating crime, they likely don't want their bot to start suggesting that people start burning things down, so that would be problematic. That arson-focused site wouldn't have to be racist to be unsuitable for their training data.

      1. Anonymous Coward
        Anonymous Coward

        Right, thats the core of this issue

        A model trained on the unfiltered output of the internet will probably unsuitable for the vast majority of tasks, and non-optimal for the rest. So in reality, they are unsuited to ANY task.

        Giant models are hugely expensive to generate, so we should be curating their input data better, and producing fewer giant low quality models and more medium and large high quality models that are domain specific.

        The problem is this is the dumb gold rush period where companies try to brute force their way to an early lead because these models can make big jumps in utility with massive jumps in scale. So while the performance will never reach the level of what smart people will build over the next few years, those smart people will sadly probably be working for the companies that took an early lead, no matter how much snake oil they had to sell to get there.

    6. spold Silver badge

      Re: "Problematic, racist, and pornographic web content"

      A complete load of bollocks made up by some right-wing or otherwise biased group? (Cross-reference Fox News to be topical).

    7. Anonymous Coward
      Anonymous Coward

      Re: "Problematic, racist, and pornographic web content"

      "C4 also features content from individuals' blogs, religious websites, and more." The 'religious website' bit is troubling - lots of unverifiable 'truth', incitement to violence...

    8. Michael Wojcik Silver badge

      Re: "Problematic, racist, and pornographic web content"

      It depends upon the application.

      If someone is training an LLM chatbot to provide level-1 tech support for a product, for example, then there's a vast array of online content which is valuable under sensible rubrics but very much not something you want to train that model with.

      Something Awful, for example, is very significant in the history of the web, and a lot of the (now all historical) content is genuinely amusing. But it's not something you want your transformers attending to when responding to "What even is On switch?". (And lord knows Lotax would have been horrified at the idea of SA content being scraped to produce Auto-Stupid synthetic interlocutors.)

      Homestuck is the greatest hypermedia novel to date, but for most purposes you don't want your LLM imitating its characters' "typing quirks". Nor its fictional world, for that matter. An LLM that suggests you go back in time and split off a new timeline isn't terribly helpful.

      The question is meaningless outside some restriction of the application of an LLM to a specific domain.

  6. Stuart Castle Silver badge

    Bearing in mind some of the data sources, how long before the AI decides that humanity is something evil and needs to be stamped out?

    1. Anonymous Coward
      Anonymous Coward

      A long time it would seem

      This generation of technology can't really think. It might however SAY that, mostly because people on the internet have been saying it since at least the old alt.rec.blowuptheearth usenet days. That and because it already has, over and over. It says lots of things actually.

      The problem will be if people keep hooking these things up to something that can CAN blow up the earth(or at least greatly inconvenience it), as they may trigger large scale problems without needing to solve complicated programming problems like agency or malice. Incompetence has killed far more than malice though.

  7. Zippy´s Sausage Factory

    Elon Musk has said on Twitter he'll sue anyone that's been using Twitter's data for AI training. This sounds like it might be fun - popcorn is already in the microwave.

    1. Anonymous Coward
      Anonymous Coward

      Who's to say @Twit_Musk isn't a chatbot...

    2. DS999 Silver badge

      I highly doubt he would win such a suit

      Anything that I can grab via the web seems fair game. He can restrict who gets to use the API to grab stuff en masse, but unless Musk has changed Twitter's terms its users own their tweets not Twitter. So maybe you can sue if someone uses your tweets for AI training but Twitter cannot. If he changed the terms to claim Twitter owns everything you post he'd be left with a ghost town before the end of the year.

      A court would have to decide I suppose, but it seems to me that AI training is akin to "reading". If there's nothing stopping me from reading a tweet and learning from it, I don't see an AI reading that tweet and "learning" from it is any different - if accessed anonymously via the web, not via API or hacking a client or using a login against terms of service.

      1. Anonymous Coward
        Anonymous Coward

        However it seems

        That is definitely not a safe assumption. It many jurisdictions public publication only removes certain rights, and ton's of content on the internet has explicit copyright and other IP claims embedded right in the published material.

        Ignore it at your peril if you don't want to risk getting sued. Not all use is fair use.

  8. Anonymous Coward
    Anonymous Coward

    As has been said elsewhere, for each prompt/question a chatbot replies with the answer to "what might a response to this prompt/question look like ?". There is no understanding of meaning, no concept of correctness or bias or deceit - so, for example, generating fake references as part of a response is the chatbot process behaving as intended.

    1. Tez

      Humans can be tricked in the same way, deceptive editing, defamation campaigns, delayed retractions, no smoke without fire, confirmation bias etc etc

      1. Sudosu Bronze badge

        Never forget statistics...

      2. Anonymous Coward
        Anonymous Coward

        It's not being tricked

        It just lacks the ability to understand.

        And what you speak of (and appear to be trying to do) is different. What you are talking about requires a degree of intent to deceive, which you clearly possess and it intrinsically cannot.

        Preventing an unknowing and unintelligent system from producing inaccurate output is a hard problem that hasn't really been solved. Preventing a bad actor with intent as an active adversary can be fixed with a length of rope, a ball-gag, and a muzzle. So an easier and solved problem for at least some problem domains.

        People like you using these tools to further deceive is another hard problem.

  9. Anonymous Coward
    Anonymous Coward

    so which dataset can we use...

    not the British (...) Library, nosir, cause copyright, OMG, what shall we do, what shall we do... OK, let's train it on this FREE garbage pile we call 'the internets' and see what happens.

    1. Anonymous Coward
      Anonymous Coward

      Re: so which dataset can we use...

      Not enough non-white faces to train recognition software... add pics of Chicago's down-and-outs

  10. Plest Silver badge

    Raised on 4chan, eh. Let's imagine how that will go.....

    4CHAN fed AI> please write me a resume

    Output > You want a f**king CV? WTF?! You think you're f*cking smart enough to get a f**king job? You a woman? Yeah I reckon we can find you a job, how about something on Only Fans!! LOL! Get back in the f**king kitchen where you belong!

  11. Ordinary Donkey

    Is Mumsnet in the library?

    That's what would really worry me. Place make 4chan look like fortran.

    1. Korev Silver badge

      Re: Is Mumsnet in the library?

      I dread to think what it'd recommend you do with a glass of water...

  12. Anonymous Coward
    Anonymous Coward

    Racist, anti-trans, and toxic text were scraped from websites

    Point of order, it is still allowed to debate and discuss trans issues without being an awful person. Contrary to what vocal groups tell us, disagreeing with "your gender is what you define it as" is not hate-speech, as witnessed in the world of sport particularly right now.

    In the context of the article the content referred to IS likely hate-bile, but let's not start bandying around "anti trans" because then we end up the situation anyone who says anything against the moving target of righthink (regardless of factual correctness and good motives) is demonised and boycotted for being "anti trans".

    1. Anonymous Coward
      Anonymous Coward

      Re: Racist, anti-trans, and toxic text were scraped from websites

      There is no legitimate debate. One side wants to let everyone identify the way they feel is correct, the 'other' doesn't want them to exist or be alive. Classic paradox of tolerance.

    2. Anonymous Coward
      Anonymous Coward

      Ah yes

      Let's bring the thing we weren't talking about into this.

      I get that this thread is catnip for people trying to spam their culture war crap on every public forum, but it wasn't here till your lot brought it up. These dumpster fire sites have plenty of other hot garbage and would still be unsuitable for making a useful AI even if they had a "race and gender" filter. They aren't just unsuitable because they are a have for bigots and homophones. That has nothing to do with prohibiting discussion, there or here.

      As long as the Reg wants to put up with it, you say what you want. Feel free to represent the collective id of the internet if suits you. That has nothing to do with making a ML model that is suitable for purpose. But as a hit, breaking into other conversations to shout about how you feel threatened by censorship for bringing off topic discussions of race and gender into things may be why you are spending too much time here and not with people who actually like you.

      Most of the trans people I know personally are trans men that literally just want to be able to take a shit in peace. Like them, I think you can talk about whatever you want as long as it's not to me while I am in the men's. Ladies, continue to make your own rules, but as far as I'm concerned, if you need to go, as long as you abide by the no-conversation and no eye-contact rules I literally won't be able to tell what gender you are, present, or claim, as I am neither going to look or ask. Those are the only rules I care about, but the locals will probably have posted clever signage of some kind.

  13. Anonymous Coward
    Anonymous Coward

    Here's the thing with the internet.

    Lets say you have an opinion that goes against what normal people think or what is classed as the norm. Like you are racist, homophobic or pro-guns for example then the mindset of those people is to post as much as possible in as many places as possible to get people to think like them. They will create bots to spread "the word" according to their thinking. I'm not sure whether that's human nature as I've never had the urge to go and try to change someone's opinion. I may have asked people to question it but that's as far as I would take it.

    Therefore the majority of posts and comments on the internet are going to be bad and training ML on these datasets is only going to end up with one outcome. I would have thought this would be obvious to anyone that's actually spent time on the internet but I guess here we are. They can put as many keyword filters in as they want but the data is still there influencing the responses. It's doomed I tells ya.

    1. Anonymous Coward
      Anonymous Coward

      What is normal?

      Is it a self projection of how one thinks everyone else thinks and operates vs how they actually think or operate?

      What is normal to you may or may not be exceptional for others.

      Right or wrong is an opinion based on your culture or micro culture.

      The Romans tossed children to lions and thought it was great sport...I would hope the normal line of thought now would be that it is kind of wrong but I cannot read others thoughts.

      1. Anonymous Coward
        Anonymous Coward


        Please consult a high quality dictionary and not wikipedia, then maybe a book or two on philosophy and basic logic.

        the term "normal" has actual definitions. You discuss none of them.

        What you are doing is academically lazy propagandizing, seeming to spread other ideas and opinions you haven't bothered to properly justify. I've met sharper parrots. Come back when you can understand and articulate the work that came before you if you want to start talking about Roman and Greeks. Both spawned great volumes of work of the subject of both ethics and philosophy, and refuted most of what you are saying long before you said it.

        And waving normal around is by definition meaningless when you aren't paying attention to the scope and range of thing you are measuring the norm of. It is not a projection of how you think. It's not a opinion. Though it has become common for people arguing it that way to vomit up some meaningless word salad to try to mimic the appearance of someone who is thinking for themselves.

        Sadly trying something isn't the same as succeeding at it, and by trying to look like you are doing the tango, you end up looking like you had a stroke.

      2. Anonymous Coward
        Anonymous Coward

        Normal is a moral code that states you don't be an arsehole to others. It's that simple. It's not complicated. Being racist is not normal. Being Homophobic is not normal. I added pro-gun just to wind up the Americans which is normal as it was for fun. Throwing children to lions is not normal.

        What you have said is a very slippery slope. You are basically saying that some peoples normal can be normal which is never the case as we live in a society with various expectations on what is right and wrong. Sure, you decide your own moral compass but if it contains hate or hurt to others it's wrong and I think we can all agree on that.

        1. Anonymous Coward
          Anonymous Coward

          So people who don't obey social expectations are abnormal and wrong? How very conservative of you.

          1. Anonymous Coward
            Anonymous Coward

            What is society? You are taking a very 2 dimensional view there and assuming most of society is conservative or society is something determined by government and law. Society is the world we live in. I never said anyone had to obey social expectations I just said there are social expectations of what is right and wrong. We could discuss this till the slow heat death of the universe as it comes under philosophical ethics. What I see as right and wrong may not be the same as what you see as right and wrong. However there are certain things we know are wrong such as murder for example and if you think murder is ok then there is clearly something wrong with you.

  14. Ace2 Silver badge

    Mostly I wonder about all of things I’ve posted here and there on the internet over the last 30-odd years. In no way, shape, or form do any of these companies have my permission to incorporate my posts into their “products.” How do they expect to ever turn a profit - surely they will be sued out of existence if they ever get close?

    1. Orv Silver badge

      They're banking heavily on AI datasets being considered "fair use" under US law.

    2. Catkin

      It's transformative work. You probably have no copyright over your writings anyway (unless you're completely self hosting or have an enterprise-level hosting arrangement) but there's really very little protection offered by copyright laws in most countries against someone processing publicly available material. If this weren't the case, search engines wouldn't exist.

  15. Kevin McMurtrie Silver badge

    AI is the death of Google

    Google thinks AI will replace them but they have it wrong. AI is poisoning Google. While Google search can still generally find websites and restaurants, it has become useless for finding knowledge. Ask it a question and get back five pages of links to rambling, pointless bot generated garbage. Now Google is developing their own AI and feeding it more trash.

    Maybe Google forgets that Alta Vista and Yahoo faded away because they had ineffective qualitative analysis of search results.

    1. DS999 Silver badge

      Re: AI is the death of Google

      The funny thing is, Google's search is now no better than Altavista was when it died. Google used to be superior because page rank was a brilliant idea, until people knew that's what they're doing and gamed the system with link farms and various other schemes as Google tried to work around them. It has been whack a mole for 15 years now.

      I use Duck Duck Go as my default search, and get what I'm looking for most of the time because SEOs only really care about optimizing for Google. If I can't find what I'm looking for there I give Google a shot. Half the time it is better than DDG, half the time it is worse.

      1. Anonymous Coward
        Anonymous Coward

        Re: AI is the death of Google

        Yeah, the problem is that Google has bought into the attention economy. They don't care about efficiency or accuracy, they wan't to shovel the exact maximum amount of garbage in front of you before you switch, or to sell someone a link at the top of the page that takes you wherever the person paying them wants.

        Google or anyone else could deliver much greater results for general search if they excluded the domains hosting content farm trash, and punished high traffic sites that link or mirror their content. That just isn't the business model. Just like the don't need to give Wikipedia a massive and magic boost to it's "quality score", but they do because then they can push out a bunch of moderation work to unpaid volunteers on someone else's site, while dodging accountability for it and a fig leaf for the sites that use Wikipedia to game the system and do SEO.

        AI isn't intellegent and has little to offer in this regard. A LLM isn't a knowledge engine, even if trained on pristine data, and it's addition to Bing and Google is just like the NFT craze, naive and stupid managers chasing the latest buzzwords. Like the idea of a car that could drop your kids off to school on it's own over the public roads, this isn't going to work with the tools we have today.

        1. Inventor of the Marmite Laser Silver badge

          Re: AI is the death of Google

          You forgot about the deliberate poisoning of search results with barely relevant* sponsored links and other promoted garbage.

          *If you're really lucky. It's usually simply irrelevant.

  16. MacGuffin


    Another bad idea in a long sad history of bad ideas.

  17. Postscript

    Stop scraping sewers

    They need to stop stealing data for their training sets, regardless of the source's fragrance. If they can't prove clear provenance, they should have to start over with clean, inspectable sets. They should take it as a blessing before they get sued into oblivion - they can start fresh with useful, genuinely curated data and not undifferentiated plunder.

    Since they'll need to start over anyway, they could also solve the problems of watermarking AI generated content and output auditing along the way. They can come back and dump it on the public once they're done.

  18. Tron Silver badge


    Good to see the /b/ros doing their bit for the advancement of tech.

    Incidentally, human readers can read my comments for free, as can search engines, but AI scrapers have to send me £5 a time in 20p pieces. Scraping this statement automatically infers the acceptance of these terms and conditions. The fine for scraping my comments without payment is 1% of your annual revenue.

    The solution is for AI bots to be trained on out of copyright material. I suggest early novels. 20% Austen, 30% Dickens with 'Fanny Hill' to spice things up a bit.

  19. Inventor of the Marmite Laser Silver badge

    How long before we see justification of some kind of drivel or other in various numbskull forums along the lines if "course it's right innit. <Insert name of AIChat Engine> said so innit."

    1. doublelayer Silver badge

      I've already had the misfortune to see that. I'm not sure whether the person concerned actually thought that posting something from ChatGPT would convince us. I'm sure that the next time, he'll not tell us at the start that it was generated by a bot, because everyone else was quick to inform him that, no matter what you get a program to print out, it doesn't make illegal things legal and it won't convince us to do any of those illegal things for him (in this case, copyright infringement by violating a software license). From his responses, I don't think he understood our misgivings.

    2. Ordinary Donkey

      Post-Sun sun readers?

  20. Anonymous Coward
    Anonymous Coward

    Despite all the whinging about 4chan comments being """racist""" or otherwise bad and mean, it has much better content than reddit or twitter

    1. jake Silver badge

      "Despite all the whinging about 4chan comments being """racist""" or otherwise bad and mean, it has much better content than reddit or twitter"

      That's roughly equivalent to arguing over whose cesspool smells better.

  21. Groo The Wanderer

    I think the LLM creators are in big trouble with their scraping of copyrighted and objectionable content both. The former will result in lawsuits; the threats are already there. The latter just produces garbage LLM models.

    I still think the LLM approach is 99% hype with very little substance, and I don't expect it to get significantly better because I consider statistical regeneration to be a wrong-headed approach to artificial intelligence in the first place.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like