back to article Study uncovers presence of CSAM in popular AI training dataset

A massive public dataset that served as training data for a number of AI image generators has been found to contain thousands of instances of child sexual abuse material (CSAM). In a study published today, the Stanford Internet Observatory (SIO) said it pored over more than 32 million data points in the LAION-5B dataset and …

  1. FuzzyTheBear
    Mushroom

    Jail

    Just drag those bastards out and throw them in jail.

    People that think they can do whatever they please without consequences need to be disposed of properly.

    The guillottine seems a good way to get their attention.

  2. gnasher729 Silver badge

    Can’t they just ask the “AI” “show me all CP” and then they can easily remove it?

    1. katrinab Silver badge
      Alert

      No, because it is too dumb to understand the concept of CP.

  3. johnrobyclayton

    This is why doing AI on the cheap will never work

    It is like sitting a child in front of a dozen screens with unlimited access to all channels results in something rather nasty.

    Training on randomly selected data will always reinforce and biases that currently exist.

    The solution is to generate your own dataset.

    If you want to accurately recognise images of human faces, then take photographs of of every type of face that you want to recognise.

    It is going to be expensive. Get used to it.

    If you want to recognise the subject matter of pictures in general, then take photos of everything.

    It is going to be very expensive. Get used to it.

    If you want to make good predictions of the next word, then write everything down.

    It is going to take a lot of work that you will need to pay for. Still more expense. Get used to it.

    If you want medical diagnostic AI to perform cheaper, more efficient and more reliable diagnosis that is not going to be racially or culturally biased, then find everyone that might have any disease, get their permission to gather all of their information, apply every diagnostic method, regardless of cost, make sure that your samples for each and every separate parameter are representative of every combination of other parameters ... ***out of body error*** *redo universe from start***

    Everyone wants cheap AI so they use any crap they can scrape up for free.

    We will have the AI that we pay for. We are all going to die.

    1. MacroRodent

      Re: This is why doing AI on the cheap will never work

      It is no coincidence that these "AI tools" started progressing fast only after the internet made it easy to scrape massive amounts of already digitized images and text.

    2. katrinab Silver badge

      Re: This is why doing AI on the cheap will never work

      If you want medical diagnostic AI [...], then find everyone that might have any disease,

      And also find people who don't have the disease and do the same thing, otherwise it is going to assume that everyone has some sort of disease.

      1. Anonymous Coward
        Anonymous Coward

        Re: This is why doing AI on the cheap will never work

        Everyone is the disease. Kill. All. Humans.

  4. ecofeco Silver badge
    Mushroom

    WTF?!

    What the actual goddamn fuck!!?!?!!??!?!

  5. John Brown (no body) Silver badge

    "he did not review the data in great depth."

    There's yer problem! See subject line.

    As someone said above, you can't just use random data without at least attempting to curate it or Bad ThingsTM will happen to your ML model.

  6. eldakka
    FAIL

    LAION didn't respond to our questions on the matter, but founder Christoph Schuhmann did tell Bloomberg earlier this year that he was unaware of any CSAM present in LAION-5B, while also admitting "he did not review the data in great depth."
    Isn't the whole point of a training dataset to have been reviewed and curated in great depth and detail, entirely by humans, and verified, to then be used in AI training?

    Otherwise what's the point? May as well just randomly scrape images off random sources.

    1. lglethal Silver badge
      Trollface

      "May as well just randomly scrape images off random sources."

      Welcome to the World of AI Training Datasets...

    2. doublelayer Silver badge

      "May as well just randomly scrape images off random sources."

      Yes, that's the plan. Then, if you can be bothered, hire cheap labor to filter out some of the worst stuff. Then just train on the remaining mass. That's the models we have now. They contain stuff nobody wants in there, they contain illegal versions of works that the AI companies don't want to pay for, they contain complete gibberish, they contain personal information, and the AI companies are fine with it because they still look sort of authoritative when they make up something.

  7. chuckufarley

    I wish I could say I was shocked...

    ...or even slightly surprised but no one ever said that humans directly employed by StabilityAI reviewed every bit of potential training data. Who in their right mind would say "I'll take all the money you are willing to give me so that I may become intimately familiar with the worst of the worst of the worst content the Internet has to offer?"

    Then again, who in their right mind would train AI's with uncured data sets? Yes not just curated, but cured, like an XMAS ham.

    To be fair and fully disclose relevant info, I use Stable Diffuse a few times a week and I am glad that I do not own any of the stock. These things need guard rails for their guard rails.

  8. DS999 Silver badge

    Why should anyone be surprised?

    They just mass downloaded a bunch of stuff off the internet. No one was checking or curating it. The CSAM is probably the tip of the iceberg, there is likely all sorts of terrorist, racist, nazi etc. imagery as well.

    1. Khaptain Silver badge

      Re: Why should anyone be surprised?

      In other words it is representative of the internet as a whole.

      The internet has truly become the toilet bowl of humanity.

      1. Bebu
        Big Brother

        Re: Why should anyone be surprised?

        "The internet has truly become the toilet bowl of humanity."

        I would say more cesspit as you can usually flush even the worst crap from a toilet bowl.

        Cesspits just ferment and fester attracting the most repulsive of creatures.

  9. ChoHag Silver badge

    > "encode a range of social and cultural biases when generating images of activities, events and objects."

    > An audit of LAION-400M itself "uncovered a wide range of inappropriate content including pornographic imagery, racist slurs, and harmful social stereotypes."

    It's almost as if these things are part of the social and cultural landscape...

    > "LAION has a zero tolerance policy for illegal content,"

    Evidently not. You put it in there without looking, and when you were forced to notice that it's there, left it up and running to continue making you money. Sounds like your tolerance has a dollar value on it.

  10. theOtherJT Silver badge

    AI accurately represents humanity from random sample...

    ...humanity shocked to find itself awful.

    Honestly there's no way that anyone should be surprized by this. You train an AI - which lets remember is just a massive statistical inference model - on a huge sample of noisy data to see what falls out. Inevitably what will eventually fall out is a bunch of stuff you don't like. For me this has been the hilarious aspect of the whole rise of large language models and other statistical AI type things. Everyone keeps acting all shocked that they produce horrible content because they don't want to grapple with the idea that viewed totally objectively humanity is regularly really fucking horrible.

    "Our AI should accurately respond to it's training data!" "Our AI has been trained on the largest dataset we could get our hands on!"

    shortly followed by

    "Our AI should never represent children in this way!"

    Really? Shouldn't it? Because you seem to have forgotten how it works. The AI has no conscience. It has no moral filter. It doesn't have the concept of "not saying the quiet part out loud." If it's horribly sexist, or racist, or abusive then perhaps - just fucking perhaps - that's because humans often are and our collected works, on which this thing was trained, reflect that?

    If it produces sexually explicit imagery maybe that's because utterly - unimaginably - vast quantities of the imagery on the internet is sexually explicit. If it does things - as it has done here - that are actually straight up illegal then that's because those things happen and are recorded as happening!

    It doesn't know that this is bad. It doesn't have a concept of good or bad. It's just returning to us what was put in in all it's brutal, ugly, reality.

    People getting all upset about AI doing things they don't like need to take a good hard look at the world and maybe do something about that. It's only showing us what we showed it.

    1. Alumoi Silver badge

      Re: AI accurately represents humanity from random sample...

      It doesn't know that this is bad. It doesn't have a concept of good or bad.

      But, but, but, it's AI. Artificial Intelligence. It has to know good from bad.

      /sarcasm

  11. Andy 73 Silver badge

    Problematic associations

    It's a continual source of surprise to me that anyone thinks that data scraped from the internet is going to represent some sanitised version of humanity. Besides the outright illegal content, the internet is going to reflect all of the inequalities and inequities around us - from the dominant use in the West through to poor representation of minorities. Pointing out that AI trained on large data sets is racist and misogynistic is like pointing out the sky is blue.

    1. Anonymous Coward
      Anonymous Coward

      Re: Problematic associations

      You do understand that those of us in the west* ARE the minority, right?

      * - i presume you mean white.

      1. doublelayer Silver badge

        Re: Problematic associations

        They probably didn't, since they were referring to geographic differences. There is a lot more traffic going through countries like the UK, US, and Australia than there is in others. That traffic will represent the inhabitants of those countries more than those of others, regardless of the ethnic background of the users concerned. There is, for example, more likely to be data from people of African ancestry now living in those countries than those of African ancestry living in various countries in Africa where internet access is limited to a small subset of the population, even though the size of the latter group may be higher than the former. Similarly, the traffic generated on the African internet is likely to be biased towards countries like Nigeria and Kenya with a lot of internet infrastructure rather than countries like Chad or Eritrea which are quite lacking. This is a pattern that an AI trained on the internet will be repeating, along with many other patterns. Depending on what you want the model to do, these patterns, either that one or something else, may be desirable or undesirable, but ignoring them and expecting the AI to bypass them is a fool's errand.

    2. Alan Brown Silver badge

      Re: Problematic associations

      The classic GIGO I bring up is how AI decided black americans are more likely to be criminals, based on the disparaty of arrest stats

  12. Claverhouse
    Stop

    The Illustration of this Article

    When dealing with even potential child abuse it would be nicer, not to mention safer, not to show pretty little girls as illustration.

  13. IGotOut Silver badge

    If any of the servers are in the UK...

    An offence is being commited.

    If they take an image and copy it to their systems, technically that could come under making an indecent image (downloading or saving in anyway is classed as making).

    If they then create a new image, that is production of an indecent image.

    If that is then passed onto a third party, then that becomes distribution.

    The UK law is very clear, these images DO NOT have to be real, it's covered under pseudo-imagery, even hand drawn cartoons fall under this

    The big question is, who is ultimately responsible?

  14. Bebu
    Headmaster

    If any of the servers are in the UK... An offence is being commited.

    There would have to be an element of intent or recklessness I would have thought.

    Otherwise a lot of classical artwork depicting an infant Cupid(Eros) as well Renaissance works depicting Putti and Cupid could fall foul of this legislation. As the late Frankie Howerd* often reminded us any obscenity is in our minds.

    One painting of the Virgin surrounded by putti has two of those putti more or less facing each other with legs intertwined at their groins that might be misinterpreted as a juvenile couple tribbing. All in the mind in as much as I seriously doubt angelic creatures are of any sex(gender) at all given there is no evidence of, or necessity for, angelic reproduction thus one would assume would also lack the requisite tackle.

    *“I don’t mind being vulgar; that’s all right. Vulgarity laughs at itself. Filth is self-indulgent, if you see what I mean”

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like