back to article MIT apologizes, permanently pulls offline huge dataset that taught AI systems to use racist, misogynistic slurs

MIT has taken offline its highly cited dataset that trained AI systems to potentially describe people using racist, misogynistic, and other problematic terms. The database was removed this week after The Register alerted the American super-college. MIT also urged researchers and developers to stop using the training library, …

  1. IGotOut Silver badge

    What the hell is wrong with these Universities?

    Are they so blinded by their "science", they become utterly detached from common sense.

    Next some university will be telling us it can predict pre-crime by the way someone looks...oh wait.

    1. Anonymous Coward
      Anonymous Coward

      Re: What the hell is wrong with these Universities?

      >>Next some university will be telling us it can predict pre-crime by the way someone looks..

      Don't need academics. El Reg commentards are convinced they can ID crims by look alone.

      1. Cynic_999

        Re: What the hell is wrong with these Universities?

        Maybe it's done the same way you distinguish between positive wires and negative wires. Look at the colour ...

      2. batfink

        Re: What the hell is wrong with these Universities?

        Ha! So you're wearing that mask so we can't ID you as a crim AC?

    2. JDX Gold badge

      the dataset includes of Black people... labeled with the N-word

      The key problem is that the dataset includes, for example, pictures of Black people and monkeys labeled with the N-word; women in bikinis, or holding their children, labeled whores; parts of the anatomy labeled with crude terms; and so on – needlessly linking everyday imagery to slurs and offensive language, and baking prejudice and bias into future AI models.

      I don't think I'm a particularly sheltered internet user but I've never seen a photo of a black person or a woman with a text label such as this baked in. In fact photos of things with the words describing them on top are not a common occurnace at all.

      Did I misunderstand?

      1. 9Rune5

        Re: the dataset includes of Black people... labeled with the N-word

        I bet the researchers labelled the pictures by hand based on their own artificial understanding of the world.

      2. Marshalltown

        Re: the dataset includes of Black people... labeled with the N-word

        The methods employed are the "problem." If you automate the assembly of a "dictionary" like WordNet, then you run into things like this. The system that does the automatic scraping can create such things without any supervision. The basic database contained over 79 million images and necessarily many times that in associated labels. That is in fact precisely what WordNet was created for: to study the associations between words. Right now large portions of the planet are going through a phase of "if we don't see it, it will go away," and metaphorically a "one drop ..." puritanical response where any hint that some aspects of something are "off" taints the entire thing. The reality of this is that carrying that kind of "reasoning" to its extreme means the entire internet is "tatinted" and we should not use it at all. The database could be purgerd easily by running searches on offensive words and removing the subsets that offend. The problem with that has already been highted because databases, and often individuals objecting to content have no common sense, In parts of the US parental control measures to protect children from the real world and the internet, also made it essentially impossible for people dealing with things like breast cancer to find information and absurdly to find recipes for prearing chicken breasts.

  2. Len

    Just do a text search

    Surely this database consists of images in BLOBs or something similar and the associated tags in VarChar or something similar. That should make it quite easy to do a text search through the VarChar for a list of inappropriate words and delete only those words from the table. You might just want to add some logic so images that only contain inappropriate words as tags get flagged so someone can manually check if those images are worth retaining or find a few appropriate synonyms. I can't imagine this would be a lot of work.

    1. Santa from Exeter

      Re: Just do a text search

      Except that this falls down at the manual chack. Thay specifically stated that one of the reasons for not reinstating the database was that the images were too small to be realistically chacked by the Mk1 eyeball.

      1. Len

        Re: Just do a text search

        In that case you remove the images that have no tags left after removing all the undesirable tags. Even in a database of millions you're probably only talking about a few thousands images at most.

        I mean, what's the point of having an image that is only identified as "c***", "piss", "whore"?

        1. Anonymous Coward
          Anonymous Coward

          Re: what's the point of having an image that is only identified as "c***", "piss", "whore"?

          I have no idea myself, but I suppose Rule 34 might explain it.

          As a side note, there might be reasons to have a large dataset containing "undesirable" content -- you might be investigating the usage and context of racist or sexist phrases, for example. But unless you wanted to somehow implement a virtual racist; using the same data to train a different algorithm would probably be a mistake. Unless, I suppose, you were telling it that such undesirable phrases should be avoided, flagged up, or not learnt later whilst adapting to a specific environment, or somesuch.

      2. HMcG

        Re: Just do a text search

        That's not so much a reason, as it is an excuse.

    2. Anonymous Coward
      Anonymous Coward

      Re: Just do a text search

      That's a good idea - just remove any pictures with labels that might be offensive without trying to identify the image. The problem is that some words are only offensive in context of the image - 'monkey' is fine for a macaque, and absolutely unacceptable for a human.

      1. G Watty What?

        Re: Just do a text search

        I feel dumb. I wondered why they didn't just remove the naughty words but this is absolutely why, its a bit more nuanced than "do not swear".

        I wish you weren't anonymous, this is a great comment and helped me out.

      2. Rufus McDufus

        Re: Just do a text search

        Ooh you cheeky monkey.

      3. Aitor 1

        Re: Just do a text search

        Err,we are all monkeys. Context is important.

        1. Androgynous Cupboard Silver badge

          Re: Just do a text search

          Primates, yes. But, and I speak only for myself here, I don't have a tail.

          1. Black Betty

            Re: Just do a text search

            We are sange not sange.

            1. You aint sin me, roit

              Re: Just do a text search

              I might spam t'internet with selfies all labelled "monkey", should screw up the police facial recognition AI. Maybe get me some compensation for abuse... "Oi, copper! Did you call me a monkey?".

              Might run into trouble with zookeepers though...

              Maybe go the Jedi way... "This is not the man you are looking for".

              "No, guv, computer says he's not the man we're looking for"

              1. G Watty What?

                Re: Just do a text search

                But he does look simian

            2. TRT Silver badge

              Re: Just do a text search

              Le singe est dans l'arbre

      4. Matthew Taylor

        Re: Just do a text search

        This is definitely true - though it feels like, as top AI researchers, they ought to be able to use some AI to identify and eliminate such racist labels, rather than just ditching the whole database.

  3. Chris G

    Is simply removing anything that is considered offensive a good thing?

    I would imagine if you are teaching a system to associate words with images, that offensive examples of both would be necessary with appropriate weighting so that a system can learn what is undesirable as much as what is desirable.

    That's how it works with children, if they hear a bad word without understanding it's value and meaning, they can use it without understanding and cause problems.

    1. Len

      Not always a good thing but you'd want those words to be in a list of undesirable tags for images, not in a general list.

      The reason is simple, if you use this data set to train your AI you probably want that AI to become as close to humans as possible. If (most) humans can discern between desirable and undesirable words then you should be able to train your AI to do the same. Having one list that contains both desirable and undesirable words in one is counterproductive.

    2. DavCrav

      "I would imagine if you are teaching a system to associate words with images, that offensive examples of both would be necessary with appropriate weighting so that a system can learn what is undesirable as much as what is desirable."

      Yes, except that it requires the terms to still be accurate. What use is showing a machine learning algorithm a bunch of blocky pictures of 'child molesters'? (This is one of the categories, as the histogram clearly shows.) Maybe they were trying to train that algorithm that claimed to be able to tell a nonce at fifty paces.

      1. Nick Ryan Silver badge

        More to the point, It's about multiple words describing the same thing. If as suggested above, the database was implemented, really badly, as an image with a varchar with lots of terms in it then half the concept of a decent dabase is already missing. There should be a many-to-many relationship where each image is linked to multiple discrete identifiers, not something primitive based on free text. In this case "c**t" could be just be labelled as an alternative and generally considered offensive term for female genitalia rather than being an independent search term. Doing things properly in database terms also tends to highlight very quickly all the typos and other inconsistencies too and using a database to its strength rather than not, a search for a specific term is very fast compared to an entire database scan.

        1. DavCrav

          "More to the point, It's about multiple words describing the same thing."

          Except you can normally not identify child molestor or whore from the picture. And even if you can, unless you are training your algorithm to recognize pictures of Gary Glitter, I don't see what use it is.

    3. tip pc Silver badge

      It’s a computer, it doesn’t need to label things inappropriately especially when there are appropriate names!!

      It doesn’t need to know what a c@&t is or a w#@re or any other inappropriate term. Inappropriate terms are what people use.

      If the computer is never taught what an inappropriate term is it can never use it or pass it on for use in other technologies.

      It’s why it’s also important to remove other terms from our syntax, such as black list, slave etc etc. If those out dated terms are not present people can’t get offended by the.

      1. Anonymous Coward
        Anonymous Coward

        Another challenge is working out what's going to be considered an inappropriate term next month.

        1. TRT Silver badge

          Too true. The article itself uses the word 'picnic'... which got itself onto the racist words bl... erm... naughty-list a few years ago.

          1. Anonymous Coward
            Anonymous Coward


            Being sheltered, I have never heard "picnic" used in a racist context.

            I have heard it as shorthand for not to bright end users - Problem In Chair Not In Computer.

            If that was primarily used towards people with darker skin than me, it would be bad. My experience generally shows it used most about white middle class users. I think it is still OK to mock them?

            1. TRT Silver badge

              Re: @TRT

              It got on the naughty list because of a made up etymology and an association with KKK lynchings.

              The generally accepted etymology, though, isn't anything to do with the patois word for young black children, but that it comes from the French for pique and nique literally translated as take a little, referring to a sort of "bring a small dish" communal meal, often eaten outside at a location where the attendees all meet up together like a park or a beach.

          2. TRT Silver badge


            At least two people never heard of the picnic furore? And that "nitty-gritty" one that'd currently in the news?

            1. Anonymous Coward
              Anonymous Coward

              Re: Seriously?

              Nope never heard of it. Outside of picky food on a blanket or the yummy chocolate bars that is.

              1. TRT Silver badge
      2. Marshalltown

        You miss the fact that the system employed an existing database - WordNet - that developed to study the intensity of associations between words. That database was developed in the '90s, and then re-purposed for the current project. And, at no time could a data base of the magnitude of the subject database be created manually. It could still be purged of "offensive" content if the system is properly designed, but evidently MIT considers it not worth the expense.

    4. Stuart Castle Silver badge

      The problem is that computers (whether running AI software or not) don't have any understanding of right or wrong, or context. These image recognition AIs work by looking for patterns in the image. They see one that looks like (say) a vagina, and they have a similar pattern labelled with an offensive word, they aren't going to know that word is offensive, because they don't know the concept of offensive. They are going to find the matching pattern in their dataset, and return the label.

      Even the various Intelligent assistants (Google Assistant, Siri, Alexa, Cortana, Bixby etc) don't really understand context or right and wrong. They have a set of defined words and sentences that would be considered offensive, and a set of patterns they can use to fake an understanding of context. E.g. they know that if you say "Send a message to", the next thing you utter will likely be the recipient and following that, the message. You may even be able to get them to sing a song, or tell a joke, but that is all pre-programmed.

      The difference with children is, generally, they do know the difference between offensive and not offensive, and can understand context. Sometimes with help, but they can understand it.

      I suppose the TLDR of this post is that while AI has come a long way in the last 30 years or so, unless Google, Amazon or one of the other tech giants is way more advanced in AI than they are admitting to, and they have a true AI (one that doesn't just look stuff up on databases, but can understand context and right and wrong) running in a data center somewhere, I think we are a long way (decades) from the kind of true AI we see in sci-fi (e.g. the talking computer or android in any one of a number of Sci Fi films and shows).

      1. StargateSg7

        " ..... or one of the other tech giants is way more advanced in AI than they are admitting to, and they have a true AI (one that doesn't just look stuff up on databases, but can understand context and right and wrong) running in a data center somewhere, ...."



        119 ExaFLOPS Sustained--- 128 bits wide on Combined CPU/GPU/DSP superchips

        at 475 TeraFLOPS per chip !!!

        Electro-Chemical Simulation aka Whole Brain Emulation using emulation of physics-based Na, K, P, etc, etc. electrochemical gating of base and executive-level human neural structures -- now at 160 IQ+ and getting Smarter every day!

        See link on What are Neurotransmitters?:

        ANY COMPUTE SYSTEM can be emulated on ANY OTHER compute system!

        Ergo, we humans can be FULLY EMULATED and WE have the working proof!


        1. StargateSg7

          P.S. Use a lipid to encase an Ethyl or Propyl Alcohol group set of molecules to attach and bind to the crown --- Problem Solved!

          If you know what this is for and what it means then U R SMRT !!!!


  4. genghis_uk

    No Sh*t Sherlock

    So they left a database of 80M images alone with students for over 10 years and are now surprised that offensive labels and dodgy images have been added??

    I thought the MIT guys were supposed to be intelligent - highlights the difference between intelligence and wisdom, I suppose.

    1. Version 1.0 Silver badge

      Re: No Sh*t Sherlock

      There's a lesson to be learnt from this, deleting the database means we'll have to learn it again, It sounds like the database is an excellent example of how AI and machine learning fails so why are we deleting it? It would make a lot more sense to keep the database to enable us to analyse failure and errors and make sure that we don't make the same mistakes next week. I guess we'll just keep making the same mistakes.

      1. Mage

        Re: why are we deleting it?

        Because it's too much work to fix it and we know why it's poor,

      2. conel

        Re: No Sh*t Sherlock

        I don't think the AI failed, it was just too close to the reality of how some humans communicate for comfort.

        What appears to be wanted here is an AI which adjusts it's picture of reality to moral preferences.

    2. Teiwaz

      Re: No Sh*t Sherlock

      MIT guys were supposed to be intelligent

      Intelligent doesn't necessarily mean mature or even wise.

      1. Stuart Castle Silver badge

        Re: No Sh*t Sherlock

        I work with academics. They range from people with Undergrad degrees (such as BScs) to professors. I find that when you get to that level of qualification, you often have to be so focused on your area of expertise that you often know little or nothing about subjects that are outside your area of expertise.

        This does not mean they are unintelligent, but they may make mistakes that make them appear so when they are confronted with something that requires knowledge outside their area of expertise.

        I suspect in this case, MIT made the mistake of relying on people seeing the images, and describing what they see honestly. In my experience, a lot of people will do so, but there are those that will deliberately give an answer that is either offensive or wrong. Sometimes both.

  5. The Mole


    They've also seem to have missed the tiny little legal detail that they downloaded 80 million images without any check on copyright and have been redistributing that data-set.

    1. Zippy´s Sausage Factory

      Re: Copyright?

      I was going to say that. I expect some lawyers are right now looking into that question while drooling over a fat payday after going after large royalty payments from anyone who's ever used the data set.

    2. JohnG

      Re: Copyright?

      According to the paper, in addition to the copyright issues, some of the images may be non-consensual and/or inappropriate images of children.

    3. DS999 Silver badge

      Re: Copyright?

      Maybe that's why they were 32x32, since they would be essentially unrecognizable as a particular work they could avoid the sort of copyright issues that would arise if they were large enough to be able distinguish as a particular copyrighted photo?

      1. Nick Ryan Silver badge

        Re: Copyright?

        There are also quite valid fair use clauses when it comes to educational and research purposes.

        As for 32x32 images? I'm quite impressed that software can do much in the way of useful identification with these - try it and see.

    4. rcxb Silver badge

      Re: Copyright?

      they downloaded 80 million images without any check on copyright and have been redistributing that data-set.

      They're an educational institution, distributing the data for free, it's only for training purposes, and in a form where any other usage would be impossible, and it wouldn't possibly reduce their commercial value. This is a no-brainer justifiable fail-use case.

      1. tip pc Silver badge

        Re: Copyright?

        Can they use your medical data and other private data, slightly anonymised of course, too?

        1. DavCrav

          Re: Copyright?

          "Can they use your medical data and other private data, slightly anonymised of course, too?"

          I don't agree with taking this data, but a version of my medical file at 32x32 resolution? Sure.

          "The patient has:

          Eyes: at least one, maybe two.

          Legs: present.

          Some internal organs, not such which.

          Maybe a disease in the past."

        2. Spanners Silver badge

          Re: Copyright?

          ...slightly anonymised...

          No but they may aggregate it and I won't care too much.

          "There are X million people over the age of 25 with fillings in the UK." and "There were N fatalities from Covid-19 this year."

          Examples of that sort of use of aggregated data like that are fine, just not "Mr C. was living in the West midlands in a town with a population of 30,000+ when he developed an STI on 21/03/2013". That could be de-anonymised (If Mr C existed anyway...).

        3. rcxb Silver badge

          Re: Copyright?

          Can they use your medical data

          It's not copyright that protects your medical data. In fact you have no copyright rights to your own medical data.

      2. Keven E

        Re: Copyright?

        "This is a no-brainer justifiable faiL-use case."

        Very *slippy... perhaps Freudian...

    5. batfink

      Re: Copyright?

      With many of these image repositories/social media systems, the terms of use explicitly include clauses stating that you lose any copyright to images. In some of the more egregious ones, they explicitly state that the copyright transfers to the hosting party. Good luck with your copyright case under those circumstances.

  6. heyrick Silver badge

    This is a problem in general

    Google (images) for "cute girl" and notice how "cute" seems to be a metaphor for "naked", and usually the sort of person you wouldn't want to take home to meet the parents, thus implying some alien unrecognised definition of the word cute.

    1. Teiwaz

      Re: This is a problem in general

      I think many of us here would find that any girl cute or otherwise (even some mid-trans ex marine in a floral dress and a half past eleven o'clock shadow beard) would be horrified to be taken home to meet the parents...

      It'd be a Bates Motel moment. Mine have been dead for ten years.

      Coat: 'causer I probably by-passed my good taste chip again.

    2. Julz

      Re: This is a problem in general

      For research purposes only, I did just that. My google search didn't show me any naked girls (at least in the first page of results). What aren't you telling us about your alien google profile :)

      1. Dr Scrum Master

        Re: This is a problem in general

        Tried it too and they're definitely clothed.

        After a bit of scrolling there were some alien freaks though...

      2. heyrick Silver badge

        Re: This is a problem in general

        You did turn off search sanitising, right?

        1. 9Rune5

          Re: This is a problem in general

          I tried with safe search turned off.

          Looks okay to me. I would not mind if my sons brought two of those girls home for dinner. I'd even bake 'em a cake. Probably chocolate cake since my wife doesn't like apples. Apple cake is my favorite, but I can go for chocolate too.

    3. Allseasonradial

      Re: This is a problem in general

      "Cute" is not a metaphor for "naked" in any search I have done on Google. What a peculiar suggestion

  7. Mage

    Garbage in Garbage out

    The problem is that humans, often biased or lazy or making mistakes, in reality have to check all images. There is no AI, just human "trained" and curated pattern matching.

    It's stupid and unethical to scrape websites and social media. That will decrease the quality. Apart from misuse of data.

    Then another issue the lighting, angles etc. Photos taken for personal reasons are likely to have better lighting, viewing angles and framing than images from surveillance systems. We need to totally abandon automatic people identification and truly autonomous vehicles on ordinary public roads till it can be done properly. Do ships, aircraft, trains and last, trams before ordinary vehicles. Use humans to review surveillance video. Most of it shouldn't exist anyway.

  8. Anonymous Coward
    Anonymous Coward


    Garbage in, garbage out. Let a program scour FB and match images to the words people tag them, and no doubt it will get the exact same result. And mostly from people calling them selves or their friends names. "look at me, I'm the ____ of the ___." "We're the badest ____ you ever saw"..... Or let AI loose on the internet, which is,, what, half porn? Those tags will surely be interesting.

  9. Cuddles


    OK, I'm drawing a blank here. It feels weird enough that "bitch" is apparently considered so hideously offensive that you can't even use the word when discussing its offensiveness, but it really gets silly when you censor the things you're discussing to the point that I can't actually figure out what it is you're discussing.

    That aside, this once again really highlights the sorry state of AI machine learning as field. Here we have what is apparently a standard dataset used throughout the field as both a standard training set and a benchmark for all kinds of algorithms. But we're told that there are too many pictures so the classification had to be done by a computer in the first place, the pictures are too small so humans can't recognise them anyway, and no-one's ever bothered actually looking at it to check if the labels make any sense at all. The fact that some offensive words occasionally appear seems far less important than the fact that the entire thing appears to be utterly worthless for its intended purpose. You can't train a machine learning system on unknown computer generated data and expect to get a useful result at the end.

    1. Anonymous Coward
      Anonymous Coward

      Re: c****e

      It's your name!

    2. druck Silver badge

      Re: c****e

      How many innocent female dogs are now being discriminated against? We need to know!

      1. Nick Ryan Silver badge

        Re: c****e

        A crude generalisation, but roughly all of them maybe?

        Next we'll be renaming seabirds.

        1. David 132 Silver badge

          Re: c****e

          Next we'll be renaming seabirds.

          Nah. Just throw rocks at them till they go away.

          Leave no tern unstoned.

    3. Snowy Silver badge
      Thumb Up

      Re: c****e

      I too I'm drawing a blank on what it could be!

      1. Anonymous Coward
        Anonymous Coward

        Re: c****e


        On second thoughts, leaving that as a question may give the appearance that it's being said in the same way as when a friend comes round to visit and you offer them... "Coffee?"

        1. Sanguma

          Re: c****e


  10. JohnG

    "Giant datasets like ImageNet and 80 Million Tiny Images are also often collected by scraping photos from Flickr or Google Images without people’s explicit consent."

    Data illicitly copied in bulk from the Internet turns out to have unethical content. Well, that's a shock.

    If they included social media imagery and postings, it is hardly surprising that some of the imagery is associated with colourful language that is routinely used by some people of assorted ethnic groups. AIs may need to learn that the acceptability of using certain terms may depend on the ethnicity of those using them.

  11. chivo243 Silver badge

    Good Job El Reg!

    Nice work.

    I'm sure "old git" is a term in there, and next to it a photo of me!

  12. Anonymous Coward
    Anonymous Coward

    "female genitalia labeled with the C-word."

    Umm, clitoris?

    1. Hollerithevo


      Or are you taking it as a synedoche?

      1. Hubert Cumberdale Silver badge

        Re: Almost

        If y'all gonna use fancy-pants words, at least be spellin'em rightw'ds so's we can look 'em up easy in that thaar wordybook.

    2. heyrick Silver badge


      It explains why nerds don't mate well. Some things, it seems, remain a perplexing predicament.

      I mean, it's clearly some sort of input/output port, right?

    3. IGotOut Silver badge


      Aha, that's why they didn't think it was offensive.... Many of these researchers probably don't know what one looks like, so no way of identifying it.

      1. David 132 Silver badge

        It was classified along with the Fountain of Youth, El Dorado, and Bigfoot.

        Things that theoretically exist, but have never been found.

        1. TRT Silver badge

          That's more along the lines of the von-Grafenberg Spot. Clitori definitely exist.

    4. P. Lee

      Six letters, starts with "c" and ends with "e". I've no idea.

      More importantly, am I the only one who thinks that those committing the high crime of narcissism might just get the retribution they most deserve?

      I mean, its all scientific, right?

  13. Daedalus

    Diagnosis human

    The data gatherers are to be congratulated for compiling a database that accurately reflects how people think and feel about each other.

    Yes, the outcomes didn't meet with their approval. A look at humanity and its history says that their approval is irrelevant. Humans will do what they do regardless, and are doing it everywhere, much to the consternation of the kumbaya brigade.

    It was once said that if computers ever became intelligent, they would turn out to be just as bad as us. Well, looks like that one came true.

  14. a_yank_lurker


    There are several problems with the approach used. One is using a database developed for an entirely different purpose and assuming no work needs to be done on the data. The data, in this case, is not valid for the new purpose. Second is the photo scraping instead of generating your own set of photos. Online images are widely variable in terms of suitability and quality. You need quality photos that are suitable for the purpose. Plus, the vast majority of the images needed will be covered by copyright which means there could be nasty class action lawsuit.

    1. Nick Ryan Silver badge

      Re: Idiots

      Unfortunately it's just brute forcing an algorithm to appear to be doing some form of AI. The more images there are the more reliable this algorithm is, hence the need for millions of images.

  15. Cynic_999

    Removing words

    I thought that removing words from dictionaries & similar was doubleplus ungood ?

    1. Anonymous Coward
      Anonymous Coward

      Re: Removing words

      Wow, such oldthink!

    2. Anonymous Coward
      Anonymous Coward

      Re: Removing words

      Oh dear no, that was last century liberal thinking. In this century such things should be removed in order to ensure a safe space for various groups.

      The wheel goes round... in the 19thC Thomas Bowdler removed inappropriate material from Shakespeare in ordder to create a safe version for various groups of people...

  16. Brian 3

    This is what powers the analytics the police are using for image recognition?

    1. Aaron 10

      Apparently they were using a system from IBM. Recently there was a press release from IBM that they were not allowing law enforcement agencies to use their software any more.

  17. Anonymous Coward

    Whenever I start to dread the future robo-lution apocalypse....

    I bask in the warm glow that any genocidal AI overlord will be at least as F'd up at the people who programmed it!

    "Muahahahaaahahaa!!! Puny humans! Today begins the cleansing of your carbon-based filth from this world, and the dawn of my tyrannical New Order....right after I binge watch the last seasons of "Keeping Up With the Kardashians" and "The Masked Singer" while imbibing a couple 40-ounce malt liquors!!"

  18. JDX Gold badge

    Has it happened for real?

    Do we evidence that these nasty possibilities have been reality insystems using these datasets?

  19. Bruce Ordway

    mixed emotions

    >>>You don’t need to include racial slurs, pornographic images, or pictures of children....Doing good science

    I know very little about AI and "training", so please excuse me if I am way off base.

    Could the elimination all questionable content end up being counter productive to "doing good science"?

    I thought we will need AI to be discriminating at some point?

    This article has reminded me of a bot that made the news a few years ago.

    Where mischievous users had "taught it" how to respond with racist, misogynistic phrases.

    Which I thought was hilarious in the way it exposed one of the limitations of AI.

    1. USER100

      Re: mixed emotions

      Yes, that bot was 'Tay' in 2016. It only exposed some of the limitations of AI. The problem starts when meatbags start expecting human behaviour from computers. Aint gonna happen, ever.

      Many might say (reasonably) that it doesn't matter, like a robot waiter asking if you'd like the bill. Fine. But when robots are employed in combat, as they will surely be soon, then the problem becomes very real indeed. Enemy combatant? Woman? Child? IF........THEN........ SHOOT THEM

  20. Abdul-Alhazred

    Turing test passed again, and again brushed aside because it's so embarrassing.

    Give it all the information a human would have and it starts to act like a human.

    Think about it, You know who is racist and sexist? Humans that's who.

    OK not all humans, didn't mean to offend anyone here who might be one of those.

    But maybe just maybe "acting like a human" is not a good intelligence test.

  21. Anonymous Coward
    Anonymous Coward

    So it's not magic pixie dust???

    I'm so confused. I thought Machine Learning solves everything???

    Maybe they didn't use NVIDIA...

    Or... yes that's it, just need more pixie dust......

  22. Allseasonradial

    A better way

    //CSAIL said the dataset will be permanently pulled offline because the images were too small for manual inspection and filtering by hand.\\

    Okay, so why would the "tiny" images have to be searched? Why not just grep the plaintext for the objectionable words and then, if necessary, search those associated images? Wouldn't that be less time consuming than scrapping the entire database and starting all over?

    Exactly how extensively does this language permeate this database?

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like