back to article If you think AI labs wouldn't stoop to using scraped YouTube subtitles for training, think again

FYI: It's not just Reddit posts, books, articles, webpages, code, music, images, and so forth being used by multi-billion-dollar businesses for training neural networks. AI labs have been teaching models using subtitles scraped from at least tens of thousands of YouTube videos, much to the surprise of the footage creators. …

  1. Anonymous Coward
    Anonymous Coward

    Models trained on transcripts is going to be very interesting, because I'm not sure if I've ever seen a transcript that aligns 1:1 with what the speaker is saying. Every transcript has at least one word or phrase used differently than what was said. I'm sure with enough data that the subtle differences won't amount to much deviation, but it'll be curious if any of it manages to throw a wrench in a model, since apparently you can just straight-up perfectly replicate things models are trained on.

    1. imanidiot Silver badge

      Depends on the source of the transcript. Some youtubers (especially the larger ones) will have either volunteers or paid staff manually transcribing the video and time coding it to the video, which is generally fairly accurate. The auto generated subtitles by Youtube itself... utter garbage for anything other than the clearest mid-atlantic accent spoken English.

      1. UnknownUnknown

        The terrible one on Teams Meetings seems to do better with our Indian colleges than various flavours of British, American and Canadian accents/dialect/annunciation/pronunciation.

        Babel Fish/Universal Translator it most certainly is not.

    2. druck Silver badge

      They'll have fun if they use some of the Japanese cartoons my kids used to watch on YouTube, they were obviously dubbed in to English and subtitled by two completely different firms, so the names and phrases in each didn't match up at all, you really had to struggle to comprehend they were telling the same story even though each vaguely matched the animation.

  2. Bendacious Silver badge

    Quality data

    As mentioned above, anyone who has tried to use the subtitles on YT will know that they are almost useless. In my experience they have a 100% record for getting technical terms wrong and cannot cope with anything outside simple words spoken with a mild US accent.

    As these subtitles are automatically generated, this is a situation discussed often in Reg comments - AI training data produced by AI. I can’t believe the data is not poisoned in many places, which kinda serves them right.

    On a separate topic, no one abuses YT creators as much as YT. They recently enabled adverts on people’s channels where the creators had chosen not to advertise, without any option to avoid it. Big channels used by children and even babies (Miss Rachel, for example) now show pre-roll and mid-roll adverts against the wishes of the channels. Evil bastards.

    1. Anonymous Coward
      Anonymous Coward

      Re: Quality data

      I could be wrong, but I'm *pretty* sure this is talking about human-written subtitles, not the auto-generated ones. Everyone knows the auto subtitles are trash, even YouTube, that's why you can still make your own. Some YouTubers even hire dedicated transcribers for their videos. Besides that, generated subtitles would just be a synthetic dataset anyway, so you may as well generate a real synthetic dataset of videos with subtitles instead of a half-real/half-wrong one.

      1. Anonymous Coward
        Anonymous Coward

        Re: Quality data

        we have been scraping transcriptions from YouTube for about 2 years now. We filter it quite easily cause auto-gen doesn't have punctuation, whilst human edited does.

        However, it is nearly 100% certain that auto-gen punctuation will come very soon and the reason YouTube are holding back on it is because it will be Scrape City when that happens.

        "... generated subtitles would just be a synthetic dataset anyway, so you may as well generate a real synthetic dataset of videos with subtitles instead of a half-real/half-wrong one."

        Had to ask Gemini to make a dunces version to understand it. Not that I don't understand what you are saying, just that it is difficult to extract from the sentence. Even Gemini struggled but agreed with you.

        And the reason I pulled this one comment on my day off is that this is the first roundabout mention of synthetic intelligence (SI) on here that I have noted and one which, whilst not quite there, does capture the point of unfathomable computing.

        Let the machines make their own 'language' - not the abstract processing of data that we currently instruct it to do and then await the reports with just an overview of how it did it.

        The more work I do in this area the more I am reminded of the Improbability Drive.

        1. jospanner

          Re: Quality data

          Youtube auto generated punctuation? It can barely get the words right, can’t wait to see what mess it makes of this

          1. Brewster's Angle Grinder Silver badge
            Trollface

            Re: Quality data

            I: think. It, "will!" be? fine;

            1. Anonymous Coward
              Anonymous Coward

              Re: Quality data

              O ye of little faith!

              LOL.

              With attitudes like yours it is no wonder we cant have nice things anymore.

      2. Brewster's Angle Grinder Silver badge

        I agree this sounds like human-written subtitles.

        But, in this case, using AI generated ones wouldn't be a problem: because the purpose is to label audio with text. Google will have invested $billions in developing an AI that does that. You don't have those resources. But by scraping YouTube videos, you can get the results of their training - and use it to train your model to a similar standard. Then you use your leftover resources to try and better Google's effort.

        This situation is not like training a model to write articles based on LLM generated text. In that case, the input is connected to the output so you get feedback skewing the results with every iteration. But this is a transition from one form to another where you are trying to piggyback somebody else's work.

    2. FeepingCreature

      Re: Quality data

      Try out OpenAI Whisper - it's way better than Youtube subs.

  3. Neil Barnes Silver badge

    "quantum chromodynamics" to "flat earth."

    I hope they included "aardvarks" and "ziggy" along the way.

    Isn't it odd: they promised that the internet would provide access to the whole sum of human knowledge; instead it's providing hallucinations about the same...

    1. Anonymous Coward
      Anonymous Coward

      Re: "quantum chromodynamics" to "flat earth."

      !they promised that the internet"

      They? 3rd person pronouns in the presence of that 3rd person when they are within earshot is consider poor form.

      We did promise the sum of human knowledge. I personally promised the sum of human knowledge in your hand in 1999 to my girlfriend (talking about mobile internet here)

      And now, we, many on here who have built this modern world... we are working on providing hallucinations to you. LOL. Wait until you here about self-referential paradoxes creating true/untrues that expose the limitations of formal logic systems. You are going to love it.

    2. Baird34

      Re: "quantum chromodynamics" to "flat earth."

      The 'Information Super Highway' has turned to highway robbery.

  4. Dan 55 Silver badge

    EleutherAI

    So this organisation that Apple, Salesforce, and Nvidia invest in says they generate "open source" training data which Apple, Salesforce, and Nvidia just so happen to use. Otherwise known as outsourcing blame and slapping an "open source" label on something so it sounds wholesome.

    Anyway, if they scraped data from YouTube's automatically generated subtitles then good luck to them. They're already at the stage of ML training from ML data from human-generated data so gibberish is guaranteed.

  5. Pascal Monett Silver badge

    "the internet giant puts a lot of effort into thwarting unauthorized scraping"

    Oh sure it does. That's why there's already a 5+ GB dataset of scraped data.

    Honestly guys, can't you see that are fooling no one ? Your words are worthless because your acts have already spoken for you.

    YouTube is your site. You have no excuse not to be able to lock it down, especially when you are continually messing with YouTube downloader addons. Apparently, those things bother you a lot more than subtitle scraping, because those addons are constantly updating to cope with your messing around.

    But I get it : a downloader addon cuts out your ads and thus impacts your bottom line, and we can't have that, now can we ?

  6. Howard Sway Silver badge

    The Pile includes data pulled from internal Enron emails

    If our AI future is to be based on a mixture of emails from a huge corporate fraud and auto generated subtitles from Youtube product unboxing videos, that future is going to be very bizarre indeed.

    Although if you're wanting to generate a sales campaign for an IoT smart meter that cons millenials out of their cash, it should do the job perfectly.

  7. vtcodger Silver badge

    Job security ... for some

    AI = Job security for whole generations of lawyers. I'm not sure whether that's good or bad. On the positive side, it keeps the wretched creatures from alternate activities. On the other hand it will likely encourage the breeding of even more of them. I'm not sure how they propagate. Spore's maybe.

  8. IGotOut Silver badge

    It would be funny if Google..

    ...kicked off about content being scraped given their entire business model is based on using other people work to make money.

  9. Boolian

    So, to defeat AI scraping, everyone has to be Scottish?

    1. Michael Wojcik Silver badge

      Aye, 'tis th' only way t' be shuuuuuure.

  10. Postscript

    all-purpose defense

    I'll have to remember the Apple defense and in the future get someone else to steal all my business supplies & tools for me. "I didn't steal this stuff, I'm just profiting from it!"

    1. Anonymous Coward
      Anonymous Coward

      Re: all-purpose defense

      You beat me to it. "I didn't steal the watches, I just bought them from the guy who did and am selling them. Nothing wrong with that, right?"

      1. JoeCool Silver badge

        A guy walks into a bar with an iPhone prototype ...

        and that didn't turn out well at all.

  11. Bebu Silver badge
    Headmaster

    What next, nutrition labels on cartons?

    I was thinking feed this monsterous technology the various EULAs that the twisted little minds of corparate lawyers have themselves hallucinated might be its Waterloo. ;)

    The commentard who promised his girlfriend in 1999, the sum of all human knowledge in her hand evoked quite a different image in my mind from the nascent mobile internet.

    She gasping: "Oh! I was hoping it would be bigger."*

    Although the sum of most human's knowledge could be inscribed on a single grain of rice in blackletter and still leave space for footnotes.

    * Why was I thinking of the late Frankie Howerd, I wonder?

  12. Michael Wojcik Silver badge

    Nutrition labels

    And how many calories of waste heat would be generated as models digested their recommended daily allowance of data before its expiration date, eh?

  13. Stevie

    Bah!

    Good luck with that.

    In every single yootoob video I've seen with subtitles those subtitles are riddled with gibberish created by the speech-to-text software being almost good enough for primetime. The hallucination issue will fade into insignificance behind the nonsense used as training issue.

    Naturally, the "creatives" responsible for the content are too busy (ie bone idle) to proofread their visual word salad - which is a problem I foresee getting well-and-truly out of claw when AI software starts authoring everything from stereo instructions to software modules.

    Once humans start automating things they forget the need for sanity checks or won't afford the staff to do them. "Yippee! It's Free" thinking goes all the way down to the Earth's core.

    The noise to signal ratio will look like a tangent curve plot over time, making for medical procedure documentation and legal contractese that could have been written by Donald Trump.

    1. This post has been deleted by its author

  14. This post has been deleted by its author

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like